Giter Club home page Giter Club logo

Comments (7)

paulu-aws avatar paulu-aws commented on August 17, 2024 1

@martyn-swift, Originally I had in mind an S3 trigger on the data-lake bucket that would trigger a lambda to insert records into Elasticsearch. However, the cheat code for this Quilt. https://quiltdata.com/

Full disclosure, I don't work for them, this is my own opinion and not my employers: Quilt is an awesome product for this kind of use case. I'm not sure I'd try and implement my own indexing and elasticsearch architecture if Quilt was an option.

from data-lake-as-code.

martyn-swift avatar martyn-swift commented on August 17, 2024

Hi @paulu-aws, what would the architecture look like for this? Do you have a diagram?

from data-lake-as-code.

martyn-swift avatar martyn-swift commented on August 17, 2024

@paulu-aws, does the Glue job write to S3 in large batches? Would the job trigger per object? Is Eventbridge an alternative for batching up s3 put events?

from data-lake-as-code.

paulu-aws avatar paulu-aws commented on August 17, 2024

@martyn-swift, part of the beauty of the Glue is how Dynamic Frames abstracts away this kind of detail, UNLESS you really want to control it. glueContext.write_dynamic_frame.from_options() format and format options parameters (like block size) give you some control as to the behavior of the write activity to S3. However, I'd advise against trying to outsmart it. The dynamic frame is going to do a much more efficient job mapping writes across your DPUs in parallels into S3 than anything someone might cook up. Live the dream. Let the dynamic frame do its job. Lets you focus on enrolling more datasets and less on plumbing.

from data-lake-as-code.

paulu-aws avatar paulu-aws commented on August 17, 2024

@martyn-swift, I'll also mention if you are trying to plug the Glue Job into an event-driven architecture, its probably not a good idea to rely on S3 triggers as the message bus. S3 triggers characterize write behaviors to an S3 bucket, not logical processing steps. Multi-file outputs, failed-writes, object versions, etc all become challenges relying on S3 triggers as an eventing mechanism. Better options in order of complex to least complex (IMO) would be AWS EventBridge, Amazon MQ, AWS Step, and Amazon SNS. All of those services can be called directly from inside your Glue job using python (or Java) APIs to send messages over the duration of your job. For example, directly after the .write_dynamic_frame. you'll know the files are written, what bucket, key name, format, and options used and you can pass that along into your preferred messaging bus.

from data-lake-as-code.

martyn-swift avatar martyn-swift commented on August 17, 2024

@paulu-aws, can you give me an example of the SNS call? Is it using boto3 with a call after the job.commit() ?

from data-lake-as-code.

paulu-aws avatar paulu-aws commented on August 17, 2024

@martyn-swift,

The SNS call would look like any other Boto3 call you might see in the api basics examples. You just need to make sure you glue jobs execution role has IAM permissions to publish to the SNS topic. You may eventually find yourself peppering in several .publish() calls to SNS topics over the course of your Glue job write operations to provide updates or metrics.

Just keep in mind, doing this will start you down a path of blending business workflow state with your data engineering code which are probably best kept abstracted. As a one-off or short term solution, SNS inside the Glue job is fine. But anything else really deserves a more sophisticated business workflow framework like AWS Step. Here is a quick Step State machine I mocked up in a few minutes that triggers a glue workflow and publishes to SNS topics WITHOUT requiring any changes to the Glue job itself.

image

from data-lake-as-code.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.