awslabs / project-lakechain Goto Github PK

:zap: Cloud-native, AI-powered, document processing pipelines on AWS.

Home Page: https://awslabs.github.io/project-lakechain/

License: Apache License 2.0

JavaScript 4.79% TypeScript 83.21% Dockerfile 0.63% Python 11.12% Shell 0.24%

aws computer-vision document-processing generative-ai machine-learning natural-language-processing retrieval-augmented-generation serverless hacktoberfest aws-cdk

project-lakechain's Introduction

Project Lakechain

Cloud-native, AI-powered, document processing pipelines on AWS.

🔖 Features

🤖 Composable — Composable API to express document processing pipelines using middlewares.
☁️ Scalable — Scales out-of-the box. Process millions of documents, scale to zero automatically when done.
⚡ Cost Efficient — Uses cost-optimized architectures to reduce costs and drive a pay-as-you-go model.
🚀 Ready to use — 60+ built-in middlewares for common document processing tasks, ready to be deployed.
🦎 GPU and CPU Support — Use the right compute type to balance between performance and cost.
📦 Bring Your Own — Create your own transform middlewares to process documents and extend Lakechain.
📙 Ready Made Examples - Quickstart your journey by leveraging 50+ examples we've built for you.

🚀 Getting Started

👉 Head to our documentation which contains all the information required to understand the project, and quickly start building!

What's Lakechain ❓

Project Lakechain is an experimental framework based on the AWS Cloud Development Kit (CDK) that makes it easy to express and deploy scalable document processing pipelines on AWS using infrastructure-as-code. It emphasizes on modularity of pipelines, and provides 40+ ready to use components for prototyping complex document pipelines that can scale out of the box to millions of documents.

This project has been designed to help AWS customers build and scale different types of document processing pipelines, ranging a wide array of use-cases including metadata extraction, document conversion, NLP analysis, text summarization, translations, audio transcriptions, computer vision, Retrieval Augmented Generation pipelines, and much more!

Show me the code ❗

👇 Below is an example of a pipeline that deploys the AWS infrastructure to automatically transcribe audio files uploaded to S3, in just a few lines of code. Scales to millions of documents.

LICENSE

See LICENSE.

project-lakechain's People

Contributors

Stargazers

Watchers

Forkers

brnaba-aws robert6126 datnoor josephrp

project-lakechain's Issues

Feature request: Support SES as a trigger

Use case

Support Amazon SES as a trigger for pipelines, allowing to trigger a pipeline whenever an e-mail is received.

Solution/User Experience

Provide a SesTrigger that will allow customers to react to e-mail reception within pipelines.

Alternative solutions

No response

Feature request: Implement onCreate() method for middlewares

Use case

Middlewares require to manually call the super.bind() function when they are constructed. There are currently no hooks for a middleware to know when a subclass implementation is done creating its constructs.

Solution/User Experience

We propose the creation of an onCreate method allowing middlewares to override it and create their resources as part of it.

Alternative solutions

No response

Feature request: Support for Bedrock Batch Jobs

Use case

Add support for Bedrock Batch Jobs in the existing Bedrock text processors.

Solution/User Experience

No response

Alternative solutions

No response

Feature request: Move Tar and Zip middlewares to Fargate

Use case

The Tar and the Zip inflate middleware are based on Lambda with a maximum execution time of 15 minutes. It would be better and more scalable for customers to move the implementation from Lambda to Fargate, ensuring that very large archives can be inflated.

Solution/User Experience

Moving to Fargate in order to benefit from a longer time to inflate archives.

Alternative solutions

No response

Feature request: Add connector for FAISS

Use case

We want to add support for a FAISS index for a very low-cost, non-production setup. This would be a new middleware that acts as a storage connector taking embeddings from other middlewares in a pipeline and stores them in a FAISS index in a given storage.

Solution/User Experience

It would be possible to use a S3 bucket as a mean of low-cost storage. The FAISS storage connector would be based on a Lambda compute with a reserved concurrency of 1, loading and writing back the index on the S3 bucket.

Alternative solutions

No response

Bug: Transcribe processor does not enrich metadata

Expected Behaviour

The transcribe audio processor does not correctly enrich the output document metadata with the detected language of the text.

Current Behaviour

No metadata are specified at the output of the transcribed document

Code snippet

No response

Steps to Reproduce

No response

Possible Solution

No response

Project Lakechain version

latest

Execution logs

No response

Bug: Batching window don't translate to undefined when set at 0

Expected Behaviour

In AWS Lambda, setting a batching window in the event source mapping between a middleware SQS input queue and the Lambda compute results in a delay in the processing of items from the input queue. We want to ensure that if the batching window is set to zero, that the value passed in the event source mapping is set to undefined.

Current Behaviour

No response

Code snippet

No response

Steps to Reproduce

No response

Possible Solution

No response

Project Lakechain version

latest

Execution logs

No response

Feature request: Implement reduce step for pipelines

Use case

Pipelines can map their execution into different parallel branches. We need to implement a way to reduce branches into a single aggregate. This aggregate can be used to apply transformations to a collection of documents at once.

Solution/User Experience

No response

Alternative solutions

No response

Feature request: Add A/V transcoding middleware

Use case

Customers own a variety of audio and video documents that they want to process in order to realize different use-cases :

Transcoding of audio and video documents.
Modifying the container of audio and video documents.
Generate a serie of different audio and video formats suited for OTT applications.

Solution/User Experience

Provide a TranscodingProcessor middleware capable of taking input documents as batch or sequence of documents, and output a collection of results based on the transcoding intent.

Alternative solutions

No response

Bug: All CloudWatch Logs must be part of middleware log group

Expected Behaviour

All CloudWatch Log Groups created by middlewares must be part of the same log group.

Current Behaviour

Some log groups are created outside of the control of middlewares, for example, for S3 auto delete lambda, or for bucket notification custom resource created by the AWS CDK.

Code snippet

No response

Steps to Reproduce

No response

Possible Solution

No response

Project Lakechain version

latest

Execution logs

No response

Feature request: Support Ollama

Use case

Add the ability for customers to run models supported by Ollama using a unified interface.

Solution/User Experience

Provide an OllamaTextProcessor middleware that would manage the CPU/GPU infrastructure on behalf of the customer and package Ollama within a Docker container.

Alternative solutions

No response

Bug: Translate processor does not switch language

Expected Behaviour

The translate middleware should set the new language in each translated output documents.

Current Behaviour

No response

Code snippet

No response

Steps to Reproduce

No response

Possible Solution

No response

Project Lakechain version

latest

Execution logs

No response

Feature request: Split Typescript SDK in Spec

Use case

The TypeScript SDK currently contains the definition for cloud events. This creates a strong coupling to the TypeScript SDK.

Solution/User Experience

Separate the cloudevent specification in its own repository.

Alternative solutions

No response

Feature request: use s3 copyobject for copying objects

Use case

Use copy object instead of manual copying

Solution/User Experience

None

Alternative solutions

No response

Feature request: Move RAG CLI as separate package to be used across examples

Use case

The RAG CLI is a useful tool currently only working for the Building a RAG Pipeline example.

Solution/User Experience

Move the RAG CLI as a separate, more generic tool.

Alternative solutions

No response

Feature request: Add middleware for crawling websites with headless browser

Use case

Make it possible for customers to crawl one or multiple websites using a headless browser to forward the HTML associated with web pages to other middlewares.

Solution/User Experience

No response

Alternative solutions

No response

Feature request: Migrate Lambdas to Node.js 20

Use case

Migrate all Lambda functions runtimes to Node.js 20.

Solution/User Experience

Middleware are still responsible for deining the Lambda runtime in this release, simply bump up the runtime. This will require a re-test and re-benchmark of all middlewares.

Alternative solutions

No response

Feature request: Bedrock Knowledge Bases Connector

Use case

Integrate Bedrock Knowledge Bases with Lakechain to allow customers to publish their documents in a Bedrock Knowledge Base

Solution/User Experience

No response

Alternative solutions

No response

Docs: define "production-ready"

What were you searching in the docs?

https://awslabs.github.io/project-lakechain/general/faq/

Is Project Lakechain production-ready?

Is this related to an existing documentation section?

https://awslabs.github.io/project-lakechain/general/faq/

How can we improve?

It is not clear what "production-ready" means.

What makes this project not production-ready?

What steps are required to make it production-ready?

Acknowledgment

I understand the final update might be different from my proposed suggestion, or refused.

Feature request: PostgreSQL connector

Use case

Using RDS Aurora PostgreSQL as a vector database.

Solution/User Experience

I'd like to use Aurora PostgreSQL w/ pg_vector for storage.

This is also a supported storage backend fro Bedrock KB, by the way.

Alternative solutions

Using Pinecone.

Feature request: Vector Storage does not allow to specify document id indexing logic

Use case

Today, the vector storage connector uses the document url or the chunk identifier (if the document is a chunk) to provide a document identifier to OpenSearch when indexing the document. This a problem for documents that change often as this can lead to a duplication of modified chunks in the OpenSearch storage.

Solution/User Experience

Provide a way for end-users to define how they want the vector storage connector to index documents (e.g append-only, or a potential removal of previous chunks before insertion).

Alternative solutions

No response

Bug: User provided conditionals do not inherit middleware conditionals

Expected Behaviour

No response

Current Behaviour

No response

Code snippet

No response

Steps to Reproduce

No response

Possible Solution

No response

Project Lakechain version

latest

Execution logs

No response

Feature request: Add support for removal policy

Use case

Add the ability for middlewares to support a user defined removal policy.

Solution/User Experience

Add the .withRemovalPolicy method to the Middleware API.

Alternative solutions

No response

Bug: Conditionals are bound via middleware name, not instance

Expected Behaviour

In the Middleware class, conditionals are associated with middlewares by name, instead of by instance. This could cause confuded conditional retrieval when using multiple instances of the same middleware.

Current Behaviour

Bind conditionals per middleware instance or CDK resource name which are both guaranteed to be unique.

Code snippet

No response

Steps to Reproduce

No response

Possible Solution

No response

Project Lakechain version

latest

Execution logs

No response

Feature request: Leverage batch operations in middlewares

Use case

Batch operations improve the performance of middlewares across multiple use-cases (GPU batch operations, API calls with multiple elements, etc.). Middlewares today do not entirely leverage batch operations, and a refactor of several middlewares would be required to improve throughput and performance.

Solution/User Experience

Middlewares can leverage batch operations when consuming messages from their input SQS queue. We want to refactor some middlewares to drop the use of partial responses and leverage the batch of input messages.

Alternative solutions

No response

Feature request: Implement an E-mail text processor

Use case

Parse emails at scale, comprising .eml and .msg documents.

Solution/User Experience

Narrative

The e-mail text processor makes it easy to extract the textual content of e-mail documents and pipe it to other middlewares for further processing. This middleware can extract text, HTML, and structured JSON from e-mail documents.

Alternative solutions

No response