Giter Club home page Giter Club logo

aws-pdf-textract-pipeline's Introduction

aws-pdf-textract-pipeline

๐Ÿ” Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS Textract. Built with AWS CDK + TypeScript.

This is an example data pipeline that illustrates one possible approach for large-scale serverless PDF processing - it should serve as a good foundation to modify for your own purposes.

Example Extension Popup

Getting Started

Run the following commands to install dependencies, build the CDK stack, and deploy the CDK Stack to AWS.

yarn install
yarn build
cdk bootstrap
cdk deploy

Overview

The following is an overview of each process performed by this CDK stack.

  1. Scrape PDF download URLs from a website

    Scraping data from the COGCC website.

  2. Store PDF download URL in DynamoDB

    Example Extension Popup

  3. Download the PDF to S3

    A lambda fires off when a new PDF download URL has been created in DynamoDB.

  4. Process the PDF with AWS Textract

    Another lambda fires off when a PDF has been downloaded to the S3 bucket.

  5. Process the AWS Textract results

    When an SNS event is detected from AWS Textract, a lambda is fired off to process the result.

  6. Save the processed Textract result to DynamoDB.

    After the full result is pruned down the the desired datastructure, we save the data in DynamoDB. Example Extension Popup

Scripts

  • yarn install - installs dependencies
  • yarn build - builds the production-ready CDK Stack
  • yarn test - runs Jest
  • cdk bootstrap - bootstraps AWS Cloudformation for your CDK deploy
  • cdk deploy - deploys the CDK stack to AWS

Notes

  • If a PDF download URL has already been added to the pdfUrlsTable DynamoDB table, the pipeline will not re-execute for the PDF.

  • Includes tests with Jest.

  • Recommended to use Visual Studio Code with the Format on Save setting turned on.

Built with

Additional Resources

License

Opens source under the MIT License.

Built with โค๏ธ by aeksco

aws-pdf-textract-pipeline's People

Contributors

aeksco avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.