Assignment2

Live application Links

Problem Statement

This project aims to streamline access to finance professional development resources by developing a data engineering solution that structures and aggregates content from the CFA Institute’s website. The outcome will support an intelligent application to enhance finance professionals' learning experiences, demonstrated through comprehensive documentation and a GitHub repository showcasing the workflow and results.

Project Goals

Key tasks include:

Automated Web Scraping: Extract and structure information from 224 refresher readings into a CSV dataset.
PDF Text Extraction: Utilize PyPDF2 and Grobid to extract text from PDF files, organizing it into structured text files.
Database Integration: Upload the structured data to a Snowflake database.
Cloud Storage: Store the dataset and text files in an AWS S3 bucket.

Technologies Used

Data Sources

The webpages are being scraped from this main webpage:

https://www.cfainstitute.org/en/membership/professional-development/refresher-readings#first=200&sort=%40refreadingcurriculumyear%20descending

The PDF Archives can be found Assignment2/Datasets/Sample_PDFs/

Pre requisites

Python 3.6 or later from https://www.python.org/downloads
AWS credentials
Snowflake credentials
Docker

Project Structure

├─ .gitignore
├─ Diagrams
│  ├─ Architecture Diagram.py
├─ Datasets
│  ├─ CFA.csv
│  ├─ Grobid
│  │  ├─ Grobid_RR_2024_l1_combined.txt
│  │  ├─ Grobid_RR_2024_l1_combined.xml
│  │  ├─ Grobid_RR_2024_l2_combined.txt
│  │  ├─ Grobid_RR_2024_l2_combined.xml
│  │  ├─ Grobid_RR_2024_l3_combined.txt
│  │  ├─ Grobid_RR_2024_l3_combined.xml
│  │  └─ metadata.csv
│  ├─ PyPDF
│  │  ├─ PyPDF_RR_2024_l1_combined.txt
│  │  ├─ PyPDF_RR_2024_l2_combined.txt
│  │  └─ PyPDF_RR_2024_l3_combined.txt
│  └─ Sample_PDFs
│     ├─ 2024-l1-topics-combined-2.pdf
│     ├─ 2024-l2-topics-combined-2.pdf
│     └─ 2024-l3-topics-combined-2.pdf
├─ Pwspider
│  ├─ Pwspider
│  │  ├─ WebScrapingandDatasetCreation.ipynb
│  │  ├─ __init__.py
│  │  ├─ items.py
│  │  ├─ middlewares.py
│  │  ├─ pipelines.py
│  │  ├─ settings.py
│  │  └─ spiders
│  │     ├─ CFA.csv
│  │     ├─ DEBUG.log
│  │     ├─ __init__.py
│  │     ├─ pwspidey.py
│  │     └─ requirements.txt
│  └─ scrapy.cfg
├─ README.md
└─ notebooks
   ├─ Cloud_storage_integration.ipynb
   ├─ PDF Extraction.ipynb
   ├─ SQLAlchemy_Snowflake_Database.py
   └─ WebScrapingandDatasetCreation.ipynb

How to run Application locally

To run Web Scraping locally follow these steps:

Create a virtual environment on local setup and activate it.

CLI MacOS: python3 -m venv .venv source .venv/bin/activate

CLI Windows: py -m venv .venv .venv\Scripts\activate
Install the requirements: scrapy, playwright. Install the requirements.txt from the path Assignment2/Pwspider/Pwspider/spider/ CLI on Terminal: pip install -r requirements.txt
To run the spider on terminal:
- Naviage to folder Assignment2/Pwspider/Pwspider/spider/
- Run the cli: "scrapy crawl pwspidey -o CFA.csv -s LOG_FILE=DEBUG.log" where pwspidey is the name of the spider, CFA.csv is where the scraped data is stored and DEBUG.LOG is the debug log for the process.
To run the spider on Notebook environment:
- Launch Jupyter notebook.
- Navigate to Assignment2/Pwspider/Pwspider/ and open WebScrapingandDatasetCreation.ipynb
- Change the evironment of the kernel to scraping environment.
- Run the notebook.

Running Grobid Locally with Docker:

Install Docker: https://docs.docker.com/desktop/install/windows-install/
Choose a Grobid Docker Image: Lightweight Image

docker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:latest

Start the Grobid Docker Container

PyPDF2: Requirements

   pip install PyPDF2

Snowflake Upload: Requirements

SQLAlchemy installation
Pandas

Run the Jupyter notebook SQLAlchemy_Snowflake_Database.py in the path Assignment2/notebooks/ to upload the dataset to Snowflake DB

AWS S3 upload: Requirements

S3 credentials

Run the Jupyter notebook Cloud_storage_integration.ipynb in the path Assignment2/notebooks/ to upload the dataset to AWS S3

References

https://www.zenrows.com/blog/scrapy-playwright#why-use-playwright-with-scrapy\ https://github.com/kermitt2/grobid
https://pypdf.readthedocs.io/en/stable/\ https://diagrams.mingrammer.com/\ https://github.com/ashrithagoramane/DAMG7245-Spring24/tree/main/repository_structure

Learning Outcomes

Web scraping tools and techniques
PDF Extraction Tools and Techniques
SQL and AWS S3 upload methods

Team Information and Contribution

Name	Contribution %	Contributions
Nidhi Nitin Kulkarani	35%	Web Scraping and Dataset Creation, Documentation, README
Riya Singh	35%	PDF Extraction, Grobid Local installation and PDF Extraction via Grobid and PyPDF2, Documentation, README
Deepakraja Rajendran	30%	Snowflake and Amazon S3 Database Upload, Cloud Storage Integration Diagram

ledil / assignment2 Goto Github PK

assignment2's Introduction

Assignment2

Live application Links

Problem Statement

Project Goals

Technologies Used

Data Sources

Pre requisites

Project Structure

How to run Application locally

References

Learning Outcomes

Team Information and Contribution

assignment2's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent