Giter Club home page Giter Club logo

anhgv252 / moderndataengineerpipeline Goto Github PK

View Code? Open in Web Editor NEW

This project forked from stefen-taime/moderndataengineerpipeline

0.0 0.0 0.0 2.32 MB

Building a Robust Data Pipeline: Integrating Proxy Rotation, Kafka, MongoDB, Redis, Logstash, Elasticsearch, and MinIO for Efficient Web Scraping

Home Page: https://medium.com/@stefentaime_10958/moderndataengineering-building-a-robust-data-pipeline-integrating-proxy-rotation-kafka-mongodb-9a908d1bd94f

Shell 3.37% Python 95.76% Dockerfile 0.86%

moderndataengineerpipeline's Introduction

ModernDataEngineering: Building a Robust Data Pipeline: Integrating Proxy Rotation, Kafka, MongoDB, Redis, Logstash, Elasticsearch, and MinIO for Efficient Web Scraping

Utilizing Proxies and User-Agent Rotation

Proxies and Rotating User Agents: To overcome anti-scraping measures, our system uses a combination of proxies and rotating user agents. Proxies mask the scraper’s IP address, making it difficult for websites to detect and block them. Additionally, rotating user-agent strings further disguises the scraper, simulating requests from different browsers and devices.

Storing Proxies in Redis: Valid proxies are crucial for uninterrupted scraping. Our system stores and manages these proxies in a Redis database. Redis, known for its high performance, acts as an efficient, in-memory data store for managing our proxy pool. This setup allows quick access and updating of proxy lists, ensuring that our scraping agents always have access to working proxies.

RSS Feed Extraction and Kafka Integration

Extracting News from RSS Feeds: The system is configured to extract news from various RSS feeds. RSS, a web feed that allows users and applications to access updates to websites in a standardized, computer-readable format, is an excellent source for automated news aggregation.

Quality Validation and Kafka Integration: Once the news is extracted, its quality is validated. The validated news data is then published to a Kafka topic (Kafka A). Kafka, a distributed streaming platform, is used here for its ability to handle high-throughput data feeds, ensuring efficient and reliable data transfer.

Data Flow and Storage

MongoDB Integration with Kafka Connect: Kafka Connect Mongo Sink consumes data from Kafka topic A and stores it in MongoDB.

MongoDB, a NoSQL database, is ideal for handling large volumes of unstructured data. The upsert functionality, based on the _id field, ensures that the data in MongoDB is current and avoids duplicates.

Data Accessibility in FastAPI: The collected data in MongoDB is made accessible through FastAPI with OAuth 2.0 authentication, providing secure and efficient access for administrators.

Logstash and Elasticsearch Integration: Logstash monitors MongoDB replica sets for document changes, capturing these as events. These events are then indexed in Elasticsearch, a powerful search and analytics engine. This integration allows for real-time data analysis and quick search capabilities.

Data Persistence with Kafka Connect S3-Minio Sink: To ensure data persistence, Kafka Connect S3-Minio Sink is employed. It consumes records from Kafka topic A and stores them in MinIO, a high-performance object storage system. This step is crucial for long-term data storage and backup.

Public Data Access and Search

ElasticSearch for Public Search: The data collected and indexed in Elasticsearch is made publicly accessible through FastAPI. This setup allows users to perform fast and efficient searches across the aggregated data.

Here are some example API calls and their intended functionality:

Basic Request Without Any Parameters:

Search with a General Keyword:

  • Searches across multiple fields (like title, description, and author) using a general keyword.

  • Example API Call: GET http://localhost:8000/api/v1/news/?search=Arsenal

  • This call returns news items where the word “Arsenal” appears in either the title, description, or author.

Search in a Specific Field:

Filter by Language:

Combining General Search with Language Filter:

Combining Specific Field Search with Language Filter:

ModernDataEngineerPipeline - Startup Guide

This guide provides step-by-step instructions for setting up and running the "ModernDataEngineerPipeline" project.

Setup Steps

1. Clone the Repository

Start by cloning the repository from GitHub:

git clone https://github.com/Stefen-Taime/ModernDataEngineerPipeline

2. Navigate to the Project Directory

cd ModernDataEngineerPipeline

3. Launch Services with Docker

Use docker-compose to build and start the services:

docker-compose up --build -d

3.1 Use MongoDB and Redis Clusters

You can use ready-made MongoDB and Redis clusters from MongoAtlas and Redis, or create a free account to get trial clusters. It is also possible to use local MongoDB and Redis clusters by deploying them with Docker.

4. Navigate to the src Folder

cd src

5. Run the Proxy Handler

Execute proxy_handler.py to retrieve proxies and store them in Redis:

python proxy_handler.py

6. Handle RSS Feeds with Kafka

Use rss_handler.py to produce messages towards Kafka:

python rss_handler.py

7. Add JSON Sink Connectors

Add the two JSON Sink connectors found in the connect folder on Confluent Connect or use the Connect API.

8. Launch Logstash

Run Logstash using Docker:

docker exec -it <container_id> /bin/bash -c "mkdir -p ~/logstash_data && bin/logstash -f pipeline/ingest_pipeline.conf --path.data /usr/share/logstash/logstash_data"

9. Start the API

Finally, start the API:

cd api
python main.py

Follow these steps to set up and run the "ModernDataEngineerPipeline" project.

Medium

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.