Giter Club home page Giter Club logo

reddit-etl's Introduction

reddit-etl

Description

A data pipeline for extracting and visualizing basic activity in r/ukraine Reddit community.

Project Architecture

project-architecture

  • Data extracted with PRAW: The Python Reddit API Wrapper
  • Visualized with Tableau and Google Data Studio
  • Orchestrated with Airflow in Docker
  • Hosted on Raspberry Pi 3B

Project Output

Project setup

Setup Airflow on Docker

(details here)

  • Download docker-compose.yaml bash curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.2/docker-compose.yaml'

  • Add the required libraries to docker-compose.yaml

    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:- praw pandas pygsheet}

    This is a development/test feature only. This should never be used in production( build a custom image as described in the docs), but it would be enough for our project.

  • In the project/airflow directory run

    mkdir -p ./dags ./logs ./plugins
  • Run the next command to create .env file next to docker-compose.yaml:

    echo -e "AIRFLOW_UID=$(id -u)" > .env
  • Initialize the database to run database migrations and create the first user account

    docker-compose up airflow-init

    The created account has the login airflow and the password airflow

  • Now start the Airflow

    docker-compose up

    The Airflow web server is available at http://localhost:8080

Setup Reddit App

  • First, you need an active Reddit account. If you don't have one, create
  • Go to reddit.com/prefs/apps. If the previous link doesn't work for you, try old.reddit.com/prefs/apps/
  • Select create another app. Make sure you select the Script option
  • Fill in the description and optional fields. Click create app reddit-create-app
  • Next, you will see your client id and secret. These values will be needed in the next step reddit-script-details
  • Create pipeline.conf file in the dags/extraction directory with the credentials from the previous step
    [reddit]
    client_id = your_script_client_id
    client_secret = your_script_client_secret
    

Google Sheets API Setup

  • Go to Google Sheets and create a new sheet. Leave it empty
  • Create a project if you don't already have one
  • Next, create a service account from Google API Console
  • Search for Google Drive API (Sheets API) and Enable it
  • Then go to Credentials and click + CREATE CREDENTIALS > Service Account

create-service-acc

  • Click on your newly created account and go to the Keys tab

select-service-acc

  • Click Add Key > Create New Key

create-new-key

  • Key type select JSON > CREATE

select-json-key

  • Save the key to the project folder
  • Share the file with the value from a client_email field

client_email

Credentials

  • praw
  • pandas
  • pygsheets

The idea was taken from https://github.com/ABZ-Aaron/Reddit-API-Pipeline

reddit-etl's People

Contributors

pelekh-o avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.