Giter Club home page Giter Club logo

cs-498-project's Introduction

ETL for Youtube Captions

TODO

  • DynamoDB DB schema to be updated

    • store video ID, video tags
  • ETL script

    • two files
      • txn file CSV file "videoId", "list of words"
  • Create a small front-end (REST API endpoint)

    • Given one or more keywords
    • Returns (JSON) a list of matching videos
  • make sure that the topics are meaningful

  • Finalize the project report

This code helps to download captions from YouTube using YouTube Data API and save them locally in a dynamodb instance.

We use Docker and Docker-Compose to ensure that the environment is reproducible. Please make sure you have both Docker and Docker Compose installed on your local machine before proceeding further.

1. Build docker image

$./build_docker_image.sh

This creates 3 docker images

  • cs498_etl:latest

    This image is used to perform ETL operations such as downloading YouTube video details and captions.

  • cs498_ds:latest

    This image is used to perform data science activities such as training predictive model and inference.

  • cs498_api:latest

    This image is used to provide REST API so that usres can query videos based on the labels.

2. Create tables in local dynamodb instance

$./create_tables.sh

This script create two tables in DynamoDB

  • Table 'videos'

    This table is used during ETL operations to store list of YouTube videos to be processed.

  • Table 'video_tags'

    This table is used during ETL, after model inference, to store the label and video relation information.

3. Download captions

In order to download video captions using Google YouTube Data api, you'll need to get an API key from google. Please refer to YouTube Data API Overview

$./download_captions.sh [API_KEY] [PLAYLIST_ID]

This will save captions inside ./captions folder.

./etl_captions/main.py has three variables you might want to adjust:

  • PLAYLIST_ID

This is the Playlist ID you want to download videos from.

  • MAX_RESULTS

The number of results whenever hitting youtube api (max value 50).

  • MAX_TRIES

Maximum number of hits made to Youtube API (this is to avoid infinite loops).

Creating an unsupervised model

Explore data

We would like to find out what words occur in various transcriptions and which ones are relevant to the NLP based unsupervised learning.

In order to work towards the same, we first launch the jupyter notebook

$./explore_topic_models.sh

This will show on the console URL which can be used to connect with the notebook

[I 00:22:05.940 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 00:22:08.484 NotebookApp] Serving notebooks from local directory: /app/notebooks
[I 00:22:08.484 NotebookApp] The Jupyter Notebook is running at:
[I 00:22:08.484 NotebookApp] http://(2e118104df80 or 127.0.0.1):8080/?token=28d18090a6c3ddfe2bf364b471cbc7ab70c7cadfe53231b2
[I 00:22:08.485 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 00:22:08.493 NotebookApp] No web browser found: could not locate runnable browser.
[C 00:22:08.493 NotebookApp]

    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
    Or copy and paste one of these URLs:
        http://(2e118104df80 or 127.0.0.1):8080/?token=28d18090a6c3ddfe2bf364b471cbc7ab70c7cadfe53231b2

Use the URL to access the notebook and open test1.ipynb from notebooks folder.

Topic modeling

In order to build TFIDF based LSI model

$./build_topic_model.sh [NUM_TOPICS] [NUM_WORDS_IN_TOPIC]

Dynamodb Queries

  • List Tables

aws dynamodb list-tables --endpoint-url http://localhost:8000

  • Show Videos

aws dynamodb scan --table-name videos --endpoint-url http://localhost:8000

  • Delete Videos Table

aws dynamodb delete-table --table-name=videos --endpoint-url http://localhost:8000

cs-498-project's People

Contributors

nileshmalpekar avatar jgarciadiaz avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.