Giter Club home page Giter Club logo

talksearch-scraper's Introduction

TalkSearch scraper

This scraper is a command line tool that extract information from YouTube playlists and push them to Algolia.

Usage

yarn index {config_name}

How it works

The ./configs/ folder contain custom configs, each containing a list of playlists to index.

The command will use the YouTube API to fetch data about the defined playlists and push them to Algolia.

Captions will be extracted from the videos if they are available. Each record in Algolia will represent one caption, also containing a .video, .playlist and .channel key. The distinct feature of Algolia is used to group records of the same video together, to display the most relevant caption each time.

Each channel will have its own index called {channel_name}_{channel_id}. All videos of all playlists will be saved in this index, but can be filtered based on the channel.id and playlist.id keys of the records.

Development

Start with yarn install to load all the dependencies.

The project will need ENV variables to connect to the services.

  • ALGOLIA_APP_ID and ALGOLIA_API_KEY for pushing records to Algolia
  • YOUTUBE_API_KEY to connect to the YouTube API

We suggest using a tool like direnv to load those variables through the use of a .envrc file.

Once everything is installed, you should be able to run yarn index {youtube_url}

Options

--log

This flag should be used when debugging an indexing. It will write to disk all HTTP call responses made (in the ./logs directory). This is useful for analysing calls.

--to-cache and --from-cache

When --to-cache is present, the data obtained from the YouTube API will be saved on disk instead of being transformed into records and pushed to Algolia.

When --from-cache is present, data will be read directly from disk (as saved with --to-cache) instead of obtained through the API before being send to Algolia.

The combination of those two options will allow you to independently debug the crawling of data and the transformation of data to records.

talksearch-scraper's People

Contributors

pixelastic avatar utay avatar haroenv avatar martyndavies avatar meschreiber avatar renovate-bot avatar jessicag avatar renovate[bot] avatar

Watchers

James Cloos avatar Moe avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.