Giter Club home page Giter Club logo

ytts's Introduction

YouTube Transcript Search

Live Demo: http://www.blankadventure.com/ytts

YTTS is an exploration of enabling semantic search capabilities on a YouTube channel. The idea was initially sparked by Mr. Carlson Lab, whose videos are a deep-dive into antique electronics repair. His videos are information-dense and I would often find myself wishing I could remember some clever method or technqiue for circuit debugging, noise reduction, or performance testing. The ability to semantically search his content for various topcis seemed like exactly the thing I needed, and so this is my attempt.

A few things would be needed to accomplish this: (1) scraping the channel for video transcripts; (2) transforming the transcripts; (3) storing to a database; and (4) enabling the search.

(1) Scraping the channel for video transcripts

Thankfully, YT provides transcripts for (most of?) their videos, so we don't have to actually generate those. Two packages were used: scrapetube and youtube_transcript_api, whose functionality was combined in videos.py to enable collecting a set of transcripts for a YT channel.

(2) Transforming the transcripts

The transcript for a single video consists of a list of dictionaries, where each dictionary contains a single (text) sentence and a start and end timestamp. It's important to note that each "sentence" is really just a string of some number of words, and isn't necessarily a sentence in the proper grammatical sense. Because of this, the contextual value of a particular sentence may be limited. To improve context, we can append multiple sentences together forming larger text samples. Additionally, a rolling-window aggregator can be used to provide "overlapped" (or shared) context between samples. The appropriate start timestamp must be maintained across this operation (the end timestamp is not used). The transcript does not include the video title, which is obtained separately, and is inserted as an addtional piece of metadata. In videos.py, chunk_generator handles this functionality. It is a generator that yields a single sample in the form:

{'text': 'its a Panasonic radio receiver that winds up like a clock then it searches for radio stations lets do a tear down on this Ill do a circuit description then lets fix it and see how this mechanism actually works should be a lot of fun lets get started heres the Panasonic radar mtic radio that were going to be tearing down troubleshooting and were going to',
 'metadata': {'timestamp': 4.319,
  'title': 'Panasonic Radar Matic Receiver Teardown With Circuit Description, Troubleshooing, And Resurrection!',
  'video': '6B_-WznDq1g'},
 'uid': '6B_-WznDq1g_0'
}

These samples are then subsequently added to our database of choice.

(3) Storing to a database

The current implemenation relies on chromadb (https://www.trychroma.com/), an open-source embeddings database. The code at present uses the default all-MiniLM-L6-v2 model, but chromadb wraps a variety of other LLMs as well.

(4) Online & Results

The following collections have been built and tested. Qualitatively, the all-mpnet model seemed to produce the best search results.

Collection Name channels Chunk Method Chunk Size Adv By Model
roberta-base-squad2_word20_15 combined words 20 15 roberta-base-squad2
Thesignalpath TheSignalPath transcripts 6 4 all-MiniLM-L6-v2
MrCarlsonslab MrCarlsonsLab transcripts 6 4 all-MiniLM-L6-v2
all-mpnet-base-v2_word20_15 combined words 20 15 all-mpnet-base-v2
combined combined transcripts 6 4 all-MiniLM-L6-v2
word_chunk_20_15 combined words 20 15 all-MiniLM-L6-v2

ytts's People

Contributors

blankadventure avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.