YouTube Transcript Search

Live Demo: http://www.blankadventure.com/ytts

YTTS is an exploration of enabling semantic search capabilities on a YouTube channel. The idea was initially sparked by Mr. Carlson Lab, whose videos are a deep-dive into antique electronics repair. His videos are information-dense and I would often find myself wishing I could remember some clever method or technqiue for circuit debugging, noise reduction, or performance testing. The ability to semantically search his content for various topcis seemed like exactly the thing I needed, and so this is my attempt.

A few things would be needed to accomplish this: (1) scraping the channel for video transcripts; (2) transforming the transcripts; (3) storing to a database; and (4) enabling the search.

(1) Scraping the channel for video transcripts

Thankfully, YT provides transcripts for (most of?) their videos, so we don't have to actually generate those. Two packages were used: scrapetube and youtube_transcript_api, whose functionality was combined in videos.py to enable collecting a set of transcripts for a YT channel.

(2) Transforming the transcripts

The transcript for a single video consists of a list of dictionaries, where each dictionary contains a single (text) sentence and a start and end timestamp. It's important to note that each "sentence" is really just a string of some number of words, and isn't necessarily a sentence in the proper grammatical sense. Because of this, the contextual value of a particular sentence may be limited. To improve context, we can append multiple sentences together forming larger text samples. Additionally, a rolling-window aggregator can be used to provide "overlapped" (or shared) context between samples. The appropriate start timestamp must be maintained across this operation (the end timestamp is not used). The transcript does not include the video title, which is obtained separately, and is inserted as an addtional piece of metadata. In videos.py, chunk_generator handles this functionality. It is a generator that yields a single sample in the form:

{'text': 'its a Panasonic radio receiver that winds up like a clock then it searches for radio stations lets do a tear down on this Ill do a circuit description then lets fix it and see how this mechanism actually works should be a lot of fun lets get started heres the Panasonic radar mtic radio that were going to be tearing down troubleshooting and were going to',
 'metadata': {'timestamp': 4.319,
  'title': 'Panasonic Radar Matic Receiver Teardown With Circuit Description, Troubleshooing, And Resurrection!',
  'video': '6B_-WznDq1g'},
 'uid': '6B_-WznDq1g_0'
}

These samples are then subsequently added to our database of choice.

(3) Storing to a database

The current implemenation relies on chromadb (https://www.trychroma.com/), an open-source embeddings database. The code at present uses the default all-MiniLM-L6-v2 model, but chromadb wraps a variety of other LLMs as well.

(4) Online & Results

The following collections have been built and tested. Qualitatively, the all-mpnet model seemed to produce the best search results.

Collection Name	channels	Chunk Method	Chunk Size	Adv By	Model
roberta-base-squad2_word20_15	combined	words	20	15	roberta-base-squad2
Thesignalpath	TheSignalPath	transcripts	6	4	all-MiniLM-L6-v2
MrCarlsonslab	MrCarlsonsLab	transcripts	6	4	all-MiniLM-L6-v2
all-mpnet-base-v2_word20_15	combined	words	20	15	all-mpnet-base-v2
combined	combined	transcripts	6	4	all-MiniLM-L6-v2
word_chunk_20_15	combined	words	20	15	all-MiniLM-L6-v2

blankadventure / ytts Goto Github PK

ytts's Introduction

YouTube Transcript Search

ytts's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent