Giter Club home page Giter Club logo

transcripttopdf's Introduction

YouTube Transcript to Readable PDF Pipeline

Pipeline

Overview

This project aims to convert YouTube transcripts into readable PDF documents with time stamps. The purpose is to provide a convenient way for knowledge seekers to follow along and absorb valuable insights shared by humanity. For inspiration, I have used Y-combinator as a content creator known for their exceptional expertise and captivating delivery.

Motivation

Y-combinator's YouTube content is invaluable, and their expertise has made it a gem for individuals seeking knowledge. However, consuming video content can sometimes be challenging, especially when referring back to specific points of interest. By converting the transcripts into readable PDFs with time stamps, this project aims to enhance the learning experience and make it easier for users to navigate and reference information.

Features

  1. YouTube Video Information Extraction: The pipeline begins by extracting the video information from Y-combinator's YouTube videos. This process involves leveraging Pytube to obtain the video information (video_id, title, description) programmatically.

  2. YouTube Transcript Extraction: The pipeline then extracts the transcripts from Y-combinator's YouTube videos. This process involves leveraging Youtube-Transcript-Api to obtain the video transcriptions along with timestamps (transcription, timestamp) programmatically.

  3. Text Preprocessing: The extracted transcripts undergo preprocessing to remove unnecessary noisy elements that may be present in the raw transcript data.

  4. Time Stamp Integration: The pipeline integrates the time stamps within the PDF document to allow users to easily refer back to specific moments in the video. In addition, it merges chapters as subheading and a reference to that heading. Users can simply click on a time stamp, and the corresponding video segment will be automatically played.

  5. Markdown Generation: The preprocessed transcript is then formatted into a readable markdown document. Each line of text is accompanied by a corresponding time stamp, which indicates the video timestamp at which the content was spoken.

  6. PDF Generation: The preprocessed transcript is then formatted into a readable PDF document. With the help of Pypandoc

Example

Alt text

PDF Example

Full PDF Combined Startup School

Contributing

Contributions to this project are welcome. If you have any suggestions, bug reports, or feature requests, please open an issue on the GitHub repository.

  • You can replace Youtube-Transcript-Api with Whisper if the transcript is not available.
  • You can merge the whole playlist into one PDF file.
  • You can chat with your generated pdf using pdfGPT or any online service that provides that such as ChatPDF
  • You create better pdf styles and share them with the community. Be wild!

License

This project is licensed under the MIT License.

Issues

If you encounter any problems, please open an issue along with a detailed description.

  • The timestamps of the chapters are not exact, so you may need to adjust them manually for better subheading transitions.

Acknowledgments

  • Y-combinator for creating exceptional content and inspiring this project.

Enjoy converting YouTube transcripts with this pipeline to readable PDFs with time-stamps and delve deeper into the wealth of knowledge shared by HUMANITY!

Reach me out

Be sure to reach me out if you have any questions or want to hop on a feature.

transcripttopdf's People

Contributors

creativeself0 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.