Giter Club home page Giter Club logo

tube-clues's Introduction

Overview

The purpose of this repository is a pipeline for YouTube video analysis and specifically fact-check. Videos can have transcripts retrieved or generated using Whisper with a flag. There are also prompts in progress for analysis of the videos. Currently the only prompt is for unverifiable or potentially misleading facts. Eventually the goal is to have a fact check pipeline and other pipelines for analyzing other elements of transcripts.

Open Issues

  • Develop some sort of opinionation/bias measure and way to display it
  • Add diarization -> sometimes the videos play clips within them and that makes the transcript nonsensical without diarization
    • Even with diarization, sometimes speakers themselves read a section of text they are quoting: it'd be nice to capture that they're quoting someone else in those scenarios
  • Connect to search tool (similar to stochasticity) for fact checking
    • Create UX to present identified potentially misleading or false facts and then give option to investigate further
  • Develop a fact checking pipeline
    • Add a reliable way to find relevant facts from transcript
    • Add a way to search for each of the facts
      • Currently thinking of doing transcript->list of facts->formulate search queries->search
      • This is a lot of steps though so I'll try transcripts->formulate search queries, skipping list of facts at some point
    • Many of these facts are so complicated lol
    • Display in UX
  • Possibly return a list of related articles from other news sources
  • Utilize video title
  • Add error telemetry

Stretch Goals or Low Priority

  • Add channel analysis
  • Add graph-based visualization (ex: comparisons on factors such as opinions per minute, political lean, misleading facts per 1000 words)
  • Basic visualization around custom prompts
  • Add support for longer videos, maybe by breaking up the video
  • UX for inputting youtube video link and getting information back
    • UX for processing multiple videos (such as multiple videos from one channel)
    • UX to visualize results from query / video analysis
  • Currently YouTube transcripts are not cached, potential for bottleneck there due to YouTube API limits
  • Currently targeting negative attacks, but could also think about pieces that defend a position or person

Bugs

  • Issues with initial package install (mentioned in pyproject.toml)
  • Sometimes transcripts don't download the first time and need to be re-run (potentially fixed with switching to something like yt-dlc instead of yt-dlp?)

Closed Bugs

  • Fixed occassional segfault crash by setting cache_discovery to false in youtube API object config

Completed

✓ Ability to enter a link and run the processing pipeline

✓ Add support for downloading youtube video audio automatically

✓ Local caching for transcript retrieval and for video MP3

✓ Separate functions into different files by function (maybe helper, main, and transcript acquisition?)

✓ Have some form of web UX

✓ Consider figuring out some UX for connecting overarching claims to the facts that are being presented

✓ Limit to reject videos over 15 minutes

✓ Split each type of analysis into its own flow

✓ Added caching for GPT output

✓ Add a bias and opinionation flow that looks for negative claims that are not backed up

✓ Option to use YouTube transcript if it exists for video (as opposed to using Whisper)

✓ Redis support for caching and locking

✓ Host website

✓ Custom prompt support

✓ Start deleting audio once done with it

✓ Add way to prevent multiple clicks from causing multiple requests

✓ Avoid version overlapping response when response is long

✓ When response JSON can't be decoded, print the raw JSON on the page instead of crashing

✓ Add text streaming for GPT results

✓ Longer transcripts and parallelized transcript processing (videos up to 1 hour)

Identifying Trustworthiness of a News Piece Notes

The overarching goal of this system is to be able to find a measure of quality for news pieces and essays. In my view there are two types of good news pieces. The first is a news piece that transmits information about an event in a way that represents both sides (if two sides exist). The second is a news piece that has a "thesis" it is attempting to convey about a situation or event. Examples of this sort of news piece would be ones that are talking about a new piece of legislation and are advocating for or against it.

The challenge comes from identifying news pieces that are more interested in presenting a perspective than in fairly representing multiple sides of an issue. These pieces come with a viewpoint pre-conceived and then either present the facts purposely in a way that leaves out other relevant factors or purposely deceive consumers with false facts. If a clear false fact is encountered such as a statistic, it can sometimes be trivial to do a search and identify it. However there are a few challenges that need to be overcome in this realm.

One major challenge is that presenters in many cases simply can avoid facts and figures or use ones that are too vague to check. If a presenter says "this new legislation increases violent crime." It's technically a claim of fact, but we need to identify as an "opinion" because it's not something that can actually be verified. But what if they used other facts or data before that claim which back it up? Where is the line here? The line between "this connection is obvious, it doesn't need to be explained" and "this connection is just an opinion" is a tough one to explain or quantify.

Maybe logical fallacies is what I'm looking for? False facts, conjured opinions, and logical fallacies? Dog whistles are also very challenging. Often the video doesn't say anything substantive on the surface, but you have to read between the lines. Also we're looking for effect on the viewer - for example if a presenter is talking about something random but drops a line about ineffective leadership, then that's an influence on viewers even though there were no arguments and no facts and that line may not be a part of a "main" argument from the presenter at all.

Some things that The Disinformation Index uses are "article bias" and "senational language." They also talk about language that "demeans", "belittles" or "otherwise targets" individuals, groups, or institutions. The values they measure by are listed as bias, negative targeting, out-group inferiority, sensational language, sensational visuals, sources, attribution, headline accuracy, lede present (and whether it's fact-based), and byline information. One clear weakness that the TDI has is that its measures are inherently qualitative. Their method is to have three reviewers look at each article and answer questions about those measures.

Article on linguistic differences between real and fake news. One of the things the discusses is central versus peripheral persuasion (summary) which is called the "Elaboration Likelihood Model" (ELM). ELM lines up with the intuition that articles that are higher on heuristics or hand-waving are likely to be less trustworthy and that articles that contain specific facts (assuming they're true facts) are more likely to be useful.

tube-clues's People

Contributors

arysef avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.