arxiv-summaries-workflow

This repo is public access to what I use to automate my weekly arxiv paper skimming series on youtube. I don't think anybody will really find a majority of these files useful but who knows.

if you're looking to see the backlog of papers i've looked into each week in my videos, then check out the following two files:

papers_downloaded.csv: includes every single paper that has shown up on the weekly paper videos since 2024/06/21. From reading the abstracts of these papers I selected which papers would make their way into the following file
papers_kept.csv: includes every single paper that has shown up on the weekly substack newsletter/podcast since 2024/06/21. These are the papers that I actually bother starting to read, some percentage of which get deleted part of the way through, some get read but never discussed again, and some get read & talked about on the channel in one of my paper breakdown videos

Repo Contents

arxiv-search.py - this script opens up an app window with a list of paper titles and allows you to download these papers into pdfs/ with the click of a button. It selects them according to search criteria specified in search_terms_include.txt and search_terms_exclude.txt and some settings in the config; by default the search terms are ones that I prefer and it shows you the most recent papers you've not yet seen with a cap at 2000 total (i don't recommend sifting through that many in one sitting, it's mind-numbing). Whenever this is run to completion every single paper in the list gets added to papers_seen.csv. Whenever you download a file the script writes the ArXiv link into links.txt for use later and a bunch of info into papers_downloaded.csv in the hopes that i'll one day be able to train a model to select papers for me using these two csv files.
arxiv-link-downloader.py - this script takes as input any number of arxiv links and downloads them as well as adds them to links.txt, papers_seen.csv and papers_downloaded.csv
open-links.py - this opens all of the links in links.txt in your browser in order. Technically you can manually add non-arxiv papers to this file and it'll still work as long as you properly use the separator | between name & link and your link includes the full https:// and whatnot; I do this whenever I need to add a paper not found on arxiv
newsletter-podcast.py - this will consume all PDFs in the pdfs-to-summarize/ folder and use OpenAI's API to generate summaries which will go into newsletter.txt. It then turns this newsletter into an mp3 file for a podcast using OpenAI's TTS
timestamps.py - a script that generates youtube chapter timestamps based on the pdfs that have been summarized. Hit a configurable hotkey to (I use `) to indicate that a new yt chapter should start, and esc to end the script. Creates timestamps.txt which is what I copy & paste into my yt description. This script likely requires you go into your computer's settings and allow special permissions in order for it to monitor your keyboard. It also auto-replaces certain strings into shorter versions (ex. "Large Language Model" into "LLM") with specific strings configurable in the config. If you run it a second time (meaning a timestamps.txt file exists) then instead of recording timestamps it will trim lines until they get below a specified character count (4,475 by default; YT's max description length is 500 characters) according to which have the shortest time period
thumbnail.py - creates a simple thumbnail for my YT videos after they've been recorded. Basically a screenshot is taken from the video then the right 2/3 of the screen gets overlayed with a specified color and either black or white text (160pt font) is written on that random color depending on what shows up better. By default the text says "New AI Papers Published MMM DD YYYY" but you can change this by calling it like in the example below. Colors include red, gree, yellow, purple, blue, cyan, magenta, orange, pink, black, and white. Here's an example call and an example output that does not match up:

python thumbnail.py "path/to/video.mov" --text "Alternative\nTitle of\nYour Choice" --coverage 3/4 --color red --font_size 170

cleanup.py - this will take any pdf files that have been copy & pasted from pdfs/ to pdfs-to-summarize/ and create .md corresponding .md files before sending them into your obsidian vault. You need to specify the location of your obsidian vault in config.py in order for it to work. When sending files to obsidian, it also records the fact that you decided to keep them by adding lines to papers_kept.csv; if you want to use that csv but don't want to use obsidian then i'm sorry but i'm lazy so you'll have to mess with the code yourself. Then, it deletes all of the files that are generated by all the other scripts. I'd recommend running this after you download the repo since I may have left it populated with a bunch of files on my last push by accident
config.py - Where you can change a couple settings if you'd like.
papers_seen.csv - includes every single paper that had its title pass in front of my eyes BEFORE the video; this is basically every single paper that gets published to arxiv under the AI category and every tangentially related category. From reading these titles I selected which papers would go in the following file
papers_downloaded.csv - includes every single paper that has shown up on the weekly paper videos since 2024/06/21. From reading the abstracts of these papers I selected which papers would make their way into the following file
papers_kept.csv - includes every single paper that has shown up on the weekly substack newsletter/podcast since 2024/06/21. These are the papers that I actually bother starting to read, some percentage of which get deleted part of the way through, some read but never discussed again, and some read & talked about on the channel in one of my paper breakdown videos

SETUP

Clone the repository to your local machine.
Install the required Python packages by running pip install -r requirements.txt in your terminal within your virtual environment
Obtain an API key from OpenAI and save it in a file named key_openai.txt in the root directory of the repository.
Run cleanup.py to get rid of all the pdf and text files that I may or may not have left in here by accident on the most recent push. I'd also recommend deleting papers_seen.csv, papers_downloaded.csv, and papers_kept.csv assuming you plan to use them to train your own custom recommendation model like I do one day and don't want my data confounding yours.
If you don't use obsidian then open config.py and set send-to-obsidian = False. If you plan to send files to an Obsidian vault then open config.py and define directories for your/obsidian/vault/location/here and your/obsidian/vault/location/here/attachments-folder. Also inside config.py you can edit frontmatter_lines to fit your tagging system
Maybe peruse config.py to check settings and try to gain a better understanding of this monstrocity I've created. I suggest editing prompts to fit your use-case.

USAGE

Write out your search terms in search_terms_include.txt and search_terms_exclude.txt to fit your use-case. Each search term should be on its own line. If you just want all of today's newest papers then leave both blank. For me personally I exclude papers that I know I'm not going to be interested in, for example anything related to the medical field. Also if you'd like to include more than the most recent day's papers then open up config.py and set restrict_to_most_recent = False. By default the maximum number of papers to include in search is 2000, but again you can change this in config.py.
Run arxiv-search.py, wait for it to finish printing out every title and link to console, and then it should create a little app window. Drag expand this window and then you'll see a bunch of buttons with names of papers. Click on a paper and it'll be downloaded to pdfs/
- If no papers show up and you get a blank window that's the arXiv API wrapper bugging out. Just run it a couple times until it works, preferably waiting at least 15 minutes between attempts
Run open-links.py and all the papers you downloaded will pop up in your browser. 4a. If you want to publish YT vids like I do, run timestamps.py as close to when you hit record as possible and use the hotkey to indicate whenever you switch to a new paper. press esc to end that script. 4b. If the timestamps that get printed into timestamps.txt are too long to fit into your YT description, then first I recommend manually trimming down the titles or even cutting out entire lines. If you'd like to automatically cut out lines corresponding to the shortest time lengths, run timestamps.py again, immediately hit esc and now you'll see that many of them are gone.
Run the newsletter-podcast.py script to generate a newsletter and podcast based on all the pdf files in pdfs-to-summarize/
With your recorded video call python thumbnail.py "path/to/your/video.mov" --color pink (or any other color)
Once you're finished run cleanup.py. This will send the pdfs in pdfs-to-summarize/ to your obsidian vault and create corresponding markdown notes for them, and then delete all files created by the previous scripts.

Potential TODOs

setup open-links.py to only open one link at a time and sync up with the hotkey from timestamps.py for opening new links (this would likely require merging them into one single file). The reason I might do this is to save on ram usage, as the many tabs open can get taxing for my 8gb macbook air
train a model (BERT based?) off of papers_seen.csv, papers_downloaded.csv, and papers_kept.csv to automatically grab for me the papers that i find interesing in a given week rather than having to read through the boring list myself

un1tz3r0 / arxiv-summaries-workflow Goto Github PK

arxiv-summaries-workflow's Introduction

arxiv-summaries-workflow

Repo Contents

SETUP

USAGE

Potential TODOs

arxiv-summaries-workflow's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent