Lucas laughlin
Demo Site: http://paper-recommender-frontend.s3-website-us-west-2.amazonaws.com/
Table of Contents:
-
- Related write up of project
-
- Use spacy tokenizer to tokenize abstracts for TF-IDF. Note: RoBERTa and GPT-2 have their own tokenizers
- Define Functiosn for performing K-means clustering
- Functions to perform dimension reduction on embddings with PCA, Umap, and TSNE
- Functions to plot 2d embeddings once dimension reduction has been performed
- Functions to extract l nearest neighbors given a target embedding and a list of embeddings
- Functions for calculating Co-citation, reference and citation scores given a target embeddings and a list of nearest neighbor embeddings
- Perform a full run of getting TFIDF embeddings from abstracts. Clusters, plots, and calculates the co-citation scores
- Perform a full run of getting Doc2Vec embeddings from abstracts. Clusters, plots, and calculates the co-citation scores
- Fine tune RoBERTa on the abstract data and extract document embeddings. Then calculate co-citation scores.
-
-
Take raw json data and sample by year/oi then szve into a df
-
Summarize df and remove duplicates
-
Create an interaction df, sparse interaction matrix and idx2auth/idx2doc mappings
-
Convert interaction df of authors and paper names to contain only ids (ids = index in unique authors/papers list)
-
Create new datasets by setting minimum author frequency. This was an attempt to create better results by reducing the sparsity of the interaction matrix
-
Creating and testing functions to measure cocitations using Open Citations API
-
-
- Download tokenizer, tokenize absrtactsa and create DataLoader objects for finetuning and validation
- Download GPT-2 and fine tune it
- Using the fine tuned GPT-2 hidden layer outputs to create document embeddings
-
- Uploading document df to mongoDB database
- Remove all documents from mongoDB
- Protoype for querying mongoDB using pymongo
-
- Contains a flask server with trained embeddings and a k nearest neighbor model. It has been deployed to an AWS EC2 instance
-
- Contains a react based front end application. It allows you to search through all arXiv high energy physics papers. Once you bhave selected a paper it provides you with the 10 most similar papers.