Giter Club home page Giter Club logo

arxiv-analysis's Introduction

ArXiv analysis

Run online variational LDA on all the abstracts from the arXiv. The implementation is based on Matt Hoffman's GPL licensed code.

Usage

You'll need a mongod instance running on the port given by the environment variable MONGO_PORT and a redis-server instance running on the port given by the REDIS_PORT environment variable.

The code depends on the Python packages: numpy, scipy, requests, pymongo and redis.

  • mkdir abstracts
  • ./analysis.py scrape abstracts — scrapes all the metadata from the arXiv OAI interface and saves the raw XML responses as abstracts/raw-*.xml. This takes a long time because of the arXiv's flow control policies. It took me approximately 6 hours.
  • ./analysis.py parse abstracts/raw-*.xml — parses the raw responses and saves the abstracts to a MongoDB database called arxiv in the collection called abstracts.
  • ./analysis.py build-vocab — counts all the words in the corpus removing anything with less than 3 characters and removing any stop words.
  • ./analysis.py get-vocab 100 5000 > vocab.txt — lists the vocabulary skipping the first 100 most popular words and keeping 5000 words total.
  • ./analysis.py run vocab.txt — runs online variational LDA by randomly selecting articles from the database. The topic distributions are stored in the lambda-*.txt files. This will run forever so just kill it whenever you feel like it.
  • ./analysis.py vocab.txt lambda-100.txt — list the topics and their most common words at step 100.

arxiv-analysis's People

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.