Giter Club home page Giter Club logo

arxiv-search's People

Contributors

dependabot[bot] avatar edayers avatar ericphanson avatar larsmennen avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

gimseng

arxiv-search's Issues

Animations

  • Searchbox focus
  • Animate positions of leaderboards
  • Animate paper moving in and out of library when you save it.
  • Some fancy papers loading animation
  • Meta-numbers fade in and out slowly.

Top authors

Given a search query, facet by authors. What are the top authors by relevance score?
Can this data be used to give autocompletions in rich search?

Rich search

The search box should allow in: and prim: and by: etc keywords that cause autocompletion and fancy formatting and so on. This means that all wueries can be made just using the keyboard focussed in one searchbox.

Queries with unicode accents cause a server error

Reproduction

Type the query "Özgür Açık". Look at server logs.
I got:

UnicodeEncodeError: 'ascii' codec can't encode character '\xd6' in position 95: ordinal not in range(128)
-7626696693452858257

Recommendations perf

Right now, to rec papers based on a user's library, we send a list of all of the papers in the library to ES, which identifies keywords for those papers, and searches against them. This is slow for ES, because it ends up having to weight and search against many many keywords.

A better way seems to be to do some kind of topic modelling. Sam was telling me about it on the board, and I took a picture:
img_5748

I forget how it all works, but on the technical side, we would process the papers 1 by 1, building sparse vectors which correspond to an understanding of which words are associated to which topics, and for each paper produce a spare vector corresponding to weights for that paper for each topic.

So a paper could be 20% topic 1 and 30% topic 2, etc.

We would run all the papers through this algo, identify say the top 10 topics for each paper, and upload those as a field on ES, keeping track of the weights.

Then when a user adds a paper to the library, we would grab the topics/weights, and keep an average interest vector for the user (or something like that). Then when we query elasticsearch, instead of passing a list of papers to find similar papers to, we would instead pass a list of top topics (with weights), to score the papers by.

This seems like it would be much more performant, scales well with the number of papers the user gets, and might give better recs too. We could try different topic modelling algs as well.

This is a different type of processing workflow, because (1) there is internal state (the model of which words correspond to what topics), and (2) we don't want to parallelize when learning the model from the corpus.

I propose an EC2 instance with it's own storage be dedicated to the task, along with a dedicated queue. Papers get thrown on the queue, processed one at a time by the EC2 instance, which then updates ES.

CSS Responsive Layout

I want to sort out how the website layouts itself for different media types.
At the moment it's quite ad-hoc and breaks violently in some corner cases; for example if the paper includes some very wide non-linebreaking text.

If more than 1000px; center the grid but let top header bar bleed to edges.
Below 1000px, results content area shrinks.
Below 600px, sidebar goes to collapsed menu below search bar.
Try to do this all with CSS only. Try and do it so it's robust to content changes.

Stuck waiting for slow meta.

If you double click recommended (with no query or filters, etc), it freezes, but it should just forget about the first query and load the new one, which should be very fast since it's just get time ordered papers

Minify client javascript

At the moment the client must download several megabytes of javascript!
It should really be getting react, katex and friends from a CDN and our code should be minified

Client side routing

The URL bar should reflect the state of the app.
Also the back button should work.

migration of core pipeline to EC2

Right now, WatchStatus.. lambda reads the dynamodb table stream and fire (wrapped) Lambdas accordingly.

Instead, WatchStatus should put the info to fire the Lambda into an SQS queue.

An EC2 instance should be configured to read the SQS queue and process events, by "firing" the associated lambda (i.e. running the corresponding code on the EC2 instance itself).

Why?

  1. Cost: running 10 million Lambdas for 10 seconds each costs $200, and that's a realistic amount of lambdas for processing the million paper in the arxiv
  2. Still scales reasonably well. The SQS queue has nice properties so we don't miss events, and we can always boot up more EC2 instances to process the queue in parallel. We keep the paralellized structure we had from the lambdas (each queue event causes 1 thing to happen which updates the table, triggering the next event down the line, which could even be processed by a different EC2 instance).

What needs to be done?

Need to

  • learn about SQS and EC2 a bit,
  • update the cloudformation template to deploy SQS and EC2,
  • update watchstatus.. to send events to the SQS queue instead of firing wrappers
  • either poll the SQS queue from EC2 or figure out how to fire code from events. Write a small handler to run the correct function based on the details from the queue. Ideally one which can fire both python and javascript events. (Python 3.5's subprocess.run() sounds pretty good for firing these off).
  • configure appropriate cloudwatch logging for what the EC2 instance

Full text storage

How do we store the full text? What about the .tex?

  1. We could store each as a separate file in an s3 bucket. But it costs ~5 pounds for a million put/get requests to s3.

  2. We could upload each directly to ES and access them by querying ES. These queries should be fast, because given the arxiv_id, it's just a lookup (ES's rows are indexed by arxiv id). But it still could be a decent bit of load on the ES server if it's handling lots of these.

  3. Something else ??

I'm leaning towards (2) but want to see if there are other options..

elasticsearch integration in AWS pipeline

doesn't work yet-- need

  1. wrapper gets a field ESfield as an input. Sets the field ESfield in elasticsearch to the output of the Lambda which it runs.
  2. The OAI harvester on AWS should update ES.
  3. We should have a function to scroll through ES and set fulltext and havethumb on the dynamodb table to have if ES says we have them.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.