ericphanson / arxiv-search Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 4.21 MB

Elasticsearch-backed rewrite of arxiv-sanity

License: MIT License

Shell 0.28% Python 17.57% JavaScript 6.62% CSS 43.32% HTML 6.58% TypeScript 25.52% TSQL 0.11%

arxiv-search's People

Contributors

Stargazers

Watchers

Forkers

gimseng

arxiv-search's Issues

Login or create button should be greyed out if the fields are empty.

Noticed people clicking it thinking it would open a modal dialogue.

Login feedback

Give alerts/flares about login state to the user.

Animations

Searchbox focus
Animate positions of leaderboards
Animate paper moving in and out of library when you save it.
Some fancy papers loading animation
Meta-numbers fade in and out slowly.

Author names rendering on 1803.10179v2

Should render Jo~ao Ara'ujo properly

Top authors

Given a search query, facet by authors. What are the top authors by relevance score?
Can this data be used to give autocompletions in rich search?

The search box should allow in: and prim: and by: etc keywords that cause autocompletion and fancy formatting and so on. This means that all wueries can be made just using the keyboard focussed in one searchbox.

katex error: "Generalized test ideals, sharp F-purity, and sharp test elements"

search for "Generalized test ideals, sharp F-purity, and sharp test elements"

Queries with unicode accents cause a server error

Reproduction

Type the query "Özgür Açık". Look at server logs.
I got:

UnicodeEncodeError: 'ascii' codec can't encode character '\xd6' in position 95: ordinal not in range(128)
-7626696693452858257

Recommendations perf

Right now, to rec papers based on a user's library, we send a list of all of the papers in the library to ES, which identifies keywords for those papers, and searches against them. This is slow for ES, because it ends up having to weight and search against many many keywords.

A better way seems to be to do some kind of topic modelling. Sam was telling me about it on the board, and I took a picture:

I forget how it all works, but on the technical side, we would process the papers 1 by 1, building sparse vectors which correspond to an understanding of which words are associated to which topics, and for each paper produce a spare vector corresponding to weights for that paper for each topic.

So a paper could be 20% topic 1 and 30% topic 2, etc.

We would run all the papers through this algo, identify say the top 10 topics for each paper, and upload those as a field on ES, keeping track of the weights.

Then when a user adds a paper to the library, we would grab the topics/weights, and keep an average interest vector for the user (or something like that). Then when we query elasticsearch, instead of passing a list of papers to find similar papers to, we would instead pass a list of top topics (with weights), to score the papers by.

This seems like it would be much more performant, scales well with the number of papers the user gets, and might give better recs too. We could try different topic modelling algs as well.

This is a different type of processing workflow, because (1) there is internal state (the model of which words correspond to what topics), and (2) we don't want to parallelize when learning the model from the corpus.

I propose an EC2 instance with it's own storage be dedicated to the task, along with a dedicated queue. Papers get thrown on the queue, processed one at a time by the EC2 instance, which then updates ES.

recruit summer students!!

Author carnage on 1801.06898v1

Has country names as authors!

UI is dropping the rendering of papers in some cases.

Example:

search for abc (no quotes). Then the results get overwritten by the second page, or something like that.

Save button should look like a button

Perhaps give it a fancy animation

CSS Responsive Layout

I want to sort out how the website layouts itself for different media types.
At the moment it's quite ad-hoc and breaks violently in some corner cases; for example if the paper includes some very wide non-linebreaking text.

If more than 1000px; center the grid but let top header bar bleed to edges.
Below 1000px, results content area shrinks.
Below 600px, sidebar goes to collapsed menu below search bar.
Try to do this all with CSS only. Try and do it so it's robust to content changes.

An EC2 instance should be configured to read the SQS queue and process events, by "firing" the associated lambda (i.e. running the corresponding code on the EC2 instance itself).

Why?

Cost: running 10 million Lambdas for 10 seconds each costs $200, and that's a realistic amount of lambdas for processing the million paper in the arxiv
Still scales reasonably well. The SQS queue has nice properties so we don't miss events, and we can always boot up more EC2 instances to process the queue in parallel. We keep the paralellized structure we had from the lambdas (each queue event causes 1 thing to happen which updates the table, triggering the next event down the line, which could even be processed by a different EC2 instance).

What needs to be done?

Need to

learn about SQS and EC2 a bit,
update the cloudformation template to deploy SQS and EC2,
update watchstatus.. to send events to the SQS queue instead of firing wrappers
either poll the SQS queue from EC2 or figure out how to fire code from events. Write a small handler to run the correct function based on the details from the queue. Ideally one which can fire both python and javascript events. (Python 3.5's subprocess.run() sounds pretty good for firing these off).
configure appropriate cloudwatch logging for what the EC2 instance

We could store each as a separate file in an s3 bucket. But it costs ~5 pounds for a million put/get requests to s3.
We could upload each directly to ES and access them by querying ES. These queries should be fast, because given the arxiv_id, it's just a lookup (ES's rows are indexed by arxiv id). But it still could be a decent bit of load on the ES server if it's handling lots of these.
Something else ??

I'm leaning towards (2) but want to see if there are other options..

elasticsearch integration in AWS pipeline

doesn't work yet-- need

wrapper gets a field ESfield as an input. Sets the field ESfield in elasticsearch to the output of the Lambda which it runs.
The OAI harvester on AWS should update ES.
We should have a function to scroll through ES and set fulltext and havethumb on the dynamodb table to have if ES says we have them.