ericphanson / arxiv-search Goto Github PK
View Code? Open in Web Editor NEWElasticsearch-backed rewrite of arxiv-sanity
License: MIT License
Elasticsearch-backed rewrite of arxiv-sanity
License: MIT License
Noticed people clicking it thinking it would open a modal dialogue.
Give alerts/flares about login state to the user.
Should render Jo~ao Ara'ujo properly
Given a search query, facet by authors. What are the top authors by relevance score?
Can this data be used to give autocompletions in rich search?
The search box should allow in:
and prim:
and by:
etc keywords that cause autocompletion and fancy formatting and so on. This means that all wueries can be made just using the keyboard focussed in one searchbox.
search for "Generalized test ideals, sharp F-purity, and sharp test elements"
Type the query "Özgür Açık". Look at server logs.
I got:
UnicodeEncodeError: 'ascii' codec can't encode character '\xd6' in position 95: ordinal not in range(128)
-7626696693452858257
Right now, to rec papers based on a user's library, we send a list of all of the papers in the library to ES, which identifies keywords for those papers, and searches against them. This is slow for ES, because it ends up having to weight and search against many many keywords.
A better way seems to be to do some kind of topic modelling. Sam was telling me about it on the board, and I took a picture:
I forget how it all works, but on the technical side, we would process the papers 1 by 1, building sparse vectors which correspond to an understanding of which words are associated to which topics, and for each paper produce a spare vector corresponding to weights for that paper for each topic.
So a paper could be 20% topic 1 and 30% topic 2, etc.
We would run all the papers through this algo, identify say the top 10 topics for each paper, and upload those as a field on ES, keeping track of the weights.
Then when a user adds a paper to the library, we would grab the topics/weights, and keep an average interest vector for the user (or something like that). Then when we query elasticsearch, instead of passing a list of papers to find similar papers to, we would instead pass a list of top topics (with weights), to score the papers by.
This seems like it would be much more performant, scales well with the number of papers the user gets, and might give better recs too. We could try different topic modelling algs as well.
This is a different type of processing workflow, because (1) there is internal state (the model of which words correspond to what topics), and (2) we don't want to parallelize when learning the model from the corpus.
I propose an EC2 instance with it's own storage be dedicated to the task, along with a dedicated queue. Papers get thrown on the queue, processed one at a time by the EC2 instance, which then updates ES.
Has country names as authors!
Example:
search for abc (no quotes). Then the results get overwritten by the second page, or something like that.
Perhaps give it a fancy animation
I want to sort out how the website layouts itself for different media types.
At the moment it's quite ad-hoc and breaks violently in some corner cases; for example if the paper includes some very wide non-linebreaking text.
If more than 1000px; center the grid but let top header bar bleed to edges.
Below 1000px, results content area shrinks.
Below 600px, sidebar goes to collapsed menu below search bar.
Try to do this all with CSS only. Try and do it so it's robust to content changes.
If you double click recommended (with no query or filters, etc), it freezes, but it should just forget about the first query and load the new one, which should be very fast since it's just get time ordered papers
At the moment the client must download several megabytes of javascript!
It should really be getting react, katex and friends from a CDN and our code should be minified
The URL bar should reflect the state of the app.
Also the back button should work.
Right now, WatchStatus.. lambda reads the dynamodb table stream and fire (wrapped) Lambdas accordingly.
Instead, WatchStatus should put the info to fire the Lambda into an SQS queue.
An EC2 instance should be configured to read the SQS queue and process events, by "firing" the associated lambda (i.e. running the corresponding code on the EC2 instance itself).
Why?
What needs to be done?
Need to
So our many users can shower us with well deserved praise.
Two parts to this: client-side, need to make a page, make charts, etc.
Server-side: need to figure out how to get this type of data from elasticsearch. Seems like something similar to https://stackoverflow.com/questions/32593282/n-grams-with-frequency-number-using-elasticsearch.
How do we store the full text? What about the .tex?
We could store each as a separate file in an s3 bucket. But it costs ~5 pounds for a million put/get requests to s3.
We could upload each directly to ES and access them by querying ES. These queries should be fast, because given the arxiv_id, it's just a lookup (ES's rows are indexed by arxiv id). But it still could be a decent bit of load on the ES server if it's handling lots of these.
Something else ??
I'm leaning towards (2) but want to see if there are other options..
doesn't work yet-- need
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.