Giter Club home page Giter Club logo

fatcat-scholar's Introduction

Internet Archive Scholar

IA Scholar is an effort within the Internet Archive to track, preserve, index, and serve scholarly articles.

Our focus is on Open Access content that might otherwise disappear from the web, but we also focus on building an open bibliographic database of all scholarly content.

This is source code for scholar.archive.org, a full-text web search interface over the 25+ million open research papers in the Internet Archive.

All of the technical heavy lifting of harvesting, crawling, and metadata corrections are handled by the fatcat service; this service is just a bare-bones, read-only search interface. Unlike the basic fatcat.wiki search, this index allows querying the full content of papers when available.

Overview

This repository is fairly small and contains:

  • src/scholar/: Python code for web service and indexing pipeline
  • src/scholar/templates/: HTML template for web interface
  • tests/: Python test files
  • proposals/: design documentation and change proposals
  • data/: empty directory for indexing pipeline

A data pipeline converts groups of one or more fatcat "release" entities (grouped under a single "work" entity) into a single search index document. Elasticsearch is used as the full-text search engine. A simple web interface parses search requests and formats Elasticsearch results with highlights and first-page thumbnails.

Getting Started for Developers

You'll need python3. We test against 3.11; your mileage may vary with older pythons. Ensure that pip and venv modules are available (these need to be installed manually via apt on Debian).

Most tasks are run using a Makefile; make help will show all options.

Working on the indexing pipeline effectively requires internal access to the Internet Archive cluster and services, though some contributions and bugfixes are probably possible without staff access.

To install dependencies for the first time run:

make dep

then run the tests (to ensure everything is working):

make test

To start the web interface run:

make serve

While developing the web interface, you will almost certainly need an example database running locally. A docker-compose file in extra/docker/ can be used to run Elasticsearch 7.x locally. The make dev-index command will reset the local index with the correct schema mapping, and index any intermediate files in the ./data/ directory. We don't have an out-of-the-box solution for non-IA staff at this step (yet).

After making changes to any user interface strings, the interface translation file (".pot") needs to be updated with make extract-i18n. When these changes are merged to master, the Weblate translation system will be updated automatically.

This repository uses ruff for code formatting and mypy for type checking; please run make fmt and make lint for submitting a pull request.

Contributing

Software, copy-editing, translation, and other contributions to this repository are welcome! For content and metadata corrections, or identifying new content to include, the best place to start is the in fatcat repository. Learn more in the fatcat guide. You can chat and ask questions on gitter.im/internetarchive/fatcat.

Contributors in this project are asked to abide by our Code of Conduct.

The web interface is translated using the Weblate platform, at internetarchive/fatcat-scholar

The software license for this repository is Affero General Public License v3+ (APGL 3+), as described in the LICENSE.md file. We ask that you acknowledge the license terms when making your first contribution.

For software developers, the "help wanted" tag in Github Issues is a way to discover bugs and tasks that external folks could contribute to.

fatcat-scholar's People

Contributors

alfonso133 avatar aniketshahane avatar astro-marco avatar atalanttore avatar bnewbold avatar cclauss avatar coding-young avatar comradekingu avatar dandelionsprout avatar eugenia-russell avatar f0rb1d avatar fitojb avatar gbrlgn avatar gdamdam avatar giorgio93p avatar jaimemf avatar kovalevartem avatar miku avatar milotype avatar nautilusx avatar padanian avatar rochacbruno avatar santossi avatar sr093906 avatar sreenikethmadgula avatar stonkol avatar t1011 avatar vilmibm avatar weblate avatar yireun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

fatcat-scholar's Issues

Continuous updates from fatcat catalog

The current experimental index is a one-shot, based on a 2020-08 export of fatcat release entities. Of course we want updates to flow from fatcat to the scholar index in the same way that entity updates currently flow to the fatcat metadata search index.

The rough plan for this feature is:

  • two new Kafka topics: one for work identifiers needing re-indexing, and one for "heavy intermediate" JSON objects (ready for transform and indexing)
  • changes to fatcat entity updater to find all work identifiers that need to be updated as a result of an editgroup, and publish these to the needs-updating queue. important that these are de-duped within the editgroup. we already grab all the releases updated (eg, even for new file creation), so this should be easy
  • new worker (consume+publish) to generate heavy intermediate objects per work identifier
  • new worker (consume) to transform heavy intermediate to elastic schema and send to the index. should probably work in batches of 50+ documents at a time

Because scholar/fulltext index updates are relatively expensive compared to regular fatcat entity index updates, we might want to consider some optimizations:

  • match kafka partitioning/sharding to elasticsearch partitioning/sharding (eg, based on work identifier) to minimize cross-index updates in elasticsearch
  • some "linger" delay between a work update and index update, in case the work is rapidly updated again. considering all edits within an editgroup will catch some of this, but for example the daily new arxiv imports will result in new release entities, then just an hour or two later a new file entity, both of which will result in work-level update requests
  • checking difference between old and new document at index time, and not re-indexing if nothing has changed. not sure if we can pull the "body" field from index (should not be publicly possible at least), but can infer that content from fulltext metadata. this would save re-indexing churn for some edits

Review fulltext schema and queries for performance, index size

In August, I made some changes to the elasticsearch schema and query structure to try and improve query latency (aka, speed and performance). Kept a log of notes on these experiments here: https://github.com/internetarchive/fatcat-scholar/blob/master/notes/scaling_works.md

These should be reviewed and revisited as we move towards broad release. In particular, more quantitative measurements would be nice to have, and it seems like index_phrases may have had a significant increase in disk utilization (50%? memory is fuzzy) with small or no latency improvement.

External requests could be async/await

Calls to Elasticsearch, GROBID, and Fatcat API are done as regular synchronous HTTP fetches. FastAPI is an async framework, and we should be making these calls using async code paths (to be clear, meaning async/await syntax, not the /_async_search/ API).

For the citation query code path (requests to GROBID, fatcat elasticsearch, and fatcat API), plan is to move all this code into fuzzycat, and use async versions of code (eg, aiohttp for GROBID and fatcat API, async/await version of elasticsearch python library). For the primary queries, against the fulltext index, waiting on support in elasticsearch-dsl (elastic/elasticsearch-dsl-py#1480). Alternatively, we might be able to use existing elasticsearch-dsl to generate query object (as a dict), and pass that to an async elasticsearch-py function call. Not sure.

Fulltext linking: show type (PDF, HTML, etc); display multiple access options; title link behavior

Current behavior is that blue SERP result links ("title") go to fatcat.wiki landing pages, while thumbnail links go to IA copy of fulltext PDF. Analytics show that users click on both, but clicking on the title is more common than thumbnails.

Want to support HTML as well, but should visually indicate so users know what they are getting. Also want to display multiple options, eg alternative domains.

Display volume, issue, pages

In the "journal" row, after the journal name. Not sure what the style should be, eg:

Vol. 1, Iss. 2, p3-4

Or full words?

Volume 1, Issue 2, p3-4

Or citation style?

1(2) p3-4

social "cards" when sharing links

At a minimum, should have a short blurb about the site when shared from the home page (with i18n based on lang prefix).

This is implemented via HTML meta tags. IIRC there is a pseudo-standard, but also a bunch of platform-specific tricks and rules. Sigh. Probably show the logo, some description.

For search URLs, could share the same card, or something custom that includes the query string.

For "exact match" search URLs, could show the thumbnail, if one exists?

Related, should try to opt-in to google "search box in search results" thing. I tried to do this with fatcat.wiki but have failed (so far), but it works for "internet archive": https://www.google.com/search?q=internet+archive

HTML error pages (eg, 404, 500)

For requests that hit an error (eg, 404 or generic 500), and are coming from a browser (eg, not requesting JSON response), we should display generic "not found" or "server error" pages.

It would be good to detect if the URL path has a language prefix and show a translated error page, though need to be careful not to cause a recursive error loop (eg, if the error/exception was with translation itself).

only highlight query string in result highlights

The returned API results also highlight (in the "_highlights" field) the search-language.

I think this is not really useful. I would suggest to split the language from the query terms (basic normalization of the search parameters).

So instead of "q=lang:en+foobar", something like "lang=en&q=foobar".

Generated bibliography files have incomplete author info

I queried https://scholar.archive.org/search?q=sawood and clicked on the "cite this work" button to copy the BibTeX record of first few results. I noticed that the author records were inaccurate/incomplete in them.


HTTP Mailbox - Asynchronous RESTful Communication [ARTICLE]
Sawood Alam and Charles L. Cartledge and Michael L. Nelson
2013 arXiv   PRE-PRINT

Resulted in:

@article{nelson_2013, 
  title={HTTP Mailbox - Asynchronous RESTful Communication}, 
  abstractNote={We describe HTTP Mailbox, a mechanism to enable RESTful HTTP communication in
an asynchronous mode with a full range of HTTP methods otherwise unavailable to
standard clients and servers. HTTP Mailbox allows for broadcast and multicast
semantics via HTTP. We evaluate a reference implementation using ApacheBench (a
server stress testing tool) demonstrating high throughput (on 1,000 concurrent
requests) and a systemic error rate of 0.01%. Finally, we demonstrate our HTTP
Mailbox implementation in a human assisted web preservation application called
"Preserve Me".}, 
  author={Nelson}, 
  year={2013}, 
  month={May}}

Support for Various HTTP Methods on the Web [ARTICLE]
Sawood Alam and Charles L. Cartledge and Michael L. Nelson
2014 arXiv   PRE-PRINT

Resulted in:

@article{nelson_2014, 
  title={Support for Various HTTP Methods on the Web}, 
  abstractNote={We examine how well various HTTP methods are supported by public web
services. We sample 40,870 live URIs from the DMOZ collection (a curated
directory of World Wide Web URIs) and found that about 55% URIs claim support
(in the Allow header) for GET and POST methods, but less than 2% of the URIs
claim support for one or more of PUT, PATCH, or DELETE methods.}, 
  author={Nelson}, 
  year={2014}, 
  month={May}}

Web Archive Profiling Through Fulltext Search [CHAPTER]
Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, David S. H. Rosenthal
2016 Lecture Notes in Computer Science

Resulted in:

@inbook{alam_nelson_sompel_rosenthal_2016, 
  title={Web Archive Profiling Through Fulltext Search}, 
  DOI={10.1007/978-3-319-43997-6_10}, 
  publisher={Springer International Publishing}, 
  author={Alam and Nelson and Sompel and Rosenthal}, 
  year={2016}}

Web archive profiling through CDX summarization
Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila L. Balakireva, Harihar Shankar, David S. H. Rosenthal
2016 International Journal on Digital Libraries

Resulted in:

@article{alam_nelson_sompel_balakireva_shankar_rosenthal_2016, 
  title={Web archive profiling through CDX summarization}, 
  volume={17}, 
  DOI={10.1007/s00799-016-0184-4}, 
  publisher={Springer Nature}, 
  author={Alam and Nelson and Sompel and Balakireva and Shankar and Rosenthal}, 
  year={2016}, 
  month={Jul}}

Fix periods (eg, for Croatian translation) in filter bar CSS (right-to-left)

Currently, as sort of a hack, we use left-to-right mode for the filters sidebar, to right-justify text without wrapping.

This breaks display of, eg, Croatian numerals, which have a period after the number. In the current CSS, the period is displayed on the far left instead of just after the number as expected. This is presumably due to the RTL mode.

I think the fix will be to use different CSS to achieve right-justification. Will need to test the fix in a RTL language (eg, Arabic) before deploying.

RSS feeds for search queries

This feature would allow creation of RSS feed endpoints for any search query. The feed would allow users to "subscribe" to new search hits.

Some implementation thoughts:

  • if query is embedded in feed URL, no need to retain any server-side state
  • utility of this may be dependent on having decent subject/categorization metadata? or maybe not, if keywords are used.
  • feed could be sorted/filtered by 1) release date of works 2) index document update time 3) index document creation time. or maybe some combination? index document creation time will not be a stable/long-term metadata field (eg, when re-indexing all document creation times will get incremented)
  • it is an assumption that this would not result in many actual queries and search engine load, as RSS usually is only fetched... daily? if there was a lot of load, could cache results in elasticsearch itself (eg, hash the query string, check if there is a cached result from past N hours, only run query and update cache if stale; store results in separate index)

i18n/hr: 'Support and Acknowledgements' paragraph on About page not displaying translation (even though translation exists)

For some reason the last paragraph Support and Acknowledgements at https://scholar-qa.archive.org/hr/about does not show in Croatian, allthough the strings are translated at Weblate (https://hosted.weblate.org/translate/internetarchive/fatcat-scholar/hr/?checksum=b661a6477d95136c&sort_by=-priority,position)

Also on the page https://scholar-qa.archive.org/hr/help there are headings Tags and Persistent Identifiers which do not translate and do not exist on Weblate.

Search result localization and "boosting"

For cases where the selected interface language has been explicitly selected, we could have that selection change show search queries are ranked and displayed. For example, documents in the selected language could be "boosted" to appear higher in the search results, and we could swap the title / original_title display order in some cases.

Off the top of my head, some things this would involve:

  • include language filter as a "boost" in elasticsearch query (along with existing boosts) to impact sorting when appropriate
  • conditional logic for title display preference (probably jinja2 macro)
  • possibly a query filter, parameter, or some other indication that this change is happening and a way to opt-out
  • testing of these changes for performance, search quality, and user experience
  • documentation ("help") updates to describe this change

Better query parsing

A particular user request is to be able to paste a citation string into the search box and have "the right thing" happen in most cases. The current query parser (Elasticsearch's built-in) doesn't work well for this; it is expecting a structured query string (with booleans etc).

A great solution would be a custom query parser with perfect detection of user intent that "does the expected thing". In the meanwhile, more practically, we could try to differentiate between regular queries and citation string queries, and have two code paths. The query string path would be the current behavior. The query string path would use, eg, GROBID and/or biblio-glutton to parse the raw citation in to a structured citation, then try to do a fuzzy match against the live fatcat metadata index (generally faster than the scholar fulltext index), and if there is a hit do an exact identifier lookup against scholar elasticsearch. The later half of this code path would be similar to the current behavior for identifier lookups (eg, remove all filters and sort order).

Gitlab CI testing

Internet Archive uses Gitlab for most of our projects, so this project uses a .gitlab-ci.yml, and we want to stick with that for testing. The way we have done this for the fatcat repository is to set up a mirroring account on gitlab.org and run the CI there. This works surprisingly well!

Alternatively, if there was an off-the-shelf way to run github actions using the .gitlab-ci.yml file (eg, some shim dockerfile?) that would work as well.

January 2021 UI iteration bugs

Tracking issue for small UI (HTML/CSS) bugs with the current iteration of the web interface:

  • title / highlight left-alignment doesn't match. probably because of <detail> padding
  • safari (iOS and desktop) ghost thumbnail (SVG) is very tall in fulltext drop-down
  • desktop circle buttons (under access links) not centered correctly (again)
  • no highlight results: should not have a vertical gap between external identifiers (green links) and journal metadata row
  • mobile: no-access circle buttons look weird... floating but not centered
  • mobile: access links generally need position fixes. eg, vertical margin same between current and next search result. maybe should be centered
  • mobile: next/previous links are mangled (line wrap)

Example no-highlight/no-link page: https://scholar-qa.archive.org/search?q=SUBCUTANEOUS+RUPTURE+OF+THE+ACHILLES+TENDON&filter_availability=everything

Thanks to @avdempsey for catching many of these.

Small web UI tweaks (CSS, etc)

  • header bar has rounded edges; should be rectangular (just a "bar")
  • swap out logo for officially purchased/generated, to remove last remaining watermark artifacts (at least one tester has noticed)
  • sometimes bottom footer still doesn't "stick" correctly. Eg, the search query error page

DBLP: "of" to "&" ?

<td>{% trans %}Digital Bibliography of Logic Programming{% endtrans %}</td>

Reading in DBLP in Wikipedia:

DBLP originally stood for DataBase systems and Logic Programming. As a backronym, it has been taken to stand for Digital Bibliography & Library Project;[9] however, it is now preferred that the acronym be simply a name, hence the new title "The DBLP Computer Science Bibliography".[10]

Should the of be replaced with &amp; to result in & ?

Remove search query from goatcounter tracking

Currently the search query is leaked into goatcounter tracking via the page "title". Only one recent "title" ends up getting displayed in goatcounter, but this is still medium-priority privacy concern.

Should be possible to fix by configuring goatcounter tracker on specific pages, or just never grabbing the page title.

better query support for exact matching

Partially a query parsing issue, but likely to also be an indexing issue.

Especially in technical fields, or when doing digital humanities-style queries, there are a lot of valid queries which include meta characters. Not clear how to represent many of these in the Lucene query syntax, or to escape out to a simpler syntax. Also not clear how many of these can even be handled by the query engine. Some examples:

  • A* search in computer science ("A star" algorithm)
  • identifiers used in bio-medicine. could try to query by prefix, suffix, or sub-patterns. sometimes dashes, periods, spaces, or other characters have meaning
  • math. even simple things like searching for exponentiation. or symbols like β (\beta in LaTeX). appear in titles, abstracts, body, citations, etc. do we flatten these down (in a unicode-aware way) to, eg, "b" for indexing? expand "beta"? other isues: function syntax, arrows, primes, dots, set inclusion, real numbers ("R"), integers ("N"), dot product, etc.
  • chemical formula: arrows, other notation

Implement OAI-PMH API

There might be reasons to not do this that I am forgetting, but if there is an OAI-PMH API for this catalog, the PDF links could be ingested in Unpaywall. This would help for e.g. 1800's articles in The Lancet, which are otherwise not accessible without a subscription. If that would be appropriate I am happy to try to implement it.

Croatian translation

Please implement "Croatian" in the language menu and update translations from Weblate.

search results page occasionally dumps escaped HTML for part of the page

Repro steps:

  • open safari incognito window, go to: scholar.archive.org

  • search "Beyonce"

  • search results appear as HTML string

  • on page refresh, elements are no longer strings, styles are properly applied.

Image of side-by-side results page w/ string HTML & its element via inspector:
image

side-by-side inspector/view diff of sidebar w/ html string:
image

Localize dates and numbers

Please enable localized forms for dates (in this case year) and numbers.

For instance, Croatian needs a period after the year. Also, a period is used as thousands separator … other languages use different forms …

Localized-Punctuation

Ability to override stemming in queries

It would be good to be able to override stemming at query time. Stemming by default probably makes the most sense, even for phrase queries, but in some cases we would want to return only exact matches.

For example, "dancing" matches "dance". We could have an alternative field or flag so it only matches "dancing".

In terms of implementation, this could definitely be done by including an extra non-stemmed copy of all fields. For example: everything.exact:"dancing".

Related: case-sensitivity, projecting down diacritics (et al), and off-by-N fuzzy typo tolerance.

A blog post noting this issue: https://jurnsearch.wordpress.com/2020/09/28/internet-archive-scholar-is-live/

PDFs with GROBID 'success' but empty GROBID fulltext body lack any access option on scholar.archive.org (eg, presentation slides)

Edit: was "Some works archived on fatcat appear as unarchived in fatcat-scholar"

I have been unable to find a pattern, but some works archived on fatcat appear as unarchived in fatcat-scholar. They are also not shown in search results on scholar unless filter_availability=everything is set. In the fatcat search, they are listed as bright archives.

The archived files were created by the savepaper-now bot a couple days ago, so i don't think it's an issue of invalid data or fatcat and fatcat-scholar being out of sync.

Indexing: add field (or tag?) for preservation status

A feature difference between scholar and fatcat is that fatcat displays non-IA preservation status. As a first step towards supporting this in scholar, we could/should show "Keepers" preservation status for individual works.
This could include the specific keepers (LOCKSS, Portico, Hathitrust, etc), or just indicate "in at least one keeper". It could link directly to the keeper's registry on issn.org (eg, https://keepers.issn.org/?q=api/search&search[]=MUST=allissn=0958-1596&search[]=MUST_EXIST=keepers), or to fatcat, or just be a visual label (like the other tags are currently).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.