Comments (1)
What seems to be happening in this particular corner case is that GROBID parses the PDF successfully, but the "body" is empty. Here is what the "intermediate" hydrated object looks like: https://archive.org/~bnewbold/tmp/release_hneyekhayrdwvpvzdlnbzqalfu.scholar_intermediate.json
Because the body is empty, we don't create a "fulltext" sub-object in the indexed document here: https://github.com/internetarchive/fatcat-scholar/blob/master/fatcat_scholar/transform.py#L250
There are a couple ways we could try harder here. We could link to the file even though the extracted text is empty (aka, add access options even if the fulltext object is emtpy). We could detect the empty GROBID body earlier in the pipeline, and substitute raw extracted text ("pdftotext") earlier, so there would be at least something. We could try to improve GROBID extraction for slides, or detect that case and always use a different tool.
Slides are in-scope for both fatcat and scholar, and it would be good to fix this. I think this would be on the backburner for me to fix in the near future, but if you (or somebody else) would like to dig in and try to improve the behavior, I would be happy to review and give pointers.
I'm going to edit the title of this issue, I hope that is ok with you.
from fatcat-scholar.
Related Issues (20)
- January 2021 UI iteration bugs HOT 1
- Mangled Cyrillic full text
- External requests could be async/await HOT 1
- ES schema DOI (and other identifiers) should be case-insensitive HOT 1
- circle action buttons not always center-aligned under buttons (desktop)
- CORS not working HOT 7
- CI: add "codespell" tool
- Indexing: add field (or tag?) for preservation status
- Display volume, issue, pages HOT 3
- only highlight query string in result highlights
- More UI/UX Ideas
- better query support for exact matching
- social "cards" when sharing links
- DBLP: "of" to "&" ? HOT 1
- Generated bibliography files have incomplete author info
- i18n/hr: 'Support and Acknowledgements' paragraph on About page not displaying translation (even though translation exists) HOT 4
- search results page occasionally dumps escaped HTML for part of the page HOT 7
- API url is missing from redoc HOT 1
- Implement OAI-PMH API
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fatcat-scholar.