Comments (8)
An update: maybe check for special characters only. see https://en.wikipedia.org/wiki/Specials_(Unicode_block)
many non ascii or non latin chars are actually useful, such as '≈', '∞'...
rolling back the test function...
from limesoup.
Sure, I'll do it.
For all parsers, pay attention to special HTML symbols in parsed metadata. For example, DOI
10.1002/adsc.201190008
(Wiley) hasAdvanced Synthesis & Catalysis
which should beAdvanced Synthesis & Catalysis
. This is to be solved in the Wiley parser @zjensen262@zhugeyicixin Could you find journals in Springer that have similar problems?
from limesoup.
@zjensen262 I saw your format_text
implementation in the latest pull request. However, I think a better approach rather than using regular expressions is to use libraries such as https://stackoverflow.com/a/2087446/2310794.
from limesoup.
I checked the Springer parser and wrote a new test function (see #37). There is no HTML characters in the parsed result. But we might want to pay attention to some special characters which are not readable for humans. Currently, I set a warning in the test function if the character is not one of the "normal sets" (need discussion for a good version): (1) ASCII (2) extended Latin (3) greek letters.
from limesoup.
As an example of the weird special characters, here are some journal names (from scraping rather than parsing, just used as examples) not good for reading.
Pfl�gers Archiv - European Journal of Physiology
Pfl�gers Archiv European Journal of Physiology
Fresenius' Zeitschrift f�r Analytische Chemie
Monatshefte f�r Chemie/Chemical Monthly
Zeitschrift f�r Physik D Atoms, Molecules and Clusters
Monatshefte f�r Chemie Chemical Monthly
Monatshefte f�r Chemie
Monatshefte f�r Chemie / Chemical Monthly
Zeitschrift f�r Physik B Condensed Matter
Zeitschrift f�r Physik B Condensed Matter and Quanta
Zeitschrift f�r Analytische Chemie
Archiv f�r Mikrobiologie
Langenbecks Archiv f�r Chirurgie
Fresenius Zeitschrift f�r Analytische Chemie
Zeitschrift f�r Physik
Naunyn-Schmiedebergs Archiv f�r Experimentelle Pathologie und Pharmakologie
Zeitschrift f�r Lebensmittel-Untersuchung und -Forschung
Archiv f�r Elektrotechnik
Internationales Archiv f�r Arbeitsmedizin
Archiv f�r Toxikologie
Zeitschrift f�r Rheumatologie
Zeitschrift f�r Physik A Atoms and Nuclei
Archiv f�r Klinische und Experimentelle Dermatologie
Zeitschrift f�r Physik A Atomic Nuclei
Journal of Orofacial Orthopedics / Fortschritte der Kieferorthop�die
Naunyn-Schmiedebergs Archiv f�r Pharmakologie und Experimentelle Pathologie
Naunyn-Schmiedeberg's Archiv f�r Experimentelle Pathologie und Pharmakologie
≪UML≫ 2000 - The Unified Modeling Language
B - Ba … Cu - Zr
≪UML≫ 2001 - The Unified Modeling Language. Modeling Languages, Concepts, and Tools
·Nature
«Nature
from limesoup.
As an example of the weird special characters, here are some journal names (from scraping rather than parsing, just used as examples) not good for reading.
Pfl�gers Archiv - European Journal of Physiology
Pfl�gers Archiv European Journal of Physiology
Fresenius' Zeitschrift f�r Analytische Chemie
Monatshefte f�r Chemie/Chemical Monthly
Zeitschrift f�r Physik D Atoms, Molecules and Clusters
Monatshefte f�r Chemie Chemical Monthly
Monatshefte f�r Chemie
Monatshefte f�r Chemie / Chemical Monthly
Zeitschrift f�r Physik B Condensed Matter
Zeitschrift f�r Physik B Condensed Matter and Quanta
Zeitschrift f�r Analytische Chemie
Archiv f�r Mikrobiologie
Langenbecks Archiv f�r Chirurgie
Fresenius Zeitschrift f�r Analytische Chemie
Zeitschrift f�r Physik
Naunyn-Schmiedebergs Archiv f�r Experimentelle Pathologie und Pharmakologie
Zeitschrift f�r Lebensmittel-Untersuchung und -Forschung
Archiv f�r Elektrotechnik
Internationales Archiv f�r Arbeitsmedizin
Archiv f�r Toxikologie
Zeitschrift f�r Rheumatologie
Zeitschrift f�r Physik A Atoms and Nuclei
Archiv f�r Klinische und Experimentelle Dermatologie
Zeitschrift f�r Physik A Atomic Nuclei
Journal of Orofacial Orthopedics / Fortschritte der Kieferorthop�die
Naunyn-Schmiedebergs Archiv f�r Pharmakologie und Experimentelle Pathologie
Naunyn-Schmiedeberg's Archiv f�r Experimentelle Pathologie und Pharmakologie
≪UML≫ 2000 - The Unified Modeling Language
B - Ba … Cu - Zr
≪UML≫ 2001 - The Unified Modeling Language. Modeling Languages, Concepts, and Tools
·Nature
«Nature
This is due to the encoding of files. Perhaps the HTML file is encoded in ISO 8859-1 while you opened the file in UTF-8. The weird symbol is the non-ASCII European language alphabets, such as the German alphabet "ü", "ä", etc.
from limesoup.
The test function get_non_ascii_latin
in soup_tester.py
currently checks for non ascii or latin chars. Since these problems are not due to the parser, consider removing it from the unit test. Maybe start a issue in the scraper repo.
from limesoup.
resolved 5feae369f0245d102c3497f43884996fab7a55db.
from limesoup.
Related Issues (20)
- Removing reference numbers from the text HOT 12
- Formulas in gif format HOT 2
- Missing materials names and temperatures HOT 1
- Please add Changelog and Version Rolling HOT 1
- Feedback on the Springer Parser HOT 9
- Feedback on ACS parser HOT 31
- Feedback on Wiley parser HOT 4
- Issues with Elsevier journals HOT 1
- Parser Version HOT 2
- Unit tests implementation for Wiley and Springer parsers HOT 1
- ['obj']['Sections'] contains None HOT 1
- Elsevier parser issue with the parse_formula function HOT 1
- [Springer]"Acknowledgements" is supposed to be parsed or not? HOT 5
- [Springer]Paragraphs containing bullet points HOT 6
- [Springer] Paper title and journal name HOT 3
- [ECS] Need to remove references from text HOT 1
- Beautiful Soup is unable to preserve namespace HOT 1
- Issues for ElsevierSoup HOT 5
- which file format should be given
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from limesoup.