Giter Club home page Giter Club logo

Comments (8)

hhaoyan avatar hhaoyan commented on June 25, 2024 1

An update: maybe check for special characters only. see https://en.wikipedia.org/wiki/Specials_(Unicode_block)

many non ascii or non latin chars are actually useful, such as '≈', '∞'...

rolling back the test function...

from limesoup.

zhugeyicixin avatar zhugeyicixin commented on June 25, 2024

Sure, I'll do it.

For all parsers, pay attention to special HTML symbols in parsed metadata. For example, DOI 10.1002/adsc.201190008 (Wiley) has Advanced Synthesis & Catalysis which should be Advanced Synthesis & Catalysis. This is to be solved in the Wiley parser @zjensen262

@zhugeyicixin Could you find journals in Springer that have similar problems?

from limesoup.

hhaoyan avatar hhaoyan commented on June 25, 2024

@zjensen262 I saw your format_text implementation in the latest pull request. However, I think a better approach rather than using regular expressions is to use libraries such as https://stackoverflow.com/a/2087446/2310794.

from limesoup.

zhugeyicixin avatar zhugeyicixin commented on June 25, 2024

I checked the Springer parser and wrote a new test function (see #37). There is no HTML characters in the parsed result. But we might want to pay attention to some special characters which are not readable for humans. Currently, I set a warning in the test function if the character is not one of the "normal sets" (need discussion for a good version): (1) ASCII (2) extended Latin (3) greek letters.

from limesoup.

zhugeyicixin avatar zhugeyicixin commented on June 25, 2024

As an example of the weird special characters, here are some journal names (from scraping rather than parsing, just used as examples) not good for reading.

Pfl�gers Archiv - European Journal of Physiology
Pfl�gers Archiv European Journal of Physiology
Fresenius' Zeitschrift f�r Analytische Chemie
Monatshefte f�r Chemie/Chemical Monthly
Zeitschrift f�r Physik D Atoms, Molecules and Clusters
Monatshefte f�r Chemie Chemical Monthly
Monatshefte f�r Chemie
Monatshefte f�r Chemie / Chemical Monthly
Zeitschrift f�r Physik B Condensed Matter
Zeitschrift f�r Physik B Condensed Matter and Quanta
Zeitschrift f�r Analytische Chemie
Archiv f�r Mikrobiologie
Langenbecks Archiv f�r Chirurgie
Fresenius Zeitschrift f�r Analytische Chemie
Zeitschrift f�r Physik
Naunyn-Schmiedebergs Archiv f�r Experimentelle Pathologie und Pharmakologie
Zeitschrift f�r Lebensmittel-Untersuchung und -Forschung
Archiv f�r Elektrotechnik
Internationales Archiv f�r Arbeitsmedizin
Archiv f�r Toxikologie
Zeitschrift f�r Rheumatologie
Zeitschrift f�r Physik A Atoms and Nuclei
Archiv f�r Klinische und Experimentelle Dermatologie
Zeitschrift f�r Physik A Atomic Nuclei
Journal of Orofacial Orthopedics / Fortschritte der Kieferorthop�die
Naunyn-Schmiedebergs Archiv f�r Pharmakologie und Experimentelle Pathologie
Naunyn-Schmiedeberg's Archiv f�r Experimentelle Pathologie und Pharmakologie
≪UML≫ 2000 - The Unified Modeling Language
B - Ba … Cu - Zr
≪UML≫ 2001 - The Unified Modeling Language. Modeling Languages, Concepts, and Tools
·Nature
«Nature

from limesoup.

hhaoyan avatar hhaoyan commented on June 25, 2024

As an example of the weird special characters, here are some journal names (from scraping rather than parsing, just used as examples) not good for reading.

Pfl�gers Archiv - European Journal of Physiology
Pfl�gers Archiv European Journal of Physiology
Fresenius' Zeitschrift f�r Analytische Chemie
Monatshefte f�r Chemie/Chemical Monthly
Zeitschrift f�r Physik D Atoms, Molecules and Clusters
Monatshefte f�r Chemie Chemical Monthly
Monatshefte f�r Chemie
Monatshefte f�r Chemie / Chemical Monthly
Zeitschrift f�r Physik B Condensed Matter
Zeitschrift f�r Physik B Condensed Matter and Quanta
Zeitschrift f�r Analytische Chemie
Archiv f�r Mikrobiologie
Langenbecks Archiv f�r Chirurgie
Fresenius Zeitschrift f�r Analytische Chemie
Zeitschrift f�r Physik
Naunyn-Schmiedebergs Archiv f�r Experimentelle Pathologie und Pharmakologie
Zeitschrift f�r Lebensmittel-Untersuchung und -Forschung
Archiv f�r Elektrotechnik
Internationales Archiv f�r Arbeitsmedizin
Archiv f�r Toxikologie
Zeitschrift f�r Rheumatologie
Zeitschrift f�r Physik A Atoms and Nuclei
Archiv f�r Klinische und Experimentelle Dermatologie
Zeitschrift f�r Physik A Atomic Nuclei
Journal of Orofacial Orthopedics / Fortschritte der Kieferorthop�die
Naunyn-Schmiedebergs Archiv f�r Pharmakologie und Experimentelle Pathologie
Naunyn-Schmiedeberg's Archiv f�r Experimentelle Pathologie und Pharmakologie
≪UML≫ 2000 - The Unified Modeling Language
B - Ba … Cu - Zr
≪UML≫ 2001 - The Unified Modeling Language. Modeling Languages, Concepts, and Tools
·Nature
«Nature

This is due to the encoding of files. Perhaps the HTML file is encoded in ISO 8859-1 while you opened the file in UTF-8. The weird symbol is the non-ASCII European language alphabets, such as the German alphabet "ü", "ä", etc.

from limesoup.

hhaoyan avatar hhaoyan commented on June 25, 2024

The test function get_non_ascii_latin in soup_tester.py currently checks for non ascii or latin chars. Since these problems are not due to the parser, consider removing it from the unit test. Maybe start a issue in the scraper repo.

from limesoup.

hhaoyan avatar hhaoyan commented on June 25, 2024

resolved 5feae369f0245d102c3497f43884996fab7a55db.

from limesoup.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.