Giter Club home page Giter Club logo

limesoup's People

Contributors

eddotman avatar grootel avatar hhaoyan avatar hhliu0 avatar jmadeano avatar nicolas-mng avatar olgagkononova avatar shaunrong avatar tiagobotari avatar vtshitoyan avatar zherenwang avatar zjensen262 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

limesoup's Issues

[AllParsers] Special HTML symbols in parser.

For all parsers, pay attention to special HTML symbols in parsed metadata. For example, DOI 10.1002/adsc.201190008 (Wiley) has Advanced Synthesis & Catalysis which should be Advanced Synthesis & Catalysis. This is to be solved in the Wiley parser @zjensen262

@zhugeyicixin Could you find journals in Springer that have similar problems?

Feedback on ACS parser

Some feedback from Olga based on about 10 randomly chosen papers.

General:

  • not parsed symbols like &mgr;, &pgr;, +, ...
  • Figures and Tables captions are embedded in text. They should be either removed or left as separate paragraph
  • Bunch of '\n' instead of spaces
  • Need to remove reference numbers
  • Paragraphs in text are not separated

10.1021/ja068965r:

  • Where applicable, names of paragraphs should be extracted as separate headings:
    "X-ray Crystallography. Crystals of H2GL2 and CuGL2 were grown from concentrated MeOH/H2O solutions of the respective compounds, whereas crystals of NiGL2 were obtained via slow..."
    "name": "X-ray Crystallography"
    "content": "Crystals of H2GL2 and CuGL2 were grown from concentrated MeOH/H2O solutions of the respective compounds, whereas crystals of NiGL2 were obtained via slow..."

10.1021/ja0024340:
Weird symbols:
"\nThe energy of this transition state lies 21.1 kcal/mol above\nthe separated species. Using a typical18 &Dgr;S⧧ of −27 cal deg-1\nmol-1,"

Feedback on Wiley parser

Here is a list of issues found for the Wiley parser:

  • References are not removed from paragraphs.
  • Duplicates of paragraphs: for example https://doi.org/10.1111/j.1551-2916.2011.04722.x
  • Some paragraphs don't have a section name for example 10.1002/anie.201305377.
  • missing paragraphs for example 10.1111/jace.12629, 10.1002/adma.201103895.
  • Subsection names are not parsed: 10.1002/2017JE005343

Issue 2 is the most urgent one.

Elsevier parser issue with the parse_formula function

The parser generates TypeError when parsing papers. Example DOI's:

  • 10.1016/j.solmat.2004.07.052
  • 10.1016/j.solmat.2015.12.001
  • 10.1016/j.apcatb.2008.01.005

Exception information:

must be str, not NoneType

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "/home/hhuo/anaconda3/envs/synthesis/lib/python3.6/site-packages/rpyc/core/protocol.py", line 329, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/home/hhuo/anaconda3/envs/synthesis/lib/python3.6/site-packages/rpyc/core/protocol.py", line 590, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/hhuo/Projects/Codes/synthesis-api-hub/synthesis_api_hub/worker.py", line 20, in wrapper
    ret = f(self, *args, **kwargs)
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/api_worker.py", line 38, in parse_elsevier
    return ElsevierSoup.parse(html_string)
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/lime_soup.py", line 57, in parse
    return self._next.parse(html_str)
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/lime_soup.py", line 69, in parse
    results = self._next.parse(results)
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/lime_soup.py", line 67, in parse
    results = self._parse(html_str)
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/ElsevierSoup.py", line 33, in _parse
    parser.parse_formula(rules=[{'name': 'formula'}])
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/parser/parser_paper_elsevier.py", line 113, in parse_formula
    label.string = ' ' + label.string + ' '
TypeError: must be str, not NoneType

Parser Version

Currently, all versions of the parsers are recorded as the repo version in production database. We need to assign each parser their own version number and update DB content according to each parser version number.

Issues for ElsevierSoup

1.zip
This is an .xml document I got through data mining in the elsvier journal,
when I type in pycharm:
from LimeSoup import ElsevierSoup
with open('1.xml', 'r', encoding = 'utf-8') as f:
xml_str = f.read()
data = ElsevierSoup.parse(xml_str)
print(data)
The print results was:
{'Journal': None, 'DOI': None, 'Title': None, 'Keywords': [], 'Sections': []}
I'm curious what I did wrong and whynot get the expected results.
Thank you!

Formulas in gif format

I found, that some ECS papers has gif pictures for formulas and numbers.
For example: http://jes.ecsdl.org/content/157/3/J69.full
span class="inline-formula" id="inline-formula-38"><img class="math mml" alt="Formula" src="J69/embed/mml-math-38.gif"

Can we check how many of those cases and do something about it?
Thank you.

which file format should be given

I am unable to understand the usage case for LimeSoup.
I am not sure which format for the article should be given here:
with open(article, 'r', encoding = 'utf-8') as f:
html_str = f.read()

Moreover, I guess this usage given is for a single article. what if there are thousands of articles to be parsed.

[Springer]Paragraphs containing bullet points

The "Conclusion" part in "10.1007/s40964-017-0023-1" has only 1 paragraph while the parsed result has 3. I think it is because bullet points are used in that paper. (see unit test LimeSoup/test/test_springer/test_springer.py)
@IAmGrootel

Feedback on the Springer Parser

Here is some feedback from Tanjin who analyzed the results for the Springer Parser based on a few papers.

  1. Many blanks are inserted, especially when dealing with subscripts/superscripts. This makes it difficult to correctly parse chemical formula.
    E.g.:
    Pb(Zr x Ti 1− x )O 3
    Pb 0.97 Nd 0.02 (Zr 0.55 Ti 0.45 )O 3 (PNZT)
    ScTaO 4
    Ar + ion
    Mg 2 Ni
    7.49 × 10 3 kg/m 3
    1.5 J/cm 2
    CuK α
    k -space

  2. Paragraphs in the same section are not separated
    E.g.: Introduction section of the paper 10.1007/s00339-013-8138-9.

  3. References are not removed

  4. Some text is missed in a section with sub-sections.
    E.g.: Methods section missed for the paper 10.1007/bf01142064.

  5. I am not sure if we need to keep the formula in same format?
    E.g. Some formula starts and ends with "$$", which some starts and ends with "\(" as the boundary.
    Formula 1: $$ \sigma_{\text{wh}} = \sqrt { \sigma_{\text{sat}}^{2} - \left( {\sigma_{\text{sat}}^{2} - \sigma_{0}^{2} } \right)\exp ( - r(\varepsilon - \varepsilon_{0} ))} $$
    Formula 2: \( {\dot{{\varepsilon }}} \)?

I think we should at least address the first 4 points. Happy to discuss this further.

Olivetti Group - html parsing running errors

@eddotman @zjensen262
We pulled a branch from master and tried running the ECS parsers using:

from LimeSoup.ECSSoup import ECSSoup 
data = ECSSoup.parse(ECS_htmls[0])

Where ECS_htmls is a list of html strings.

But we get an error:

NameError                                 Traceback (most recent call last)
<ipython-input-6-12ee0748abcb> in <module>()
      1 from LimeSoup.ECSSoup import ECSSoup
----> 2 data = ECSSoup.parse(ECS_htmls[0])

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     50         if not self._next:
     51             raise ValueError("Please provide at least one parsing rule ingredient to the soup")
---> 52         return self._next.parse(html_str)
     53 
     54 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     60 
     61     def parse(self, html_str):
---> 62         results = self._parse(html_str)
     63         if self._next:
     64             results = self._next.parse(results)

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\ECSSoup.pyc in _parse(parser_obj)
    164         # Collect information from the paper using ParserPaper
    165         # Create tag from selection function in ParserPaper
--> 166         parser.deal_with_sections()
    167         obj['Sections'] = parser.data_sections
    168         return {'obj': obj, 'html_txt': parser.raw_html}

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\parser\parser_paper.pyc in deal_with_sections(self)
     53         """
     54         parameters = {'name': re.compile('^section_h[0-6]'), 'recursive': False}
---> 55         parse_section = self.create_parser_section(self.soup, parameters, parser_type=self.parser_type)
     56         self.data_sections = parse_section.data
     57         self.headings_sections = parse_section.heading

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\parser\parser_paper.pyc in create_parser_section(soup, parameters, parser_type)
     73         :return:
     74         """
---> 75         return ParserSections(soup, parameters, parser_type=parser_type)
     76 
     77     @staticmethod

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\parser\parser_section.py in __init__(self, soup, parameters, debugging, parser_type)
     37             #self.save_soup_to_file('some_thing_wrong_chieldren.html')
     38             warnings.warn(' Some think is wrong in children!=1')
---> 39             exit()
     40         self.soup1 = self.soup1[0]
     41         self.parameters = parameters

NameError: global name 'exit' is not defined

This was tried on the following dois: 10.1149/1.3492151, 10.1149/1.3492174, 10.1149/1.3492188. The html files could be opened on Chrome and looked like it was parsed properly there.

With RSC we were able to run :
data = RSCSoup.parse(RSC_htmls[0])
Here we get an issue with empty entries in data. DOI, Journal, and Keywords which are all empty. We tried this on the DOIs: 10.1039/B210215C, 10.1039/B210393C , 10.1039/C000028K

So for the ECS parser we were wondering if this error has a fix. And for the RSC parser we wanted to check in to see if the missing entries is expected behavior or whether we should be attempting to extract that information from htmls.

[Springer] Paper title and journal name

  1. It is natural to think the paper title/jounal name is a string rather than a list. And we have discussed it in the PR comments.

  2. There are some weird pages have several paper titles, for example:
    10.1007/BF01161620
    10.1007/s10230-014-0302-8

  3. Parser needs to be fixed for some journals, for example:

10.1007/s10562-004-3745-x: parsed Journal is ['Catalysis Letters', 'J. Catal.', 'J. Am. Chem. Soc.', 'J. Phys. Chem.', 'Catal. Lett.', 'Angew. Chem. Int. Edn.', 'J. Ind. Rng. Chem.', 'J. Catal.']

10.1007/s11244-005-2883-8: parsed Journal is ['Topics in Catalysis', 'Stud. Surf. Sci. Catal.', 'Appl. Catal. A: General', 'Stud. Surf. Sci. Catal.', 'Top Catal.', 'J. Phys. Chem.', 'Top. Catal.', 'Top. Catal.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Brennstoff-Chem.', 'Angew. Chem.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Catalysis Today', 'Fuel Process Technol.', 'Appl. Cat. A: General', 'CIT']

So I think maybe we should:

  1. Change the type of Journal and Title from list to str
  2. Maybe get rid of html files containing several titles if they are useless?
  3. Fix the parser for Journal if we want to keep this field. Since the Journal name is already known during scraping, we could also not parse Journal.

What do you think? @IAmGrootel @hhaoyan

['obj']['Sections'] contains None

Sometimes parser returns an data['obj']['Sections'] with a None in it. IMO None should not be in the list and be removed in the coming versions.

For example:

html_str = """<div id="wrapper"><div class="left_head"><a class="simple" href="http://pubs.rsc.org"><img class="rsc-logo" border="0" src="http://pubs.rsc.org/content/NewImages/royal-society-of-chemistry-logo.png" alt="Royal Society of Chemistry"></a><br><span class="btnContainer"><a class="btn btn--tiny btn--primary" target="_blank" title="Link to PDF version" href="http://pubs.rsc.org/en/content/articlepdf/2012/CC/C1CC90183D">View PDF Version</a></span><span class="btnContainer"><a class="btn btn--tiny btn--nobg" title="Link to previous article (id:C1CC90192C)" href="http://pubs.rsc.org/en/content/articlehtml/2012/CC/C1CC90192C" target="_BLANK">Previous Article</a></span><span class="btnContainer"><a class="btn btn--tiny btn--nobg" title="Link to next article (id:C1CC90182F)" href="http://pubs.rsc.org/en/content/articlehtml/2012/CC/C1CC90182F" target="_BLANK">Next Article</a></span></div><div class="right_head"> </div><div class="article_info"> DOI: <a target="_blank" title="Link to landing page via DOI" href="https://doi.org/10.1039/C1CC90183D">10.1039/C1CC90183D</a>
(Editorial)
<span class="italic"><a title="Link to journal home page" href="https://doi.org/10.1039/1364-548X/1996">Chem. Commun.</a></span>, 2012, <strong>48</strong>, 18-18</div><h1 id="sect127"><span class="title_heading">A message from the new <span class="italic">ChemComm</span> chair</span></h1><p class="header_text">
      <span class="bold">
        
          
            Richard R. 
            Schrock
          
          
        
      </span>
    </p><div id="art-admin"><table><tbody><tr><td class="biogPlate"><img alt="" src="http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageService.svc/ImageService/Articleimage/2012/CC/c1cc90183d/c1cc90183d-p1.gif"><b></b><p><b>Richard R. Schrock</b></p></td><td><i></i><p>Richard R. Schrock received his PhD in inorganic chemistry from Harvard in 1971. After spending one year as an NSF postdoctoral fellow at the University of Cambridge and three years at the Central Research and Development Department of E. I. DuPont de Nemours and Co., he moved to M.I.T. in 1975 where he became full professor in 1980 and the Frederick G. Keyes Professor of Chemistry in 1989. His interests include the inorganic and organometallic chemistry of early transition metals and catalytic processes involving them. In 2005 he shared the Nobel Prize in chemistry with Robert Grubbs and Yves Chauvin for the “development of the metathesis method in organic synthesis.”</p></td></tr></tbody></table><hr>
    
      <span>I accepted the position of <span class="italic">ChemComm</span> Editorial Board Chair with honour and pride in the summer of 2011. Steeped in history, <span class="italic">ChemComm</span> continues to be one of the leading journals for important and urgent research across all chemical disciplines. It was largely because of the journal's standing in the chemical community that I agreed to take the role and lead the Editorial Board for the next four years. In this brief message, I would like to layout my vision for <span class="italic">ChemComm</span> from 2012.</span>
      <p class="otherpara">First, I want to thank Professor Peter Kündig, University of Geneva, who retires from the Chairman's role at the end of 2011. In his four years as Chair, <span class="italic">ChemComm</span> has seen its impact factor rise year on year while the number of articles published has increased by 50%; this is a truly remarkable achievement. I hope to be able to look back on similarly impressive results in four years time. Thank you Peter for your leadership, vision and energy.</p>
      <p class="otherpara">Looking to the future, 2012 will be a landmark year for <span class="italic">ChemComm</span>. Starting in January the journal will publish 100 issues per year. <span class="italic">ChemComm</span> will be the first chemistry journal to achieve such a remarkable feat. The journal will be hitting your desks twice a week, with each issue packed with a mixture of high quality communications and reviews. This doubling in frequency is a consequence of the significant growth of the journal, with annual submissions now close to 8000. The most rapid growth is in the number of submissions from Asia, in particular China, where <span class="italic">ChemComm</span> is both well known and popular. We hope to maintain these links with Asia while ensuring we continue to build strong support from other key countries that are leading the way in chemical research.</p>
      <p class="otherpara">Most importantly, we will continue to focus on further improving the quality of the journal through vigorous and fair peer review. Marshalled by our Associate Editors, who are all world-renowned scientists, and the dedicated professional Editors based in Cambridge, UK, we will strive to deliver the very best customer service at a speed that sets <span class="italic">ChemComm</span> apart from its competitors.</p>
      <p class="otherpara">In summary, I am very much looking forward to working with the Editorial Board and steering the journal through this exciting period of its life. On behalf of the Editorial Board, I would like to thank all our referees and authors who continue to contribute to the journal’s success.</p>
      <p class="otherpara">Richard R. Schrock</p>
      <p class="otherpara">F. G. Keyes Professor of Chemistry</p>
      <p class="otherpara">Editorial Board Chair, <span class="italic">ChemComm</span></p>
    
  <table><tbody><tr><td><hr></td></tr><tr><td><b>This journal is © The Royal Society of Chemistry 2012</b></td></tr></tbody></table></div></div>"""

from LimeSoup.RSCSoup import RSCSoup
RSCSoup.parse(html_str)
# Gives:
# {'obj': {'DOI': '', 'Title': ['A message from the new ChemComm chair'], 'Keywords': [], 'Journal': [], 'Sections': [None]}, 'html_txt': '<section_h1>\n</section_h1>'}

Issues with Elsevier journals

In the following journals text appeared to be not divided into sections:
Advanced Powder Technol.
Journal of Catalysis
Chemical Engineering Research and Design
Desalination

Also check for Author Index and remove it.

[ECS] Need to remove references from text

I think, I have already opened similar issue before: in ECS papers very often references numbers are left in front of the sentence.
Example DOIs: 10.1149/1.1420706, 10.1149/1.1565141, 10.1149/1.3606475, 10.1149/2.003203jes
Please, remove ALL references numbers from text.

I believe it should be solved by removing statements similar ti this:
<sup><a class=\"xref-bibr\" href=\"#ref-25\" id=\"xref-ref-25-1\">25</a></sup>

RSCSoup

doi: 10.1039/B003394O
Problem: Content inside a tag, maybe it is not worth to correct

Missing materials names and temperatures

Please check this paragraph "5af36e44ce31211cf1712941" in Paragraphs collection.
It misses precursor and temperatures.
I wonder if there are similar issues in same journal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.