cedergrouphub / limesoup Goto Github PK

View Code? Open in Web Editor NEW

19.0 7.0 7.0 15.77 MB

LimeSoup is a package to parse HTML or XML papers from different publishers.

License: MIT License

Python 100.00%

python nlp parser html xml journal-article

limesoup's People

Contributors

Stargazers

Watchers

Forkers

olivettigroup poioit kevcruse96 damianmartix weez0008

limesoup's Issues

Removing reference numbers from the text

ECS parsed text still contains reference number in the text. Please, remove them.
Same for all other parsers.

[AllParsers] Special HTML symbols in parser.

For all parsers, pay attention to special HTML symbols in parsed metadata. For example, DOI 10.1002/adsc.201190008 (Wiley) has Advanced Synthesis & Catalysis which should be Advanced Synthesis & Catalysis. This is to be solved in the Wiley parser @zjensen262

@zhugeyicixin Could you find journals in Springer that have similar problems?

Feedback on ACS parser

Some feedback from Olga based on about 10 randomly chosen papers.

General:

not parsed symbols like &mgr;, &pgr;, +, ...
Figures and Tables captions are embedded in text. They should be either removed or left as separate paragraph
Bunch of '\n' instead of spaces
Need to remove reference numbers
Paragraphs in text are not separated

10.1021/ja068965r:

Where applicable, names of paragraphs should be extracted as separate headings:
"X-ray Crystallography. Crystals of H2GL2 and CuGL2 were grown from concentrated MeOH/H2O solutions of the respective compounds, whereas crystals of NiGL2 were obtained via slow..."
"name": "X-ray Crystallography"
"content": "Crystals of H2GL2 and CuGL2 were grown from concentrated MeOH/H2O solutions of the respective compounds, whereas crystals of NiGL2 were obtained via slow..."

10.1021/ja0024340:
Weird symbols:
"\nThe energy of this transition state lies 21.1 kcal/mol above\nthe separated species. Using a typical18 &Dgr;S⧧ of −27 cal deg-1\nmol-1,"

Feedback on Wiley parser

Here is a list of issues found for the Wiley parser:

References are not removed from paragraphs.
Duplicates of paragraphs: for example https://doi.org/10.1111/j.1551-2916.2011.04722.x
Some paragraphs don't have a section name for example 10.1002/anie.201305377.
missing paragraphs for example 10.1111/jace.12629, 10.1002/adma.201103895.
Subsection names are not parsed: 10.1002/2017JE005343

Issue 2 is the most urgent one.

Unit tests implementation for Wiley and Springer parsers

Unit tests are missing, please write unit test cases to test issues raised in #20 and #22

Subscript of formulas.

Is it necessary to keep information about subscript or superscript?

[Springer]"Acknowledgements" is supposed to be parsed or not?

When do we include "Acknowledgements" and when not? For example, "10.1007/s10562-010-0490-1" has "Acknowledgements" part while "10.1007/s00339-013-8138-9" does not. Most failures in the unit test are due to "Acknowledgements".
@IAmGrootel

Elsevier parser issue with the parse_formula function

The parser generates TypeError when parsing papers. Example DOI's:

10.1016/j.solmat.2004.07.052
10.1016/j.solmat.2015.12.001
10.1016/j.apcatb.2008.01.005

Exception information:

must be str, not NoneType

========= Remote Traceback (1) =========
Traceback (most recent call last):
  File "/home/hhuo/anaconda3/envs/synthesis/lib/python3.6/site-packages/rpyc/core/protocol.py", line 329, in _dispatch_request
    res = self._HANDLERS[handler](self, *args)
  File "/home/hhuo/anaconda3/envs/synthesis/lib/python3.6/site-packages/rpyc/core/protocol.py", line 590, in _handle_call
    return obj(*args, **dict(kwargs))
  File "/home/hhuo/Projects/Codes/synthesis-api-hub/synthesis_api_hub/worker.py", line 20, in wrapper
    ret = f(self, *args, **kwargs)
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/api_worker.py", line 38, in parse_elsevier
    return ElsevierSoup.parse(html_string)
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/lime_soup.py", line 57, in parse
    return self._next.parse(html_str)
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/lime_soup.py", line 69, in parse
    results = self._next.parse(results)
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/lime_soup.py", line 67, in parse
    results = self._parse(html_str)
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/ElsevierSoup.py", line 33, in _parse
    parser.parse_formula(rules=[{'name': 'formula'}])
  File "/home/hhuo/Projects/Codes/LimeSoup/LimeSoup/parser/parser_paper_elsevier.py", line 113, in parse_formula
    label.string = ' ' + label.string + ' '
TypeError: must be str, not NoneType

Parser Version

Currently, all versions of the parsers are recorded as the repo version in production database. We need to assign each parser their own version number and update DB content according to each parser version number.

Please add Changelog and Version Rolling

For version tagging, search for commands git tags -a and then git push --tags

Issues for ElsevierSoup

1.zip
This is an .xml document I got through data mining in the elsvier journal，
when I type in pycharm：
from LimeSoup import ElsevierSoup
with open('1.xml', 'r', encoding = 'utf-8') as f:
xml_str = f.read()
data = ElsevierSoup.parse(xml_str)
print(data)
The print results was:
{'Journal': None, 'DOI': None, 'Title': None, 'Keywords': [], 'Sections': []}
I'm curious what I did wrong and whynot get the expected results.
Thank you!

Formulas in gif format

I found, that some ECS papers has gif pictures for formulas and numbers.
For example: http://jes.ecsdl.org/content/157/3/J69.full
span class="inline-formula" id="inline-formula-38"><img class="math mml" alt="Formula" src="J69/embed/mml-math-38.gif"

Can we check how many of those cases and do something about it?
Thank you.

Reference remove or keep?

which file format should be given

I am unable to understand the usage case for LimeSoup.
I am not sure which format for the article should be given here:
with open(article, 'r', encoding = 'utf-8') as f:
html_str = f.read()
Moreover, I guess this usage given is for a single article. what if there are thousands of articles to be parsed.

[Springer]Paragraphs containing bullet points

The "Conclusion" part in "10.1007/s40964-017-0023-1" has only 1 paragraph while the parsed result has 3. I think it is because bullet points are used in that paper. (see unit test LimeSoup/test/test_springer/test_springer.py)
@IAmGrootel

Feedback on the Springer Parser

Here is some feedback from Tanjin who analyzed the results for the Springer Parser based on a few papers.

Many blanks are inserted, especially when dealing with subscripts/superscripts. This makes it difficult to correctly parse chemical formula.
E.g.:
Pb(Zr x Ti 1− x )O 3
Pb 0.97 Nd 0.02 (Zr 0.55 Ti 0.45 )O 3 (PNZT)
ScTaO 4
Ar + ion
Mg 2 Ni
7.49 × 10 3 kg/m 3
1.5 J/cm 2
CuK α
k -space
Paragraphs in the same section are not separated
E.g.: Introduction section of the paper 10.1007/s00339-013-8138-9.
References are not removed
Some text is missed in a section with sub-sections.
E.g.: Methods section missed for the paper 10.1007/bf01142064.
I am not sure if we need to keep the formula in same format?
E.g. Some formula starts and ends with "$$", which some starts and ends with "$" as the boundary.
Formula 1: $$ \sigma_{\text{wh}} = \sqrt { \sigma_{\text{sat}}^{2} - \left( {\sigma_{\text{sat}}^{2} - \sigma_{0}^{2} } \right)\exp ( - r(\varepsilon - \varepsilon_{0} ))} $$
Formula 2: \( {\dot{{\varepsilon }}} $?

I think we should at least address the first 4 points. Happy to discuss this further.

Olivetti Group - html parsing running errors

@eddotman @zjensen262
We pulled a branch from master and tried running the ECS parsers using:

from LimeSoup.ECSSoup import ECSSoup 
data = ECSSoup.parse(ECS_htmls[0])

Where ECS_htmls is a list of html strings.

But we get an error:

NameError                                 Traceback (most recent call last)
<ipython-input-6-12ee0748abcb> in <module>()
      1 from LimeSoup.ECSSoup import ECSSoup
----> 2 data = ECSSoup.parse(ECS_htmls[0])

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     50         if not self._next:
     51             raise ValueError("Please provide at least one parsing rule ingredient to the soup")
---> 52         return self._next.parse(html_str)
     53 
     54 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     62         results = self._parse(html_str)
     63         if self._next:
---> 64             results = self._next.parse(results)
     65         return results
     66 

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\lime_soup.pyc in parse(self, html_str)
     60 
     61     def parse(self, html_str):
---> 62         results = self._parse(html_str)
     63         if self._next:
     64             results = self._next.parse(results)

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\ECSSoup.pyc in _parse(parser_obj)
    164         # Collect information from the paper using ParserPaper
    165         # Create tag from selection function in ParserPaper
--> 166         parser.deal_with_sections()
    167         obj['Sections'] = parser.data_sections
    168         return {'obj': obj, 'html_txt': parser.raw_html}

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\parser\parser_paper.pyc in deal_with_sections(self)
     53         """
     54         parameters = {'name': re.compile('^section_h[0-6]'), 'recursive': False}
---> 55         parse_section = self.create_parser_section(self.soup, parameters, parser_type=self.parser_type)
     56         self.data_sections = parse_section.data
     57         self.headings_sections = parse_section.heading

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\parser\parser_paper.pyc in create_parser_section(soup, parameters, parser_type)
     73         :return:
     74         """
---> 75         return ParserSections(soup, parameters, parser_type=parser_type)
     76 
     77     @staticmethod

C:\Users\avang\Documents\GitHub\LimeSoup\LimeSoup\parser\parser_section.py in __init__(self, soup, parameters, debugging, parser_type)
     37             #self.save_soup_to_file('some_thing_wrong_chieldren.html')
     38             warnings.warn(' Some think is wrong in children!=1')
---> 39             exit()
     40         self.soup1 = self.soup1[0]
     41         self.parameters = parameters

NameError: global name 'exit' is not defined

This was tried on the following dois: 10.1149/1.3492151, 10.1149/1.3492174, 10.1149/1.3492188. The html files could be opened on Chrome and looked like it was parsed properly there.

With RSC we were able to run :
data = RSCSoup.parse(RSC_htmls[0])
Here we get an issue with empty entries in data. DOI, Journal, and Keywords which are all empty. We tried this on the DOIs: 10.1039/B210215C, 10.1039/B210393C , 10.1039/C000028K

So for the ECS parser we were wondering if this error has a fix. And for the RSC parser we wanted to check in to see if the missing entries is expected behavior or whether we should be attempting to extract that information from htmls.

[Springer] Paper title and journal name

It is natural to think the paper title/jounal name is a string rather than a list. And we have discussed it in the PR comments.
There are some weird pages have several paper titles, for example:
10.1007/BF01161620
10.1007/s10230-014-0302-8
Parser needs to be fixed for some journals, for example:

10.1007/s10562-004-3745-x: parsed Journal is ['Catalysis Letters', 'J. Catal.', 'J. Am. Chem. Soc.', 'J. Phys. Chem.', 'Catal. Lett.', 'Angew. Chem. Int. Edn.', 'J. Ind. Rng. Chem.', 'J. Catal.']

10.1007/s11244-005-2883-8: parsed Journal is ['Topics in Catalysis', 'Stud. Surf. Sci. Catal.', 'Appl. Catal. A: General', 'Stud. Surf. Sci. Catal.', 'Top Catal.', 'J. Phys. Chem.', 'Top. Catal.', 'Top. Catal.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Brennstoff-Chem.', 'Angew. Chem.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Stud. Surf. Sci. Catal.', 'Catalysis Today', 'Fuel Process Technol.', 'Appl. Cat. A: General', 'CIT']

So I think maybe we should:

Change the type of Journal and Title from list to str
Maybe get rid of html files containing several titles if they are useless?
Fix the parser for Journal if we want to keep this field. Since the Journal name is already known during scraping, we could also not parse Journal.

What do you think? @IAmGrootel @hhaoyan

['obj']['Sections'] contains None

Sometimes parser returns an data['obj']['Sections'] with a None in it. IMO None should not be in the list and be removed in the coming versions.

For example:

html_str = """<div id="wrapper"><div class="left_head"><a class="simple" href="http://pubs.rsc.org"><img class="rsc-logo" border="0" src="http://pubs.rsc.org/content/NewImages/royal-society-of-chemistry-logo.png" alt="Royal Society of Chemistry"></a><br><span class="btnContainer"><a class="btn btn--tiny btn--primary" target="_blank" title="Link to PDF version" href="http://pubs.rsc.org/en/content/articlepdf/2012/CC/C1CC90183D">View PDF Version</a></span><span class="btnContainer"><a class="btn btn--tiny btn--nobg" title="Link to previous article (id:C1CC90192C)" href="http://pubs.rsc.org/en/content/articlehtml/2012/CC/C1CC90192C" target="_BLANK">Previous Article</a></span><span class="btnContainer"><a class="btn btn--tiny btn--nobg" title="Link to next article (id:C1CC90182F)" href="http://pubs.rsc.org/en/content/articlehtml/2012/CC/C1CC90182F" target="_BLANK">Next Article</a></span></div><div class="right_head"> </div><div class="article_info"> DOI: <a target="_blank" title="Link to landing page via DOI" href="https://doi.org/10.1039/C1CC90183D">10.1039/C1CC90183D</a>
(Editorial)
<span class="italic"><a title="Link to journal home page" href="https://doi.org/10.1039/1364-548X/1996">Chem. Commun.</a></span>, 2012, <strong>48</strong>, 18-18</div><h1 id="sect127"><span class="title_heading">A message from the new <span class="italic">ChemComm</span> chair</span></h1><p class="header_text">
      <span class="bold">
        
          
            Richard R. 
            Schrock
          
          
        
      </span>
    </p><div id="art-admin"><table><tbody><tr><td class="biogPlate"><img alt="" src="http://pubs.rsc.org/services/images/RSCpubs.ePlatform.Service.FreeContent.ImageService.svc/ImageService/Articleimage/2012/CC/c1cc90183d/c1cc90183d-p1.gif"><b></b><p><b>Richard R. Schrock</b></p></td><td><i></i><p>Richard R. Schrock received his PhD in inorganic chemistry from Harvard in 1971. After spending one year as an NSF postdoctoral fellow at the University of Cambridge and three years at the Central Research and Development Department of E. I. DuPont de Nemours and Co., he moved to M.I.T. in 1975 where he became full professor in 1980 and the Frederick G. Keyes Professor of Chemistry in 1989. His interests include the inorganic and organometallic chemistry of early transition metals and catalytic processes involving them. In 2005 he shared the Nobel Prize in chemistry with Robert Grubbs and Yves Chauvin for the “development of the metathesis method in organic synthesis.”</p></td></tr></tbody></table><hr>
    
      <span>I accepted the position of <span class="italic">ChemComm</span> Editorial Board Chair with honour and pride in the summer of 2011. Steeped in history, <span class="italic">ChemComm</span> continues to be one of the leading journals for important and urgent research across all chemical disciplines. It was largely because of the journal's standing in the chemical community that I agreed to take the role and lead the Editorial Board for the next four years. In this brief message, I would like to layout my vision for <span class="italic">ChemComm</span> from 2012.</span>
      <p class="otherpara">First, I want to thank Professor Peter Kündig, University of Geneva, who retires from the Chairman's role at the end of 2011. In his four years as Chair, <span class="italic">ChemComm</span> has seen its impact factor rise year on year while the number of articles published has increased by 50%; this is a truly remarkable achievement. I hope to be able to look back on similarly impressive results in four years time. Thank you Peter for your leadership, vision and energy.</p>
      <p class="otherpara">Looking to the future, 2012 will be a landmark year for <span class="italic">ChemComm</span>. Starting in January the journal will publish 100 issues per year. <span class="italic">ChemComm</span> will be the first chemistry journal to achieve such a remarkable feat. The journal will be hitting your desks twice a week, with each issue packed with a mixture of high quality communications and reviews. This doubling in frequency is a consequence of the significant growth of the journal, with annual submissions now close to 8000. The most rapid growth is in the number of submissions from Asia, in particular China, where <span class="italic">ChemComm</span> is both well known and popular. We hope to maintain these links with Asia while ensuring we continue to build strong support from other key countries that are leading the way in chemical research.</p>
      <p class="otherpara">Most importantly, we will continue to focus on further improving the quality of the journal through vigorous and fair peer review. Marshalled by our Associate Editors, who are all world-renowned scientists, and the dedicated professional Editors based in Cambridge, UK, we will strive to deliver the very best customer service at a speed that sets <span class="italic">ChemComm</span> apart from its competitors.</p>
      <p class="otherpara">In summary, I am very much looking forward to working with the Editorial Board and steering the journal through this exciting period of its life. On behalf of the Editorial Board, I would like to thank all our referees and authors who continue to contribute to the journal’s success.</p>
      <p class="otherpara">Richard R. Schrock</p>
      <p class="otherpara">F. G. Keyes Professor of Chemistry</p>
      <p class="otherpara">Editorial Board Chair, <span class="italic">ChemComm</span></p>
    
  <table><tbody><tr><td><hr></td></tr><tr><td><b>This journal is © The Royal Society of Chemistry 2012</b></td></tr></tbody></table></div></div>"""

from LimeSoup.RSCSoup import RSCSoup
RSCSoup.parse(html_str)
# Gives:
# {'obj': {'DOI': '', 'Title': ['A message from the new ChemComm chair'], 'Keywords': [], 'Journal': [], 'Sections': [None]}, 'html_txt': '<section_h1>\n</section_h1>'}

Beautiful Soup is unable to preserve namespace

For some tags without a namespace, beautifulsoup accidentally assigns the parent namespace to it.

Issues with Elsevier journals

In the following journals text appeared to be not divided into sections:
Advanced Powder Technol.
Journal of Catalysis
Chemical Engineering Research and Design
Desalination

Also check for Author Index and remove it.

[ECS] Need to remove references from text

I think, I have already opened similar issue before: in ECS papers very often references numbers are left in front of the sentence.
Example DOIs: 10.1149/1.1420706, 10.1149/1.1565141, 10.1149/1.3606475, 10.1149/2.003203jes
Please, remove ALL references numbers from text.

I believe it should be solved by removing statements similar ti this:
<sup><a class=\"xref-bibr\" href=\"#ref-25\" id=\"xref-ref-25-1\">25</a></sup>

RSCSoup

doi: 10.1039/B003394O
Problem: Content inside a tag, maybe it is not worth to correct

Missing materials names and temperatures

Please check this paragraph "5af36e44ce31211cf1712941" in Paragraphs collection.
It misses precursor and temperatures.
I wonder if there are similar issues in same journal.