Giter Club home page Giter Club logo

boilerpy3's People

Contributors

cramaker avatar jmriebold avatar jsirois avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

boilerpy3's Issues

AttributeError: 'NoneType' object has no attribute 'lower'

I am trying to try boilerpy3 with simple warc file. However getting the following error:

Error parsing HTML
Traceback (most recent call last):
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/extractors.py", line 81, in parse_doc
bp_parser.feed(input_str)
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 652, in feed
self.end_document()
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 459, in end_document
self.flush_block()
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 534, in flush_block
if self.last_start_tag.lower() == "title":
AttributeError: 'NoneType' object has no attribute 'lower'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/extractors.py", line 87, in parse_doc
bp_parser.feed(input_str)
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 652, in feed
self.end_document()
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 459, in end_document
self.flush_block()
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/parser.py", line 534, in flush_block
if self.last_start_tag.lower() == "title":
AttributeError: 'NoneType' object has no attribute 'lower'
Traceback (most recent call last):
File "/home/mani/Workspace/Researches/Thesis/nparacrawl/npc-miner/main.py", line 23, in
app.create_db()
File "/home/mani/Workspace/Researches/Thesis/nparacrawl/npc-miner/main.py", line 19, in create_db
print(text_extractor.get_content(text))
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/extractors.py", line 33, in get_content
return self.get_doc(text).content
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/extractors.py", line 49, in get_doc
self.filter.process(doc)
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/filters.py", line 98, in process
is_updated |= filtr.process(doc)
File "/home/mani/.python-venvs/research1/lib/python3.8/site-packages/boilerpy3/filters.py", line 859, in process
for tb in doc.text_blocks:
AttributeError: 'NoneType' object has no attribute 'text_blocks'
Example Domain
This domain is established to be used for illustrative examples in documents. You may use this
domain in examples without prior coordination or asking for permission.

I was using this this warc file for testing. It successfully extracts text but leaves with error message. For complex webpages, I am not getting any output except error message. I have tried with python 3.9 and 3.8. Anybody aware of this?

IndexError: pop from empty list

When running the code provided, an error occurred while trying to parse HTML. The error message indicates that there was an issue with the boilerpy3 package, specifically the parse_doc function. The error occurred when trying to pop an element from an empty list, indicating that the parser encountered an unexpected condition during parsing. This caused the parser to fail, and prevented the program from completing successfully.

sample input html is attached
fail1.html.zip

To replicate the issue, the following code was used:

from boilerpy3 import extractors
extractor = extractors.ArticleSentencesExtractor()
doc = extractor.get_doc(x)
page_contents = doc.content
Error parsing HTML
Traceback (most recent call last):
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 108, in parse_doc
    bp_parser.feed(input_str)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 657, in feed
    HTMLParser.feed(self, data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 172, in goahead
    k = self.parse_endtag(i)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 420, in parse_endtag
    self.handle_endtag(elem)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 667, in handle_endtag
    self.end_element(tag)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 495, in end_element
    self.label_stacks.pop()
IndexError: pop from empty list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 114, in parse_doc
    bp_parser.feed(input_str)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 657, in feed
    HTMLParser.feed(self, data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 172, in goahead
    k = self.parse_endtag(i)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 420, in parse_endtag
    self.handle_endtag(elem)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 667, in handle_endtag
    self.end_element(tag)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 495, in end_element
    self.label_stacks.pop()
IndexError: pop from empty list
Traceback (most recent call last):
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 108, in parse_doc
    bp_parser.feed(input_str)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 657, in feed
    HTMLParser.feed(self, data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 172, in goahead
    k = self.parse_endtag(i)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 420, in parse_endtag
    self.handle_endtag(elem)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 667, in handle_endtag
    self.end_element(tag)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 495, in end_element
    self.label_stacks.pop()
IndexError: pop from empty list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 114, in parse_doc
    bp_parser.feed(input_str)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 657, in feed
    HTMLParser.feed(self, data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 110, in feed
    self.goahead(0)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 172, in goahead
    k = self.parse_endtag(i)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/html/parser.py", line 420, in parse_endtag
    self.handle_endtag(elem)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 667, in handle_endtag
    self.end_element(tag)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/parser.py", line 495, in end_element
    self.label_stacks.pop()
IndexError: pop from empty list
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1578, in <module>
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 63, in get_doc
    doc = self.parse_doc(text)
  File "/Users/mac/Desktop/django/envs/seotool/lib/python3.9/site-packages/boilerpy3/extractors.py", line 118, in parse_doc
    raise HTMLExtractionError from ex
boilerpy3.exceptions.HTMLExtractionError

Malformed HTML

Hi,
Sometimes I pass HTML to get_doc, but it returns this warning with empty content:

WARNING:boilerpy3:Warning: SAX input contains nested A elements -- You have probably hit a bug in your HTML parser (e.g., NekoHTML bug #2909310). Please clean the HTML externally and feed it to BoilerPy3 again. Trying to recover somehow..

Any automated tips to fix this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.