Comments (9)
The ArticleExtractor appears to assumes li elements are part of a menu of some
sort. Generally this is correct, but it seems we can assume that menus aren
normally not ordered lists?
Working from that assumption, I was able to modify the code to accept li
elements that are in an ol into 1 textblock adding the order number before each
li. I have not had the chance to test my modifications against a wide variety
of articles, but it seems to work as expected.
Original comment by [email protected]
on 15 Mar 2012 at 10:03
from boilerpipe.
I think it is relatively safe to assume that, but at the same time I doubt this
is always the case, some obscure reasons may push a developer to utilise <ol>'s
instead of <ul>'s.
Also, it is very common for articles to have <ul>'s inside of them - is there
something that can be added that looks for leading/trailing content blocks
greater than X length. Or you could alternatively look for lists inside of a
element which also contain a large volume of text?
Its is unlikely that a div would contain lists as well as a large volume of
text (where the text is outside of the list element).
Hope that all makes sense?
Original comment by [email protected]
on 16 Mar 2012 at 7:12
from boilerpipe.
Partially fixed in r170. The issue with LIs still needs to be tackled, although
there are other reasons for this behavior.
Please try again and tell me if you are happy with the results.
Cheers,
Christian
Original comment by ckkohl79
on 21 Mar 2012 at 10:10
- Changed state: Fixed
from boilerpipe.
I tried this using
http://boilerpipe-web.appspot.com/extract?url=http%3A%2F%2Fwww.seomoz.org%2Fugc%
2Flink-building-management&extractor=ArticleExtractor&output=htmlFragment and
it doesn't seem to have made much of a difference - has the appspot version
been updated yet?
Notice how they have used an ol and a ul within the same article.
Original comment by [email protected]
on 22 Mar 2012 at 8:48
from boilerpipe.
Please try again. It's now live on boilerpipe-web.
(before, it was only on SVN trunk)
Original comment by ckkohl79
on 22 Mar 2012 at 5:48
from boilerpipe.
Looking much better, the only issue remaining is it seems to have trimmed out
two of the <ol> <li>'s -
http://boilerpipe-web.appspot.com/extract?url=http%3A%2F%2Fwww.seomoz.org%2Fugc%
2Flink-building-management&extractor=ArticleExtractor&output=htmlFragment
These two:
Majestic SEO - Deeper than OSE but contains noisy, unfiltered data.
Official Google Toolbar (PageRank) - Single metric. Infrequently updated.
Great work by the way, works almost perfectly now :-)
Original comment by [email protected]
on 22 Mar 2012 at 5:53
from boilerpipe.
[deleted comment]
from boilerpipe.
Can something like the ArticleMetadataFilter be used to remove the "82 Thumbs
Up, 1 Thumbs Down" block?
Original comment by [email protected]
on 22 Mar 2012 at 8:34
from boilerpipe.
Sorry tucker was that directed at me?
Original comment by [email protected]
on 26 Mar 2012 at 2:59
from boilerpipe.
Related Issues (20)
- BoilerplateBlockFilter ignores labelToKeep
- [deleted issue]
- Program does not terminate for badly formatted/syntactically incorrect HTML input
- How to use boilerpipe to get some text with a hyperlink from the web page? HOT 1
- Incomplete extraction of text with special characters
- Server returned HTTP response code: 403 for URL (SOLVED) please use this codeline. HOT 2
- Limit the parsing depth of the html parsing to avoid out of memory situations HOT 1
- Extract article from non-english text HOT 1
- Missing Maven 1.2.0
- Xerces for andorid jar file needed HOT 2
- its not working for a news site HOT 1
- Incomplete extraction of article
- Fail to extract main content on some page, get footnote instead
- IllegalArgumentException for many web pages
- Missing ImageExtractor in downloabale 1.2 jar file
- Performance issues with UnicodeTokenizer
- Boilerpipe is conflicting with CyberNeko library HOT 1
- Unsupported content type: null HOT 1
- Different result when using Web Api and the source api?
- How to debug the result?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from boilerpipe.