Comments (1)
Hi Holger, to be clear Full-Text RSS itself doesn't do anything magic here. It simply carries out the string replacement. So in your example, the <p>
replacement on the input HTML you provided:
<h2 class="clay-subheader" ...>
<span class="ordered-list-item">
<p class="list-item-text">1.</p>
</span>
You don’t have to read everyone’s book.
</h2>
becomes the following after string replacement
<h2 class="clay-subheader" ...>
<span class="ordered-list-item">
<anydifferenttag>1.</p>
</span>
You don’t have to read everyone’s book.
</h2>
The magic that you're seeing happens with the HTML parser when it parses the above. Full-Text RSS relies on HTML5-PHP to parse HTML, which tries to follow HTML5 parsing rules. It's kind of the same way a browser will try to make sense of a malformed HTML document.
I think the ideal way to deal with such changes is with more advanced rules, some of which the original Instapaper rules these are based on supported. Things such as unwrap: XPath
or move_into()
, which would let you manipulate the DOM without string replacement. We haven't seen many cases where those are essential, so haven't added support for them.
Whether to use the string replacement method you've highlighted and rely on the parser to figure out the correct structure, I'm not sure. It's a bit hacky. I personally wouldn't do it for minor formatting improvements, but I don't have very strong feelings about it.
There are some site config files where we've used string replacement to signal an end to the document where a suitable XPath couldn't be found or it was just much simpler than constructing one to isolate the content. Imagine something like this:
replace_string(<!--end of article-->):</body></html>
We're not creating well-formed HTML, but we're hoping the HTML parser does what we want and ends the document at the point the comment is encountered and ignores all the other elements that follow.
from ftr-site-config.
Related Issues (20)
- how to deal with JavaScript objects/json parsing HOT 4
- sz-magazin.sueddeutsche.de sz-plus Login HOT 2
- nytimes.com - lazy-loaded images not loaded HOT 1
- Fix tags when styled by CSS instead of using semantic HTML HOT 6
- How to contribute? HOT 2
- Kenfm.de is missing HOT 1
- Notebookcheck broken
- Specify title in site config file HOT 4
- tweakers.net pattern doesn't work anymore HOT 1
- nature.com Improvement HOT 2
- gizmodo/lifehacker don't work (they store text in JSON-LD now) HOT 3
- faz.net paywall articles shows payment-hint instead of the teaser as content HOT 2
- Need help to find a fingerprint for 60+ ippen.media newssites HOT 18
- Update vox.com.txt
- Suggestion for nytimes.com.txt HOT 4
- Are there any wildcard for ' find_string' or 'replace_string'? HOT 1
- Suggestion on wikipedia
- I can't set author for feeds from RSS-Bridge HOT 3
- How to get content from a site with bad ssl cert HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ftr-site-config.