fivefilters / ftr-site-config Goto Github PK

View Code? Open in Web Editor NEW

352.0 16.0 253.0 4.56 MB

Site-specific article extraction rules to aid content extractors, feed readers, and 'read later' applications.

Home Page: https://www.fivefilters.org/full-text-rss/

License: Other

article-extracting xpath extract-rules

ftr-site-config's People

Contributors

Stargazers

Watchers

Forkers

angeloprado eyebar mikevister dev101 williamgateszhao gonebushx glandos xxxazxxx paour vdquang tomtaylor newspaperclub coolius zinnober weishaupt patelatharva cirnod tcitworld jbfavre kaistian joglomedia kunalsood timgws karmak23 daverage jordidg mishnit stesie j0k3r benages szech lukas0907 corion zhoubug haeckle applejo taocwang nblock nmabhinandan zertrin thrilleratplay kdecherf andresth dariottolo jcharaoui atomike0 hejman tinloaf amreldib jeroenseegers shtrom rgugliel ngosang jkraemer sanjayankur31 fylleth667 alosultan tharts dschoepe kaliwdsn bonbadil 70b43r reinouts mynamehurts biva dthes lumiru alecmtetwa94 diversoft kobemtl andrey-str arakeis leblanc-simon lmontrieux marblepebble xfilosofx tuian mirelsol rfc2119 gadgetcoma guddl lapoigne 2xyo zedascouves dev-drprasad nclshart monsieurpouet sophieforceno azrdev kreativmonkey westgen darthoctopus webracer999 rurik19 crazyquesadilla rsanzante khinsen moritanosuke cyrusmg phyks

ftr-site-config's Issues

create config for theverge.com

this article doesn't render properly at all here.

testing on this site would lead me to think a good start would be the Xpaths:

//div[contains(concat(' ',normalize-space(@class),' '),' l-wrapper ')]//div[contains(concat(' ',normalize-space(@class),' '),' l-segment ') and (contains(concat(' ',normalize-space(@class),' '),' l-feature '))]
//div[contains(concat(' ',normalize-space(@class),' '),' c-entry-hero__content ')]

but i'm not sure how to add two xpaths at once?

computerbase.de isnt working

hi,
if you would change body to: "body://div[@Class='text-content']" it will work again.
Regards,
FiShi

Show script version on debug

It would be great if displayed version of the script in the debug data.

readretro.com isn't part of the list

I'd like if someone could add a filter for this site, currently it only shows a summary.

ansible documentation pages cannot be parsed

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=http%3A%2F%2Fdocs.ansible.com%2Fansible%2Flatest%2Fpam_limits_module.html

body: //div[@id='pam-limits-modify-linux-pam-limits']
test_url: http://docs.ansible.com/ansible/latest/pam_limits_module.html

Ability to bypass redirection

Vk.com redirects to /badbrowser.php, so content couldn't be fetched. But if we ignore redirection we have access to content.
Can we, somehow disable redirection for site?

Amazon.com: no result

Hi,

I don't know if this is the right place for this. Fetching Amazon.com links in Wallabag and on http://f43.me/feed/test yield no result.
For example:

Thanks

Use stackoverflow config for stackexchange

Hello everyone,
I don't want to do a PR on this as it is something of a personal taste, but I think the site config for stackoverflow.com (link) should maybe be copied for .stackexchange.com (link) as the websites share the same engine, and the last site config is in my opinion sub-optimal.

faz.net.txt is broken

gizmodo/lifehacker don't work (they store text in JSON-LD now)

Hi, both gizmodo and lifehacker now store article text in JSON-LD:

<script type="application/ld+json">...</script>

Could you, please, add support for JSON-LD documents?

Provide guidance on file naming

Will someone please provide guidance on file naming?

I keep opening files as .example.com.txt. Primarily because most of the first listed files have that naming scheme. A few of my PRs have their file names adjusted, but I don’t fully understand why.

Adding something to the readme’s contributing section would be helpful, at least for me, if you maintainers don’t want to keep adjusting file names.

tweakers.net pattern doesn't work anymore

Version of Full-Text RSS: 3.9.11
Version of Site Patterns: 2021-05-26T01:09:01Z

Most of the time i get [unable to retrieve full-text content] error using Tweakers.net. Since the script is from 2018 and the website got redesigned in that time, the pattern should be updated.

With the point-and-click interface, i could select the body in 3 types of articles on the site. Test links in the pattern here below:

News article:

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https%3A%2F%2Ftweakers.net%2Fnieuws%2F182324%2Fgoogle-probeerde-telefoonmakers-privacy-instellingen-te-laten-verstoppen.html

body: //div[contains(concat(' ',normalize-space(@class),' '),' article ')]
test_url: https://tweakers.net/nieuws/182324/google-probeerde-telefoonmakers-privacy-instellingen-te-laten-verstoppen.html

Multi-page articles (not every page can be parsed so i think the best is just do the first page one):

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https%3A%2F%2Ftweakers.net%2Freviews%2F9040%2Fbluetooth-trackers-round-up-zoekt-en-gij-zult-niet-altijd-vinden.html

body: //div[contains(concat(' ',normalize-space(@class),' '),' centeredContent ')]
test_url: https://tweakers.net/reviews/9040/bluetooth-trackers-round-up-zoekt-en-gij-zult-niet-altijd-vinden.html

Software updates:

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https%3A%2F%2Ftweakers.net%2Fdownloads%2F56134%2Fparallels-desktop-160.html

body: //div[contains(concat(' ',normalize-space(@class),' '),' articleColumn ')]
test_url: https://tweakers.net/downloads/56134/parallels-desktop-160.html

I'm not sure how to edit the pattern for all the 3 types of articles and test it since i use an RSS reader that uses this service (Bazqux) and not self-hosting this service by myself. I hope this helps updating the pattern.

Paywall Nextinpact new version

The website nextinpact has released a new version of their website and it seems articles cannot be extracted anymore.

inpact-hardware paywall

Hi,
I am trying to add https://www.inpact-hardware.com and most specifically the paywall part. It seems the login form is submitted as a json payload and I don't understand how to configure it. Any hint?

Undocumented patterns

I'm trying to add a kind of test validations for siteconfig to avoid mistake in them and I found some undocumented pattern in some of them (following that, I'll submit some fixes for others files).

Here is the list:

convert_double_br_tags, for example .blog.163.com.txt
strip_comments, for example .blogspot.com.txt
move_into, for example 500px.com.txt
autodetect_next_page, for example 5by5.tv.txt
dissolve, for example acroswing.fr.txt
native_ad_clue, for example arstechnica.com.txt
footnotes, for example blogs.msdn.com.txt
wrap_in, for example blogs.smithsonianmag.com.txt
if_page_contains, for example gamasutra.com.tx
single_page_link_in_feed, for example techmeme.com.txt

I was wondering if these patterns are absolete, new, unsued, etc.
I can't find them in the documentation nor in the current open source version of Full-Text RSS. Have they been introduced in the current version of Full-Text RSS? (which means we can't see how they are handled)

Let me know 🙂

phoronix.com.txt seems to not be working.

I can't retrieve the full article, I just get a full summary.

Notebookcheck broken

https://www.notebookcheck.net/Final-nail-in-the-coffin-Bar-raising-AMD-Ryzen-9-5950X-somehow-lags-behind-four-Intel-parts-including-the-Core-i9-10900K-in-average-bench-on-UserBenchmark-despite-higher-1-core-and-4-core-scores.503581.0.html

Only one sentence will be parsed from that website:

This is on Wallabag 2.3.8.

androidpolice.com: Missing images

For example in the following article: http://www.androidpolice.com/2015/02/25/galaxy-s6-appears-leaked-glory/
All the gallery images are missing from the generated RSS feed.

reddit comments visible

Hello
what i need to change to make reddit comments visible.

Sandbox for testing configuration.

I think it's cool that we can send configuration files specific to a site.

It would be too much to ask to have a sandbox to test the configuration files before posting them? :-)

axios.com: unable to fetch.

exemple article: https://www.axios.com/amazons-eating-the-media-too-2467710448.html

ad.nl and related sites, it seems, no longer work

Using the latest Wallabag Docker image, sites published by DPG Media no longer work: the cookie wall isn't automatically bypassed. (Ex. from source: http://www.ad.nl/ad/nl/10444/Offside/article/detail/4043834/2015/05/31/Dani-Alves-voetbalt-met-drol-op-zijn-hoofd.dhtml)

Stumbled upon this by trying to copy and adapt settings for ad.nl and parool.nl to demorgen.be (and, perhaps hln.be), as they should all be using the same cookie wall technology.

It'd be awesome if these could be fixed/added.

Possible additional test URL: https://www.demorgen.be/nieuws/locked-and-loaded-trump-dreigt-met-militaire-actie-tegen-iran-olieprijzen-schieten-omhoog~b82ff8c9/

Also affected: volkskrant.nl, see ex. https://www.volkskrant.nl/nieuws-achtergrond/privacywaakhond-verbiedt-cookiemuren-op-websites-ja-ook-die-op-volkskrant-nl~b7dab7ee/

unable to parse zeit.de

It seems the last change (#773) to zeit.de.txt concerning cookie does not work anymore.
The cookie content now is a timestamp in format "2020-09-03T19:45:59.150Z".
I tried to just insert that instead of the previous fixed text, and it even worked on f43.me.
http_header(Cookie): zonconsent="2020-09-03T19:45:59.150Z"
Testwise changing the timestamp to August 3rd it didn't work.
So question: How can the date (and time?) set dynamically?

Test-URL

florian-casse.fr does not parse

Used wallabag to add it and it didn't fetched, so I came here because of http://siteconfig.fivefilters.org/ this tool.

I did the selection but it keep not parsing, any thoughts ?

https://florian-casse.fr/clear-dns-cache-on-vcsa-6-5-and-later/

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https%3A%2F%2Fflorian-casse.fr%2Fclear-dns-cache-on-vcsa-6-5-and-later%2F

body: //div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]
test_url: https://florian-casse.fr/clear-dns-cache-on-vcsa-6-5-and-later/

Linux.com's original feed is missing

Linux.com's original feed (https://www.linux.com/feeds/original-content/rss) is missing as well.

Website Arrêt sur Images has wrong paywall URL

The connection URL in arretsurimages.net.txt should not be :
http://www.arretsurimages.net/forum/login.php
Instead, it should be:
http://www.arretsurimages.net
The login form is now directly accessible from the home page. Fields still have the same ids: username and password.
This may help fixing the paywall detection, although I can't test it by myself.

faz.net paywall articles shows payment-hint instead of the teaser as content

faz.net has free articles and f+ articles behind a paywall. For these f+ articles, apart from the headline, only a short teaser is displayed on the page and then a notice that you should pay 2.95 EUR per week to continue reading.

In this case, fullt-text-rss only takes the payment terms as content and not the teaser.

Can someone please adjust the site-config so that the teaser is taken over for f+ articles? Gladly also with a short (German) hint that you can read more only with a subscription.

examples:

free:
http://ftr.fivefilters.net/makefulltextfeed.php?url=https%3A%2F%2Fwww.faz.net%2Faktuell%2Fsport%2Folympische-winterspiele%2Fdeutsches-team%2Fkatharina-althaus-holt-olympia-silber-in-china-im-skispringen-17782373.html&max=3

f+:
http://ftr.fivefilters.net/makefulltextfeed.php?url=https%3A%2F%2Fwww.faz.net%2Faktuell%2Fpolitik%2Fausland%2Fdie-bergbaustadt-broken-hill-in-australien-setzt-auf-solarenergie-17735957.html&max=3

Since I host fulltext-rss on my own server, the required lines for the config would also be fine for me

taz.de.txt does not work

fetching content from taz.de articles does not work any longer, see:

https://www.taz.de/Einwanderungspolitik-der-USA/!5500980/

Reddit filter change

Reddit has changed the way they setup their pages (at some point, not a recent thing) and the filter now includes comments and the sidebar, which should be stripped.

COPYING/license file missing

Hi,

I think it'd be good if you'd provide a COPYING file clearly stating the license of this work.

After all I consider using these files as well but I'm a bit confused regarding legal situation

on http://help.fivefilters.org/customer/portal/articles/223153-site-patterns you say in the same spirit, have made available our own additions for anyone to use
likewise on http://www.keyvan.net/2011/03/content-extraction/
from http://fivefilters.org/content-only/ however you can either download old versions (for free; licensed AGPL) which don't have patterns included or you can pay (and get the patterns)

So after all it sounds like "hey, it's public domain" ... right?
If so, could you please add a COPYING file stating so :-)

cheers
~stesie

Articles from aps.dz cannot be fetched

Hello,

my Wallabag cannot retrieve articles of the official Algerian press agency APS (http://www.aps.dz/) - and unfortunately, the site config builder gives me either a blank page or a 404 error when I try to load an article. Any ideas why this website doesn't work?

Thanks in advance!

The Verge site config does not pull images

I've played around with it a lot. It does not pull any images. This is happening with a lot of sites for me, but since the verge's feed is easily testable, I'm reporting that as the test case.

outlined words get censored on psychologytoday.com

This article about Richard Stallman is really weird to read on parsed content, because all the "dotted underline" words are just removed from the output. I reproduced this on the test site.

The last sentence of the second paragraph should read like:

If you’ve heard of open source (free software’s practice sans its moral stance) or Linux (really GNU, plus a program called Linux), you can thank Stallman.

(Emphasis on the removed word added.)

Instead it is:

If you’ve heard of open source (free software’s practice sans its stance) or Linux (really GNU, plus a program called Linux), you can thank Stallman.

Thanks!

Blogs using PluXML don't have any content

It seems to be common to blogs using PluXML, regardless the theme. Some examples :

A blog using an other theme which works :

http://p3ter.fr/firefox-os-mes-retours-sur-le-zte-open-c.html

Specify title in site config file

In this example Wallbag can extract article without a specific site config, but It can't set the title, there is a way to specify where is the title in site config?

http://siteconfig.fivefilters.org/grab.php?url=https://www.kapilarya.com/fix-you-cant-access-this-shared-folder-because-your-organizations-security-policies-block-unauthenticated-guest-access

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https://www.kapilarya.com/fix-you-cant-access-this-shared-folder-because-your-organizations-security-policies-block-unauthenticated-guest-access

body: //div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]
test_url: https://www.kapilarya.com/fix-you-cant-access-this-shared-folder-because-your-organizations-security-policies-block-unauthenticated-guest-access

Lesserwrong.com page cannot be parsed

Hi,

I'd like to parse pages from www.lesserwrong.com. I've tried creating a site config based on this page:
https://www.lesserwrong.com/rationality/what-do-we-mean-by-rationality

This is how my site config looks like:

title: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-header-title ')]//h1
body: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-body-html ')]
date: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-body-metadata-date ')]
author: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-header-author ')]
test_url: https://www.lesserwrong.com/rationality/what-do-we-mean-by-rationality

As far as I can tell, these XPaths all point to the correct elements inside that page. Howeve, the tool at https://f43.me/feed/test still fails to parse the page. Did I mess up the site config, or is this a bug in the parser (and if so, is this the right repository to report such a bug?)

hardware.info - social buttons not excluded

Test link: https://nl.hardware.info/nieuws/71789/be-quiet-maakt-pure-rock-2-in-komende-weken-beschikbaar
Full-Text RSS version: 3.9.5
Site Patterns: 2020-04-27T10:13:58Z

The social buttons under the article are not excluded. This is for all articles on the website.

The buttons in question:

The buttons present on a RSS Feed Reader after parsed by Full-Text RSS:

thaitable.com

//div[@id='main'] would be a good start for that. it includes a mailing list block and other crap, but i can't find a better match.

Impossible to catch anything from Bloomberg

I tried to fetch https://www.bloomberg.com/news/articles/2018-10-16/the-dirt-on-clean-electric-cars but I can't get anything, both with wallabag and https://f43.me/feed/test or even with http://siteconfig.fivefilters.org/

Any idea?

blogs.gnome.org fail to load

Failing test blog post:
https://blogs.gnome.org/aday/2017/08/08/the-gnome-way/

That's the error message from https://f43.me/feed/test

Request throw exception (with a response): Client error response [url] https://blogs.gnome.org/aday/2017/08/08/the-gnome-way/?not-changed [status code] 403 [reason phrase] Bad Behavior

I've tried your point and click page to find a proper site-filter but failed to do so. The fetched content is always this:

HTTP Error 403

We're sorry, but we could not fulfill your request for /aday/2017/08/08/the-gnome-way/?not-changed on this server.

An invalid request was received from your browser. This may be caused by a malfunctioning proxy server or browser privacy software.

Your technical support key is: b01f-6435-1756-6707

You can use this key to fix this problem yourself.

If you are unable to fix the problem yourself, please contact allanday at gnome.org and be sure to provide the technical support key shown above.

nature.com Improvement

This patch makes article body extraction for nature.com more exact:

--- a/nature.com.txt    2021-07-23 12:11:36.331873505 +0200
+++ b/nature.com.txt    2021-07-23 12:11:17.747730246 +0200
@@ -2,7 +2,7 @@
 date: //meta[@name="dc.date"]/@content
 date: //meta[@name="prism.publicationDate"]/@content
 author: //meta[@name='dc.creator']/@content
-body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')]
+body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' c-article-body ')]
 
 strip: //div[contains(concat(' ',normalize-space(@id),' '),' further-reading-section ')]

Issue fetching tagesanzeiger.ch

Hi guys,

I have trouble fetching the content of articles from tagesanzeicher.ch
They got a loading-page in front of the content and I don't know how to get around it.

Test-URL:
https://m.tagesanzeiger.ch/articles/13916862
Siteconfig:
http://siteconfig.fivefilters.org/grab.php?url=https%3A%2F%2Fm.tagesanzeiger.ch%2Farticles%2F13916862

filmstarts.de: slideshows

Although I wrote this config file myself, I can't figure this issue out.
This site basically has 2 very different types of articles

the normal articles with some text, images and trailer-videos
slide-shows that are structured completely different internally

I have no idea how to make both types work correctly. Currently the normal articles work fine, but the slide-shows come up like this

here is the config-file so far

unable to parse "checklist" newyorker article

Steps to reproduce: parse https://www.newyorker.com/magazine/2007/12/10/the-checklist
expected: content parsed successfully
actual: fails to make content readable

Using https://f43.me/feed/test, the "Body after Readability" step is essentially an empty HTML document. Setting prune and tidy to no with a custom site config doesn't change the result on the test page.

Site config builder not working for sun-connect-news.org

I'm trying to create a site config for www.sun-connect-news.org , using this article as an example : http://www.sun-connect-news.org/de/articles/market/details/is-power-supply-a-priority-for-african-people/

But I'm not able to use the site config builder (http://siteconfig.fivefilters.org/): nothing appears in CSS and XPath fields :(

See here: http://siteconfig.fivefilters.org/grab.php?url=http%3A%2F%2Fwww.sun-connect-news.org%2Fde%2Farticles%2Fmarket%2Fdetails%2Fis-power-supply-a-priority-for-african-people%2F

Can you help me? Thanks!

Kenfm.de is missing

Hi, since I don't understand git so well and the limit of 1000 files is reached in the online editor, I upload the file here. Hope someone integrates it.
kenfm.de.txt
Thank you!

futurezone.at.txt is bad and should be deleted

The website www.futurezone.at changed design and several features according to the article. Using the old config brings only one picture but no content. If I delete that config, I get the full content.

The problem of that situation is the fact, the settings return if I would update from github for all settings. Therefore please for deletion or if someone is able to fix the settings - that would be also nice.

regards, Andyt

How to contribute?

I'm trying out to do the filters with the your tool http://siteconfig.fivefilters.org/, how can I contribute when it work?

This filter below worked, its for virten.net

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https%3A%2F%2Fwww.virten.net%2F2013%2F04%2Fintel-cpu-evc-matrix%2F

body: //div[contains(concat(' ',normalize-space(@class),' '),' post-single ')]
test_url: https://www.virten.net/2013/04/intel-cpu-evc-matrix/

houzz.com: missing images

I have tried to write a site config for houzz.com, but am constantly being hit by the brick wall.

Example: http://www.houzz.com/ideabooks/38003610/list/the-top-10-houzz-articles-of-2014

   [13-Mar-2015 14:20:52 Australia/Sydney] PHP Warning:  DOMNode::cloneNode(): ID printLogo already defined in /home/articles/public_html/full-text-rss/libraries/content-extractor/ContentExtractor.php on line 527
   [13-Mar-2015 14:20:52 Australia/Sydney] PHP Warning:  DOMNode::cloneNode(): ID footercontent already defined in /home/articles/public_html/full-text-rss/libraries/content-extractor/ContentExtractor.php on line 527

create a config for blog.google.com

Hello,
can you create a config for the blog.google.com site? Please

Thank you,
Timo