Giter Club home page Giter Club logo

ftr-site-config's People

Contributors

bilelmoussaoui avatar burkemw3 avatar coolius avatar digicommons avatar doc75 avatar elibadou384 avatar fivefilters avatar holgerausb avatar j0k3r avatar jangernert avatar janjastrow avatar jordidg avatar kdecherf avatar kreativmonkey avatar lukas0907 avatar marmo avatar moneytoo avatar ngosang avatar nicosomb avatar shtrom avatar silberzwiebel avatar simounet avatar snptrs avatar stesie avatar strubbl avatar techexo avatar thiagotalma avatar timgws avatar tomtaylor avatar zinnober avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ftr-site-config's Issues

create config for theverge.com

this article doesn't render properly at all here.

testing on this site would lead me to think a good start would be the Xpaths:

  • //div[contains(concat(' ',normalize-space(@class),' '),' l-wrapper ')]//div[contains(concat(' ',normalize-space(@class),' '),' l-segment ') and (contains(concat(' ',normalize-space(@class),' '),' l-feature '))]
  • //div[contains(concat(' ',normalize-space(@class),' '),' c-entry-hero__content ')]

but i'm not sure how to add two xpaths at once?

ansible documentation pages cannot be parsed

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=http%3A%2F%2Fdocs.ansible.com%2Fansible%2Flatest%2Fpam_limits_module.html

body: //div[@id='pam-limits-modify-linux-pam-limits']
test_url: http://docs.ansible.com/ansible/latest/pam_limits_module.html

Ability to bypass redirection

Vk.com redirects to /badbrowser.php, so content couldn't be fetched. But if we ignore redirection we have access to content.
Can we, somehow disable redirection for site?

Use stackoverflow config for stackexchange

Hello everyone,
I don't want to do a PR on this as it is something of a personal taste, but I think the site config for stackoverflow.com (link) should maybe be copied for .stackexchange.com (link) as the websites share the same engine, and the last site config is in my opinion sub-optimal.

Provide guidance on file naming

Will someone please provide guidance on file naming?

I keep opening files as .example.com.txt. Primarily because most of the first listed files have that naming scheme. A few of my PRs have their file names adjusted, but I don’t fully understand why.

Adding something to the readme’s contributing section would be helpful, at least for me, if you maintainers don’t want to keep adjusting file names.

tweakers.net pattern doesn't work anymore

Version of Full-Text RSS: 3.9.11
Version of Site Patterns: 2021-05-26T01:09:01Z

Most of the time i get [unable to retrieve full-text content] error using Tweakers.net. Since the script is from 2018 and the website got redesigned in that time, the pattern should be updated.

With the point-and-click interface, i could select the body in 3 types of articles on the site. Test links in the pattern here below:

News article:

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https%3A%2F%2Ftweakers.net%2Fnieuws%2F182324%2Fgoogle-probeerde-telefoonmakers-privacy-instellingen-te-laten-verstoppen.html

body: //div[contains(concat(' ',normalize-space(@class),' '),' article ')]
test_url: https://tweakers.net/nieuws/182324/google-probeerde-telefoonmakers-privacy-instellingen-te-laten-verstoppen.html

Multi-page articles (not every page can be parsed so i think the best is just do the first page one):

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https%3A%2F%2Ftweakers.net%2Freviews%2F9040%2Fbluetooth-trackers-round-up-zoekt-en-gij-zult-niet-altijd-vinden.html

body: //div[contains(concat(' ',normalize-space(@class),' '),' centeredContent ')]
test_url: https://tweakers.net/reviews/9040/bluetooth-trackers-round-up-zoekt-en-gij-zult-niet-altijd-vinden.html

Software updates:

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https%3A%2F%2Ftweakers.net%2Fdownloads%2F56134%2Fparallels-desktop-160.html

body: //div[contains(concat(' ',normalize-space(@class),' '),' articleColumn ')]
test_url: https://tweakers.net/downloads/56134/parallels-desktop-160.html

I'm not sure how to edit the pattern for all the 3 types of articles and test it since i use an RSS reader that uses this service (Bazqux) and not self-hosting this service by myself. I hope this helps updating the pattern.

Paywall Nextinpact new version

The website nextinpact has released a new version of their website and it seems articles cannot be extracted anymore.

Undocumented patterns

I'm trying to add a kind of test validations for siteconfig to avoid mistake in them and I found some undocumented pattern in some of them (following that, I'll submit some fixes for others files).

Here is the list:

I was wondering if these patterns are absolete, new, unsued, etc.
I can't find them in the documentation nor in the current open source version of Full-Text RSS. Have they been introduced in the current version of Full-Text RSS? (which means we can't see how they are handled)

Let me know 🙂

Sandbox for testing configuration.

I think it's cool that we can send configuration files specific to a site.

It would be too much to ask to have a sandbox to test the configuration files before posting them? :-)

ad.nl and related sites, it seems, no longer work

Using the latest Wallabag Docker image, sites published by DPG Media no longer work: the cookie wall isn't automatically bypassed. (Ex. from source: http://www.ad.nl/ad/nl/10444/Offside/article/detail/4043834/2015/05/31/Dani-Alves-voetbalt-met-drol-op-zijn-hoofd.dhtml)

Stumbled upon this by trying to copy and adapt settings for ad.nl and parool.nl to demorgen.be (and, perhaps hln.be), as they should all be using the same cookie wall technology.

It'd be awesome if these could be fixed/added.

Possible additional test URL: https://www.demorgen.be/nieuws/locked-and-loaded-trump-dreigt-met-militaire-actie-tegen-iran-olieprijzen-schieten-omhoog~b82ff8c9/

Also affected: volkskrant.nl, see ex. https://www.volkskrant.nl/nieuws-achtergrond/privacywaakhond-verbiedt-cookiemuren-op-websites-ja-ook-die-op-volkskrant-nl~b7dab7ee/

unable to parse zeit.de

It seems the last change (#773) to zeit.de.txt concerning cookie does not work anymore.
The cookie content now is a timestamp in format "2020-09-03T19:45:59.150Z".
I tried to just insert that instead of the previous fixed text, and it even worked on f43.me.
http_header(Cookie): zonconsent="2020-09-03T19:45:59.150Z"
Testwise changing the timestamp to August 3rd it didn't work.
So question: How can the date (and time?) set dynamically?

Test-URL

florian-casse.fr does not parse

Used wallabag to add it and it didn't fetched, so I came here because of http://siteconfig.fivefilters.org/ this tool.

I did the selection but it keep not parsing, any thoughts ?

https://florian-casse.fr/clear-dns-cache-on-vcsa-6-5-and-later/

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https%3A%2F%2Fflorian-casse.fr%2Fclear-dns-cache-on-vcsa-6-5-and-later%2F

body: //div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]
test_url: https://florian-casse.fr/clear-dns-cache-on-vcsa-6-5-and-later/

Website Arrêt sur Images has wrong paywall URL

The connection URL in arretsurimages.net.txt should not be :
http://www.arretsurimages.net/forum/login.php
Instead, it should be:
http://www.arretsurimages.net
The login form is now directly accessible from the home page. Fields still have the same ids: username and password.
This may help fixing the paywall detection, although I can't test it by myself.

faz.net paywall articles shows payment-hint instead of the teaser as content

faz.net has free articles and f+ articles behind a paywall. For these f+ articles, apart from the headline, only a short teaser is displayed on the page and then a notice that you should pay 2.95 EUR per week to continue reading.

In this case, fullt-text-rss only takes the payment terms as content and not the teaser.

Can someone please adjust the site-config so that the teaser is taken over for f+ articles? Gladly also with a short (German) hint that you can read more only with a subscription.

examples:

free:
http://ftr.fivefilters.net/makefulltextfeed.php?url=https%3A%2F%2Fwww.faz.net%2Faktuell%2Fsport%2Folympische-winterspiele%2Fdeutsches-team%2Fkatharina-althaus-holt-olympia-silber-in-china-im-skispringen-17782373.html&max=3

f+:
http://ftr.fivefilters.net/makefulltextfeed.php?url=https%3A%2F%2Fwww.faz.net%2Faktuell%2Fpolitik%2Fausland%2Fdie-bergbaustadt-broken-hill-in-australien-setzt-auf-solarenergie-17735957.html&max=3

Since I host fulltext-rss on my own server, the required lines for the config would also be fine for me

Reddit filter change

Reddit has changed the way they setup their pages (at some point, not a recent thing) and the filter now includes comments and the sidebar, which should be stripped.

COPYING/license file missing

Hi,

I think it'd be good if you'd provide a COPYING file clearly stating the license of this work.

After all I consider using these files as well but I'm a bit confused regarding legal situation

So after all it sounds like "hey, it's public domain" ... right?
If so, could you please add a COPYING file stating so :-)

cheers
~stesie

Articles from aps.dz cannot be fetched

Hello,

my Wallabag cannot retrieve articles of the official Algerian press agency APS (http://www.aps.dz/) - and unfortunately, the site config builder gives me either a blank page or a 404 error when I try to load an article. Any ideas why this website doesn't work?

Thanks in advance!

The Verge site config does not pull images

I've played around with it a lot. It does not pull any images. This is happening with a lot of sites for me, but since the verge's feed is easily testable, I'm reporting that as the test case.

outlined words get censored on psychologytoday.com

This article about Richard Stallman is really weird to read on parsed content, because all the "dotted underline" words are just removed from the output. I reproduced this on the test site.

The last sentence of the second paragraph should read like:

If you’ve heard of open source (free software’s practice sans its moral stance) or Linux (really GNU, plus a program called Linux), you can thank Stallman.

(Emphasis on the removed word added.)

Instead it is:

If you’ve heard of open source (free software’s practice sans its stance) or Linux (really GNU, plus a program called Linux), you can thank Stallman.

Thanks!

Specify title in site config file

In this example Wallbag can extract article without a specific site config, but It can't set the title, there is a way to specify where is the title in site config?

image

image

http://siteconfig.fivefilters.org/grab.php?url=https://www.kapilarya.com/fix-you-cant-access-this-shared-folder-because-your-organizations-security-policies-block-unauthenticated-guest-access

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https://www.kapilarya.com/fix-you-cant-access-this-shared-folder-because-your-organizations-security-policies-block-unauthenticated-guest-access

body: //div[contains(concat(' ',normalize-space(@class),' '),' entry-content ')]
test_url: https://www.kapilarya.com/fix-you-cant-access-this-shared-folder-because-your-organizations-security-policies-block-unauthenticated-guest-access

Lesserwrong.com page cannot be parsed

Hi,

I'd like to parse pages from www.lesserwrong.com. I've tried creating a site config based on this page:
https://www.lesserwrong.com/rationality/what-do-we-mean-by-rationality

This is how my site config looks like:

title: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-header-title ')]//h1
body: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-body-html ')]
date: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-body-metadata-date ')]
author: //div[contains(concat(' ',normalize-space(@class),' '),' posts-page-content-header-author ')]
test_url: https://www.lesserwrong.com/rationality/what-do-we-mean-by-rationality

As far as I can tell, these XPaths all point to the correct elements inside that page. Howeve, the tool at https://f43.me/feed/test still fails to parse the page. Did I mess up the site config, or is this a bug in the parser (and if so, is this the right repository to report such a bug?)

thaitable.com

//div[@id='main'] would be a good start for that. it includes a mailing list block and other crap, but i can't find a better match.

blogs.gnome.org fail to load

Failing test blog post:
https://blogs.gnome.org/aday/2017/08/08/the-gnome-way/

That's the error message from https://f43.me/feed/test

Request throw exception (with a response): Client error response [url] https://blogs.gnome.org/aday/2017/08/08/the-gnome-way/?not-changed [status code] 403 [reason phrase] Bad Behavior

I've tried your point and click page to find a proper site-filter but failed to do so. The fetched content is always this:

HTTP Error 403

We're sorry, but we could not fulfill your request for /aday/2017/08/08/the-gnome-way/?not-changed on this server.

An invalid request was received from your browser. This may be caused by a malfunctioning proxy server or browser privacy software.

Your technical support key is: b01f-6435-1756-6707

You can use this key to fix this problem yourself.

If you are unable to fix the problem yourself, please contact allanday at gnome.org and be sure to provide the technical support key shown above.

nature.com Improvement

This patch makes article body extraction for nature.com more exact:

--- a/nature.com.txt    2021-07-23 12:11:36.331873505 +0200
+++ b/nature.com.txt    2021-07-23 12:11:17.747730246 +0200
@@ -2,7 +2,7 @@
 date: //meta[@name="dc.date"]/@content
 date: //meta[@name="prism.publicationDate"]/@content
 author: //meta[@name='dc.creator']/@content
-body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')]
+body: //div[contains(concat(' ',normalize-space(@class),' '),' article__body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' article-body ')] | //div[contains(concat(' ',normalize-space(@class),' '),' c-article-body ')]
 
 strip: //div[contains(concat(' ',normalize-space(@id),' '),' further-reading-section ')]

filmstarts.de: slideshows

Although I wrote this config file myself, I can't figure this issue out.
This site basically has 2 very different types of articles

  1. the normal articles with some text, images and trailer-videos
  2. slide-shows that are structured completely different internally

I have no idea how to make both types work correctly. Currently the normal articles work fine, but the slide-shows come up like this

screenshot from 2016-05-20 02-07-53

here is the config-file so far

Site config builder not working for sun-connect-news.org

Kenfm.de is missing

Hi, since I don't understand git so well and the limit of 1000 files is reached in the online editor, I upload the file here. Hope someone integrates it.
kenfm.de.txt
Thank you!

futurezone.at.txt is bad and should be deleted

The website www.futurezone.at changed design and several features according to the article. Using the old config brings only one picture but no content. If I delete that config, I get the full content.

The problem of that situation is the fact, the settings return if I would update from github for all settings. Therefore please for deletion or if someone is able to fix the settings - that would be also nice.

regards, Andyt

How to contribute?

I'm trying out to do the filters with the your tool http://siteconfig.fivefilters.org/, how can I contribute when it work?

This filter below worked, its for virten.net

# Generated by FiveFilters.org's web-based selection tool
# Place this file inside your site_config/custom/ folder
# Source: http://siteconfig.fivefilters.org/grab.php?url=https%3A%2F%2Fwww.virten.net%2F2013%2F04%2Fintel-cpu-evc-matrix%2F

body: //div[contains(concat(' ',normalize-space(@class),' '),' post-single ')]
test_url: https://www.virten.net/2013/04/intel-cpu-evc-matrix/

houzz.com: missing images

I have tried to write a site config for houzz.com, but am constantly being hit by the brick wall.

Example: http://www.houzz.com/ideabooks/38003610/list/the-top-10-houzz-articles-of-2014

   [13-Mar-2015 14:20:52 Australia/Sydney] PHP Warning:  DOMNode::cloneNode(): ID printLogo already defined in /home/articles/public_html/full-text-rss/libraries/content-extractor/ContentExtractor.php on line 527
   [13-Mar-2015 14:20:52 Australia/Sydney] PHP Warning:  DOMNode::cloneNode(): ID footercontent already defined in /home/articles/public_html/full-text-rss/libraries/content-extractor/ContentExtractor.php on line 527

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.