Comments (6)
There's even an example in the docs:
Another useful option with XPath and CSS filters is ``exclude``.
Elements selected by this ``exclude`` expression are removed from the
final result. For example, the following job will not have any ``<a>``
tag in its results:
.. code:: yaml
url: https://example.org/css-exclude.html
filter:
- css:
selector: body
exclude: a
And for that one, we even have testcase data:
https://example.org/css-exclude.html:
input: |-
<html>
<body>
<h1>A page in a book</h1>
<p>And some paragraph, too. <a href="http://example.net/">Also check out example.net!</a></p>
</body>
</html>
output: |
<body>
<h1>A page in a book</h1>
<p>And some paragraph, too. </p>
</body>
I guess it might be that selector: "*"
is the problem, and using body
or html
as the selector might be better?
from urlwatch.
Closing this as "works for me" based on your comment.
from urlwatch.
If it's something that can be done line-based, grepi
is the filter you might want, using regular expressions.
from urlwatch.
Unfortunately it is impossible using grep :(
from urlwatch.
Then I found out about
css: selector: "*" exclude: ".ads"
. But this seems to somehow clone the output multiple times and makes the output unusable.
This exclude
subfilter should actually be working. Do you have a minimal example (input / filter / expected output / actual output) that we can use to investigate the issue?
from urlwatch.
Thanks for the input !
Yes, replacing * with body did the trick! Now nothing gets cloned or otherwise mangled. Awesome.
Just for reference. These were my inputs:
stuff.html:
<div class="thumbnail" tr="Produktliste">
<a href="https://some-site.info/BRAND-Special-modern-type-Product-long-Muster-weiss" name="1263/24/91/40"><img src="/$WS/sitesite/websale8_shop-sitesite/produkte/medien/bilder/klein/OL6WAHM1_M0_1.jpg" title="BRAND Special modern fit Product long Muster wei" alt="BRAND Special modern fit Product long Muster wei" class="artikel"></a>
<div class="nachhaltigkeit-stoerer"></div>
</div>
<div class="product_list_div_content" >
<a href="https://some-site.info/BRAND-Special-modern-type-Product-long-Muster-weiss" class="produktLink pr-name"><span>BRAND Special modern fit Product long Muster wei</span></a>
<a href="https://some-site.info/BRAND-Special-modern-type-Product-long-Muster-weiss" class="produktLink discount_ja sale-stoerer BRAND-stoerer">%</a>
<div class="preisinfo">
<div class='produktbewertung bewertung_OL6WAHM1M0'></div>
<p class="uvpprice">UVP <span class="uvp">69.95 €</span></p>
<p class="price price_sale">54,99 €</p>
</div>
<div class="GroesseVisibleOnHover">
<div class="weitere_farben">
<p>Weitere Farben:</p>
<div class="weitere_farbe">
<a href="https://some-site.info/BRAND-Special-modern-type-Product-long-Muster-navy" itemprop="isSimilarTo"><img src="/$WS/sitesite/websale8_shop-sitesite/produkte/medien/bilder/klein/OL6WAHM1_M6_1.jpg" alt="BRAND Special modern fit Product long Muster navy preisreduziert" title="BRAND Special modern fit Product long Muster navy preisreduziert"></a>
</div>
</div>
</div>
</div>
<br class="clear">
</div>
command: "cat ~/stuff.html"
filter:
- css:
selector: "*"
exclude: ".weitere_farben,div.thumbnail > a"
- html2text:
method: lynx
nolist:
- strip
WRONG output using *
BRAND Special modern fit Product long Muster wei %
UVP 69.95 €
54,99 €
BRAND Special modern fit Product long Muster wei %
UVP 69.95 €
54,99 €
BRAND Special modern fit Product long Muster wei %
UVP 69.95 €
54,99 €
BRAND Special modern fit Product long Muster wei %
UVP 69.95 €
54,99 €
BRAND Special modern fit Product long Muster wei BRAND
Special modern fit Product long Muster wei %
UVP 69.95 €
54,99 €
UVP 69.95 €
69.95 €
54,99 €
correct output using body even though theres no body tag in the input ;-)
BRAND Special modern fit Product long Muster wei %
UVP 69.95 €
54,99 €
from urlwatch.
Related Issues (20)
- Socks proxy HOT 3
- urlwatch in github actions? HOT 2
- Best practice for a bunch of keywords for several urls HOT 1
- `--test-reporter` option is ignoring separated flag
- different line height / vertical spacing / between urlwatch 2.22 and 2.28 HOT 11
- [pyppeteer] No module named 'pyppeteer' using Docker python3.10 bookworm HOT 1
- Reporting blanks HOT 28
- add support to specify multiple recipients per URL HOT 7
- YAML Anchors/Aliases not working HOT 4
- CSS Filter "AttributeError: 'CSSSelector' object has no attribute 'evaluate'" HOT 2
- FEATURE: Support multiple reporters with different options HOT 6
- Meaning of max_tries is confusing
- urlwatch stopped working HOT 4
- sendmail is not documented HOT 2
- Randomly "not enough values to unpack" Python errors HOT 4
- Cache inconsistency creating new items from nowhere HOT 3
- Feature request: Extension of regex filtering to extract data HOT 7
- Consider releasing version 2.29 HOT 5
- Question - Report http errors only once HOT 5
- urlwatch 2.25-1 on Debian Stable 12.5 (navigate fails) HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from urlwatch.