Giter Club home page Giter Club logo

Comments (7)

Jamstah avatar Jamstah commented on August 16, 2024

grep and grepi can be used directly, so you could do something like:

filter:
  - grepi: 'price: <span>.*</span>'
  - re.sub:
      pattern: '^.*(price: <span>.*</span>).*$'
      repl: '\1'

from urlwatch.

Jamstah avatar Jamstah commented on August 16, 2024

findall might be easier though, what are you thinking for the output, just put each match on a new line?

from urlwatch.

f0sh avatar f0sh commented on August 16, 2024

findall might be easier though, what are you thinking for the output, just put each match on a new line?

yes, that's what I was thinking too.

I didn't check in the source yet, how it is implemented, but it felt like, it could be easier integrated. But maybe the similar names of re.sub in urlwatch and the re package fooled me.

from urlwatch.

Jamstah avatar Jamstah commented on August 16, 2024

Yes, its not hard to add it. A little more than re.sub because you have to do something with the matches, where re.sub will just give you the string to return.

https://github.com/thp/urlwatch/blob/master/lib/urlwatch/filters.py#L831

from urlwatch.

Jamstah avatar Jamstah commented on August 16, 2024

I actually have a couple of places that this would simplify my filters, so have put in an implementation in #805. See what you think.

from urlwatch.

thp avatar thp commented on August 16, 2024

For filtering out HTML elements, use the CSS or XPath filters. Never use regex.

from urlwatch.

f0sh avatar f0sh commented on August 16, 2024

For filtering out HTML elements, use the CSS or XPath filters. Never use regex.

For me, this was not the intention here. It's more that you want to extract certain data. I just tried to make my problem more clear by taking the previous example from the urlwatch docs. The scenario is more, that you have an p element and want to extract some data from there. E.g [...] and therefore for blablabla we set the price of 2.39€.[...]. The idea is to only grab the data 2.39€ without the whole text.

With re.sub you always have to build a regex which catches the whole paragraph which is error prone. And the grep solution only works on full lines.

I actually got inspired by changedetection.io which I tried recently, because of their GUI and they have this nice data extraction feature. However their scripting is much more troublesome so I would like to stick with urlwatch.

OT: It's just a bit frustrating, why open source often has to invent new wheels instead of joining forces. It would be amazing to see, if changedetection.io would have used urlwatch under the hood, to build a more powerful solution.

from urlwatch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.