Giter Club home page Giter Club logo

Comments (7)

dgtlmoon avatar dgtlmoon commented on June 11, 2024

the problem is this regex specifically /^\s*(\d+\s*)+$/

can you explain what this is for?

from changedetection.io.

dgtlmoon avatar dgtlmoon commented on June 11, 2024
Yes, it's quadratic time.  If the string being searched has N characters, first it fails to find "x" in all N of 'em, then `.*` advances by one and it fails to find "x" in the trailing N-1 characters, then again in the trailing N-2, and so on.  N + N-1 + N-2 + ... + 1 is quadratic in N.

That's how this kind of regexp engine works.  And it's mild, as such things go:  you can also create poor regexps that take time _exponential_ in N that fail to match certain strings.

It's unlikely this will change without replacing Python's regexp implementation entirely.  For why, see Jeffrey Friedl's book "Mastering Regular Expressions" (published by O'Reilly).  That book also spells out techniques for crafting regexps that don't suck ;-)  It's not a small topic, alas.

https://bugs.python.org/issue35915

from changedetection.io.

dgtlmoon avatar dgtlmoon commented on June 11, 2024

the problem is your regex, the other problem is that the system doesnt timeout, the regex works but it takes an exponentially long time to use. your regex is bad and the system doesnt catch it.

the fix is to place the call to

changed_detected, update_obj, contents = update_handler.run_changedetection(uuid,
in a thread and wrap that thread with a timeout, on timeout, it should throw an error that suggests to check all regexs etc

maybe something like


def search_with_timeout(pattern, text, timeout=3):
    result = [None]

    def search_thread():
        result[0] = re.search(pattern, text)

    # Create and start the thread
    search_thread = threading.Thread(target=search_thread)
    search_thread.start()

    # Wait for the thread to finish or timeout
    search_thread.join(timeout)

    # If thread is still alive, it means it has exceeded the timeout
    if search_thread.is_alive():
        print("Search operation timed out!")
        # Terminate the thread
        search_thread.terminate()  # This method doesn't exist, it's just for illustration

    return result[0]

# Example usage
pattern = r'your_pattern_here'
text = 'your_text_here'

result = search_with_timeout(pattern, text)
if result:
    print("Match found:", result.group())
else:
    print("No match found within the timeout.")

from changedetection.io.

jgupta avatar jgupta commented on June 11, 2024

the problem is this regex specifically /^\s*(\d+\s*)+$/

can you explain what this is for?

I am not good at regex. It was created by ChatGPT4. My intention was to ignore lines that has just number or space. Below is entire ChatGPT 4 response.

To match a line containing only numbers that may include spaces between them, such as "8 4 6 2 1 9 6 6", and also match lines that have just a number without spaces, you can use the following regular expression:

/^\s*(\d+\s*)+$/ 

This regular expression does the following:

  • ^ asserts the position at the start of the line.
  • \s* matches any whitespace characters (like spaces or tabs) zero or more times.
  • (\d+\s*) is a group that matches one or more digits followed by zero or more whitespace characters.
  • + after the group (\d+\s*) means that this group can appear one or more times, allowing for multiple numbers separated by spaces.
  • $ asserts the position at the end of the line.

So, this regex will match a line with a single number, as well as lines with multiple numbers separated by spaces.

from changedetection.io.

jgupta avatar jgupta commented on June 11, 2024

Following works and should be better as per ChatGPT4.

The regular expression you've provided /^\s*(\d+\s*)+$/ is intended to match lines that consist solely of numbers with optional whitespace characters between them. However, it is susceptible to what's known as "catastrophic backtracking," which can occur when the regex engine has to evaluate a large number of possible ways to match a pattern. This happens because the pattern (\d+\s*)+ is highly ambiguous: \d+ can match as many digit characters as possible, and \s* can match as many whitespace characters as possible. The + at the end allows this entire group to repeat, creating many possible combinations for the regex engine to try and match.

To improve the performance of this regex, we can try to make the quantifiers less ambiguous and remove unnecessary repetition. A revised version might look like this:

/^\s*\d+(?:\s+\d+)*\s*$/

Here’s what’s changed:

  • Instead of (\d+\s*)+, it now uses \d+(?:\s+\d+)*, which will match a number, followed by zero or more groups of one or more whitespace characters followed by another number. This pattern is less prone to backtracking because the + inside the non-capturing group (?: ... ) requires at least one whitespace character to be present for a match to continue, eliminating the ambiguity of \s*.
  • The non-capturing group (?: ... ) is used with * to match any additional numbers separated by whitespace without capturing them, which is more efficient in many regex engines.

This optimized pattern should perform much better because it guides the regex engine more precisely, reducing the potential for excessive backtracking.

from changedetection.io.

dgtlmoon avatar dgtlmoon commented on June 11, 2024

I guess this is the classic old problem of pasting code that you dont understand fully.

from changedetection.io.

jgupta avatar jgupta commented on June 11, 2024

I guess this is the classic old problem of pasting code that you dont understand fully.

So true. Exactly why a good software should have all kinds of safety nets wherever it allows user's to inject code in a input box.

Your proposed solution to timeout and throw error looks good.

from changedetection.io.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.