<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

The output of self.cal.parse('1 month', start) is <c

Ok, I turned on lots of logging. Interesting results: <div class="snippet-clipboar

testMonths fails about parsedatetime HOT 14 CLOSED

bear commented on August 18, 2024

testMonths fails

from parsedatetime.

Comments (14)

zed commented on August 18, 2024

Whether the test succeeds depends on PYTHONHASHSEED envvar e.g., if it equals 12 then all tests pass:

$ PYTHONHASHSEED=12 python3 run_tests.py
.....................................................
----------------------------------------------------------------------
Ran 53 tests in 0.292s

OK

but with PYTHONHASHSEED=123 testMonths fails. The hash value is also may depend on the python version. The above values are for Python 3.4.

The result (as expected) doesn't depend on the PyICU version installed. I've tried with 1.5 and 1.8.

from parsedatetime.

zed commented on August 18, 2024

testMonths continues to fail on the earliest parsedatetime version that I can run 7c5e6a0

from parsedatetime.

fake-name commented on August 18, 2024

Can confirm this issue. I'm seeing strings like 1 month ago fail to be parsed at all in the current version from pip.

Edit: Just checked out the HEAD. I'm seeing the exact same behaviour as zed describes.

from parsedatetime.

fake-name commented on August 18, 2024

The output of self.cal.parse('1 month', start) is (time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=3, tm_sec=0, tm_wday=6, tm_yday=292, tm_isdst=-1), 2), so it looks like the parser is incorrectly classifying the string as a time, rather then a date.

This is supported by comparing the output of self.cal.parse('1 month', start) and start:

Parsed:  
time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=4, tm_sec=25, tm_wday=6, tm_yday=292, tm_isdst=-1)
Now:  
time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=3, tm_sec=25, tm_wday=6, tm_yday=292, tm_isdst=-1)

So, it appears that the parser is incorrectly parsing the "1" as a minute quantity with a positive offset.

from parsedatetime.

fake-name commented on August 18, 2024

Ok, I turned on lots of logging. Interesting results:

herp@mainnas:/media/Storage/Scripts/parsedatetime$ python3 run_tests.py parsedatetime.tests.TestUnits.test.testMonths
daysInMonth(11, 2014)

parse()
parse (top of loop): [1 month][]
parse 1 [][][]
parse 2 [][][]
CRE_UNITS matched
parse 3 [1 m][][onth]
parse 4 [1 m][][onth]
parse 5 [1 m][][onth]
parse 6 [1 m][][onth]
parse 7 [1 m][][onth]
parse 8 [1 m][][onth]
parse 9 [1 m][][onth]
parse A [1 m][][onth]
parse (bottom) [ onth][1 m][][onth]
weekday False, dateStd False, dateStr False, time False, timeStr False, meridian False
dayStr False, modifier False, modifier2 False, units True, qunits False
_evalString(1 m, time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=12, tm_sec=49, tm_wday=6, tm_yday=292, tm_isdst=-1))
unitsFlag is set
_buildTime: [1 ][][m]
units m --> realunit minutes
parse (top of loop): [ onth][]
parse 1 [][][]
parse 2 [][][]
parse 3 [][][]
parse 4 [][][]
parse 5 [][][]
parse 6 [][][]
parse 7 [][][]
parse 8 [][][]
parse 9 [][][]
parse A [][][]
parse (bottom) [][][][]
weekday False, dateStd False, dateStr False, time False, timeStr False, meridian False
dayStr False, modifier False, modifier2 False, units False, qunits False
parse() return dateFlag 0 timeFlag 2 totalTime time.struct_time(tm_year=2014, tm_mon=10, tm_mday=19, tm_hour=6, tm_min=13, tm_sec=49, tm_wday=6, tm_yday=292, tm_isdst=-1)
F
======================================================================
FAIL: testMonths (parsedatetime.tests.TestUnits.test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/media/Storage/Scripts/parsedatetime/parsedatetime/tests/TestUnits.py", line 110, in testMonths
    self.assertTrue(_compareResults(self.cal.parse('1 month',  start), (target, 1)))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 0.144s

So it looks like, /sometimes only/, the "m" is being interpreted as the prefix for "minutes", rather then as part of the value "months".

It always happens in the parse 3 stage.

from parsedatetime.

fake-name commented on August 18, 2024

Ok, I've figured out where the issue is.

Basically, it appears that the ordering of the contents of self.locale.re_values['numbers'] is non-determinstic. self.locale.re_values['numbers'] contains both the literal m and months.

When the literal m sorts before months, the test fails, because the regex binds to the m preferentially. When the months sorts first, everything works fine.

I think the solution here would be to add a word boundary match to the self.RE_UNITS regex, so the regex has to match an entire word, rather then just the starting boundary and the first n characters.

In other words:

        self.RE_UNITS     = r'''(?P<qty>(-?(\b(%(numbers)s)\b|\d+)\s*
                                         (?P<units>((\b%(units)s)s?))
                                        ))''' % self.locale.re_values

Becomes

        self.RE_UNITS     = r'''(?P<qty>(-?(\b(%(numbers)s)\b|\d+)\s*
                                         (?P<units>((\b(%(units)s))s?\b))
                                        ))''' % self.locale.re_values

Note that I had to add an additional set of parenthesis, because as it is, (\b%(units)s) is evaluating to (\bsecond|seconds|sec|s|month|months|mth|year|years|yr|y|hour|hours|hr|h|minute|minutes|min|m|week|weeks|wk|w|day|days|dy|d), and as such only the first item is actually even using the \b boundary.

The boundary grouping issue is also occuring in the RE_QUNITS regex, so there are certain conditions where it improperly matches as well:

parse() return dateFlag 1 timeFlag 0 totalTime time.struct_time(tm_year=2014, tm_mon=11, tm_mday=19, tm_hour=6, tm_min=34, tm_sec=0, tm_wday=2, tm_yday=323, tm_isdst=-1)
parse()
parse (top of loop): [1month][]
parse 1 [][][]
parse 2 [][][]
M None
parse 3 [][][]
CRE_QUNITS matched
parse 4 [1m][][onth]
parse 5 [1m][][onth]
parse 6 [1m][][onth]
parse 7 [1m][][onth]
parse 8 [1m][][onth]
parse 9 [1m][][onth]
parse A [1m][][onth]
parse (bottom) [ onth][1m][][onth]
weekday False, dateStd False, dateStr False, time False, timeStr False, meridian False

(with PYTHONHASHSEED=152)

from parsedatetime.

fake-name commented on August 18, 2024

Actually, considering some of the comments, I think the core of the issue might be the values of self.locale.re_values['units'] which both seems to include single char date characters, which contradicts some of the comments in the code:

units year|years|yr|y|day|days|dy|d|hour|hours|hr|h|minute|minutes|min|m|second|seconds|sec|s|week|weeks|wk|w|month|months|mth


        # Given string is a time string with units like "5 hrs 30 min"
        if self.unitsFlag:
--- SNIP ---
        # Given string is a time string with single char units like "5 h 30 m"
        if self.qunitsFlag:

from parsedatetime.

fake-name commented on August 18, 2024

Ok, so the root of the issue is that single-character duration labels are getting into self.locale.units somehow.

I filtered them out:

(line 2226... in __init__.py)

            l = []
            for s in self.locale.units:
                l = l + self.locale.units[s]
---- to ----
            l = []
            for s in self.locale.units:
                l = l + [x for x in self.locale.units[s] if len(x) > 1]

And it fixes the issue initially described. I still see periodic failures in the 1month match, but I suspect that's a deeper issue with the regex.

Sorry if this thread has gotten a bit long. This is an interesting problem, and it's affecting a project I'm working on, so I've been poking at it.

from parsedatetime.

fake-name commented on August 18, 2024

Ok, the fix as I described above fixes the month parsing issue, but breaks a number of the other tests, primarily those using the single-char time notation.

I suspect what it's actually doing is preventing the self.RE_UNITS regex from occationally working in some contexts, which is exposing an existing issue with the self.RE_QUNITS regex that was already there, and just papered over by the accidental functionality of the self.RE_UNITS regex, but the regexes are melting my brain at this point, so I'm going to stop for the moment.

from parsedatetime.

zed commented on August 18, 2024

Note: locale.units is a dictionary that is why the order depends on PYTHONHASHSEED. I've created PR #57 that fixes this issue.

from parsedatetime.

bear commented on August 18, 2024

Let me know if either of you would like to be added as contributors - your both actively using and fixing bugs.

from parsedatetime.

fake-name commented on August 18, 2024

I'd be willing to spend some time trying to make the regex stuff more concise.

Particularly, is the use of named %(name)s string formatting a legacy compatibility thing? Moving to '{val}'.format(val=something) makes things MUCH more readable, and makes scoping much easier to understand.

The .format() method was added in python 2.6. Is < 2.6 still a support target?

from parsedatetime.

bear commented on August 18, 2024

< 2.6 is not longer supported in the main branch - 2.7+ and 3.0 are it's goals

from parsedatetime.

fake-name commented on August 18, 2024

Ok, cool. The .format() method was introduced in 2.6 anyways, so it's a bit of a moot point.

from parsedatetime.

testMonths fails about parsedatetime HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent