danieljdufour / date-extractor Goto Github PK

Extract dates from text

License: Apache License 2.0

Python 97.85% Shell 2.15%

nlp date datetime python extractor extract-dates arabic kurdish french sorani chinese taiwan parser parse iso time temporal

date-extractor's People

Stargazers

Watchers

Forkers

mariavarley dahabit viymak nkatebi motazsaad trungtv lightcax ricky1192 mustafacebisli wind-chh suonbo

date-extractor's Issues

Add flag for which langs to scan

For speed reasons

Error in extraction of dates for upcoming years

When trying extract upcoming dates for upcoming years, I am finding unexpected results. Here are few failed cases.

from date_extractor import extract_dates

extract_dates('can you find correct date here 2033')
[datetime.datetime(1920, 3, 3, 0, 0)]

extract_dates('can you find correct date here june 2033')
[]

extract_dates('can you find correct date here 2 june 2033')
[datetime.datetime(1920, 6, 2, 0, 0)]

extract_dates('can you find correct date here 12 january 2018')
[datetime.datetime(1920, 1, 12, 0, 0)]

extract_dates('can you find correct date here 1 january 2018')
[datetime.datetime(1920, 1, 1, 0, 0)]

Using a period as a separator changes the date value altogether

`>>> extract_dates("31.12.1986")[0].strftime('%m/%d/%Y')

'03/01/2012'`

Inconsistency in the following cases

I am finding some inconsistency in the following cases:

in first case i am getting - ValueError: day is out of range for month

extract_dates('can you find correct date here 31 april 2017')
Traceback (most recent call last):
File "", line 1, in
File "/home/pranavwaila/anaconda2/lib/python2.7/site-packages/date_extractor/init.py", line 192, in extract_dates
completes = [datetime(normalize_year(d['year']),int(d['month']),int(d['day'])) for d in completes]
ValueError: day is out of range for month

where as similarly when i pass the out of range date for december, it is handeled:

extract_dates('can you find correct date here 32 december 2017')
[]

2015 is interpreting incorrectly

As of now, 2015 is interpreting as Yr: 20, Month: 1, Day: 5

But it should be interpreted as Year 2015.....

Changing

p["date"] = (
"(?P"
+ "|".join(
[p["iso"] , p["mdy"], p["dmy"], p["ymd"], p["my"] , p["y"]]
)
+ ")"
)

p["date"] = (
"(?P"
+ "|".join(
[p["iso"] , p["y"], p["mdy"], p["dmy"], p["ymd"], p["my"] ]
)
+ ")"
)

i.e putting p["y"] at the start is solving this... pls share your thoughts

Extract Only Year from text

Thanks for this great project.
Currently I am able to extract the dates, but for only year i.e for eample "In year 2011 the incident happened." The program retrieves "2011-01-01 00:00:00+00".

But we need to retrieve it as "2011-01-01 12:14:12+00"
Can you please let me know how should I change in the library to achieve this.

The basic Aim is to differentiate the original "1st Jan 2011" and "2011".

Thanks

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 125: character maps to <undefined>

Running python 3.7.6 in jupyter notebook, occurred when trying to import using "from date_extractor import extract_dates".

Update Build Tools

https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html#summary

detect_format, train on column before extraction

Assuming a column in a csv will all be formatted the same, I should be able to train on a column of dates before detecting dates

two new methods

from date_extractor import detect_format

data = [None, "", "10/31/23", "1/2/23"]

detect_format(data)
"%m/%d/%Y"


from date_extractor import prepare

extract_date = prepare(data)
for date in data:
    extract_date(date)

I think it would be great if the extract_dates function could return the original matched text.
ie:
extract_dates('This happened 2020-01-01')
would return matches and the original date text (2020-01-01)

add retrain method

This will clear current model/detected patterns.

Taking 3 digit number as year.

i tried with this string "R-6/941/KAMDAR ROAD, 01/01/2021" as its taking 941 as 1941 year

Add support for timezones

New date formatting

I was processing a bunch of text blobs and the date/time is written like this:
23:49:58 on 11/9/2020.
Would it be hard to add support for the time before the date to the data-extractor?

UnicodeDecodeError on Import

Getting this error during import, running Python 3.6.3:

File "myproject\myfile.py", line 6, in <module>
    from date_extractor import extract_dates
  File "myproject\venv\lib\site-packages\date_extractor\__init__.py", line 6, in <module>
    from . import enumerations
  File "myproject\venv\lib\site-packages\date_extractor\enumerations.py", line 83, in <module>
    lines = f.read().split("\n")
  File "myproject\venv\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 129: character maps to <undefined>

Works fine on another machine running 3.5.?. Any clues?

add unaccented versions of Arabic anglicized months

Issue with numbers

Hi
Thanks a lot for your great job.
I have some issues regarding number, most of numbers in the text is converted to date !
for example
text2="The meeting will be held at paris Allé 6, 0208 paris. Election 30 of a chairperson in france. page 18 of 20"

then we did get
[datetime.datetime(2008, 2, 6, 0, 0, tzinfo=), datetime.datetime(1930, 1, 1, 0, 0, tzinfo=), datetime.datetime(2018, 1, 1, 0, 0, tzinfo=), datetime.datetime(1920, 1, 1, 0, 0, tzinfo=)]
As you see all numbers her should not be extracted as date.
Is there any sulotion ?