danieljdufour / date-extractor Goto Github PK
View Code? Open in Web Editor NEWExtract dates from text
License: Apache License 2.0
Extract dates from text
License: Apache License 2.0
For speed reasons
When trying extract upcoming dates for upcoming years, I am finding unexpected results. Here are few failed cases.
from date_extractor import extract_dates
extract_dates('can you find correct date here 2033')
[datetime.datetime(1920, 3, 3, 0, 0)]
extract_dates('can you find correct date here june 2033')
[]
extract_dates('can you find correct date here 2 june 2033')
[datetime.datetime(1920, 6, 2, 0, 0)]
extract_dates('can you find correct date here 12 january 2018')
[datetime.datetime(1920, 1, 12, 0, 0)]
extract_dates('can you find correct date here 1 january 2018')
[datetime.datetime(1920, 1, 1, 0, 0)]
`>>> extract_dates("31.12.1986")[0].strftime('%m/%d/%Y')
'03/01/2012'`
I am finding some inconsistency in the following cases:
in first case i am getting - ValueError: day is out of range for month
extract_dates('can you find correct date here 31 april 2017')
Traceback (most recent call last):
File "", line 1, in
File "/home/pranavwaila/anaconda2/lib/python2.7/site-packages/date_extractor/init.py", line 192, in extract_dates
completes = [datetime(normalize_year(d['year']),int(d['month']),int(d['day'])) for d in completes]
ValueError: day is out of range for month
where as similarly when i pass the out of range date for december, it is handeled:
extract_dates('can you find correct date here 32 december 2017')
[]
As of now, 2015 is interpreting as Yr: 20, Month: 1, Day: 5
But it should be interpreted as Year 2015.....
Changing
p["date"] = (
"(?P"
+ "|".join(
[p["iso"] , p["mdy"], p["dmy"], p["ymd"], p["my"] , p["y"]]
)
+ ")"
)
to
p["date"] = (
"(?P"
+ "|".join(
[p["iso"] , p["y"], p["mdy"], p["dmy"], p["ymd"], p["my"] ]
)
+ ")"
)
i.e putting p["y"] at the start is solving this... pls share your thoughts
Thanks for this great project.
Currently I am able to extract the dates, but for only year i.e for eample "In year 2011 the incident happened." The program retrieves "2011-01-01 00:00:00+00".
But we need to retrieve it as "2011-01-01 12:14:12+00"
Can you please let me know how should I change in the library to achieve this.
The basic Aim is to differentiate the original "1st Jan 2011" and "2011".
Thanks
Running python 3.7.6 in jupyter notebook, occurred when trying to import using "from date_extractor import extract_dates".
Assuming a column in a csv will all be formatted the same, I should be able to train on a column of dates before detecting dates
two new methods
from date_extractor import detect_format
data = [None, "", "10/31/23", "1/2/23"]
detect_format(data)
"%m/%d/%Y"
from date_extractor import prepare
extract_date = prepare(data)
for date in data:
extract_date(date)
Hi,
I think it would be great if the extract_dates function could return the original matched text.
ie:
extract_dates('This happened 2020-01-01')
would return matches and the original date text (2020-01-01)
This will clear current model/detected patterns.
i tried with this string "R-6/941/KAMDAR ROAD, 01/01/2021" as its taking 941 as 1941 year
I was processing a bunch of text blobs and the date/time is written like this:
23:49:58 on 11/9/2020.
Would it be hard to add support for the time before the date to the data-extractor?
Getting this error during import, running Python 3.6.3:
File "myproject\myfile.py", line 6, in <module>
from date_extractor import extract_dates
File "myproject\venv\lib\site-packages\date_extractor\__init__.py", line 6, in <module>
from . import enumerations
File "myproject\venv\lib\site-packages\date_extractor\enumerations.py", line 83, in <module>
lines = f.read().split("\n")
File "myproject\venv\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 129: character maps to <undefined>
Works fine on another machine running 3.5.?. Any clues?
Hi
Thanks a lot for your great job.
I have some issues regarding number, most of numbers in the text is converted to date !
for example
text2="The meeting will be held at paris Allé 6, 0208 paris. Election 30 of a chairperson in france. page 18 of 20"
then we did get
[datetime.datetime(2008, 2, 6, 0, 0, tzinfo=), datetime.datetime(1930, 1, 1, 0, 0, tzinfo=), datetime.datetime(2018, 1, 1, 0, 0, tzinfo=), datetime.datetime(1920, 1, 1, 0, 0, tzinfo=)]
As you see all numbers her should not be extracted as date.
Is there any sulotion ?
Thanks
Extract hour, minutes, and seconds.
I needed to parse dates in my project; sometimes, I was getting dates with the month as the first 4 letters (except for may as it only has 3 letters) can this be considered a feature?
So we can avoid 31st of September situations
For the string "5/1/2016 ", the results are "date": "2016-05-01", the day and month are shown opposite. Kindly tell if some parameter can be used to manually handle this. Or kindly provide a fix for it.
>>> extract_date("some_text_20140205") or print("Uhoh...")
datetime.datetime(2014, 2, 5, 0, 0, tzinfo=<UTC>)
>>> extract_date("some_text20140205") or print("Uhoh...")
Uhoh...
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.