scrapinghub / dateparser Goto Github PK

View Code? Open in Web Editor NEW

2.5K 2.5K 467.0 5.51 MB

python parser for human readable dates

License: BSD 3-Clause "New" or "Revised" License

Python 99.98% Shell 0.02%

hacktoberfest

dateparser's Introduction

Scrapinghub command line client

shub is the Scrapinghub command line client. It allows you to deploy projects or dependencies, schedule spiders, and retrieve scraped data or logs without leaving the command line.

Requirements

Python >= 3.6

Installation

If you have pip installed on your system, you can install shub from the Python Package Index:

pip install shub

Please note that if you are using Python < 3.6, you should pin shub to 2.13.0 or lower.

We also supply stand-alone binaries. You can find them in our latest GitHub release.

Documentation

Documentation is available online via Read the Docs: https://shub.readthedocs.io/, or in the docs directory.

dateparser's People

Stargazers

Watchers

Forkers

horva dwillmer wrightrocket gsh45 coder46 jbkahn liechti eliasdorneles fudong1127 fanfannothing qbektrix seagatesoft invacuo pombredanne wibowo87 markbaas umrashrf mariansto sitesoft sunnfishyu webknjaz kiryam mevigour learn-alex mojojolo type-of-python voidcrusher ezyinsights tobiasli zeezooz datafyit merito danielmoniz walidsa3d sribnis taito-zz davidjb kami amorgun demelziraptor closeio thomasst drat magnofel benjaoming halflife mrgrigorii gitter-badger sorseg intakefoods youngilcho solalatus brechmos ehzhang yaojialyu enod yuwenlidao hcoura francisco-cabrera-olx pragyajaiswal megacool ishirav alexoner tkizm1 andresp99999 hristo-vrigazov code-for-u manyfun agateblue fernand0 ozhiganov techscientist kinuax own3dh4rd musaffa yuseferi arjunmahishi gvalkov devkhan vishalbelsare eragnms j08ny gurkein flipperpa atultherajput nikhilraj1321 atchoum31 howeroc alexxnica kryndex johnnoone dialoguemd-archives ivdelchev gto481 dchllngr janrygl avostryakov cjstuart eszakharova taybin

dateparser's Issues

Wrong date parsing when year changes

During scraping a website I encountered this issue:

Dec 14 11:00 is parsed as datetime.datetime(2015, 12, 14, 11, 0) whereas it was supposed to mean datetime.datetime(2014, 12, 14, 11, 0) because it was a post of 2014.

I think there should be parameter like only_allow_past_dates which should disable future date parsing and interpret it only as the date that has passed.

"tomorrow 12:05am" returns None - support for future dates

"tomorrow 12:05am" returns None

Offer a more direct way to parse a date

The most common use-case is by far just getting the date for a given date string, without really caring about language it is.

Right now, the way to do that has been:

>>> from dateparser.date import DateDataParser
>>> ddp = DateDataParser(allow_redetect_language=True)
>>> ddp.get_date_data(u'24 de Janeiro de 2014')['date_obj']
datetime.datetime(2014, 1, 24, 0, 0)
>>> ddp.get_date_data(u'January 1st 2014')['date_obj']
datetime.datetime(2014, 1, 1, 0, 0)

What I'd like to be able to do:

>>> import dateparser
>>> dateparser.parse_date(u'24 de Janeiro de 2014')
datetime.date(2014, 1, 24)
>>> dateparser.parse_date(u'January 1st 2014')
datetime.date(2014, 1, 24)
>>> dateparser.parse_datetime(u'24 de Janeiro de 2014, 13:23')
datetime.datetime(2014, 1, 24, 13, 23)

What do you think, folks?

Invalid date getting parsed

Hi guys, the string u'Wed, 30 Nov -0001 00:00:00 +0000' is getting parsed to datetime.datetime(2001, 11, 30, 2, 0) which is wrong.

AttributeError: 'tuple' object has no attribute 'year'

Reporting error:
AttributeError: 'tuple' object has no attribute 'year'

Which can be reproduced with:

from dateparser.date import DateDataParser
DateDataParser().get_date_data('22nd July 2012')

Issue is present since merge of https://github.com/scrapinghub/dateparser/pull/50/files

Skipped tests should be converted

During the migration to the declarative languages approach we marked some tests as skipped, because the nature of code in those places almost completely changed. Now, when the behavior of working with languages is more likely finalized, those skipped tests should be rewritten to test that new code is still working for input from old cases. Tests should be designed in a way described here.

File base configuration

Settings object here must be able to get settings file on initialization (string or already opened file) and defaults to data/settings.yaml if not set. All settings should be moved to this file instead of class attributes. Here is how it is done for LanguageDataLoader class.

There could be some additional notes for improvement of existing code when pull-request is ready.

"Feb 2011" parsing fails but Jan, Mar-Dec works

>>> from dateutil import parser
>>> parser.parse('Jan 2011', fuzzy=True)
datetime.datetime(2011, 1, 30, 0, 0)
>>> parser.parse('Feb 2011', fuzzy=True)
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/umair/dev/venvs/nathanartz/local/lib/python2.7/site-packages/dateutil/parser.py", line 743, in parse
    return DEFAULTPARSER.parse(timestr, **kwargs)
  File "/home/umair/dev/venvs/nathanartz/local/lib/python2.7/site-packages/dateutil/parser.py", line 310, in parse
    ret = default.replace(**repl)
ValueError: day is out of range for month
>>> parser.parse('Dec 2011', fuzzy=True)
datetime.datetime(2011, 12, 30, 0, 0)

For date parsing, the time component is being cached between calls.

Maybe this is a "feature" but it smells more like a bug. Reporting in case, since its caused some headaches.
Using dateparser 0.3.1

Expected behavior:
the "time" part of the datetime object should be the current time when parsing a value with no time info like today.

Current behavior:
When parsing today, if you call it again at a later time the time is being cached in between. Even if calling with a different value like hoy (today in spanish).
This is a bit surprising, and I haven't tested what would happen the time rolls over to a different date but I assume that could be problematic also.

Python 2.7.10 (default, Sep 30 2015, 17:12:08)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.72)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import dateparser, time
>>> dateparser.parse('today')
datetime.datetime(2015, 11, 11, 17, 28, 50, 234704)
>>> time.sleep(2)
>>> dateparser.parse('today')
datetime.datetime(2015, 11, 11, 17, 28, 50, 234704)
>>> time.sleep(2)
>>> dateparser.parse('hoy')
datetime.datetime(2015, 11, 11, 17, 28, 50, 234704)

Thanks! 🍻

Inconsistent return types between calendars.

HijriCalendar and JalaliCalendar implement BaseCalendar. But these two classes' get_date function's return type differs. This should be fixed by defining an interface for BaseCalender.

Date strings with `year` in them are not parsing correctly

Dates like '19 February 2013 year 09:10' do not parse correctly.

DateDataParser().get_date_data('19 February 2013 year 09:10')

returns

datetime(2, 1, 8, 10, 30, 38, 116715)

while the correct date should be:

datetime(2013, 2, 19, 9, 10)

Adding support for vietnamese dateparsing

I am working on it as we at sprinklr have a couple of websites in critical list which need vietnamese support.

parsed time zone different for fixed and relative dates

It appears that dateparser.parse converts a fixed date and time to the local time zone and a relative date and time to UTC? I was wondering if it might be possible to add an option to make them both return the same, or to set the time zone in the returned datetime object? I would like to convert the parsed date and time to a Unix time stamp, but I don't know whether the user will enter a fixed or relative date and time. This is with version 0.2.1 on Ubuntu 14.04.2 LTS. Thank you!

Support for particular chinese date format

This is on amazon.cn reviews section: http://www.amazon.cn/Logitech-%E7%BD%97%E6%8A%80-M185%E6%97%A0%E7%BA%BF%E5%85%89%E5%AD%A6%E9%BC%A0%E6%A0%87/product-reviews/B00776T3NA/ref=dpx_acr_txt?showViewpoints=1

In unicode this is u'2012\u5e742\u670825\u65e5', and I'm not sure what the characters mean but it looks like you can just remove them and parse the remaining numbers.

Iranian calendar

On Iranian sites often Iranian calendar is used instead of Gregorian. We need to add this support for Persian language. Leonid pushed some work with 5c8611fd52752e8890681fd403f02de633f7f20d commit in other repository. He is also did some research, so feel free to contact him on this matter.

Support for date strings with mixed languages

Old topic: "Date string like 'Marzo 2, 2015 at 8:56 pm' not being parsed"

In [7]: DateDataParser().get_date_data('Marzo 2, 2015 at 8:56 pm')
Out[7]: {'date_obj': None, 'period': 'day'}

Should have returned {'date_obj': datetime(2015, 3, 2, 20, 56), 'period': 'day'}

Special parser to break date string and passing identifiable chunks to relevant sub parsers

As of now, we've mixed time parsing logic in FreshnessDateParser. This is kind of breaking SRP. Ideally, we'd like to have a special parser which would break string into multiple parts, directing them to relevant sub parsers -- eventually consolidating the separate results to return one datetime object.

It's open to suggestions. Above is just a recommendation.

Cannot parse foreign (i.e: arabic) dates in dateparser

DateParser seems to suffer from the same pains as Arrow in arrow-py/arrow#152... it seems unable to parse arabic dates:

In [1]: import dateparser

In [2]: from dateparser.date import DateDataParser

In [3]: ddp = DateDataParser()

In [4]: ddp.get_date_data('२०१४-०४-२८')
Out[4]: {'date_obj': None, 'period': 'day'}

In [5]: ddp.get_date_data('۱۳۹۳-۰۲-۱۰')
Out[5]: {'date_obj': None, 'period': 'day'}

Simple arithmetic for the words

We need to be able to transform token sequences like "seven hundred and sixty-five thousand, four hundred and thirty-two" to the "765432". There could be different handling of such tokens in different languages (for example Roman numerals deals with subtractions). So let's for now only focus on how English tokens transforming to numbers. Let's call this approach "general" (later we would define which approach should be used in languages.yaml file)
Initial idea is to iterate through the list of tokens, skipping tokens that are in skip, or [\W_]+. Each token should be present in dictionary (numbers section of the language).

So if number represented by current token is less then previous, we use addition, if it is greater than several of previous nearby numbers, than those smaller number are describing this bigger one and use multiplication. Be sure to use multiplication only with those preceding number that are 1) less then current 2) directly chained with current.

This approach should of course be properly tested.

Dates detection

Current implementation of language detection works in a way "When we ask for next language, that means that previous language did not work on the given date and should be dropped" We need to change this behavior to only modify working set of languages if at least one language was applicable. If possible, we should also find a way to make language detection more clear for the reader.
This need to be done without significant increase of code complexity.
Current test case is:

In [1]: from dateparser import DateDataParser
In [2]: parser = DateDataParser()
In [3]: parser.get_date_data(u'01-01-15 06:47 AM')
Out[3]: {'date_obj': datetime.datetime(2015, 1, 1, 6, 47), 'period': 'day'}
In [4]: parser.get_date_data(u'foo')
Out[4]: {'date_obj': None, 'period': 'day'}
In [5]: parser.get_date_data(u'01-01-15 06:47 AM')
Out[5]: {'date_obj': None, 'period': 'day'}

Improper parsing of relative date with absolute time of exactly midnight

When parsing a relative date with an absolute time (e.g., 1 week ago at 12 am), the parser ignores the time portion if it is exactly midnight (00:00:00 or the equivalent). If the time portion is one minute later, it works properly. This is with Python 2.7.9 and dateparser 0.3.0. Thanks.

In [1]: import dateparser as dp

In [2]: dp.parse('1 week ago at 12:00 am')
Out[2]: datetime.datetime(2015, 8, 24, 18, 35, 6, 272800)

In [3]: dp.parse('1 week ago at 12:01 am')
Out[3]: datetime.datetime(2015, 8, 24, 0, 1)

Should replace date_range with rrule?

>>> start = datetime.strptime('2015-01-07 04:00:00', "%Y-%m-%d %H:%M:%S")
>>> end = datetime.strptime('2015-01-17 04:00:00', "%Y-%m-%d %H:%M:%S")
>>>
>>> for x in dateutil.rrule.rrule(DAILY, dtstart=start, until=end):
...     print x
... 
2015-01-07 04:00:00
2015-01-08 04:00:00
2015-01-09 04:00:00
2015-01-10 04:00:00
2015-01-11 04:00:00
2015-01-12 04:00:00
2015-01-13 04:00:00
2015-01-14 04:00:00
2015-01-15 04:00:00
2015-01-16 04:00:00
2015-01-17 04:00:00
>>>
>>> for x in dateparser.date.date_range(start, end, days=1):
...     print x
... 
2015-01-07 04:00:00
2015-01-08 04:00:00
2015-01-09 04:00:00
2015-01-10 04:00:00
2015-01-11 04:00:00
2015-01-12 04:00:00
2015-01-13 04:00:00
2015-01-14 04:00:00
2015-01-15 04:00:00
2015-01-16 04:00:00

Which is the right implementation?

Support for mixed languages

Hi guys, dateparser isn't detecting dates like Diciembre 23, 2014 at 3:43 am, which is actually a mix between Spanish (Diciembre) and English (at). Which would be the best way to deal with it?

Search text for dates

It would be useful for Portia if it is possible to search text for dates, see scrapinghub/portia#192

Vietnamese month uncertainty (discussion)

We need to come up with generic solution for this.

Vietnamese language does not have names for months and simply use "Month One", "Month Two" etc.
Some sites use numeric form like "Month 1", "Month 2" etc. So when we translate tokens from Vietnamese for dates like "1 Year 1 Month 1 Day" it is not quite clear whether it is "1 year 1 month 1 day" or "1 year 1 January day".

Intended PyPI update?

Current PyPI release (0.1.0) is from November 2014. With the latest changes on here, I've found much better accuracy with hours, minutes, and seconds. Do you have an intended date for the next release? This is a really great library.

French `moins de 21s` not getting parsed.

Although, french dates like above in the subject seems to translate fine but are not getting parsed.

>>> parse('moins de 21s')
>>>
>>> language_loader._data['fr'].translate('moins de 21s')
'21 s'
>>> parse('21 s')
datetime.datetime(2015, 7, 13, 9, 19, 43, 484810)

packaging issues

Hi,

I think there are several issues with setup.py:

it imports from dateparser to get __version__. This is not good because if this import fails (e.g. because of missing dateutil dependency) installation will fail. This means dateutil in install_requires won't work if dateutil is not installed. It is better to either extract __version__ using a regex or even have it duplicated.
setup.py tries to use distutils if setuptools is not available, but there are setuptools-specific options like include_package_data. If include_package_data is needed then dateparser won't work with distutils. It is not needed though. I think it is better to either remove distutils fallback or to make sure setup.py works with distutils. I'd also remove include_package_data.
setup.py reads install_requires from requirements.txt file. I think this is a wrong approach: requirements.txt should specify package versions that are known to work, while install_requires should exclude version that are known not to work. I.e. using foo==1.0 in requirements.txt is good (because it ensures users will get a working build when they follow requirements.txt), but foo==1.0 in install_requires is bad because it prevents package from being used with an updated versions of a dependency, and it may cause an unintended package downgrade for the end user. The difference is that users can't opt out of install_requires, so we should be careful about what is put there; the less strict install_requires is the better.
because of (3) wheel package is in install_requires. It is unnecesary: wheel package is not needed to install Python wheels, it is only needed to create wheels. wheel version is fixed, so by installing dateparser users could get their local wheel upgraded or downgraded, and they might need specific wheel versions for other software.

We need support for 'Today 01:56 AM'

Currently it returns something like datetime.datetime(2014, 12, 9, 15, 17, 21, 562654) - date is correct here but time corresponds to import time because of last line in freshness_date_parser.py - we should consider changing this maybe by reinitializing freshness parser each time.

Add Support for CJK Languages

Hi,

I'm working on adding support for parsing CJK (Chinese, Japanese, Korean) languages and had a few questions.

I have defined a zh_parserinfo class and have been able to get a test case like this to work:

def test_zh_dates(self):
    date = DateParser(language='zh').parse(u'2014年10月4日', date_format='%Y年%m月%d日')   
    self.assertEqual(date.year, 2014)
    self.assertEqual(date.month, 10)
    self.assertEqual(date.day, 4)

However, this is not ideal. Dates are always written year-month-day, so it would be nice if parse handled it by default, and the above test case would pass without specifying this date_format. There are also two cases to support:

# 年 means year, 月 means month, 日 means day
# Dates are always written year first.
date = DateParser(language='zh').parse(u'2014年10月4日')   

# This is the same date, but it is written with Chinese numbers. 
# 二 = 2, 〇 = 0, 一 = 1, 四 =  4, 十 = 10, etc...
date = DateParser(language='zh').parse(u'二〇一四年十月四日')  # formal usage

Where would you suggest putting the code to handle the mapping of Chinese numbers to standard numbers? Any another suggestions for implementing this are appreciated. Thanks!

Edit: I'm not familiar with Arabic, but in #6 it looks like a similar mapping is needed. Something generic that could work for CJK languages, Arabic, and whatever other languages might need this would be best.

support for words(noon, midnight, etc.) as time

Hi, It would be nice to add support for words as time, noon specially.

Here an example I got:

ERROR: Unknown date format u'Oct. 26, 2012 at noon' in http://www.realbuzz.com/forums/gear/?&items=30

Thanks.

12 am/pm

According to this wiki page noon/midnight could be written in different ways. We should check if we parsing 12 noon/noon correctly and also add an option to choose how we should treat 12 am/pm.

Add support for timezone info in pytz

This means you don't have to duplicate that work or hardcode that information.

Such as

set(['PMDT',
'BAKT',
'CPT',
'KUYT',
'WAT',
'TKT',
'CHAST',
'NOVST',
'FJST',
'ALMST',
'SHEST',
'SCT',
'PDDT',
'BRST',
'VLAST',
'NPT',
'CVST',
'QYZT',
'PMMT',
'NEST',
'AQTST',
'LHDT',
'VUST',
'MDDT',
'CMT',
'SRET',
'zzz',
'HKST',
'LST',
'CHOST',
'EEST',
'MAGT',
'WGT',
'NFT',
'TJT',
'BEAUT',
'PLMT',
'SMT',
'RET',
'COST',
'FJT',
'BST',
'TASST',
'JST',
'UYST',
'TAST',
'MDT',
'VET',
'CLST',
'HST',
'FMT',
'TBIST',
'ORAST',
'SBT',
'PYST',
'MMT',
'LMT',
'YAKT',
'MART',
'EDT',
'MAWT',
'AHST',
'VOST',
'TOT',
'CAT',
'MADMT',
'VOLST',
'ROTT',
'ISST',
'PGT',
'KGST',
'CHOT',
'YEKST',
'YPT',
'AZOMT',
'FKST',
'FORT',
'NCT',
'PNT',
'WGST',
'ARST',
'KIZT',
'KWAT',
'SAMT',
'FNT',
'AKDT',
'LINT',
'EGT',
'DUST',
'WITA',
'JCST',
'NZDT',
'JWST',
'SHET',
'GBGT',
'PHST',
'UYT',
'HOVT',
'MALST',
'PYT',
'APT',
'PEST',
'WEMT',
'FRUST',
'KST',
'STAT',
'HDT',
'VLAT',
'YST',
'PKT',
'HMT',
'SJMT',
'MADT',
'CET',
'BMT',
'SAKST',
'ChST',
'AFT',
'CST',
'BTT',
'SST',
'AWDT',
'MUT',
'IRST',
'IST',
'SAST',
'SET',
'ORAT',
'RMT',
'AST',
'NUT',
'SWAT',
'ECT',
'AQTT',
'YERT',
'TLT',
'PDT',
'TOST',
'IMT',
'HAST',
'NOVT',
'YWT',
'AKST',
'GYT',
'CEST',
'BEAT',
'TBIT',
'WART',
'CWT',
'NEGT',
'TFT',
'FKT',
'PHOT',
'IHST',
'BDST',
'DDUT',
'EASST',
'NRT',
'URAT',
'BAKST',
'CKT',
'FRUT',
'MUST',
'AWT',
'PKST',
'AMST',
'SDMT',
'AHDT',
'BOST',
'BNT',
'WET',
'ADMT',
'NZST',
'ANAT',
'ADDT',
'CEMT',
'CANT',
'ALMT',
'CKHST',
'PHT',
'SVET',
'EMT',
'DMT',
'LHST',
'AZST',
'TRST',
'SAMST',
'GET',
'MALT',
'MHT',
'ASHST',
'MOT',
'ANT',
'TSAT',
'TBMT',
'GEST',
'PST',
'DAVT',
'TMT',
'COT',
'PET',
'AZOST',
'TAHT',
'VUT',
'KMT',
'IRKT',
'CAST',
'MAGST',
'KDT',
'GALT',
'OMST',
'KIZST',
'SRT',
'KOST',
'NDT',
'NMT',
'CDT',
'SAKT',
'DUSST',
'FNST',
'CVT',
'WAST',
'PPT',
'CGST',
'NST',
'UTC',
'MEST',
'VOLT',
'ACWST',
'CHADT',
'ULAT',
'IDDT',
'SDT',
'PWT',
'ART',
'HOVST',
'ULAST',
'MADST',
'GST',
'EPT',
'BORT',
'BOT',
'OMSST',
'XJT',
'URAST',
'PPMT',
'AWST',
'YERST',
'UYHST',
'IOT',
'MYT',
'HKT',
'SVEST',
'YDT',
'PMST',
'CAWT',
'WSDT',
'WMT',
'ACWDT',
'KRAT',
'ACDT',
'UZST',
'AKTT',
'IRKST',
'MDST',
'MWT',
'EET',
'BURT',
'EST',
'JDT',
'LKT',
'NWT',
'WSST',
'JMT',
'EGST',
'CDDT',
'AMT',
'CHDT',
'CAPT',
'BDT',
'MIST',
'TRT',
'EWT',
'BORTST',
'YDDT',
'MPT',
'LRT',
'HADT',
'GAMT',
'KUYST',
'IDT',
'IRDT',
'AEDT',
'YAKST',
'ACT',
'NET',
'PMT',
'NZMT',
'QMT',
'ANAST',
'YEKT',
'NDDT',
'EAST',
'CGT',
'EDDT',
'ADT',
'CUT',
'FET',
'GHST',
'SYOT',
'GMT',
'EHDT',
'WIB',
'BRT',
'QYZST',
'MET',
'WIT',
'AKTST',
'KRAST',
'KART',
'MST',
'MSM',
'AEST',
'MSK',
'GFT',
'MVT',
'MSD',
'AZT',
'ACST',
'SGT',
'CLT',
'PETT',
'UZT',
'DACT',
'EAT',
'FFMT',
'PETST',
'WARST',
'MOST',
'AZOT',
'ICT',
'KGT',
'NCST',
'WEST',
'JAVT',
'ASHT']))

Please at least indicate python 3 compatibility.

I am looking for a Python 3 date parsing library, and I actually wound up trying to install this library before realizing it's only Py2k compatible.

If you aren't planning on supporting python 3, can you please at least note that (hopefully rather prominently) somewhere in the readme?

Dates "08/17/14 17:00 PM (PDT)" not parsed.

Timezones recognition should be more generic

Provide a way to check supported languages

Suggested on #6

The idea is to have something like dateparser.languages that would allow one to check support for a given language.

Right now, there are language specific code in both date_parser.py and freshness_date_parser.py.
Any thoughts on how we should do that?

Adding support for parsing Thai dates

Hey @asadurski -- I know you're working on support for Thai on branch feature-thai-support, just creating this issue to track it.

Upper limit for years and months

It looks like the years upper limit is 19 years and for months it's 12. Its quite common to have mentions like "25 years ago", "50 years ago" or "24 months ago" on webpages and dateparser returns None for them :P .

ddp.get_date_data('19 years ago')
{'date_obj': datetime.datetime(1995, 11, 25, 6, 17, 17, 980574), 'period': u'years'}

ddp.get_date_data('20 years ago')
{'date_obj': None, 'period': 'day'}

ddp.get_date_data('12 months ago')
{'date_obj': datetime.datetime(2013, 11, 25, 6, 17, 17, 980574), 'period': u'months'}

ddp.get_date_data('13 months ago')
{'date_obj': None, 'period': 'day'}

This is quite interesting. I can get past the 19 years barrier with these queries:

ddp.get_date_data('19 years 12 months ago')
{'date_obj': datetime.datetime(1994, 11, 25, 6, 17, 17, 980574), 'period': u'months'}

ddp.get_date_data('19 years 12 months 1000 weeks ago')
{'date_obj': datetime.datetime(1975, 9, 26, 6, 17, 17, 980574), 'period': u'weeks'}

But then its quite rare to have text like the above two examples on webpages.

Day/month/year order for different locales

We should specify what day/month/year order is prefferable for different languages.
We can use this information as a starting point and define date_order = DMY for each language in our file (don't forget to add a specific validation for this). Then, based on this info we should pass proper arguments to dateutil parser.

Better period extracting from dateutil parser

As we now have information of what date units (year, month, hour) were parsed exactly by dateutil parser, we now can guess date period with more precision. For example, if any of the day, hour, minute, second, microsecond units were parsed, then it is day period, else if month was parsed then period is month and at last year period for year unit. I am not sure if week period can be applicable to dates passed to dateutil parser.

This period should be passed all the way back to call from _DateLanguageParser.

Extend language redetection to every subparser used.

For now language detection works only with subparser extending dateutil behavior. We should move it one level up to the main parser, so we can use same detected language for all approaches, including freshness subparser and formats. This way we can set default formats specific to some languages.

'wheel' dependency

Hi,

requirements.txt states that project depends on wheel library, but there is no imports in the source that use that library. I think that means that this library is not actually required for dateparser to work.

Perhaps it could be removed from requirements.txt?

Missing date parts

Sometimes there are dates that don't have all the information to get the exact date. Like "December 21" or "Friday". We can assume either current year and week, or the one that is latest in the past.

To achieve that, instead of calling parse method in dateutil_parse function we would need to call for the _parse method of dateutil parser and then, when we have information on what parts are parsed (and depending on configuration), we either chose date that is current week/month/year or the last one seen.

Dateparser returning timezone aware datetime depending on existence of date_format arg (with any value)

Here are some self explanatory examples.

In [22]: ddp.get_date_data('2014-10-09T17:57:39+00:00')['date_obj']
Out[22]: datetime.datetime(2014, 10, 9, 17, 57, 39)

In [23]: ddp.get_date_data('2014-10-09T17:57:39+00:00', '')['date_obj']
Out[23]: datetime.datetime(2014, 10, 9, 17, 57, 39)

In [24]: ddp.get_date_data('2014-10-09T17:57:39+00:00', '%Y')['date_obj']
Out[24]: datetime.datetime(2014, 10, 9, 17, 57, 39, tzinfo=tzutc())

Incorrect Portuguese translation of the 'second' keyword in languages.yaml

dateparser v0.3.0 on Ubuntu 14.04

The languages.yaml file has the incorrect English to Portuguese Translation of 'second.'
'segunda' is the plural adjective form of second as an ordered position where as 'segundo' is the desired term for the unit of time. The bug manifests itself as follows:

>>> parse(u'1 segundo atrás')
>>>
>>> parse(u'1 segunda atrás')
datetime.datetime(2015, 7, 13, 9, 19, 43, 484810)

Configuration

It looks that we need to parametrize parsing behavior. Instead of keep adding parameters to the parse function I suggest to create a Registry changeable with configure(key=value) function.
Example settings could be NO_DATES_FROM_FUTURE (to parse dates from web, where some times pieces from the created date is missing and we are assuming past) or SUPPORT_BEFORE_COMMON_ERA (to use custom datetime class inherited from datetime but supporting Astronomical year numbering)

Please add support for default UNIX "date" command format

Most Unix/Linux flavors use %a %b %e %T %Z %Y as the default date format. However dateparser does not support that format.

> date
Tue Oct 13 20:18:56 CDT 2015

> python
>>> import dateparser
>>> type(dateparser.parse('Tue Oct 13 20:18:56 CDT 2015'))
<class 'NoneType'>

"in 5 min" returns None

I just downloaded this module and it's fantastic. Thank you for it.

I noticed that "5 min ago" works. "in 5 min" returns None.

Uniform unit tests

It seems that we have plenty of methodologies used in unit tests.
Some of them are:

    date = DateParser(language='cz').parse('pon 16. čer 2014 10:07:43')
    self.assertEqual(date.year, 2014)
    self.assertEqual(date.month, 6)
    self.assertEqual(date.day, 16)
    self.assertEqual(date.hour, 10)
    self.assertEqual(date.minute, 07)
    self.assertEqual(date.second, 43)

   parser = DateParser()
    date_fixtures = [
        ('13 iunie 2013', datetime(2013, 6, 13)),
        ('14 aprilie 2014', datetime(2014, 4, 14)),
        ('18 martie 2012', datetime(2012, 3, 18)),
    ]

    for dt_string, correct_date in date_fixtures:
        parsed = parser.parse(dt_string)
        self.assertEquals(correct_date.date(), parsed.date())

    @parameterized.expand([
    param('Sep 03 2014 | 4:32 pm EDT', datetime(2014, 9, 3, 21, 32)),
    param('17th October, 2034 @ 01:08 am PDT', datetime(2034, 10, 17, 9, 8)),
    param('15 May 2004 23:24 EDT', datetime(2004, 5, 16, 4, 24)),
    param('15 May 2004', datetime(2004, 5, 15, 0, 0)),
    param('Nov 25 2014 10:17 pm EST', datetime(2014, 11, 26, 4, 17)),
    ])

   date = DateParser(language='pl').parse('Środa, 26 listopada 2014 10:11:12')
   self.assertEqual(date.timetuple()[:6], (2014, 11, 26, 10, 11, 12))

Maybe we could create simple method/function for asserting correct date? Eg.

    date = DateParser(language='en').parse('Tue, 25 Dec, 2012 12:00')
    self.assertDate(date, 2012, 12, 25, 12, 0)

Depending of number of params given to assertDate we check date with given resolution

Stop parsing invalid dates

We should not really parse dates like this

>>> from dateparser import parse
>>> parse("2015-03-17T16:37:51+00:002015-03-17T15:24:37+00:002015-03-17T15:02:08+00:002015-03-17T13:09:31+00:002015-03-17T11:34:21+00:002015-03-16T17:49:15+00:002015-03-16T17:33:30+00:002015-03-16T16:49:46+00:002015-03-16T15:50:57+00:002015-03-16T13:26:50+00:00 ")
datetime.datetime(2015, 3, 17, 13, 26, 50)