wklumpen / gtfs-lite Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 3.0 1.22 MB

Lightweight GTFS Analysis

License: MIT License

Python 100.00%

gtfs-lite's People

Contributors

Stargazers

Watchers

Forkers

carlhiggs olalid shiqin-liu

gtfs-lite's Issues

Load only specific parts of GTFS feed for faster processing

Allow the user to load only certain portions of the feed for ease of analysis.

Will have to validate/check if files that are required are loaded (issue warning)

File names are printed on `load_zip()` call

There's an errant print statement in the load_zip() function that needs removing.

Start and end times don't work in the counting of trips at stops

The function trips_at_stops(stop_ids, date, start_time=datetime.time(0, 0), end_time=datetime.time(23, 59)) doesn't actually do anything when different time slices are provided.

Remove shapely and geometry related attributes from GTFS-Lite

Right now there are a few older functions which rely on geometric/spatial analysis. In order to keep GTFS-lite as lightweight as possible, these should be removed, unless there is demand for them later via a feature or pull request.

Add ability to delete/scrub a route from a GTFS dataset

The idea here is if you want to make a specific route_id (or set of route_ids) disappear you can do so programatically.

Persistent datetime versions of `stop_times` arrival and departure columns.

In order to do some specific date-and-time analysis such as creating a frequency grid (See #21), we need to effectively have our stop times be date-aware instead of just straight up-and-down times.

To solve this problem we need to differentiate between a "date-aware" GTFS object (which would effectively only include a subset of the entire schedule feed filtered via the calendar and calendar_dates dataframes) and a "date-naive" or basic GTFS object.

I'm imagining that we would have something like a set_date function which allows the user to pass the analysis date in question and which is persisted in some way throughout the analysis. The way we go about this could be a bit different:

1. Do all the calculations on the fly

We apply the filtering every time we do something. This means effectively creating a datetime column for arrival_time and departure_time columns in the stop_times frame and updating this column whenever the date changes. We can then go about the analyses assuming those columns are up-to-date.

2. Create a persistent subset of the GTFS feed

We could have a person create a "copy" of the GTFS object which has a subset of the data. This would be useful if we are planning to have functions that work in both a global context and a subset context (i.e. total trips or total service hours).

My opinion is that we should go the route of option 1, by creating extra columns attached to the stop_times column that update whenever a new date is set.

Produce simple but useful test feeds

We need feeds that are useful for writing testing. We will want to generate a few GTFS feeds from scratch, or adapt simple agency feeds to our suiting.

Specifically we need to test the basics, but also frequencies.txt

Encoding failures when loading non-UTF-8 feeds

The load_zip method throws a UnicodeDecodeError when loading certain GTFS feed packages like this one:
f-9q9-bart.zip

The suggested fix is to allow the user to specify an encoding on load, otherwise default to UTF-8. May also want to consider adding an option for the user to specify a behaviour on an error.

Add ability to enforce spec on loading

Often providers will throw all kinds of weird columns into their feeds. While it might be useful to have these in some instances, it can also lead to various loading warnings (mixed types) and slow the process down.

Proposed feature is to provide an option on load_zip() called enforce_spec=False which only loads columns specified in the official GTFS specification using the usecols option from Pandas.

Reading fails when a column has missing values

Hi! so I'm trying to read in a GTFS.zip as follows:

gtfs = GTFS.load_zip("SRTA GTFS-2020-06-29.zip")

but I'm getting the following error

  File "/home/ja/miniconda3/envs/TC/lib/python3.8/site-packages/gtfslite/gtfs.py", line 101, in load_zip
    trips = pd.read_csv(
  File "/home/ja/miniconda3/envs/TC/lib/python3.8/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/ja/miniconda3/envs/TC/lib/python3.8/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/home/ja/miniconda3/envs/TC/lib/python3.8/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/home/ja/miniconda3/envs/TC/lib/python3.8/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 860, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 952, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1084, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1115, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1208, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Integer column has NA values in column 5

I think this pertains to the coding of 'direction_id': 'Int64'

Maybe the solution is to use float instead of int? https://stackoverflow.com/questions/21287624/convert-pandas-column-containing-nans-to-dtype-int

cheers,

Parse dates warning when reading in frequencies.txt

When reading in a GTFS feed that uses the optional frequencies.txt file, there seems to be a parse dates error:

>>> import gtfslite.gtfs
>>> test = gtfslite.gtfs.GTFS.load_zip('data/20230504_070233_Euskadi_Bizkaibus.zip')
C:\Users\carlh\miniconda3\lib\site-packages\gtfslite\gtfs.py:348: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  frequencies = pd.read_csv(
C:\Users\carlh\miniconda3\lib\site-packages\gtfslite\gtfs.py:348: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  frequencies = pd.read_csv(

The data still reads in more or less fine, although it plugs in today's date for the start_time field,

>>> test.frequencies
                     trip_id          start_time  end_time  headway_secs  exact_times
0      trp_A0651_737_OP43FIN 2023-05-15 08:35:00  22:35:00          3600            1
1      trp_A0651_737_OP43LIN 2023-05-15 07:35:00  22:05:00          1800            1
2      trp_A0651_737_OP43SIN 2023-05-15 07:35:00  22:35:00          3600            1
3      trp_A0651_738_OP43FIN 2023-05-15 07:35:00  21:35:00          3600            1
4      trp_A0651_738_OP43SIN 2023-05-15 06:35:00  21:35:00          3600            1
..                       ...                 ...       ...           ...          ...
532   trp_A3932_1053_OP41VIN 2023-05-15 05:30:00  22:30:00          3600            1
533   trp_A3932_1054_OP41FIN 2023-05-15 07:00:00  23:00:00          3600            1
534  trp_A3932_1054_OP41LJIN 2023-05-15 07:00:00  23:00:00          3600            1
535   trp_A3932_1054_OP41SIN 2023-05-15 07:00:00  23:00:00          3600            1
536   trp_A3932_1054_OP41VIN 2023-05-15 07:00:00  23:00:00          3600            1

[537 rows x 5 columns]

presumably because of the error, end_time remains a string object:

>>> test.frequencies.dtypes
trip_id                 object
start_time      datetime64[ns]
end_time                object
headway_secs             int32
exact_times              int32
dtype: object

But things still seem to work, so it doesn't necessarily impact analysis:

>>> test.frequencies.loc[test.frequencies.end_time > '23:59:00']
                     trip_id          start_time  end_time  headway_secs  exact_times
119  trp_A3136_668_OP40LJIN2 2023-05-15 21:30:00  24:30:00          1800            1
120    trp_A3136_668_OP40SIN 2023-05-15 21:30:00  24:30:00          1800            1
121   trp_A3136_668_OP40VIN2 2023-05-15 21:30:00  24:30:00          1800            1
324   trp_A3516_872_OP42SPPV 2023-05-15 09:30:00  25:30:00          3600            1

I believe this may relates to time in GTFS feeds (HH:MM:SS) potentially going over 24 hours, with the result that times of 24:00:00 and later are no longer valid datetime objects.

Currently, times are explicitly parsed as dates when frequencies.txt is read in:

gtfs-lite/gtfslite/gtfs.py

Lines 343 to 354 in 0bbdd2e

 frequencies = pd.read_csv( 

 zip_file.open(filepaths["frequencies.txt"]), 

 dtype={ 

 "trip_id": str, 

 "start_time": str, 

 "end_time": str, 

 "headway_secs": int, 

 "exact_times": int, 

 }, 

 parse_dates=["start_time", "end_time"], 

 skipinitialspace=True, 

 )

However, I think if they were treated as timedeltas this error would be avoided:

                frequencies = self.read_clean_feed(
                    zip_file.open(filepaths["frequencies.txt"]),
                    dtype={
                        "trip_id": str,
                        "start_time": str,
                        "end_time": str,
                        "headway_secs": int,
                        "exact_times": int,
                    },
                    skipinitialspace=True,
                )
                frequencies["start_time"] = pd.to_timedelta(frequencies["start_time"])
                frequencies["end_time"] = pd.to_timedelta(frequencies["end_time"])

Using this code means that error isn't raised, both start_time and end_time are parsed consistently, things still work, but without the awkward guess work of plugging in today's date (which most often wouldn't be the right day, technically); relative time seems more appropriate:

>>> import gtfslite.gtfs
>>> test = gtfslite.gtfs.GTFS.load_zip('data/20230504_070233_Euskadi_Bizkaibus.zip')
>>> test.frequencies
                     trip_id      start_time        end_time  headway_secs  exact_times
0      trp_A0651_737_OP43FIN 0 days 08:35:00 0 days 22:35:00          3600            1
1      trp_A0651_737_OP43LIN 0 days 07:35:00 0 days 22:05:00          1800            1
2      trp_A0651_737_OP43SIN 0 days 07:35:00 0 days 22:35:00          3600            1
3      trp_A0651_738_OP43FIN 0 days 07:35:00 0 days 21:35:00          3600            1
4      trp_A0651_738_OP43SIN 0 days 06:35:00 0 days 21:35:00          3600            1
..                       ...             ...             ...           ...          ...
532   trp_A3932_1053_OP41VIN 0 days 05:30:00 0 days 22:30:00          3600            1
533   trp_A3932_1054_OP41FIN 0 days 07:00:00 0 days 23:00:00          3600            1
534  trp_A3932_1054_OP41LJIN 0 days 07:00:00 0 days 23:00:00          3600            1
535   trp_A3932_1054_OP41SIN 0 days 07:00:00 0 days 23:00:00          3600            1
536   trp_A3932_1054_OP41VIN 0 days 07:00:00 0 days 23:00:00          3600            1

[537 rows x 5 columns]
>>> test.frequencies.loc[test.frequencies.end_time > '23:59:00']
                     trip_id      start_time        end_time  headway_secs  exact_times
119  trp_A3136_668_OP40LJIN2 0 days 21:30:00 1 days 00:30:00          1800            1
120    trp_A3136_668_OP40SIN 0 days 21:30:00 1 days 00:30:00          1800            1
121   trp_A3136_668_OP40VIN2 0 days 21:30:00 1 days 00:30:00          1800            1
324   trp_A3516_872_OP42SPPV 0 days 09:30:00 1 days 01:30:00          3600            1
>>> test.frequencies.dtypes
trip_id                  object
start_time      timedelta64[ns]
end_time        timedelta64[ns]
headway_secs              int32
exact_times               int32
dtype: object

Loading feed fails if optional field 'exact_times' in frequencies.txt is not specified (should be "Int64")

As per https://github.com/global-healthy-liveable-cities/global-indicators/issues/338, when loading a feed with frequencies.txt if the optional field exact_times has not been completed, this results in a ValueError exception--- like,

home/ghsci/process/data/transit_feeds/test_gtfs/20230329_130123_Metro_Sevilla
Traceback (most recent call last):
  File "/home/ghsci/process/subprocesses/_10_gtfs_analysis.py", line 291, in <module>
    main()
  File "/home/ghsci/process/subprocesses/_10_gtfs_analysis.py", line 287, in main
    gtfs_analysis(codename)
  File "/home/ghsci/process/subprocesses/_10_gtfs_analysis.py", line 78, in gtfs_analysis
    loaded_feeds = gtfslite.GTFS.load_zip(f'{gtfsfeed_path}.zip')
  File "/env/lib/python3.10/site-packages/gtfslite/gtfs.py", line 348, in load_zip
    frequencies = pd.read_csv(
  File "/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
  File "/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/env/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 812, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 889, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1034, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1073, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1192, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Integer column has NA values in column 4

The reason is, in gtfs.py, this field is specified to be interpreted as int:

gtfs-lite/gtfslite/gtfs.py

Line 371 in e94cfa5

"exact_times": int,

but according to the spec, exact_times is optional --- so should be "Int64", like:

                        "exact_times": "Int64",

I see that this dtype "Int64" has been used elsewhere in gtfs.py when reading in other files, so I have tested out a change of this on my fork of gtfslite and confirmed it resolves the issue when parsing this file.

In case its useful, I'll lodge a pull request with this change to address this issue in a tic!

Add `areas.txt` and `stop_areas.txt` to optional files

See: https://gtfs.org/schedule/reference/#areastxt

Swedish gtfs files has stopped working in 0.2.0

Seem like there is something that is stopping the official swedish gtfs files from working with 0.2.0.
It works fine with 0.1.8 (and pandas 1.5.3), but when upgrading to gtfs-lite 0.2.0 no trips are anymore found.
E.g. functions date_trips() and trips_at_stops() returns empty datasets with 0.2.0, while with 0.1.8 they return correct information.
Example gtfs file: http://olal.se/gtfs/otraf.zip

Allow `**kwargs` to be passed to `load_zip()`

Hello,
I met a problem when I was trying to load a GTFS dataset using GTFS.load_zip() function. The GTFS dataset I'm using is from http://gtfs.ovapi.nl. Could you please check how this UnicodeDecodeError happens? Many thanks!

Headway/Frequency Matrix Construction

Motivated by the need to do some sanity checks on GTFS feeds in this r5py issue, add a feature which constructs a matrix of average headways in user-specified chunks of time throughout a given service day.

Stripping whitespace from DataFrame headers could avoid raising errors loading some GTFS feeds

This is not an issue with gtfs-lite per se, but rather with a particular GTFS file I tried to load, resulting in the following error:

/home/ghsci/work/process/data/transit_feeds/bilbao_gtfs/20230509_010334_RENFE_AVLD
Traceback (most recent call last):
  File "/env/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3652, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'end_date'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ghsci/work/process/subprocesses/_10_gtfs_analysis.py", line 266, in <module>
    main()
  File "/home/ghsci/work/process/subprocesses/_10_gtfs_analysis.py", line 81, in main
    loaded_feeds = gtfslite.GTFS.load_zip(f'{gtfsfeed_path}.zip')
  File "/env/lib/python3.10/site-packages/gtfslite/gtfs.py", line 280, in load_zip
    calendar["end_date"] = pd.to_datetime(calendar["end_date"]).dt.date
  File "/env/lib/python3.10/site-packages/pandas/core/frame.py", line 3760, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/env/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3654, in get_loc
    raise KeyError(key) from err
KeyError: 'end_date'

The apparent cause of this is that this calendar.txt file contains spaces before the newline symbol for the header as well as data rows, as seen in this text editor showing all characters:

I don't belive the data rows are the issue, as the read_csv uses the skipinitialspace=True (and I confirmed this resolves the spaces-after-dates issue).

gtfs-lite/gtfslite/gtfs.py

Line 274 in 0bbdd2e

skipinitialspace=True,

However, the last column in this file ends up with its name included spaces, such that it can't be understood as simply 'end_date', as per below screenshot showing the reading of this file into pandas:

One possibility, if you did want to handle these kind of inconsistencies, would be to call str.strip() on columns after loading each dataframe, as per https://stackoverflow.com/a/36082588/4636357, e.g.
calendar.columns = calendar.columns.str.strip()

I confirmed that the above code resolves the issue in this case:

without this addition:

>>> import gtfslite.gtfs
>>> test = gtfslite.gtfs.GTFS.load_zip('data/20230509_010334_RENFE_AVLD.zip')
Traceback (most recent call last):
  File "C:\Users\carlh\miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'end_date'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "D:\gtfs-lite\gtfslite\gtfs.py", line 277, in load_zip
    calendar["start_date"] = pd.to_datetime(calendar["start_date"]).dt.date
  File "C:\Users\carlh\miniconda3\lib\site-packages\pandas\core\frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\carlh\miniconda3\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'end_date'

with the addition:

>>> import gtfslite.gtfs
>>> test = gtfslite.gtfs.GTFS.load_zip('data/20230509_010334_RENFE_AVLD.zip')
>>> test
<gtfslite.gtfs.GTFS object at 0x0000024EFD75FE50>
>>> test.calendar
                      service_id  monday  tuesday  wednesday  ...  saturday  sunday  start_date    end_date
0     2023-05-082023-06-09001651    True     True       True  ...      True    True  2023-05-08  1970-01-01
1     2023-05-082023-06-09001653    True     True       True  ...      True    True  2023-05-08  1970-01-01
2     2023-05-082023-06-30001901    True     True       True  ...      True    True  2023-05-08  1970-01-01
3     2023-05-082023-06-30001902    True     True       True  ...      True    True  2023-05-08  1970-01-01
4     2023-05-082023-06-30001931    True     True       True  ...      True    True  2023-05-08  1970-01-01
...                          ...     ...      ...        ...  ...       ...     ...         ...         ...
3425  2023-05-082023-05-28389071    True     True       True  ...      True    True  2023-05-08  1970-01-01
3426  2023-05-082023-05-28389081    True     True       True  ...      True    True  2023-05-08  1970-01-01
3427  2023-05-082023-05-28389091    True     True       True  ...      True    True  2023-05-08  1970-01-01
3428  2023-05-082023-12-09941841    True     True       True  ...      True    True  2023-05-08  1970-01-01
3429  2023-05-082023-12-09942751    True     True       True  ...      True    True  2023-05-08  1970-01-01

[3430 rows x 10 columns]

If you were to implement this, to help robustness in loading GTFS feeds with slight validity issues, it would probably be a good idea to do this for all loaded frames.

I'm not sure if its of interest for you to implement this feature, as its a problem with some GTFS files, not the software. However, I suspect other GTFS readers must do something similar as a colleague was able to read this GTFS feed using urbanaccess, as per this thread https://github.com/global-healthy-liveable-cities/global-indicators/issues/275 (focused on a different issue, which I'm scoping whether usage of GTFS-Lite can resolve).

In case it helps, I'll look into drafting a pull request implementing this change.

Updated unique trips at stop to accept time slice

We compute transit service intensity as the number of unique trips that stop at stops within a specified area (usually a buffer around a block group).

This is done by finding all stops within a certain zone, and then finding all trips, and then counting the number of unique trips visiting that zone within a 24-hour period (or within the GTFS service schedule).

For GTFS-lite, we simply need to verify the function works as intended.

Unique trips at stop does not account for frequencies.txt

Right now, unique trips (and service hours, but that's another issue) doesn't account for the "frequencies.txt" trip definitions.

Do to this, we will have to handle trip_ids that appear in frequencies.txt separately as follows:

Use stop_times to get all trips with stops in them
Check to see if the trips are in the frequencies dataset, if so infer the total number of trips
For trips not in the frequencies dataset, count trips as normal

fail to load feed with __MACOSX folder inside the feed

I encountered a few GTFS feeds with __MACOSX folder inside the feed. for example:

zipfile.ZipFile(orgzipfile, 'r').namelist()

['GTFS Import/',
 '__MACOSX/._GTFS Import',
 'GTFS Import/agency.txt',
 '__MACOSX/GTFS Import/._agency.txt',
 'GTFS Import/calendar_dates.txt',
 '__MACOSX/GTFS Import/._calendar_dates.txt',
 'GTFS Import/stop_times.txt',
 '__MACOSX/GTFS Import/._stop_times.txt',
 'GTFS Import/shapes.txt',
 '__MACOSX/GTFS Import/._shapes.txt',
 'GTFS Import/trips.txt',
 '__MACOSX/GTFS Import/._trips.txt',
 'GTFS Import/stops.txt',
 '__MACOSX/GTFS Import/._stops.txt',
 'GTFS Import/calendar.txt',
 '__MACOSX/GTFS Import/._calendar.txt',
 'GTFS Import/routes.txt',
 '__MACOSX/GTFS Import/._routes.txt']

gtfs-lite fails to read such feed indicating a Unicode error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 37: invalid continuation byte

A possible solution could be a slight change when reading the nested files in

gtfs-lite/gtfslite/gtfs.py

Line 188 in 52516ff

if req in file:

by adding an exclusion condition:
if req in file and not str(file).startswith('__MACOSX/'):

Counting unique trips fails to count those with no specified arrival/departure times

There are feeds where times in between major stops are not specified (as they are not required to be specified). In its current form, unique_trip_count_at_stops() fails to count these as they pose a larger problem: When filtering by a time span, where do you count these?

We would have to add a function to "interpolate_all_trips" in some manner.

Add a function to extend the `calendar.txt` of a service to a different date

This will allow for the use of older (or stale) GTFS feeds that still operate in the same way to be used for present/later day analyses.

`route_frequency_matrix` fails on concatenation

The route_frequency_matrix returns an error when analyzing a GTFS file. The error seems to be in the concatenation process:

   1132 # Assemble final matrix
-> 1133 mx = pd.concat(slices, axis="index")
   1134 mx = mx.fillna(0)
   1135 return mx.reset_index(drop=True)

ValueError: No objects to concatenate

transfers.txt is optional, but if the file is present and empty, parsing results in EmptyDataError

If a GTFS feed has an empty text file for transfers.txt, then load_zip() method results in a an error, as in below example:

/home/ghsci/process/data/transit_feeds/Marc issues/Malaga/20230519_130136_Metro_Malaga
Traceback (most recent call last):
  File "/home/ghsci/process/subprocesses/_10_gtfs_analysis.py", line 311, in <module>
    main()
  File "/home/ghsci/process/subprocesses/_10_gtfs_analysis.py", line 307, in main
    gtfs_analysis(codename)
  File "/home/ghsci/process/subprocesses/_10_gtfs_analysis.py", line 92, in gtfs_analysis
    loaded_feeds = gtfslite.GTFS.load_zip(f'{gtfsfeed_path}.zip')
  File "/env/lib/python3.10/site-packages/gtfslite/gtfs.py", line 364, in load_zip
    transfers = pd.read_csv(
  File "/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1407, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1679, in _make_engine
    return mapping[engine](f, **self.options)
  File "/env/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 555, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

Transfers.txt is optional, so its possible that rather than the file being absent some agencies may leave it empty.

A possible solution is presented here: https://stackoverflow.com/a/42143354

Following that approach, perhaps a try except clause in load_clean_feeds() could catch the EmptyDataError and return None in that case, perhaps if an argument indicating 'optional=True' (that defaults to False) was provided. That would seem consistent with the current load if present Else None approach for optional gtfs feed files.

Change namespace to import `GTFS` directly

Update the namespace so that we can just have

from gtfslite import GTFS

Instead of

from gtfslite.gtfs import GTFS

Combine `calendar` and `calendar_dates` when checking for valid date coverage

When dates are checked for validity both calendar and calendar_dates are checked separately.

gtfs-lite/gtfslite/gtfs.py

Lines 539 to 544 in 52516ff

 if self.calendar is not None: 

 summary["first_date"] = self.calendar.start_date.min() 

 summary["last_date"] = self.calendar.end_date.max() 

 else: 

 summary["first_date"] = self.calendar_dates.date.min() 

 summary["last_date"] = self.calendar_dates.date.max()

These need to be combined to take the minimum of both minimums and the maximum of both maximums.

Add `start_time` and `end_time` options to `route_summary`

As a complement to implementing #21, add a start and end time option for the calculation of the route summary.

Updated documentation and functionality on spot-checking or validation of data sets

Following the advice of r5py/r5py#222

Error loading file with attributions.txt

When attempting to load a zipped GTFS feed containing the file attributions.txt, a KeyError was raised, suggesting that this file wasn't found in the archive.

I'll paste this below -- I'm trialling using the GTFS-Lite library in an existing workflow, so there's a bit of extra stuff here:

/home/ghsci/work/process/data/transit_feeds/bilbao_gtfs/20230505_130305_Euskadi_Euskotren
Traceback (most recent call last):
  File "/home/ghsci/work/process/subprocesses/_10_gtfs_analysis.py", line 266, in <module>
    main()
  File "/home/ghsci/work/process/subprocesses/_10_gtfs_analysis.py", line 81, in main
    loaded_feeds = gtfslite.GTFS.load_zip(f'{gtfsfeed_path}.zip')
  File "/env/lib/python3.10/site-packages/gtfslite/gtfs.py", line 460, in load_zip
    zip_file.open("attributions.txt"),
  File "/env/lib/python3.10/zipfile.py", line 1514, in open
    zinfo = self.getinfo(name)
  File "/env/lib/python3.10/zipfile.py", line 1441, in getinfo
    raise KeyError(
KeyError: "There is no item named 'attributions.txt' in the archive"

I had a quick look at the gtfs-lite/gtfs.py file and I suspect what is happening is that the check for attributes.txt passes (the file was correctly identified as being present), but when the call is made to load the file from the zipped directory, this is made without using the filepaths dictionary:

gtfs-lite/gtfslite/gtfs.py

Line 455 in 0bbdd2e

zip_file.open("attributions.txt"),

elsewhere, the files are loaded using the filepaths dictionary, for example,

gtfs-lite/gtfslite/gtfs.py

Line 434 in 0bbdd2e

zip_file.open(filepaths["feed_info.txt"]),

I'll see if I can have a go at making this change and if it makes a difference, but this looks to be the source of the error. I believe I didn't notice this until now as the other feeds I am using happen to not have this file.

Thanks for your work on this package, it looks useful! Let me know if you need any more information.

Service hours function raises error

Looks like a date is being compared with a method. Need to go back and verify this function is working as intended.

File ".../gtfs.py", line 569, in valid_date
   if first_date > date_to_check or last_date < date_to_check:

TypeError: '>' not supported between instances of 'datetime.date' and 'builtin_function_or_method'

Add contribution guide

Need to follow in the footsteps of r5py and company and write a contribution guide.

Allow for writing of the GTFS feed back to a zipfile.

Once loaded it's possible to manipulate and adjust the feeds in any number of ways. Those feeds, once changed, should be writeable back out to a standard GTFS package.

	frequencies = pd.read_csv(
	zip_file.open(filepaths["frequencies.txt"]),
	dtype={
	"trip_id": str,
	"start_time": str,
	"end_time": str,
	"headway_secs": int,
	"exact_times": int,
	},
	parse_dates=["start_time", "end_time"],
	skipinitialspace=True,
	)

	if self.calendar is not None:
	summary["first_date"] = self.calendar.start_date.min()
	summary["last_date"] = self.calendar.end_date.max()
	else:
	summary["first_date"] = self.calendar_dates.date.min()
	summary["last_date"] = self.calendar_dates.date.max()

wklumpen / gtfs-lite Goto Github PK

gtfs-lite's People

Contributors

Stargazers

Watchers

Forkers

gtfs-lite's Issues

1. Do all the calculations on the fly

2. Create a persistent subset of the GTFS feed

Recommend Projects

Recommend Topics

Recommend Org