propershark / timetable_cpp Goto Github PK

View Code? Open in Web Editor NEW

1.0 3.0 0.0 178 KB

Purveyor of schedule information for transit agencies via GTFS feeds and WAMP procedures.

Makefile 2.70% C 27.97% C++ 68.90% Objective-C 0.43%

timetable gtfs-feed schedule-information wamp-procedures transit

timetable_cpp's Introduction

Timetable

Purveyor of schedule information for transit agencies via GTFS feeds and WAMP procedures. This time in C++.

Installation

This is the only compiled part of the server system. As such, it takes a bit more effort to install and set up. These are the steps I took to get a working application.

NOTE: This project is configured to be compiled with clang. To use a different compiler (untested), simply edit line 5 of Makefile to be CXX ?= compiler_name. There are currently no compiler-specific flags, so everything else should work as-is.

First, you'll need to download and compile wamped and its dependency, mpack. It's easiest to clone msgpack into the mpack folder in wamped.

git clone https://github.com/propershark/wamped.git
cd wamped
git clone http://github.com/alvistar/mpack.git

Then, follow the build instructions from wamped to create the static libraries for your system. I've copied them here to make things simple.

mkdir build
cd build
cmake ..
make

After running make, the built libraries will be somewhat hidden away. It took me a while to find these, but we need to copy them into our lib/ folder to link them properly when compiling timetable. Assuming you are in the directory for this repo (i.e., ../timetable), and you cloned wamped to a sibling folder (i.e., ../wamped), these commands should copy the library files properly.

cd /path/to/timetable/clone
cp ../wamped/build/source/wamped/libwamped.a      lib/
cp ../wamped/build/source/mpackCPP/libmpackcpp.a  lib/
cp ../wamped/build/mpack/src/mpack/libmpack.a     lib/

Note that the extension of the lib* files may depend on your system. Use tab completion in your shell and you should be able to find them pretty easily.

Now, to compile timetable itself, you should be able to run make and get a timetable executable.

make
./timetable

Usage

Timetable is currently a dumb executable with no real capabilities. As such, it can be run with no arguments.

./timetable

As it develops, there will need to be configurations for finding GTFS files and configuring worker pools. These may be done at the command line

./timetable --gtfs ./data/gtfs.zip --workers 8

However, this will likely be done through a configuration file instead. Only time will tell.

timetable_cpp's People

Contributors

Stargazers

Watchers

timetable_cpp's Issues

Date formatter bug in visits_after

I'm seeing malformed dates on calls to visits_after:

timetable.visits_after(BUS064, 20170330 17:18:55, 5) =>

[ [ '20170330 00:17344',
    '20170330 00:17792',
    '28',
    'to/from Purdue Village & Dorms & Campus' ],
  [ '20170330 00:17344',
    '20170330 00:17792',
    '28',
    'to/from Purdue Village & Dorms & Campus' ],
  [ '20170330 00:17344',
    '20170330 00:17792',
    '28',
    'to/from Purdue Village & Dorms & Campus' ],
  [ '20170330 00:17344',
    '20170330 00:17792',
    '28',
    'to/from Purdue Village & Dorms & Campus' ],
  [ '20170330 00:17344',
    '20170330 00:17792',
    '28',
    'to/from Purdue Village & Dorms & Campus' ] ]

A comparable call to visits_between (with expected dates), is:

timetable.visits_between(BUS064, 20170330 17:18:55, 20170330 18:18:55, 5) =>

[ [ '20170330 17:24:00',
    '20170330 17:24:00',
    '28',
    'to/from Purdue Village & Dorms & Campus' ],
  [ '20170330 17:36:00',
    '20170330 17:36:00',
    '28',
    'to/from Purdue Village & Dorms & Campus' ],
  [ '20170330 17:48:00',
    '20170330 17:48:00',
    '28',
    'to/from Purdue Village & Dorms & Campus' ],
  [ '20170330 18:00:00',
    '20170330 18:00:00',
    '28',
    'to/from Purdue Village & Dorms & Campus' ],
  [ '20170330 18:12:00',
    '20170330 18:12:00',
    '28',
    'to/from Purdue Village & Dorms & Campus' ] ]

I should note that I'm switching to visits_after for the POI view — since the view is supposed to refresh itself with new arrivals as old ones depart, I now call visits_after periodically and filter what is displayed in the client to only show the next hour.

Stop times > 23:59:59 unsearchable after midnight

For example, this stop time:
75R10,24:02:00,24:02:00,PLZA,3,Warm Springs/South Fremont,,,,1

...cannot be found by this visits_between query:
timetable.visits_between("PLZA", "20170728 00:00:00", "20170728 00:30:00", 5)

This is because we start searching at day of the requested time.

Improve `stop_time` indexing.

As I'm thinking about how #14 could be implemented efficiently, I'm realizing that a single-index solution as is currently implemented will be rather limiting in terms of either functionality or performance.

For example: the current key of (station, departure_time, route, trip) is great for iterating all visits to a station, but less so for iterating only visits for a particular route, where (route, station, departure_time) would be far more suitable. Another key - (route, trip, station, departure_time) - is more efficient for queries like those in #14, where

I think a nice implementation of this would be having some Timetable::index type template that takes a set of parameters to use for building an index. This way, adding and modifying indexes is as simple as adding/modifying a member definition on Timetable::Timetable. Off the top of my head, I'd like something like this:

template<typename key_type, typename value_type>
class index {
  using indexer = std::function<key_type(value_type)>;
  indexer make_index;
  // Store pointers to avoid duplicating objects. This requires that
  // values be stored elsewhere, but the method of storage is irrelevant.
  std::map<key_type, value_type*> index_map;

  public: 
    index(indexer func) : make_index(func) {};

    void add_entry(value_type& value) {
      index_map[make_index(value)] = &value;
    };
};

class Timetable {
  index<std::string, gtfs::stop_time> visits_by_trip([](auto st) {
    return st.trip_id + st.departure_time.to_string();
  });

  std::vector<gtfs::stop_time> stop_times;

  void build_indices() {
    for(auto st : stop_times) {
      visits_by_trip.add_entry(st);
    }
  };
};

Using this implementation would then be as simple as looking up a lower and upper bound in the index and iterating between them as desired. A potential use case could look something like this:

Timetable tt;
auto lower_bound = tt.visits_by_trip.lower_bound(some_trip);
auto upper_bound = tt.visits_by_trip.upper_bound(some_trip);
for(auto it = lower_bound; it != upper_bound; it++) {
  // Since the indices only keep pointers, dereference the value to get
  // the actual visit.
  auto visit = *it;
  // do work...
}

I'm not entirely sure this will work as written above, but I'd like to get some sort of generic indexing system so that we can easily add new indexes as necessary without too much difficulty.

Add route information to RPC responses

With the implementation of #5, responses from route-agnostic RPCs have no indication of which route each arrival is from. In other words, the context of the arrival is lost.

It'd be nice to add some route information to the responses from the RPCs so that consumers can keep that context and map arrivals to different routes, even if the requests they make do not specify that information.

An example response with route information could look like:

# [ETA, ETD, Route, Headsign]
["12:45:00", "12:45:00", "1B", "to Walmart"]

More information can be added, but keeping route and headsign information should be sufficient for calls to create a full context about the arrival.

This information should be added to all responses for consistency and to avoid surprises in the API.

Responses limited to 1024 characters

mpackCPP defines the data buffer for a node to statically be 1024 characters. This means that responses longer than 1024 characters will appear truncated and have an invalid representation.

The easiest (and likely viable) solution would be to expand the buffer to 4096 characters, but the real solution is to replace the implementation with std::string. Both of these require modification to mpackCPP itself, so the std::string solution is probably better to go with now rather than later.

Create visits RPCs for all routes at a station

As discussed on January 15:

emw - Last Sunday at 10:24 AM
Another thing I'm realizing is that there are some instances where I'd rather get all arrivals on a station, regardless of route, in order
I can multiplex different next_visits calls together, but it'd be simpler to just ask for the n next visits from Timetable. It'd also prevent me from having to know the associated routes for a station before making the call.
If either of these things seem reasonable to you, btw, I can create an issue

faulty - Last Sunday at 11:34 AM
I sort of bundled this into my response on the same issue. Might be worth making a separate one for the route-agnostic stuff

If it's not too complicated, I'd like to have visits_before/after/between RPCs that omit the route parameter, and return all visits to that station regardless of route.

Iterate dates on `visits_*` calls

stop_times only specify the time at which they arrive at a station, while the dates that each stop_time is active are defined by either calendar or calendar_dates for the service_id.

When iterating stop_times for the visits_* calls, there is a good chance that the result list can span multiple days. Because the stop_time map on which these queries are based does not understand the concept of dates, iterating over that map once will only yield visits for one date.

A solution to be able to infinitely generate visits from a starting date is to iterate the stop_time map for a stop once for every date, incrementing/decrementing the date each time. Because GTFS archives are not necessarily infinite, it would make sense to try to determine the date bounds that it defines services for and limit iterations to those dates.

Optional csv fields

Does timetable support nullability / default values for optional fields in the GTFS spec? If not, what needs to be done to support it?

For context, I'm playing with BART and SFMTA's GTFS feeds. Both feeds cause timetable to crash while parsing the pickup_type field from stop_times.txt:

Reading trips
Reading routes
Reading stops
Parsing stop times
libc++abi.dylib: terminating with uncaught exception of type std::invalid_argument: stoi: no conversion
Process 11532 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fffbbc6ed42 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fffbbc6ed42 <+10>: jae    0x7fffbbc6ed4c            ; <+20>
    0x7fffbbc6ed44 <+12>: movq   %rax, %rdi
    0x7fffbbc6ed47 <+15>: jmp    0x7fffbbc67caf            ; cerror_nocancel
    0x7fffbbc6ed4c <+20>: retq
(lldb) up 10
frame #10: 0x00000001000ac279 timetable`gtfs::csv_parser<gtfs::stop_time>::_parse_line(this=0x0000000100118270, line="01SFO10SUN,08:00:00,08:00:00,LAFY,1,Millbrae,,,,1\r") at csv_parser.h:77
   74             switch(mapper.type) {
   75               case tBOOL:   mapper.apply(inst, column == "1");     break;
   76               case tDOUBLE: mapper.apply(inst, std::stod(column)); break;
-> 77               case tINT:    mapper.apply(inst, std::stoi(column)); break;
   78               case tSTRING: mapper.apply(inst, column);            break;
   79               default:      break;
   80             }

Per the spec, pickup_type is optional and defaults to 0.

RPC to get hours of service for a station/route

It'd be useful to have an RPC to determine when a particular route makes stops at a station. This would allow clients to say something like:

"Route 13 stops here Mon-Fri, 7:00am-6:00pm, every 5 min"

So, the call would look something like

timetable.service_times(station, route)

and would return

a range of weekday-times (e.g. Mon 07:00...Mon 18:00 or Tue 19:00...Wed 02:00) for a particular level of service
the time interval between stop times

timetable.service_times(station, route) -> [ (Date Range, Visit Interval) ]

Re-arrange RPC arguments to allow route agnosticism

In many cases, queries for the next arrivals at a station are interested in all arrivals, not just those from a particular route. However, the current RPC argument structure for the visits_* family is (station, route, timestamp, count), meaning route is not a parameter that can be omitted.

To allow for both calls that request arrivals from a particular route and that don't, a better RPC definition is:

visits_after(station, timestamp, route="", count);

Note that the route parameter is optional, but is not the last parameter. The current WAMP library that timetable uses does not allow for optional parameters, nor does it allow registering multiple procedures under the same name. In other words, this structure is not possible with the current WAMP library, but is the ideal case and should be considered if the underlying WAMP library changes.

For now, however, a simple fix is to create two distinct procedures that take these different argument sets. They could be:

visits_after(station, timestamp, count);
visits_after_from_route(station, timestamp, route, count);

This should handle the structure for now, but should not be considered a final solution.

Data versioning and idempotency

From a comment on propershark/proto#1:

Versioning in Timetable: clients should be able to cache. Timetable should
(at some point) keep a monotonically increasing version number of sorts, based
on the GTFS data it holds.

Since Timetable is an idempotent service (making the same request twice will yield the same results), clients can easily cache calls with identical parameters.

However, when Timetable updates its GTFS information, any cached calls should be immediately invalidated, and there is currently no way that clients can safely cache calls and know when the cache is invalidated.

The proposal here is to add a parameter to every response from Timetable indicating the version of the data that was used to generate the response with the following constraints:

When the underlying GTFS source changes in Timetable, this version number will change to a new value such that clients know to invalidate their caches.
Within a reasonable timeframe, this version number is never repeated. This allows clients to passively check for cache invalidation while avoiding potential collisions.

How this version number will be best represented in responses is unclear to me. One option is to simply wrap responses in another Array and include the version number there:

timetable.visits_between(...) =>

[
  <version_number>,
  [
    ["20170402 06:50:00", ...],
    ...
  ]
]

Another option would be to use a map response instead. This has the benefit of showing semantics, but also takes up a sizable amount of space to include the key names:

timetable.visits_between(...) =>

{
  version: <version_number>,
  response: [
    ["20170402 06:50:00", ...],
    ...
  ]
}

I'm partial to using the map response, as it will also allow us to add arbitrary meta information later on without requiring clients to necessarily change their parsing logic.

Calls to visits_between don't honor the time range

I'm seeing things like:

Received call to `visits_between`:
        stop:   BUS064
        start:  20170319 23:17:33
        end:    20170320 00:17:33
        count:  10
Responded in 1.838ms with:
[["20170319 09:11:00","20170319 09:11:00","1B","to CityBus Center"],["20170319 10:11:00","20170319 10:11:00","1B","to CityBus Center"],["20170319 11:11:00","20170319 11:11:00","1B","to CityBus Center"],["20170319 12:11:00","20170319 12:11:00","1B","to CityBus Center"],["20170319 13:11:00","20170319 13:11:00","1B","to CityBus Center"],["20170319 14:11:00","20170319 14:11:00","1B","to CityBus Center"],["20170319 15:11:00","20170319 15:11:00","1B","to CityBus Center"],["20170319 16:11:00","20170319 16:11:00","1B","to CityBus Center"],["20170319 17:11:00","20170319 17:11:00","1B","to CityBus Center"],["20170319 18:11:00","20170319 18:11:00","1B","to CityBus Center"]]

Note that the returned stop times are for the morning previous day. Maybe this has to do with requesting past midnight?

RPC calls for visits within a time interval

As Proper starts to consume Timetable data, I'm realizing that a more useful way to query arrival data is by asking for all the arrivals within a time interval. Proper's station views (by default) show all arrivals for the next hour; this RPC would suit that use case nicely. Here's how I envision the RPC looking:

timetable.visits_between(station, route, start_time, end_time, n) -> [(ETA, ETD)]

start_time and end_time are standard (%Y%m%d %H:%M:%S) timestamps. n allows a count limit to be specified, but it can be omitted.

Handling agencies that don't have stop codes

I'm trying to get Timetable working with BART, and I've run into a conceptual issue I'd like your feedback on: How do we identify stations that don't have a stop code?

BART's stops.txt looks like this:

stop_id,stop_name,stop_desc,stop_lat,stop_lon,zone_id,stop_url,location_type,parent_station,stop_timezone,wheelchair_boarding
12TH,12th St. Oakland City Center,,37.803768,-122.271450,12TH,http://www.bart.gov/stations/12TH/,0,,,1
16TH,16th St. Mission,,37.765062,-122.419694,16TH,http://www.bart.gov/stations/16TH/,0,,,1
19TH,19th St. Oakland,,37.808350,-122.268602,19TH,http://www.bart.gov/stations/19TH/,0,,,1
...

There's no stop code, only a stop_id. Looking at the spec for stops.txt, I find this:

stop_code - Optional
Contains short text or a number that uniquely identifies the stop for passengers. Stop codes are often used in phone-based transit information systems or printed on stop signage to make it easier for riders to get a stop schedule or real-time arrival information for a particular stop.

The stop_code field should only be used for stop codes that are displayed to passengers. For internal codes, use stop_id. This field should be left blank for stops without a code.

Makes sense. BART doesn't have this IRL: Stations are always identified by a full name. Since it's a subway rail system, there are relatively few stops, and they're easy to identify by the street(s) they span or their position in a city.

So why do we key our models with stop_code? Looking back at Citybus, I see why:

stop_id,stop_code,stop_name,stop_desc,stop_lat,stop_lon,location_type
73c3bfbe-e120-4cde-ae8a-dc1c661bbe75,"BUS215","CityBus Center: BUS215","CityBus Center",40.4206937802252,-86.8948909279639,0
84f8d586-2af9-498b-b0cd-7c594effe124,"BUS287","2nd St & Columbia St (NW Corner): BUS287","2nd St. and Columbia St. (NW Corner)",40.41832136621,-86.8954126176484,0

Presumably, that UUID is something internal to the GTFS file, and doesn't correspond to any IDs we get from Citybus's realtime sources.

What do you think the right approach is here? Should we fall back to stop_id when stop_code is unavailable, or is there a better approach / refactor you can think of?

Build on linux

For deployment, we need a way of building timetable on linux. We could insist on using clang, however, I've been playing with installing clang on our server and it's not as simple as an apt get (currenly having problems with the linker and libstdc++ vs libc++).

I think it'd make the most sense to just change a few clang-specific sections of the code, making timetable more portable. For example, I'm seeing this error when I try to compile:

src/datetime/resolve.cc: In member function ‘void DateTime::resolve()’:
src/datetime/resolve.cc:23:3: sorry, unimplemented: non-trivial designated initializers not supported
   };
   ^
src/datetime/resolve.cc:23:3: sorry, unimplemented: non-trivial designated initializers not supported
src/datetime/resolve.cc:23:3: sorry, unimplemented: non-trivial designated initializers not supported
src/datetime/resolve.cc:23:3: sorry, unimplemented: non-trivial designated initializers not supported
src/datetime/resolve.cc:23:3: sorry, unimplemented: non-trivial designated initializers not supported
src/datetime/resolve.cc:23:3: sorry, unimplemented: non-trivial designated initializers not supported
src/datetime/resolve.cc:23:3: sorry, unimplemented: non-trivial designated initializers not supported
src/datetime/resolve.cc:23:3: warning: missing initializer for member ‘tm::tm_yday’ [-Wmissing-field-initializers]
src/datetime/resolve.cc:23:3: warning: missing initializer for member ‘tm::tm_isdst’ [-Wmissing-field-initializers]
src/datetime/resolve.cc:23:3: warning: missing initializer for member ‘tm::tm_gmtoff’ [-Wmissing-field-initializers]
src/datetime/resolve.cc:23:3: warning: missing initializer for member ‘tm::tm_zone’ [-Wmissing-field-initializers]

As it would seem, gcc doesn't support designated initializers in c++, while clang does. I anticipate little adjustments like this will be necessary if you want portability.

Rename `next/last_visits` to `visits_before/after`.

To be more consistent with the newly-added visits_between RPC, next_visits and last_visits should get renamed to visits_after and visits_before. The parameters and return values will be unchanged.

This is generally more semantic as well, as next_visits is actually just returning the visits that occur after the given timestamp; last_visits, before.