hasadna / open-bus Goto Github PK

:bus: Analysing Israel's public transport data

Python 3.15% Java 93.34% HTML 0.09% JavaScript 0.50% CSS 0.02% Dockerfile 0.02% Jupyter Notebook 2.86% Makefile 0.01% Batchfile 0.02%

transport open-data gtfs siri transit transit-data

open-bus's Introduction

Open Bus

Open Bus is a project of The Public Knowledge Workshop.

We use public data to improve bus service - and public transport in general - in Israel.

We're currently working on one main project, Real Delays, aggregating real-time bus data and comparing it to the planned bus schedules.

Where does the data come from?

Planned (static) data: The Ministry of Transport publishes a file called GTFS. This file contains planned trips data for the next 60 days. Alongside it, in the same FTP folder, there are a number of files with additional related data.
Online data: the MoT has a webservice that provides real-time data. The webservice is called [SIRI SM] (https://github.com/hasadna/open-bus/wiki/Bus-Real-Time-(SIRI)-Data-Documentation).

Want to help?

The project is currently focused on aggregating and analyzing data, so we need mainly Python developers and data scientists. We also have side tasks that are quite "stand-alone".

We are using Python 3 for all of our analysis, GTFS and "ETL" code, Java for the siri fetching code.

To get started, check our wiki and have a look at our task board to see what we're working on

We recommend contacting us by filling up the workshop's new volunteer form. There's sometimes, but not always, someone working on the project in the Public Knowledge Workshop Tel-Aviv development meetings (Monday evenings).

open-bus's People

Contributors

Stargazers

Watchers

open-bus's Issues

Data Analysis - Create Reports For 480,19,947

Create reports with the data we have regarding lines: 480, 19, 947.
The reports should include the following analysis:

Per Date - how many trips were planned (by GTFS) how many we actually received GPS coordinates (from SIRI). for example: 99 trips were planned but we saw only 91 in SIRI.
Which buses left their first stop late?
How late are the buses on other stops? (not just the first)
How long does each trip takes from start to end?

If you have other ideas you are more than welcome to add them! 🥇

Write to siri_real_time_arrivals table expected arival time

Definition Of Done:

Expected arrival time will be available in the column: "expected_arrival_time" in "siri_real_time_arrivals" table.

Tasks

Add new column to siri_real_time_arrivals table called "expected_arrival_time" (timestamp with time zone)
Change real-time-siri CRUD to handle the new column.
write expected arrival time to the new column

Importing stop locations to Open Street Map

This is a request that came up in a discussion in Tapuz Public Transport forum. It's tangent to our current tasks, but I thought I'll document as a "product backlog" item.

The idea is to use our daily GTFS download + DB insert script to also update Open Street Map with any changes to bus stations and stops. At the minimum we could just make sure that all the stops exist on the map and are in the correct location. We could probably also add information about which bus lines stop there.

The first task is to research how to technically do the import (see link below). It's also probably best to announce the intention of doing that in the Israel OSM forum and ask for advice\help. I suggest documenting the progress

Resources:

Help wanted:
This is a relatively independent task that doesn't require a lot of coordination with the rest of the project. The first step involves research rather than diving into code.

Upsert relevant GTFS record into open-bus DB

Definition Of Done:

Read relevant records from files/tables
update~insert records of tables in open-bus tables depends on the "upsert" policy of each table.

Route that contain the same coordinate twice or more leads to undefined behavior

Because the calculation of the route_offset made without the context of other records, if bus visit the same place twice or more in a trip the route_offset is undefined.
In the follow example the purple dot represent such case that route_offset of position 4 and 2 will be the same.

Add contribution guidelines

I feel the our project could use some order-making, and a boost to its accessibility to newcomers and sparsely-contributing members. My first offer in this direction is to come up with a good CONTRIBUTING.md file that will be suggested to every newcomer to read, and linked to on every issue, pull-request and commit creation.

Besides pointing to the basic resources for getting started, should include at least the following contribution type guidelines -

Documentation
New features
Bug reports
Use case / Analysis

For code contributions

Small - pull request
Large - disscussion issue -> design-> pull-request
How to submit a proper Pull Request
Test guidelines

A script/query to add stop_point field to gtfs_stops table [implement #30 first]

Note: this task is a little similar to #30, except it also includes indexing. This isn't as urgent though because we can use the shape_dist_travelled in gtfs_stop_times to start with, rather than calculating the route offset.

We are want to add geometry fields to our database in order to be able to run geoqueries using PostGIS.

The task is to create a script that:

Adds a geometry column named stop_point to the gtfs_stops table.
Updates the column to contain the data from stop_lat and stop_lon in the same table.
Index the new field

The script location should be under the /postgres folder in the source code.

Some additional pointers:

Use AddGeometryColumn rather than ALTER TABLE (see here why)
These are probably the correct parameters for add geometry column: srid = 4326 , type = "POINT" and dimension = 2
When inserting the data, stop_lon is the X value and stop_lat is the Y value
Information about geoindexing

What you need to know to implement this task

SQL, some PostGIS. If you don't know any PostGIS, these should help:
- Manual
- Tutorial
You'll be touching the gtfs_stops table, but you don't really need to understand much about it except for the stop_lat & lon fields
You can probably create the query using the re:dash interface. If you need a local dev copy of the database, see #32
Clone the repository and use pull requests to submit your code.

Optimize siri_arrivals data

siri_arrivals has a huge (and hopefully growing) number of records, so we need to be more careful about the size of each record.

We need to make some changes to the current table as well as to the code that creates the table:

Change trip_id to varchar(20)
Change vehicle_location_lat and vehicle_location_lon to varchar(8) (5 digit decimal precision is enough)
Drop columns confidence_level, arrival_boarding_activity
Review other fields and see if the type can be changes or the field can be dropped completely.

Write web page that display a sql query

write a web page that display a data from sql query.

create csv file that will represent the result of sql query.
write a static web page that show the data in a table.

Definition Of Done

A web page that shows the csv file in a table.

Please add me as project collaborator!

Please add a comment if you would like to join the repository as a collaborator. This will allow you to assign tasks to yourself, and also work with the Kanban boards.

Preferably talk to us first (e.g. by filling out the Public Knowledge Workshop's volunteers questionnaire, meeting us during the development meetings...)

Otherwise please explain in your comment what's your interest in the project and how you plan to contribute!

Remember - you can always contribute by cloning the repository and posting pull requests, as well as by commenting on issues.

A script/query for creating gtfs_shape_lines table

We are want to add geometry fields to our database in order to be able to run geoqueries using PostGIS.

The task is to create a script that:

Creates a new table, gtfs_shape_lines, with two fields: shape_id (INTEGER NOT NULL) and shape_line (geometry)
Insert data from gtfs_shapes to gtfs_shape_lines. All the records with the same shape_id should in gtfs_shape should be merged into a single record in gtfs_shape_lines. The line data should be a concatenation of the of the shape_pt_lon & shape_pt_lan fields, ordered by shape_pt_sequence.
Optional for extra credit: before loading the line strings, run shape simplification (https://github.com/hasadna/open-bus/blob/master/gtfs/parser/simplifyshapes.py). This should be useful because the data in gtfs_shapes is over sampled.

The script location should be under the /postgres folder in the source code. If you need Python code, put the script under /gtfs/parser

Additional pointers:

Use AddGeometryColumn rather than ALTER TABLE (see here why)
These are probably the correct parameters for add geometry column: srid = 4326 , type = "LINESTRING"
When inserting the data, shape_pt_lon is the X value and shape_pt_lat is the Y value

What's would you need to implement this?

SQL, Some PostGIS, probably also Python. If you don't know any PostGIS, these should help:
- Manual
- Tutorial
You'll need to understand the gtfs_shapes table. Read about the GTFS tables here and play with them using the re:dash interface.
If you need a local dev copy of the database, see #32
Clone the repository and submit the query using a pull request.

Bus2train: Analyse bus arrivals to train stations (data analysis)

Some time ago we were asked by Gil from 15 minutes to analyse the transfers between buses and trains in train stations. His request was to find a metric for how well the buses are coordinates with the trains in different stations, and different times of day. This can have value both for PR and for prioritising the work with the ministry of transport.

Here's a set of files containing arrival of buses and trains to train stations on Thursday, 2016-9-1. It was created from GTFS data using the calling_at_station module.

This task is rather open ended: can anyone look at these files and think of ways to analyse the data and create useful metrics (or even visualisations?) that could provide insights on where and when the coordination of trains and buses is especially problematic?

Map of bus stops that are connected to train stations

As part of checking bus\train connectivity for 15 minutes, we want to understand what parts of the country are easily accessible from train stations.

The idea is

find bus stops that are near train stations
find all routes that stop in those stops, and see which other stops their reach

I have initial results for this created with code in my old repository. There is a map of of bus stops near train stations. And there is a map of stops from which can reach train stations .

I am working now on:

Moving all the relevant code to this repository
Fixing some possible bugs

Next tasks:

15 minutes have asked for the following features

Color code the distance from the station (e.g. use saturation)
Control over the parameters of drawing the map (e.g. the current map only shows stops up to 30 minutes from the station)
Include all train stations (e..g in separate layers) . The current map doesn't contain the 10 most well served stations (mostly Tel-Aviv & Haifa).

Achieving all this seems to me to require a dynamic map (e.g. Google maps with some kind of backend). Will be happy for ideas about how best to do it (or help in doing it...)

Create a database dump for dev environment + docker?

Find an easy way for new volunteers to have write access to a copy of the database.

Preferably not the entire database but a subset (e.g. siri_arrivals from a single day + related gtfs data)

Can we put it into a docker container based on this?

Alternatively we need good instructions on how to set up a database.

Deploy script for fetching online data from SIRI

We finally have direct access to the SIRI interface! It's time to start using it!

I suggest we install a script that polls the service regularly (e.g. every minute) and dumps the data to database. As far as I understand, we already have the technology to do it.

This would require (I think):

Setting up the DB tables
Testing the script and fixing and bugs
We might want to make some adaptation, e.g. read the list of stops to poll from a file
Installing a cron job

Which stops to poll?
Ideally I would like to poll all the stops all the time. My thinking is, we should be build the biggest database we can, because we don't know what parts of the data would become useful.

However I think there's some limit on the amount of data you can poll. If we hit that, I suggest we start by focusing on bus routes that serve train stations, since this will be helpful for the Bus2Train task. I'll soon post a list of these stops here.

Find real time stop arrivals [depends on #24, #26][advanced]

The goal of this task is to build a table with at least the following fields:

trip_id
date
stop_code
actual_arrival_time

Write a script that outputs that to a CSV (no need to write to DB at this point). The script should go under /siri .

How to do this
Assuming

siri_arrivals table has a trip_id field (#24) and route offset (#26)
gtfs_stop_times table has shape_dist_ratio (#33)

You should be able to do something like:

For each trip_id in siri_arrivals, build:
- planned stops: a list of all planned stops in the trip, from gtfs_stop_times (stop_code & shape dist ratio)
- known locations: a list of know positions from siri_arrivals - route_offset & timestamp
For each stop in planned stops, find the last know location before it, and first known location after it. If the previous know location was at time t₁ and offset x₁, and the next known location is at time t₂ and offset x₂, and the stop offset is x_s, then

t_s = (x_s - x₁) / (x₂ - x₁) * (t₂ - t₁)

What do I need to know

Python, basic SQL
You'll need a good understanding of the GTFS tables and the siri_arrivals tables, including the extra fields added in #24, #26 & #33. If you know enough to understand this task description, you're probably all right.
You probably need a dev database, see #32
Clone the repository, and pull request your changes.

Write a roadmap

We could really use the help of recording what we are working on, and what we plan or wish to work on in the future. What is better than a good ol' roadmap for the job!

@AvivSela and I started working on this in the last Monday meetup, and I will open a pull request with a ROADMAP.md file on which we could all discuss

Found this short guide to be useful https://mozillascience.github.io/working-open-workshop/roadmapping/

Build incremental GTFS database [epic, needs breaking down into tasks]

The GTFS is published by MoT nightly. Each file contains data for the next 60 days, but changes to planning can occur: the GTFS files from the 1/1 and 2/1 may disagree on the trips planned for the 2/1. The data published on the 2/1 should be considered more accurate because it is more up to date.

We need a script that:

For every day, loads the data from the GTFS file of that day (or the most recent file we have) to a Postgres DB.
Can add one GTFS file to an existing database.
Can work on an archive of GTFS files and load data incrementally from them.

An outline for the incremental db schema and logic for implementing the task can be found here.

An archive of GTFS files to test with is kept by Open Train.

Execute Open-Bus-Bash App and Load Data for One Day

Talk with Michal and understand how to execute the bash code.
Execute the bash code for one day.

Definition of Done

The GTFS information of a day is in the DB
The SIRI information of a day is in the DB
The real time DB table is updated

Add siri request status log

We currently have code which queries the siri service but we don't keep log whether the request succeeded or failed.
The mission is to add a feature that will save in the db for each request whether it succeeded or failed.
For more information about siri see documentation here and issue #10

Guidelines:

Create Table siri_log with columns: request_id (the same one as in siri_arrivals), request_time, response_time, status.
Don't save the raw xml. It is too big.
Consider saving which lines we requested

[Transferred from GitLab] Implement a function to extract bus stops for a given route

Need to implement a function which gets a bus route + departure stop and returns a list of stops for that route.

Originally posted by @johananl

A script/query adding vehicle_location_point to siri_arrivals

(note: this task is a little similar to #28, but is slightly simpler because indexing isn't required, and also has higher priority)

We are want to add geometry fields to our database in order to be able to run geoqueries using PostGIS.

The task is to create a script that:

Adds a geometry column named vehicle_location_point to the siri_arrivals table.
Updates the column to contain the data from vehicle_location_lat and vehicke_location_lon in the same table.

The script location should be under the /postgres folder in the source code.

Some additional pointers:

Use AddGeometryColumn rather than ALTER TABLE (see here why)
These are probably the correct parameters for add geometry column: srid = 4326 , type = "POINT" and dimension = 2
When inserting the data, vehicle_location_lon is the X value and vehicle_location_lat is the Y value

What you need to know to implement this task

SQL, some PostGIS. If you don't know any PostGIS, these should help:
- Manual
- Tutorial
You'll be touching the siri_arrivals table, but you don't really need to understand much about it except for the vehicle_location_lon & lat fields
You can probably create the query using the re:dash interface. If you need a local dev copy of the database, see #32
Clone the repository and use pull requests to submit your code.

fetch_and_store_arrivals: add an option to dump results to file rather than DB

siri/fetch_and_store_arrivals.py is the script that fetches data from the SIRI server.

Currently it writes the results to a database. It would be helpful if it has an option to just write the results into a flat file, so it can be used for debugging with a database installed.

This would require:

Adding appropriate options to the command line parameters, e.g. --use_file (a boolean) and--output_filename. These parameters can be saved to the connection_details dictionary.
Add a function that receives an array of bus arrivals and a file name, and uses the csv module to write to the file. db.insert_arrivals could be used as reference.
Change the function fetch_and_store_arrivals to either call db.insert_arrivals or the new function, depending on the connection parameters.

Arrange with Yehuda to open an S3 bucket for the GTFS archive

trips file now includes trip_headsign field which causes parsing exception

At least since 2017-12-29, the trips file header is:
route_id,service_id,trip_id,trip_headsign,direction_id,shape_id

On 2017-11-26 it was:
route_id,service_id,trip_id,direction_id,shape_id

As a result we get a parsing error in insert_gtfs.sql:

[snip]
********** importing trips **********
CREATE TABLE
Time: 2.696 ms
ALTER TABLE
Time: 0.639 ms
ERROR:  extra data after last expected column
CONTEXT:  COPY gtfs_trips, line 2: "2946,54524100,7141464_040218,דרך הטייסים-נווה חן,0,72337"
Time: 115.895 ms
[snip]

and the trips table is not populated:

obus=> select count(*) from gtfs_trips;                                                                                                             count 
-------
     0
(1 row)

Upload open train public GTFS archive to S3

The archive is here:
http://gtfs.otrain.org/static/archive/

Need to rename files to YYYY-MM-DD.zip

Add SHAPE_DIST_RATIO field to gtfs_stop_times [easy]

gtfs_stop_times has a SHAPE_DIST_TRAVELED which gives the distance travelled by the bus from the route origin to the current stop.

The task is to write a query that:

Adds a shape_dist_ratio field to gtfs_stop_times table
Populate the field with the ratio between SHAPE_DIST_TRAVELED and the length of the route (values in 0-1 range)

Keep the query as an .sql file under /postgres

You can get the length of the route by querying the SHAPE_DIST_TRAVELED of the last stop (highest STOP_SEQUENCE for the same TRIP_ID).

What you will need to know?

SQL
Understand the gtfs_stop_times table. Read about the GTFS tables here and play with them using the re:dash interface.
You can access the database and write the query using the re:dash database.
Clone the repository and submit the query using a pull request.

Analyse train passenger count data: new dataset! never seen before!

We have received (sort of, see below) from Israel Railways data about the number of people starting and ending their journeys at each train station, for each date between April and October 2016, for each hour of the day.

We have asked for the data in order to understand which stations need better bus service, and when (this is for the bus2train project). However I think this is the first time this kind of data is made public, and there's probably more interesting stuff to do with it.

The original file is here: https://drive.google.com/open?id=0B_MFjF7hDGLcRGVKaklvSERWUmM

And there's a CSV of the data here https://drive.google.com/file/d/0ByIrzj3OFMnIMGxWQk5RaFM5Yjg

Unfortunately there are a few problems with the data:

We received only data for ends of journey (number of people leaving the station), and not starts of journey
We are missing data of many of the stations (there's only partial information for Segula & Tel-Aviv Centre, and everything between them in alphabetical order is missing)

We have written to Israel railway to ask for fixed data. When we get it I'll put everything in our DB.

GTFS Route Stories

I have written some code that creates "route stories".

Route stories can be calculated from stop_times in GTFS, but they are much more compressed (about 6MB of a text file instead for a few hundreds in the original file), and they are helpful for analysis. Here's a Google Doc that explains what they are and why they are helpful.

The next task would be to add them to the database. This requires

Adding an extra table, RouteStoryStops, with the following fields:

route_story_id     (integer)
arrival_offset        (integer, seconds since midnight, or varchar)
departure_offset     (integer, seconds since midnight, or varchar)
stop_id                 (integer, can be constrained with the stops table)
stop_sequence    (small integer)
pickup_type    (boolean)
drop_off_type    (boolean)

Adding the following fields to the trips table:

route_story_id     (integer)
start_time     (either integer, seconds since midnight, or varchar)

Adding the code that writes to the db...

Update nightly GTFS script to upload to S3

Server

Hi,
I am talking with Yehuda and he is preparing a server for us.
We must note that the prev server was diagnosed with a dangerous virus and was put to sleep by Yehuda.
So - no more viruses!
We are getting a ubuntu 16 server with nginx and postgres.
What are the first things we would like to do with it, other then creating the db?

When did the bus actually called at a stop?

This is the next task that we will need to tackle once we build up some data in the SIRI database.

SIRI data only gives estimate for the next arrival to each stop. What we should see (if we poll frequently enough), is a bus arrival time going down monotonously, then the bus at the stop, then the time jumping up.

In practice it is going to probably be more complicated than that because:

The algorithm that estimates next arrival is VERY often wrong
Our planned polling time is currently set to 5 minutes.

Once we accumulate enough data, we need to analyse it and see how best to estimate calling time.

Find trip id for SIRI arrival

We aggregate bus trips in real rime from the SIRI protocol. The results are in the siri_arrivals table.
We also have data on planned trips in our GTFS tables.

We want to link the data in siri_arrivals to the GTFS trips table, by adding a trip_id column to siri_arrivals.

This should be possible based using the following data from siri_arrival:

line_ref (which should be GTFS route_id)
day of week
origin_aimed_departure_time

There should be one and only one trip with the given route_id, that runs on the given day of week and departs from the first stop at the given aimed departure time.

This task is to create a script\query that:

adds a trip_id field to siri_arrivals
fills it with the correct trip_id

If you only require sql to do it, add the query file to the /postgres folder. If you need to write Python code, put it under /siri.

Issues you may run into

We aren't sure how reliable the origin_aimed_departure_time departure time is. Does it actually contain sensible data?
Our GTFS tables contain data from a single GTFS file (from 8/2/2016). GTFS is published nightly and may include changes in the planned schedule. It's possible that some of the siri_arrivals data, especially towards the end of the month, won't match the data in our GTFS tables. If you get low match data.

What's would you need to implement this?

SQL, maybe some Python.
Some understanding of GTFS, particularly the gtfs_trips, gtfs_calendar and gtfs_stop_times tables. Get started by reading this. As a starting point, this query returns trip ids and the start time for each trip, and also on which days it's running:

select route_id, trip_id, DEPARTURE_TIME, sunday, monday, tuesday, wednesday, thursday, friday, saturday
from gtfs_trips 
join gtfs_stop_times on gtfs_stop_times.trip_id = gtfs_trips.trip_id
join gtfs_calendar on gtfs_calendar.service_id = gtfs_trips.service_id
where route_id = 7020
and stop_sequence= 1

Basic understanding of the siri_arrivals table (see #31)
It's possible that this task can be implemented as an SQL query. You can use our re:dash to play with queries against our database.
If you need to write a script, you will probably need a local development database to play with (#32)
Clone the repository and use pull requests to submit your code.

re:dash access to our DB

Re:dash is a really useful tool that allows people to run queries, share the queries and also do visualisations and dashboards. Various other Sadna projects are using it heavily as far as I understand.

Ideally this could be a way to look at all our data (gtfs, siri & bus2train).

What seems to be required:

Creating a read-only user on our database. Something like:

create role UESRNAME with login password 'PASSWORD';
GRANT CONNECT ON DATABASE obus TO USERNAME;
GRANT USAGE ON SCHEMA public TO USERNAME;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO USERNAME;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO USERNAME;

Opening the database for access from remote addresses. Preferably not from the entire world wide internet though. See for example here: http://www.thegeekstuff.com/2014/02/enable-remote-postgresql-connection/
Sending the details to @akariv so he can set up re:dash for us.

fix DB connection

real time siri feature should reuse the siri db connection function from siri/db.py

Which bus stops serve train stations?

For the project with 15minute where we evaluate transfers between busses and trains, it's important to know which bus stops are within walking distance to train stations.

For now we just use a simplification, and use stops that are up to 300m in straight line. However of course straight line distance is often much shorter than walking distance, and we get many more stops that we should.

We had an idea to use Google Maps navigation queries to get walking distance, but this is problematic in the other direction. Google often doesn't recognise open spaces and footpaths that can be used by pedestrians. So their walking distance are to long. For example they give 350m for this bus stop near Tel-Aviv University Station

Where in fact the distance should be more like 230m:

Unless someone has a better idea, I think we need a volunteer to go over 56 train stations using maps and\or satellite imagery and make an informed opinion about which bus stations serve each one.

is origin_aimed_departure_time in SIRI the GTFS first stop time?

Write a SQL query for trips that did not departed

The query should compare the real time information of the first station with the GTFS time and return the trips that are 8 minute late or more.

Definition Of Done

Save the SQL query in the new re:dash. [https://app.redash.io/hasadna/]

Filter relevant GTFS data for tommorow

Definition Of Done:

Filter all record from GTFS files that relevant for tomorrow.
Save the relevant record in different files/table.

User Analysis Tool

The need - We were approached by Tony Weisbuch to create an app which will allow citizen to analyze by themselves transport data, specifically regarding Public Transit in the Suburbs. See Mr. Weisbuch full email here.
Before creating an app or UI we need to think what are the capabilities/ services we want to provide in such tool.
Here are some Possible analysis tools/ queries:

Public Transport from and to a certain area - what buses serve the area? In what frequency? Where can you get by using a bus? (one bus? exchanges?)
Public transport between two areas - What buses can you take? How long will it take you? Frequency?
Bus2Train - combine queries from bus2Train research and use them for specific areas and lines?
Specific lines analysis?

These are only initial ideas and this issue's purpose is to create a discussion what kind of tools we can give to the public.

Write down all idea and comments which come up 💡

bus2train: enhance the trains vs bus arrivals spreadsheet

A few months ago I prepared this spreadsheet for 15 minutes.

It gives for each 1 hour window, the number of train arrivals to the station, and the number of bus arrivals. The ratio gives a crude estimation of how good the train station is served by buses.

There are a few updates that this file needs:

The table is based on straight line distance between bus stops and stations. Now that we have walking distance data (available in the db), we should re-compute using only buses with reasonable (up to 300m) walking distance.
We were asked to take the "importance" (number of passengers) in each station into account. We have data on number of passengers here.
I think the first thing to do is just to export the average number of passengers as an extra sheet in the Excel output.

The code I used to create the file initially is here, but I think it would make more sense to re-write it from scratch using the GTFS the DB.

When headsign includes quotation mark we don't parse the trips file properly

For example:
3159,54524691,5036931_291217,נס ציונה _ בית ספר חב"ד,0,86320
Postgres COPY will parse the headsign field from the quote until the next quote it encounters. In many cases this is further down the file and all the trips in between will not be loaded.

Add route offset to siri_arrivals [depends on #29, #30]

siri_arrivals table contains the GPS locations of buses. As a step towards estimating the buses arrival time to stops, we need to find the route offset of the bus at that point:

The task is to:

create a query that adds a route_offset field to siri_arrivals
populates it with the correct values

Put the result query under the /postgres folder

How to do it?

Assuming tasks siri_arrivals has a vehicle_location_point field (issue #30) and gtfs_shape_lines table exists (issue #29), this should be pretty easy to achieve using PostGIS ST_LineLocatePoint query (see here).

What's would you need to implement this?

SQL, Some PostGIS (at least enough to run ST_LineLocatePoint). If you don't know any PostGIS these could help:
- Manual
- Tutorial
Some understanding of the siri_arrivals and gtfs_shape_lines table, though you really don't need to know much beyond the two geometry fields you're going to use.
You can probably create the query using the re:dash interface. If you need a local dev copy of the database, see #32
Clone the repository and use pull requests to submit your code.

Improve performance of the GTFS postgres insert script

The GTFS postgres insert script ps_insert works at about 1000 records / second on our server.

This is a bit slow considering that a single GTFS file can contains 10s of millions of records.

We should look at how to improve the performance. Is there some batch insert method that would be efficient? Could inserting directly from CSV work better?

Download GTFS data

Definition Of Done:

Download last GTFS zip file (Reuse the code that currently download the GTFS from MOT)
Check if the file is processed already (gtfs_files table).
unzip the zip file in given path.

[Transferred from GitLab] Creating GTFS DB

[Originally posted by @nitzangur ]

A basic scheme is now available under gtfs/.

Next steps:

Scheme level:

Validate index-creation syntax.
Deside which indexes are wanted. 3 Create enums for relevant fields (inline comments).
Make sure type-sizes are fine.
Consider using TinyInt instead of INT when relevant.
Check requested type-size for coordinates.
Check requested type-size for shape_dist_traveled.

IT level:

Create a Postgresql DB based on this scheme.
Make sure the DB is accessible for developing and querying.

Code level:

Change the code under gtfs/parser/ so it will parse GTFS data into the new DB instead of into SQLite DB.
Migrate some other GTFS parsing code (e.g. route stories).
Create a cron to automatically parse the GTFS data every day. (Yehuda - is there any old cron to work with?)

Feel free to add/change some steps. Please response to this issue if you intend to perform some of these steps. (I guess those remarks are relevant to all of the issues.)

Document the siri_arrivals table for beginners

Sample data does not include 'shape_dist_traveled' column

Because of the missing column, the insert_gtfs.sql script fails with the error:
ERROR: missing data for column "shape_dist_traveled" CONTEXT: COPY stop_times, line 2: "20132304_260516,09:11:35,09:11:35,1321,44,0,0"

This is can be fixed by using a newer data set which includes the missing column.
@avielb and I can submit a PR with a new sample if needed.

Simplify shapes

GTFS shape.txt contains a list of coordinates for transport routes.

These lists aren't efficient for drawing the shape on the map. They are over-specified (have too many points). See for example this map of line 28א in Jerusalem (picked at random).

This is a problem if want to to presents the lines on a map.

Ramer–Douglas–Peucker is a pretty straight forward algorithm to simplify polylines . There's a Python implementation here.

What needs doing is:

Add a new database table simplified_shape with the same table definition as shapes
Read all shapes from the shapes table that are missing from simplified_shape
Run RDP
Write to simplified_shapes

Since the data is still not in the DB, an intermediate task would be:

Read the shapes using https://github.com/hasadna/open-bus/blob/master/gtfs/parser/gtfs_reader.py
Run RDP
Dump results to a CSV file in the same format as the original shape.txt (shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence)

Help wanted:
This is an independent task that doesn't really require coordination with the rest of the project and is perfect for a new volunteer. I would estimate it 1-3 work days, depending on experience.

Create a script to archive GTFS nightly

We currently rely on the GTFS archive of Open Train.

Unfortunately they can delete files from there without consulting us (it's much less important to them currently than it is to us).

We need to write a script that downloads the files (Open train script is a good starting point) and deploy it to run nightly.

hasadna / open-bus Goto Github PK

open-bus's Introduction

Open Bus

Where does the data come from?

Want to help?

open-bus's People

Contributors

Stargazers

Watchers

Forkers

open-bus's Issues

Contents

For code contributions

Recommend Projects

Recommend Topics

Recommend Org