hasadna / open-bus Goto Github PK
View Code? Open in Web Editor NEW:bus: Analysing Israel's public transport data
:bus: Analysing Israel's public transport data
The GTFS is published by MoT nightly. Each file contains data for the next 60 days, but changes to planning can occur: the GTFS files from the 1/1 and 2/1 may disagree on the trips planned for the 2/1. The data published on the 2/1 should be considered more accurate because it is more up to date.
We need a script that:
An outline for the incremental db schema and logic for implementing the task can be found here.
An archive of GTFS files to test with is kept by Open Train.
This is the next task that we will need to tackle once we build up some data in the SIRI database.
SIRI data only gives estimate for the next arrival to each stop. What we should see (if we poll frequently enough), is a bus arrival time going down monotonously, then the bus at the stop, then the time jumping up.
In practice it is going to probably be more complicated than that because:
Once we accumulate enough data, we need to analyse it and see how best to estimate calling time.
We currently have code which queries the siri service but we don't keep log whether the request succeeded or failed.
The mission is to add a feature that will save in the db for each request whether it succeeded or failed.
For more information about siri see documentation here and issue #10
Guidelines:
Definition Of Done:
[Originally posted by @nitzangur ]
A basic scheme is now available under gtfs/.
Next steps:
Scheme level:
IT level:
Code level:
Feel free to add/change some steps. Please response to this issue if you intend to perform some of these steps. (I guess those remarks are relevant to all of the issues.)
We are want to add geometry fields to our database in order to be able to run geoqueries using PostGIS.
The task is to create a script that:
shape_id
(INTEGER NOT NULL) and shape_line (geometry)The script location should be under the /postgres folder in the source code. If you need Python code, put the script under /gtfs/parser
Additional pointers:
ALTER TABLE
(see here why)What's would you need to implement this?
siri/fetch_and_store_arrivals.py is the script that fetches data from the SIRI server.
Currently it writes the results to a database. It would be helpful if it has an option to just write the results into a flat file, so it can be used for debugging with a database installed.
This would require:
--use_file
(a boolean) and--output_filename
. These parameters can be saved to the connection_details
dictionary.db.insert_arrivals
could be used as reference.gtfs_stop_times has a SHAPE_DIST_TRAVELED which gives the distance travelled by the bus from the route origin to the current stop.
The task is to write a query that:
Keep the query as an .sql file under /postgres
You can get the length of the route by querying the SHAPE_DIST_TRAVELED of the last stop (highest STOP_SEQUENCE for the same TRIP_ID).
What you will need to know?
siri_arrivals table contains the GPS locations of buses. As a step towards estimating the buses arrival time to stops, we need to find the route offset of the bus at that point:
The task is to:
Put the result query under the /postgres folder
How to do it?
What's would you need to implement this?
Definition Of Done:
Hi,
I am talking with Yehuda and he is preparing a server for us.
We must note that the prev server was diagnosed with a dangerous virus and was put to sleep by Yehuda.
So - no more viruses!
We are getting a ubuntu 16 server with nginx and postgres.
What are the first things we would like to do with it, other then creating the db?
For example:
3159,54524691,5036931_291217,נס ציונה _ בית ספר חב"ד,0,86320
Postgres COPY will parse the headsign field from the quote until the next quote it encounters. In many cases this is further down the file and all the trips in between will not be loaded.
I feel the our project could use some order-making, and a boost to its accessibility to newcomers and sparsely-contributing members. My first offer in this direction is to come up with a good CONTRIBUTING.md
file that will be suggested to every newcomer to read, and linked to on every issue, pull-request and commit creation.
Besides pointing to the basic resources for getting started, should include at least the following contribution type guidelines -
Definition Of Done:
GTFS shape.txt contains a list of coordinates for transport routes.
These lists aren't efficient for drawing the shape on the map. They are over-specified (have too many points). See for example this map of line 28א in Jerusalem (picked at random).
This is a problem if want to to presents the lines on a map.
Ramer–Douglas–Peucker is a pretty straight forward algorithm to simplify polylines . There's a Python implementation here.
What needs doing is:
Since the data is still not in the DB, an intermediate task would be:
Help wanted:
This is an independent task that doesn't really require coordination with the rest of the project and is perfect for a new volunteer. I would estimate it 1-3 work days, depending on experience.
The archive is here:
http://gtfs.otrain.org/static/archive/
Need to rename files to YYYY-MM-DD.zip
We have received (sort of, see below) from Israel Railways data about the number of people starting and ending their journeys at each train station, for each date between April and October 2016, for each hour of the day.
We have asked for the data in order to understand which stations need better bus service, and when (this is for the bus2train project). However I think this is the first time this kind of data is made public, and there's probably more interesting stuff to do with it.
The original file is here: https://drive.google.com/open?id=0B_MFjF7hDGLcRGVKaklvSERWUmM
And there's a CSV of the data here https://drive.google.com/file/d/0ByIrzj3OFMnIMGxWQk5RaFM5Yjg
Unfortunately there are a few problems with the data:
We have written to Israel railway to ask for fixed data. When we get it I'll put everything in our DB.
real time siri feature should reuse the siri db connection function from siri/db.py
As part of checking bus\train connectivity for 15 minutes, we want to understand what parts of the country are easily accessible from train stations.
The idea is
I have initial results for this created with code in my old repository. There is a map of of bus stops near train stations. And there is a map of stops from which can reach train stations .
I am working now on:
Next tasks:
15 minutes have asked for the following features
Achieving all this seems to me to require a dynamic map (e.g. Google maps with some kind of backend). Will be happy for ideas about how best to do it (or help in doing it...)
The query should compare the real time information of the first station with the GTFS time and return the trips that are 8 minute late or more.
Definition Of Done
Definition of Done
Because of the missing column, the insert_gtfs.sql
script fails with the error:
ERROR: missing data for column "shape_dist_traveled" CONTEXT: COPY stop_times, line 2: "20132304_260516,09:11:35,09:11:35,1321,44,0,0"
This is can be fixed by using a newer data set which includes the missing column.
@avielb and I can submit a PR with a new sample if needed.
(note: this task is a little similar to #28, but is slightly simpler because indexing isn't required, and also has higher priority)
We are want to add geometry fields to our database in order to be able to run geoqueries using PostGIS.
The task is to create a script that:
The script location should be under the /postgres folder in the source code.
Some additional pointers:
ALTER TABLE
(see here why)What you need to know to implement this task
At least since 2017-12-29, the trips file header is:
route_id,service_id,trip_id,trip_headsign,direction_id,shape_id
On 2017-11-26 it was:
route_id,service_id,trip_id,direction_id,shape_id
As a result we get a parsing error in insert_gtfs.sql:
[snip]
********** importing trips **********
CREATE TABLE
Time: 2.696 ms
ALTER TABLE
Time: 0.639 ms
ERROR: extra data after last expected column
CONTEXT: COPY gtfs_trips, line 2: "2946,54524100,7141464_040218,דרך הטייסים-נווה חן,0,72337"
Time: 115.895 ms
[snip]
and the trips table is not populated:
obus=> select count(*) from gtfs_trips; count
-------
0
(1 row)
Need to implement a function which gets a bus route + departure stop and returns a list of stops for that route.
Originally posted by @johananl
I have written some code that creates "route stories".
Route stories can be calculated from stop_times in GTFS, but they are much more compressed (about 6MB of a text file instead for a few hundreds in the original file), and they are helpful for analysis. Here's a Google Doc that explains what they are and why they are helpful.
The next task would be to add them to the database. This requires
route_story_id (integer)
arrival_offset (integer, seconds since midnight, or varchar)
departure_offset (integer, seconds since midnight, or varchar)
stop_id (integer, can be constrained with the stops table)
stop_sequence (small integer)
pickup_type (boolean)
drop_off_type (boolean)
route_story_id (integer)
start_time (either integer, seconds since midnight, or varchar)
We finally have direct access to the SIRI interface! It's time to start using it!
I suggest we install a script that polls the service regularly (e.g. every minute) and dumps the data to database. As far as I understand, we already have the technology to do it.
This would require (I think):
Which stops to poll?
Ideally I would like to poll all the stops all the time. My thinking is, we should be build the biggest database we can, because we don't know what parts of the data would become useful.
However I think there's some limit on the amount of data you can poll. If we hit that, I suggest we start by focusing on bus routes that serve train stations, since this will be helpful for the Bus2Train task. I'll soon post a list of these stops here.
We could really use the help of recording what we are working on, and what we plan or wish to work on in the future. What is better than a good ol' roadmap for the job!
@AvivSela and I started working on this in the last Monday meetup, and I will open a pull request with a ROADMAP.md file on which we could all discuss
Found this short guide to be useful https://mozillascience.github.io/working-open-workshop/roadmapping/
write a web page that display a data from sql query.
Definition Of Done
We aggregate bus trips in real rime from the SIRI protocol. The results are in the siri_arrivals table.
We also have data on planned trips in our GTFS tables.
We want to link the data in siri_arrivals to the GTFS trips table, by adding a trip_id column to siri_arrivals.
This should be possible based using the following data from siri_arrival:
There should be one and only one trip with the given route_id, that runs on the given day of week and departs from the first stop at the given aimed departure time.
This task is to create a script\query that:
If you only require sql to do it, add the query file to the /postgres folder. If you need to write Python code, put it under /siri.
Issues you may run into
What's would you need to implement this?
select route_id, trip_id, DEPARTURE_TIME, sunday, monday, tuesday, wednesday, thursday, friday, saturday
from gtfs_trips
join gtfs_stop_times on gtfs_stop_times.trip_id = gtfs_trips.trip_id
join gtfs_calendar on gtfs_calendar.service_id = gtfs_trips.service_id
where route_id = 7020
and stop_sequence= 1
Re:dash is a really useful tool that allows people to run queries, share the queries and also do visualisations and dashboards. Various other Sadna projects are using it heavily as far as I understand.
Ideally this could be a way to look at all our data (gtfs, siri & bus2train).
What seems to be required:
create role UESRNAME with login password 'PASSWORD';
GRANT CONNECT ON DATABASE obus TO USERNAME;
GRANT USAGE ON SCHEMA public TO USERNAME;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO USERNAME;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO USERNAME;
This is a request that came up in a discussion in Tapuz Public Transport forum. It's tangent to our current tasks, but I thought I'll document as a "product backlog" item.
The idea is to use our daily GTFS download + DB insert script to also update Open Street Map with any changes to bus stations and stops. At the minimum we could just make sure that all the stops exist on the map and are in the correct location. We could probably also add information about which bus lines stop there.
The first task is to research how to technically do the import (see link below). It's also probably best to announce the intention of doing that in the Israel OSM forum and ask for advice\help. I suggest documenting the progress
Resources:
Help wanted:
This is a relatively independent task that doesn't require a lot of coordination with the rest of the project. The first step involves research rather than diving into code.
The need - We were approached by Tony Weisbuch to create an app which will allow citizen to analyze by themselves transport data, specifically regarding Public Transit in the Suburbs. See Mr. Weisbuch full email here.
Before creating an app or UI we need to think what are the capabilities/ services we want to provide in such tool.
Here are some Possible analysis tools/ queries:
These are only initial ideas and this issue's purpose is to create a discussion what kind of tools we can give to the public.
Write down all idea and comments which come up 💡
Note: this task is a little similar to #30, except it also includes indexing. This isn't as urgent though because we can use the shape_dist_travelled in gtfs_stop_times to start with, rather than calculating the route offset.
We are want to add geometry fields to our database in order to be able to run geoqueries using PostGIS.
The task is to create a script that:
The script location should be under the /postgres folder in the source code.
Some additional pointers:
ALTER TABLE
(see here why)What you need to know to implement this task
Create reports with the data we have regarding lines: 480, 19, 947.
The reports should include the following analysis:
If you have other ideas you are more than welcome to add them! 🥇
siri_arrivals has a huge (and hopefully growing) number of records, so we need to be more careful about the size of each record.
We need to make some changes to the current table as well as to the code that creates the table:
Please add a comment if you would like to join the repository as a collaborator. This will allow you to assign tasks to yourself, and also work with the Kanban boards.
Preferably talk to us first (e.g. by filling out the Public Knowledge Workshop's volunteers questionnaire, meeting us during the development meetings...)
Otherwise please explain in your comment what's your interest in the project and how you plan to contribute!
Remember - you can always contribute by cloning the repository and posting pull requests, as well as by commenting on issues.
Definition Of Done:
Tasks
The goal of this task is to build a table with at least the following fields:
Write a script that outputs that to a CSV (no need to write to DB at this point). The script should go under /siri .
How to do this
Assuming
You should be able to do something like:
For each trip_id in siri_arrivals, build:
For each stop in planned stops, find the last know location before it, and first known location after it. If the previous know location was at time t1 and offset x1, and the next known location is at time t2 and offset x2, and the stop offset is xs, then
ts = (xs - x1) / (x2 - x1) * (t2 - t1)
What do I need to know
A few months ago I prepared this spreadsheet for 15 minutes.
It gives for each 1 hour window, the number of train arrivals to the station, and the number of bus arrivals. The ratio gives a crude estimation of how good the train station is served by buses.
There are a few updates that this file needs:
The table is based on straight line distance between bus stops and stations. Now that we have walking distance data (available in the db), we should re-compute using only buses with reasonable (up to 300m) walking distance.
We were asked to take the "importance" (number of passengers) in each station into account. We have data on number of passengers here.
I think the first thing to do is just to export the average number of passengers as an extra sheet in the Excel output.
The code I used to create the file initially is here, but I think it would make more sense to re-write it from scratch using the GTFS the DB.
The GTFS postgres insert script ps_insert works at about 1000 records / second on our server.
This is a bit slow considering that a single GTFS file can contains 10s of millions of records.
We should look at how to improve the performance. Is there some batch insert method that would be efficient? Could inserting directly from CSV work better?
Find an easy way for new volunteers to have write access to a copy of the database.
Preferably not the entire database but a subset (e.g. siri_arrivals from a single day + related gtfs data)
Can we put it into a docker container based on this?
Alternatively we need good instructions on how to set up a database.
For the project with 15minute where we evaluate transfers between busses and trains, it's important to know which bus stops are within walking distance to train stations.
For now we just use a simplification, and use stops that are up to 300m in straight line. However of course straight line distance is often much shorter than walking distance, and we get many more stops that we should.
We had an idea to use Google Maps navigation queries to get walking distance, but this is problematic in the other direction. Google often doesn't recognise open spaces and footpaths that can be used by pedestrians. So their walking distance are to long. For example they give 350m for this bus stop near Tel-Aviv University Station
Where in fact the distance should be more like 230m:
Unless someone has a better idea, I think we need a volunteer to go over 56 train stations using maps and\or satellite imagery and make an informed opinion about which bus stations serve each one.
Some time ago we were asked by Gil from 15 minutes to analyse the transfers between buses and trains in train stations. His request was to find a metric for how well the buses are coordinates with the trains in different stations, and different times of day. This can have value both for PR and for prioritising the work with the ministry of transport.
Here's a set of files containing arrival of buses and trains to train stations on Thursday, 2016-9-1. It was created from GTFS data using the calling_at_station module.
This task is rather open ended: can anyone look at these files and think of ways to analyse the data and create useful metrics (or even visualisations?) that could provide insights on where and when the coordination of trains and buses is especially problematic?
We currently rely on the GTFS archive of Open Train.
Unfortunately they can delete files from there without consulting us (it's much less important to them currently than it is to us).
We need to write a script that downloads the files (Open train script is a good starting point) and deploy it to run nightly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.