jingqingz / baidutraffic Goto Github PK

This repo includes introduction, code and dataset of our paper Deep Sequence Learning with Auxiliary Information for Traffic Prediction (KDD 2018).

Python 100.00%

traffic dataset traffic-prediction traffic-data deep-learning

baidutraffic's Introduction

Deep Sequence Learning with Auxiliary Information for Traffic Prediction. KDD 2018. (Accepted)

Binbing Liao, Jingqing Zhang, Chao Wu, Douglas McIlwraith, Tong Chen, Shengwen Yang, Yike Guo, Fei Wu

Binbing Liao and Jingqing Zhang contributed equally to this article.

Paper Link: arXiv or KDD18

Abstract
Q-Traffic Dataset
Code
Citation
Poster and Video

Abstract

Predicting traffic conditions from online route queries is a challenging task as there are many complicated interactions over the roads and crowds involved. In this paper, we intend to improve traffic prediction by appropriate integration of three kinds of implicit but essential factors encoded in auxiliary information. We do this within an encoder-decoder sequence learning framework that integrates the following data: 1) offline geographical and social attributes. For example, the geographical structure of roads or public social events such as national celebrations; 2) road intersection information, i.e. in general, traffic congestion occurs at major junctions; 3) online crowd queries. For example, when many online queries issued for the same destination due to a public performance, the traffic around the destination will potentially become heavier at this location after a while. Qualitative and quantitative experiments on a real-world dataset from Baidu have demonstrated the effectiveness of our framework.

Q-Traffic Dataset

We collected a large-scale traffic prediction dataset - Q-Traffic dataset, which consists of three sub-datasets: query sub-dataset, traffic speed sub-dataset and road network sub-dataset. We compare our released Q-Traffic dataset with different datasets used for traffic prediction.

Access to the Q-Traffic Dataset

This dataset is updated and now available at BaiduNetDisk Code：umqd. Backup link.

For those who have downloaded the old dataset, we strongly suggest you re-download the updated dataset. The old dataset at Baidu Research Open-Access Dataset (BROAD) exists some duplicated hashed_link_id due to the hash function. So the hashed_link_id is removed in the updated dataset, meaning that we just use the link_id which is consistent with the intermediate_files.

The intermediate data files (after pre-processing) are available at intermediate_files, so you can directly train the model now.

Please feel free to raise an issue if you have any question.

Query Sub-dataset

This sub-dataset was collected in Beijing, China between April 1, 2017 and May 31, 2017, from the Baidu Map. The detailed pre-processing of this sub-dataset is described in the paper. The query sub-dataset contains about 114 million user queries, each of which records the starting time-stamp, coordinates of the starting location, coordinates of the destination, estimated travel time (minutes). There are some query samples as follows:


2017-04-01 19:42:23, 116.88 37.88, 116.88 37.88, 33

2017-04-01 18:00:05, 116.88 37.88, 116.88 37.88, 33

2017-04-01 01:14:08, 116.88 37.88, 116.88 37.88, 33

..., ..., ..., ..., ...

Traffic Speed Sub-dataset

We also collected the traffic speed data for the same area and during the same time period as the query sub-dataset. This sub-dataset contains 15,073 road segments covering approximately 738.91 km. Figure 1 shows the spatial distribution of these road segments, respectively.

Figure 1. Spatial distribution of the road segments in Beijing

They are all in the 6th ring road (bounded by the lon/lat box of <116.10, 39.69, 116.71, 40.18>), which is the most crowded area of Beijing. The traffic speed of each road segment is recorded per minute. To make the traffic speed predictable, for each road segment, we use simple moving average with a 15-minute time window to smooth the traffic speed sub-dataset and sample the traffic speed per 15 minutes. Thus, there are totally 5856 ($61 \times 24 \times 4$) time steps, and each record is represented as road_segment_id, time_stamp ([0, 5856)) and traffic_speed (km/h).

There are some traffic speed samples as follows:


15257588940, 0, 42.1175  

..., ..., ...  
  
15257588940, 5855, 33.6599  

1525758913, 0, 41.2719  

..., ..., ...

Road Network Sub-dataset

Due to the spatio-temporal dependencies of traffic data, the topology of the road network would help to predict traffic. Table 1 shows the fields of the road network sub-dataset.

Table 1. Examples of geographical attributes of each road segment.

For each road segment in the traffic speed sub-dataset, the road network sub-dataset provides the starting node (snode) and ending node (enode) of the road segment, based on which the topology of the road network can be built. In addition, the sub-dataset also provides various geographical attributes of each road segment, such as width, length, speed limit and the number of lanes. Furthermore, we also provide the social attributes such as weekdays, weekends, public holidays, peak hours and off-peak hours.

Comparison with Other Datasets

Table 2 shows the comparison of different datasets for traffic speed prediction. In the past few years, researchers have performed experiments with small or (and) private datasets. The release of Q-Traffic, a large-scale public available dataset with offline (geographical and social attributes, road network) and online (crowd map queries) information, should lead to an improvement of the research of traffic prediction.

Table 2. Comparison of different datasets for traffic speed prediction.

Code

The source code has been tested with:

Python 3.5
TensorFlow 1.3.0
TensorLayer 1.7.3
numpy 1.14.0
pandas 0.21.0
scikit-learn 0.19.1

The structure of code:

model.py: Implementation of deep learning models
train.py: Implementation of controllers for training and testing
baselines.py: Implementation of baseline models including RF and SVR
dataloader.py: Data processing and loading, subject to change due to data format if necessary
preprocessing: Data preprocessing and cleaning
others: utilities, playground, logging, data preprocessing

Citation

In case using our dataset, please cite the following publication:

@inproceedings{bbliaojqZhangKDD18deep,  
  title = {Deep Sequence Learning with Auxiliary Information for Traffic Prediction},  
  author = {Binbing Liao and Jingqing Zhang and Chao Wu and Douglas McIlwraith and Tong Chen and Shengwen Yang and Yike Guo and Fei Wu},  
  booktitle = {Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},  
  pages = {537--546},
  year = {2018},  
  organization = {ACM}  
}

Poster and Video

You can find our KDD 2018 poster here.
You can find our KDD 2018 Video here. YouTube

baidutraffic's People

Contributors

Stargazers

Watchers

Forkers

bbliao leibai jing--li caibing1872 helianglen yufengwhy hintonthu zhujiahui shubhampachori12110095 wentixiaogege tiffen l9g yuanhaitao fallcsc309 wenqi95 wsgan001 wenbinliu2015 eglrp newdimitri lxyazi chenzhou9513 starskychang mamuncse30 yawenz hisham32 kibbon theyoungkwon kinect59 moontrance-xd scott198510 ifv doublefnsn chaojidaxingxin callmesaox kenithtian jiangchao2014 fernandochan nnu-gisa 2673323862 maispace wangjianlongnba pepsalehi light201212 jethrojc zhangxiaoyu11 robinlu1209 zeovan whzhcahzxh siwangzhou rakshithasoma crystal22 happybayes relevation-143 isehd miaomiao17 zhaoyibo61 xiaolinhan rcads mcdragon qlwin drummyfloyd silvanabc bowenzys fusion-research nashema007 anson1066 vuviethung1998 guozhiyu1998 safarzadeh-reza wayson20 zizhengfan jkfdre maxrubby siyingwang2333 zdqf xenos-code akshay2695 buke2016 liuxiaoyu-bjtu

baidutraffic's Issues

Cant figure out how to run the code

I'd like to thank you first for the code and the paper , but I'm having missing files issues although I changed the repositories paths , is there a way that you can write a quick line by line instructions to do before running the train.py please , it will be really helpful , thank you so much !

如何获取交通流量数据集

您好，
我看到这里做了一个交通数据集的对比，主要有的是道路速度的数据，请问有没有道路流量的数据呢，如果没有的话有没有其他公开数据集提供道路流量数据呢，感谢！

Can't find intermediate file: query_distribution_beijing_1km_k_150_filtfilt.pkl

I can't find intermediate file : query_distribution_beijing_1km_k_150_filtfilt.pkl as wrote in the dataloader :

   data = pickle.load(open(config.data_path + "query_distribution_beijing_1km_k_%d_filtfilt.pkl" % config.impact_k, "rb"), 
   encoding='latin1')

I tried to run the get_query_distribution_feature_beijing_1km_sqe.py to get it, but it takes a lot of time.
Is there a file provided directly ?

How to run

Hello, I looked at readme.md, and I find that it did not say how to run this code? How do you write its running command?
python train.py?
Thank you!

Why x_query data in the test function of Query_Comb_Contrller is never used ?

Hi, I am confused that why the x_query data in the test function of Query_Comb_Contrller is never used ?
Doesn't the model predict result by the past query data ?
And why the decode_query is only cut from the next query data , not form the LSTM's results ?

No node GPS in Road Network Sub-dataset

Hi, I have downloaded the dataset. In the road_network_sub-dataset, I can not see the snodegps and enodegps mentioned in ReadMe, or any other files which can map the node id with the corresponding gps.
Could you upload this part?

traffic data missing for some road_segment_id

I was reading the road_segment_id from the neighbour_1km.txt. Using the road_segment_id as key to fetch the speed value from the traffic_speed_sub-dataset. some road_segment_id can not be found in traffic_speed_sub-dataset. for example:
read one line in neighbour_1km.txt. -> all the ids has been mapped from the link_id_hash_map
['1597566463414', '1462215565312', '1503121983912', '1770727782763', '1462073565097', '1503110983921', '1597550463401', '1682917053527', '1687870507191', '1597560463393', '1770727782763']
1503110983921 is missing in traffic_speed_sub-dataset.
we assmue that all the id in neighbour_1km.txt should be contained in the traffic_speed_sub-dataset. could you please explain that?

Questions about Event Discovery Algorithm

Hello, I've read your paper "Deep Sequence Learning with Auxiliary Information for Traffic Prediction" and find that the code of Event Discovery Algorithm may be not in accordance with its describtion in paper.
@ /src/preprocessing/new_anomalty1109.py line 38-43
The code shows that rules to discover event is:
(d(x,y,t) - d(x,y,t-7d))) > 300 && (d(x,y,t) / d(x,y,t-7d))) > 0.2 (The second condition make no sense)
In paper this rule to discover event is :
(d(x,y,t) - d(x,y,t-7d))) > 300 && (d(x,y,t) - d(x,y,t-7d))/d(x,y,t-7d) > 0.2

Unable to Download

The links at the Baidu Webpage are not working. Unable to download.

Hi! Where can I download ur paper?

What do snode and enode represent?

Hi. My questions may seem a bit trivial but I could not find explicit explanation in the paper.

The entire network is composed of road segments and snode and enode represent the endpoint of these segments? These are the end points of the edges of the graph?
In paper, it is written that we are given snode and enode gps, but in issue #6 it is said that it represents the middle point of road segment.
Where is the extra information on social attributes such as weekdays, weekends, public holidays, peak hours and off-peak hours described in the paper?

What does Link GPS represents?

Hi, I have downloaded the dataset, and confront with the similar problem with the closed issue #5.
In the closed issue #5, you reply with "Yes, the gps of the link_id, snode_id, and enode_id are all in the link_gps file since they are all link."

But the link_id in the file 'link_gps' has 13 digits, where the node_id has only 10 digits. So it seems that the GPS of snode_id and enode_id are not contained in the link_gps file.

I have the following questions.
2) What does the link_gps represent? The middle point of a road segment?
2) How can I get the snodegps and enodegps information of the road segments?
3) According to your paper, "The origin traffic speed dataset contains the traffic speed of ∼450k road segments". But I only see 44,172, not '~450k', unique link_id items in both 'road_network_sub-dataset' and 'link_gps'.

Duplicate links in Road_network_sub-dataset but with different properties

I want to process your data to reveal some graph structure. But I found there are duplicate link_ids in road_network_sub-dataset and those links are totally with different value of width snodeid enodeid or length. I am quite confusing how this could happen.
Following are some link_ids I found duplicated in road_network_sub-dataset
<class 'list'>: [1014574024344, 1163808777119, 1573596569624, 1490171293608, 1144042225780, 1934286704917]

Intermediary files- not able to download

Cannot download the intermediary files from the link you mentioned.
Could you kindly mail the intermediary data files

`snodegps` & `enodegps` colunms are missing

I cannot find colunms snodegps and enodegps in Road Network Sub-dataset (https://ai.baidu.com/broad/introduction?dataset=traffic), only link_id, width, direction, snodeid, enodeid, length, speedclass, lanenum are included.

Connetion between the Baidu's "Traffic Speed Sub-dataset" /" Road Network Sub-dataset" and the files used in your codes

I check the your codes, in function load_data, you open files: "event_traffic_beijing_1km_mv_avg_15min_completion.pkl" & "neighbours_1km.txt" , but how this data files coming from. I just check for Q-Traffic Dataset, only 3 Sub-datasets used . so do there are raw data pre-pocessing codes that are not included in github now?

thanks!
BR,
James

Some useful intermediate-file do not find

Hi! I find many data loaded in [WideDeep_Controller] can not be found...

coarse_file = open(config.data_path + "wide_features/event_link_set_all_poi_type_feature_coarse_beijing_1km.pkl", "rb")
    fine_file = open(config.data_path + "wide_features/event_link_set_all_poi_type_feature_fine_beijing_1km.pkl", "rb")
    info_file = open(config.data_path + "wide_features/event_link_set_all_beijing_1km_link_info_feature.pkl", "rb")
    time_file = open(config.data_path + "wide_features/time_feature_15min.pkl", "rb")

So how can I get these files, thank you!

Still questions about enode and snode

#20
Hi, thank you about the issues before, but I still get some questions here.
I am trying to recover the topology of the dataset. I want to construct a graph represent the spatial information with G(V,E) where V is vertices and E is Edges. Firstly I consider snode and enode can be mapped as vertices and links can be mapped as edges. Like this:

The outlines represent true road and the thin lines can be seen as links and the big dots can be seen as nodes(snode or enode). But it is said in issue #16 the snode and enode are also links. So that confused me if the snode and enode are also links. It may cause some vague representations when situation like below happens:

The link in the middle are connect with 6 links. And I need 2 of them to represent the enode and snode of such link.I am not sure which links I should choose to represent.
Using the illustration in issue #16 like below. If we add link2 link3 and link4, for link2, one of the (link1,enode) could be the snode of the link2, one of the (link3,link4) could be the enode of the link2. If so, it will really hard to recover the topology of the dataset in G(V,E) form. Because I can only get Edge information but not Vertices. And I am not sure, if all those snode and enode are links, what should be consider as vertices in graph.

Hoping you reply.
Thank you.

Traffic Speed Sub-dataset Damaged

Hi, I have download Q-Traffic Dataset in the link
https://ai.baidu.com/broad/introduction?dataset=traffic
But I can't unzip the file, The file Traffic Speed Sub-dataset, as the file is damaged. How can I access to the dataset?

Can't find data files

As written in dataloader.py (shown as below), these data files, such as event_link_set_beijing_1km, event_traffic_beijing_1km_mv_avg_15min_completion.pkl, etc., can't be found in BaiduTraffic and dataset (which are got from https://ai.baidu.com/broad/download?dataset=traffic)

event_set_file = open(config.data_path + "event_link_set_beijing_1km", "r")
traffic_data_file = open(config.data_path + "event_traffic_beijing_1km_mv_avg_15min_completion.pkl", "rb")

Could you give suggestions on how to get these files?
Is there a brief of use steps?

Thanks.

Question regarding baseline models

In the baseline models that you implemented(specifically RF and SVR); were they trained solely on the temporal data or the entire spatio-temporal data?

Dataset

Hello, may I ask if the dataset is data from all road sections in Beijing, or has it been processed to only have data adjacent to special event road sections?

link_gps中无snode与enode的坐标信息

大神，你好！我今天在使用数据的时候惊奇的发现，road_network_sub-dataset.v2数据中的snodeid与enodeid的经纬度坐标信息在link_gps中是不存在的...请问是什么原因？谢谢！

Could not find GPS of node?

Hi, I have downloaded the dataset, and confront with the similar problem with the closed issue #6.
I have mapped snode_id, enode_id to 13 digits new link_id. However they are not contained in the link_gps file.

So how can I get the GPS of these node_id?

road segment information is not complete

hello dear,

Thanks for impressive work. I am working on similar project and interested to use your great dataset. but in your road-segment dataset there is only start node of the street. in your paper you mentioned about a few items, like roadGps, length, etc. So here are my question. I would be appreciate if you can help me. I will definitely cite your paper and will acknowledge you .
1- how can i get main dataset for roads, i really need to have street start and end points. I see you have some intermediate files but i am out side of china and cannot get those files. also in your intermediate files there is nothing about gps coordination of streets. i mean start and end both.
2- i already downloaded your files from backup link, but in that roead-network-subset the sample is this : "1144134225930 116.391026 39.922581". it is not complete as you mentioned in paper, is there any way i can get end point of streets?

highly appreciated in advance

Can't get intermediate files

questions about the GPS coordinates

I feel a little confused about the GPS coordinates provided in the file 'link_gps'.
I've tried to plot the location of road in the google map, but there existed some offset. Where the ploted points in always seems not like a road. So I'm wondering:
1 what coordinate system are they in?WGS84?GCJ-02?or BD-09 of Baidu?
2 are the gps coordinates given in the dataset aiming at the true road in reality? how can I get the specific road for a pair of gps coordinate?or it need to be kept serect?

Thank you in advance

为什么对query使用LSTM？

query的问题直观上没有时序关系，为什么会想到用LSTM来处理query呢？

It is described in the paper there will be snode_gps and enode_gps but there is only link_gps in the file.

How could I transfer link_gps to nodes_gps? Can you release the dataset including the nodes_gps?

One file wasn't existed

The file "pagerank_1km.txt" was not existed in dir. Could you tell me where i can find it?

Issues downloading dataset Baidu

Hello I'm having an issue while downloading the dataset , I need to create an account on Baidu and it does not seem to work cause I do not have a Chinese phone number , could someone help me with an easier way to download the dataset thank you so much.

question about "neighbours_1km.txt"

could you tell me how do you generate the file through 5 predecessors + 5 successors?

What does the 'event_link_set_all_poi_type_feature_coarse_beijing_1km.pkl' mean?

I check the your codes, in function load_data, you open files: "event_traffic_beijing_1km_mv_avg_15min_completion.pkl" & "event_link_set_all_poi_type_feature_fine_beijing_1km.pkl" &'event_link_set_all_beijing_1km_link_info_feature.pkl&'', but how this data files coming from. In other words, how can i get these data.
Thanks!