Set of python scripts to parse large DataSift streaming corpora including several sources. Example deployment at UN Global Pulse climate monitor
-
python process_ds_files.py
Cycles through all files from DataSiftDataSift*json
and places messages in daily files. Also counts mentions, hashtags and other attributes. Serialises data incounters.dat
. Deleted tweets that appear in the stream written todeletions.csv
. -
python process_deletions.py
Considers tweet IDs indeletions.csv
, removes from daily files and adjusts counters. -
python get_top_tweets.py
Looks through daily files produced as output fromprocess_ds_files.py
and counts tweets from last n days. Writes out IDs to file for embedding. -
query_dump.ipynb
Reads in serialised data fromcounters.dat
and produces plots interatively. -
make_plots.py
Reads in serialised data fromcounters.dat
and produces plots and data files for web pages. Pass in directory with pickle file with-d <dataDirectory>
.
#Usage
-
Location of data specified by
dataDirectory
as a compulsory argument toprocess_ds_files.py
. If several corpora exist (typically corresponding to different languages) then invoke separately for each one (only overhead for running multiple times is parsing of geolocation world pickle file). All directories matching20[0-9]{2,2}-[0-9]{2,2}
will be examined and all files within matchingDataSift*json
will be considered. This produces all output files indataDirectory
. -
To restrict to content geo-located to particular countries only; pass in a list of 2 letter ISO country code
python process_ds_files.py -C GB ID FR
-
To produce a map of content snapped to particular cities for use in DC.js dashboard supply a list of cities
python process_ds_files.py -c cities.csv
-
To clean existing output files pass
--clean
flag.
#Output Files
process_ds_files.py
produces a set of files for each corpora (in dataDirectory
)
- Set of daily files of form
YYYY_MM_DD[_languages][_countries].json
, holds all messages from that day - Pickle file holding all counters and time series
counters[_languages][_countries].dat
- Input file to CartoDB
carto[_languages][_countries].txt
- Input file to DC.js
dc[_languages][_countries].csv
- Deletions file, all streaming messages later deleted
deletions[_languages][_countries].csv
(only non-empty for streaming data, not historical queries)
make_plots.py
produces all input plots for dashboard in png/mlpd3 format
get_top_tweets.py
reads daily files from last N days and produces list of top tweets by ID ready for embedding.
#Example Usage
python process_ds_files.py data/ -C BR PT -c cities.csv
python process_deletions.py -d data/ -C BR -L pt
python make_plots.py -d data/ -C BR --clean
python get_top_tweets.py -d data -n 7 -C BR --clean
#Dependencies
#TODOs
- Find a more consistent way to count over topics (and sub topics)
- Add in gender time series for each topic
- Count ngram time series in a separate process
- Convert topic counting (amd resampling) to using temp Series not DataFrame
- Convert topic collocations to time series rather than counts