Giter Club home page Giter Club logo

gazouilloire's Introduction

DOI

logo logo

A command line tool for long-term tweets collection. Gazouilloire combines two methods to collect tweets from the Twitter API ("search" and "filter") in order to maximize the number of collected tweets, and automatically fills the gaps in the collection in case of connexion errors or reboots. It handles various config options such as:

Python >= 3.7 compatible.

Your Twitter API keys must have been created before April 29, 2022, in order to fully use the tool. If your keys were created after that date, Gazouilloire will work with the "search" endpoint only, and not the "filter". See Twitter's documentation about this issue.

Summary

Installation

  • Install gazouilloire

    pip install gazouilloire
  • Install ElasticSearch, version 7.X (you can also use Docker for this)

  • Init gazouilloire collection in a specific directory...

    gazou init path/to/collection/directory
  • ...or in the current directory

    gazou init

a config.json file is created. Open it to configure the collection parameters.

Quick start

  • Set your Twitter API key and generate the related Access Token

    "twitter": {
        "key": "<Consumer Key (API Key)>xxxxxxxxxxxxxxxxxxxxx",
        "secret": "<Consumer Secret (API Secret)>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "oauth_token": "<Access Token>xxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
        "oauth_secret": "<Access Token Secret>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    }
    
  • Set your ElasticSearch connection (host & port) within the database section and choose a database name that will host your corpus' index:

    "database": {
        "host": "localhost",
        "port": 9200,
        "db_name": "medialab-tweets"
    }

Note that ElasticSearch's databases names must be lowercased and without any space or accented character.

  • Write down the list of desired keywords and @users and/or the list of desired url_pieces as json arrays:

    "keywords": [
        "amour",
        "\"mots successifs\"",
        "@medialab_scpo"
    ],
    "url_pieces": [
        "medialab.sciencespo.fr/fr"
    ],

    Read below the advanced settings section to setup more filters and options or to get precisions on how to properly write your queries within keywords.

  • Start the collection by typing the following command in your terminal:

    gazou run

    or, if the config file is located in another directory than the current one:

    gazou run path/to/collection/directory

    Read below the daemon section to let gazouilloire run continuously on a server and how to set up automatic restarts.

Disk space

Before starting the collection, you should make sure that you will have enough disk space. It takes about 1GB per million tweets collected (without images and other media contents).

You should also consider starting gazouilloire in multi-index mode if the collection is planed to exceed 100 million tweets, or simply restart your collection in a new folder and a new db_name (i.e. open another ElasticSearch index) if the current collection exceeds 150 million tweets.

As a point of comparison, here is the number of tweets sent during the whole year 2021 containing certain keywords (the values were obtained with the API V2 tweets count endpoint):

Query Number of tweets in 2021
lemondefr lang:fr 3 million
macron lang:fr 21 million
vaccine 176 million

Export the tweets in CSV format

Data is stored in your ElasticSearch, which you can direcly query. But you can also export it easily in CSV format:

# Export all fields from all tweets, sorted in chronological order:
gazou export

Sort tweets

By default, tweets are sorted in chronological order, using the "timestamp_utc" field. However, you can speed-up the export by specifying that you do not need any sort order:

gazou export --sort no

You can also sort tweets using one or several other sorting keys:

gazou export --sort collection_time

gazou export --sort user_id,user_screen_name

Please note that:

  • Sorting by "id" is not possible.
  • Sorting by long textual fields (links, place_name, proper_links, text, url, user_description, user_image, user_location, user_url) is not possible.
  • Sorting by other id fields such as "user_id" or "retweeted_id" will sort these fields in alphabetical order (100, 101, 1000, 99) and not numerical.
  • Sorting by plural fields (e.g. mentions, hashtags, domains) may produce unexpected results.
  • Sorting by several fields may strongly increase export time.

Write into a file

By default, the export command writes in stdout. You can also use the -o option to write into a file:

gazou export > my_tweets_file.csv
# is equivalent to
gazou export -o my_tweets_file.csv

Although if you interrupt the export and need to resume it to complete in multiple sequences, only the -o option will work with the --resume option.

Query specific keywords

Export all tweets containing "medialab" in the text field:

gazou export medialab

The search engine is not case sensitive and it escapes # or @: gazou export sciencespo will export tweets containing "@sciencespo" or "#SciencesPo". However, it is sensitive to accents: gazou export medialab will not return tweets containing "médialab".

Use lucene query syntax with the --lucene option in order to write more complex queries:

  • Use AND / OR:
    gazou export --lucene '(medialab OR médialab) AND ("Sciences Po" OR sciencespo)'

(note that queries containg AND or OR will be considered in lucene style even if you do not use the --lucene option)

  • Query other fields than the text of the tweets:
    gazou export --lucene user_location:paris
  • Query tweets containing non-empty fields:
    gazou export --lucene place_country_code:*
  • Query tweets containing empty fields:
    gazou export --lucene 'NOT retweeted_id:*'
    # (this is equivalent to:)
    gazou export --exclude-retweets
  • Note that single quotes will not match exact phrases:
    gazou export --lucene "NewYork OR \"New York\"" #match tweets containing "New York" or "NewYork"
    gazou export --lucene "NewYork OR 'New York'" #match tweets containing "New" or "York" or "NewYork"

Other available options:

# Get documentation for all options of gazou export (-h or --help)
gazou export -h

# By default, the export will show a progressbar, which you can disable like this:
gazou export --quiet

# Export a csv of all tweets between 2 dates or datetimes (--since is inclusive and --until exclusive):
gazou export --since 2021-03-24 --until 2021-03-25
# or
gazou export --since 2021-03-24T12:00:00 --until 2021-03-24T13:00:00

# List all available fields for each tweet:
gazou export --list-fields

# Export only a selection of fields (-c / --columns or -s / --select the xsv way):
gazou export -c id,user_screen_name,local_time,links
# or for example to export only the text of the tweets:
gazou export --select text

# Exclude tweets collected via conversations or quotes (i.e. which do not match the keywords defined in config.json)
gazou export --exclude-threads

# Exclude retweets from the export
gazou export --exclude-retweets

# Export all tweets matching a specific ElasticSearch term query, for instance by user name:
gazou export '{"user_screen_name": "medialab_ScPo"}'

# Take a csv file with an "id" column and export only the tweets whose ids are included in this file:
gazou export --export-tweets-from-file list_of_ids.csv

# You can of course combine all of these options, for instance:
gazou export medialab --since 2021-03-24 --until 2021-03-25 -c text --exclude-threads --exclude-retweets -o medialab_tweets_210324_no_threads_no_rts.csv

Count collected tweets

The Gazouilloire query system is also available for the count command. For example, you can count the number of tweets that are retweets:

gazou count --lucene retweeted_id:*

You can also use the --step parameter to count the number of tweets per seconds/minutes/hours/days/months/years:

gazou count medialab --step months --since 2018-01-01 --until 2022-01-01

The result is written in CSV format.

Export/Import data dumps directly with ElasticSearch

In order to run and reimport backups, you can also export or import data by dialoguing directly with ElasticSearch, with some of the many tools of the ecosystem built for this.

We recommend using elasticdump, which requires to install NodeJs:

# Install the package
npm install -g elasticdump

Then you can use it directly or via our shipped-in script elasticdump.sh to run simple exports/imports of your gazouilloire collection indices:

gazou scripts elasticdump.sh
# and to read its documentation:
gazou scripts --info elasticdump.sh

Advanced parameters

Many advanced settings can be used to better filter the tweets collected and complete the corpus. They can all be modified within the config.json file.

- keywords

Keywords syntax follow Twitter's search engine rules. You can forge your queries by typing them within the website's search bar. You can input a single word, or a combination of ones separated by spaces (which will query for tweets matching all of those words). You can also write complex boolean queries such as (medialab OR (media lab)) (Sciences Po OR SciencesPo) but note only the Search API will be used for these ones, not the Streaming API, resulting in less exhaustive results.

Some advanced filters can be used in combination with the keywords, such as -undesiredkeyword, filter:links, -filter:media, -filter:retweets, etc. See Twitter API's documentation for more details. Queries including these will also only run on the Search API and not the Streaming API.

When adding a Twitter user as a keyword, such as "@medialab_ScPo", Gazouilloire will query specifically "from:medialab_Scpo OR to:medialab_ScPo OR @medialab_ScPo" so that all tweets mentionning the user will also be collected.

Using upper or lower case characters in keywords won't change anything.

You can leave accents in queries, as Twitter will automatically return both tweets with and without accents through the search API, for instance searching "héros" will find both tweets with "heros" and "héros". The streaming API will only return exact results but it mostly complements the search results.

Regarding hashtags, note that querying a word without the # character will return both tweets with the regular word and tweets with the hashtag. Adding a hashtag with the # characters inside keywords will only collect tweets with the hashtag.

Note that there are three possibilities to filter further:

- language

In order to collect only tweets written in a specific language: just add "language": "fr" to the config (the language should be written in ISO 639-1 code)

- geolocation

Just add "geolocation": "Paris, France" field to the config with the desired geographical boundaries or give in coordinates of the desired box (for instance [48.70908786918211, 2.1533203125, 49.00274483644453, 2.610626220703125])

- time_limited_keywords

In order to filter on specific keywords during planned time periods, for instance:

"time_limited_keywords": {
      "#fosdem": [
          ["2021-01-27 04:30", "2021-01-28 23:30"]
      ]
  }

- url_pieces

To search for specific parts of websites, one can input pieces of urls as keywords in this field. For instance:

"url_pieces": [
    "medialab.sciencespo.fr",
    "github.com/medialab"
]

- resolve_redirected_links

Set to true or false to enable or disable automatic resolution of all links found in tweets (t.co links are always handled, but this allows resolving also for all other shorteners such as bit.ly).

The resolving_delay (set to 30 by default) defines for how many days urls returning errors will be retried before leaving them as such.

- grab_conversations

Set to true to activate automatic recursive retrieval within the corpus of all tweets to which collected tweets are answering (warning: one should account for the presence of these when processing data, it often results in collecting tweets which do not contain the queried keywords and/or which are way out of the collection time period).

- catchup_past_week

Twitter's free API allows to collect tweets up to 7 days in the past, which gazouilloire does by default when starting a new corpus. Set this option to false to disable this and only collect tweets posted after the collection was started.

- download_media

Configure this option to activate automatic downloading within media_directory of photos and/or videos posted by users within the collected tweets (this does not include images from social cards). For instance the following configuration will only collect pictures without videos or gifs:

"download_media": {
    "photo": true,
    "video": false,
    "animated_gif": false,
    "media_directory": "path/to/media/directory"
}

All fields can also be set to true to download everything. media_directory is the folder where Gazouilloire stores the images & videos. It should either be an absolute path ("/home/user/gazouilloire/my_collection/my_images"), or a path relative to the directory where config.json is located ("my_images").

- timezone

Adjust the timezone within which tweets timestamps should be computed. Allowed values are proposed on Gazouilloire's startup when setting up an invalid one.

- verbose

When set to true, logs will be way more explicit regarding Gazouilloire's interactions with Twitter's API.

Daemon mode

For production use and long term data collection, Gazouilloire can run as a daemon (which means that it executes in the background, and you can safely close the window within which you started it).

  • Start the collection in daemon mode with:

    gazou start
  • Stop the daemon with:

    gazou stop
  • Restart the daemon with:

    gazou restart
  • Access the current collection status (running/not running, nomber of collected tweets, disk usage, etc.) with

    gazou status
  • Gazouilloire should normally restart on its own in case of temporary internet access outages but it might occasionnally fail for various reasons such as ElasticSearch having crashed for instance. In order to ensure a long term collection remains up and running without always checking it, we recommand to program automatic restarts of Gazouilloire at least once every week using cronjobs (missing tweets will be completed up to 7 days after a crash). In order to do so, a restart.sh script is proposed that handles restarting ElasticSearch whenever necessary. You can install it within your corpus directory by doing:

    gazou scripts restart.sh

    Usecases and cronjobs examples are proposed as comments at the top of the script. You can also consult them by doing:

    gazou scripts --info restart.sh
  • An example script daily_mail_export.sh is also proposed to perform daily tweets exports and get them by e-mail. Feel free to reuse and tailor it to your own needs the same way:

    gazou scripts daily_mail_export.sh
    # and to read its documentation:
    gazou scripts --info daily_mail_export.sh
  • More similar practical scripts are available for diverse usecases:

    # You can list them all using --list or -l:
    gazou scripts --list
    # Read each script's documentation with --info or -i (for instance for "backup_corpus_ids.sh"):
    gazou scripts --info backup_corpus_ids.sh
    # And install it in the current directory with:
    gazou scripts backup_corpus_ids.sh
    # Or within a specific different directory using --path or -p:
    gazou scripts backup_corpus_ids.sh -p PATH_TO_MY_GAZOUILLOIRE_COLLECTION_DIRECTORY
    # Or even install all scripts at once using --all or -a (--path applicable as well)
    gazou scripts --all

Reset

  • Gazouilloire stores its current search state in the collection directory. This means that if you restart Gazouilloire in the same directory, it will not search again for tweets that were already collected. If you want a fresh start, you can reset the search state, as well as everything that was saved on disk, using:

    gazou reset
  • You can also choose to delete only some elements, e.g. the tweets stored in ElasticSearch and the media files:

    gazou reset --only tweets,media

    Possible values for the --only argument: tweets,links,logs,piles,search_state,media

Development

To install Gazouilloire's latest development version or to help develop it, clone the repository and install your local version using the setup.py file:

git clone https://github.com/medialab/gazouilloire
cd gazouilloire
python setup.py install

Gazouilloire's main code relies in gazouilloire/run.py in which the whole multiprocess architecture is orchestrated. Below is a diagram of all processes and queues.

  • The searcher collects tweets querying Twitter's search API v1.1 for all keywords sequentially as much as the API rates allows
  • The streamer collects realtime tweets using Twitter's streaming API v1.1 and info on deleted tweets from users explicity followed as keywords
  • The depiler processes and reformats tweets and deleted tweets using twitwi before indexing them into ElasticSearch. It also extracts media urls and parent tweets to feed the downloader and the catchupper
  • The downloader requests all media urls and stores them on the filesystem (if the download_media option is enabled)
  • The catchupper collects recursively via Twitter's lookup API v1.1 parent tweets of all collected tweets that are part of a thread and feeds back the depiler (if the grab_conversations option is enabled)
  • The resolver runs multithreaded queries on all urls found as links within the collected tweets and tries to resolve them to get unshortened and harmonized urls (if the resolve_redirected_links option is enabled) thanks to minet

All three queues are backed up on filesystem in pile_***.json files to be reloaded at next restart whenever Gazouilloire is shut down.

multiprocesses

Troubleshooting

ElasticSearch

  • Remember to set the heap size (at 1GB by default) when moving to production. 1GB is fine for indices under 15-20 million tweets, but be sure to set a higher value for heavier corpora.

    Set these values here /etc/elasticsearch/jvm.options (if you use ElasticSearch as a service) or here your_installation_folder/config/jvm.options (if you have a custom installation folder):

    -Xms2g
    -Xmx2g
    

    Here the heap size is set at 2GB (set the values at -Xms5g -Xmx5g if you need 5GB, etc).

  • If you encounter this ElasticSearch error message: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]:

    ➡️ Increase the max_map_count value:

    sudo sysctl -w vm.max_map_count=262144

    (source)

  • If you get a ClusterBlockException [SERVICE_UNAVAILABLE/1/state not recovered / initialized] when starting ElasticSearch:

    ➡️ Check the value of gateway.recover_after_nodes in /etc/elasticsearch/elasticsearch.yml:

    sudo [YOUR TEXT EDITOR] /etc/elasticsearch/elasticsearch.yml

    Edit the value of gateway.recover_after_nodes to match your number of nodes (usually 1 - easily checked here : http://host:port/_nodes).

Publications

Gazouilloire presentations

Publications using Gazouilloire

Publications talking about Gazouilloire

Credits & License

Benjamin Ooghe-Tabanou, Béatrice Mazoyer, Jules Farjas & al @ Sciences Po médialab

Read more about Gazouilloire's migration from Python2 & Mongo to Python3 & ElasticSearch in Jules' report.

Discover more of our projects at médialab tools.

This work has been supported by DIME-Web, part of DIME-SHS research equipment financed by the EQUIPEX program (ANR-10-EQPX-19-01).

Gazouilloire is a free open source software released under GPL 3.0 license.

gazouilloire's People

Contributors

bmaz avatar boogheta avatar danieleguido avatar farjasju avatar rouxrc avatar yomguithereal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gazouilloire's Issues

Improve logging globally

Currently we're using print pretty much everywhere, and most of the time within stdout.
Shift to a logging module would be better and separate warning and errors from the rest by outputting to stderr?

Unify logs

Currently we're using print pretty much everywhere, and most of the time within stdout.
Shift to a logging module would be better and separate warning and errors from the rest by outputting to stderr?

config.json line 24 error

Hi,
thanks for great software!

In config.json.example on line 24 there is a colon (:) instead of a comma (,) which gives a delimiter error.

Best regards
Christopher

Add command gazouilloire reset to CLI

example in the script bin/delete_es_index.py, to move within the CLI

  • Include options to include or not both indexes for tweets and links (default removes both?)

  • Include options to remove also .search_state.json (default includes it?)

  • Add human confirmation by default for each operation (Are you sure you want to delete index ?) - and -y/--yes option to bypass the confirmation

  • Delete the script bin/delete_es_index.py when it's done

  • Include option to keep media files (default delete)

  • Reset logs

  • check if indices/searchstate actually exist before asking the user to erase them

  • Reset pile

  • media folder can have an alternative name

Filter by user_screen_name

Hi,

Is there a way to filter (in the config file) by the screen_name in order to have only keywords on specific accounts ?

Sincerely

Minor issues

  • source < 1024 not useful any more

  • catch multithread resolving errors (cf fa5c58a?branch=fa5c58a0e8c4cd6f385c708ec1f638909937172d&diff=unified#diff-9e61678f6d25ee73ad5ba89d46aae3a3R93 )

  • replace "db_conf" argument by "host", "port", "db_name" in complete_link_resolving_v2

  • "coordinates" is LONG:LAT while "geo" is LAT:LONG

  • rename update_tweets_with_links --> update_retweets_with_links

  • normalize_url called twice on same url

  • use os.path.join("..", "..", ...) instead of os.path.dirname(os.path.dirname(os.path.dirname in elasticmanager

  • in resolve_thread, the todo = line can probably be moved within the while and used only once since todo is not used in the while condition (this probably supposes to init done to 0 before though)

  • remove old resolver code in run.py and rename resolve_thread

  • update reply_count only if the value exists in the new tweet

  • remove references to analyzer in doc and in config.json

  • --verbose and --silent have same behavior in gazou resolve

  • store date of tweet in 2 formats : timestamp (UTC time) and iso (local time)

MongoDB/ES migration scripts

There is a script bin/mongo_to_es.py which was doing this from Jules' version. It probably needs to be updated quite a bit.

What to do with accents in keywords?

It looks now that Twitter search handles accents properly as regular letters and returns identical results (e.g. cédille & çedille) so recommending to not use them should not be relevant anymore. Although, we should probably:

  • advertise this in the readme
  • ensure to handle these properly (cf 9fd83d3#r31601621 where we strangely only need to handle encoding on the search side but not the stream one)

Reproduce behavior of previous script bin/backup_corpus_ids.sh

The feature should be already be handled via "gazou export -s id", although I fear it might be a bit slow on big corpuses and I'm wondering whether there might be a faster way to do such exports maybe using ES native tools instead of python's port.

To illustrate, in v0.1, we had 2 scripts for this, one in python writings ids from a mongodb query (bin/backup_corpus_ids.py) which was quite slow, and one very fast using mongoexport in shell (bin/backup_corpus_ids.sh)

I don't know if such thing exists as well with ES, but if it does we might want to use it for this.

Reuse elements from search_state which haven't changed

Whenever changing the list of followed keywords, we reset the whole search status stored to adapt to new queries, although some combinations are often the same and could be reused instead of losing precious queries to recatch the last 7 days of all queries

Indexing tests

Features to test:

  • [ ] deleted tweets are marked as deleted -(procedure : add a user in keywords, pst then delete with this user and check the tweet was updated)
  • "stream", "search",... tags are there

Handle batch size in depiler?

Currently the depiler empties the whole pile at each pass.
With big collects, it means each pass can sometimes work on batches of hundred thousands tweets, it might make sense to set a max batch size for performance issues.

This also raises the question of storing the queue elsewhere than in ram, to ensure avoidance of data loss due to crashes or forced kills, maybe using RabbitMQ?
cc @Yomguithereal

Adapt export scripts to elasticsearch database

  • Exports should be sorted by timestamp

  • remove selected_field but keep extra_fields

  • when possible, shorten field names

  • perform all formatting steps before indexing

  • database fields and csv exported fields should have the same names

  • remove "withheld_*" fields

  • Check that all collected fields are exported

  • Take advantage of the new indexing (collected_via) to simplify thread export

  • Document tweet fields that appeared/disappeared due to Twitter changes

  • export script should : convert booleans to 0/1, separate lists with "|", export links only if proper_links is not populated

  • Include export with the other gazou CLI commands (start, resolve) with syntax gazou export - by default write in stdout, but -o filename is in option

  • separate build_query logic into a function that can be unit tested

  • test & benchmark whether using csv.writer instead of DictWriter would fasten things

  • sort out which export scripts should be kept in the last version of gazouilloire

Document CLI in README

  • gazou export (and include examples, such as: get all text of a corpus using gazou export -s text, etc.)
  • gazou reset
  • document .search_state.json

Add Import/Export scripts or CLI for resolved links

(to do after having setup the shared linksstore option)

We need to be able to dump a csv of all resolved links from the index/redis/... and to reimport it into another one (to do backups or to fuse existing resolved db for instance)

I guess these could well be specific CLI commands such as importlinks and exportlinks for instance (there's probably better name options)

these actions corresponded to the scripts bin/export_resolved_links.py and bin/import_resolved_links.py in v0.1

Cannot start multiprocess, pickleError: NonType

Hello Medilab,

I have adapted the multiprocessing part of your code into mine for the twitterstream and search API. I am unable to start any process and I have already tested that my search and stream methods work correctly.

I would really appreciate if you can help me a bit on resolving this issue!

Thank you very much for your time!

Here is the full traceback of the error:

Traceback (most recent call last):
  File "C:\Python27\Code\tweets\Stream.py", line 620, in <module>
    tweetParser.start()
  File "C:\Python27\lib\multiprocessing\process.py", line 130, in start
    self._popen = Popen(self)
  File "C:\Python27\lib\multiprocessing\forking.py", line 277, in __init__
    dump(process_obj, to_child, HIGHEST_PROTOCOL)
  File "C:\Python27\lib\multiprocessing\forking.py", line 199, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "C:\Python27\lib\pickle.py", line 224, in dump
    self.save(obj)
  File "C:\Python27\lib\pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python27\lib\pickle.py", line 425, in save_reduce
    save(state)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "C:\Python27\lib\pickle.py", line 687, in _batch_setitems
    save(v)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 568, in save_tuple
    save(element)
  File "C:\Python27\lib\pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python27\lib\pickle.py", line 425, in save_reduce
    save(state)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "C:\Python27\lib\pickle.py", line 687, in _batch_setitems
    save(v)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "C:\Python27\lib\pickle.py", line 686, in _batch_setitems
    save(k)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 754, in save_global
    (obj, module, name))
pickle.PicklingError: Can't pickle <type 'NoneType'>: it's not found as __builtin__.NoneType
[INFO/MainProcess] process shutting down
[DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[DEBUG/MainProcess] running the remaining "atexit" finalizers
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python27\lib\multiprocessing\forking.py", line 381, in main
    self = load(from_parent)
  File "C:\Python27\lib\pickle.py", line 1384, in load
    return Unpickler(file).load()
  File "C:\Python27\lib\pickle.py", line 864, in load
    dispatch[key](self)
  File "C:\Python27\lib\pickle.py", line 886, in load_eof
    raise EOFError
EOFError
[INFO/Process-1] process shutting down
[DEBUG/Process-1] running all "atexit" finalizers with priority >= 0
[DEBUG/Process-1] running the remaining "atexit" finalizers

Create twitwi package

Externalize:

  • create new repo and pypy lib

  • include gazouilloire/API_wrapper.py:

    • merge codes from minet/gazouilloire
    • switch to using the lib in minet/gazouilloire
    • add functional tests with calls ?

  • include tweets/users metadata formatters from gazouilloire/tweets.py:
    • add options to flatten the parsed data (mainly pipe join arrays)
    • add unit tests using hardcached json from API
    • switch to using the lib in minet/gazouilloire (beware minet uses a simplier version that only resolve entities)

  • include list of fields from gazouilloire/web/export.py:
    • add other useful csv helpers if needed ?

Spring cleanup

In the new version a number of old things can be proegressively removed:

  • the "collect_*" dirs at the root of the repo
  • Remove meta.json / test_stream.py / restart.sh
  • mongo_manager.py
  • move schemas in a doc directory

[resolver] Retry urls in error

Currently we detect and log urls for which the resolver failed with an error that is not a redirect one but we do not stop there and save them in db as if they just worked without redirection. Those should rather be retried at least a number of times.
So before we setup a smart way (using the future redis store) to retry them for a few days before giving up, we should just retry them in loop.

Declare module with setup.py file

  • create setup.py

  • config.json cannot be part of the module egg --> create gazouilloire init command to create file

  • create separate config_format.py file to load and check content of config.json

  • change README

  • adapt scripts (in particular complete_link_resolving_v2.py)

  • create CLI interface ?

  • Give a relevant version number to the version on PyPi (something like 1.0.0.beta)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.