okfn-brasil / serenata-toolbox Goto Github PK

📦 pip module containing code shared across Serenata de Amor's projects | ** Este repositório não recebe atualizações frequentes **

License: MIT License

Python 99.64% Dockerfile 0.36%

serenata-toolbox's Introduction

Serenata de Amor Toolbox

pip installable package to support Serenata de Amor and Rosie development.

Serenata_toolbox is compatible with Python 3.6+

Installation

$ pip install -U serenata-toolbox

If you are a regular user you are ready to get started after pip install.

If you are a core developer willing to upload datasets to the cloud you need to configure AMAZON_ACCESS_KEY and AMAZON_SECRET_KEY environment variables before running the toolbox.

Usage

We have plenty of them ready for you to download from our servers. And this toolbox helps you get them. Here some examples:

Example 1: Using the command line wrapper

# without any arguments will download our pre-processed datasets and store into data/ folder
$ serenata-toolbox

# will download these specific datasets and store into /tmp/serenata-data folder
$ serenata-toolbox /tmp/serenata-data --module federal_senate chamber_of_deputies

# you can specify a dataset and a year
$ serenata-toolbox --module chamber_of_deputies --year 2009

# or specify all options simultaneously
$ serenata-toolbox /tmp/serenata-data --module federal_senate --year 2017

# getting help
$ serenata-toolbox --help

Example 2: How do I download the datasets?

Another option is creating your own Python script:

from serenata_toolbox.datasets import Datasets
datasets = Datasets('data/')

# now lets see what are the latest datasets available
for dataset in datasets.downloader.LATEST:
    print(dataset)  # and you'll see a long list of datasets!

# and let's download one of them
datasets.downloader.download('2018-01-05-reimbursements.xz')  # yay, you've just downloaded this dataset to data/

# you can also get the most recent version of all datasets:
latest = list(datasets.downloader.LATEST)
datasets.downloader.download(latest)

Example 3: Using shortcuts

If the last example doesn't look that simple, there are some fancy shortcuts available:

from serenata_toolbox.datasets import fetch, fetch_latest_backup
fetch('2018-01-05-reimbursements.xz', 'data/')
fetch_latest_backup( 'data/')  # yep, we've just did exactly the same thing

Example 4: Generating datasets

If you ever wonder how did we generated these datasets, this toolbox can help you too (at least with the more used ones — the other ones are generated in our main repo):

from serenata_toolbox.federal_senate.dataset import Dataset as SenateDataset
from serenata_toolbox.chamber_of_deputies.reimbursements import Reimbursements as ChamberDataset

chamber = ChamberDataset('2018', 'data/')
chamber()

senate = SenateDataset('data/')
senate.fetch()
senate.translate()
senate.clean()

Documentation (WIP)

The full documentation is still a work in progress. If you wanna give us a hand you will need Sphinx:

$ cd docs
$ make clean;make rst;rm source/modules.rst;make html

Contributing

Firstly, you should create a development environment with Python's venv module to isolate your development. Then clone the repository and build the package by running:

$ git clone https://github.com/okfn-brasil/serenata-toolbox.git
$ cd serenata-toolbox
$ python setup.py develop

Always add tests to your contribution — if you want to test it locally before opening the PR:

$ pip install tox
$ tox

When the tests are passing, also check for coverage of the modules you edited or added — if you want to check it before opening the PR:

$ tox
$ open htmlcov/index.html

Follow PEP8 and best practices implemented by Landscape in the veryhigh strictness level — if you want to check them locally before opening the PR:

$ pip install prospector
$ prospector -s veryhigh serenata_toolbox

If this report includes issues related to import section of your files, isort can help you:

$ pip install isort
$ isort **/*.py --diff

Always suggest a version bump. We use Semantic Versioning – or in Elm community words:

MICRO: the API is the same, no risk of breaking code
MINOR: values have been added, existing values are unchanged
MAJOR: existing values have been changed or removed

This is really important because every new code merged to master triggers the CI and then the CI triggers a new release to PyPI. The attemp to roll out a new version of the toolbox will fail without a version bump. So we do encorouge to add a version bump even if all you have changed is the README.rst — this is the way to keep the README.rst updated in PyPI.

If you are not changing the API or README.rst in any sense and if you really do not want a version bump, you need to add [skip ci] to you commit message.

And finally take The Zen of Python into account:

$ python -m this

serenata-toolbox's People

Contributors

Stargazers

Watchers

serenata-toolbox's Issues

The AWS Access Key Id you provided does not exist in our records

I'm trying to run the very first example given in your main page: https://github.com/datasciencebr/serenata-toolbox

from serenata_toolbox.datasets import Datasets
datasets = Datasets('/tmp/serenata-data/')

# now lets see what datasets are available
for dataset in datasets.remote.all:
    print(dataset)  # and you'll see a long list of datasets!

I changed the datasets folder to an existing folder in my computer and everything works. However, when I try to run the above loop, I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/marcus/Documents/Research/Big Data/Serenata de Amor/serenata-toolbox/serenata_toolbox/datasets/remote.py", line 74, in all
    response =  self.s3.list_objects(Bucket=self.bucket)
  File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 253, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 557, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.

It seems I need an AWS Access Key. However, I am using the config.inifile you provided:

[Amazon]
Bucket: serenata-de-amor-data
AccessKey: YOUR_ACCESS_KEY
Region: sa-east-1
SecretKey: YOUR_SECRET_KEY

Since you said "If you don't plan to upload anything to S3 please don't bother about keys and secrets in this file.", I didn't look for these credentials.

Anyway, how can I download data to my computer using this toolbox? What am I doing wrong? I am no very familiar to python, so maybe this is a very simple question.

Automate PyPI release

Check whether it is possible to, upon merge, generate the files for PyPI release. And do the actual release.

The steps to generate and upload the files to PyPI are the following:

$ python setup.py sdist
$ python setup.py bdist_wheel --plat-name='any'
$ twine upload dist/*

Ps.: requires twine to be installed

Deployment ideas

With #134 almost good to merge, I would like to raise a question already raised here and described bellow.

Today our deploys to PyPI are made whenever a PR is merged into master. And that's pretty good!

Though there's a catch: PRs that do not have version bump break our build because PyPI don't accept file upload with a same version. Hence breaking our Travis Build.

There are a few approaches and I would like to get some thoughts on:

1. create a release branch:

Pros: We would only release those pull requests that really have code changes
Cons: A branch to update whenever we have meaningful changes which will generate double work to be done (update the branch, open a PR, merge it)

2. everything is at least a patch bump:

Pros: No build breaking, only on commit away from updating things
Cons: Rapidly increase in patch bumps (but considering that patches do not directly affect the way people use the toolbox that might not be such a big con)

3. use the condition tag on deploy:

Pros: Would put a condition to only deploy when setup had a bump
Cons: Don't know how to do this or if it is even possible to do so

Become a project 100% Healthy

The purpose of this PR is have 100% of Health, fixing all coding quality analyze issues.
landscape build

Remove pandas dependency

Talking to @cuducos we found out that people just wanting to use serenata-toolbox to convert data (and use it without rosie) would need pandas (and all numpy-related dependencies). This is not a good thing if you're converting this data to use on a website or a restricted deployment environment (such as Heroku), so I propose to remove pandas as a dependency if there's a library that does not depend on numpy/scipy/etc. and can do the same pandas is doing here.
Probably the rows library will do the job (still need to check the whole code base here). A side effect of replacing pandas with rows is the lower memory consumption, since the new version of rows (currently under development) can import/export data in a lazy way and so don't need a huge amount of RAM to run.

Improve documentation

Now with this issue being worked on we need a better documentation on how each toolbox submodule works.

Weekly update for reimbursement files

I have been updating reimbursements.xz, current-year.xz, last-year.xz and previous-years.xz every week. The objective is to use updated data when running analysis locally.

The latest ones are:

2017-05-22-reimbursements.xz
2017-05-22-current-year.xz
2017-05-22-last-year.xz
2017-05-22-previous-years.xz

Considering they are too big to be attached here (> 10MB), those files are saved in a Google Drive folder. You can access them here.

I will keep updating every Sunday and sharing the Google Drive folder. Feel free to download them and upload to server or save locally whenever you feel like.

Add timeout or similar to Downloader

In case the download of one or more dataset is stuck we need to have a timeout or some other sort of breaking the wait gracefully.

Add deploy artifacts to .gitignore

Should we add dist/ and build/ to .gitignore file once we don't commit it?

Travis CI breaks the build after 10 minutes without any output no terminal

Looks that after some datasets were added to Toolbox the journey test of CEAP is not able to run on Travis CI because it takes more than 10 minutes without any log.
Broken build

Add geolocation to the latest companies dataset

The latest companies dataset (actually 2017-04-27-companies-no-geolocation.xz) has no geolocation info. The src/geocode_addresses.py script can handle that ; )

Replace methods name starting with double underscore (use a single underscore instead)

As discussed here — thanks @lipemorais and @Irio.

Simplify public API

IMHO our public API is quite redundant — surely redundancy can be good sometimes… so I ask does it worth it to simplify it in these terms?

Turns stuff like:

from serenata_toolbox.federal_senate.federal_senate_dataset import FederalSenateDataset

In:

from serenata_toolbox.federal_senate.dataset import Dataset

Or, at least"

from serenata_toolbox.federal_senate.dataset import FederalSenateDataset

(And the same for serenata_toolbox.chamber_of_deputies).

Deploy only after Unit tests passes

Right now Travis tries to deploy to PyPI two time hence breaking the build in the second time.

Maybe add a condition for the deploy only to be done if unit tests pass would be a smart choice.

New TSE dataset

With PR that fixes a bug in the TSE data scrapper being accepted we need to generate and upload to S3 the new dataset and update the toolbox list of latest datasets.

Better config Coveralls for accuracy

Hello, I hope it is not a problem to open an issue out of the blue like this.

Lately, @lipemorais was showing me this project and, as a way to try to start contributing to it, I thought it would be interesting to replace Coveralls by another quality measuring tool. My reasons behind such suggestion are in the fact that the Coveralls UI is very confusing, I can't quickly get a picture of what is actually going on in the project when looking at its dashboard. Besides, it is showing a quite imprecise metric at this moment, since it's also considering third-party libraries in its analysis, lowering the metrics to only 25%.

I believe that CodeClimate would be a better option to solve this problem. I've forked this project and configured its forked copy to be analyzed by CodeClimate. Below, it's possible to see a result of the first analysis:

Besides the code styling, it's also possible to harvest some metrics about testing coverage:

I've also noticed that Jarbas makes use of both Coveralls and CodeClimate. Maybe, CodeClimate could be configured to be used as a test coverage reporter, replacing Coveralls there as well.

I would like to apologize in advance if this is not the right way to open a new issue.

Documentation needs update

Installation now can be done via pip it can be updated

"Both a converter and dtype were specified" warning with chamber_of_deputies

When using pandas.read_csv, you must provide a converter or a dtype, not both.

Log

$ python rosie.py run chamber_of_deputies
/home/travis/virtualenv/python3.6.1/lib/python3.6/site-packages/serenata_toolbox/chamber_of_deputies/dataset.py:74: ParserWarning: Both a converter and dtype were specified for column vlrDocumento - only the converter will be used
  'vlrRestituicao': lambda x: float(x.replace(',','.'))})

/home/travis/virtualenv/python3.6.1/lib/python3.6/site-packages/serenata_toolbox/chamber_of_deputies/dataset.py:74: ParserWarning: Both a converter and dtype were specified for column vlrGlosa - only the converter will be used
  'vlrRestituicao': lambda x: float(x.replace(',','.'))})

/home/travis/virtualenv/python3.6.1/lib/python3.6/site-packages/serenata_toolbox/chamber_of_deputies/dataset.py:74: ParserWarning: Both a converter and dtype were specified for column vlrLiquido - only the converter will be used
  'vlrRestituicao': lambda x: float(x.replace(',','.'))})

/home/travis/virtualenv/python3.6.1/lib/python3.6/site-packages/serenata_toolbox/chamber_of_deputies/dataset.py:74: ParserWarning: Both a converter and dtype were specified for column vlrRestituicao - only the converter will be used
  'vlrRestituicao': lambda x: float(x.replace(',','.'))})

/home/travis/build/datasciencebr/rosie/rosie/chamber_of_deputies/adapter.py:25: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
  self.update_datasets()

Merging all datasets…
Loading reimbursements-2009.xz…
Loading reimbursements-2010.xz…
Loading reimbursements-2011.xz…
Loading reimbursements-2012.xz…
Loading reimbursements-2013.xz…
Loading reimbursements-2014.xz…
Loading reimbursements-2015.xz…
Loading reimbursements-2016.xz…
Loading reimbursements-2017.xz…
Dropping rows without document_value or reimbursement_number…
Grouping dataset by applicant_id, document_id and year…
Gathering all reimbursement numbers together…
Summing all net values together…
Summing all reimbursement values together…
Generating the new dataset…
Casting changes to a new DataFrame…
Writing it to file…
Done.

https://travis-ci.org/datasciencebr/rosie/builds/244041874

Must be changed in
https://github.com/datasciencebr/serenata-toolbox/blob/bced7b332c486f6a20f5bbc1285cf843040c76ed/serenata_toolbox/chamber_of_deputies/dataset.py#L56-L74

Test happy path of all the existing functions

Use logging module

Use built-in logging module in order to get a structured and controllable way to output messages, instead of just using the function "print"

Fix namespace of the package

There are 3 ways of setting up a python package as Python Package User Guide gracefully shows and I copy here:

Use native namespace packages. This type of namespace package is defined in PEP 420 and is available in Python 3.3 and later. This is recommended if packages in your namespace only ever need to support Python 3 and installation via pip.
Use pkgutil-style namespace packages. This is recommended for new packages that need to support Python 2 and 3 and installation via both pip and python setup.py install.
Use pkg_resources-style namespace packages. This method is recommended if you need compatibility with packages already using this method or if your package needs to be zip-safe.

The toolbox is structured according to the second name spacing type, which is as follows:

setup.py
mynamespace/
    __init__.py  # Namespace package __init__.py
    subpackage_a/
        __init__.py  # Sub-package __init__.py
        module.py

So for imports to work properly we need to fix the serenata_toolbox/__init__.py file to contain

__path__ = __import__('pkgutil').extend_path(__path__, __name__)

as it is explained here.

Fix example

As pointed out by @mnunes (thank you!) on #85, you can only list all files from the S3 if you have the AWS keys. A little fix is required on our examples on how to use the toolbox.

Failing tests don't make the suite/CI fail

Current status is a test suite failing with a green flag. When running locally, I get the same errors with a zero exit code.
https://travis-ci.org/datasciencebr/serenata-toolbox/jobs/242200049

Refactoring tests to its specific tasks

This issue aims at refactoring two existent tests that are working well, but need enhancements to keep tests independent of network connection and writing to/reading filesystem.

Tests that need to be refactored:
tests/test_federal_senate_dataset.py
tests/test_chamber_of_deputies_dataset.py

@cuducos would like to help in it :)

Treat xml2csv.py as a script

xml2csv.py module is a script (as long as there are code embedded that is executed wherever the module is called) and must be treated properly, i.e., the setup file must specify it as a script (after the package installation xml2csv script will be added properly to the system path).

Missing sub quota from the final reimbursements dataset

We have a lot of sub quotas mapped

But checking the latest reimbursements file generated on the terminal, and comparing it to the .csv file directly downloaded from the chamber of deputies @rodolfo-viana found that there is a whole subquota of reimbursements missing (999, 'Flight ticket issue').

That accounts for over 20k of reimbursements.

Add the script to get the Federal Senate datasets.

Ok guys, it's time.

Now that the toolbox is ready to receive a new dataset, I am working on a script to get the Federal Senate datasets.
They are pretty different from Chamber of Deputies, but we can work on it.

By now I will be fixing the final things to make Rosie works fine with this new toolbox, and then I will be working on bring the new script to the toolbox.

Missing version bump

We missed a version bumps when merging #48 and #49

Suggestions for version bump:

#48 - micro - typo fix, not much trouble there
#49 - micro - few renaming conventions based on the changes from CEAP to Chamber of Deputies

XML parsing error while running Rosie

@cuducos I was running Rosie and got this error below, could you review it please?!

(serenata_rosie) 19:03:52 at rosie (master)$ python rosie.py run
2017-01-11 19:04:23 Creating the CSV file
2017-01-11 19:04:24 Reading the XML file
2017-01-11 19:04:24 Writing record #346 to the CSV
2017-01-11 19:04:24 Done!
2017-01-11 19:04:24 Creating the CSV file
2017-01-11 19:04:24 Reading the XML file
2017-01-11 19:06:19 Writing record #337,740 to the CSV
2017-01-11 19:06:19 Done!
2017-01-11 19:06:19 Creating the CSV file
2017-01-11 19:06:19 Reading the XML file
Traceback (most recent call last): #114,024 to the CSV
  File "rosie.py", line 36, in <module>
    command()
  File "rosie.py", line 23, in run
    rosie.main(target_directory)
  File "/home/temporal/Documents/Serenata/rosie/rosie/__init__.py", line 64, in main
    dataset = Dataset(target_directory).get()
  File "/home/temporal/Documents/Serenata/rosie/rosie/dataset.py", line 16, in get
    self.update_datasets()
  File "/home/temporal/Documents/Serenata/rosie/rosie/dataset.py", line 28, in update_datasets
    ceap.convert_to_csv()
  File "/home/temporal/anaconda3/envs/serenata_rosie/lib/python3.5/site-packages/serenata_toolbox/ceap_dataset.py", line 36, in convert_to_csv
    convert_xml_to_csv(xml_path, csv_path)
  File "/home/temporal/anaconda3/envs/serenata_rosie/lib/python3.5/site-packages/serenata_toolbox/xml2csv.py", line 70, in convert_xml_to_csv
    for json_io in xml_parser(xml_file_path):
  File "/home/temporal/anaconda3/envs/serenata_rosie/lib/python3.5/site-packages/serenata_toolbox/xml2csv.py", line 23, in xml_parser
    for event, element in iterparse(xml_path, tag=tag):
  File "src/lxml/iterparse.pxi", line 208, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:148582)
  File "src/lxml/iterparse.pxi", line 193, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:148280)
  File "src/lxml/iterparse.pxi", line 224, in lxml.etree.iterparse._read_more_events (src/lxml/lxml.etree.c:148818)
  File "src/lxml/parser.pxi", line 1374, in lxml.etree._FeedParser.close (src/lxml/lxml.etree.c:114116)
  File "src/lxml/parser.pxi", line 586, in lxml.etree._ParserContext._handleParseResult (src/lxml/lxml.etree.c:104990)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
  File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
  File "/tmp/serenata-data/AnosAnteriores.xml", line 2
lxml.etree.XMLSyntaxError: Couldn't find end of Start Tag numEspecificacao, line 2, column 1

Problem when fetching data from CEAP

When running ceap = serenata_toolbox.CEAPDataset('data/') I got this error File is not a zip file.

Same thing happened when running Rosie. I may be wrong but it is possible that is something wrong with the data.

Traceback when running Rosie with docker is here

and the error while running the fetch with the toolbox with ipython is here

any thoughts?

Create unittests

Create unitary tests for this package. Sugestions:

Use travis for online triggered testing at each commit;
Use coverage and coveralls for test covering metrics.

Provide proper documentation

Provide proper documentation for this package. Suggestion:

Use sphinx and readthedocs.io for automatic documentation building online at each commit.

Script to Download and include Supervised Learning

After the contribution of many people we built a gold standard as reference to indicate if a reimbursement is a generalization or not.
Example of generalization:
5635048.pdf

Not a generalization:
5506259.pdf
Our sample of reference consists in 1691 suspicious, and 1691 not suspicious reimbursements link. It was manually curated as explained in this video made by Felipe Cabral apoia.se

The goal of this dataset is to deal with this part of CEAP:

O documento que comprova o pagamento não pode ter rasura, acréscimos, emendas ou entrelinhas, deve conter data e deve conter os serviços ou materiais descritos item por item, sem generalizações ou abreviaturas, podendo ser:

Thus, this issue aims the following:

Transfer the files i have in google drive to Amazon S3
Create a script to download the above files in the toolbox. It could be a new category of datasets, e.g., Supervised Learning
Create a script to download pre-built Machine learn models to Rosie

First objective, find hereafter the files i have:
PNG images
CSV reference

Regarding the CSV files, we have to include the direct link to chamber of deputies. Right now it only has the link to Jarbas.
To easily do that, you have to take the document id from CSV file, and get the full link using a method like that:

def document_url(record):
    return 'http://www.camara.gov.br/cota-parlamentar/documentos/publ/%s/%s/%s.pdf' %\
        (record['applicant_id'],record['year'], record['document_id'])

The Dataset i used was this one:

data = pd.read_csv('../data/2016-11-19-last-year.xz',
                   parse_dates=[16],
                   dtype={'document_id': np.str,
                          'congressperson_id': np.str,
                          'congressperson_document': np.str,
                          'term_id': np.str,
                          'cnpj_cpf': np.str,
                          'reimbursement_number': np.str})

data=data[data['subquota_description']=='Congressperson meal']
data=data[data['document_id'].isin(doc_ids)]
# The doc_ids are retrieved from the csv file

The first objective will allow more people to have access to these curated files in order to replicate and create new experiments!

Second objective: The goal is to call some method like:

from serenata_toolbox.chamber_of_deputies.dataset import Dataset
chamber_of_deputies = Dataset(self.path)
chamber_of_deputies.fetch_supervised_learning()

It makes the integration of the mentioned files easily in other parts of project, e.g., Classifier using these files, Analyse using these data

Third objective, as you can see in the mentioned link: Classifier using these data](okfn-brasil/rosie#66)

To upload big files in git is not a good practice. Therefore to facilitate the contribution of new models to Rosie we have to create a method to specify which model we would like to retrieve.
Example right now:

 classifier.__name__ == 'MealGeneralizationClassifier':
 model = classifier()
 model.fit('rosie/chamber_of_deputies/classifiers/keras/model/weights.hdf5')

Proposed:

 classifier.__name__ == 'MealGeneralizationClassifier':
 model = classifier()
 model.fit(self._model('generalization'))

It will allows us in the future to include more models and re-training the existent ones to be more robust.
To this task find hereafter my model:
Meal Generalization

PS: To upload files we have this method in the toolbox remote.py

Fix init file

Fix package init file by removing dummy helloworld function and including the proper import statements, making calls to functions/classes more straightforward, without the need to refer to the submodules.
Ex:

#python3
>>>from serenata_toolbox import CEAPDataset
>>>from serenata_toolbox fimport fetch_latest_backup
>>>from serenata_toolbox import Reimbursements

Remove the link to full documentation from README.md

… unless we focus on #23

Add `if` statement to avoid dropping 'Flight ticket issue' expenses

@jtemporal figured out why the dataset is missing subquota 999, 'Flight ticket issue' (see #106). According to her findings:

What happens is, there is a filter that cuts out receipts with reimbursement_value equals to 0 because this means that, that document was not reimbursed.

It is not a bug indeed. The reason: subquota 999, 'Flight ticket issue' does not generate reimbursement value. According to Chamber of Deputies:

Os gastos com bilhete aéreo (...) também não são objeto de reembolso e, por isso, não há emissão individual de nota fiscal. O valor gasto é debitado automaticamente do valor da cota do respectivo parlamentar.

Filght ticket expenses (...) are also not subject to reimbursement, therefore, there is no individual invoice issue. The amount spent is automatically deducted from the amount of the respective member's subquota.

I understand the mission of this project regarding reimbursement and how this work flows around reimbursement values. But taking it strictly, we disregard expenses on which the congressperson does not have to get reimbursed; we disregard subquotas in which the congressperson has a monthly value to deduct from.

In this category, although congresspersons do not have to pay first and get the value reimbursed later, there is public money being spent. And a lot of it: over R$ 100 million during the current term, putting Flight ticket issue in second place among subquotas with most expenses.

As an example of the relevance of having this subquota in our dataset, a few years ago there was this public scandal called "Farra das passagens", about congresspersons using this specific subquota to issue tickets for his family members and friends.

So I ask you guys: although dropping Flight ticket issue from our dataset is not a bug, shouldn't we reconsider having it back?

Error fetching new dataset

I keep getting this error. Does anyone know what I am doing wrong?

Big refactor of the public API to generate the datasets

This issue is proposed as a roadmap to a big refactor in the public API. This issue might also work as a whishlist for you who uses this toolbox and believe its API for generating the datasets could be improved. I'll suggest a to-do list in this opening post and try to keep it updated as the following discussion goes by.

The main problems with the current one has been discussed by @lipemorais and myself in several other issues and PRs. For example:

Problems with semantic: we only have fetch, translate and clean methods when what really happens in these three methods is fetch data, translate data to en_US, clean data, convert data from .csv to .xz, and merge datasets by year in a single file (see #53 and #68)
These methods have a lot of logic embedded which makes tests more complex than they could be (see #68)
We miss a single method to handle all tasks required to generate a dataste from the scratch (see #59)
Integration tests still depends on external server — we could work with fixtures instead (also #59)

Therefore what I propose here is to:

map the impact of changing the API
rewrite fetch, translate, clean into more atomic methods (reduce side effects), with really simple logic and adding more methods if needed (e.g. convert_to_xz, translate_to_en etc.)
add a method (e.g. generate) to handle all internal tasks from downloading it from the original source to having a dataset ready for Serenata pipeline (i.e. make all methods from the previous task internal methods used by this main one here)
write unit tests for each of this methods that do not depend on external download (using mocks)
write integration tests for this class that do not download on external download (using fixtures)

I think that this refactor will enhance our code quality and architecture and can pave the way to more overarching changes such as:

adopting Dask as a default to handle any dataset (if we have barely no side effects, this is quite easy)
changing the tests suite for something more robust such as pytest, or even using tox
opening the API for new user customization, embracing DRY and avoiding this

Toolbox error

I had to uninstall Anaconda and delete my env. I installed everything back and recreated the env. But during setyp, I had a weird error. Apparently it is related to toolbox and its datasets.

Take a look:

(I guess Windows users must change "data/" for "data". I will try that and get back here to say if it worked.)

Update Brazilian Cities CSV on the latest_backup list.

We are expanding our project to cities, in order to it we must update our dataset list.

We need to add the 2017-05-22-brazilian-cities.csv to the latest-backup script, in order to make the serenata-de_amor notebooks work.

Update reimbursements file

We need a new reimbursements file on Amazon

Test async part of Datasets module

If #38 is merged the coverage points to a non-tested part that is basically the coroutines:

Name                                          Stmts   Miss  Cover
-----------------------------------------------------------------
serenata_toolbox/datasets/__init__.py            26      0   100%
serenata_toolbox/datasets/contextmanager.py       5      0   100%
serenata_toolbox/datasets/downloader.py          58     25    57%
serenata_toolbox/datasets/local.py               15      0   100%
serenata_toolbox/datasets/remote.py              21      0   100%
-----------------------------------------------------------------
TOTAL                                           125     25    80%

I opened the PR as it is because I've never written unit tests for asyncio and studying that was postponing the new module further than I could handle. Sorry about that, guys.

Update package versioning

Serenata Toolbox have 19 merged PR and most of them should generate a new package version if it were a released PYPI package. We should consider update the package version at least when there is new datasets being added or updated.

Document new reimbursement dataset available

There is a more recent reimbursements dataset available on our S3 but it isn't documented yet.

2017-07-04-reimbursements.xz needs to be added to our LATEST on downloader.py

Up-to-date reimbursements file

Hey everyone.

I just ran group_receipts.py and got fresh new set of data (2017-04-11-reimbusements.xz).
Should I use it locally only or you guys believe it may be useful to upload to Amazon?

Love,

Missing version bump

@anaschwendler you forgot the version bump in #53 and I missed it while code reviewing. Can you fix that ASAP and open a new PR?

Amazon credentials path

Error when running Jupyter notebooks in serenata-de-amor/develop saying it could not find config.ini file (Amazon credentials).

Could not find config.ini file.
using fetch from serenata_toolbox.datasets
`Could not find config.ini file.
You need Amzon section in it to interact with S3
(Check config.ini.example if you need a reference.)
—-------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-b724014e1ca7> in <module:angry:)
—--> 1 fetch('2017-04-21-sex-place-distances.xz', '../data')

/Users/renangohe/anaconda/envs/serenata_de_amor/lib/python3.6/site-packages/serenata_toolbox/datasets/__init__.py in fetch(filename, destination_path)
     72 
     73 def fetch(filename, destination_path):
—-> 74     datasets = Datasets(destination_path)
     75     return datasets.downloader.download(filename)
     76 

/Users/renangohe/anaconda/envs/serenata_de_amor/lib/python3.6/site-packages/serenata_toolbox/datasets/__init__.py in __init__(self, local_directory)
     53             local_directory,
     54             bucket=self.remote.bucket,
—-> 55             **self.remote.credentials
     56         )
     57 

TypeError: type object argument after ** must be a mapping, not NoneType

We are working on implementing a simple search for the config file in the project folder.

Clean/convert all the data before exporting

I think it's serenata-toolbox's responsability to convert and clean data, such as replacing , with . in float values, date in the format %d/%m/%y to %Y-%m-%d and so on, so jarbas, rosie and other tools don't need to bother about this kind of task.

If it's true, then we need to move code like to_number and to_date from jarbas to here, remove from the other repositories (and maybe convert some files that were already exported and are hosted on S3).

This issue is kind of related to #87.

@cuducos could you please help me validating the issue requirements and add more details, if possible? I can work on this.

Error updating reimbursement files

@rodolfo-viana has asked:

I tried to update reimbursement files last Sunday, but found a weird error when fetching.

In [1]: from serenata_toolbox.chamber_of_deputies.chamber_of_deputies_dataset import ChamberOfDeputiesDataset

In [2]: chamber_of_deputies = ChamberOfDeputiesDataset('data/')

In [3]: chamber_of_deputies.fetch()
---------------------------------------------------------------------------
BadZipFile Traceback (most recent call last)
<ipython-input-3-032b8190ff36> in <module>()
----> 1 chamber_of_deputies.fetch()

c:\users\rodolfoviana\documents\serenata-de-amor\serenata-toolbox\serenata_toolbox\chamber_of_deputies\chamber_of_deputies_dataset.py in fetch(self)
21 zip_file_path = os.path.join(self.path, filename)
22 urlretrieve(url, zip_file_path)
---> 23 zip_file = ZipFile(zip_file_path, 'r')
24 zip_file.extractall(self.path)
25 zip_file.close()

c:\users\rodolfoviana\appdata\local\conda\conda\envs\serenata_de_amor\lib\zipfile.py in __init__(self, file, mode, compression, allowZip64)
1098 try:
1099 if mode == 'r':
-> 1100 self._RealGetContents()
1101 elif mode in ('w', 'x'):
1102 # set the modified flag so central directory gets written

c:\users\rodolfoviana\appdata\local\conda\conda\envs\serenata_de_amor\lib\zipfile.py in _RealGetContents(self)
1166 raise BadZipFile("File is not a zip file")
1167 if not endrec:
-> 1168 raise BadZipFile("File is not a zip file")
1169 if self.debug > 1:
1170 print(endrec)

BadZipFile: File is not a zip file

Do you guys know how to solve this?

@jtemporal has replied:

Hi @rodolfo-viana which version of the toolbox are you using?

Journey tests taking too long

As commented in here by @lipemorais:

One more file to download is making the journey test even taking more time.

I'm not sure how to address this but it is necessary to do this sooner than later.

@lipemorais any ideas?