Giter Club home page Giter Club logo

Comments (14)

jtemporal avatar jtemporal commented on August 30, 2024 2

I took the liberty to rename this issue since we agreed on an approach to have the 999 sub quota in the final dataset ;)

from serenata-toolbox.

jtemporal avatar jtemporal commented on August 30, 2024 1

Is the data available in the new version of the API or only in the XML version?

Yes it is, I checked to find out if we were cutting it out or by any chance the chamber was. It is the filter we have that was dropping lines with 0 reimbursement value.

If it is, I think we simply enhance the if statement that excludes reimbursements with zero values to keep them if subquota is 999.

That was my initial idea!

from serenata-toolbox.

jtemporal avatar jtemporal commented on August 30, 2024 1

Is that right, @jtemporal?

I guess so, but a I'll test it today with @rodolfo-viana ;)

from serenata-toolbox.

cuducos avatar cuducos commented on August 30, 2024

Very good and important point, @rodolfo-viana! I'm almost sure this detail was unnoticed until now… Surely we need to take that into account.

from serenata-toolbox.

jtemporal avatar jtemporal commented on August 30, 2024

although dropping Flight ticket issue from our dataset is not a bug, shouldn't we reconsider having it back?

I believe it was never our intention to cut out an entire sub quota.

this detail was unnoticed until now

And I agree with this.

In this category, although congresspersons do not have to pay first and get the value reimbursed later, there is public money being spent.

You are completely right! I believe the way to go here is find a way we can, cut out receipts that weren't reimbursed and still have the 999 sub quota expenses in our dataset.

@cuducos any ideas?

from serenata-toolbox.

rodolfo-viana avatar rodolfo-viana commented on August 30, 2024

I believe this subquota was cut out -- not by Serenata team, of course -- during the time Chamber was setting up its second version of open data. I say that because I had read a notebook that covered Flight ticket issue: https://github.com/datasciencebr/serenata-de-amor/blob/master/develop/2016-08-13-irio-descriptive-analysis.ipynb

I guess that when they changed their dataset something went wrong, athough another analysis had led to a positive result: https://github.com/datasciencebr/serenata-de-amor/blob/master/develop/2017-05-21-luizcavalcanti-chamber-ceap-api-version-comparison.ipynb

Anyway, if I can help somehow, just let me know.

from serenata-toolbox.

cuducos avatar cuducos commented on August 30, 2024

@cuducos any ideas?

Is the data available in the new version of the API or only in the XML version?

If it is, I think we simply enhance the if statement that excludes reimbursements with zero values to keep them if subquota is 999. And surely mention that in the documentation because people will ask about that.

from serenata-toolbox.

rodolfo-viana avatar rodolfo-viana commented on August 30, 2024

I checked .csv files and compared 999 to other subquotas. It lacks three rows:

  • batch_number (that we hardly use),
  • reimbursement_number (ditto), and
  • document_id

I believe document_id rows, as inexistent, are being dropped in reimbursements.py:

    def group(self, receipts):
        print('Dropping rows without document_value or reimbursement_number…')
        subset = ('document_value', 'reimbursement_number')
        receipts = receipts.dropna(subset=subset)

        groupby_keys = ('year', 'applicant_id', 'document_id')
        receipts = receipts.dropna(subset=subset + groupby_keys)

Is it the issue?

from serenata-toolbox.

cuducos avatar cuducos commented on August 30, 2024

I wouldn't be so sure they are being dropped. Can anyone confirm that in the source 999 subquota have document_ids?

The thing is that document_id is not documented anywhere in their material. We guess it's is a kind of unique identifier for the reimbursed. As flight tickets are not actually reimbursed, maybe they never receive this identifier at all…

from serenata-toolbox.

rodolfo-viana avatar rodolfo-viana commented on August 30, 2024

It seems this def drops rows of reimbursement_number and document_id, both inexistent rows in 999.

Anyway, if I can help somehow, let me know. :)

from serenata-toolbox.

cuducos avatar cuducos commented on August 30, 2024

It seems this def drops rows of reimbursement_number and document_id, both inexistent rows in 999.

So this is the problem — they don't come with a document_id. That's awful. Anyway… pandas can catch that quite easily I guess. Is that right, @jtemporal?

We're gonna have to discuss this along with Jarbas architecture too — the whole API is based on the uniqueness of document_id.

from serenata-toolbox.

rodolfo-viana avatar rodolfo-viana commented on August 30, 2024

Just to explain the process I went through to come to this ideia: I downloaded the .csv file regarding expenses of this year, opened in Excel, picked 999 and other subquotas, looked up which rows these other subquotas have and 999 do not, and found these three mentioned above.

I am not sure if it is different in .xml. I believe it is not.

from serenata-toolbox.

jtemporal avatar jtemporal commented on August 30, 2024

Yep! a talk about Jarbas architecture is required. To have back in our dataset 999 (and also 10 and 11 for that matter) sub quota we need to study the implications and maybe revisit Jarbas whole structure. Maybe generate a separated dataset for these quotas for now could be a way.

from serenata-toolbox.

cuducos avatar cuducos commented on August 30, 2024

Jarbas used to have an composed unique ID with year, applicant_id and document_id. No problem in recreating something like that. I think it's a heads up about it but the first thing is to generate data, bring them in and see what crashes (in our local machines). The main question is not about Jarbas itself, is about the data (what are the unique ID for each row? just the sequential index?). Even if there's none we can work around (eg no detail view, only list views).

from serenata-toolbox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.