<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Is that right, <a class="user-mention notranslate" data-hovercard-type="u

Very good and important point, <a class="user-mention notranslate" data-hovercard-type

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

I checked .csv files and compared <code class="notran

It seems this def drops rows of <code class="notransl

Add `if` statement to avoid dropping 'Flight ticket issue' expenses about serenata-toolbox HOT 14 CLOSED

rodolfo-viana commented on August 30, 2024 2

Add `if` statement to avoid dropping 'Flight ticket issue' expenses

from serenata-toolbox.

Comments (14)

jtemporal commented on August 30, 2024 2

I took the liberty to rename this issue since we agreed on an approach to have the 999 sub quota in the final dataset ;)

from serenata-toolbox.

jtemporal commented on August 30, 2024 1

Is the data available in the new version of the API or only in the XML version?

Yes it is, I checked to find out if we were cutting it out or by any chance the chamber was. It is the filter we have that was dropping lines with 0 reimbursement value.

If it is, I think we simply enhance the if statement that excludes reimbursements with zero values to keep them if subquota is 999.

That was my initial idea!

from serenata-toolbox.

jtemporal commented on August 30, 2024 1

Is that right, @jtemporal?

I guess so, but a I'll test it today with @rodolfo-viana ;)

from serenata-toolbox.

cuducos commented on August 30, 2024

Very good and important point, @rodolfo-viana! I'm almost sure this detail was unnoticed until now… Surely we need to take that into account.

from serenata-toolbox.

jtemporal commented on August 30, 2024

although dropping Flight ticket issue from our dataset is not a bug, shouldn't we reconsider having it back?

I believe it was never our intention to cut out an entire sub quota.

this detail was unnoticed until now

And I agree with this.

In this category, although congresspersons do not have to pay first and get the value reimbursed later, there is public money being spent.

You are completely right! I believe the way to go here is find a way we can, cut out receipts that weren't reimbursed and still have the 999 sub quota expenses in our dataset.

@cuducos any ideas?

from serenata-toolbox.

rodolfo-viana commented on August 30, 2024

I believe this subquota was cut out -- not by Serenata team, of course -- during the time Chamber was setting up its second version of open data. I say that because I had read a notebook that covered Flight ticket issue: https://github.com/datasciencebr/serenata-de-amor/blob/master/develop/2016-08-13-irio-descriptive-analysis.ipynb

I guess that when they changed their dataset something went wrong, athough another analysis had led to a positive result: https://github.com/datasciencebr/serenata-de-amor/blob/master/develop/2017-05-21-luizcavalcanti-chamber-ceap-api-version-comparison.ipynb

Anyway, if I can help somehow, just let me know.

from serenata-toolbox.

cuducos commented on August 30, 2024

@cuducos any ideas?

Is the data available in the new version of the API or only in the XML version?

If it is, I think we simply enhance the if statement that excludes reimbursements with zero values to keep them if subquota is 999. And surely mention that in the documentation because people will ask about that.

from serenata-toolbox.

rodolfo-viana commented on August 30, 2024

I checked .csv files and compared 999 to other subquotas. It lacks three rows:

batch_number (that we hardly use),
reimbursement_number (ditto), and
document_id

I believe document_id rows, as inexistent, are being dropped in reimbursements.py:

    def group(self, receipts):
        print('Dropping rows without document_value or reimbursement_number…')
        subset = ('document_value', 'reimbursement_number')
        receipts = receipts.dropna(subset=subset)

        groupby_keys = ('year', 'applicant_id', 'document_id')
        receipts = receipts.dropna(subset=subset + groupby_keys)

Is it the issue?

from serenata-toolbox.

cuducos commented on August 30, 2024

I wouldn't be so sure they are being dropped. Can anyone confirm that in the source 999 subquota have document_ids?

The thing is that document_id is not documented anywhere in their material. We guess it's is a kind of unique identifier for the reimbursed. As flight tickets are not actually reimbursed, maybe they never receive this identifier at all…

from serenata-toolbox.

rodolfo-viana commented on August 30, 2024

It seems this def drops rows of reimbursement_number and document_id, both inexistent rows in 999.

Anyway, if I can help somehow, let me know. :)

from serenata-toolbox.

cuducos commented on August 30, 2024

It seems this def drops rows of reimbursement_number and document_id, both inexistent rows in 999.

So this is the problem — they don't come with a document_id. That's awful. Anyway… pandas can catch that quite easily I guess. Is that right, @jtemporal?

We're gonna have to discuss this along with Jarbas architecture too — the whole API is based on the uniqueness of document_id.

from serenata-toolbox.

rodolfo-viana commented on August 30, 2024

Just to explain the process I went through to come to this ideia: I downloaded the .csv file regarding expenses of this year, opened in Excel, picked 999 and other subquotas, looked up which rows these other subquotas have and 999 do not, and found these three mentioned above.

I am not sure if it is different in .xml. I believe it is not.

from serenata-toolbox.

jtemporal commented on August 30, 2024

Yep! a talk about Jarbas architecture is required. To have back in our dataset 999 (and also 10 and 11 for that matter) sub quota we need to study the implications and maybe revisit Jarbas whole structure. Maybe generate a separated dataset for these quotas for now could be a way.

from serenata-toolbox.

cuducos commented on August 30, 2024

Jarbas used to have an composed unique ID with year, applicant_id and document_id. No problem in recreating something like that. I think it's a heads up about it but the first thing is to generate data, bring them in and see what crashes (in our local machines). The main question is not about Jarbas itself, is about the data (what are the unique ID for each row? just the sequential index?). Even if there's none we can work around (eg no detail view, only list views).

from serenata-toolbox.

Add `if` statement to avoid dropping 'Flight ticket issue' expenses about serenata-toolbox HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent