The FeatureGenerator should have some way of dynamically computing choices for categor

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Created <a class="issue-link js-issue-link" data-error-text="Failed to load title" dat

Looks like we agree on both points: That the current workaroun

Allow dynamic categorical lists about triage HOT 6 CLOSED

dssg commented on June 14, 2024

Allow dynamic categorical lists

from triage.

Comments (6)

thcrock commented on June 14, 2024

@jzanzig what do you think of an interface like this?

        categoricals:
            -
                column: 'color_id'
                choice_query: 'select distinct(color_id) from myschema.colors'
                metrics: ['sum']

Temporal leakage is worth considering, but maybe not as much as I'm thinking. Even if there are choices returned by that query that didn't exist yet at the time of the called as_of_dates, they will have columns but just won't end up with any data. So maybe that's fine?

from triage.

thcrock commented on June 14, 2024

@andreanr also mentioned supporting getting the feature names from lookup tables as well. For instance, using the above interface, you could either

Get feature names that have color ids in them instead of the more helpful color names
Use a join query in your spacetime aggregation's from_obj, and use the joined name instead of the id. A bunch of joins may make the queries take longer.

Neither of these are optimal, so solving this in a better way could be worth it. @andreanr do you think the work you've done on this problem could be used here?

from triage.

thcrock commented on June 14, 2024

Created dssg/collate#67 for the problem I outlined in the previous comment. It seems more appropriate to solve that problem there than here.

For the other use cases (ie denormalized tables, categoricals which actually are ids, and cases where joining to lookup a value is not too bad), I think we can move ahead here with a simpler interface like outlined above. I'm still unsure if there are any leakage problems with creating a list of choices this way.

from triage.

jzanzig commented on June 14, 2024

I'm not sure I completely understand the difference between the way collate is dealing with categoricals and this way that would introduce information leakage. Are you saying that there's a possibility that the select(distinct) query would return some possible values of color_id that wouldn't be present in that specific training or test set, but when the features themselves are generated the feature will be 0 for all observations? If so, that's fine (and it's not leaking information to have a feature that is 0 for all observations, if that's the correct value), and I think the interface (specifying a choice_query) is good

from triage.

thcrock commented on June 14, 2024

Looks like we agree on both points:

That the current workarounds for dynamic categoricals work no differently with regards to information leakage than this as proposed, just with more work
That this doesn't actually introduce information leakage (I was pretty sure of this but wanted to confirm)

Furthermore, based on Matt's comments above it looks like this feature may make it into collate, so we could get to piggyback on that here.

from triage.

thcrock commented on June 14, 2024

Going to implement v1 of this here instead of waiting for collate.

from triage.

Allow dynamic categorical lists about triage HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent