Giter Club home page Giter Club logo

Comments (6)

thcrock avatar thcrock commented on June 14, 2024

@jzanzig what do you think of an interface like this?

        categoricals:
            -
                column: 'color_id'
                choice_query: 'select distinct(color_id) from myschema.colors'
                metrics: ['sum']

Temporal leakage is worth considering, but maybe not as much as I'm thinking. Even if there are choices returned by that query that didn't exist yet at the time of the called as_of_dates, they will have columns but just won't end up with any data. So maybe that's fine?

from triage.

thcrock avatar thcrock commented on June 14, 2024

@andreanr also mentioned supporting getting the feature names from lookup tables as well. For instance, using the above interface, you could either

  1. Get feature names that have color ids in them instead of the more helpful color names
  2. Use a join query in your spacetime aggregation's from_obj, and use the joined name instead of the id. A bunch of joins may make the queries take longer.

Neither of these are optimal, so solving this in a better way could be worth it. @andreanr do you think the work you've done on this problem could be used here?

from triage.

thcrock avatar thcrock commented on June 14, 2024

Created dssg/collate#67 for the problem I outlined in the previous comment. It seems more appropriate to solve that problem there than here.

For the other use cases (ie denormalized tables, categoricals which actually are ids, and cases where joining to lookup a value is not too bad), I think we can move ahead here with a simpler interface like outlined above. I'm still unsure if there are any leakage problems with creating a list of choices this way.

from triage.

jzanzig avatar jzanzig commented on June 14, 2024

I'm not sure I completely understand the difference between the way collate is dealing with categoricals and this way that would introduce information leakage. Are you saying that there's a possibility that the select(distinct) query would return some possible values of color_id that wouldn't be present in that specific training or test set, but when the features themselves are generated the feature will be 0 for all observations? If so, that's fine (and it's not leaking information to have a feature that is 0 for all observations, if that's the correct value), and I think the interface (specifying a choice_query) is good

from triage.

thcrock avatar thcrock commented on June 14, 2024

Looks like we agree on both points:

  1. That the current workarounds for dynamic categoricals work no differently with regards to information leakage than this as proposed, just with more work
  2. That this doesn't actually introduce information leakage (I was pretty sure of this but wanted to confirm)

Furthermore, based on Matt's comments above it looks like this feature may make it into collate, so we could get to piggyback on that here.

from triage.

thcrock avatar thcrock commented on June 14, 2024

Going to implement v1 of this here instead of waiting for collate.

from triage.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.