Comments (6)
@jzanzig what do you think of an interface like this?
categoricals:
-
column: 'color_id'
choice_query: 'select distinct(color_id) from myschema.colors'
metrics: ['sum']
Temporal leakage is worth considering, but maybe not as much as I'm thinking. Even if there are choices returned by that query that didn't exist yet at the time of the called as_of_dates, they will have columns but just won't end up with any data. So maybe that's fine?
from triage.
@andreanr also mentioned supporting getting the feature names from lookup tables as well. For instance, using the above interface, you could either
- Get feature names that have color ids in them instead of the more helpful color names
- Use a join query in your spacetime aggregation's from_obj, and use the joined name instead of the id. A bunch of joins may make the queries take longer.
Neither of these are optimal, so solving this in a better way could be worth it. @andreanr do you think the work you've done on this problem could be used here?
from triage.
Created dssg/collate#67 for the problem I outlined in the previous comment. It seems more appropriate to solve that problem there than here.
For the other use cases (ie denormalized tables, categoricals which actually are ids, and cases where joining to lookup a value is not too bad), I think we can move ahead here with a simpler interface like outlined above. I'm still unsure if there are any leakage problems with creating a list of choices this way.
from triage.
I'm not sure I completely understand the difference between the way collate is dealing with categoricals and this way that would introduce information leakage. Are you saying that there's a possibility that the select(distinct) query would return some possible values of color_id that wouldn't be present in that specific training or test set, but when the features themselves are generated the feature will be 0 for all observations? If so, that's fine (and it's not leaking information to have a feature that is 0 for all observations, if that's the correct value), and I think the interface (specifying a choice_query) is good
from triage.
Looks like we agree on both points:
- That the current workarounds for dynamic categoricals work no differently with regards to information leakage than this as proposed, just with more work
- That this doesn't actually introduce information leakage (I was pretty sure of this but wanted to confirm)
Furthermore, based on Matt's comments above it looks like this feature may make it into collate, so we could get to piggyback on that here.
from triage.
Going to implement v1 of this here instead of waiting for collate.
from triage.
Related Issues (20)
- Error earlier on duplicates in label/cohort
- Add documentation for catwalk estimators
- Make using groups other than `entity_id` in collate less error-prone HOT 1
- Use string datatype for all bias attributes in dataframe sent to aequitas
- Allow SQL file paths for cohort and label queries HOT 1
- add scorecard type models for baselines
- reduce warnings in test suite HOT 2
- Explore migrating to psycopg3
- Look at updating SQLAlchemy to 1.4.x HOT 1
- Look into removing dependency on s3fs
- Debug push docs on tag workflow
- Create features per fold
- Remove postgres service for Github Action HOT 3
- Remove (old/unused) dependencies
- Add unit test for collate groups deprecation warning
- Update colab triage for triage 5.2
- validator should check for label name starting with a digit and give error
- Add an option for not persisting model objects to disk?
- test predictions table doesn't allow predictions of multiple matrices for the same model id HOT 1
- Add the ability to evaluate a model with different test label timespans without retraining models HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from triage.