Comments (4)
I agree with your assessment @DerWeh that we should be able to parallelize boosting on multiple features at a time. I think there is a reasonable limit though where if there were for example 1000 features, we might be able to boost 20-ish in parallel. If all features were truly uncorrelated then we could probably boost more at a time, but in practice there's often quite a bit of correlation between features, so we probably want to be careful about how we handle this. I've heard from someone who tried doing all features simultaneously that it didn't work as well as expected, but I think doing a subset is still a viable possibility. This is something that we can prototype using the existing python interface (in the boost function) before trying to implement it for real with all the parallelism baggage that it would entail.
There's another avenue to parallelization though that comes with potentially less complications. We first bin the features into 1024 bins (by default), so if we subdivide the dataset into subsets of several tens of thousands, we can build the histograms for each subset independently, and then use a reduce operation to merge the histograms afterwards. In theory this should allow EBMs to be massively parallelized in clusters, etc..
I think ultimately someday when EBMs are built on clusters and in GPUs we'll use both approaches.
from interpret.
And we already have the concept of data sub-sets in C++ to prepare for this eventuality:
And we can then merge the histograms from the subsets after constructing them with this function:
https://github.com/interpretml/interpret/blob/develop/shared/libebm/ConvertAddBin.cpp
from interpret.
I think there is a reasonable limit though where if there were for example 1000 features, we might be able to boost 20-ish in parallel. If all features were truly uncorrelated then we could probably boost more at a time, but in practice there's often quite a bit of correlation between features, so we probably want to be careful about how we handle this.
Very interesting point. Of course, even if the increment (learning rate) is small, the contribution is significant if I do many steps (one for each feature). Never tried this regime, as with so many feature most of the (global) interpretability is probably lost.
Was it also reported, that feature order matter is this case?
This is something that we can prototype using the existing python interface (in the boost function) before trying to implement it for real with all the parallelism baggage that it would entail.
Thanks for pointing out, haven't made my way till the code of the boost function yet. When I get there, I'll give it a try.
There's another avenue to parallelization though that comes with potentially less complications. We first bin the features into 1024 bins (by default), so if we subdivide the dataset into subsets of several tens of thousands, we can build the histograms for each subset independently, and then use a reduce operation to merge the histograms afterwards. In theory this should allow EBMs to be massively parallelized in clusters, etc..
This is indeed a very interesting avenue. Though, I am not quite sure yet how to go about this. Maybe I need to first take a deeper look that the boosting routine. Could you elaborate a bit? Do you plan to naively (randomly) split the dataset into subsets and use the same binning and then merge the final EBM weighting each bin by the number of samples in that subset? Or is there some sophisticated way to split the dateset, resulting in disjunct bins? This seems quite hard due to high dimensionality. Or have I missed the point completely?
Parallelism would give a massive boost to EBMs, as we have many (cheap) CPUs nowadays. With GPU programming, I have too little experience. If the algorithm is really suitable, this would of course be a tremendous speedup.
from interpret.
I think there is a reasonable limit though where if there were for example 1000 features, we might be able to boost 20-ish in parallel. If all features were truly uncorrelated then we could probably boost more at a time, but in practice there's often quite a bit of correlation between features, so we probably want to be careful about how we handle this.
Very interesting point. Of course, even if the increment (learning rate) is small, the contribution is significant if I do many steps (one for each feature). Never tried this regime, as with so many feature most of the (global) interpretability is probably lost. Was it also reported, that feature order matter is this case?
I believe the person who tried it didn't care about feature order since they boosted all features together independently at the same time. There were other differences between their implementation and EBMs, and I think they used a higher learning rate, so I don't know how much of this will be valid in the domain of EBMs. Since it's easy enough to check, I figured that should be the plan when the time came.
There are some other issues that I didn't mention previously for brevity. For simplicity we (currently) start from a default score of 0 when we boost the mains. The first few boost steps essentially move the intercept closer to its ultimate value. If we had 1000 features being boosted together then they'll all overshoot this intercept by a large amount in total. Now, this is something that can be corrected for by initializing to a reasonable initial intercept, but hopefully it highlights a separate point that boosting can become unstable if you attempt to boost too many features at once. Even with initialization of the intercept this instability could still happen with correlated features where all the features would be positive in some sections of their graphs, etc. Despite all this, I do think the idea is valuable up to some reasonable point that is yet to be determined.
There's another avenue to parallelization though that comes with potentially less complications. We first bin the features into 1024 bins (by default), so if we subdivide the dataset into subsets of several tens of thousands, we can build the histograms for each subset independently, and then use a reduce operation to merge the histograms afterwards. In theory this should allow EBMs to be massively parallelized in clusters, etc..
This is indeed a very interesting avenue. Though, I am not quite sure yet how to go about this. Maybe I need to first take a deeper look that the boosting routine. Could you elaborate a bit? Do you plan to naively (randomly) split the dataset into subsets and use the same binning and then merge the final EBM weighting each bin by the number of samples in that subset? Or is there some sophisticated way to split the dateset, resulting in disjunct bins? This seems quite hard due to high dimensionality. Or have I missed the point completely?
Parallelism would give a massive boost to EBMs, as we have many (cheap) CPUs nowadays. With GPU programming, I have too little experience. If the algorithm is really suitable, this would of course be a tremendous speedup.
We already split the data into subsets within the C++ in order to use SIMD effectively. We need the SIMDable data to be a multiple of the SIMD packing size. We also set a maximum number of samples that can go into each subset with this parameter:
interpret/shared/libebm/bridge/bridge.hpp
Line 29 in 0cf3f2f
If we wanted to parallelize this more at the thread level (and we do!), we might want to drop this number to something like 50k samples per subset. The samples are and can be put into the subsets without any restrictions from the feature contents.
There are two phases during training that are bottlenecks. The first phase builds histograms and the second one applies a model update. During the histogram construction phase, for each feature we have bins and for each bin of a feature we need to know the following values: sum_samples, sum_weights, sum_gradients, sum_hessians. Since all of these are just sums, we can build the sums in parallel for all subsets. Once we have the final sums (which we'd use a reduce operation to combine) we do tree building and select cuts, which is not a speed bottleneck. Then we need to apply the scores updates that we determined, which is the 2nd compute bottleneck. We have an array that we need to index into by the feature values (interactions are a tensor) and add the value in the array to the existing per-sample scores that we have. Since the samples are subdivided into subsets, we can again do this update in parallel accross the subsets.
In summary, EBMs should parallelize fairly well for most datasets, and this is something we could do on a GPU. If we have a dataset with a limited number of samples but many features, this won't work as well though, so we're back to looking at the multiple feature at a time parallelism that we started this discussion about.
from interpret.
Related Issues (20)
- Multi target black box models explanation HOT 1
- How much data is require for EBM??? HOT 2
- Question about Inference Time HOT 1
- Computational complexity of EBM HOT 5
- Validation or OOB score HOT 3
- Does the EBM model support customized loss functions (objectives)? HOT 1
- Ordinal vs nominal for categorical variables HOT 2
- Attribute ERROR from interpret/utils/_clean_x.py HOT 1
- How can I interpret a fully connected nerual network?
- Stepwise process for model selection with EBM classifier HOT 2
- Method for extracting where the steps are in the variable plots HOT 1
- Interactions setting in multi-class datasets HOT 2
- Multi Output Regressor Model Support HOT 2
- cannot import name 'url_quote' from 'werkzeug.urls' when using show(dt.explain_global()) HOT 5
- How to get word importance HOT 1
- Development installation: Requirements? HOT 2
- Query: performance prospects on massive data sets (curse of dimensionality?) HOT 3
- How to speed up EBM model? Unbelievable slow. HOT 9
- Integrate EBM into the pytorch framework HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from interpret.