While the EBMs are an incredible model, a main pain-point is that they are relatively

I agree with your assessment <a class="user-mention notranslate" data-hovercard-type="

Question: Parallel boosting? about interpret HOT 4 OPEN

DerWeh commented on May 26, 2024

Question: Parallel boosting?

from interpret.

Comments (4)

paulbkoch commented on May 26, 2024

I agree with your assessment @DerWeh that we should be able to parallelize boosting on multiple features at a time. I think there is a reasonable limit though where if there were for example 1000 features, we might be able to boost 20-ish in parallel. If all features were truly uncorrelated then we could probably boost more at a time, but in practice there's often quite a bit of correlation between features, so we probably want to be careful about how we handle this. I've heard from someone who tried doing all features simultaneously that it didn't work as well as expected, but I think doing a subset is still a viable possibility. This is something that we can prototype using the existing python interface (in the boost function) before trying to implement it for real with all the parallelism baggage that it would entail.

There's another avenue to parallelization though that comes with potentially less complications. We first bin the features into 1024 bins (by default), so if we subdivide the dataset into subsets of several tens of thousands, we can build the histograms for each subset independently, and then use a reduce operation to merge the histograms afterwards. In theory this should allow EBMs to be massively parallelized in clusters, etc..

I think ultimately someday when EBMs are built on clusters and in GPUs we'll use both approaches.

from interpret.

paulbkoch commented on May 26, 2024

And we already have the concept of data sub-sets in C++ to prepare for this eventuality:

interpret/shared/libebm/DataSetBoosting.hpp

Line 26 in 9cbb353

struct DataSubsetBoosting final {

And we can then merge the histograms from the subsets after constructing them with this function:

https://github.com/interpretml/interpret/blob/develop/shared/libebm/ConvertAddBin.cpp

from interpret.

DerWeh commented on May 26, 2024

I think there is a reasonable limit though where if there were for example 1000 features, we might be able to boost 20-ish in parallel. If all features were truly uncorrelated then we could probably boost more at a time, but in practice there's often quite a bit of correlation between features, so we probably want to be careful about how we handle this.

Very interesting point. Of course, even if the increment (learning rate) is small, the contribution is significant if I do many steps (one for each feature). Never tried this regime, as with so many feature most of the (global) interpretability is probably lost.
Was it also reported, that feature order matter is this case?

This is something that we can prototype using the existing python interface (in the boost function) before trying to implement it for real with all the parallelism baggage that it would entail.

Thanks for pointing out, haven't made my way till the code of the boost function yet. When I get there, I'll give it a try.

There's another avenue to parallelization though that comes with potentially less complications. We first bin the features into 1024 bins (by default), so if we subdivide the dataset into subsets of several tens of thousands, we can build the histograms for each subset independently, and then use a reduce operation to merge the histograms afterwards. In theory this should allow EBMs to be massively parallelized in clusters, etc..

This is indeed a very interesting avenue. Though, I am not quite sure yet how to go about this. Maybe I need to first take a deeper look that the boosting routine. Could you elaborate a bit? Do you plan to naively (randomly) split the dataset into subsets and use the same binning and then merge the final EBM weighting each bin by the number of samples in that subset? Or is there some sophisticated way to split the dateset, resulting in disjunct bins? This seems quite hard due to high dimensionality. Or have I missed the point completely?

Parallelism would give a massive boost to EBMs, as we have many (cheap) CPUs nowadays. With GPU programming, I have too little experience. If the algorithm is really suitable, this would of course be a tremendous speedup.

from interpret.

paulbkoch commented on May 26, 2024

I think there is a reasonable limit though where if there were for example 1000 features, we might be able to boost 20-ish in parallel. If all features were truly uncorrelated then we could probably boost more at a time, but in practice there's often quite a bit of correlation between features, so we probably want to be careful about how we handle this.

Very interesting point. Of course, even if the increment (learning rate) is small, the contribution is significant if I do many steps (one for each feature). Never tried this regime, as with so many feature most of the (global) interpretability is probably lost. Was it also reported, that feature order matter is this case?

I believe the person who tried it didn't care about feature order since they boosted all features together independently at the same time. There were other differences between their implementation and EBMs, and I think they used a higher learning rate, so I don't know how much of this will be valid in the domain of EBMs. Since it's easy enough to check, I figured that should be the plan when the time came.

There are some other issues that I didn't mention previously for brevity. For simplicity we (currently) start from a default score of 0 when we boost the mains. The first few boost steps essentially move the intercept closer to its ultimate value. If we had 1000 features being boosted together then they'll all overshoot this intercept by a large amount in total. Now, this is something that can be corrected for by initializing to a reasonable initial intercept, but hopefully it highlights a separate point that boosting can become unstable if you attempt to boost too many features at once. Even with initialization of the intercept this instability could still happen with correlated features where all the features would be positive in some sections of their graphs, etc. Despite all this, I do think the idea is valuable up to some reasonable point that is yet to be determined.

There's another avenue to parallelization though that comes with potentially less complications. We first bin the features into 1024 bins (by default), so if we subdivide the dataset into subsets of several tens of thousands, we can build the histograms for each subset independently, and then use a reduce operation to merge the histograms afterwards. In theory this should allow EBMs to be massively parallelized in clusters, etc..

This is indeed a very interesting avenue. Though, I am not quite sure yet how to go about this. Maybe I need to first take a deeper look that the boosting routine. Could you elaborate a bit? Do you plan to naively (randomly) split the dataset into subsets and use the same binning and then merge the final EBM weighting each bin by the number of samples in that subset? Or is there some sophisticated way to split the dateset, resulting in disjunct bins? This seems quite hard due to high dimensionality. Or have I missed the point completely?

Parallelism would give a massive boost to EBMs, as we have many (cheap) CPUs nowadays. With GPU programming, I have too little experience. If the algorithm is really suitable, this would of course be a tremendous speedup.

We already split the data into subsets within the C++ in order to use SIMD effectively. We need the SIMDable data to be a multiple of the SIMD packing size. We also set a maximum number of samples that can go into each subset with this parameter:

interpret/shared/libebm/bridge/bridge.hpp

Line 29 in 0cf3f2f

 static constexpr size_t k_cSubsetSamplesMax = REPRESENTABLE_INT32_AS_FLOAT32_MAX / 128; 

If we wanted to parallelize this more at the thread level (and we do!), we might want to drop this number to something like 50k samples per subset. The samples are and can be put into the subsets without any restrictions from the feature contents.

There are two phases during training that are bottlenecks. The first phase builds histograms and the second one applies a model update. During the histogram construction phase, for each feature we have bins and for each bin of a feature we need to know the following values: sum_samples, sum_weights, sum_gradients, sum_hessians. Since all of these are just sums, we can build the sums in parallel for all subsets. Once we have the final sums (which we'd use a reduce operation to combine) we do tree building and select cuts, which is not a speed bottleneck. Then we need to apply the scores updates that we determined, which is the 2nd compute bottleneck. We have an array that we need to index into by the feature values (interactions are a tensor) and add the value in the array to the existing per-sample scores that we have. Since the samples are subdivided into subsets, we can again do this update in parallel accross the subsets.

In summary, EBMs should parallelize fairly well for most datasets, and this is something we could do on a GPU. If we have a dataset with a limited number of samples but many features, this won't work as well though, so we're back to looking at the multiple feature at a time parallelism that we started this discussion about.

from interpret.

Question: Parallel boosting? about interpret HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent