Comments (25)
Cool, thanks for the clear info! Yes, diving a bit deeper may be helpful. Keep in mind: we will need fast access mainly during the training loop, so directly before returning some tensor/ndarray (in the usual case) that will be passed to the deep network. So for preload=True
, accessing _data may be fine to me. The question is more the preload=False
case, if this one can be fast enough in mne as well. So the relatively small gap for get_data
there is encouraging for sure.
You could additionally do the following on reasonable GPU to know better what kind of times we may need to reach in the end: Forward one dummy batch size (64,22,1000) through the deep and shallow network, compute classification loss with dummy targets, and do the backward, measure the wall clock time (don't use profilers here for now, they may not work well with GPU). Then we have a rough time we want to reach...
from braindecode.
Great @hubertjb . Seems we are getting to a reasonable training time range. Would also be interesting how big the difference is for Deep4. And as you said, maybe num_workers would already close the gap enough to consider it finished. I would say a gap of 1.5x for deep4 to me is acceptable.
from braindecode.
Yes, I agree that the interesting case here is preload=False
! With some profiling I realized that the bottleneck is the indexing and not get_data()
. I came up with a super basic method get_single_epoch()
that directly calls the function to read from file instead of first creating a new Epochs
object as the standard indexing does. The results are pretty encouraging so far (sorry not same colours as above):
It looks like we could get something pretty close to HDF5 lazy loading. Here's what the method looks like:
def get_single_epoch(self, idx, postprocess=False):
"""
Get a single epoch.
Parameters
----------
idx: int
Index of epoch to extract.
postprocess: bool
If True, apply detrending + offset + decim when loading a new epoch
from raw; also, apply projection if configured to do so.
Returns
-------
epoch : array of shape (n_channels, n_times)
The specific window that was extracted.
"""
assert isinstance(idx, int)
if self.preload:
epoch = self._data[idx]
else:
epoch = self._get_epoch_from_raw(idx)
if postprocess:
epoch = self._detrend_offset_decim(epoch)
if postprocess and not self._do_delayed_proj:
epoch = self._project_epoch(epoch)
return epoch
from braindecode.
Also, I'm looking at the test you suggested @robintibor, will give an update soon.
from braindecode.
@hubertjb would you prefer if we fixed this at the MNE end? I enjoy (at least making an attempt at) solving these problems
from braindecode.
@larsoner that'd be amazing! :) Out of curiosity, what kind of optimization do you have in mind? Would it go through a special method as above, or something else?
from braindecode.
@robintibor I've made a script to test what you suggested: https://github.com/hubertjb/braindecode/blob/profiling-mne-epochs/test/others/time_training_iteration.py
I haven't tested the Deep4 architecture yet, but with the shallow one I get about 35 ms per minibatch of 256 examples on a Tesla V100.
From the previous test we know a conservative lower bound for getting windows is about 1 ms per window. If we need to apply some transforms (e.g., filtering, normalization, augmentation, etc.) on-the-fly, this will likely go up quite a bit. This means with the shallow architecture we'd be too slow for maximizing GPU usage at this batch size.
Does this fit with the numbers you had previously?
from braindecode.
Why 256 examples? in script it is 64 right?
in any case right now for me it is about having a rough estimation whether an approach may be fast enough and is therefore worth spending time on. These numbers may still be ok I think. Keep in mind in the end pytorch may use multithreading to get multiple at the same time, there may be additional overhead apart from the train step like storing losses etc. So proceeding in this direction seems promising to me at the moment.
from braindecode.
Sorry, I meant 64! And yep, good point concerning multithreading.
I agree with you, if we can get that close to HDF5 performance with a few modifications to MNE then we're on a good path.
from braindecode.
Out of curiosity, what kind of optimization do you have in mind? Would it go through a special method as above, or something else?
No I will try to optimize the epochs.get_data
itself to avoid whatever the bottleneck is under the hood in MNE
from braindecode.
No I will try to optimize the
epochs.get_data
itself to avoid whatever the bottleneck is under the hood in MNE
Ok, cool. If that can be useful, on my end the epochs.copy()
in _getitem()
seemed to be the culprit.
from braindecode.
Ahh yes that would make sense. Then in would scale with the number of epochs, etc. I'll see how bad it will be to fix this
from braindecode.
One possible API would be a new argument item
to epochs.get_data
as:
out = epochs.get_data(item=0)
To get epoch index 0. You could pass to item
whatever you currently pass to epochs[item]
. It makes the code much simpler if we require that epochs have bad epochs dropped before allowing this, which at least makes some sense as it ensures you get back out.shape[0] == np.size[item]
(for appropriate inputs).
If I
- use mne-tools/mne-python#7273
- change n_loops=20 (for speed)
- add a call to
epochs.drop_bad
before each loop - change case 1 to do
epochs.get_data(item=0)
instead of theget_single_epoch
I get this:
from braindecode.
Thanks @larsoner, very elegant solution! I think this should satisfy our requirements. I have one question:
I understand that you need to call epochs.drop_bad()
before calling epochs.get_data(item=idx)
. However, if I understand correctly, drop_bad
starts by loading the entire data in memory. If that's the case, then this defeats our purpose here as we want to avoid eager loading. If drop_bad()
only drops epochs based on signal quality criteria, could we just "accept" any epoch, and in doing so bypass the use of drop_bad()
?
from braindecode.
drop_bad starts by loading the entire data in memory
Not quite. It loads each epoch one at a time and then discards it from memory (does not keep it around). Is that okay?
from braindecode.
If drop_bad() only drops epochs based on signal quality criteria, could we just "accept" any epoch, and in doing so bypass the use of drop_bad()?
It's also based on things like whether or not the entire epoch is actually extractable from raw (e.g., a 10 second epoch from an event occurring one second from the end of the raw instance would be discarded as TOO_SHORT or so.
If it's desirable to just check these limits without actually loading the data we might be able to, but it again would be a larger refactoring.
from braindecode.
drop_bad starts by loading the entire data in memory
Not quite. It loads each epoch one at a time and then discards it from memory (does not keep it around). Is that okay?
Ok, got it. So in our case, I guess we would run drop_bad()
only once, before training. This would load all data so might take some time on very large datasets, however since it's done sequentially memory would not be an issue.
I think this should be fine for now. If calling drop_bad()
once before training ends up taking too much time we might want to look into other options though.
from braindecode.
yes I agree with @hubertjb that case (loading into and dropping from memory once at start of training) may be fine for now for us, we will see if it causes any practically relevant problems once we have more implemented full examples.
from braindecode.
Okay great, let us know of it does indeed end up being a bottleneck
from braindecode.
I started working on an example that shows how to do lazy vs eager loading on real data, and that compares the running times of both approaches (https://github.com/hubertjb/braindecode/blob/lazy-loading-example/examples/plot_lazy_vs_eager_loading.py). It's still not a perfectly realistic scenario - for instance, there is no preprocessing/transforms applied - but it should be better than my previous tests. Here are some preliminary results (they are subject to change)!
So far, using the first 10 subjects of the TUH Abnormal dataset, and training a shallow net for 5 epochs on a normal/abnormal classification task, the breakdown is (average of 15 runs, in seconds):
data_preparation training_setup model_training
loading_type
eager 6.539236 0.003837 18.746278
lazy 9.040427 0.136949 59.257529
Observations:
- Data preparation is longer for lazy loading, most likely because of
drop_data
which loads each window one at a time. This is OK for now but we might want to optimize eventually. - I'm not sure why training setup time is larger for lazy loading, but in the end this is pretty short anyways. To investigate.
- Model training time is >3x longer with lazy loading.
Next steps:
- Look into the
num_workers
argument of Dataloader to see whether that can help mitigate the overhead of lazy loading - Once we agree on an API for on-the-fly transforms (#72), we can add e.g. a filtering step and see what impact it has on running times
from braindecode.
Some good news, using num_workers
makes lazy loading a lot more competitive (https://github.com/hubertjb/braindecode/blob/lazy-loading-example/examples/plot_lazy_vs_eager_loading.py).
Without (num_workers=0
):
data_preparation training_setup model_training
loading_type
eager 8.866640 0.003532 25.106193
lazy 12.615696 0.418142 72.939527
With (num_workers=8
):
data_preparation training_setup model_training
loading_type
eager 8.82214 0.004220 28.750993
lazy 23.60280 0.422488 30.715090
The two types of loading are much closer now. Also I noticed GPU usage stayed at 100% during training, which means the workers did what they are supposed to do. I'm not sure why data preparation time is so much longer for lazy loading with num_workers=8
though, I'll have to investigate.
Also, @robintibor, I tried with Deep4, keeping all other parameters equal:
data_preparation training_setup model_training
loading_type
eager 8.902757 0.022913 11.306941
lazy 22.113035 0.456256 17.751904
Surprisingly, it seems more efficient than the Shallow net... Is that expected? I haven't played with the arguments much, maybe I did something weird.
from braindecode.
No the behavior is very unexpected, time spent on GPU forward/backward pass should be longer for deep, therefore difference should be smaller. It is also very strange that for you deep is faster than shallow, that should not be.
Before investigating further, please add following line somewhere before your main loop:
torch.backends.cudnn.benchmark = True
(see https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-benchmark-do/5936 for why for our cases it should always be set to True)
Numbers should improve overall. However it does not explain any of the current behavior, as said deep should be slower than shallow, not ~3 times faster (in eager mode)
from braindecode.
Latest results can be found in #75.
from braindecode.
@hubertjb I have rerun your perf code after the fix in #89 (which improves preload=True
performance quite a lot) for batch_size=64
and cuda=True
. These are results (averaged over the 3 repetitions):
preload | len | model | n_work | data_prep | setup | training | |
---|---|---|---|---|---|---|---|
0 | False | 2 | deep | 0 | 11.9 | 0.1 | 21.7 |
12 | True | 2 | deep | 0 | 7.5 | 0.1 | 2.6 |
1 | False | 2 | deep | 8 | 11 | 0.1 | 11.4 |
13 | True | 2 | deep | 8 | 7.6 | 0 | 3.5 |
4 | False | 4 | deep | 0 | 16.8 | 0.2 | 29.5 |
16 | True | 4 | deep | 0 | 7.8 | 0.1 | 3.3 |
5 | False | 4 | deep | 8 | 15.4 | 0.1 | 13.2 |
17 | True | 4 | deep | 8 | 7.9 | 0.1 | 4.6 |
8 | False | 15 | deep | 0 | 28.2 | 0.2 | 57.1 |
20 | True | 15 | deep | 0 | 10 | 0.1 | 8.9 |
9 | False | 15 | deep | 8 | 31.5 | 0.2 | 25.8 |
21 | True | 15 | deep | 8 | 9.8 | 0.1 | 11.4 |
2 | False | 2 | shallow | 0 | 10.5 | 0.1 | 24 |
14 | True | 2 | shallow | 0 | 7.5 | 0 | 2.2 |
3 | False | 2 | shallow | 8 | 12.8 | 0.1 | 10.8 |
15 | True | 2 | shallow | 8 | 7.5 | 0.1 | 3.6 |
6 | False | 4 | shallow | 0 | 10.5 | 0.1 | 29.2 |
18 | True | 4 | shallow | 0 | 7.8 | 0 | 3.3 |
7 | False | 4 | shallow | 8 | 14.8 | 0.1 | 13.3 |
19 | True | 4 | shallow | 8 | 7.9 | 0.1 | 5.3 |
10 | False | 15 | shallow | 0 | 31.2 | 0.2 | 53.9 |
22 | True | 15 | shallow | 0 | 10 | 0.1 | 12.3 |
11 | False | 15 | shallow | 8 | 27.2 | 0.2 | 26.6 |
23 | True | 15 | shallow | 8 | 9.9 | 0.1 | 15.1 |
As can be seen for eager num_workers=0
is better whereas for lazy num_workers=8
is better.
Therefore I made a comparison table for these two settings:
win_len_s | model_kind | train_lazy | train_eager | ratio | |
---|---|---|---|---|---|
0 | 2 | deep | 11.4 | 2.6 | 4.4 |
1 | 4 | deep | 13.2 | 3.3 | 4 |
2 | 15 | deep | 25.8 | 8.9 | 2.9 |
3 | 2 | shallow | 10.8 | 2.2 | 4.9 |
4 | 4 | shallow | 13.3 | 3.3 | 4 |
5 | 15 | shallow | 26.6 | 12.3 | 2.2 |
I don't know for these win_len settings which one is a realistic one? What do they correspond to? Would be great if we can get all of the ratios to <2 at some point, but good that we have a running implementation in any case
from braindecode.
outdated
from braindecode.
Related Issues (20)
- Torch or Nvidia is installing a lot of dependencies and slowing the installation in CI
- Combined Conv is not matching the precision of 1E-6 anymore on the Linux
- Create a braindecode functional
- Create automatic braindecode-dev/preview release
- Add type hints HOT 4
- Stop supporting python 3.8 HOT 1
- MNE in version 1.7 broke moabb and braindecode HOT 1
- Braindecode is not working with mne 1.7.1 HOT 3
- What is `mapping` in `create_windows_from_events`? HOT 6
- Add mypy to CI
- Preprocessing on `EEGWindowsDataset` vs `WindowsDataset`
- support TUH v2 HOT 1
- Create new "real life" tutorial HOT 2
- A minimal rewriting of PhysioNet Chambon2018 HOT 5
- Mypy is not working correcty inside the PR
- Chambon2018 comments HOT 4
- augmentation probability not working HOT 1
- Warn when using moabb 1.0.0
- Include more data augmentation HOT 2
- NVM HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from braindecode.