Giter Club home page Giter Club logo

brain-age-benchmark-paper's Introduction

M/EEG brain age prediction benchmark paper

This repository presents the code, tools and resources developed in the course of [1]. To reuse the code, please follow the instructions and recommendations below. When publishing results based on this code, please cite [1]

[1] D. Engemann, A. Mellot, R. Höchenberger, H. Banville, D. Sabbagh, L. Gemein, T. Ball, and A. Gramfort. A reusable benchmark of brain-age prediction from M/EEG resting-state signals (2022). NeuroImage 262 (119521): 119521. https://doi.org/10.1016/j.neuroimage.2022.119521


Exploring the aggregated results using the plotting scripts

For convenience, we provide aggregated group-level results facilitating exploration.

  1. Aggregate information on demographics is presented in: ./outputs/demog_summary_table.csv
  2. Aggregate cross-validation results can be found for every dataset and benchmark in: ./results/. Filenames indicate the benchmark and the dataset as in ./results/benchmark-deep_dataset-lemon.csv for the deep learning (Deep4Net) benchmark on the LEMON dataset.

The scripts used for generating the figures and tables presented in the paper can be a good starting point. All plots and tables were generated using plot_benchmark_age_prediction.r.

The R code uses few dependencies and base-r idioms supporting newer as well as older versions of R.

If needed, dependencies can be installed as follows:

install.packages(c("ggplot2", "scales", "ggThemes", "patchwork", "kableExtra"))

The demographic data can be plotted using plot_demography.r. Note however that the input file contains individual-specific data and cannot be readily shared. Computing the input tables can be done using gather_demographics_info.py provided that all input datasets are correctly downloaded and stored in the BIDS format.


General workflow for computing intermediate outputs

Here we considered 4 datasets that can be programmatically downloaded from their respective websites linked below:

  1. Cam-CAN
  2. LEMON
  3. CHBP
  4. TUAB

Some of these datasets already come in BIDS formats, others have to be actively converted. In other cases some modifications and fixes are needed to make things work. Please consider the notes on dataset-specific peculiarities below.

Datasets are then preprocessed using the MNE-BIDS pipeline. To make this work, you must edit the respective config files to point to the input and derivative folders on your machine. The respective variables to modify in each config file are bids_root (input data path), deriv_root (intermediate BIDS outpouts) and subjects_dir (FreeSurfer path).

Note: The filterbank source model requires on the Cam-Can dataset a reduced size template head model. After setting your FreeSurfer subjects_dir you can obtain the scaled MRI called fsaverage_small using:

mne.coreg.scale_mri("fsaverage", "fsaverage_small", scale=0.9, subjects_dir=subjects_dir, annot=True, overwrite=True)

The four config files for the datasets are:

  1. config_camcan_meg.py
  2. config_lemon_eeg.py
  3. config_chbp_eeg.py
  4. config_tuab_eeg.py

Once all data is downloaded and the configs are updated, the MNE-BIDS pipeline can be used for preprocessing. We recommend downloading the MNE-BIDS-pipeline repository and placing it in the same folder this repository is downloaded, such that its relative position would be ../mne-bids-pipeline. For help with the installation dependencies, please consider the dedicated section below.

Note: the MNE-BIDS pipeline is a bit different from other packages. Instead of installing it as a library it is more like a collection of scripts. Installing it means getting the Python files and making sure the dependencies are met. See installation instructions.

If all is good to go, preprocessing can be conducted using the following shell commands:

python ../mne-bids-pipeline/run.py --config config_camcan_meg.py --n_jobs 40 --steps=preprocessing
python ../mne-bids-pipeline/run.py --config config_lemon_eeg.py --n_jobs 40 --steps=preprocessing
python ../mne-bids-pipeline/run.py --config config_chbp_eeg.py --n_jobs 40 --steps=preprocessing
python ../mne-bids-pipeline/run.py --config config_tuab_eeg.py --n_jobs 40 --steps=preprocessing

Note: Make sure chose an appropriate number of jobs given your computer. Above 40 are used but this assumes you have access to a machine with more than 40 CPUs and a lot of RAM.

Note: It can be convenient to run these commands from within IPython e.g. to benefit from a nicer Terminal experience during debugging. Start IPython and use run instead of python. This will apply filtering and epoching according to the settings in the config files.

Then a custom preprocessing step has to be performed involving artifact rejection and re-referencing:

python compute_autoreject.py --n_jobs 40

Note: This will run computation for all datasets. To perform this step on specific datasets, check out the -d argument.

Once this is done, some additional processing steps are needed for the filterbank-source models:

python ../mne-bids-pipeline/run.py --config config_camcan_meg.py --n_jobs 40 --steps=source
python ../mne-bids-pipeline/run.py --config config_lemon_eeg.py --n_jobs 40 --steps=source
python ../mne-bids-pipeline/run.py --config config_chbp_eeg.py --n_jobs 40 --steps=source
python ../mne-bids-pipeline/run.py --config config_tuab_eeg.py --n_jobs 40 --steps=source

Potential errors can be inspected in the autoreject_log.csv that is written in the dataset-specific derivative directories.

Now feature computation can be launched for the 3 non-deep learning benchmarks:

  1. handcrafted
  2. filterbank-riemann
  3. filterbank-source
python compute_features.py --n_jobs 40

Note: This will run computation for all datasets and all benchmarks. To visit specific datasets or feature types, check out the -d and -f arguments.

Potential errors can be inspected in the benchmark-specific log files that are written in the dataset-specific derivative directories, e.g. feature_fb_covs_pooled-log.csv for the filterbank features.

If all went fine until now, the following machine learning benchmarks can finally be run:

  1. dummy
  2. handcrafted
  3. filterbank-riemann
  4. filterbank-source
  5. shallow
  6. deep
python compute_benchmark_age_prediction.py --n_jobs 10

Note: This will run computation for all datasets and all benchmarks. To visit specific datasets or benchmarks, checkout the -d and -b arguments.

If all worked until now out you should find the fold-wise scores for every benchmark on every dataset in ./results.


Handling dataset-specific peculiarities prior to computation

For some of the datasets, custom processing of the input data was necessary.

Cam-CAN

The dataset was provided in BIDS format by the curators of the Cam-CAN. Computation worked out of the box. Note that previous releases were not provided in BIDS format. Moreover, Maxwell filtering was applied for mitigating strong environmental magnetic artifacts which is only available for MEG and not EEG.

LEMON

Downloading the data

The data provided by LEMON can be conveniently downloaded using our custom script:

download_data_lemon.py

Make sure to adjust the paths to make this work on your machine. Also note that the script presupposes that the ETA_File_IDs_Age_Gender_Education_Drug_Smoke_SKID_LEMON.csv file has been dowloaded to this repository.

Finishing BIDS conversion

Further steps are necessary to obtain a fully operable BIDS dataset. That effort is summarized in convert_lemon_to_bids.py.

Manual fixes

We noticed that for the following subjects the header files were pointing to data files with an old naming scheme, leading to errors upon file-reading:

  • sub-010193
  • sub-010044
  • sub-010219
  • sub-010020
  • sub-010203

For these subjects, we had to manually edit the *.vhdr files to point to the bids name of the marker and data files, e.g. sub-010193.eeg, sub-010193.vmrk. This error may be fixed in a future release of the LEMON data.

CHBP

Downloading the data

The data can be downloaded from synapse: https://doi.org/10.7303/syn22324937

It can be handy to use the command-line client for programmatic download, which can be installed using pip:

pip install synapseclient

Then one can log in using one's credentials ...

synapse login -u "username" -p "password"

... and download specific folders recursively:

synapse get -r syn22324937
Finishing BIDS conversion

Further steps were needed to make the CHBP data work using the MNE-BIDS package. That effort is summarized in convert_chbp_to_bids.py.

Note that future versions of the dataset may require modifications to this approach or render some of these measures unnecessary.

The current work is based on the dataset as it was available in July 2021.

Manual fixes

We found a bug in the participants.tsv file, leading to issues with the BIDS validator. In the input data (July 2021), one can find a trailing whitespace until line 251. Then the line terminates at the last character of the “sex” column (F/M). We removed the whitespaces to ensure proper file-parsing.

Note that future versions of the dataset may require modifications to this approach or render some of these measures unnecessary.

TUAB

BIDS conversion

After downloading the TUAB data, we first needed to create a BIDS dataset. That effort is summarized in convert_tuh_to_bids.py.


Installation of packages and dependencies

The development initiated by this work has been stabilized and released in the latest versions of packages we list as dependencies below. You can install these packages using pip. For their respective dependencies, consider the package websites:

  1. MNE

  2. MNE-bids

  3. AutoReject

  4. coffeine

  5. Braindecode

The MNE-BIDS pipeline repository is not a package in the classical sense. We recommend using the latest version from GitHub. Please consider the installation instructions.

brain-age-benchmark-paper's People

Contributors

agramfort avatar apmellot avatar dengemann avatar gemeinl avatar hubertjb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

brain-age-benchmark-paper's Issues

compute_autoreject.py on TUAB erroring at epochs.pick_channels (line 62-68)

Hey all,

What is meant by the comment on line 62 in compute_autoreject.py?

# XXX Seems to be necessary for TUAB - figure out why
if 'eeg' in epochs:
montage = mne.channels.make_standard_montage('standard_1005')
epochs.set_montage(montage)
if analyze_channels:
epochs.pick_channels(analyze_channels)

I'm getting an error while running the TUAB data at line 68 so I thought maybe line 62 is related/the answer?

Thanks
Anthony

Code TODO before submitting

This is not a list of what'd be nice to have but what's crucial for the reviewers to do inspect our code / software contribuion.
We should make sure all points are addressed within till next Monday.

  • write read me with clear instructions on how to use the repo
  • release caffeine
  • release (?) mine-bids-pipeline
  • release mne-features
  • release braindecode ?
  • refactor some of the code (put functions from scripts into utils.py / library

What else? @agramfort @gemeinl @hubertjb @apmellot ...

Dataset sources

I want to download all the datasets used for the benchmark. Maybe we can gather the links in this issue.

TUAB

LEMON

Cam-CAN

CHBP

@dengemann can you confirm that these are correct / update the links if they are incorrect? Thanks!

Issues with preprocessing CHBP

Conversion of chbp fails due to path definition in convert_chbp_to_bids.py https://github.com/dengemann/meeg-brain-age-benchmark-paper/blob/a2305966a30f8e2f6bed979c7152fac6c7253112/convert_chbp_to_bids.py#L14-L15 For my understanding paths should only be defined in config_chbp_eeg.py.

compute_autoreject.py fails for chbp due to setting of montage in https://github.com/dengemann/meeg-brain-age-benchmark-paper/blob/a2305966a30f8e2f6bed979c7152fac6c7253112/compute_autoreject.py#L62-L64
Apparently, this is something that should only be done for tuab

complete deep learning (shallow, deep) benchmarks

now that the benchmark script seems battle-tested, we still need to compute the results. In the figure below a few deep boxes are missing :)

Screenshot 2021-10-27 at 07 41 39

I will take care of the missing handcrafted box.

The idea would be that @gemeinl and @hubertjb share a screen and fight / debug together with our Inria server.

I'm only one call / message away.

Incorrect handling of n_jobs > 1 in shallow / deep benchmark

I just want to point out that this message (https://github.com/dengemann/meeg-brain-age-benchmark-paper/blob/main/compute_benchmark_age_prediction.py#L257-L259) is not how I intended.
It is true that we do not use n_jobs to run several folds of the CV in parallel.
However, this does not mean that you should put n_jobs to 1 when calling the script.
The n_jobs is handed to the data loaders for train and validation set (https://github.com/dengemann/meeg-brain-age-benchmark-paper/blob/main/compute_benchmark_age_prediction.py#L228, https://github.com/dengemann/meeg-brain-age-benchmark-paper/blob/main/X_y_model.py#L340-L341) and could speed up computations (due to lazily loading the next batch of data for the GPU in the same amount of time that it does computations on the previous batch). See @hubertjb work here braindecode/braindecode#75.

Adaptive Average Pooling for Cropped Decoding?

Have one suggestion how cropped decoding may be implementable in an easy way, and also some changes to model hyperparameters to make the version the same that I used in pathology detection

https://github.com/dengemann/meeg-brain-age-benchmark-paper/blob/1204bfda96c8f65c7067f705e5de1c844dea8b87/deep_learning_utils.py#L279-L284

model = ShallowFBCSPNet( 
     in_chans=n_channels, 
     n_classes=1, 
     input_window_samples=None, 
     final_conv_length=35, 
 ) 

https://github.com/dengemann/meeg-brain-age-benchmark-paper/blob/1204bfda96c8f65c7067f705e5de1c844dea8b87/deep_learning_utils.py#L289-L293

model = Deep4Net(
    in_chans=n_channels,
    n_classes=1, 
    input_window_samples=None,
    final_conv_length=1,
    stride_before_pool=True,
)

https://github.com/dengemann/meeg-brain-age-benchmark-paper/blob/1204bfda96c8f65c7067f705e5de1c844dea8b87/deep_learning_utils.py#L299-L303

new_model.add_module("global_pool", nn.AdaptiveAvgPool1d(1))
new_model.add_module("squeeze2", Expression(squeeze_final_output))

Something like this may work without any further changes

Preprocessing required MNE 0.24.dev

The installation instructions include mne stable (0.23.4).
Running python ../mne-bids-pipeline/run.py config_tuab.py --steps=preprocessing will fail with mne.concatenate_epochs does not have argument 'on_mismatch'. Apparently it was introduced with mne 0.24.dev

filterbank-riemann on TUAB cross_validate array shape error

Hey brain-age team,

Everything has been going great with TUAB analysis: dummy, shallow, and deep analyses have all run great with expected results. However, when running filterbank-riemann analysis I'm getting a cross_validate error ValueError: could not broadcast input array from shape (16,16) into shape (20,20) which results in 10 out of 10 fits failed.

Here's the full error code:
Screenshot from 2022-04-26 16-22-13

Screenshot from 2022-04-26 16-22-20

Best,
Anthony

Request for Python Environment Details

Dear contributors,

Thanks for your valuable work and your code release. I am interested in reproducing your results; however, the code does not seem to be working with the recent changes in the dependencies.

Could you please share in your repo a venv / conda environment that has been used in your paper?

Best,
Leo.

writing TODO before submitting

Here is a small progress tracker, mostly focusing on the manuscript text (all this must happen by end of Tuesday 30th of November; I will then send what we have by then to Roche's internal communications reviewers)

  • update main figures and table after changes from #43 and #44
  • take into account open suggestions / comments by Tonio on wording
  • software section: list all packages, perhaps highlight the ones that are partly our contributions
  • conclusion
  • research ideas
  • Author contributions
  • Acknowledgment section: Cuba and Leipzig and TUAB
  • write discussion, focus on outline indicated in text
  • complete / write methods section on validation / scoring strategy
  • spell out details / rationale of hand-crafted features + add citations
  • complete main text in results section, focus on dividing the labor nicely between the captions and main text
  • write abstract
  • integrate data flow table in main text
  • make results table

Questions about data splits and training details

Hello -

Thank you for putting this benchmark together and releasing all the code! This is certainly an aspirational level of research and code transparency! :) I don't have any issue/bug to report with the paper or the released code.

It'd be great if the authors could comment on the following:

  1. Would cropped decoding make a significant difference in training stability/dynamics compared to trialwise decoding?
  2. Were the subject splits done using some form of stratification by age group?
  3. What is the rationale when normalizing EEG across multiple subjects? From what I understand, the data mean remained untouched during normalization/scaling.
  4. Are the models trained using the 10s-level MAE or the subject-level MAE?
  5. During prediction, would there be two levels of averaging to compute the reported MAE? 1) across multiple crops to get epoch-level predictions; 2) across multiple epochs to get subject-level predictions, then compute final MAE?
  6. Would training on one dataset and testing on another as additional model validation make sense? In this case, recovering unscaled predictions made on an unseen dataset based on train set age mean/stdev may give incorrect/negative ages due to age distribution differences. Is there an alternative way to normalize age targets for cross-dataset evaluations?

Thank you for your time!

--Neeraj

compute_benchmark_age_prediction.py line 315 error

Hey brain-age benchmark team,

I've encountered an error that I'm having trouble solving. Running the shallow benchmark analysis on the chbp data, I'm getting the error in the picture below after cross-validation is done.

Screenshot (1332)

line 315 ys.loc[cv_splits[:, 1], 'cv_split'] = cv_splits[:, 0].astype(int)

ys and cv_splits are different lengths, so cv_splits[:,0] has many more rows relative to ys, thus the code fails to find appropriate places for many of the cv_splits

ys length equals the number of subjects
cv_splits length is ~3221

Handcrafted, dummy, filterbank-riemann have all worked well. Haven't tried deep yet.

Thanks,
Anthony

Source steps for TUH data

HI all,
I have been trying to replicate the results for filterbank-source method on TUH data, however running the preprocessing and source steps using mne-bids-pipeline does not work for me.
IF anyone has replicated these results, it would be nice if you can help me with this, I hope that you can let me know what version of mne-bids-pipeline did you use, and if I need to make any further changes to the configuration file other than the changes listed on this repo's readme.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.