Giter Club home page Giter Club logo

toad's Introduction

TOAD 🐸

AI-based Pathology Predicts Origins for Cancers of Unknown Primary

Nature

Read Link | Journal Link | Interactive Demo | Cite

TL;DR: In this work we propose to use weakly-supervised multi-task computational pathology to aid the differential diagnosis for cancers of unknown primary (CUP). CUPs represent 1-2% of all cancers and have poor prognosis because modern cancer treatment is specific to the primary. We present TOAD (Tumor Origin Assessment via Deep-learning) for predicting the primary origin of these tumors from H&E images without using immunohistochemistry, molecular testing or clinical correlation. Our model is trained on 22,833 gigapixel diagnostic whole slide images (WSIs) from 18 different primary cancer origins and tested on an held-out set of 6,499 (WSIs) and an external set of 682 WSIs from 200+ institutions. Furthermore, we curated a large multi-institutional dataset of 743 CUP cases originiating in 150+ different medical centers and validated our model against a subset of 317 cases for which a primary differential was assigned based on evidence from extensive IHC testing, radiologic and/or clinical correlation.

Β© This code is made available for non-commercial academic purposes.

TOAD: Tumor Origin Assessement via Deep-learning

Pre-requisites:

  • Linux (Tested on Ubuntu 18.04)
  • NVIDIA GPU (Tested on Nvidia GeForce RTX 2080 Ti x 16)
  • Python (3.7.7), h5py (2.10.0), matplotlib (3.1.1), numpy (1.18.1), opencv-python (4.1.1), openslide-python (1.1.1), openslide (3.4.1), pandas (1.0.3), pillow (7.0.0), PyTorch (1.5.1), scikit-learn (0.22.1), scipy (1.3.1), tensorflow (1.14.0), tensorboardx (1.9), torchvision (0.6).

Installation Guide for Linux (using anaconda)

Installation Guide

Data Preparation

We chose to encode each tissue patch with a 1024-dim feature vector using a truncated, pretrained ResNet50. For each WSI, these features are expected to be saved as matrices of torch tensors of size N x 1024, where N is the number of patches from each WSI (varies from slide to slide). The following folder structure is assumed:

DATA_ROOT_DIR/
    └──DATASET_DIR/
         β”œβ”€β”€ h5_files
                β”œβ”€β”€ slide_1.h5
                β”œβ”€β”€ slide_2.h5
                └── ...
         └── pt_files
                β”œβ”€β”€ slide_1.pt
                β”œβ”€β”€ slide_2.pt
                └── ...

DATA_ROOT_DIR is the base directory of all datasets (e.g. the directory to your SSD). DATASET_DIR is the name of the folder containing data specific to one experiment and features from each slide is stored as .pt files.

Please refer to refer to CLAM for examples on how perform this feature extraction step.

Datasets

Datasets are expected to be prepared in a csv format containing at least 5 columns: case_id, slide_id, sex, and labels columns for the slide-level labels: label, site. Each case_id is a unique identifier for a patient, while the slide_id is a unique identifier for a slide that correspond to the name of an extracted feature .pt file. This is necessary because often one patient has multiple slides, which might also have different labels. When train/val/test splits are created, we also make sure that slides from the same patient do not go to different splits. The slide ids should be consistent with what was used during the feature extraction step. We provide a dummy example of a dataset csv file in the dataset_csv folder, named dummy_dataset.csv. You are free to input the labels for your data in any way as long as you specify the appropriate dictionary maps under the label_dicts argument of the dataset object's constructor (see below). For demonstration purposes, we used 'M' and 'F' for sex and 'Primary' and 'Metastatic' for the site. Our 18 classes of tumor origins are labaled by 'Lung', 'Breast', 'Colorectal', 'Ovarian', 'Pancreatobiliary', 'Adrenal', 'Skin', 'Prostate', 'Renal', 'Bladder', 'Esophagogastric', 'Thyroid', 'Head Neck', 'Glioma', 'Germ Cell', 'Endometrial', 'Cervix', and 'Liver'.

Dataset objects used for actual training/validation/testing can be constructed using the Generic_MIL_MTL_Dataset Class (defined in datasets/dataset_mtl_concat.py). Examples of such dataset objects passed to the models can be found in both main_mtl_concat.py and eval_mtl_concat.py.

For training, look under main.py:

if args.task == 'dummy_mtl_concat':
    args.n_classes=18
    dataset = Generic_MIL_MTL_Dataset(csv_path = 'dataset_csv/dummy_dataset.csv',
                            data_dir= os.path.join(args.data_root_dir,'DATASET_DIR'),
                            shuffle = False, 
                            seed = args.seed, 
                            print_info = True,
                            label_dicts = [{'Lung':0, 'Breast':1, 'Colorectal':2, 'Ovarian':3, 
                                            'Pancreatobiliary':4, 'Adrenal':5, 
                                             'Skin':6, 'Prostate':7, 'Renal':8, 'Bladder':9, 
                                             'Esophagogastric':10,  'Thyroid':11,
                                             'Head Neck':12,  'Glioma':13, 
                                             'Germ Cell':14, 'Endometrial': 15, 
                                             'Cervix': 16, 'Liver': 17},
                                            {'Primary':0,  'Metastatic':1},
                                            {'F':0, 'M':1}],
                            label_cols = ['label', 'site', 'sex'],
                            patient_strat= False)

In addition to the number of classes (args.n_classes), the following arguments need to be specified:

  • csv_path (str): Path to the dataset csv file
  • data_dir (str): Path to saved .pt features for the dataset
  • label_dicts (list of dict): List of dictionaries with key, value pairs for converting str labels to int for each label column
  • label_cols (list of str): List of column headings to use as labels and map with label_dicts

Finally, the user should add this specific 'task' specified by this dataset object to be one of the choices in the --task arguments as shown below:

parser.add_argument('--task', type=str, choices=['dummy_mtl_concat'])

Training Splits

For evaluating the algorithm's performance, we randomly partitioned our dataset into training, validation and test splits. An example 70/10/20 splits for the dummy dataset can be fould in splits/dummy_mtl_concat. These splits can be automatically generated using the create_splits.py script with minimal modification just like with main_mtl_concat.py. For example, the dummy splits were created by calling:

python create_splits.py --task dummy_mtl_concat --seed 1 --k 1

The script uses the Generic_WSI_MTL_Dataset Class for which the constructor expects the same arguments as Generic_MIL_MTL_Dataset (without the data_dir argument). For details, please refer to the dataset definition in datasets/dataset_mtl_concat.py

Training

CUDA_VISIBLE_DEVICES=0 python main_mtl_concat.py --drop_out --early_stopping --lr 2e-4 --k 1 --exp_code dummy_mtl_sex  --task dummy_mtl_concat  --log_data  --data_root_dir DATA_ROOT_DIR

The GPU to use for training can be specified using CUDA_VISIBLE_DEVICES, in the example command, GPU 0 is used. Other arguments such as --drop_out, --early_stopping, --lr, --reg, and --max_epochs can be specified to customize your experiments.

For information on each argument, see:

python main_mtl_concat.py -h

By default results will be saved to results/exp_code corresponding to the exp_code input argument from the user. If tensorboard logging is enabled (with the arugment toggle --log_data), the user can go into the results folder for the particular experiment, run:

tensorboard --logdir=.

This should open a browser window and show the logged training/validation statistics in real time.

Evaluation

User also has the option of using the evluation script to test the performances of trained models. Examples corresponding to the models trained above are provided below:

CUDA_VISIBLE_DEVICES=0 python eval_mtl_concat.py --drop_out --k 1 --models_exp_code dummy_mtl_sex_s1 --save_exp_code dummy_mtl_sex_s1_eval --task study_v2_mtl_sex  --results_dir results --data_root_dir DATA_ROOT_DIR

For information on each commandline argument, see:

python eval_mtl_concat.py -h

To test trained models on your own custom datasets, you can add them into eval_mtl_concat.py, the same way as you do for main_mtl_concat.py.

Issues

  • Please report all issues on the public forum.

License

Β© Mahmood Lab - This code is made available under the GPLv3 License and is available for non-commercial academic purposes.

Reference

If you find our work useful in your research or if you use parts of this code please consider citing our paper:

Lu, M.Y., Chen, T.Y., Williamson, D.F.K. et al. AI-based pathology predicts origins for cancers of unknown primary. Nature 594, 106–110 (2021). https://doi.org/10.1038/s41586-021-03512-4

@article{lu2021ai,
  title={AI-based pathology predicts origins for cancers of unknown primary},
  author={Lu, Ming Y and Chen, Tiffany Y and Williamson, Drew FK and Zhao, Melissa and Shady, Maha and Lipkova, Jana and Mahmood, Faisal},
  journal={Nature},
  volume={594},
  number={7861},
  pages={106--110},
  year={2021},
  publisher={Nature Publishing Group}
}

toad's People

Contributors

faisalml avatar fedshyvana avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

toad's Issues

How are TCGA Labels Acquired?

First of all, I would like to appreciate all the hard work that has been put into this research article.

Moving to the problem. So, I was going through the GDC tool for downloading Diagnostic Slides for patients however I was not able to find the label site label i.e primary or metastatic? I have read your paper multiple times to understand how to extract labels but the instructions are not very clear and also went through the documentation of the TCGA GDC portal but was unable to find it.

Therefore could you point out on how you were able to extract labels typically site i.e primary pr metastatic?

I would really appreciate the help. Thanks

Demo video

Hi,
I have read the documentation and it is a really great project but can you add a video regarding the steps to perform in a sequential manner.
Thank you

The origin and label of metastatic samples

Could you indicate whether the label of the metastatic sample you used was consistent with the primary site or with the section location? If the primary colorectal cancer has metastasized to the liver, do you use images from the colorectal or liver? What is the label when using slices from the liver? I would appreciate it if you could answer me!

Fail to eval

Hi, I tried to running the evaluation command

CUDA_VISIBLE_DEVICES=0 python eval_mtl_concat.py --drop_out --k 1 --models_exp_code dummy_mtl_sex_s1 --save_exp_code dummy_mtl_sex_s1_eval --task dummy_mtl_concat  --results_dir results --data_root_dir DATA_ROOT_DIR

I changed the task bellow code
--task study_v2_mtl_sex
into
--task dummy_mtl_concat
because the study_v2_mtl_sex task couldn't found. But then I ran into this error:

Traceback (most recent call last):
File "eval_mtl_concat.py", line 122, in
model, results_dict = eval(split_dataset, args, ckpt_paths[ckpt_idx])
File "/media/chingwei/bf764154-3611-460c-9bfd-4efed447a219/chingwei/Nabila/TOAD-NGS/utils/eval_utils_mtl_concat.py", line 40, in eval
results_dict = summary(model, loader, args)
File "/media/chingwei/bf764154-3611-460c-9bfd-4efed447a219/chingwei/Nabila/TOAD-NGS/utils/eval_utils_mtl_concat.py", line 120, in summary
cls_test_error /= len(loader)
ZeroDivisionError: float division by zero

Could you please provide any guidance on how to resolve this error?
OS : ubuntu
torch : 1.10.1+cu111

pretrained model weights

Hello and thank you for sharing your work!
Could you please provide the last checkpoint for the pretrained model?
Thank you in advance, Lucia

Model interpretability

Hello,thanks for your great work!
Could you tell me from which layer to extract the feature vectors for visualization?

Heatmap visualization

Hi, thank you for creating such an amazing works. I'm trying to create a visualization heatmaps using the model from toad, what I understand is you create it using the heatmaps code in the CLAM repository. I already try it, but it didn't works, can you help me to figure out why, or maybe you have the procedures to do it?

my problem when directly apply the create heatmaps in clam is the "sex" problem which is not used/available in CLAM package

Patient data

Where can I find the patientβ€˜s data of histology slides

KeyError in Training Splits Part

Hi, I tried to running the command on Colab

!python create_splits.py --task dummy_mtl_concat --seed 1 --k 1

And I am getting this error:

Traceback (most recent call last):
File "create_splits.py", line 25, in
dataset = Generic_WSI_MTL_Dataset(csv_path = '/content/drive/MyDrive/Colab Notebooks/MTDL/TOAD/dataset_csv/dummy_dataset.csv',
File "/content/drive/My Drive/Colab Notebooks/MTDL/TOAD/datasets/dataset_mtl_concat.py", line 71, in init
slide_data = self.df_prep(slide_data, self.label_dicts, self.label_cols)
File "/content/drive/My Drive/Colab Notebooks/MTDL/TOAD/datasets/dataset_mtl_concat.py", line 133, in df_prep
data.at[i, 'label'] = label_dicts[0][key]
KeyError: 'Esophagogogastric'

Could anyone please explain why this is happening and how to resolve this error?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.