Giter Club home page Giter Club logo

medmnist's Introduction

MedMNIST: medmnist.com

Data (Zenodo) | Publication (Nature Scientific Data'23 / ISBI'21) | Preprint (arXiv)

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni

18x Standardized Datasets for 2D and 3D Biomedical Image Classification

Multiple Size Options: 28 (MNIST-Like), 64, 128, and 224

We introduce MedMNIST, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools.

Update: We are thrilled to release MedMNIST+ with larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D. As a complement to the previous 28-size MedMNIST, the large-size version could serve as a standardized benchmark for medical foundation models. Install the latest API to try it out!

MedMNISTv2_overview

For more details, please refer to our paper:

MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification (Nature Scientific Data'23)

or its conference version:

MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis (ISBI'21)

Key Features

  • Diverse: It covers diverse data modalities, dataset scales (from 100 to 100,000), and tasks (binary/multi-class, multi-label, and ordinal regression). It is as diverse as the VDD and MSD to fairly evaluate the generalizable performance of machine learning algorithms in different settings, but both 2D and 3D biomedical images are provided.
  • Standardized: Each sub-dataset is pre-processed into the same format, which requires no background knowledge for users. As an MNIST-like dataset collection to perform classification tasks on small images, it primarily focuses on the machine learning part rather than the end-to-end system. Furthermore, we provide standard train-validation-test splits for all datasets in MedMNIST, therefore algorithms could be easily compared.
  • User-Friendly: The small size of 28x28 (2D) or 28x28x28 (3D) is lightweight and ideal for evaluating machine learning algorithms. We also offer a larger-size version, MedMNIST+: 64x64 (2D), 128x128 (2D), 224x224 (2D), and 64x64x64 (3D). Serving as a complement to the 28-size MedMNIST, this could be a standardized resource for developing medical foundation models. All these datasets are accessible via the same API.
  • Educational: As an interdisciplinary research area, biomedical image analysis is difficult to hand on for researchers from other communities, as it requires background knowledge from computer vision, machine learning, biomedical imaging, and clinical science. Our data with the Creative Commons (CC) License is easy to use for educational purposes.

Please note that this dataset is NOT intended for clinical use.

Code Structure

  • medmnist/:
    • dataset.py: PyTorch datasets and dataloaders of MedMNIST.
    • evaluator.py: Standardized evaluation functions.
    • info.py: Dataset information dict for each subset of MedMNIST.
  • examples/:
    • getting_started.ipynb: To explore the MedMNIST dataset with jupyter notebook. It is ONLY intended for a quick exploration, i.e., it does not provide full training and evaluation functionalities.
    • getting_started_without_PyTorch.ipynb: This notebook provides snippets about how to use MedMNIST data (the .npz files) without PyTorch.
  • setup.py: To install medmnist as a module.
  • [EXTERNAL] MedMNIST/experiments: training and evaluation scripts to reproduce both 2D and 3D experiments in our paper, including PyTorch, auto-sklearn, AutoKeras and Google AutoML Vision together with their weights ;)

Installation and Requirements

Setup the required environments and install medmnist as a standard Python package from PyPI:

pip install medmnist

Or install from source:

pip install --upgrade git+https://github.com/MedMNIST/MedMNIST.git

Check whether you have installed the latest code version:

>>> import medmnist
>>> print(medmnist.__version__)

The code requires only common Python environments for machine learning. Basically, it was tested with

  • Python 3 (>=3.6)
  • PyTorch==1.3.1
  • numpy==1.18.5, pandas==0.25.3, scikit-learn==0.22.2, Pillow==8.0.1
  • fire, scikit-image

Higher (or lower) versions should also work (perhaps with minor modifications).

Quick Start

To use the standard 28-size (MNIST-like) version utilizing the downloaded files:

>>> from medmnist import PathMNIST
>>> train_dataset = PathMNIST(split="train")

To enable automatic downloading by setting download=True:

>>> from medmnist import NoduleMNIST3D
>>> val_dataset = NoduleMNIST3D(split="val", download=True)

Alternatively, you can access MedMNIST+ with larger image sizes by specifying the size parameter:

>>> from medmnist import ChestMNIST
>>> test_dataset = ChestMNIST(split="test", download=True, size=224)

If you use PyTorch...

  • Great! Our code is designed to work with PyTorch.

  • Explore the MedMNIST dataset with jupyter notebook (getting_started.ipynb), and train basic neural networks in PyTorch.

If you do not use PyTorch...

  • Although our code is tested with PyTorch, you are free to parse them with your own code (without PyTorch or even without Python!), as they are only standard NumPy serialization files. It is simple to create a dataset without PyTorch.
  • Go to getting_started_without_PyTorch.ipynb, which provides snippets about how to use MedMNIST data (the .npz files) without PyTorch.
  • Simply change the super class of MedMNIST from torch.utils.data.Dataset to collections.Sequence, you will get a standard dataset without PyTorch. Check dataset_without_pytorch.py for more details.
  • You still have most functionality of our MedMNIST code ;)

Dataset

Please download the dataset(s) via Zenodo. You could also use our code to download automatically by setting download=True in dataset.py.

The MedMNIST dataset contains several subsets. Each subset (e.g., pathmnist.npz) is comprised of 6 keys: train_images, train_labels, val_images, val_labels, test_images and test_labels.

  • train_images / val_images / test_images: N × 28 × 28 for 2D gray-scale datasets, N × 28 × 28 × 3 for 2D RGB datasets, N × 28 × 28 × 28 for 3D datasets. N denotes the number of samples.
  • train_labels / val_labels / test_labels: N × L. N denotes the number of samples. L denotes the number of task labels; for single-label (binary/multi-class) classification, L=1, and {0,1,2,3,..,C} denotes the category labels (C=1 for binary); for multi-label classification L!=1, e.g., L=14 for chestmnist.npz.

Additionally, we provide a CSV file for each MedMNIST subset here, which maps the "image_id" in the subset to the corresponding image in the source dataset. For each entry, it details the specific "split" and "index" within the MedMNIST subset, along with the corresponding image name from the official source dataset.

Command Line Tools

  • List all available datasets:

      python -m medmnist available
    
  • Download available datasets of a specific size (size=None (28) by default):

      python -m medmnist download --size=28
    

    To download all available sizes:

      python -m medmnist download --size=all
    
  • Delete all downloaded npz from root:

      python -m medmnist clean
    
  • Print the dataset details given a subset flag:

      python -m medmnist info --flag=xxxmnist
    
  • Save the dataset as standard figure and csv files, which could be used for AutoML tools, e.g., Google AutoML Vision:

    for 2D datasets:

      python -m medmnist save --flag=xxxmnist --folder=tmp/ --postfix=png --download=True --size=28
    

    for 3D datasets:

      python -m medmnist save --flag=xxxmnist3d --folder=tmp/ --postfix=gif --download=True --size=28
    

    By default, download=False and size=None (28).

  • Parse and evaluate a standard result file, refer to Evaluator.parse_and_evaluate for details.

      python -m medmnist evaluate --path=folder/{flag}{size_flag}_{split}@{run}.csv
    

    Here, size_flag is blank for 28 images, and _size for larger images, e.g., "_64", e.g.,

      python -m medmnist evaluate --path=bloodmnist_64_val_[AUC]0.486_[ACC][email protected]
    

    or

      python -m medmnist evaluate --path=chestmnist_test_[AUC]0.500_[ACC][email protected]
    

License and Citation

The code is under Apache-2.0 License.

The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), except DermaMNIST under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

If you find this project useful in your research, please cite the following papers:

Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. Yang, Jiancheng, et al. "MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification." Scientific Data, 2023.

Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

or using the bibtex:

@article{medmnistv2,
    title={MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification},
    author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing},
    journal={Scientific Data},
    volume={10},
    number={1},
    pages={41},
    year={2023},
    publisher={Nature Publishing Group UK London}
}
 
@inproceedings{medmnistv1,
    title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis},
    author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing},
    booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)},
    pages={191--195},
    year={2021}
}

Please also cite source data paper(s) of the MedMNIST subset(s) as per the description on the project page.

Release Notes

  • v3.0.1: Updated the downloading error message to make it more instructive.
  • v3.0.0: MedMNIST+ featuring larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D.
  • v2.2.4: Removed a small number of blank samples in OrganAMNIST, OrganCMNIST, OrganSMNIST, OrganMNIST3D, and VesselMNIST3D.
  • v2.2.3: DermaMNIST license to CC BY-NC 4.0
  • v2.2.2: Python 3.11 Sequence from collections.abc supported
  • v2.2.1: PyPI info updated
  • v2.2.0: montage method supported for scikit-image>=0.20.0
  • v2.1.0: NoduleMNIST3D data error fixed
  • v2.0.0: MedMNIST v2 release (on PyPI)
  • v1.0.0: MedMNIST v1 release
  • v0.2.0: MedMNIST beta release

medmnist's People

Contributors

duducheng avatar orientnab avatar threesrr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

medmnist's Issues

Mean and Standard Deviation for the datasets while normalizing

Dear Authors,
Thank you for the dataset.
I am looking at the getting_started.ipynb, for pathmnist it is said that the normalization transform is the following - data_transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize(mean=[.5], std=[.5])])
The values 0.5, 0.5 are being used. I have the following questions.

  1. Does this value work for all the datasets in medmnist?
  2. Is 0.5, 0.5 the correct mean and standard deviations, or are they just approximate numbers?
  3. Is there a place where I can find datasets and their corresponding mean and standard deviation values so I can use them in my method?

Thanks for your time and help,
Megh

separation of concern and publication on PyPI

I just found this project by chance. I think it is a wonderful idea to have this many different modalities of data formatted like the MNIST dataset. This may give rise to a lot of opportunities during teaching or during sandboxing of methods.

I suggest to split off the dataset.py part completely and put this on PyPI. This way, any user doesn't have to rely on the dependencies which are exposed at this point. In addition, people can easily adopt the datasets by including a relevant statement in their requirements.txt or environment.yml.

What do you think?

all images download as .npz

I can't find .csv or pngs, even if I use the command:
python -m medmnist save --flag=xxxmnist --folder=tmp/ --postfix=png

Not able to download dataset

Dear Authors,
Thank you for making the dataset public.
When I go to this link https://zenodo.org/record/5208230#.YluEcy-B0UE , and go to one of the datasets and click on download, nothing happens and the webpage simply hangs.
I also tried using the command line to download - 'python -m medmnist download' - and the download fails.
Thanks and please let me know at the earliest.
Megh

Is it too small to process medical image data into a size of 28*28*28

Hi, thanks for sharing your work.
I have a question, is it too small to process 3D medical image data into a size of 282828, especially when classifying based on some detailed features in medical images?

How do you process medical images with an original size of such as 25625664 into a size of 282828? Have you considered the loss of details caused by downsizing, I noticed that your work exhibits high performance metrics such as AUC.

The temporal dimension of the 3D dataset

The 3D dataset have dimensions (N, 28, 28, 28) where N corresponds to the number of samples. I would just like to make myself clear on the point that axis=1 stands for the temporal dimension here (number of frames of images).

I have also noticed in the following function, the frames are taken from axis=1

def montage3d(imgs, n_channels, sel):

Any help would be greately appreciated.
TIA!

download by Command Line Tools | Something went wrong when downloading

hello, thank u so much for this amazing job!
I was trying to download one of the datasets by command line tool, I tried this command:
python -m medmnist save --flag=organsmnist --folder=tmp/ --postfix=png --download=True --size=224
but I've got this output which seems to be running to some wrong URL.

Traceback (most recent call last):
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 1348, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 1286, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 1332, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 1281, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 1041, in _send_output
    self.send(msg)
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 979, in send
    self.connect()
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 1451, in connect
    super().connect()
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/http/client.py", line 945, in connect
    self.sock = self._create_connection(
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/socket.py", line 851, in create_connection
    raise exceptions[0]
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/socket.py", line 836, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/medmnist/dataset.py", line 106, in download
    download_url(
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/torchvision/datasets/utils.py", line 134, in download_url
    url = _get_redirect_url(url, max_hops=max_redirect_hops)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/torchvision/datasets/utils.py", line 82, in _get_redirect_url
    with urllib.request.urlopen(urllib.request.Request(url, headers=headers)) as response:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 216, in urlopen
    return opener.open(url, data, timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 519, in open
    response = self._open(req, data)
               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 536, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 496, in _call_chain
    result = func(*args)
             ^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 1391, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/urllib/request.py", line 1351, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 111] Connection refused>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/medmnist/__main__.py", line 184, in <module>
    fire.Fire()
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/medmnist/__main__.py", line 71, in save
    dataset = getattr(medalist, INFO[flag]["python_class"])(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/medmnist/dataset.py", line 56, in __init__
    self.download()
  File "/home/katie/miniconda3/envs/dreamnew/lib/python3.11/site-packages/medmnist/dataset.py", line 113, in download
    raise RuntimeError(
RuntimeError: Something went wrong when downloading! Go to the homepage to download manually. https://github.com/MedMNIST/MedMNIST/

Am I trying a wrong command? Or something went wrong? I would really appreciate it if u could can help me with this.

Encountering `BadZipFile` Bug When Loading `pathmnist.npz` Locally

My Problem

Traceback (most recent call last):
  File "/home/21009290012/Projects/DRLProjects/CNNLIME/train.py", line 303, in <module>
    main(data_flag, output_root, num_epochs, gpu_ids, batch_size, download, model_flag, resize, as_rgb, model_path, run)
  File "/home/21009290012/Projects/DRLProjects/CNNLIME/train.py", line 63, in main
    train_dataset = DataClass(split='train', transform=data_transform, download=False, as_rgb=as_rgb, root='dataset/')
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/site-packages/medmnist/dataset.py", line 43, in __init__
    npz_file = np.load(os.path.join(self.root, "{}.npz".format(self.flag)))
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/site-packages/numpy/lib/npyio.py", line 444, in load
    ret = NpzFile(fid, own_fid=own_fid, allow_pickle=allow_pickle,
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/site-packages/numpy/lib/npyio.py", line 190, in __init__
    _zip = zipfile_factory(fid)
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/site-packages/numpy/lib/npyio.py", line 103, in zipfile_factory
    return zipfile.ZipFile(file, *args, **kwargs)
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/zipfile.py", line 1269, in __init__
    self._RealGetContents()
  File "/home/21009290012/.conda/envs/DL_gpu/lib/python3.10/zipfile.py", line 1336, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

My Configuration

My Environment

  • Platform: Linux
  • Torch 1.13.0+cu116
  • Python: 3.10.12

My Code

train_dataset = DataClass(split='train', transform=data_transform, download=False, as_rgb=as_rgb, root='dataset/')

My Project Structure

CNNLIME
|---checkpoints
|---dataset
|        |---pathmnist.npz
|---model.py
|---train.py

How to understand the label array

Hi thank you for your work and repo!

I would like to know the semantic meaning of the label array. For example, the chestmnist has the test label array (22433, 14), and the first label is array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8). What is the label it is associated with and how to convert the array to the corresponding label(s)?

"label": {
    "0": "atelectasis",
    "1": "cardiomegaly",
    "2": "effusion",
    "3": "infiltration",
    "4": "mass",
    "5": "nodule",
    "6": "pneumonia",
    "7": "pneumothorax",
    "8": "consolidation",
    "9": "edema",
    "10": "emphysema",
    "11": "fibrosis",
    "12": "pleural",
    "13": "hernia"
},

I appreciate your help in advance. Thanks!

Paired multi-modal data?

Hi there,

Thanks for the wonderful dataset!

I was wondering if there are any paired images in this dataset. What I mean by paired images (x_i, y_i) is that they belong to 2 different modalities (in this case Modality X and Modality Y) and they come from the same patient and hence mapped to the same class labels.

I see in the paper that OrganMNIST Axial, Coronal, and Sagittal come from the same source and have the same set of labels. I was wondering if these 3 modalities have paired images in them and if it includes the pairing data (which axial image is paired with which coronal and sagittal images).

Thank you.

Query related to AUC and ACC score

Dear Sir,
I noticed one thing that in your experimental results the AUC is greater than Accuracy score. Is it normal to have AUC score greater than Accuracy? Could you please explain this. Thanks

Visualization of MedMNIST Images

Dear Authors,
Thank you again for making the dataset public. I have a question regarding the dermamnist dataset, and I am having some issues while visualizing it. I am using dermamnist with pytorch for a classification task, and my data loader is the following -

class MedMNISTDatasetProxy(Dataset):
    def __init__(self, tensors, transform=None):
        assert tensors[0].shape[0] == tensors[1].shape[0]
        self.tensors = tensors
        self.transform = transform

    def __getitem__(self, index):
        x = self.tensors[0][index]

        if self.transform:
            x = self.transform(x)

        y = torch.tensor(self.tensors[1][index])
        
        return x, y

    def __len__(self):
        return self.tensors[0].shape[0]

The transform list which I am passing is the following -

data_transform_proxy = transforms.Compose([transforms.ToTensor()])

I am making a data loader from this dataset (because I need that in my application), and I save the data loader and the load it again for the purpose of visualization. I am trying to visualize the images as follows, by using transforms.ToPILImage() in pytorch.
However when I visualize the images, I get a green shaded color for the dermamnist images, I'm not sure why this is happening. Following are a few of the image visualizations attached -
dermafullentropy_unnorm

The same issue happens with pathmnist also. The histopath images are usually pinkish color but the visualization, using the same procedure as above results in the following visualization -
pathfull_unnorm

If needed, my code for making the image grid is as follows -

def image_grid(imgs, rows, cols, original = False):
    print(rows, cols, len(imgs))
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))

    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        if original:
            img = img.convert("RGB")
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

Thanks and please let me know if I am missing something.
Best Regards,
Megh

getting_start.ipynb notebook scoring issue

Your getting_start.ipynb is a great addition to the repo, but should it use train/val/test sets like your command line version does, where it picks the epoch with the highest val score as the best model, and then shows the test score for that model?

At the command line you do it like this:

==> Building and training model...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:06<00:00, 2.39s/it]
epoch 0 is the best model
==> Testing model...
train AUC: 0.57632 ACC: 0.26923
val AUC: 0.49373 ACC: 0.26923
test AUC: 0.57728 ACC: 0.26923

Custom Dataset Usage

Thank you for the repository and the code you provided. Is it possible to use my own dataset ?

Where can I find sample IDs?

Hi there, I was wondering how can I extract the IDs of a dataset's scans. For example, if I'd like to go back to the original scan (in the original dataset). I skimmed through the medmnist.dataset class (e.g. for ChestMNIST or NoduleMNIST3D) but it doesn't look like there's any relevant mention. Is the sample ID traceable? Thanks!

Larger image options- 64*64 or 128*128?

I believe that intention of this work is to provide medical datasets for quick prototyping of ML algorithms. But since medical imaging classification generally relies on micro features and textures, 28*28 might be too small to learn anything meaningful.

I am curious if there is any way to access larger version of these datasets directly from your repo, say of size 6464 or 128128.

How to contact the train and val dataset?

If I need to contact the two data sets (training set and validation set) as whole training, how to do it? When I use ConcatDataset provided by Pytorch, the concatenated data can't return "imgs" and "labels". For example:

train_data = data.ConcatDataset([train_dataset,val_dataset])

the train_data can't directly get the "imgs" and "labels", such as "train_data .imgs" and "train_data .labels".

running getting_started_without_PyTorch notebook report error

I meet such an error for running the getting_started_without_PyTorch notebook-- searched around this is Python version issue (I am using 3.11). check here: wireservice/agate#737.
I downgrade the Python to 3.9.0 this error was gone.
Please consider upgrading the code.

File ~/prj/medmnist/MedMNIST/examples/dataset_without_pytorch.py:4
      2 import random
      3 import numpy as np
----> 4 from collections import Sequence
      5 from PIL import Image
      6 from medmnist.info import INFO, HOMEPAGE, DEFAULT_ROOT

ImportError: cannot import name 'Sequence' from 'collections' (/home/xlz/miniconda3/envs/medical/lib/python3.11/collections/__init__.py)

Normalization config

May I know if you have a normalization parameter setup for each sub-dataset? Thank you!

Visualize 28x28x28 data

Dear repo,

This is not a bug report. I try to visualize the 28x28x28 MNIST data, as the montage is not very clear.
Any example is available? Thanks,

AssertionError

Hello, when i run "python -m medmnist save --flag=organmnist3d --folder=tmp/" ,
terminal show
Saving organmnist3d train...
Using downloaded and verified file: /home/islab/.medmnist/organmnist3d.npz
Traceback (most recent call last):
File "/home/islab/anaconda3/envs/covid/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/islab/anaconda3/envs/covid/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/islab/MedMNIST-main/medmnist/main.py", line 123, in
fire.Fire()
File "/home/islab/anaconda3/envs/covid/lib/python3.6/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/islab/anaconda3/envs/covid/lib/python3.6/site-packages/fire/core.py", line 471, in _Fire
target=component.name)
File "/home/islab/anaconda3/envs/covid/lib/python3.6/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/islab/MedMNIST-main/medmnist/main.py", line 45, in save
dataset.save(folder, postfix)
File "/home/islab/MedMNIST-main/medmnist/dataset.py", line 169, in save
assert postfix == "gif"
AssertionError

i dont know how to solve it , hope to help

Thank you in advance!

Citation to PneumoniaMNIST original source

This seems to be an issue with the paper itself and with the website too. I wanted to access the original source of the PneumoniaMNIST dataset, but the references are copied from the OCTMNIST paper instead. Can you provide a link to original source and paper for the PneumoniaMNIST dataset?

Thanks in advance!

Labelling vs ground truth

Hi, Just a quick question.

Where can I get confirmation of ground truth labels for your datasets?

Specifically BreastMNIST and AdrenalMNIST3D.

Request for preprocessing code

I would like to know if you could please share the code that you preprocess the datasets? MedMNIST is a good work, but some extra information contained in the original datasets is ignored, for example, for BreastMNIST, I wish to know the labels of normal and benign images, although they have been simplified into positive class. In addition, other information, like the gender/age information is important, but cannot be directly used from MedMNIST. Thanks a lot.

Evaluation about AutoML Methods

Thanks very much for your nice work!

I have read your source code and paper. I found both your code and AutoKeras chose the best model based on the highest AUC score on the validation set.

However, how Google AutoML Vision2 and auto-sklearn evaluate is introduced. If they use the best test auc/acc during the searching, is it an unfair comparison?

Question about chestmnist dataset

When I use the chestmnist dataset, I found:

class c = 0: 70472 real images
class c = 1: 7996 real images
class c = 2: 0 real images
class c = 3: 0 real images
class c = 4: 0 real images
class c = 5: 0 real images
class c = 6: 0 real images
class c = 7: 0 real images
class c = 8: 0 real images
class c = 9: 0 real images
class c = 10: 0 real images
class c = 11: 0 real images
class c = 12: 0 real images
class c = 13: 0 real images

However, it seems that the chestmnist dataset has multi-label:

Dataset ChestMNIST of size 28 (chestmnist)
    Number of datapoints: 78468
    Root location: /home/user3/.medmnist
    Split: train
    Task: multi-label, binary-class
    Number of channels: 1
    Meaning of labels: {'0': 'atelectasis', '1': 'cardiomegaly', '2': 'effusion', '3': 'infiltration', '4': 'mass', '5': 'nodule', '6': 'pneumonia', '7': 'pneumothorax', '8': 'consolidation', '9': 'edema', '10': 'emphysema', '11': 'fibrosis', '12': 'pleural', '13': 'hernia'}
    Number of samples: {'train': 78468, 'val': 11219, 'test': 22433}
    Description: The ChestMNIST is based on the NIH-ChestXray14 dataset, a dataset comprising 112,120 frontal-view X-Ray images of 30,805 unique patients with the text-mined 14 disease labels, which could be formulized as a multi-label binary-class classification task. We use the official data split, and resize the source images of 1×1024×1024 into 1×28×28.
    License: CC BY 4.0
===================
Dataset ChestMNIST of size 28 (chestmnist)
    Number of datapoints: 22433
    Root location: /home/user3/.medmnist
    Split: test
    Task: multi-label, binary-class
    Number of channels: 1
    Meaning of labels: {'0': 'atelectasis', '1': 'cardiomegaly', '2': 'effusion', '3': 'infiltration', '4': 'mass', '5': 'nodule', '6': 'pneumonia', '7': 'pneumothorax', '8': 'consolidation', '9': 'edema', '10': 'emphysema', '11': 'fibrosis', '12': 'pleural', '13': 'hernia'}
    Number of samples: {'train': 78468, 'val': 11219, 'test': 22433}
    Description: The ChestMNIST is based on the NIH-ChestXray14 dataset, a dataset comprising 112,120 frontal-view X-Ray images of 30,805 unique patients with the text-mined 14 disease labels, which could be formulized as a multi-label binary-class classification task. We use the official data split, and resize the source images of 1×1024×1024 into 1×28×28.
    License: CC BY 4.0

How can I use the multi-label instead of just binary-class?

[feature request] the 3d dataset convert from npz to dicom

Hello,
Regarding converting the dataset from npz to another format: for 3d dataset, the current implementation only provides the gif format:

assert postfix == "gif"

I would need to dicom format (dcm series data) -- do you have a plan to include that feature, or otherwise do you have suggestions about the ref. code that I can DIY?

[BUG] DataClass montage method not working with scikit-image==0.20.0

Bug description

When calling the method montage from DataClass the following error appears:

TypeError: montage() got an unexpected keyword argument 'multichannel'

Last week skimage was updated to version 0.20.0 and the method montage fromDataClass is no longer working. In the tutorial notebook, this method is used to plot images (cell number 8), and skimage already displays this warning:

/usr/local/lib/python3.9/dist-packages/medmnist/utils.py:25: FutureWarning: multichannel is a deprecated argument name for montage. It will be removed in version 1.0. Please use channel_axis instead. montage_arr = skimage_montage(sel_img, multichannel=(n_channels == 3))

So, now with the new skimage version the argument multichannelis deprecated.

How to reproduce this error?

  1. Update skimage to the latest version (pip install scikit-image==0.20.0)
  2. Run the following snippet
import medmnist
from medmnist import INFO
import torchvision.transforms as transforms
import skimage
print(f"Skimage v{skimage.__version__}")
print(f"MedMNIST v{medmnist.__version__} @ {medmnist.HOMEPAGE}")

data_flag = 'pathmnist'
info = INFO[data_flag]
download = True

DataClass = getattr(medmnist, info['python_class'])
data_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[.5], std=[.5])
])
# load the data
train_dataset = DataClass(split='train', transform=data_transform, download=download)
train_dataset.montage(length=1)

Additional context
A temporary workaround to bypass this error is to modify the requirements.txt file enforcing scikit-image==0.19.0

Easy way to combine datasets?

Is there any code snippet to combine multiple datasets?

Examples Organ(A/C/S)MNIST.

Ideally a 33-class problem instead 3 separate 11-class ones?

Can you provide the code for other models?

I am following your article, but you just provide the models of the baseline method in your GitHub.
So can you provide the code for other models?
Such as auto-sklearn , AutoKeras and Google AutoML Vision.

Possible error in getting_started.ipynb?

Hello,

I was looking at the source code and attached notebooks in the folder examples. In the evaluation cell of the getting_started.ipynb notebook, we can find:

print('%s  acc: %.3f  auc:%.3f' % (split, *metrics)) 

This is shown as to have printed train acc: 0.983 auc:0.834 when running the statement test('train'). However, looking at the evaluator.py file in MedMNIST, it seems that the evaluator object outputs the AUC first and then the accuracy. Consequently, the print statements in your notebook(s) may be switching the two metrics.

Let me know if this is right.

Best regards,

qlero

Generation of OrganMNIST {Axial,Coronal,Sagittal}

Hi, in the paper of MNIST_v1, you say that

" We use bounding-box annotations of 11 body organs from another study [17] to obtain the organ labels. Hounsfield-Unit (HU) of the 3D images are transformed into grey scale with a abdominal window; we then crop 2D images from the center slices of the 3D bounding boxes in axial / coronal / sagittal views (planes)."

the I found that the size of OrganAMNIST is significantly larger than OrganMNIST3D. Does it mean that you crop multiple slices from a single 3D bbox for OrganAMNIST? I would appreciate it if you could provide further details regarding the generation of OrganaMNIST.

install via conda

Hi, are you planning to make the package available for installation via conda?
That would be great, thanks!

License problem and use of this dataset?

Hey,

I see that your README seems to explicitly state that the dataset is licensed under Creative Commons Attribution 4.0 International ([CC BY 4.0]), which allows for commercial use.

However, if I've understood your paper correctly, at least the DermaMNIST part of the dataset is derived from the HAM10000 dataset, which as I understand is explicitly licensed CC-BY-NC (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T&version=4.0&selectTab=termsTab)

If the DermaMNIST part of this dataset is indeed derived from HAM10000, and if this dataset is hosted under CC BY 4.0, then does this not constitute a license problem?

Looking forward to hearing back

Cheers,
Jumperkables

Project dependencies may have API risk issues

Hi, In MedMNIST, inappropriate dependency versioning constraints can cause risks.

Below are the dependencies and version constraints that the project is using

numpy
pandas
scikit-learn
scikit-image
tqdm
Pillow
fire
torch
torchvision

The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict.
The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.

After further analysis, in this project,
The version constraint of dependency pandas can be changed to >=0.4.0,<=1.2.5.
The version constraint of dependency scikit-learn can be changed to >=0.14,<=0.21.3.
The version constraint of dependency tqdm can be changed to >=4.36.0,<=4.64.0.
The version constraint of dependency Pillow can be changed to ==9.2.0.
The version constraint of dependency Pillow can be changed to >=2.0.0,<=9.1.1.

The above modification suggestions can reduce the dependency conflicts as much as possible,
and introduce the latest version as much as possible without calling Error in the projects.

The invocation of the current project includes all the following methods.

The calling methods from the pandas
pandas.read_csv
The calling methods from the scikit-learn
sklearn.metrics.accuracy_score
sklearn.metrics.roc_auc_score
The calling methods from the tqdm
tqdm.trange
The calling methods from the Pillow
PIL.Image.fromarray
The calling methods from the all methods
RuntimeError
numpy.random.rand.sum
fire.Fire
next
format
numpy.stack
ys.append
save_fn
setuptools.setup
numpy.random.rand
list
filename.split
available
medmnist.Evaluator.get_dummy_prediction
f.read
os.path.join
zip
time.time
download
os.path.exists
self.download
medmnist.utils.montage3d
df.append.sort_index
medmnist.utils.montage2d
frames.append
filename.split.split
save
split_.startswith
join
cls.evaluate
self.labels.max
key.INFO.medmnist.getattr
shuffle_iterator
self.get_standard_evaluation_filename
map
warnings.DeprecationWarning
medmnist.info.INFO.keys
pandas.DataFrame
index.self.labels.astype
get_default_root
y_score.pd.DataFrame.to_csv
key.INFO.medmnist.getattr.montage
numpy.argmax
key.INFO.medmnist.getattr.save
flag.INFO.medmnist.getattr
key.endswith
y_true.squeeze.squeeze
os.path.split
glob.glob
Metrics
pandas.read_csv
medmnist.utils.montage2d.save
self.__len__
pprint.pprint
open.close
df.append.append
medmnist.utils.save2d
medmnist.Evaluator.parse_and_evaluate
self.transform.convert
medmnist.Evaluator
os.path.expanduser
getAUC
xs.append
readme
range
setuptools.find_packages
dataset._collate_fn
open
info
self.__len__.append
path.endswith
sklearn.metrics.accuracy_score
y_score.squeeze.squeeze
sklearn.metrics.roc_auc_score
medmnist.utils.save_frames_as_gif
data.append
open.write
montage2d
os.makedirs
cls
getACC
numpy.load
random.shuffle
tqdm.trange
torchvision.datasets.utils.download_url
load_fn.save
os.remove
print
getattr
load_fn
medmnist.utils.save3d
skimage.util.montage
montage_frames.append
self.transform
self.target_transform
medmnist.Evaluator.evaluate
df.append.to_csv
len
numpy.random.choice
frames.save
PIL.Image.fromarray
numpy.array
collections.namedtuple
i.y_true.astype
warnings.warn
idx.append

@developer
Could please help me check this issue?
May I pull a request to fix it?
Thank you very much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.