Giter Club home page Giter Club logo

histoprep's Introduction

HistoPrep

Preprocessing large medical images for machine learning made easy!

DescriptionInstallationUsageAPI DocumentationCitation

Description

HistoPrep makes is easy to prepare your histological slide images for deep learning models. You can easily cut large slide images into smaller tiles and then preprocess those tiles (remove tiles with shitty tissue, finger marks etc).

Installation

Install OpenSlide on your system and then install histoprep with pip!

pip install histoprep

Usage

Typical workflow for training deep learning models with histological images is the following:

  1. Cut each slide image into smaller tile images.
  2. Preprocess smaller tile images by removing tiles with bad tissue, staining artifacts.
  3. Overfit a pretrained ResNet50 model, report 100% validation accuracy and publish it in Nature like everyone else.

With HistoPrep, steps 1. and 2. are as easy as accidentally drinking too much at the research group christmas party and proceeding to work remotely until June.

Let's start by cutting a slide from the PANDA kaggle challenge into small tiles.

from histoprep import SlideReader

# Read slide image.
reader = SlideReader("./slides/slide_with_ink.jpeg")
# Detect tissue.
threshold, tissue_mask = reader.get_tissue_mask(level=-1)
# Extract overlapping tile coordinates with less than 50% background.
tile_coordinates = reader.get_tile_coordinates(
    tissue_mask, width=512, overlap=0.5, max_background=0.5
)
# Save tile images with image metrics for preprocessing.
tile_metadata = reader.save_regions(
    "./train_tiles/", tile_coordinates, threshold=threshold, save_metrics=True
)
slide_with_ink: 100%|██████████| 390/390 [00:01<00:00, 295.90it/s]

Let's take a look at the output and visualise the thumbnails.

jopo666@~$ tree train_tiles
train_tiles
└── slide_with_ink
    ├── metadata.parquet       # tile metadata
    ├── properties.json        # tile properties
    ├── thumbnail.jpeg         # thumbnail image
    ├── thumbnail_tiles.jpeg   # thumbnail with tiles
    ├── thumbnail_tissue.jpeg  # thumbnail of the tissue mask
    └── tiles [390 entries exceeds filelimit, not opening dir]

Prostate biopsy sample Tissue mask Thumbnail with tiles

That was easy, but it can be annoying to whip up a new python script every time you want to cut slides, and thus it is recommended to use the HistoPrep CLI program!

# Repeat the above code for all images in the PANDA dataset!
jopo666@~$ HistoPrep --input './train_images/*.tiff' --output ./tiles --width 512 --overlap 0.5 --max-background 0.5

As we can see from the above images, histological slide images often contain areas that we would not like to include into our training data. Might seem like a daunting task but let's try it out!

from histoprep.utils import OutlierDetector

# Let's wrap the tile metadata with a helper class.
detector = OutlierDetector(tile_metadata)
# Cluster tiles based on image metrics.
clusters = detector.cluster_kmeans(num_clusters=4, random_state=666)
# Visualise first cluster.
reader.get_annotated_thumbnail(
    image=reader.read_level(-1), coordinates=detector.coordinates[clusters == 0]
)

Tiles in cluster 0

I said it was gonna be easy! Now we can mark tiles in cluster 0 as outliers and start overfitting our neural network! This was a simple example but the same code can be used to cluster all several million tiles extracted from the PANDA dataset and discard outliers simultaneously!

Citation

If you use HistoPrep to process the images for your publication, please cite the github repository.

@misc{histoprep,
  author = {Pohjonen, Joona and Ariotta, Valeria},
  title = {HistoPrep: Preprocessing large medical images for machine learning made easy!},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {https://github.com/jopo666/HistoPrep},
}

histoprep's People

Contributors

jopo666 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

histoprep's Issues

Is the slide resolution taken into account when extracting tiles ?

Hello,
Thank you for this nice package !

I used histoprep on two slides with different resolutions.
One was scanned at a mpp of 0.25, the other one with a mpp of 0.5. It seems that when extracting the tiles with the following code, it reads images of size 512x512 at the lowest openslide level, without taking in account the resolution. This implies that the cell scale etc.. differs between the images extracted on the two different slides (see images below, first is mpp 0.25, second is mpp 0.5).

from histoprep import SlideReader

# Read slide image.
reader = SlideReader("./slides/slide_with_ink.jpeg")
# Detect tissue.
threshold, tissue_mask = reader.get_tissue_mask(level=-1)
# Extract overlapping tile coordinates with less than 50% background.
tile_coordinates = reader.get_tile_coordinates(
    tissue_mask, width=512, overlap=0.5, max_background=0.5
)
# Save tile images with image metrics for preprocessing.
tile_metadata = reader.save_regions(
    "./train_tiles/", tile_coordinates, threshold=threshold, save_metrics=True
)
image image

Is this an intended behaviour or am I missing something?
Best

AttributeError: module 'openslide' has no attribute 'OpenSlide'

Hi there! I was trying to open a histology WSI in .tif and bumped into the following error:

File ~/.local/lib/python3.11/site-packages/histoprep/_backend.py:235 in init
self.__reader = openslide.OpenSlide(path)

AttributeError: module 'openslide' has no attribute 'OpenSlide'

I wonder if you have any thoughts or suggestions to fix this bug? I had both libraries, openslide and openslide_python (v 1.3.1), installed on Linux.

Thank you!

Add labeler to HistoPrep

A class for easily adding labels for tiles would be awesome! Something like

metadata = histoprep.Labeler(metadata, prefix='cancer', mask=mask)

AttributeError: 'SlideReader' object has no attribute 'get_tissue_mask'

Hi!

Thank you for providing HistoPrep to the community.

I'm running trying to run the functions in the Readme page, using the following H2_1.jpg that can be downloaded here: https://data.mendeley.com/datasets/svw96g68dv/4.

I encounter the following error:

reader.get_tissue_mask(level=-1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 reader.get_tissue_mask(level=-1)

AttributeError: 'SlideReader' object has no attribute 'get_tissue_mask'

Windows 11 Home, 64 bit, Anaconda Python, running in a Jupter notebook.

Happy to provide more details if needed, would love to be able to run HistoPrep on some images!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.