scalablecytometryimageprocessing / scip Goto Github PK

Scalable Cytometry Image Processing (SCIP) is an open-source tool that implements an image processing pipeline on top of Dask, a distributed computing framework written in Python. SCIP performs projection, illumination correction, image segmentation and masking, and feature extraction.

Home Page: https://scalable-cytometry-image-processing.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 99.85% Shell 0.15%

distributed-computing dask-distributed dask bioimage-analysis cytometry-analysis-pipeline

scip's People

Contributors

Stargazers

Watchers

scip's Issues

Segmentation of optical microscopy images

After loading and fusing, we end up with a bag of (C, X, Y) tiles where every tile contains many cells. We have to identify these cells to profile them. This means we have to map a bag with m (= number of tiles) entries to a bag of n (= number of cells) entries.

This mapping could be achieved as follows:
Array of tiles (one tile = one chunk) -(to_delayed)-> array of Delayed objects of tiles -(segment)-> array of Bags of cells -(concat)-> Bag of cells

Alternative segmentation for noisy channels

Brightfield segmentation currently does not work because it is much more noisy than other channels. We need to allow the user to indicate which channels are expected to be very noisy.

For large datasets, SCIP hangs when exporting features

Most likely due to features being collected to one node in the delayed call to final. Fix by indexing features, joining with meta, and calling to_parquet directly on Dask DataFrame, rather then on pandas DataFrame in final.

Export to AnnData

https://anndata.readthedocs.io/en/latest/index.html

Rewrite reporting and exporting flow

Currently, reports and data export is done by calling intermediate computes. This interrupts the Dask task graph, preventing further optimization. It might be better to gather delayed objects and output later.

Support for CZI images

To be able to handle CZI images we can load them using aicsimageio. The CZI images contain 'scenes', each scene corresponds to a treatment. The scenes have dimensions MTCZXY equal to (56, 1, 6, 3, 1000, 1000). The user specifices which scene to process for each run through a command line argument or config entry.

The planes are stored as tiles. Each tile contains many cells which we want to identify through segmentation. The aicsimageio provides access to image data as a Dask Array.

Switch to async cluster for benchmarking

Switching to async execution allows us to wait for the cluster to be fully spun up.

Implement principal feature analysis

Principal feature analysis is a method for selecting the features which describe most variance in a dataset. It is based on PCA.

Blog post explaining PFA: https://biapol.github.io/blog/ryan_savill/principal_feature_analysis/
PFA paper: http://venom.cs.utsa.edu/dmz/techrep/2007/CS-TR-2007-011.pdf

Benchmarking scalability of data loading + masking

To prove that horizontal scaling is useful, we want to measure runtime on a dataset for increasing parallelization. Concretely, we want to measure runtime in seconds in function of amount of executors on the PBSCluster. The hypothesis is that runtime initially decreases as more executors are used, but starts increasing again once overhead becomes significant.

The amount of executors is governed by two parameters: n_workers and processes. The former defines how many jobs are spawned (one job = one prism node), the latter defines in how many processes each jobs is split. Executors then equals n_workers * processes.

We want to write a script that launches the sip command for varying configurations and registers runtime.

Dimensionality reduction after feature extraction

Apply UMAP dimensionality reduction using the umap-learn library. Try running first with default parameters. Later we can tweak the most important parameters.

Distributed UMAP is not possible. First convert Dask dataframe to pandas using compute, and then apply UMAP.

Add pipeline and run configuration

There is an important distinction between pipeline configuration and run configuration. The pipeline configuration concerns things like what channels to load, which segmentation algorithm to use on what channel, which features to compute... The run configuration concerns things like how many workers to use, how much memory each worker receives, what logging level to use...

Pipeline configuration should be implemented with a YAML config file that can be passed when launching the command line interface (CLI) on the command line. Example:

data_loading:
  file_format: multiframe_tiff
  channels:
    - 0
    - 1
    - 2
    - 4
    - 5
segmentation:
  noisy_channels:
    - 0
    - 5

The run configuration will be passed to the pogram through arguments of the CLI. Example

sip --n-workers 4 --worker-mem 20G

Argument handling is done with the click Python library.

Compute quantiles with random undersampling

Implement with fold and apply. perpartition reduction samples from partitions and concatenates, binop simply concatenates. Apply on Item computes quantiles from samples.

Render reports using Jinja template engine

Add unit tests

We will use pytest to write unit tests. Some core functionality shoud already be tested in the alpha release.

Create dataset with 5 multiframe tiff images that can be used for testing (can be pushed to git repo)
Write data loading test for multiframe tiff loading
- Amount of loaded images, images shapes and channels should be correct
- Precompute average intensity value and hard code in test. Write a test that compares average intensity of loaded images to hardcoded values
Write segmentation test that checks if masks are non empty
Write feature extraction tests
Write test for minmax normalization

We will use the coverage package to monitor code coverage.

Masking quality control report

After masking a QC report should be written out which contains:

Pixel distribution per channel pre and post segmentation
Message stating what percentage of images have an empty mask per channel

Move UMAP to separate module

UMAP (and other dimensionality reduction methods) should be added to a separate analysis module. UMAP results should still be added to a report. Components should be concatenated to other features so that dimensionality reductions can be exported alongside features.

Specify different memory requirements for GPU workers in mpi mode

Currently all workers get an equal amount of memory. However, workers that execute GPU tasks often require more memory. It should be possible to specify this.

Smart indexing on Dask Dataframe

Setting the index on our dask dataframe can be interesting for us. For instance, samples from different patients could be collected on many timepoints during follow-up (eg blood test every week). Setting the dataframe index to this timepoint column allows us to quickly select timepoints for downstream analysis.

Setting the index also repartitions the data. If the index is set to patient id, for instance, we can compute analyes for all data per patient using map_partition

The index column should be set by the user in a config setting.

Fix hog features

Many hog feature values are 0 or NaN.

Use pre-commit package for enforcement

https://pre-commit.com/

Add per-image min-max normalization

Basic normalization independent per image.

Export intensity distribution reports to HTML instead of PDF

Is this pipeline actively maintained?

It sounds like a very useful image analysis platform, but is it being actively maintained?

Implement feature extraction that combines channels

In #2 we only implemented features that are computed independently per channel. We want to expand the feature set also with features computed on combinations of channels. For example, computing the similarity between two fluorescent channels.

Export to AnnData for interoperability with ScanPy

Option to persist feature dataframes to disk (data format yet to be decided. Likely, SQLite and feather)

Calling to_parquet or to_csv on Dask DataFrame creates a file per partition. This is good for intermediate checkpointing of the pipeline state, but not for exporting the data. The data should be exported to one large file. If the features dataframe fits in a single nodes' memory, we can just collect the dataframe to pandas and export from there. If it doesn't fit, we have to append to the output file batch per batch. Need to look into how this works.

Image normalization after masking

To avoid working with very small floats in the feature extraction phase, we want to rescale the pixel values to the [0-1] range. This can be achieved with a simple min-max normalization. Potentially, we can use quantile quantile normalization where instead of min and max we use the 5th and 95th quantile.

Channel normalisation

For each channel perform a x-quantile normalisation and clip values >1 to 1

Benchmark/use DoG instead of Non-local means denoising in mask segmentation

Feature extraction from pretrained neural nets

Neural networks which are pretrained on bioimaging data can contain meaningful intermediate representations. This feature would allow users to supply a pretrained network and specify which intermediate representation they would like to extract.

Support for other microscopy formats through Bio-Formats

Bio-Formats is a library for writing microscopy image data. It supports many different (proprietary) formats. If we can use Bio-Formats for our data loading, our tool becomes applicable to many more datasets.

Bio-Formats can be used from python with (A) the python-bioformats package or (B) through ImageJ with the pyimagej package.

Given this discussion on the image.sc forum it is not clear what the future is for option (A), so (B) is likely the best option to explore first.

Implement possibility to refine masks after CellPose segmentation

CellPose produces a mask for one channel only. Running CellPose for every channel is too computationally intensive. This can be solved by using CellPose to get a mask that covers the entire cell, and subsequently refining that mask per channel.

Focus selection or focus stacking for multi-focal images

The CZI images have a Z-dimension. The planes in the Z-dimension have to be converted into one plane. This can be done by selecting one plane, either fixed in config or dynamically using a focus metric. Another option is to fuse the planes with focus stacking.

Selecting a fixed plane with a config entry
Dynamically selecting a plane per instance using a focus metric
Fusing the stack with focus stacking

Add CellProfiler granularity features

Module here: https://github.com/CellProfiler/CellProfiler/blob/master/cellprofiler/modules/measuregranularity.py

Add demo dataset for new users

It is nice for new users to install the package and be able to test their installation on a demo dataset.

The dataset should not be shipped with the package, but should be downloaded when a user wants to run the demo.

Crop images to bounding box of mask to reduce computation and storage overhead

Cells often only occupy a small fraction of the total width and height of the image. Cropping the cells to the bounding box around the mask can reduce memory consumption and computational cost when computing features.

Better basic mask quality control

Currently, we only check if a mask is found. Look into what other checks we can do.

Add info and debug logging statements

Throughout the program logging statements should be added which indicate what part of the program is executing. For example, what module is currently executed, what file is being loaded etc. This can be info logs.

Debug logs are for more detailed info that is not needed in order to know what part of the pipeline is being executed.

Update masking for IFC data

Currently, a mask is computed for every channel, which is used for computing features. If any of these channels contain multiple objects, the event is discarded. The bounding box is computed by taking the largest bounding box from all channels.

In the updated version, one channel is designated as the primary channel, which will be used for computing the bounding box. For feature extraction, one mask per channel is still computed. It should not be required that this is one connected component. This is no problem for texture and intensity features.

For shape features, regionprops returns one collection of props per separate region. The corresponding measurements for all regions have to be aggregated. For example, if for an event a stained channel has three components the regionprops will be computed for all three. We can then take the average of all three measurements.

Switch to tifffile in multiframe tiff loader

Currently uses PIL, but prefer to use tifffile because this is also used in tiff loader.

Start new module for feature extraction

Features should be extracted from masked and unmasked features.
The feature extraction modules will take a Dask bag as input (output from segmentation) and a Dask DataFrame as output.

Compose list of features to be implemented (what libraries to use)
Implement extraction module per feature type (eg intensity, shape, texture ...)
Embed feature extraction in main pipeline

Perform background subtraction

With background subtraction we try to remove the influence of background noise from the signal of interest. This blog post gives a nice introduction to the topic: https://biapol.github.io/blog/ryan_savill/03_background_subtraction/

Feature extraction quality control report

After feature extraction a quality control report should be saved, which should contain:

A table of extracted features containing a column with the feature name and columns with some basic statistics (mean, variance, median)
Check for zero variance features and add list to report
Plot of dimensionality reduction (see #19), coloured according to meta data (= per sample or per acquisition day)

Use CellProfiler to extract features

CellProfiler is a software tool for cell profiling that is widely used in the community.

It has a python API, which can be used to construct and run pipelines programmatically. Their wiki includes a good example of how to use cellprofiler as a package.

If we can run this package distributed within Dask, the amount of features we can compute increases a lot.

scalablecytometryimageprocessing / scip Goto Github PK

scip's People

Contributors

Stargazers

Watchers

scip's Issues

Recommend Projects

Recommend Topics

Recommend Org