Giter Club home page Giter Club logo

scalablecytometryimageprocessing / scip Goto Github PK

View Code? Open in Web Editor NEW
7.0 2.0 0.0 21.01 MB

Scalable Cytometry Image Processing (SCIP) is an open-source tool that implements an image processing pipeline on top of Dask, a distributed computing framework written in Python. SCIP performs projection, illumination correction, image segmentation and masking, and feature extraction.

Home Page: https://scalable-cytometry-image-processing.readthedocs.io/en/latest/

License: GNU General Public License v3.0

Python 99.85% Shell 0.15%
distributed-computing dask-distributed dask bioimage-analysis cytometry-analysis-pipeline

scip's People

Contributors

maximlippeveld avatar sanderthierens avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

scip's Issues

Segmentation of optical microscopy images

After loading and fusing, we end up with a bag of (C, X, Y) tiles where every tile contains many cells. We have to identify these cells to profile them. This means we have to map a bag with m (= number of tiles) entries to a bag of n (= number of cells) entries.

This mapping could be achieved as follows:
Array of tiles (one tile = one chunk) -(to_delayed)-> array of Delayed objects of tiles -(segment)-> array of Bags of cells -(concat)-> Bag of cells

Alternative segmentation for noisy channels

Brightfield segmentation currently does not work because it is much more noisy than other channels. We need to allow the user to indicate which channels are expected to be very noisy.

For large datasets, SCIP hangs when exporting features

Most likely due to features being collected to one node in the delayed call to final. Fix by indexing features, joining with meta, and calling to_parquet directly on Dask DataFrame, rather then on pandas DataFrame in final.

Rewrite reporting and exporting flow

Currently, reports and data export is done by calling intermediate computes. This interrupts the Dask task graph, preventing further optimization. It might be better to gather delayed objects and output later.

Support for CZI images

To be able to handle CZI images we can load them using aicsimageio. The CZI images contain 'scenes', each scene corresponds to a treatment. The scenes have dimensions MTCZXY equal to (56, 1, 6, 3, 1000, 1000). The user specifices which scene to process for each run through a command line argument or config entry.

The planes are stored as tiles. Each tile contains many cells which we want to identify through segmentation. The aicsimageio provides access to image data as a Dask Array.

Benchmarking scalability of data loading + masking

To prove that horizontal scaling is useful, we want to measure runtime on a dataset for increasing parallelization. Concretely, we want to measure runtime in seconds in function of amount of executors on the PBSCluster. The hypothesis is that runtime initially decreases as more executors are used, but starts increasing again once overhead becomes significant.

The amount of executors is governed by two parameters: n_workers and processes. The former defines how many jobs are spawned (one job = one prism node), the latter defines in how many processes each jobs is split. Executors then equals n_workers * processes.

We want to write a script that launches the sip command for varying configurations and registers runtime.

Add pipeline and run configuration

There is an important distinction between pipeline configuration and run configuration. The pipeline configuration concerns things like what channels to load, which segmentation algorithm to use on what channel, which features to compute... The run configuration concerns things like how many workers to use, how much memory each worker receives, what logging level to use...

Pipeline configuration should be implemented with a YAML config file that can be passed when launching the command line interface (CLI) on the command line. Example:

data_loading:
  file_format: multiframe_tiff
  channels:
    - 0
    - 1
    - 2
    - 4
    - 5
segmentation:
  noisy_channels:
    - 0
    - 5

The run configuration will be passed to the pogram through arguments of the CLI. Example

sip --n-workers 4 --worker-mem 20G

Argument handling is done with the click Python library.

Compute quantiles with random undersampling

Implement with fold and apply. perpartition reduction samples from partitions and concatenates, binop simply concatenates. Apply on Item computes quantiles from samples.

Add unit tests

We will use pytest to write unit tests. Some core functionality shoud already be tested in the alpha release.

  • Create dataset with 5 multiframe tiff images that can be used for testing (can be pushed to git repo)
  • Write data loading test for multiframe tiff loading
    • Amount of loaded images, images shapes and channels should be correct
    • Precompute average intensity value and hard code in test. Write a test that compares average intensity of loaded images to hardcoded values
  • Write segmentation test that checks if masks are non empty
  • Write feature extraction tests
  • Write test for minmax normalization

We will use the coverage package to monitor code coverage.

Masking quality control report

After masking a QC report should be written out which contains:

  • Pixel distribution per channel pre and post segmentation
  • Message stating what percentage of images have an empty mask per channel

Move UMAP to separate module

UMAP (and other dimensionality reduction methods) should be added to a separate analysis module. UMAP results should still be added to a report. Components should be concatenated to other features so that dimensionality reductions can be exported alongside features.

Smart indexing on Dask Dataframe

Setting the index on our dask dataframe can be interesting for us. For instance, samples from different patients could be collected on many timepoints during follow-up (eg blood test every week). Setting the dataframe index to this timepoint column allows us to quickly select timepoints for downstream analysis.

Setting the index also repartitions the data. If the index is set to patient id, for instance, we can compute analyes for all data per patient using map_partition

The index column should be set by the user in a config setting.

Implement feature extraction that combines channels

In #2 we only implemented features that are computed independently per channel. We want to expand the feature set also with features computed on combinations of channels. For example, computing the similarity between two fluorescent channels.

Option to persist feature dataframes to disk (data format yet to be decided. Likely, SQLite and feather)

Calling to_parquet or to_csv on Dask DataFrame creates a file per partition. This is good for intermediate checkpointing of the pipeline state, but not for exporting the data. The data should be exported to one large file. If the features dataframe fits in a single nodes' memory, we can just collect the dataframe to pandas and export from there. If it doesn't fit, we have to append to the output file batch per batch. Need to look into how this works.

Image normalization after masking

To avoid working with very small floats in the feature extraction phase, we want to rescale the pixel values to the [0-1] range. This can be achieved with a simple min-max normalization. Potentially, we can use quantile quantile normalization where instead of min and max we use the 5th and 95th quantile.

Feature extraction from pretrained neural nets

Neural networks which are pretrained on bioimaging data can contain meaningful intermediate representations. This feature would allow users to supply a pretrained network and specify which intermediate representation they would like to extract.

Support for other microscopy formats through Bio-Formats

Bio-Formats is a library for writing microscopy image data. It supports many different (proprietary) formats. If we can use Bio-Formats for our data loading, our tool becomes applicable to many more datasets.

Bio-Formats can be used from python with (A) the python-bioformats package or (B) through ImageJ with the pyimagej package.

Given this discussion on the image.sc forum it is not clear what the future is for option (A), so (B) is likely the best option to explore first.

Focus selection or focus stacking for multi-focal images

The CZI images have a Z-dimension. The planes in the Z-dimension have to be converted into one plane. This can be done by selecting one plane, either fixed in config or dynamically using a focus metric. Another option is to fuse the planes with focus stacking.

  • Selecting a fixed plane with a config entry
  • Dynamically selecting a plane per instance using a focus metric
  • Fusing the stack with focus stacking

Add demo dataset for new users

It is nice for new users to install the package and be able to test their installation on a demo dataset.

The dataset should not be shipped with the package, but should be downloaded when a user wants to run the demo.

Add info and debug logging statements

Throughout the program logging statements should be added which indicate what part of the program is executing. For example, what module is currently executed, what file is being loaded etc. This can be info logs.

Debug logs are for more detailed info that is not needed in order to know what part of the pipeline is being executed.

Update masking for IFC data

Currently, a mask is computed for every channel, which is used for computing features. If any of these channels contain multiple objects, the event is discarded. The bounding box is computed by taking the largest bounding box from all channels.

In the updated version, one channel is designated as the primary channel, which will be used for computing the bounding box. For feature extraction, one mask per channel is still computed. It should not be required that this is one connected component. This is no problem for texture and intensity features.

For shape features, regionprops returns one collection of props per separate region. The corresponding measurements for all regions have to be aggregated. For example, if for an event a stained channel has three components the regionprops will be computed for all three. We can then take the average of all three measurements.

Start new module for feature extraction

Features should be extracted from masked and unmasked features.
The feature extraction modules will take a Dask bag as input (output from segmentation) and a Dask DataFrame as output.

  • Compose list of features to be implemented (what libraries to use)
  • Implement extraction module per feature type (eg intensity, shape, texture ...)
  • Embed feature extraction in main pipeline

Feature extraction quality control report

After feature extraction a quality control report should be saved, which should contain:

  • A table of extracted features containing a column with the feature name and columns with some basic statistics (mean, variance, median)
  • Check for zero variance features and add list to report
  • Plot of dimensionality reduction (see #19), coloured according to meta data (= per sample or per acquisition day)

Use CellProfiler to extract features

CellProfiler is a software tool for cell profiling that is widely used in the community.

It has a python API, which can be used to construct and run pipelines programmatically. Their wiki includes a good example of how to use cellprofiler as a package.

If we can run this package distributed within Dask, the amount of features we can compute increases a lot.

Add per-sample min-max normalization

Minima and maxima can be computed exactly over all partitions per file. If no outlier pixel values are expected (or data has been cleaned beforehand) this can be used.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.