spacetx / starfish Goto Github PK

View Code? Open in Web Editor NEW

220.0 17.0 67.0 116.79 MB

starfish: unified pipelines for image-based transcriptomics

Home Page: https://spacetx-starfish.readthedocs.io/en/latest/

License: MIT License

Python 83.90% Jupyter Notebook 14.07% WDL 0.15% Makefile 0.95% Dockerfile 0.11% TeX 0.83%

transcriptomics imaging human-cell-atlas python

starfish's People

Contributors

Stargazers

Watchers

starfish's Issues

Example notebook for ISS data

Upload worked example of processed in-situ sequencing data (RCA w/ Padlock probes) in human breast tissue

Add show method to io.Stack object

Right now, the API for displaying an image stack looks like:

from starfish.io import Stack
s = Stack()
s.read(in_json)
tile(s.squeeze());

It would be nice if the last line in the above snippet looked like:

s.show()

Example notebook for DARTFISH data

Upload worked example of processed DARTFISH data

pipeline.decoder -> pipeline.features.spots.decoder

Need to move the iss decoder to reflect the new API for spots/pixels.

Example notebook for MERFISH data

Upload worked example of processed MERFISH data

Allow Filter CLI + API to dump pngs

Filter methods should have an option to dump pngs for easy feedback + viewing.

Watershed Stain API

Watershed needs a stain image. Previously this was produced in the filter pipeline stage. The new API should be:

Watershed should take a stain
User should be able to provide this stain as a separate image
If not provided, Watershed should have a parameter that takes string names that indicate how the stain should be produced from a provided ImageSet.

Verify native Windows compatibility

Many collaborators are on Windows machines. We should make sure pip installation works for them w/ no usability issues.

Dependencies: #834

Process entire dataset, not just 1 FOV

For now, we've been using the Python API to process 1 FOV at a time to ensure the API makes sense and works as expected. Work on this will continue through #54 #55 #56 #57 etc. However, what does it look like to use the CLI to process an entire dataset comprised of multiple FOVs? How fast is the process? Did it work? How good is our wdl approach?

Work here will also inform #51 #58 #59 #61

Ensure HCA DCP compatibility

We have an early demo (https://github.com/chanzuckerberg/starfish/blob/master/starfish.wdl) of how Starfish can be run on 'green box' architecture for the HCA DCP. We should ensure that this demo stays up to date and plays will with changes to the CLI

AlgorithmBase -> single object

Each pipeline component should inherit from a single AlgorithmBase object, instead of each separate pipelinecomponent implementing a new one.

Algorithms should be exposed as top-level functions inside package with help

For the examples below, it should be easy for users of the API and the CLI to quickly get help through the docstrings and argparser respectively

s_reg = Registration.fourier_shift(s, upsample)
s_reg_filt = Filter.gaussian_low_pass(s_reg, simga)
s_reg_filt2 = Filter.gaussian_high_pass(s_reg_filt, sigma_2)

Update SpotAttributes for 3D

Includes update to ISS encoder.

Add contributing guidelines

A good example can be found in the Atom project:
https://github.com/atom/atom/blob/master/CONTRIBUTING.md

And the Jupyter project:
https://github.com/jupyter/notebook/blob/master/CONTRIBUTING.rst

aux lib to build starfish experiment dataset from the cloud

This can make the .ipynb portable (or .py if we have a better tool for converting ipynb <-> py).

Research: CLI and API parameters are defined once and used for both calls.

Currently, starfish defines both API and CLI parameters, both of which can declare default paramters and help strings.

We'd like to harmonize this, ideally by programming the information into the API and automatically generating the CLI.

Tony thought it might be possible to do this with a decorator, but wants to experiment with how that could affect usability.

Interactive visualization of results

There currently exists a prototype interactive visualization of processed Starfish data. The code lives here: https://github.com/chanzuckerberg/starfish/tree/master/viz

There are several interesting directions to take this in:

Split this into it's own repo
Scope out possible UI elements, e.g., right now the visualizer allows one to look at spots, segmentation results, turn on and off genes, and pan/zoom. What are we missing? What could be improved?
Have this visualizer connected to the CLI, such that after a CLI user finishes running a pipeline, they can visually inspect results
Have this visualizer connected to the API, potentially so that it can be run from a Jupyter notebook
Work on ensuring that our standardized file formats work well with the visualizer.

What kind of spot detection is best?

There are two options: 1. Decode each pixel into a corresponding gene, then use a connected components labeler to call spots as contiguous genes of a certain size 2. Find spots, pool the intensity, then decode each spot into a gene.

It should be possible, through simulations, to determine under which SNR noise regimes one method is better than another.

Interested in contributing to Skopy?

I have been working on a similar application Skopy. It can be used for feature extraction (i.e. similar to CellProfiler’s measurement modules), but I plan on adding image segmentation and object detection in the near future. It has nearly the same mission, providing a ready-made pipeline for image analysis, but it has a simple, portable, and lightweight architecture and people have already started using it. Would you be interested in contributing? It could be fun!

Problem with Reproducing Marco's results with Starfish

Hi. I am having difficulty passing this step in the notebook (Reproducing Marco's results with Starfish):

s.read('ISS/fov_001/org.json')

Traceback (most recent call last):

File "", line 1, in

File "/usr/lib/python2.7/site-packages/starfish/io.py", line 48, in read

self._read_stack()

File "/usr/lib/python2.7/site-packages/starfish/io.py", line 77, in _read_stack

im = self.read_fn(os.path.join(self.path, fname))

File "/usr/lib64/python2.7/site-packages/numpy/lib/npyio.py", line 431, in load

"Failed to interpret file %s as a pickle" % repr(file))

IOError: Failed to interpret file u'/newdata/data2/homes/joshr/ISS/fov_001/1_1st_Cy3 5.TIF' as a pickle

Investigate "dots" image vocabulary

We generate a bunch of "dots" images that represent bright spots in the image. These are typically max projections over some subset of the data. Brian long suggested we might want to store some information on how such projections were created (e.g. max projection over z? z, h, and c?

Determine how to store this data in starfish and the starfish spec.

Fix codebook format so it follows the spec

Store notebooks in an "enhanced" .py format that can be converted back and forth to a .ipynb format

This allows us to take advantage of refactoring tools and keep the notebooks up to date.

Non-rigid registration algorithm

This may be useful to have as part of the registration module, as opposed to what's currently there which is simply a fourier based translation algorithm

Verify conda compatibility

set stack API

The current way to update a data stack is:

# this is a vector that needs to be a tensor; s knows what the shape is
s.set_stack(s.un_squeeze(stack_filt))

It would be preferable to have an API for s.set_stack() that takes an arbitrary list that implicitly knows the correct shape.

Alternatively, s.set_stack() could be a classmethod that generates a new stack, thus removing side-effecting code from the code-base.

Upload worked example of processed smFISH data from Biohub

Segmentation -- add a Voronoi tesselation algorithm

Currently, the segmentation module in Starfish implements a single seeded watershed algorithm. While this is a good start, several labs have also used simple Voronoi tesselation. We should implement this as an option for pipeline research.

implementing_algorithms should really be get_algorithm_base_class (or something like that)

Improve 3D support

We want Starfish to natively and seamlessly support 3D images. There are several TODOs in the code where this transition has not been made yet.

Explain how to read `tile` output

To the uninitiated (aka me), it can take a while to understand whether the x-axis is the hybridization rounds and the y-axis is the channels, or vice versa.

Manually adding small, unobtrusive text to the figure would greatly help:

import matplotlib.pyplot as plt

fig = tile(s.squeeze(), size=20);
fig = plt.gcf()

# fig.text(x, y, text)
# x (y) = number from 0 to 1, where 0 is the left (bottom) of the plot and 1 is the right (top) of the plot.
# The numbers were found by manually playing around
fig.text(.5, .8, "Channels")
fig.text(.11, .53, "Hybridization rounds", rotation=90)

Why nbencdec vs Jupyter export or nbdime?

In the contributing guidelines, it is suggested to use nbencdec to save Jupyter notebooks as python files side-by-side. Can you help me understand the advantages of this package over the internal nbconvert or the nbdime package?

IPython nbconvert
Jupyter nbdime for diffing and merging notebooks, which can be added as a git hook, e.g. *.ipynb filter=ipython nbconvert --to python

Documentation

This project needs some documentation style guidelines, preliminary documentation, auto-generation of documentation, and a statement in the contribution.md about how to build new documentation.

An example of a project that does this well: https://dask.pydata.org/en/latest/

Crash logging and performance analytics

We want to know when things crash and why. We want to know what's slow and why. We need a sensible logging infrastructure to keep track of these issues. This is particularly important for the CLI but probably not the API.

error running detect_spots

The gather step of munge.py is giving a ValueError: setting an array element with a sequence.

The error arises during the pd.melt in the function below. Haven't tried to chase down further, wanted to flag it.

def gather(df, key, value, cols):
    id_vars = [col for col in df.columns if col not in cols]
    id_values = cols
    var_name = key
    value_name = value
    return pd.melt(df, id_vars, id_values, var_name, value_name)

remove path hack for unit tests

filters -- remove open cv import

This is no longer needed since we entirely use skimage. Opencv should also be removed from setup.py and requirements.txt.

Add type hinting

include argparse types (argparse.Namespace, argparse.ArgumentParser)

in_json and output should be required parameters, not positional parameters

Some parameters should be required, optional, or positional. We should make sure we get this right.

For example, org_json and output_dir should be required parameters, not positional parameters.

Stitching multiple fields of views

Currently, Starfish processes one field of view at a time. At some point, these processed (and potentially unprocessed) images/results need to be stitched together for visualization. This can either be done implicitly, e.g., for each FOV we record an offset and position in an overall grid with necessary overlap information, or explicitly, e.g., the former information is used to create one large image / table representing results.

What are the right algorithms for stitching, handling boundary artifacts, and de-duping tabular results data?

File and directory parameters to CLI should be required parameters

Positional parameters make it too inflexible.

Fix ISS notebook registration API

putting this here so I remember.

ISS notebook is currently does not work because the registration module has moved.

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-8-7b1a2d078906> in <module>()
----> 1 from starfish.registration._fourier_shift import FourierShiftRegistration
      2 from starfish.registration import Registration
      3 
      4 upsample = 1000
      5 s = Registration.run("FourierShiftRegistration", s, upsample)

ModuleNotFoundError: No module named 'starfish.registration'

When I fix that, the pipeline appears to have changed the run API, so either @ttung can take a stab at it or I'll review the commits that changed this when I've had more sleep.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-8-19c5ac120260> in <module>()
      3 
      4 upsample = 1000
----> 5 s = Registration.run("FourierShiftRegistration", s, upsample)

/usr/local/lib/python3.6/site-packages/starfish/pipeline/pipelinecomponent.py in run(cls, algorithm_name, stack, *args, **kwargs)
     20     def run(cls, algorithm_name, stack, *args, **kwargs):
     21         """Runs the registration component using the algorithm name, stack, and arguments for the specific algorithm."""
---> 22         algorithm_cls = cls._class_for_algorithm(algorithm_name)
     23         instance = algorithm_cls(*args, **kwargs)
     24         return instance.register(stack)

AttributeError: type object 'Registration' has no attribute '_class_for_algorithm'

Quality control metrics

How do we know our results of processed data are accurate? Is there a set of QC metrics that one can compute, that generalize across assays, that can answer this question? What QC metrics do methods developers want? What QC metrics do computational biologists want? What QC metrics do software engineers want?

Work here will also inform #60 and #58

Overhall stack API

The current Stack abstraction in io.stack.py conflates 'hybridization' images and 'auxilary' images. It's probably a good idea to separate these two under a common base class. This will:

Eliminate having to load (and potentially write) all 'hybridization' and 'auxilary' images when you only want to operate on, say, a single auxilary images
Eliminate the propagation of an 'org.json' file at each step of computation
Make the CLI more flexible and less tied to specific intermediary file formats

don't do the print_help hack. use what i did on the other library.

deep says this issue sucks. badly documented.

Enforce uint16 input type

Right now starfish loads float, we should do one of:

only accept uint16
convert to uint16 or complain if we can't deduce how to do this from the data.

Upload worked example of processed sequential smFISH data

Link to notebook is broken

This link in the main readme is broken. https://github.com/spacetx/starfish/blob/master/notebooks/Starfish%20Simple%20ISS%20tutorial%20%7C%20Mouse%20vs.%20Human%20Fibroblasts.ipynb

How do I see a worked example? I'm interested to see examples of input and output.

Thanks

Matt

convert MerfishDecoder into a DecoderAlgorithm

See #97 as an example.

Convert existing jupyter notebooks into unit tests

The notebook repository will contain several examples of how to use Starfish to analyze data from several assays, e.g., MERFISH, DARTFISH, Padlock Probes, sequential smFISH, etc.

While these notebooks provide direct examples of how to use the Starfish API, they can also be re-factored into unit tests such that developers can make sure they're not making breaking changes.

Developers will also need to make sure that the Jupyter notebooks are sufficiently updated if the corresponding unit tests need to be updated.

spacetx / starfish Goto Github PK

starfish's People

Contributors

Stargazers

Watchers

Forkers

starfish's Issues

Recommend Projects

Recommend Topics

Recommend Org