Giter Club home page Giter Club logo

iohub's People

Contributors

ahmetcansolak avatar andrewdchen avatar bryantchhun avatar camfoltz avatar edyoshikun avatar ieivanov avatar jookuma avatar lihaoyeh avatar mattersoflight avatar talonchandler avatar ziw-liu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

iohub's Issues

support multi-pos datasets that are not complete multi-well plate acquisitions

There is an apparent gap in data formats specified by OME-NGFF and supported by napari-ome-zarr plugin: they specify/support single FOV ND arrays or multi-FOV ND arrays acquired in a multi-well plate, but not simple multi-pos acquisitions.

We can represent this data as as a multi-well dataset to leverage the napari-ome-zarr plugin for visualization.

This topic requires a discussion among @ziw-liu , @Soorya19Pradeep , @jrbyrum13 , @Christianfoley , @ieivanov, @talonchandler who either acquire or analyze multi-pos datasets.

I see two options:

  1. Have a single column, with row names = position names in micro-manager acquisition .
  2. Map the data to a square grid to make it easy to visualize with napari plugin, e.g., if we have 100 FOVs, sort them into 10 rows and 10 columns.

Votes & Ideas?

Enforcing uniform data shape in a single OME-Zarr HCS dataset

When implementing the OME-Zarr HCS features, it became evident that there is a gap between what the specification allows (or has not explicitly prohibited) and what is often convenient to interact with: the non-uniformity of the ZYX shape (or in pseudo-code: array.shape[:-3]) in a single dataset.

In most use cases, it is convenient to have a uniform ZYX shape, and this opens up possibilities to have straightforward implementations for data summarization and parallel computation. As for as the official implementation goes, shape uniformity is implied by how the reader would attempt to gather all the FOVs in a single Dask array. However, this will disable support for experiments where images of the same physical object are coming off multiple acquisition arms (e.g. bright field and light sheet).

During an offline meeting, @mattersoflight, @ieivanov and others agreed that we can store non-uniformly shaped images in separate Zarr stores before resampling to the same coordinate system.

@JoOkuma @AhmetCanSolak What are your thoughts?

iohub needs a `convert` CLI command

iohub needs a simple CLI interface for converting files into the spec-adherent formats that it writes.

recOrder has a simple convert command that depends ZarrConverter which in turn depends on WaveorderReader and WaveorderWriter. I think that many parts of the WaveorderReader can be reused for iohub's converter, but we have decided not to depend on the WaveorderWriter. Therefore, I think the following steps are needed before we're ready for an iohub convert PR:

  • We've decided which parts of WaveorderReader we're going to reuse
  • the ZarrConverter class has been modified for compatibility with the new writer (and possibly modified readers)

@ziw-liu and others, I would appreciate your thoughts on both of these steps.

Unable to apply CoordinateTransformations with napari-ome-zarr

I am trying to use the coordinateTransformation metadata from the iohug.ngff.create_image() and I am seeing a couple of issues that are possibly related so keeping them in the same issue.

The first issue that it might be more of a napari issue is that we cannot apply individual coordinate transforms for multiple images within a position. After running this code it throws the error below, which means it expects some sort of pyramidal structured dataset.

store_path = "/hpc/projects/comp_micro/sandbox/Ed/tmp/"
# timestamp = datetime.now().strftime("test_translate")

store_path = store_path + 'test_translate' + ".zarr"

tczyx_1= np.random.randint(
    0, np.iinfo(np.uint16).max, size=(1, 3, 3, 32, 32), dtype=np.uint16
)
tczyx_2= np.random.randint(
    0, np.iinfo(np.uint16).max, size=(1, 3, 3, 32, 32), dtype=np.uint16
)
coords_shift = [1.0,1.0,1.0, 100.0, 100.0]
coords_shift2 = [1.0,1.0,1.0, -100.0, 1.0]
scale_val = [1.0, 1.0, 1.0, 0.5, 0.5]
translation = TransformationMeta(type='translation', translation=coords_shift)
scaling = TransformationMeta(type='scale', scale=scale_val)
translation2 = TransformationMeta(type='translation', translation=coords_shift2)
with open_ome_zarr(
    store_path,
    layout="hcs",
    mode="w-",
    channel_names=["DAPI", "GFP", "Brightfield"]) as dataset:
    # Create and write to positions
    # This affects the tile arrangement in visualization
    position = dataset.create_position(0, 0, 0)
    position.create_image("0", tczyx_1, transform=[translation])
    position = dataset.create_position(0, 1, 0)
    position.create_image("0", tczyx_2, transform = [scaling])
    # Print dataset summary
    dataset.print_tree()

Error:

File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/bin/napari", line 8, in <module>
    sys.exit(main())
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/__main__.py", line 561, in main
    _run()
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/__main__.py", line 341, in _run
    viewer._window._qt_viewer._qt_open(
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/_qt/qt_viewer.py", line 830, in _qt_open
    self.viewer.open(
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/components/viewer_model.py", line 1014, in open
    self._add_layers_with_plugins(
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/components/viewer_model.py", line 1242, in _add_layers_with_plugins
    added.extend(self._add_layer_from_data(*_data))
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/components/viewer_model.py", line 1316, in _add_layer_from_data
    layer = add_method(data, **(meta or {}))
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/utils/migrations.py", line 44, in _update_from_dict
    return func(*args, **kwargs)
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/components/viewer_model.py", line 823, in add_image
    layerdata_list = split_channels(data, channel_axis, **kwargs)
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/layers/utils/stack_utils.py", line 79, in split_channels
    multiscale, data = guess_multiscale(data)
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/layers/image/_image_utils.py", line 76, in guess_multiscale
    raise ValueError(
ValueError: Input data should be an array-like object, or a sequence of arrays of decreasing size. Got arrays of single shape: (1, 3, 3, 32, 64)
c

Now, if we remove the second image position.create_image("1", tczyx_2) and open the positions in napari napari --plugin napari-ome-zarr test_translate.zarr we see the two (32x32) test images as expected, but without transformations applied to them .

image

If we drag and drop the different positions separately into napari, we can see that the scaling was applied as well as the translation but now we have to slide the bar for positions.
image

image

tensorstore support

Tensorstore could be used for the zarr backend.

A few advantages are:

  • It's faster.
  • It supports transposes and other transformations on the coordinate space without loading the data.
  • It loads the data lazily until requested through np.asarray

Performance comparison between tifffile and iohub's custom OME-TIFF implementation

A custom OME-TIFF reader (iohub.multipagetiff.MicromanagerOmeTiffReader) was implemented because historically tifffile and AICSImageIO was slow when reading large OME-TIFF series generated by Micro-Manager acquisitions.

While debugging #65 I found that this implementation does not guarantee data integrity during reading. Before investing more time in fixing it, I think it is worth revisiting the topic of whether it is worth maintaining a custom OME-TIFF reader, given that the more widely adopted solutions have evolved since waveorder.io's designation.

Here is a simple read speed benchmark of tifffile and iohub's custom reader:

benchmark_plot

The test was done on a 123 GB dataset with TCZYX=(8, 9, 3, 81, 2048, 2048) dimensions. Voxels from 2 non-sequential positions was read into RAM in each iteration (N=5).

Test script (click to expand):

Environment: Python 3.10.8, Linux 4.18 (x86_64, AMD EPYC [email protected])

# %%
import os
from timeit import timeit
import zarr
import pandas as pd

# readers tested
from tifffile import TiffSequence  # 2023.2.3
from iohub.multipagetiff import MicromanagerOmeTiffReader  # 0.1.dev368+g3d62e6f


# %%
# 123 GB total
DATASET = (
  "/hpc/projects/comp_micro/rawdata/hummingbird/Soorya/"
  "2022_06_27_A549cellMembraneStained/"
  "A549_CellMaskDye_Well1_deltz0.25_63X_30s_2framemin/"
  "A549_CellMaskdye_Well1_30s_2framemin_1"
)

POSITIONS = (2, 0)


# %%
def read_tifffile():
  sequence = TiffSequence(os.scandir(DATASET))
  data = zarr.open(sequence.aszarr(), mode="r")
  for p in POSITIONS:
      _ = data[p]
  sequence.close()


# %%
def read_custom():
  reader = MicromanagerOmeTiffReader(DATASET)
  for p in POSITIONS:
      _ = reader.get_array(p)


# %%
def repeat(n=5):
  tf_times = []
  wo_times = []
  for _ in range(n):
      tf_times.append(
          timeit(
              "read_tifffile()", number=1, setup="from __main__ import read_tifffile"
          )
      )
      wo_times.append(
          timeit("read_custom()", number=1, setup="from __main__ import read_custom")
      )
  return pd.DataFrame({"tifffile": tf_times, "waveorder": wo_times})


# %%
timings = repeat()

At least in this test, the latest tifffile consistently out-performs the iohub implementation. While a comprehensive benchmark will take more time (#57), I think as long as a widely used library is not significantly slower, the reduction of maintenance overhead and increased user testing can make a strong case for us to reconsider maintaining the custom code in iohub.

Chunking option for writing dataset

I was trying to write the array into a newly created position and get an error during write due to size of thenp.array() which is (1,1,201,2048,2048) dtype= np.float64. This error is related to the chunk size, and I realized that we don't have the option to select the desired chunk size of the array and it defaults to the (ZYX shape).

merged_dataset   = open_ome_zarr(store_path, mode = 'r+')
position = merged_dataset.create_position("position", str(p), "0")
position["0"] = img_normalized
merged_dataset.close()

ERROR:

ValueError                                Traceback (most recent call last)
/home/eduardo.hirata/Documents/compmicro-sandbox/zebrafishInfection/ed/20230202_casper_intoto_short1_QLIPP.py in line 3
      179 #%%
      180 position = merged_dataset.create_position("position",str(p),"0")
----> 181 position["0"] = img_normalized

File ~/Documents/iohub/iohub/ngff.py:423, in Position.__setitem__(self, key, value)
    419 if not isinstance(value, np.ndarray):
    420     raise TypeError(
    421         f"Value must be a NumPy array. Got type {type(value)}."
    422     )
--> 423 self.create_image(key, value)

File ~/Documents/iohub/iohub/ngff.py:473, in Position.create_image(self, name, data, chunks, transform, check_shape)
    470 if check_shape:
    471     self._check_shape(data.shape)
    472 img_arr = ImageArray(
--> 473     self._group.array(
    474         name, data, chunks=chunks, **self._storage_options
    475     )
    476 )
    477 self._create_image_meta(img_arr.basename, transform=transform)
    478 return img_arr
...
    120     msg = "Codec does not support buffers of > {} bytes".format(max_buffer_size)
--> 121     raise ValueError(msg)
    123 return arr

ValueError: Codec does not support buffers of > 2147483647 bytes

The next thing I tried was to use the iohub.ngff.Position.create_image() with data_chunk = (1, 1, 1, 2048, 2048) and took a bit to write but it succeded.

  position = merged_dataset.create_position("position",str(p),"0")
  chunks = (1,1,1,2048,2048)
  position.create_image("0",data=img_normalized, chunks=chunks) 

Array-like API for NGFF I/O

In contrast to the readers/writers design inherited from waveorder.io, a 'dataset' container for NGFF datasets that offers array-like API is more ergonomic and pythonic for NGFF datasets. It is also more maintainable since the data and metadata models can stay in one place.

Pseudo code aggregated from our offline discussion:

from iohub.ngff import HCSZarr

from my_analysis import fancy_transformation_plot
from my_microscope import camera

with HCSZarr.open("./very_important.zarr", mode="r") as read_only_dataset:
    # Get a copy of data in RAM and prototype analysis
    # This syntax should be familiar for PyTorch users
    tczyx_example = read_only_dataset["A/1/0/0"].numpy()
    fancy_transformation_plot(tczyx_example)

with HCSZarr.open("./acquisition.zarr", mode="a") as write_and_create_ok_dataset:
    # Writing to a slice of the Zarr array
    new_position = write_and_create_ok_dataset["H"]["12"][5 + 1]
    new_position[0][0, 1, 2, :, :] = camera.next_frame()

`README` example does not run; opens in read-only by default

In the iohub's README we have:

with open_ome_zarr("20200812-CardiomyocyteDifferentiation14-Cycle1.zarr") as dataset:
    ...
    new_fov = dataset.create_position("A", "1", "0")  # creates a new fov
    ...

which fails because by default open_ome_zarr opens in read-only mode.

Adding mode='r+' works, but it modified/added to the dataset I downloaded from the web, then failed on subsequent runs (since the position already existed when I ran it the first time).

Maybe a more user-friendly README example should read the example dataset, write to a new file, then read+append to that new file?

Benchmarking

We need to set up a benchmarking infra which we can run on different contexts and get help to make more educated decisions related to the performance. I'm inclined to use asv: https://github.com/airspeed-velocity/asv, but first wanted to hear more opinions. I want to hear your opinions on both benchmarking frameworks and what aspects of iohub you wish to see in benchmarks.

Update README

Branching off from #1.

We should include additional information about the repository in the README file, including:

  • A more detailed explanation of the scope and purpose of the iohub
  • Quick start guide
  • Contact information

Please suggest any additional content here.

`get_zarr` returns incorrect data from 3D + 2D MM datasets

Here's a minimal demonstration of the issue

>>> from iohub.reader import imread
>>> reader = imread('/hpc/projects/compmicro/rawdata/other/NIC QLIPP/2022_12_01 PtK cells/FOV1_1')

>>> reader.get_zarr(0)[0,-1,:,500,500] # get the Z profile over the last channel
array([28992, 48564, 22054,     0,     0,     0,     0,     0, 11040,
       19283, 20553,  8202,  8224, 21536, 30821, 10356, 11824, 11317,
       12320,  8236,   631, 10535,  8202,  8224, 15904, 15934, 28704,
       29804, 29486, 28520, 10359,  8233,  8992, 25632, 25455, 25972,
       29811,  8250, 21291, 18763,   224], dtype=uint16)

>>> reader.get_zarr(0)[0,-1,:,500,500] # repeated call gives a different result!
array([21184, 48542, 22054,     0,     0,     0,     0,     0,  2570,
       25939,  8293, 27713, 28531, 11530, 11565, 11565, 11565,  2605,
       27750, 24943,   631, 28528, 25975,  8306,  8250, 28528, 25975,
        8306, 30054, 25454, 26996, 28271, 29728, 24936,  8308, 29296,
       28015, 29807, 29541, 26912,   224], dtype=uint16)

>>> reader.get_array(0)[0,-1,:,500,500] # get_array gives the expected result
array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0, 631,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0], dtype=uint16)

Notes

  • This dataset is a 3D + 2D dataset from MM...only the 2D channel fails on repeated get_zarr calls
  • To see the correct result, use get_array, open the dataset in MM, or open with napari's builtin reader
  • I tested this issue on the converter branch with the new imread function, and I also confirmed that it was inherited from the WaveorderReader.

Symptoms:

  • recorder view uses get_zarr (currently the WaveorderReader version...with the same behaviour) and behaves erratically when you scroll through Z in napari
  • Tagging @ieivanov because this issue affects the spindle data from the NIC. This issue is not blocking (scripts can use get_array for now), but it is an strong inconvenience since it blocks many of the napari viewing conveniences that we're used to.

Improve TIFF I/O stack

For now we use a combination of tifffile, custom code (multi-page OME-TIFF), and ndtiff (pycromanager NDTIFF) to read TIFF-based formats. Given that tifffile also supports NDTIFF, there may be room to simplify the reading stack.

Also, we should investigate if tifffile and AICSImageIO has improved their OME-TIFF reading performance.

@nclack also mentioned that there are additional strategies we can explore to accelerate TIFF I/O.

zarr store written by iohub.writer doesn't seem to be compliant with OME-HCS format

Using napari-ome-zarr plugin to view the dataset written by the writer module causes an error that suggests missing metadata:
napari --plugin napari-ome-zarr /hpc/projects/CompMicro/rawdata/hummingbird/Janie/2022_03_15_orgs_nuc_mem_63x_04NA/all_21_3.zarr

causes the following error:

ERROR Failed to load Row_0/Col_271/0/0
Traceback (most recent call last):
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/mapping.py", line 137, in __getitem__
    result = self.fs.cat(k)
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/spec.py", line 811, in cat
    return self.cat_file(paths[0], **kwargs)
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/spec.py", line 710, in cat_file
    with self.open(path, "rb", **kwargs) as f:
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/spec.py", line 1094, in open
    f = self._open(
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/implementations/local.py", line 175, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/implementations/local.py", line 273, in __init__
    self._open()
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/implementations/local.py", line 278, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/hpc/projects/CompMicro/rawdata/hummingbird/Janie/2022_03_15_orgs_nuc_mem_63x_04NA/all_21_3.zarr/Row_0/Col_271/0/0/.zarray'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/zarr/storage.py", line 1376, in __getitem__
    return self.map[key]
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/mapping.py", line 141, in __getitem__
    raise KeyError(key)
KeyError: 'Row_0/Col_271/0/0/.zarray'

This .zarray json file is found in the reference data. It can be viewed using:
napari --plugin napari-ome-zarr /local/scratch/groups/cmanalysis.grp/zarr_ref_datasets/2551.zarr/

I copied the reference dataset with the following aws cli command:
aws --no-sign-request --endpoint-url=https://uk1s3.embassy.ebi.ac.uk/ s3 cp --recursive --exclude=".git/*" s3://idr/zarr/v0.1/plates/2551.zarr /local/scratch/groups/cmanalysis.grp/zarr_ref_datasets/2551.zarr

Viewing multiple positions per well

Per our discussion @ziw-liu @ieivanov @talonchandler:

Our experience and thoughts

  • There is a need to properly display multiple positions within a well in napari. Currently, the napari-ome-zarr format does not display multiple positions within a well. It only loads position 0.
  • Additionally, dragging and dropping a Well with multiple positions creates the behavior in #75, which duplicates the naming of the layers given the number of positions in the well. From this one gets one slider for the position, making it difficult to see or scan through the whole plate.
  • One solution is to write tiled datasets like in #73 where we stitch multiple positions into a single tiled array. This can be written in Data.zarr/Row0/Col0/position/0and will show properly in napari-ome-zarr. We need to store in metadata where and what position each chunked (Y,X) array came from to preserve the original position during acquisition.

The expected behavior we want

  • Easy method to drag and drop a well with multiple positions
  • Implement a pyramid structure for viewing the whole plate
  • We discussed the potential to add a custom viewer to get us rolling.

Please feel free to add more to this if I missed any points.

Specify zarr-python version

          Thanks for the help @ziw-liu. My `zarr` was `2.12` (not sure why? I installed that environment fairly recently). Should we specify `zarr >=2.12`?

Everything's running now and looking good. I'll continue to test.

Originally posted by @talonchandler in #76 (comment)

Package metadata

The current package metadata in setup.cfg is inherited from waveorder and saw minor revisions. As iohub approaches a PyPI release, we need to agree on some update to the information. (See comments in the code blocks below)

Author and email:

author = Computational Microscopy Platform, CZ Biohub  # change to CZ Biohub?
author_email = [email protected].  # does this need to change?

Classiffiers:

classifiers =
    Development Status :: 4 - Beta. # Alpha?
    Intended Audience :: Science/Research
    License :: OSI Approved :: BSD License  # is 3-clause the original BSD?
    Programming Language :: Python :: 3 :: Only
    Topic :: Scientific/Engineering
    Topic :: Scientific/Engineering :: Image Processing
    Topic :: Scientific/Engineering :: Visualization  # remove
    Topic :: Scientific/Engineering :: Information Analysis  # remove? this is too generic
    Topic :: Scientific/Engineering :: Bio-Informatics  # remove
    Operating System :: Microsoft :: Windows
    Operating System :: POSIX
    Operating System :: Unix
    Operating System :: MacOS
    # any additional classifiers? 

@mattersoflight @AhmetCanSolak @JoOkuma would love to have your input on this.

Test workflow is outdated

Currently the GH Action only tests the package on Linux (Ubuntu) and Python 3.7 (which is EOL). We should test on more platforms (Windows and macOS) and newer Python versions (3.8+).

We can also add coverage check and linting to the PR-triggered workflows.

Wrong page indexing for an incomplete dataset

For certain incomplete datasets (mehta-lab/recOrder#320), the current iohub (or waveorder.io) would gather the wrong page number (the first page has page number of 511):

0    /hpc/projects/comp_micro/rawdata/falcon/zebraf...
1                                                  511
2                                                  162
Name: (0, 0, 0, 0), dtype: object

The byte offset is also wrong because MicromanagerOmeTiffReader uses hard-coded magic numbers to determine page offsets:

https://github.com/czbiohub/iohub/blob/0fdca8bad97c8a4c6a1b69e54923c576a7e6b44c/iohub/multipagetiff.py#L122-L142

This results in reading the wrong image for the first frame:

>>> reader.get_image(0,0,0,0).mean()
30643.720915794373

And TiffFile('/path/to/first/image/').asarray()[0].mean() would return average intensity of 194, which is consistent with the MM GUI readout.

identify optimal chunk size for computational imaging, DL, and visualization workflows

When data is stored in locations that have significant i/o latency (storage server on premise or on cloud), a large number of small files create excessive overhead in data transfer during nightly backups, transferring data across nodes, and transferring data over the internet. We need to decide a default chunk size for the data that will be read and written by DL workflows such as virtual staining. We need to test how chunk size affects the i/o speed on HPC.

@Christianfoley can you write a test for i/o performance (e.g., reading a batch of 2.5D stacks and writing the batch to a zarr store) using iohub reader and writer, and run the tests on HPC?

@ziw-liu can then evaluate if the chunk size that is optimal for DL workflow works fine for recOrder/waveOrder.

Choices to evaluate (assuming that camera that acquires 2048x2048 image)

  • Chunk by image (chunk size = 1x1x2048x2048).
  • Chunk by channels (chunk size = Cx1x2048x2048).
  • Chunk by stack (chunk size = 1XZX2048x2048).

@JoOkuma how sensitive is performance of dexp processing steps to chunk size?

Contribution guide

Branching out from #1:

To help iohub engage our co-workers and the community as a public repo, we need a contribution guide (e.g. a top-level CONTRIBUTION.md document). This document should include guidelines for:

  • What can be contributed (bug reports, feature requests, code, etc)
  • How to set up a dev environment
  • How to open a PR
    • Does every PR need to be from a fork even the contributor has push permission?
    • Does every PR need to be associated with an issue?
  • Test coverage and code style

Please suggest other topics to cover, policy choices, and potential templates.

CLI for summarizing and converting between data formats

recOrder provides CLI for summarizing ND data without loading it, and converting from OME-TIFF to OME-zarr format. These CLI are more broadly useful and therefore should be moved to iohub.

The consensus was to use the click library to implement CLI.

test that the size of data in channel dimension matches with metadata

HCSZarr dataset class should test that the size of each dimension matches in data and in metadata. I was able to write a wrong zarr dataset that would not open with napari-ome-zarr.

with HCSZarr.open(
    "hcs_wrong.zarr", mode="a", channel_names=["DAPI", "GFP"]
) as dataset:
    for row, col, fov in position_list:
        position = dataset.create_position(row, col, fov)
        position[0] = np.random.randint(
            0, np.iinfo(np.uint16).max, size=(5, 3, 3, 32, 32), dtype=np.uint16
        )

Originally posted by @mattersoflight in #31 (comment)

Open source iohub

At this stage I think iohub repository are mostly meeting our organization's requirements, as well as the general expectations for an early-stage open source project. And I propose that we make this repository public once #46 is merged.

Tagging @mattersoflight @AhmetCanSolak @JoOkuma to see if you have any concerns.

Universal entry points

TASKLIST:

  • ABC #132
  • OME-Zarr
  • OME-TIFF
  • ...



          > At this point I don't think a universal entry point for all the formats is strictly necessary.

I think that a limited version of a "universal entry point" for internal biohub formats is one the core value propositions of iohub. If we're requiring a user to know their data type and its corresponding API to access data/metadata, then I think adoption of iohub will be limited.

The current version of WaveorderReader in waveorder has been valuable to me because it serves as a universal entry point for compmicro's work. For example, when we started reading pycromanager datasets, @ieivanov added the PycromanagerReader class and adapted WaveorderReader to read these datasets with the existing API. I could continue to run WaveorderReader.shape and WaveorderReader.get_array with predictable results, and any unpredictable results were considered bugs and fixed.

I understand that there may be difficulty in the details. For example, PycromanagerReader.get_zarr might be incorrectly/poorly named, and some dataset types might support different lazy-loading operations (maybe .get_zarr is not one of the universal entry points?). But I still think that all of our datasets share some common properties and operations that users will want to perform regularly. I think this list of operations consists of

  • what is the shape of the data in this dataset? (.shape)
  • what is it's basic metadata? (if it's available) (.channel_names, .z_step_size, .dtype, etc)
  • what are the actual values in this dataset? (.get_array)

Can we have a universal entry point that supports these operations?

Originally posted by @talonchandler in #31 (comment)

This deserves a dedicated discussion IMHO, that's why creating a separate issue here.

Moving forward I think we should find ways to engineer a solution that provides a more unified API to the end user. Particularly for adoption of iohub across Biohub, I think this is crucial. Providing list of aforementioned operations(by @talonchandler ) on each dataset is a great start. But I think there is more than that we can do. Like:

  • Currently we are implementing readers and writers, but perhaps we can engineer a single dataset class that can handle reading and writing appropriately? and can help us removing redundancies like creating writers from readers and so on.
  • We can provide functional APIs like imread/imwrite in skimage.io and help scripting users to adopt iohub quicker?
  • We can provide a separated metadata API which can enable iohub to be used for inspection before reading different imaging datasets.

This is merely a starting point for universal entry points discussions, looking forward to read everyone's input.

Attempting to read old dataset with legacy ZarrReader

I was attempting to use Iohub ImageReader and ZarrReader in place of WaveorderReader and the data output come out with Nan or fails. The dataset linked below can be read properly using wavorder.io.WaveorderReader(), but fails with two iohub readers.

from iohub.reader import ImageReader

data_folder = "/hpc/projects/comp_micro/projects/zebrafish-infection/2023_02_02_hummingbird_casper_GFPmacs/intoto_casper_short_1_2023_02_08_110049.zarr"

reader = ImageReader(data_folder)

Subsequently @ziw-liu and I looked at the dataset's reader.reader.root.store._dimension_separator determining that I needed to use the legacy ZarrReader and ran the following code:

from iohub.zarrfile import ZarrReader
reader = ZarrReader(data_folder, version="0.1")

This returned the following error:

[WARNING: zarrfile:  33 2023-02-17 16:14:36,685] `iohub.zarrfile.ZarrReader` is deprecated and will be removed in the future. For OME-NGFF (OME-Zarr) v0.4 datasets please use `iohub.ngff.open_ome_zarr` instead.
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
File ~/Documents/iohub/iohub/zarrfile.py:49, in ZarrReader.__init__(self, store_path, version)
     [47](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=46)     dimension_separator = "/"
     [48](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=47) self.store = zarr.DirectoryStore(
---> [49](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=48)     store_path, dimension_separator=dimension_separator
     [50](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=49) )
     [51](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=50) self.root = zarr.open(self.store, "r")

UnboundLocalError: local variable 'dimension_separator' referenced before assignment

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
Cell In[20], line 1
----> 1 reader3 = ZarrReader(dataset_folder, version="0.1")

File ~/Documents/iohub/iohub/zarrfile.py:53, in ZarrReader.__init__(self, store_path, version)
     [51](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=50)     self.root = zarr.open(self.store, "r")
     [52](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=51) except Exception:
---> [53](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=52)     raise FileNotFoundError("Supplies path is not a valid zarr root")
     [54](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=53) try:
     [55](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=54)     row = self.root[list(self.root.group_keys())[0]]

FileNotFoundError: Supplies path is not a valid zarr root

Initial setup

After migrating the io module of waveorder (at mehta-lab/waveorder@5f60f0a) to this new repository, the next steps may include:

  • Reach a consensus on the scope and modulation of this project and draft the README and CONTRIBUTION documents accordingly. #39 #32 #36
  • Fix references and the directory structure so that the existing features work and the tests pass. #4 #46
  • Discuss and implement the abstraction that suites the need of our users (e.g. an iohub.Dataset class that offer an array-like interface), and make sure the existing feature set works under it. #31 #40 #132
  • Package the library and publish on PyPI so other packages can depend on it. #55
  • Expand the readers and writers to support more data formats. #99 #114

Where each of these can be elaborated/debated upon in spin-off issues

@mattersoflight @JoOkuma @royerloic @talonchandler @Christianfoley please feel free to add to or modify these objectives.

Should `ngff.ImageArray` have a different name?

From #45:

@mattersoflight suggested that we should rename ngff.ImageArray:

Rename the ImageArray object to ZarrArray to make it obvious that the object is a zarr array.

In naming it as ImageArray I wanted to imply that this is not only a zarr.Array object and has additional methods/properties that are specific to our image data (e.g. numpy(), channels). Also we may have more backends added in the future (tensorstore) so that it may deviate more from the baseline zarr-python object.

Inviting more discussion on this topic so that we can agree on a nomenclature that offers more clarity.

Clarify project name style

When we refer to the project/library in documentation and other forms writing, a unified visual style is ideal.

Some options:

  1. iohub
  2. iohub
  3. iohub
  4. IOHub

Improve API for the NGFF module

  • Repurpose the nomenclature from our dependencies:
  • Accessing the 5D arrays (zarr arrays) via position['key'] syntax is not intuitive. Let's implement position.data (set or get property) syntax, which mimics napari's syntax. #61

  • Rename the ImageArray object to ZarrArray to make it obvious that the object is a zarr array. #60

  • Make it intuitive to read and write tiled acquisitions: #59
  • Introduce a tiles or tiled layout. I picked these words in favor of multiPos or multiFOV. But, I think either of them is a decent choice.
  • Make it intuitive to over-write channels: #56 #70
  • setitem (position[array_name][:, channel_index] = array) is too complex a usage, let's provide a wrapper delete_channel or overwrite_channel. delete_channel is more broadly useful.

Originally posted by @mattersoflight in #31 (review)

Update repo description (About)

The current description is:

Library for reading and writing ND imaging data with metadata in ome-zarr and ome-tiff formats

And I would like to propose a rewording:

N-dimensional bioimaging data I/O with OME metadata in Python

And we should consider adding several 'Topics' (tags) for keyword queries. Some candidates are:
python microscopy bioimaging imageio image-processing image-metadata scientific-computing

writing to single OME-zarr store from multiple processes

We need to be able to read & write a single OME-zarr store or chunked-TIFF store from multiple processes.

The ND image data supported by iohub is chunked and therefore amenable to multiprocessing. But, metadata is in serial format and it is best to write that from a single thread. The consensus was that iohub should support the creation of an empty dataset to which metadata can be written. Image data can be written afterward.

Dev pre-release on PyPI

A pre-release on PyPI can ease testing for downstream projects such as recOrder and microDL, and help us navigate PyPI issues before the 'stable' release. Once we have most of the 0.1.0 features ready we can tag a v0.1.0dev0 and use this issue to track the process.

add support for writing RGB data

We often need to export ND color overlays. @edyoshikun and I ran into this when trying to analyze zebrafish data.
I experimented and found the following:

  • RGB data can be written like any other channel (ome/napari-ome-zarr#57).
  • napari-ome-zarr plugin requires appropriate channel metadata to render the RGB overlays.

Example FOVs:
/Volumes/comp_micro/projects/automation/dataFormats/ome_zarr_demo/test_astro_RGB

/hpc/projects/comp_micro/projects/zebrafish-infection/2023_02_02_hummingbird_casper_GFPmacs/hsv_pos7.zarr/positions/0/0/

metadata needed in .zattrs

 "channels": [
            {
                "active": true,
                "color": "FF0000",
                "label": "Red",
                "window": {
                    "end": 0.5,
                    "start": 0.0
                }
            },
            {
                "active": true,
                "color": "00FF00",
                "label": "Green",
                "window": {
                    "end": 0.5,
                    "start": 0.0
                }
            },
            {
                "active": true,
                "color": "0000FF",
                "label": "Blue",
                "window": {
                    "end": 0.5,
                    "start": 0.0
                }
            }
        ],

Image appears as follows with napari-ome-zarr
image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.