czbiohub-sf / iohub Goto Github PK

View Code? Open in Web Editor NEW

27.0 5.0 6.0 561 KB

Pythonic and parallelizable I/O for N-dimensional imaging data with OME metadata

Home Page: https://czbiohub-sf.github.io/iohub/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

bioimaging image-metadata ndimensional-arrays python

iohub's People

Contributors

Stargazers

Watchers

Forkers

aliddell ziw-liu jookuma ilan-theodoro aaronalvarezcz josephschull

iohub's Issues

support multi-pos datasets that are not complete multi-well plate acquisitions

There is an apparent gap in data formats specified by OME-NGFF and supported by napari-ome-zarr plugin: they specify/support single FOV ND arrays or multi-FOV ND arrays acquired in a multi-well plate, but not simple multi-pos acquisitions.

We can represent this data as as a multi-well dataset to leverage the napari-ome-zarr plugin for visualization.

This topic requires a discussion among @ziw-liu , @Soorya19Pradeep , @jrbyrum13 , @Christianfoley , @ieivanov, @talonchandler who either acquire or analyze multi-pos datasets.

I see two options:

Have a single column, with row names = position names in micro-manager acquisition .
Map the data to a square grid to make it easy to visualize with napari plugin, e.g., if we have 100 FOVs, sort them into 10 rows and 10 columns.

Votes & Ideas?

Enforcing uniform data shape in a single OME-Zarr HCS dataset

When implementing the OME-Zarr HCS features, it became evident that there is a gap between what the specification allows (or has not explicitly prohibited) and what is often convenient to interact with: the non-uniformity of the ZYX shape (or in pseudo-code: array.shape[:-3]) in a single dataset.

In most use cases, it is convenient to have a uniform ZYX shape, and this opens up possibilities to have straightforward implementations for data summarization and parallel computation. As for as the official implementation goes, shape uniformity is implied by how the reader would attempt to gather all the FOVs in a single Dask array. However, this will disable support for experiments where images of the same physical object are coming off multiple acquisition arms (e.g. bright field and light sheet).

During an offline meeting, @mattersoflight, @ieivanov and others agreed that we can store non-uniformly shaped images in separate Zarr stores before resampling to the same coordinate system.

@JoOkuma @AhmetCanSolak What are your thoughts?

iohub needs a `convert` CLI command

iohub needs a simple CLI interface for converting files into the spec-adherent formats that it writes.

recOrder has a simple convert command that depends ZarrConverter which in turn depends on WaveorderReader and WaveorderWriter. I think that many parts of the WaveorderReader can be reused for iohub's converter, but we have decided not to depend on the WaveorderWriter. Therefore, I think the following steps are needed before we're ready for an iohub convert PR:

We've decided which parts of WaveorderReader we're going to reuse
the ZarrConverter class has been modified for compatibility with the new writer (and possibly modified readers)

@ziw-liu and others, I would appreciate your thoughts on both of these steps.

Unable to apply CoordinateTransformations with napari-ome-zarr

I am trying to use the coordinateTransformation metadata from the iohug.ngff.create_image() and I am seeing a couple of issues that are possibly related so keeping them in the same issue.

The first issue that it might be more of a napari issue is that we cannot apply individual coordinate transforms for multiple images within a position. After running this code it throws the error below, which means it expects some sort of pyramidal structured dataset.

store_path = "/hpc/projects/comp_micro/sandbox/Ed/tmp/"
# timestamp = datetime.now().strftime("test_translate")

store_path = store_path + 'test_translate' + ".zarr"

tczyx_1= np.random.randint(
    0, np.iinfo(np.uint16).max, size=(1, 3, 3, 32, 32), dtype=np.uint16
)
tczyx_2= np.random.randint(
    0, np.iinfo(np.uint16).max, size=(1, 3, 3, 32, 32), dtype=np.uint16
)
coords_shift = [1.0,1.0,1.0, 100.0, 100.0]
coords_shift2 = [1.0,1.0,1.0, -100.0, 1.0]
scale_val = [1.0, 1.0, 1.0, 0.5, 0.5]
translation = TransformationMeta(type='translation', translation=coords_shift)
scaling = TransformationMeta(type='scale', scale=scale_val)
translation2 = TransformationMeta(type='translation', translation=coords_shift2)
with open_ome_zarr(
    store_path,
    layout="hcs",
    mode="w-",
    channel_names=["DAPI", "GFP", "Brightfield"]) as dataset:
    # Create and write to positions
    # This affects the tile arrangement in visualization
    position = dataset.create_position(0, 0, 0)
    position.create_image("0", tczyx_1, transform=[translation])
    position = dataset.create_position(0, 1, 0)
    position.create_image("0", tczyx_2, transform = [scaling])
    # Print dataset summary
    dataset.print_tree()

Error:

File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/bin/napari", line 8, in <module>
    sys.exit(main())
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/__main__.py", line 561, in main
    _run()
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/__main__.py", line 341, in _run
    viewer._window._qt_viewer._qt_open(
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/_qt/qt_viewer.py", line 830, in _qt_open
    self.viewer.open(
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/components/viewer_model.py", line 1014, in open
    self._add_layers_with_plugins(
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/components/viewer_model.py", line 1242, in _add_layers_with_plugins
    added.extend(self._add_layer_from_data(*_data))
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/components/viewer_model.py", line 1316, in _add_layer_from_data
    layer = add_method(data, **(meta or {}))
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/utils/migrations.py", line 44, in _update_from_dict
    return func(*args, **kwargs)
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/components/viewer_model.py", line 823, in add_image
    layerdata_list = split_channels(data, channel_axis, **kwargs)
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/layers/utils/stack_utils.py", line 79, in split_channels
    multiscale, data = guess_multiscale(data)
  File "/hpc/mydata/eduardo.hirata/.conda/envs/pyplay/lib/python3.10/site-packages/napari/layers/image/_image_utils.py", line 76, in guess_multiscale
    raise ValueError(
ValueError: Input data should be an array-like object, or a sequence of arrays of decreasing size. Got arrays of single shape: (1, 3, 3, 32, 64)
c

Now, if we remove the second image position.create_image("1", tczyx_2) and open the positions in napari napari --plugin napari-ome-zarr test_translate.zarr we see the two (32x32) test images as expected, but without transformations applied to them .

If we drag and drop the different positions separately into napari, we can see that the scaling was applied as well as the translation but now we have to slide the bar for positions.

tensorstore support

Tensorstore could be used for the zarr backend.

A few advantages are:

It's faster.
It supports transposes and other transformations on the coordinate space without loading the data.
It loads the data lazily until requested through np.asarray

Hypothesis should be a dev-only dependency

I introduced this in #16:

https://github.com/czbiohub/iohub/blob/f58d23b96a02b0c1f233d48c60c56747a2616931/setup.cfg#L36-L43

hypothesis should be a dev dependency since it's only used for tests.

Performance comparison between tifffile and iohub's custom OME-TIFF implementation

A custom OME-TIFF reader (iohub.multipagetiff.MicromanagerOmeTiffReader) was implemented because historically tifffile and AICSImageIO was slow when reading large OME-TIFF series generated by Micro-Manager acquisitions.

While debugging #65 I found that this implementation does not guarantee data integrity during reading. Before investing more time in fixing it, I think it is worth revisiting the topic of whether it is worth maintaining a custom OME-TIFF reader, given that the more widely adopted solutions have evolved since waveorder.io's designation.

Here is a simple read speed benchmark of tifffile and iohub's custom reader:

The test was done on a 123 GB dataset with TCZYX=(8, 9, 3, 81, 2048, 2048) dimensions. Voxels from 2 non-sequential positions was read into RAM in each iteration (N=5).

Test script (click to expand):

Environment: Python 3.10.8, Linux 4.18 (x86_64, AMD EPYC [email protected])

# %%
import os
from timeit import timeit
import zarr
import pandas as pd

# readers tested
from tifffile import TiffSequence  # 2023.2.3
from iohub.multipagetiff import MicromanagerOmeTiffReader  # 0.1.dev368+g3d62e6f


# %%
# 123 GB total
DATASET = (
  "/hpc/projects/comp_micro/rawdata/hummingbird/Soorya/"
  "2022_06_27_A549cellMembraneStained/"
  "A549_CellMaskDye_Well1_deltz0.25_63X_30s_2framemin/"
  "A549_CellMaskdye_Well1_30s_2framemin_1"
)

POSITIONS = (2, 0)


# %%
def read_tifffile():
  sequence = TiffSequence(os.scandir(DATASET))
  data = zarr.open(sequence.aszarr(), mode="r")
  for p in POSITIONS:
      _ = data[p]
  sequence.close()


# %%
def read_custom():
  reader = MicromanagerOmeTiffReader(DATASET)
  for p in POSITIONS:
      _ = reader.get_array(p)


# %%
def repeat(n=5):
  tf_times = []
  wo_times = []
  for _ in range(n):
      tf_times.append(
          timeit(
              "read_tifffile()", number=1, setup="from __main__ import read_tifffile"
          )
      )
      wo_times.append(
          timeit("read_custom()", number=1, setup="from __main__ import read_custom")
      )
  return pd.DataFrame({"tifffile": tf_times, "waveorder": wo_times})


# %%
timings = repeat()

At least in this test, the latest tifffile consistently out-performs the iohub implementation. While a comprehensive benchmark will take more time (#57), I think as long as a widely used library is not significantly slower, the reduction of maintenance overhead and increased user testing can make a strong case for us to reconsider maintaining the custom code in iohub.

Code formatting

Although hints of code formatting exist, no formatting has been consistently enforced in this repo:
https://github.com/czbiohub/iohub/blob/641922e9851b59b0309532c933c4156ac699e99f/pyproject.toml#L8-L13

We can refer to mehta-lab/recOrder#176 and mehta-lab/recOrder#229 on how to adopt formatting while preserving Git history.

I lean towards using the existing formatting spec (black with max line length of 79), but open to suggestions.

Chunking option for writing dataset

I was trying to write the array into a newly created position and get an error during write due to size of thenp.array() which is (1,1,201,2048,2048) dtype= np.float64. This error is related to the chunk size, and I realized that we don't have the option to select the desired chunk size of the array and it defaults to the (ZYX shape).

merged_dataset   = open_ome_zarr(store_path, mode = 'r+')
position = merged_dataset.create_position("position", str(p), "0")
position["0"] = img_normalized
merged_dataset.close()

ERROR:

ValueError                                Traceback (most recent call last)
/home/eduardo.hirata/Documents/compmicro-sandbox/zebrafishInfection/ed/20230202_casper_intoto_short1_QLIPP.py in line 3
      179 #%%
      180 position = merged_dataset.create_position("position",str(p),"0")
----> 181 position["0"] = img_normalized

File ~/Documents/iohub/iohub/ngff.py:423, in Position.__setitem__(self, key, value)
    419 if not isinstance(value, np.ndarray):
    420     raise TypeError(
    421         f"Value must be a NumPy array. Got type {type(value)}."
    422     )
--> 423 self.create_image(key, value)

File ~/Documents/iohub/iohub/ngff.py:473, in Position.create_image(self, name, data, chunks, transform, check_shape)
    470 if check_shape:
    471     self._check_shape(data.shape)
    472 img_arr = ImageArray(
--> 473     self._group.array(
    474         name, data, chunks=chunks, **self._storage_options
    475     )
    476 )
    477 self._create_image_meta(img_arr.basename, transform=transform)
    478 return img_arr
...
    120     msg = "Codec does not support buffers of > {} bytes".format(max_buffer_size)
--> 121     raise ValueError(msg)
    123 return arr

ValueError: Codec does not support buffers of > 2147483647 bytes

The next thing I tried was to use the iohub.ngff.Position.create_image() with data_chunk = (1, 1, 1, 2048, 2048) and took a bit to write but it succeded.

  position = merged_dataset.create_position("position",str(p),"0")
  chunks = (1,1,1,2048,2048)
  position.create_image("0",data=img_normalized, chunks=chunks)

Array-like API for NGFF I/O

In contrast to the readers/writers design inherited from waveorder.io, a 'dataset' container for NGFF datasets that offers array-like API is more ergonomic and pythonic for NGFF datasets. It is also more maintainable since the data and metadata models can stay in one place.

Pseudo code aggregated from our offline discussion:

from iohub.ngff import HCSZarr

from my_analysis import fancy_transformation_plot
from my_microscope import camera

with HCSZarr.open("./very_important.zarr", mode="r") as read_only_dataset:
    # Get a copy of data in RAM and prototype analysis
    # This syntax should be familiar for PyTorch users
    tczyx_example = read_only_dataset["A/1/0/0"].numpy()
    fancy_transformation_plot(tczyx_example)

with HCSZarr.open("./acquisition.zarr", mode="a") as write_and_create_ok_dataset:
    # Writing to a slice of the Zarr array
    new_position = write_and_create_ok_dataset["H"]["12"][5 + 1]
    new_position[0][0, 1, 2, :, :] = camera.next_frame()

`README` example does not run; opens in read-only by default

In the iohub's README we have:

with open_ome_zarr("20200812-CardiomyocyteDifferentiation14-Cycle1.zarr") as dataset:
    ...
    new_fov = dataset.create_position("A", "1", "0")  # creates a new fov
    ...

which fails because by default open_ome_zarr opens in read-only mode.

Adding mode='r+' works, but it modified/added to the dataset I downloaded from the web, then failed on subsequent runs (since the position already existed when I ran it the first time).

Maybe a more user-friendly README example should read the example dataset, write to a new file, then read+append to that new file?

Benchmarking

We need to set up a benchmarking infra which we can run on different contexts and get help to make more educated decisions related to the performance. I'm inclined to use asv: https://github.com/airspeed-velocity/asv, but first wanted to hear more opinions. I want to hear your opinions on both benchmarking frameworks and what aspects of iohub you wish to see in benchmarks.

Update README

Branching off from #1.

We should include additional information about the repository in the README file, including:

A more detailed explanation of the scope and purpose of the iohub
Quick start guide
Contact information

Please suggest any additional content here.

Use ome-zarr-py to write to zarr

Currently we use custom writers that try to be compliant with OME-NGFF and OME-HCS specs. This was out of necessity since there was no existing tool at the time. However, maintenance overhead to keep up with the specs is high and the writers now write to outdated versions of these formats.

Now that tools have evolved within the community, we should be able to switch to the official Python implementation of OME-NGFF as our writer to OME-Zarr.

The name of `MicromanagerOmeTiffReader.files` can be confusing

See this thread for context.

Renaming the files attribute to _files may resolve the concern.

`get_zarr` returns incorrect data from 3D + 2D MM datasets

Here's a minimal demonstration of the issue

>>> from iohub.reader import imread
>>> reader = imread('/hpc/projects/compmicro/rawdata/other/NIC QLIPP/2022_12_01 PtK cells/FOV1_1')

>>> reader.get_zarr(0)[0,-1,:,500,500] # get the Z profile over the last channel
array([28992, 48564, 22054,     0,     0,     0,     0,     0, 11040,
       19283, 20553,  8202,  8224, 21536, 30821, 10356, 11824, 11317,
       12320,  8236,   631, 10535,  8202,  8224, 15904, 15934, 28704,
       29804, 29486, 28520, 10359,  8233,  8992, 25632, 25455, 25972,
       29811,  8250, 21291, 18763,   224], dtype=uint16)

>>> reader.get_zarr(0)[0,-1,:,500,500] # repeated call gives a different result!
array([21184, 48542, 22054,     0,     0,     0,     0,     0,  2570,
       25939,  8293, 27713, 28531, 11530, 11565, 11565, 11565,  2605,
       27750, 24943,   631, 28528, 25975,  8306,  8250, 28528, 25975,
        8306, 30054, 25454, 26996, 28271, 29728, 24936,  8308, 29296,
       28015, 29807, 29541, 26912,   224], dtype=uint16)

>>> reader.get_array(0)[0,-1,:,500,500] # get_array gives the expected result
array([  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0, 631,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0], dtype=uint16)

Notes

This dataset is a 3D + 2D dataset from MM...only the 2D channel fails on repeated get_zarr calls
To see the correct result, use get_array, open the dataset in MM, or open with napari's builtin reader
I tested this issue on the converter branch with the new imread function, and I also confirmed that it was inherited from the WaveorderReader.

Symptoms:

recorder view uses get_zarr (currently the WaveorderReader version...with the same behaviour) and behaves erratically when you scroll through Z in napari
Tagging @ieivanov because this issue affects the spindle data from the NIC. This issue is not blocking (scripts can use get_array for now), but it is an strong inconvenience since it blocks many of the napari viewing conveniences that we're used to.

Depend on NDTiffStorage directly for the pycromanager dataset reader

Currently iohub depends on pycro-manager solely for the Dataset class, which is simply imported from ndtiff:

https://github.com/czbiohub/iohub/blob/f58d23b96a02b0c1f233d48c60c56747a2616931/iohub/pycromanager.py#L5

As pycro-manager is not quite a lightweight dependency, I think we should use ndtiff.Dataset directly.

Improve TIFF I/O stack

For now we use a combination of tifffile, custom code (multi-page OME-TIFF), and ndtiff (pycromanager NDTIFF) to read TIFF-based formats. Given that tifffile also supports NDTIFF, there may be room to simplify the reading stack.

Also, we should investigate if tifffile and AICSImageIO has improved their OME-TIFF reading performance.

@nclack also mentioned that there are additional strategies we can explore to accelerate TIFF I/O.

zarr store written by iohub.writer doesn't seem to be compliant with OME-HCS format

Using napari-ome-zarr plugin to view the dataset written by the writer module causes an error that suggests missing metadata:
napari --plugin napari-ome-zarr /hpc/projects/CompMicro/rawdata/hummingbird/Janie/2022_03_15_orgs_nuc_mem_63x_04NA/all_21_3.zarr

causes the following error:

ERROR Failed to load Row_0/Col_271/0/0
Traceback (most recent call last):
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/mapping.py", line 137, in __getitem__
    result = self.fs.cat(k)
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/spec.py", line 811, in cat
    return self.cat_file(paths[0], **kwargs)
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/spec.py", line 710, in cat_file
    with self.open(path, "rb", **kwargs) as f:
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/spec.py", line 1094, in open
    f = self._open(
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/implementations/local.py", line 175, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/implementations/local.py", line 273, in __init__
    self._open()
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/implementations/local.py", line 278, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/hpc/projects/CompMicro/rawdata/hummingbird/Janie/2022_03_15_orgs_nuc_mem_63x_04NA/all_21_3.zarr/Row_0/Col_271/0/0/.zarray'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/zarr/storage.py", line 1376, in __getitem__
    return self.map[key]
  File "/hpc/user_apps/comp_micro/conda_envs/recorder/lib/python3.9/site-packages/fsspec/mapping.py", line 141, in __getitem__
    raise KeyError(key)
KeyError: 'Row_0/Col_271/0/0/.zarray'

This .zarray json file is found in the reference data. It can be viewed using:
napari --plugin napari-ome-zarr /local/scratch/groups/cmanalysis.grp/zarr_ref_datasets/2551.zarr/

I copied the reference dataset with the following aws cli command:
aws --no-sign-request --endpoint-url=https://uk1s3.embassy.ebi.ac.uk/ s3 cp --recursive --exclude=".git/*" s3://idr/zarr/v0.1/plates/2551.zarr /local/scratch/groups/cmanalysis.grp/zarr_ref_datasets/2551.zarr

Viewing multiple positions per well

Per our discussion @ziw-liu @ieivanov @talonchandler:

Our experience and thoughts

There is a need to properly display multiple positions within a well in napari. Currently, the napari-ome-zarr format does not display multiple positions within a well. It only loads position 0.
Additionally, dragging and dropping a Well with multiple positions creates the behavior in #75, which duplicates the naming of the layers given the number of positions in the well. From this one gets one slider for the position, making it difficult to see or scan through the whole plate.
One solution is to write tiled datasets like in #73 where we stitch multiple positions into a single tiled array. This can be written in Data.zarr/Row0/Col0/position/0and will show properly in napari-ome-zarr. We need to store in metadata where and what position each chunked (Y,X) array came from to preserve the original position during acquisition.

The expected behavior we want

Easy method to drag and drop a well with multiple positions
Implement a pyramid structure for viewing the whole plate
We discussed the potential to add a custom viewer to get us rolling.

Please feel free to add more to this if I missed any points.

Specify zarr-python version

          Thanks for the help @ziw-liu. My `zarr` was `2.12` (not sure why? I installed that environment fairly recently). Should we specify `zarr >=2.12`?

Everything's running now and looking good. I'll continue to test.

Originally posted by @talonchandler in #76 (comment)

Support microDL data IO

The Zarr reader and writer of iohub (formerly waveorder.io) has been borrowed in the Gunpowder-based data workflow of microDL. We should integrate the adaptations and support further usage being a dependency.

Package metadata

The current package metadata in setup.cfg is inherited from waveorder and saw minor revisions. As iohub approaches a PyPI release, we need to agree on some update to the information. (See comments in the code blocks below)

Author and email:

author = Computational Microscopy Platform, CZ Biohub  # change to CZ Biohub?
author_email = [email protected].  # does this need to change?

Classiffiers:

classifiers =
    Development Status :: 4 - Beta. # Alpha?
    Intended Audience :: Science/Research
    License :: OSI Approved :: BSD License  # is 3-clause the original BSD?
    Programming Language :: Python :: 3 :: Only
    Topic :: Scientific/Engineering
    Topic :: Scientific/Engineering :: Image Processing
    Topic :: Scientific/Engineering :: Visualization  # remove
    Topic :: Scientific/Engineering :: Information Analysis  # remove? this is too generic
    Topic :: Scientific/Engineering :: Bio-Informatics  # remove
    Operating System :: Microsoft :: Windows
    Operating System :: POSIX
    Operating System :: Unix
    Operating System :: MacOS
    # any additional classifiers?

@mattersoflight @AhmetCanSolak @JoOkuma would love to have your input on this.

Run and check examples on CI

Adding new CI tasks and restructuring the overall CI flow

Here I like to propose addition of style task to CI and making style and linting tasks a requirement for tests (to save CI time) and enabling CI runs for all PRs opened against main branch.

Warnings in tests

The tests have a few warnings that can be fixed.

Originally posted by @JoOkuma in #4 (comment)

Test workflow is outdated

Currently the GH Action only tests the package on Linux (Ubuntu) and Python 3.7 (which is EOL). We should test on more platforms (Windows and macOS) and newer Python versions (3.8+).

We can also add coverage check and linting to the PR-triggered workflows.

Wrong page indexing for an incomplete dataset

For certain incomplete datasets (mehta-lab/recOrder#320), the current iohub (or waveorder.io) would gather the wrong page number (the first page has page number of 511):

0    /hpc/projects/comp_micro/rawdata/falcon/zebraf...
1                                                  511
2                                                  162
Name: (0, 0, 0, 0), dtype: object

The byte offset is also wrong because MicromanagerOmeTiffReader uses hard-coded magic numbers to determine page offsets:

https://github.com/czbiohub/iohub/blob/0fdca8bad97c8a4c6a1b69e54923c576a7e6b44c/iohub/multipagetiff.py#L122-L142

This results in reading the wrong image for the first frame:

>>> reader.get_image(0,0,0,0).mean()
30643.720915794373

And TiffFile('/path/to/first/image/').asarray()[0].mean() would return average intensity of 194, which is consistent with the MM GUI readout.

identify optimal chunk size for computational imaging, DL, and visualization workflows

When data is stored in locations that have significant i/o latency (storage server on premise or on cloud), a large number of small files create excessive overhead in data transfer during nightly backups, transferring data across nodes, and transferring data over the internet. We need to decide a default chunk size for the data that will be read and written by DL workflows such as virtual staining. We need to test how chunk size affects the i/o speed on HPC.

@Christianfoley can you write a test for i/o performance (e.g., reading a batch of 2.5D stacks and writing the batch to a zarr store) using iohub reader and writer, and run the tests on HPC?

@ziw-liu can then evaluate if the chunk size that is optimal for DL workflow works fine for recOrder/waveOrder.

Choices to evaluate (assuming that camera that acquires 2048x2048 image)

Chunk by image (chunk size = 1x1x2048x2048).
Chunk by channels (chunk size = Cx1x2048x2048).
Chunk by stack (chunk size = 1XZX2048x2048).

@JoOkuma how sensitive is performance of dexp processing steps to chunk size?

Contribution guide

Branching out from #1:

To help iohub engage our co-workers and the community as a public repo, we need a contribution guide (e.g. a top-level CONTRIBUTION.md document). This document should include guidelines for:

What can be contributed (bug reports, feature requests, code, etc)
How to set up a dev environment
How to open a PR
- Does every PR need to be from a fork even the contributor has push permission?
- Does every PR need to be associated with an issue?
Test coverage and code style

Please suggest other topics to cover, policy choices, and potential templates.

CLI for summarizing and converting between data formats

recOrder provides CLI for summarizing ND data without loading it, and converting from OME-TIFF to OME-zarr format. These CLI are more broadly useful and therefore should be moved to iohub.

The consensus was to use the click library to implement CLI.

Method to delete or rename channel in NGFF datasets

Branching off of #45:
#31 has a convenience method for appending new channels to a NGFF position node, but there is no easy way to delete or rename an existing channel so that arrays and metadata all get updated properly.

Make it easier to write and read `tiled` dataset.

Taking this point off of #45.

@mattersoflight suggested:

Introduce a tiles or tiled layout. I picked these words in favor of multiPos or multiFOV. But, I think either of them is a decent choice.

Along with this usage pattern:

https://github.com/czbiohub/iohub/blob/61237f9a9b7a99eb7a1e74ceaa01c0c93204a518/examples/tiles_zarr.py#L1-L41

test that the size of data in channel dimension matches with metadata

HCSZarr dataset class should test that the size of each dimension matches in data and in metadata. I was able to write a wrong zarr dataset that would not open with napari-ome-zarr.

with HCSZarr.open(
    "hcs_wrong.zarr", mode="a", channel_names=["DAPI", "GFP"]
) as dataset:
    for row, col, fov in position_list:
        position = dataset.create_position(row, col, fov)
        position[0] = np.random.randint(
            0, np.iinfo(np.uint16).max, size=(5, 3, 3, 32, 32), dtype=np.uint16
        )

Originally posted by @mattersoflight in #31 (comment)

Open source iohub

At this stage I think iohub repository are mostly meeting our organization's requirements, as well as the general expectations for an early-stage open source project. And I propose that we make this repository public once #46 is merged.

Tagging @mattersoflight @AhmetCanSolak @JoOkuma to see if you have any concerns.

Universal entry points

TASKLIST:

ABC #132
OME-Zarr
OME-TIFF
...

          > At this point I don't think a universal entry point for all the formats is strictly necessary.
I think that a limited version of a "universal entry point" for internal biohub formats is one the core value propositions of iohub. If we're requiring a user to know their data type and its corresponding API to access data/metadata, then I think adoption of iohub will be limited.

The current version of WaveorderReader in waveorder has been valuable to me because it serves as a universal entry point for compmicro's work. For example, when we started reading pycromanager datasets, @ieivanov added the PycromanagerReader class and adapted WaveorderReader to read these datasets with the existing API. I could continue to run WaveorderReader.shape and WaveorderReader.get_array with predictable results, and any unpredictable results were considered bugs and fixed.

I understand that there may be difficulty in the details. For example, PycromanagerReader.get_zarr might be incorrectly/poorly named, and some dataset types might support different lazy-loading operations (maybe .get_zarr is not one of the universal entry points?). But I still think that all of our datasets share some common properties and operations that users will want to perform regularly. I think this list of operations consists of

what is the shape of the data in this dataset? (.shape)

what is it's basic metadata? (if it's available) (.channel_names, .z_step_size, .dtype, etc)

what are the actual values in this dataset? (.get_array)

Can we have a universal entry point that supports these operations?

Originally posted by @talonchandler in #31 (comment)

This deserves a dedicated discussion IMHO, that's why creating a separate issue here.

Moving forward I think we should find ways to engineer a solution that provides a more unified API to the end user. Particularly for adoption of iohub across Biohub, I think this is crucial. Providing list of aforementioned operations(by @talonchandler ) on each dataset is a great start. But I think there is more than that we can do. Like:

Currently we are implementing readers and writers, but perhaps we can engineer a single dataset class that can handle reading and writing appropriately? and can help us removing redundancies like creating writers from readers and so on.
We can provide functional APIs like imread/imwrite in skimage.io and help scripting users to adopt iohub quicker?
We can provide a separated metadata API which can enable iohub to be used for inspection before reading different imaging datasets.

This is merely a starting point for universal entry points discussions, looking forward to read everyone's input.

Attempting to read old dataset with legacy ZarrReader

I was attempting to use Iohub ImageReader and ZarrReader in place of WaveorderReader and the data output come out with Nan or fails. The dataset linked below can be read properly using wavorder.io.WaveorderReader(), but fails with two iohub readers.

from iohub.reader import ImageReader

data_folder = "/hpc/projects/comp_micro/projects/zebrafish-infection/2023_02_02_hummingbird_casper_GFPmacs/intoto_casper_short_1_2023_02_08_110049.zarr"

reader = ImageReader(data_folder)

Subsequently @ziw-liu and I looked at the dataset's reader.reader.root.store._dimension_separator determining that I needed to use the legacy ZarrReader and ran the following code:

from iohub.zarrfile import ZarrReader
reader = ZarrReader(data_folder, version="0.1")

This returned the following error:

[WARNING: zarrfile:  33 2023-02-17 16:14:36,685] `iohub.zarrfile.ZarrReader` is deprecated and will be removed in the future. For OME-NGFF (OME-Zarr) v0.4 datasets please use `iohub.ngff.open_ome_zarr` instead.
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
File ~/Documents/iohub/iohub/zarrfile.py:49, in ZarrReader.__init__(self, store_path, version)
     [47](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=46)     dimension_separator = "/"
     [48](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=47) self.store = zarr.DirectoryStore(
---> [49](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=48)     store_path, dimension_separator=dimension_separator
     [50](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=49) )
     [51](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=50) self.root = zarr.open(self.store, "r")

UnboundLocalError: local variable 'dimension_separator' referenced before assignment

During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)
Cell In[20], line 1
----> 1 reader3 = ZarrReader(dataset_folder, version="0.1")

File ~/Documents/iohub/iohub/zarrfile.py:53, in ZarrReader.__init__(self, store_path, version)
     [51](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=50)     self.root = zarr.open(self.store, "r")
     [52](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=51) except Exception:
---> [53](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=52)     raise FileNotFoundError("Supplies path is not a valid zarr root")
     [54](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=53) try:
     [55](file:///home/eduardo.hirata/Documents/iohub/iohub/zarrfile.py?line=54)     row = self.root[list(self.root.group_keys())[0]]

FileNotFoundError: Supplies path is not a valid zarr root

CLI needs tests

          Code looks good!

Would be better if there were some simple tests, but that can come later.

Originally posted by @ziw-liu in #23 (review)

Supported Python versions

NumPy's support for Python 3.8 will end by Apr 14, 2023. Given the release schedule of iohub, I think we can aim for supporting 3.9-3.11 for the first release.

Please register in this issue any reasons we might want to support 3.8 and older.

Do we keep transferred_issues on the main branch

I understand we like to archive those issues, but maybe main branch docs folder is not the right place. Did I miss any discussion around it? are we open to move them @ziw-liu ?

Initial setup

After migrating the io module of waveorder (at mehta-lab/waveorder@5f60f0a) to this new repository, the next steps may include:

Reach a consensus on the scope and modulation of this project and draft the README and CONTRIBUTION documents accordingly. #39 #32 #36
Fix references and the directory structure so that the existing features work and the tests pass. #4 #46
Discuss and implement the abstraction that suites the need of our users (e.g. an iohub.Dataset class that offer an array-like interface), and make sure the existing feature set works under it. #31 #40 #132
Package the library and publish on PyPI so other packages can depend on it. #55
Expand the readers and writers to support more data formats. #99 #114

Where each of these can be elaborated/debated upon in spin-off issues

@mattersoflight @JoOkuma @royerloic @talonchandler @Christianfoley please feel free to add to or modify these objectives.

Should `ngff.ImageArray` have a different name?

From #45:

@mattersoflight suggested that we should rename ngff.ImageArray:

Rename the ImageArray object to ZarrArray to make it obvious that the object is a zarr array.

In naming it as ImageArray I wanted to imply that this is not only a zarr.Array object and has additional methods/properties that are specific to our image data (e.g. numpy(), channels). Also we may have more backends added in the future (tensorstore) so that it may deviate more from the baseline zarr-python object.

Inviting more discussion on this topic so that we can agree on a nomenclature that offers more clarity.

Cleanup `waveorder` references

After migration, the module should be more self-contained and not depend on waveorder.
Branched off from #1.

Clarify project name style

When we refer to the project/library in documentation and other forms writing, a unified visual style is ideal.

Some options:

iohub
iohub
iohub
IOHub

Support multiple-MDA (OME-Tiff) datasets

CompMicro and OpenCell teams have workflows that require reading multiple-MDA datasets, so iohub support would be extremely valuable.

Example script for CompMicro: mehta-lab/waveorder#96
Example function for OpenCell: https://github.com/czbiohub/drug-screen-analysis/blob/master/drug_screen_analysis/functions/reader.py

Improve API for the NGFF module

Repurpose the nomenclature from our dependencies:

Accessing the 5D arrays (zarr arrays) via position['key'] syntax is not intuitive. Let's implement position.data (set or get property) syntax, which mimics napari's syntax. #61
Rename the ImageArray object to ZarrArray to make it obvious that the object is a zarr array. #60

Make it intuitive to read and write tiled acquisitions: #59

Introduce a tiles or tiled layout. I picked these words in favor of multiPos or multiFOV. But, I think either of them is a decent choice.

Make it intuitive to over-write channels: #56 #70

setitem (position[array_name][:, channel_index] = array) is too complex a usage, let's provide a wrapper delete_channel or overwrite_channel. delete_channel is more broadly useful.

Originally posted by @mattersoflight in #31 (review)

Update repo description (About)

The current description is:

Library for reading and writing ND imaging data with metadata in ome-zarr and ome-tiff formats

And I would like to propose a rewording:

N-dimensional bioimaging data I/O with OME metadata in Python

And we should consider adding several 'Topics' (tags) for keyword queries. Some candidates are:
python microscopy bioimaging imageio image-processing image-metadata scientific-computing

writing to single OME-zarr store from multiple processes

We need to be able to read & write a single OME-zarr store or chunked-TIFF store from multiple processes.

The ND image data supported by iohub is chunked and therefore amenable to multiprocessing. But, metadata is in serial format and it is best to write that from a single thread. The consensus was that iohub should support the creation of an empty dataset to which metadata can be written. Image data can be written afterward.

Dev pre-release on PyPI

A pre-release on PyPI can ease testing for downstream projects such as recOrder and microDL, and help us navigate PyPI issues before the 'stable' release. Once we have most of the 0.1.0 features ready we can tag a v0.1.0dev0 and use this issue to track the process.

add support for writing RGB data

We often need to export ND color overlays. @edyoshikun and I ran into this when trying to analyze zebrafish data.
I experimented and found the following:

RGB data can be written like any other channel (ome/napari-ome-zarr#57).
napari-ome-zarr plugin requires appropriate channel metadata to render the RGB overlays.

Example FOVs:
/Volumes/comp_micro/projects/automation/dataFormats/ome_zarr_demo/test_astro_RGB

/hpc/projects/comp_micro/projects/zebrafish-infection/2023_02_02_hummingbird_casper_GFPmacs/hsv_pos7.zarr/positions/0/0/

metadata needed in .zattrs

 "channels": [
            {
                "active": true,
                "color": "FF0000",
                "label": "Red",
                "window": {
                    "end": 0.5,
                    "start": 0.0
                }
            },
            {
                "active": true,
                "color": "00FF00",
                "label": "Green",
                "window": {
                    "end": 0.5,
                    "start": 0.0
                }
            },
            {
                "active": true,
                "color": "0000FF",
                "label": "Blue",
                "window": {
                    "end": 0.5,
                    "start": 0.0
                }
            }
        ],

Image appears as follows with napari-ome-zarr