seung-lab / connected-components-3d Goto Github PK

Connected components on discrete and continuous multilabel 3D & 2D images. Handles 26, 18, and 6 connected variants; periodic boundaries (4, 8, & 6)

License: GNU Lesser General Public License v3.0

Makefile 0.14% C++ 40.27% Python 30.16% Dockerfile 4.09% Shell 0.44% Cython 24.90%

connected-components python numpy biomedical-image-processing cpp union-find image-processing cython 3d decision-tree

connected-components-3d's Introduction

cc3d: Connected Components on Multilabel 3D Images

Fig. 1. Binary and Multilabel Connected Components Labeling (CCL) 2D images are shown for simplicity. Black is the background color (zero). (a) A binary image (foreground white, background black) (b) 4-connected CCL of binary image (c) 8-connected CCL of binary image (d) A multilabel image (e) 4-connected CCL of multilabel image (f) 8-connected CCL of multilabel image.

Fig. 2. Continuous Value Connected Components Labeling (CCL) (top) A three tone grayscale image with signed additive low magnitude noise (bottom) Extracted components using continuous value CCL with a delta value greater than the noise magnitude but smaller than the difference between tones

cc3d is an implementation of connected components in three dimensions using a 26, 18, or 6-connected neighborhood in 3D or 4 and 8-connected in 2D. This package uses a 3D variant of the two pass method by Rosenfeld and Pflatz augmented with Union-Find and a decision tree based on the 2D 8-connected work of Wu, Otoo, and Suzuki. This implementation is compatible with images containing many different labels, not just binary images. It also supports continuously valued images such as grayscale microscope images with an algorithm that joins together nearby values.

I wrote this package because I was working on densely labeled 3D biomedical images of brain tissue (e.g. 512x512x512 voxels). Other off the shelf implementations I reviewed were limited to binary images. This rendered these other packages too slow for my use case as it required masking each label and running the connected components algorithm once each time. For reference, there are often between hundreds to thousands of labels in a given volume. The benefit of this package is that it labels all connected components in one shot, improving performance by one or more orders of magnitude.

In general, binary images are much more common (usually resulting from image thresholding), but multi-label images crop up in instance segmentation and semantic labeling as a classifier may label touching clusters of adjacent pixels differently. If a gap between different labels is guaranteed, then the problem degenerates into the binary version.

Check out benchmarks to see a comparison with SciPy on a few different tasks.

Python `pip` Installaction

If compatible binaries are available for your platform, installation is particularly simple.

pip install connected-components-3d

If compatible binaries are not available, you can install from source as follows.

Requires a C++ compiler.

pip install numpy
pip install connected-components-3d --no-binary :all:

Occasionally, you may appear to successfully install cc3d, but on import you'll see an error that includes: numpy.ufunc size changed, may indicate binary incompatibility. You can either try upgrading numpy or compiling cc3d from source in this case.

Python Manual Installation

Requires a C++ compiler.

pip install -r requirements.txt
python setup.py develop

Python Use

The following functions are available with examples below:

Connected Component Labeling (CCL)
Removal of small objects ("dust")
Extraction of k largest objects
Fast extraction of all objects one-by-one
Calculation of contact surface area and contact network
Extraction of a per voxel connectivity graph

import cc3d
import numpy as np

labels_in = np.ones((512, 512, 512), dtype=np.int32)
labels_out = cc3d.connected_components(labels_in) # 26-connected

connectivity = 6 # only 4,8 (2D) and 26, 18, and 6 (3D) are allowed
labels_out = cc3d.connected_components(labels_in, connectivity=connectivity)

# If you need the borders to wrap around (e.g. for simulations, world maps)
# specify periodic_boundary=True, currently only supported for
# 4 and 8 (2d) and 6 (3d) connectivities.
labels_out = cc3d.connected_components(
  labels_in, connectivity=connectivity, periodic_boundary=True
)

# If you need a particular dtype you can specify np.uint16, np.uint32, or np.uint64
# You can go bigger, not smaller, than the default which is selected
# to be the smallest that can be safely used. This can save you the copy
# operation needed by labels_out.astype(...).
labels_out = cc3d.connected_components(labels_in, out_dtype=np.uint64)

# If you're working with continuously valued images like microscopy
# images you can use cc3d to perform a very rough segmentation. 
# If delta = 0, standard high speed processing. If delta > 0, then
# neighbor voxel values <= delta are considered the same component.
# The algorithm can be 2-10x slower though. Zero is considered
# background and will not join to any other voxel.
labels_out = cc3d.connected_components(labels_in, delta=10)

# If you're working with an image that's larger than memory you can
# use mmapped files. The input and output files can be used independently.
# In this case an array labels.bin that is 5000x5000x2000 voxels and uint32_t
# in Fortran order is computed and the results are written to out.bin in Fortran
# order. You can find the properties of the file (shape, dtype, order) by inspecting
# labels_out.
labels_in = np.memmap("labels.bin", order="F", dtype=np.uint32, shape=(5000, 5000, 2000))
labels_out = cc3d.connected_components(labels_in, out_file="out.bin")

# You can extract the number of labels (which is also the maximum 
# label value) like so:
labels_out, N = cc3d.connected_components(labels_in, return_N=True) # free
# -- OR -- 
labels_out = cc3d.connected_components(labels_in) 
N = np.max(labels_out) # costs a full read

# You can extract individual components using numpy operators
# This approach is slow, but makes a mutable copy.
for segid in range(1, N+1):
  extracted_image = labels_out * (labels_out == segid)
  process(extracted_image) # stand in for whatever you'd like to do

# If a read-only image is ok, this approach is MUCH faster
# if the image has many contiguous regions. A random image 
# can be slower. binary=True yields binary images instead
# of numbered images.
for label, image in cc3d.each(labels_out, binary=False, in_place=True):
  process(image) # stand in for whatever you'd like to do

# Image statistics like voxel counts, bounding boxes, and centroids.
stats = cc3d.statistics(labels_out)

# Remove dust from the input image. Removes objects with
# fewer than `threshold` voxels.
labels_out = cc3d.dust(
  labels_in, threshold=100, 
  connectivity=26, in_place=False
)

# Get a labeling of the k largest objects in the image.
# The output will be relabeled from 1 to N.
labels_out, N = cc3d.largest_k(
  labels_in, k=10, 
  connectivity=26, delta=0,
  return_N=True,
)
labels_in *= (labels_out > 0) # to get original labels

# Compute the contact surface area between all labels.
# Only face contacts are counted as edges and corners
# have zero area. To get a simple count of all contacting
# voxels, set `surface_area=False`. 
# { (1,2): 16 } aka { (label_1, label_2): contact surface area }
surface_per_contact = cc3d.contacts(
  labels_out, connectivity=connectivity,
  surface_area=True, anisotropy=(4,4,40)
)
# same as set(surface_per_contact.keys())
edges = cc3d.region_graph(labels_out, connectivity=connectivity)

# You can also generate a voxel connectivty graph that encodes
# which directions are passable from a given voxel as a bitfield.
# This could also be seen as a method of eroding voxels fractionally
# based on their label adjacencies.
# See help(cc3d.voxel_connectivity_graph) for details.
graph = cc3d.voxel_connectivity_graph(labels, connectivity=connectivity)

Note: C and Fortran order arrays will be processed in row major and column major order respectively, so the numbering of labels will be "transposed". The scare quotes are there because the dimensions of the array will not change.

C++ Use

#include "cc3d.hpp"

// 3d array represented as 1d array
int* labels = new int[512*512*512](); 

uint32_t* cc_labels = cc3d::connected_components3d<int>(
  labels, /*sx=*/512, /*sy=*/512, /*sz=*/512
);

// The default template parameter for output type is uint32_t
uint64_t* cc_labels = cc3d::connected_components3d<int, uint64_t>(
  labels, /*sx=*/512, /*sy=*/512, /*sz=*/512
);

uint16_t* cc_labels = cc3d::connected_components3d<int, uint16_t>(
  labels, /*sx=*/512, /*sy=*/512, /*sz=*/512, 
  /*connectivity=*/18 // default is 26 connected
);

size_t N = 0;
uint16_t* cc_labels = cc3d::connected_components3d<int, uint16_t>(
  labels, /*sx=*/512, /*sy=*/512, /*sz=*/512, 
  /*connectivity=*/26, /*N=*/N // writes number of labels to N
);

#include "cc3d_continuous.hpp"

// For handling grayscale images. Note that the difference
// is the addition of the "delta" argument.
uint16_t* cc_labels = cc3d::connected_components3d<int, uint16_t>(
  labels, /*sx=*/512, /*sy=*/512, /*sz=*/512, 
  /*delta=*/10, /*connectivity=*/6 // default is 26 connected
);

#include "cc3d_graphs.hpp"

// edges is [ e11, e12, e21, e22, ... ]
std::vector<uint64_t> edges = cc3d::extract_region_graph<uint64_t>(
  labels, /*sx=*/512, /*sy=*/512, /*sz=*/512, 
  /*connectivity=*/18 // default is 26 connected
);

// graph is a series of bitfields that describe inter-voxel
// connectivity based on adjacent labels. See "cc3d_graphs.hpp"
// for details on the bitfield. 
uint32_t* graph = extract_voxel_connectivity_graph<T>(
  labels, /*sx=*/512, /*sy=*/512, /*sz=*/512, 
  /*connectivity=*/6 // default is 26 connected
);

26-Connected CCL Algorithm

The algorithm contained in this package is an elaboration into 3D images of the 2D image connected components algorithm described by Rosenfeld and Pflatz (RP) in 1968 [1] (which is well illustrated by this youtube video) using an equivalency list implemented as Tarjan's Union-Find disjoint set with path compression and balancing [2] and augmented with a decision tree based on work by Wu, Otoo, and Suzuki (WOS), an approach commonly known as Scan plus Array-based Union-Find (SAUF). [3] The description below describes the 26-connected algorithm, but once you understand it, deriving 18 and 6 are simple. However, we recently made some changes that warrant further discursion on 6-connected.

First Principles in 2D

In RP's 4-connected two-pass method for binary 2D images, the algorithm raster scans and every time it first encounters a foreground pixel (the pixels to its top and left are background), it marks it with a new label. If there is a preexisting label in its neighborhood, it uses that label instead. Whenever two labels are adjacent, it records they are equivalent so that they can be relabeled consistently in the second pass. This equivalency table can be constructed in several ways, but some popular approaches are Union-Find with path compression with balancing by rank and Selkow's algorithm (which can avoid pipeline stalls). [4] However, Selkow's algorithm is designed for two trees of depth two, appropriate for binary images. We would like to process multiple labels at the same time, making Union-Find preferable.

In the second pass, the pixels are relabeled using the equivalency table. Union-Find establishes one label as the root label of a tree, and the root is considered the representative label. Each pixel is then labeled with the representative label. Union-Find is therefore appropriate for representing disjoint sets. Path compression with balancing radically reduces the height of the tree, which accelerates the second pass.

WOS approached the problem of accelerating 8-connected 2D connected components on binary images. 8-connected labeling is achieved by extending RP's forward pass mask to the top left and top right corner pixels. In Union-Find based connected components algorithms, the unify step in the first pass is the most expensive step. WOS showed how to optimize away a large fraction of these calls using a decision tree that takes advantage of local topology. For example, since the top-center neighbor of the current pixel is also adjacent to the other mask elements, all of which have already been processed by virtue of the raster scan direction, if it is present it is sufficient to copy its value and move on. If it is absent, pick one of the remaining foreground pixels, copy their value, and use unify for the mask element on the right as it is now known to be non-neighboring with the left hand side. WOS's algorithm continues in this fashion until a match is found or all mask elements are processed at which point a new label is created.

For several years, this algorithm was the world's fastest, though it has been superceded by a newer work that exchanges the static decision tree for a dynamic one or precalculated generated one amongst other improvements. However, WOS's work is significant for both its simplicity and speed and thus serves as the inspiration for this library. For 2D 8-connected images, we provide a specialization using Wu et al's original decision tree for a slight performance boost.

We're interested in exploring the block based approaches of Grana, Borghesani, and Cucchiara ([5],[7]), however their approach appears to critically rely on binary images. We'll continue to think about ways to incorporate it. We also considered the approach of He et al [8] which is also supposed to modestly faster than than WOS. However, it substitutes the Union-Find data structure (one array) with three arrays, which imposes a memory requirement that is at odds with our goal of processing large images.

Extending to 3D

The approach presented below is very similar to that of Sutheebanjard [6]. To move to a 3D 26-connected neighborhood, the mask must be extended into three dimensions in order to connect neighboring planes. Observe that the 8-connected mask covers the trailing half of the neighborhood (the part that will have been already processed) such that the current pixel can rely on those labels. Thus the mask for the 26-connected neighborhood covers only two out of three potential planes: the entire lower plane (nine voxels), and a mask identical to WOS's (four voxels) on the current plane. While some further optimizations are possible, to begin, the problem can be conceptually decomposed into two parts: establishing a 9-connected link to the bottom plane and then an 8-connected link to the current plane. This works because the current pixel functions as a hub that transmits the connection information from the 9-connected step to the 8-connected step.

Fig. 1: Mask for an 8-connected plane. If J,K,L, and M are all eliminated, only N remains and a new label is assigned.

j	k	l
m	n	.
.	.	.

The very first Z plane (Z=0) the algorithm runs against is special: the edge effect omits the bottom plane of the mask. Therefore, as the remaining mask is only comprosed of the 8-connected 2D mask, after this pass, the bottom of the image is 8-connected. At Z=1, the 9-connected part of the mask kicks in, forming connections to Z=0, making the current plane now (8 + 9) 17-connected. At Z=2, the 9-connected bottom mask now forms connections from Z=1 to Z=2 on the top, making Z=1 (17 + 9) 26-connected. By induction, when this process proceeds to completion it results in a 26-connected labeling of the volume.

Following inspiration from WOS, we construct a decision tree on the densely labeled bottom plane that minimizes the number of unifications we need to perform.

Fig 2. The mask for the lower plane in 3D.

a	b	c
d	e	f
g	h	i

As e is connected to all other voxels, if present, it can simply be copied. If e is absent, b and h fully cover the mask. If b is absent, h, a, c comprise a covering. If h is absent, b, g, i are one. Below is a list of coverings such that each proceeding entry in the list assumes the first letters in the entries above are background.

e
k, (h | g, i)
b, (h | g, i)
h, a, c
m, (f | c, i)
d, (f | c, i)
f, g, a
a, c, g, i
c, g, i
g, i
i

The decision tree is then constructed such that each of these coverings will be evaluated using the fewest unifications possible. It's possible to further optimize this by noting that e and b are both fully connected to the upper 2D mask. Therefore, if either of them are present, we can skip the 8-connected unification step. It's also possible to try the DF covering first if B is background, which would save one unification versus HAC given even statistics, but it seems to be slightly slower on the dataset I attempted. To move from binary data to multilabel data, I simply replaced tests for foreground and background with tests for matching labels.

In order to make a reasonably fast implementation, I implemented union-find with path compression. I conservatively used an IDs array qual to the size of the image for the union-find data structure instead of a sparse map. The union-find data structure plus the output labels means the memory consumption will be input + output + rank + equivalences. If your input labels are 32-bit, the memory usage will be 4x the input size. This becomes more problematic when 64-bit labels are used, but if you know something about your data, you can decrease the size of the union-find data structure. I previously used union-by-size but for some reason it merely reduced performance and increased memory usage so it was removed.

For more information on the history of connected components algorithms, and an even faster approach for 2D 8-connected components, consult Grana et al's paper on Block Based Decision Trees. [5,7]

Phantom Labels

In the course of thinking of improvements to several algorithms, we developed a technique we term "Phantom Labeling" for improving the SAUF method directly.

Definition: Phantom Labels are elements of a CCL mask that 
transmit connectivity information between other elements of the 
mask but cannot directly pass their value to the current pixel 
during the first pass of a SAUF derived algorithm.

Reproducing Fig. 1 again, but with new letters for the more limited problem, the standard SAUF mask appears like so:

Fig. 3: Mask for an 8-connected plane.

a	b	c
d	x	.
.	.	.

This results in a decision tree like so assuming x is a foreground pixel.

if b:
    x := b
elif a:
    x := a 
    if c:
        unify(a,c)
elif d:
    x := d
    if c: 
        unify(c,d)
elif c:
    x := c
else:
    x := new label

There is an opportunity here for eliminating up to half of the unify calls, one of the more expensive operations in modern CCL by slightly modifying the mask:

Fig. 4: 8-connected mask modified to include phantom label P.

.	P	.
a	b	c
d	x	.
.	.	.

This results in a modified decision tree.

if b:
    x := b
elif a:
    x := a 
    if c and not P: <--- change here
        unify(a,c)
elif d:
    x := d
    if c: 
        unify(c,d)
elif c:
    x := c
else:
    x := new label

The novelty of this technique is unclear, but it is very simple to apply and results in substantial speed ups for the 4 and 6 connected problems, a minor improvement for 8-connected, and is readily compatible with the multi-label approach unlike block based approaches.

4 and 6-Connected CCL Algorithm

Here is where the phantom label technique shines. It's a bit harder to find 4 and 6 connected algorithms in the literature, I assume because many of the techniques invented for the 8-way problem, such as the Union-Find data structure for the equivalency table and run-based approaches, are applicable to the simpler problem. However, the SAUF decision tree approach was lacking as every pixel required a unify call in the 4-way problem and two in the 6-way problem.

Fig. 5: 4-connected mask modified to include phantom label P.

P	b	.
a	x	.

if a:
    x := a
    if b and not P:
        unify(a,b)
elif b:
    x := b
else:
    x := new label

This gives a decent improvement on the order of 10-20%. If you're lucky, you might not incur even a single label merge operation. In the 6-way problem, there are three phantom labels that can be exploited and the improvement is closer to 50% on our data, a fairly substantial amount. Again, with luck you might avoid any unify operations at all.

Fig. 6: Mask for the 6-way problem with phantom labels P, Q, and R added.

P	b
a	x

.	Q
R	c

You can even use multiple routes to propagate information if a label is missing. For example, if path (a,P,b) is unavailable due to a missing P, you could potentially transmit information using path (a,R,c,Q,b).

Four Pass Algorithm

We introduce two additional passes over the image label prior to running the two-pass SAUF algorithm. These additional passes are used to collect statistcs for optimizing the SAUF passes.

Estimating Provisional Labels

The first additional pass is used to over-estimate the number of provisional labels generated by the first SAUF pass. A better estimation allows a smaller allocation for the Union-Find datastructure. For some operating systems, the reduced size of the allocation and improved caching recovers more time than is spent collecting statistics.

This can be computed by counting the number of transitions between labels along each row of the image. This scan is easily written such that the instructions can be vectorized to minimize the cost of the scan. The number of transitions is guaranteed to be larger than or equal to the number of provisional labels as all provisional labels are generated in this fashion and then reduced by stealing a label from a neighboring voxel.

A hierarchy of estimators can be written as:

0 <= provisional labels <= X transitions <= static estimate <= voxels

Binary images can also be estimated statically as voxels / 2 for 4 and 6-way, voxels / 4 for 8 and 18 way, and voxels / 8 for 26 connected. For multi-label images, the best static estimate is voxels as no assumptions can be made about how labels connect to each other (in the worst case all eight voxels in a cube have different labels).

It is also possible to check XY and XYZ transitions to get a tighter bound, but in experiments, the amount of time spent checking those directions exceeded the benefit obtained by checking the X pass. Often the X pass alone results in factors as high as voxels / 100.

Estimation of the number of labels also allows aborting processing before the first SAUF pass in the case of an all background cube.

Estimating Foreground Location

The second additional pass is estimating the location of the foreground. In the literature, this strategy is sometimes referred to as a "one-and-a-half pass" where the foreground location is computed during the first SAUF pass and then used to skip processing of background voxels during the relabeling pass.

Here we perform this check up front so that it can be performed minimally. Instead of integrating the calculation into the first pass which could force some computation on every voxel, we scan each row from the left to find the first foreground voxel and then scan from the right to the find the foreground voxel at the end. The results are tabulated in a uint32 table of starts and ends to each row of size 2 * sy * sz. This ensures that the volume is scanned at most once, and most likely much less if the shapes fill the space reasonably well. Then, both passes of the SAUF method scan only the part of each row indicated by this table.

Certain shapes and distributions defeat the efficiency of scanning only the starts and ends of the row (such as random images or an image with foreground on the start and end of each row and nowhere else). However, for a great many shapes, this provides substantial efficiencies and minimal downside for a dense multi-label image as only two YZ slices of the images are scanned before the table is completed.

Early Abortion Points

There are three locations in the algorithm at which further processing can be aborted early without changing the result.

After estimating provisional labels if zero transitions are detected (an all zeros volume). A black image is returned.
After the first SAUF pass if the number of provisional labels is zero or one. In this case, the provisional labels are guaranteed to be identical to final labels.
After assigning final labels to each provisional label in a translation array. If the number of final labels equals the number of provisional labels, the provisional labels were accurately assigned and the relabeling scan can be skipped.

Papers Using cc3d

A number of papers are using cc3d now. Many of them seem to be deep learning applications as instance segmentation is liable to generate touching non-binary labels. Some are in geoscience, neuroscience, and medical fields. If cc3d is helpful to you, please feel free to email us and let us know. We might be able to offer some tips if its performance critical (though we can't guarantee timeliness of response). There are so many variations of the CCL problem, you might be surprised at what you can do.

https://scholar.google.com/scholar?as_ylo=2019&q=connected-components-3d&hl=en&as_sdt=0,31

References

A. Rosenfeld and J. Pfaltz. "Sequential Operations in Digital Picture Processing". Journal of the ACM. Vol. 13, Issue 4, Oct. 1966, Pg. 471-494. doi: 10.1145/321356.321357 (link)
R. E. Tarjan. "Efficiency of a good but not linear set union algorithm". Journal of the ACM, 22:215-225, 1975. (link)
K. Wu, E. Otoo, K. Suzuki. "Two Strategies to Speed up Connected Component Labeling Algorithms". Lawrence Berkeley National Laboratory. LBNL-29102, 2005. (link)
S. Selkow. "The Tree-to-Tree Editing Problem". Information Processing Letters. Vol. 6, No. 6. June 1977. doi: 10.1016/0020-0190(77)90064-3 (link)
C. Grana, D. Borghesani, R. Cucchiara. "Optimized Block-based Connected Components Labeling with Decision Trees". IEEE Transactions on Image Processing. Vol. 19, Iss. 6. June 2010. doi: 10.1109/TIP.2010.2044963 (link)
P. Sutheebanjard. "Decision Tree for 3-D Connected Components Labeling". Proc. 2012 International Symposium on Information Technology in Medicine and Education. doi: 10.1109/ITiME.2012.6291402 (link)
C. Grana, D. Borghesani, R. Cucchiara. "Fast Block Based Connected Components Labeling". Proc. 16th IEEE Intl. Conf. on Image Processing. 2009. doi: 10.1109/ICIP.2009.5413731 (link)
L. He, Y. Chao and K. Suzuki, "A Linear-Time Two-Scan Labeling Algorithm", IEEE International Conference on Image Processing, vol. 5, pp. 241-244, 2007.

connected-components-3d's People

Stargazers

Watchers

connected-components-3d's Issues

14 Connected

I'm not sure who would use this, but 14 connected is faces + corners but no edges. Someone mentioned it to me in conversation, but in such a way that it sounded like it's something people use. It was unclear to me why it would be useful.

Not actually 26 connected?

I may have made an oversight when I restricted the forward mask to 9 neighbors. It seems I neglected to add loc + 1 - sx to the mask.

This is being treated in #8 which will be faster anyway...

Parallel Operation

For large volumes, it would be nice to process it in parallel.

ValueError: numpy.ufunc size changed, may indicate binary incompatibility

Hi!

I have successfully installed connected-components-3d using pip.
But when I import cc3d it gives following error:

Traceback (most recent call last):
File "", line 1, in
File "init.pxd", line 918, in init cc3d
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

Can you please help?

What's the meaning of index list returned?

Hi, thank you for your great contribution first!
But I wonder about the index list the function 'connected_components(data)' returns, why is the index list not continuous, such as[0,1,2,3], but in fact it returns list like [0,210,213,220].

Another, I want to know for the multi_label case, such as [0,1,2], then it returns connected components labels like [0,210,213,220]. Then how can we tell the different connected components belonging to original label?

Last, does this repository has any attribute to find if the connected components neighbored?

warning from numpy

this is not a big issue, but it might create problems in the future.
I just would like to keep a mark here, nothing urgent to fix.

Processing connected-components-3d-3.2.0.tar.gz
Writing /var/folders/gc/b6s140td6xdbfxdrsmkjtww80001lg/T/easy_install-wc89qksb/connected-components-3d-3.2.0/setup.cfg
Running connected-components-3d-3.2.0/setup.py -q bdist_egg --dist-dir /var/folders/gc/b6s140td6xdbfxdrsmkjtww80001lg/T/easy_install-wc89qksb/connected-components-3d-3.2.0/egg-di
st-tmp-c5t0i0lp
In file included from cc3d.cpp:645:
In file included from /Users/jwu/opt/anaconda3/envs/wasp/lib/python3.7/site-packages/numpy/core/include/numpy/arrayobject.h:4:
In file included from /Users/jwu/opt/anaconda3/envs/wasp/lib/python3.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:12:
In file included from /Users/jwu/opt/anaconda3/envs/wasp/lib/python3.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1944:
/Users/jwu/opt/anaconda3/envs/wasp/lib/python3.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: "Using deprecated NumPy API, disable it with "
      "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
#warning "Using deprecated NumPy API, disable it with " \
 ^
1 warning generated.

Build Equivalence Table Two Z Slices at a Time

For very large volumes, it might be helpful to be able to provide a facility for processing Z slices in sequential order. This would allow the user to manage memory efficiently on their end. The interface would look something like:

import cc3d
import numpy as np

builder = cc3d.connected_components_builder()
for z in range(128):
     builder.add_z_slice(img[:,:,z])
for z in range(128):
     img[:,:,z] = builder.relabel(img[:,:,z])

Better Algorithm

Parallel operation (#6) is the rich dumb person's way of getting more performance. At minimum it would be good to apply Wu's or Grana's "World's Fastest" algorithms in this case.

Add support to release linux aarch64 wheels

Problem

On aarch64, pip install connected-components-3d builds the wheels from source code and then installs it. It requires the user to have a development environment installed on his system. Also, it takes more time to build the wheels than downloading and extracting the wheels from PyPi.

Resolution

On aarch64, pip install connected-components-3d should download the wheels from PyPi

@william-silversmith, Please let me know your interest in releasing aarch64 wheels. I can help with this.

Will it work with pointcloud data?

What is process?

The README.md file says to call the process function:


# You can extract individual components using numpy operators
# This approach is slow, but makes a mutable copy.
for segid in range(1, N+1):
  extracted_image = labels_out * (labels_out == segid)
  process(extracted_image)

# If a read-only image is ok, this approach is MUCH faster
# if the image has many contiguous regions. A random image 
# can be slower. binary=True yields binary images instead
# of numbered images.
for label, image in cc3d.each(labels_out, binary=False, in_place=True):
  process(image)

What is this process function? I can't find it in default python or in cc3d.

Massive memory Leak

Unfortunately, this fantastic package has a massive memory leak

i suggest wrapping it in the following command to avoid memory leaks.


import cc3d

import concurrent.futures


def connected_components(bianry_array, return_N=True):
    with concurrent.futures.ProcessPoolExecutor() as executor:
        f = executor.submit(cc3d.connected_components, (bianry_array), return_N=return_N)
        ret = f.result()

    return ret

How to extract 3D components from Binary 3D using cc3d package?

Not Compiling on Windows 10

Hi Thank you for providing this essential package, but I couldn't install this in my machine. I am using python 3.6 version through anaconda in Windows 10 OS. While I am installing the package it is saying that " Could not find a version that satisfies the requirement connected-component-3d (from versions: )
No matching distribution found for connected-component-3d". Is that this package not supporting Windows OS ? Please let me know about this issue.

Thank you

How to remove small object [3d volume] using cc3d.connected_components ?

largest_k fails for transposed arrays

Hi!
I faced a problem when using largest_k with transposed array - the result is nonsense. And during experiments with other functions (I tried connected_components) I did not face that problem. My cc3d version is 3.12.1

My minimal code:

import numpy as np
import cc3d
import matplotlib.pyplot as plt

strange = np.load('strange.npy')

plt.imshow(cc3d.largest_k(strange.transpose(2, 1, 0), 1)[100])
plt.imshow(cc3d.largest_k(strange, 1)[:, :, 100])
plt.imshow(cc3d.connected_components(strange)[:, :, 100])

Pictured Example

Will,
Thank you for this amazing work. I am new to python and the idea of CCL. The 'python use' code snippet helps to understand how the code is to be used but I would appreciate if you could please provide an example of a 3D object with multiple labels being run on your code..
The reason I ask is that I am currently working on a segmentation process for 3D objects and my binary matrix is (64x64x64) obtained using binvox. I tried to read the object and then pass the (64x64x64) binary matrix into your code but I just received 1 output label. I want to verify if I am doing something wrong and if there is some more pre-processing that I need to do.
Also the region adjacency graph is not working, there seems to be no output at all.
With so many issues that I am facing, I am sure there is something that I am doing wrong. You help would be appreciated.

How to access/extract the connected components using this package ?

Thank you for making this package windows compatible. I segmented the region of interest from Lung CT scans. My segmented output size 512x512x130, where 130 is the total number of cross sectional images in the Lung CT scan and 512x512 is the size of each cross sectional image.
I use the following code to find the connected components (connected tissue clusters) across the cross section of the CT scan.
import numpy as np
from cc3d import connected_components
nod_arr=np.load('nodule_arr.npy')
nod_3d = connected_components(nod_arr)

It doesn't throw any error, but I can't able to extract any label information from nod_3d. How can I extract the tissues clusters which are connected across the cross section of CT scan ? For example, if one tissue cluster is exist from slice number 40 to 43 (4 cross section images), I need to extract that tissue cluster alone separately.

List of label indices?

Is there a way to return a list of tuples/arrays that contains the indices of each label within the 3D array? Other approaches I've implemented using numpy or pandas can get quite memory intensive and are slow even in vectorized form. I was hoping there might already be such a list accessible through this module that can be created while the labelling algorithm is being executed. You cc3d labelling has no issue labelling a volume on the order of 3.5 billion voxels with ~40 million unique labels, but trying to get the indices of each label to perform subsequent operations has proved challenging.

IndexError is generated when using cc3d.connected_components

The follow index error is generated when using connected components:

IndexError: index 3 is out of bounds for axis 0 with size 3
Exception ignored in: 'cc3d.epl_special_row'
Traceback (most recent call last):
File "****", line 9, in
new = cc3d.connected_components(test)
IndexError: index 3 is out of bounds for axis 0 with size 3

This can be reproduced with the following code:

import cc3d
import numpy as np

c = 5

test = np.zeros((3, 4, 4,), dtype='int')
test[0, 0, 1] = c
test[0, 0, 2] = c
new = cc3d.connected_components(test)
print(np.count_nonzero(new) == 0)

In addition the array returned is all zeros, which I don't think should be the case. Surprisingly if we init to ones all runs as expected as in:

test = np.ones((3, 4, 4,), dtype='int')

I may be missing something, but I don't think this expected behavior.

Does cc3d also work with memmory-mapped numpy arrays and array-like data?

Hey,

Thanks for this really cool package. By now, I am using it quite frequently and it is simply awesome!

I often work with larger-than-memory 3D images and the only solution is often to memory-map them as a numpy/zarr array in order to process them.

Is cc3d capable of working with memory-mapped numpy arrays or even array-like data (Zarr, Dask, Tensor, Xarray, ...)? Or will it simply throw an exception or convert it internally to an in-memory numpy array?

If it is not possible are you aware of other libraries or approaches that could perform operations such as connected component analysis on larger-than-memory data (either memory-mapped or via patchification)?

Best,
Karol

Using cc3d with multiple channels?

hello,
I am using 3D images ,so for that how can i apply connected component library?

cc3d.statistics['bounding_boxes'] is contains floats instead of int as slice positions

Hi, the slices returned by the stats method of cc3 contain float entries but should be int. Here is a reproducible minimal example:

tt = np.zeros((3,3,3))
tt[0,1]= 2 
tt[1,1]=2

cc_dusted , N= cc3d.connected_components(tt,return_N=True)
stats_dusted = cc3d.statistics(cc_dusted)
bboxes = stats_dusted['bounding_boxes']

print(bboxes)

Above snippet returns:

[(slice(0, 541.0, None), slice(0, 530.0, None), slice(0, 530.0, None)), (slice(116, 272.0, None), slice(125, 215.0, None), slice(247, 336.0, None)), (slice(130, 270.0, None), slice(32
7, 421.0, None), slice(230, 344.0, None))]

As you can see there are float entries

cc3d.dust fails

Working in Ubuntu 22.04
python 3.8
connected-components-3d==3.10.2
numpy==1.23.2

Trying

labels_out = cc3d.dust(
            source_cube[i_range, x_range, s_slice], threshold=100,
            connectivity=26, in_place=False
        )

results to error

    labels_out = cc3d.dust(
  File "cc3d.pyx", line 1070, in cc3d.dust
  File "cc3d.pyx", line 969, in cc3d.erase
  File "cc3d.pyx", line 937, in cc3d.draw
  File "cc3d.pyx", line 939, in cc3d.__pyx_fused_cpdef
TypeError: No matching signature found

Any idea

Add Remove Dust Function

This is probably one of the most common uses of cc3d and it's easy to screw it up and make it run slow.

Probably need to support three modes:

remove fewer than this number of voxels
remove smaller than this percent of the maximum size shape.
remove smaller than this percent of the image size (can be done using option 1 and some math)

Reduce Peak Memory Consumption

This package is currently several times more memory intensive than SciPy. There are a few avenues to reducing memory consumption: .

Strip out union-by-size from union-find. Not sure what the theoretical justification is (it's supposed to be faster!), but it seems to be more performant and lower memory. It's possible that the union-by-size feature is more useful for arbitrary graphs but not the structure of graphs that are implied in images.
Allow use of uint16_t or uint8_t for output images. It's pretty rare that more than 65k labels are used in typical images. We would need a good way to estimate when this is okay or allow users to specify uint16.
As in #11, we can use std::unordered_map in union-find which for many images which would sparsely utilize the union-find array would result in large memory reductions. However, for images which densely use it, it would use more. It also supports labels larger than the maximum index of the image. However, it is also slower than the array implementation. We should allow the user to choose which implementation is right for them. (whenever I try this it's embarrassingly slow) this one would be too slow
Allocate the output array in the pyx file and pass it by reference to avoid a copy.
~~Is it possible to do this in-place? Might be restricted to data-types uint16 or bigger.~~ (No, you need to be able to check the original labels.)
Allow binary images as input and represent them as bit-packed using vector<bool>
limit memory used for binary images based on maximum possible number of prospective labels
estimate the number of provisional labels before allocating

6-connected for DNS

Is it possible to calculate 6-connectivity, or 18 instead of 26? A bit of a departure from brain imaging, but this feature would be very useful for finite volume solvers.

Applying Dust and largest_k dtype output option

Hi,
I apply dust before largest_k; is this the right order? Or performance wise largest_k should be applied first?
My input is a boolean array.

Do you consider casting of largest_k output to the relevant dtype based on k value?
If k in largest_k is less than 65535 no need to for label_out to be uint32, uint16 will be sufficient. So for k<255 label_out can be uint8.
Can this be considered to reduce memory requirements?

Dimitris

Visualization

Hello
I am trying to use your component, and i want to visualize the 3d plot. Sample from my code:

# 3D Plot
fig = plt.figure()
ax = plt.axes(projection="3d")

# depth_image = skimage.io.imread("/host/datasets/dataset1/depth_20201030T171046.png")
image = depth_image
# filter out points that are too far
idx = np.where((image != 0) & (image <= 230))

# visualize
x_points = idx[1]
y_points = 10*image[idx]
z_points = -idx[0]
col = np.arange(30)

ax.scatter3D(x_points, y_points, z_points, c=z_points, cmap='viridis');
plt.show()

Currently what I have is a depth image, with a separated rgb.
I want to do some 3d clustering and I have been reading a lot until I found your repo for 3d connected components.

Is there a way for me to visualize the results from code you have on your main README?

Better Algorithm Mk. II

Discussed this some in #6

There are several faster algorithms than Wu et al's 2005 decision tree.

He et al 2007 describes a way of using a simpler data structure than union-find, but it requires 3x as much memory (3 arrays) for what is probably a 10-20% improvement. They don't compare with Wu et al in their paper.
Chang's contour tracing algorithm is very fast and probably shouldn't be ignored. It might be the best for our number of labels (see figure below from Grana).
Grana's 2009 paper on 2x2 block based evaluation is promising. It's complex in 2D, and extending it to 3D requires a very complicated tree, though if the chip can handle it efficiently, there's a lot of efficiencies to be gained, maybe even more than in 2D.

Figure from Grana, Borghesani, Cucchiara. "FAST BLOCK BASED CONNECTED COMPONENTS LABELING". IEEE 2009. doi: 10.1109/ICIP.2009.5413731

dust sugnature

dust signature does not contain connectivity parameter although is an acceptable one.
Consider adding it to allow pre-commit working
Thanks!

def dust(img, threshold, in_place=False): # real signature unknown; restored from __doc__
    """
    dust(img, threshold, in_place=False) -> np.ndarray
    
      Remove from the input image connected components
      smaller than threshold ("dust"). The name of the function
      can be read as a verb "to dust" the image.
    
      img: 2D or 3D image
      threshold: discard components smaller than this in voxels
      connectivity: cc3d connectivity to use
      in_place: whether to modify the input image or perform
        dust 
    
      Returns: dusted image
    """
    pass

About the lastest_k function

Hello, William
when i use lastest_k , i got some error
this is code

import numpy as np
import cc3d
import cv2
labels_in = np.ones((512, 512, 512), dtype=np.int32)
labels_out, N = cc3d.largest_k(
labels_in, k=10,
connectivity=26,
return_N=True,
)

and I got TypeError: Argument 'delta' has incorrect type (expected int, got float)
so what can I do to reduce this error
looking forward to your answer

the segmentation number is 0 when the segmentation array is the same object

fragments_chunk, seg_num = cc3d.connected_components(
                    np.ones(3,3,dtype=np.uint32), connectivity=6, return_N=True
                )

the seg_num is 0, which should be 1!

How differentiate small 3d objects from 3D Volume ?using How to remove small object [3d volume] using cc3d.connected_components?

Statistics output

In this example, I expect the just one centroid near [149, 149, 149]. Could you let me know where the [64, 64, 64] comes from? Also why are the pixel indices (~149) not the same in all three dimensions with this input?

import numpy as np
import cc3d

labels_in = np.zeros((512, 512, 512), dtype=np.int32) 
labels_in[100:200, 100:200, 100:200] = 1
labels_out, N = cc3d.largest_k(labels_in, k = 1, connectivity = 26, delta=0, return_N = True)

a = cc3d.statistics(labels_out)
a['centroids']

This is what I get:

array([[ 64.480415,  64.480415,  64.480415],
       [149.26369 , 149.49011 , 149.0991  ]], dtype=float32)

Thank you for sharing this software.

Question on comparing individual lesions between two masks based on the cc3d.statistics output.

Hello,

First of all thank you for cc3d, I am very new to the field and I found it much easier to use compared to other implementations of connected components for 3D images.

I used the 'stats' function to get the voxel sizes of individual lesions and their bounding boxes from a mask, which if I understand correctly represent the position of each lesion in the mask.

What I would like to do, is to compare the number of correctly identified lesions that my model predicted in the ground truth regardless of whether their volumes match (or regardless of whether they are precisely delineated). I guess this would be done by comparing the bounding boxes between the mask my model predicted and the ground truth mask. Is there a straightforward way to do this using cc3d or another package?

Best and thank you for your time!

Cannot find reference 'dust' in 'cc3d.py'

I installed cc3d on windows 10 for python 3.6.12:

Requirement already satisfied: connected-components-3d in c:\users\max\miniconda3\lib\site-packages (3.10.3)
Requirement already satisfied: numpy in c:\users\max\miniconda3\lib\site-packages (from connected-components-3d) (1.18.5)

When I try to use dust() as shown in the Readme, I get the following error:
AttributeError: module 'cc3d' has no attribute 'dust'

Am I missing something?

Additional metrics support

Hi,
I am benefiting from this amazing library in my deep learning research, namely detecting and measuring liver tumours in CT scans. It would be great if we could have a features allowing unitless longest dimension, surface area metrics to be computed as well. Getting volume is easy since the functions already return voxel counts for labels.

1D Array of 4 Elements Incorrect

An error occurs when using an array of 4 elements

import numpy as np
import cc3d
labels, seg_count = cc3d.connected_components(np.array([[[1,1,1,1]]]), return_N=True)
labels

output:

[1,0,0,0]

expected output:

[1,1,1,1]

Memory issue when i working with connected-components-3d ? volume images shape are 1000 X 1500 X 7000 ?

Apply mask prior to segmentation?

Hi, I'm a new user of the package and was wondering if there is functionality to apply a mask before running connected components, or if masks must be applied after processing the whole volume? If possible, this could be helpful to speed up processing for large data volumes (1000s of pixels on edge) where the regions of interest are a small subset of the sample volume.

I don't know the details of your implementation, so could envision this actually slowing down the process instead of speeding it up depending on the specifics of the algorithm, especially if this means you need to keep a whole duplicate array for the mask in memory at the same time.

cc3d.statistics["bounding_boxes"] are wrong

I found slice.start is wrong in your latest version.
It always start with 0 not real start point.
Plz, look at below pictures.

Thank you.

the limit of union-find array is 65535?

I have a big data(102410241024) to use connected-components-3d and got the error : Label 65535 cannot be mapped to union-find array of length 65535.
Is this because the limit of union-find array?
And will you update a feature to cancel the limit of union-find?

max_labels questions

I'm a bit confused by the max_labels argument. If I run a connected_component call without the argument and then do a np.max(labels_out), I should get the number of components in the version after the recent 1.2.0 release. However, if now use this number with some margin to set max_labels, the procedure fails with exception:

Connected Components Error: Label 60000 cannot be mapped to union-find array of length 60000.
terminate called after throwing an instance of 'char const*'

It seems that internally, the union-find algorithm requires a higher number, but it is not clear to me how to estimate this number.

It would be great to find a way to reduce the peak memory footprint of this very nice package. :)

3d small object cannot removed when use cc3d. connected-compoenents?

3d small object cannot removed when use cc3d. connected-compoenents?:

from reference : #21

its not workin @william-silversmith

Make a PyPI Package

Do you think it's worthwhile? Upvote if yes. It'd be interesting to hear your use case. Please contribute it below.

connected_components does not labels single pixels in 2D

Dear all,

Using cc3d for a while in python I just noticed single pixels are not labeled. At the beginning I though cc3d did not catch isolated pixels but the behavior is a little bit stranger :

    import cc3d
    import numpy as np

    # NOT WORKING
    binary_img = np.zeros((3, 3), dtype=np.uint8)
    binary_img[1, 1] = 1
    labels = cc3d.connected_components(binary_img, connectivity=4)
    print(f'{binary_img}\n => \n', labels)
    
    binary_img = np.zeros((5, 5), dtype=np.uint8)
    binary_img[1, 1] = 1
    labels = cc3d.connected_components(binary_img, connectivity=4)
    print(f'{binary_img}\n => \n', labels)
    
    # WORKING
    binary_img = np.zeros((5, 5), dtype=np.uint8)
    binary_img[1, 1] = 1
    binary_img[3, 3] = 1
    labels = cc3d.connected_components(binary_img, connectivity=4)
    print(f'{binary_img}\n => \n', labels)

Corresponding outputs

    [[0 0 0]
     [0 1 0]
     [0 0 0]]
     => 
     [[0 0 0]
     [0 0 0]
     [0 0 0]]
    
    [[0 0 0 0 0]
     [0 1 0 0 0]
     [0 0 0 0 0]
     [0 0 0 0 0]
     [0 0 0 0 0]]
     => 
     [[0 0 0 0 0]
     [0 0 0 0 0]
     [0 0 0 0 0]
     [0 0 0 0 0]
     [0 0 0 0 0]]
    
    [[0 0 0 0 0]
     [0 1 0 0 0]
     [0 0 0 0 0]
     [0 0 0 1 0]
     [0 0 0 0 0]]
     => 
     [[0 0 0 0 0]
     [0 1 0 0 0]
     [0 0 0 0 0]
     [0 0 0 2 0]
     [0 0 0 0 0]]

This issue may looks minor in most of the case but I use sometimes cc3d with very small 2D patches and this situation is happening.

Notice this issue seams not reproducible in 3D with third dimension bigger than 1

Best regards

Bounding box for each label
Centroids
Voxel Count / Volume (Vol = res * Ct)

It would be even more interesting if this could be done as a one-pass algorithm too. However, it's pretty easy to add this as an additional pass.

Voxel count is pretty easily handled by https://github.com/seung-lab/fastremap but because we know the image statistics here, we can get some minor efficiencies. Centroids are also easy to compute (a 2d version is here: https://github.com/seung-lab/kimimaro/blob/master/ext/skeletontricks/skeletontricks.pyx#L463).