advancedphotonsource / tike Goto Github PK

View Code? Open in Web Editor NEW

28.0 9.0 14.0 81.53 MB

Repository for ptychography software

Home Page: http://tike.readthedocs.io

License: Other

Python 96.13% TeX 0.10% Cuda 3.77%

ptychography tomography computational-imaging image-processing

tike's Issues

Use statistical model selection methods to determine the best number of probes

Use real probes in modeling data for jupyter notebook examples

Can we use real probes from a ptychography experiment to model data in Jupiter notebook examples? Flat-top gaussian is not a good approximation of real probes. So we cannot correctly test the developed functionality.

Only return / plot the cost function intermittently or when requested

Computing the full cost function requires communications and reductions overhead which may not be needed.

Use complete norm to compute common gradients in lstsq-grad method

According to the manuscript, the least-squares grad common gradients denominator should be the norm across all batches not just the current batch. This is probably implemented incorrectly and needs to be fixed.

Check input diffraction patterns invalid values

i.e. the inputs need to be checked for NaNs, negative numbers, INF, etc.

Memory leak when restarting reconstruction?

I seems that restarting a reconstruction has a memory overhead that increases with each restart. This causes out of memory error. For example, restarting the reconstruction 10 times with 10 iterations each time consumes more memory than doing one reconstruction with 100 iterations.

sparse vs. compact group in LSQ-ML reconstruction

The least-squares solver for maximum-likelihood ptychography proposed two ways to partition batches in the reconstruction: sparse and compact.

Which method is currently in tike? Is there a plan to support both of them?

Reorder input parameters

Doga wants to reorder the input parameters for some of the functions.

Test reconstructions with probe updates disabled.

Replace 1e-32 with actual minimum float

I hope numpy should have some predefined constant for the minimal float32 value greater than 0. So we can replace 1e-32 everywhere

https://numpy.org/doc/stable/reference/generated/numpy.finfo.html?highlight=finfo#numpy.finfo

Allow arbitrary grid size for USFFT/Lamino Operators

Currently, the Laminography operator has a fixes size ratio between projections sampling and field of view sampling. i.e. for projections sized 32x32, the reconstructed volume is 32x32x32 and is sampled in a 64x64x64 frequency space. We should break this fixed ratio so that we can reconstruct downsampled fields of view i.e. 32x32 projections through a 8x8x8 field of view or 32x32 projections in a 32x32x32 frequency space for 32x32x32 field of view.

Utility Function to convert ptycho data to low rez scanning transmission tomography

What regridding algorithm to use though?

Separate cost functions into own module

I would suggest having a separate module for cost functions and for gradients. Any cost function and gradient only need fwd and adj operators to be computed. So users can define their own cost functions and gradients. It would be nice to have def cost_gaussian(fwd,adj,...), def gradient_gaussian(fwd,adj,..),etc. I think the same concept is used in ODL.https://odlgroup.github.io/odl/ @nikitinvv

Cuda kernels are written inefficiently

Cuda kernels in usfft.cu, convolution.cu are not optimal.

They are heavy and use a huge number of registers which may significantly slowdown the code. They should be split into smaller ones.

If there exist 'if,else' statement like for fwd/adj operator then it is better to split it into 2, like it was at the beginning.
Loops with many iterations inside the kernels should be avoided by adding new threads, this way allowing the scheduler to switch threads more efficiently, like it was at the beginning
Non-sequential and uncoallesced memory access in the loops make the code slower, the following loop structure is unacceptable from cuda optimization point of view. It also looks unreadable.

 // for each image
  for (int ti = blockIdx.z; ti < nimage; ti += gridDim.z) {
    // for each scan position
    for (int ts = blockIdx.y; ts < nscan; ts += gridDim.y) {
      ....
      // for x,y coords in patch
      for (int py = blockIdx.x; py < patch_shape; py += gridDim.x) {
        for (int px = threadIdx.x; px < patch_shape; px += blockDim.x) {

Use 3d thread blocks and grids associated with each array dimension, like it was at the beginning. This will give you natural coalesced memory access.

Block size should be a power of 2, optimal combinations (1024,1,1), (256,4,4), (32,32,1), (16,16,4).. What we have in the code:

block = (min(self.scatter_kernel.max_threads_per_block, (2 * m)**3),)

Say m=3, then block = 216 - not optimal.

Would It be better to return to my initial kernels implementations?

Optimize the line search

Reduce the number of cost function calls?
Tweak the step size parameters?

Use sphinx "substitutions" to remove redundant text in documentation

http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#substitutions
http://docutils.sourceforge.net/docs/ref/rst/restructuredtext.html#substitution-definitions

We can use substitutions to only write docstrings once for parameters that are used for multiple functions:

Example:

"""Do something module

.. |theta_docstring| replace:: This text describes theta for all functions. 
"""
import numpy

def func_a(theta):
"""Something

Parameters
---------------
theta :
    |theta_docstring|

"""
def func_b(theta):
"""Something

Parameters
---------------
theta :
    |theta_docstring|

"""

This approach is beneficial because it can reduce source code length and makes maintenance easier because updating one string updates many strings.

Dynamic addition / removal of probes and OPR modes

Tests for bad initial probe guess

Add some tests for a poor initial probe guess.

Using logging statements for development

We want to set up a system for logging information about the progress of reconstruction algorithms such as convergence criteria and which number of iterations have been completed. This system should probably use the standard logger of python and print statements for storing information optionally.

Instead of changing the API of tike to accommodate development, we can use the following tools to capture data in text files and display messages to stout.

Example

Here's an example that demonstrates these concepts.

dev_logging.py:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Define an example of the logging system to be used in tike."""

import numpy as np
import logging
# First set the logging level to logging.INFO.
logging.basicConfig(level=logging.INFO)
# Then we need to grab a reference to the existing logger or make a new one.
logger = logging.getLogger(__name__)

if __name__ == '__main__':
    # Go to https://pyformat.info/ for a summary of how to use format()
    print("{} is a random number.\nThis string will go to the standard output "
          "stream".format(np.random.randint(0, 100)))
    logger.info("This is an info level statement.\nLogging statements go to "
                "standard error.")
    logger.debug("This is a debugging level statement.\nIt appears when the "
                 "logging level is det to DEBUG or higher.")
    if __debug__:
        print("This statement appears if Python was unoptimized.\n"
              "i.e. if it was started without the -O option.")
    logger.info("Good bye!")

Run the script in a bash terminal:

$ python -O dev_logging.py > some_data.txt
INFO:__main__:This is an info level statement.
INFO:__main__:Good bye!

some_data.txt:

46 is a random number.
This string will go to the standard outputstream

DOC: Fixup example jupyter notebook with latest version

There is a bug in the example where the cost function printed is empty.

Unifying communicator API

There are a lot of these types of switches in the solver implementations. The Comm API should be reworked to avoid these switches.

if comm.use_mpi:
    cost = comm.Allreduce_reduce(cost, 'cpu')
else:
    cost = comm.reduce(cost, 'cpu')

Reduce memory footprint of USFFT and Lamino Operators

It looks like the largest memory allocation events are for fftshift, fftplan, and pad.

fftshift

While cupy.fft.fftn has an overwrite_x parameter, fftshift does not. This is because implementation of fftshift is completely written in Python and relies on another Python-only implementation of numpy.roll.

We should write our own in-place fftshift, ifftshift, and roll functions using a CUDA Kernel because I don't think the NumPy API was designed to handle in-place reordering of dimensions.

fftplan

There is a CuPy option to avoid plans and use a series of 1D plans. These plans appear to be 686MiB, which seems quite large, but maybe they are worth it.

pad

This padding operation in eq2us may be avoidable if we do some modular arithmetic in the gather function? Likewise, there is an _unpad function which could be avoided after scatter if this function used modular arithmetic.

    Fe = xp.pad(Fe0, m, mode='wrap')
    F = gather(xp, Fe, x, n, m, mu)

BUG: ptycho.fast_pad() fails if npadv and npadh are 0

If the detector and probe sizes are set the same, ptycho.simulate fails.

Compare linear interpolation with Fourier interpolation in Ptycho operator

Currently, linear interpolation in the near plane domain is performed in order to handle floating-point scan positions. Another and accurate way of doing so is to perform shifts in the frequency domain. Say a scan position is given as s = si+sf where si is integer part and sf is floating part. We need to compute the Fourier transform of Q\psi(x+si+sf). By using Fourier properties, F(Q\psi(x+si+sf))(\xi)=F(Q\psi(x+si))(\xi) * exp(-2\pi i \xi sf), and the floating-point position problem boils down to multiplication in the frequency domain.
It would be good to compare two approaches (linear interpolation and multiplication in the frequency domain) in terms of accuracy and performance on CPU and GPU.

Normalize the object array to 1 on average

In order to avoid amplitude ambiguity in psi and keep it 1 on average, on each iteration we can compute |FQψ| and compare its norm with the norm of data. Then compute the constant c = norm(|FQψ) / norm(sqrt(data)) and do Q/=c. So this way we simply replace prb by prb/c and ψ by ψc, which does not affect the functional |FQ/c ψc| = |FQψ|). @nikitinvv

multislice ptychography

A lot of ptycho-tomography/laminography data we acquired recently are from thick objects for which the single-slice ptychography model no longer holds. We need to incorporate the multislice model to get better reconstruction.

Modify the ptychopy API to accept Python types such as arrays, dictionaries, integers, so that Tike can pass these parameters to ptychopy

An ideal interface would be something like this:

import ptychopy

# Function accepts inputs such as integers, tuples, and numpy arrays
ptychopy.epieinit(
    jobID='sim512',
    beamSize=110e-9,
    scanDims=(30, 30),
    step=(50e-9, 50e-9),
    i=60,
    size=512,
    lamda=2.4796837508399954e-10,
    dx_d=172e-6,
    z=1,
    simulate=1,
    blind=0,
)
ptychopy.epiestep(its=3)
data = ptychopy.epieresobj()
ptychopy.epiepost()


def epieinit(psi, probe, data, scan, ....):
    """Initializes a reconstruction for the ePIE algorithm.

    Parameters
    ----------
    psi : (..., nz, n) complex64
        The complex wavefront modulation of the object.
    probe (..., 1, 1, nprobe, probe_shape, probe_shape) : complex64
        The complex illuminatio function.
    data (..., nframe, detector_shape, detector_shape) : float32
        The square of the absolute value of `farplane` summed over `fly` and
        `modes`.
    scan : (..., nscan, 2) float32
        Coordinates of the minimum corner of the probe grid for each
        measurement in the coordinate system of psi. Vertical coordinates
        first, horizontal coordinates second.

    """
    pass


def epiestep(num_iter=1):
    """Execute one step of ePIe"""
    pass

Assertion in compact cluster method is tripped when running on 8 GPUs

https://github.com/AdvancedPhotonSource/tike/blob/56508940c4b2d41318734b4a90dfcbd24be59fd4/src/tike/cluster.py#LL310C13-L310C19

v0.4.0 profiling benchmark

35.249 <module>  profile_admm.py:3
└─ 35.225 admm_profile_workload  profile_admm.py:59
   ├─ 29.160 admm  tike/tike.py:88
x   │  ├─ 15.025 forward  tike/tomo.py:256
   │  ├─ 12.937 reconstruct  tike/ptycho.py:433
   │  │  └─ 12.929 grad  tike/ptycho.py:345
   │  │     ├─ 1.873 combine_grids  tike/ptycho.py:204
   │  │     │  └─ 1.616 shift  scipy/ndimage/interpolation.py:474
x   │  │     ├─ 1.768 angle  numpy/lib/function_base.py:1914
   │  │     ├─ 1.688 uncombine_grids  tike/ptycho.py:274
   │  │     │  └─ 1.535 shift  scipy/ndimage/interpolation.py:474
   │  │     ├─ 1.069 ifft2  numpy/fft_intel/fftpack.py:880
   │  │     │  └─ 1.069 ifftn  numpy/fft_intel/fftpack.py:692
   │  │     ├─ 1.062 fft2  numpy/fft_intel/fftpack.py:791
   │  │     │  └─ 1.062 fftn  numpy/fft_intel/fftpack.py:594
x   │  │     └─ 0.956 pad  numpy/lib/arraypad.py:1094
   │  │        ├─ 0.525 _append_const  numpy/lib/arraypad.py:111
   │  │        └─ 0.371 _prepend_const  numpy/lib/arraypad.py:77
   │  └─ 0.798 reconstruct  tike/tomo.py:161
   │     └─ 0.793 recon  tomopy/recon/algorithm.py:75
   │        └─ 0.777 _dist_recon  tomopy/recon/algorithm.py:362
   │           └─ 0.665 __exit__  concurrent/futures/_base.py:580
   │              └─ 0.665 shutdown  concurrent/futures/thread.py:133
   │                 └─ 0.665 join  threading.py:1022
   │                    └─ 0.665 _wait_for_tstate_lock  threading.py:1060

Replace tike.forward with tomopy.forward
Replace np.angle with np.arctan2
Replace np.pad with tike.fast_pad

There are still 6 seconds of time inside tike.ptycho.grad which need to be investigated line-by-line.

Consider using SNAPVX for formulating ADMM as an arbitrary graph

http://snap.stanford.edu/snapvx/

momentum acceleration for object and probe updates in the LSQ-ML reconstruction

Recently I've been using more and more compact batch + momentum accelerations for object and probe updates during the LSQ-ML reconstruction. For some datasets, this combo can significantly improve reconstruction quality.

Probe power utility function

Add a utility function which computes the probe power from an array of probe modes.

Reorder scanning coordinates

theta, h, v coordinates convention should be changed to theta, v, h coordinates because of the following:

This is a more natural order for data collection where h is a slower axis than v
This matches the ordering for tomopy input arrays

Unique h,v, positions at each angle

This change will remove the ability of ptycho.reconstruct to process more than one view at a time, and require adding a for loop around the call to ptycho.reconstruct in tike.admm.

This is because our language for describing the probe geometry is not unified between the tomography and ptychography problems. Each is described by only part of the parameters: theta or (v and h).

I think it is strange that our tomography solver accepts 3D input, but our ptychography solver accepts only 2D input. The mismatch between the two is a constraint that I do not like.

num_batch default cannot be None

Let's do lazy imports

tike has the potential for having a lot of optional dependencies, so instead of importing all external solvers at the module level, let's do lazy imports, where they are only imported if actually needed.

import numpy  # importing here requires it to be installed even if never used

def recon(x, algorithm='default'):
    """Reconstruct some stuff."""
    if algorithm is 'default':
        return numpy.default_method(x)
    elif algorithm is 'optional':
        from option import optional_method
        return option.optional_method(x)  # this method is only imported if chosen

Reduce the number of arguments by packing params into dictionaries

We want to reduce the number of parameters for outward facing functions or maybe all functions in general by packing the long lists of parameters into topical dictionaries. For examples:

Parameters theta, v, h could become trajectory by defining

trajectory = {
    'theta' : None,
    'v' : None,
    'h' : None,
}

This kind of packing will definitely be employed for the reconstruction parameters of the tomo.reconstruct and pytcho.reconstruct because those need to be passed through tike.admm.

Expose the randomness API

Allow setting the random seed for repeatability.

Use decorators to indicate which parts of API are experimental

New ePIE, DM, MLs, API based on the new needed signature in ptychopy.

1 Need to add more functionality in ptychopy.c.
2 The new calling sequence would be different from the original one, keeping the original one to keep the ability for doing reconstruction from passing a file path.Ex, ePIE_File(), ePIE_Array();
3 Two sets of API, one set is for step mode to check the middle results. Another set could be for the whole reconstruction process.

Default data types

Enforce the use of 32bit data types in order to make the creation plugins and extensions easier. Also, reduce the the complexity of tike.utils; I think there is some redundancy that could be removed.

Utilize the NumPy array function protocol.

Writing functions that specifically CuPy is not necessary anymore after NumPy 1.17 because of the array function protocol. Due to this API feature, NumPy functions that are passed CuPy arrays will automatically call the corresponding CuPy functions instead of trying to convert them to NumPy arrays.

https://numpy.org/neps/nep-0018-array-function-protocol.html

Clean all references to num_batch from code

The number of batches, is a much more natural way of expressing how to divide something in the context of multiple-devices. Given the size of a batch, you have to do some math to divide the size between all the devices.

However, the real problem is that these two methods of expressing the same thing: num_batch vs batch_size are both used which is confusing for users.

Handling of Jupyter Notebooks in the repository

This issue is to discuss if we need a version control on Notebooks in the repository.

The Notebooks are meant to be interactive, which means we encourage users to play with them. So it is a question if we need versioning for these files? I'd rather have static Python scripts for tutorials that devs have control over and optimize in time, and have ipynb extensions listed in the .gitignore file.

advancedphotonsource / tike Goto Github PK

tike's Issues

Example

fftshift

fftplan

pad

Recommend Projects

Recommend Topics

Recommend Org