cta-observatory / dl1-data-handler Goto Github PK

View Code? Open in Web Editor NEW

11.0 17.0 17.0 3.49 MB

DL1 HDF5 format IACT data writer + reader + processor

License: MIT License

Python 10.98% Jupyter Notebook 89.02%

dl1-data-handler's Introduction

DL1 Data Handler

A package of utilities for reading, and applying image processing to Cherenkov Telescope Array (CTA) R0/R1/DL0/DL1 data in a standardized format. Created primarily for testing machine learning image analysis techniques on IACT data.

Currently supports ctapipe v6.0.0 data format.

Previously named image-extractor (v0.1.0 - v0.6.0). Currently under development, intended for internal use only.

Installation

The following installation method (for Linux) is recommended:

Installing as a conda package

To install dl1-data-handler as a conda package, first install Anaconda by following the instructions here: https://www.anaconda.com/distribution/.

The following command will set up a conda virtual environment, add the necessary package channels, and install dl1-data-handler specified version and its dependencies:

DL1DH_VER=0.12.0
wget https://raw.githubusercontent.com/cta-observatory/dl1-data-handler/v$DL1DH_VER/environment.yml
conda env create -n [ENVIRONMENT_NAME] -f environment.yml
conda activate [ENVIRONMENT_NAME]
conda install -c ctlearn-project dl1_data_handler=$DL1DH_VER

This should automatically install all dependencies (NOTE: this may take some time, as by default MKL is included as a dependency of NumPy and it is very large).

Dependencies

The main dependencies are:

PyTables >= 3.8
NumPy >= 1.20.0
ctapipe == 0.21.2

Also see setup.py.

Usage

ImageMapper

The ImageMapper class transforms the hexagonal input pixels into a 2D Cartesian output image. The basic usage is demonstrated in the ImageMapper tutorial. It requires ctapipe-extra outside of the dl1-data-handler. See this publication for a detailed description: arXiv:1912.09898

dl1-data-handler's People

Contributors

Stargazers

Watchers

Forkers

morcuended labsaha legutier mikael10j jlzhang001 stspencer vuillaut adonini sahilyadav27 ninavankalas lucaromanato astrojarred eigen-carmona mexanick spletts thecakeisnotalie pablitinho

dl1-data-handler's Issues

Overflow in tel_id column of Array_Info table

The tel_ids for SST1 telescopes go from 220-255, followed by 0-33, in what appears to be an overflow error. This results in tel_ids being repeated, so they can't be used as unique keys.

I suppose the relevant line in image_extractor.py is line 240, arr_row["tel_id"] = tel_id, but it's not clear to me why that would cause this error.

repository size

Hi
I am just wondering why is the repo already ~30MB large?

calibration and pixel saturation

As discussed with @nietootein last week, we realized that the calibration is not currently properly done in the code.

Q1: Are both channels saved in the HDF5 files (I think not but I don't have a datafile here to double check)?

If so, we could in principle select the right channel afterward (when loading data) but I would highly recommend option B:

The channel selection should be done during the calibration and thus we write only calibrated image (with combined channels) to disk.

However:

to my knowledge, there is nothing to do that in ctapipe.
the range of pixel values is camera dependent.

This means we need to code the pixel selection, and get the saturation values for all cameras used in the simulations (this could be different for different productions).

Even though the saturated pixels represent a vast minority, this could really affect the performances. Especially, it could explain the quite poor performances we observe at higher energies (TBC of course).

Optimize chunk size

Optimizing the chuck size may offer a significant I/O improvement. See, e.g., information here.

Refactoring to separate event/data loop from dumping to file

In addition to the bugfixes required by other issues and updating image-extractor to agree with the current version of ctapipe, it may be helpful to reorganize the code to better support different input data formats (i.e. VERITAS, FACT, etc.) and possibly different output data formats.

One possible structure is as follows:

# Abstract class for loading events sequentially
# Event loaders for different input data formats to be implemented as subclasses
# For the ctapipe/simtel input pipeline this would be mostly a wrapper around an event_source
# with some additional processing
class EventLoader(ABC):
     def __init__(self, files):
        self.files = files
        self.num_events = # compute number of events (after cuts)

    def __iter__(self):
        return self

    def __next__(self):
         return self.get_next_event()

    @abstractmethod
    def get_next_event(self):
         # Implement event loading logic here
         # Should return some standardized format, like a dict or nested objects with attributes
         # what can be handled uniformly by DataDumper
         # probably selection cuts can be self-implemented here (apply a check before returning)
         # Any image processing should be implemented here, although if there are some common processing steps, those can possibly be delegated out to a separate class

     @abstractmethod
    def get_metadata(self):
         # Get dataset-related metadata (non-eventwise)

# Abstract class
# Defines output PyTables format
# Should implement outputting to multiple files, shuffling, splitting into train/test sets, any other output file formatting
# Only one implementation should be needed (for the cta-ml data format), but in theory others could be implemented
class DataDumper(ABC):
    def __init__(self, output_filenames, event_loader):
        self.output_filenames = output_filenames
        self.event_loader = event_loader

    @abstractmethod
    def dump(self):
         # Do all necessary PyTables file setup, formatting, etc.
         # Should be able to handle dumping to multiple files and possibly multithreading (?)
          for event in self.event_loader:
                # Process and dump
          # Do any post-processing like shuffling or splitting into train/test

This should make it possible to separately modify the event loading logic and the output data format. New input data formats (besides simtel/ctapipe) can be supported by implementing EventLoaders. This should make the code more maintainable and also hopefully reduce the need for hard-coding as well as make the data format clearer directly from the code (as the cta-ml data format-specific code will be isolated in a CTAMLDataDumper class).

Update travis to use recommended installation method

The installation method was simplified in 1a7f685, but travis wasn't updated to reflect this and fails with the following error:

CondaFileIOError: 'requirements.txt'. [Errno 2] No such file or directory: 'requirements.txt'

The command "conda install -q --yes --file requirements.txt -c cta-observatory -c conda-forge -c openastronomy" failed and exited with 1 during .

Update it to use one of the recommended installation methods (virtualenv or anaconda).

Add index to commonly sorted columns

PyTables allows you to add an index on a particular column, which can make queries/selections involving that column noticeably faster. Maybe we should consider implementing this (for example on MC energy in the EventInfo table)?

Number of loaded images differs for DL1Reader in 'mono' and 'stereo' mode

With the LST dataset (produced by @vuillaut ), I'm getting following numbers, when running CTLearn in load_only mode with DL1Reader in 'mono' and 'stereo' mode. Is this expected, because it's a LST mono reduction?

Mono mode:

INFO:Logging has been correctly set up
INFO:Loading data:
INFO:For a large dataset, this may take a while...
INFO:Batch size: 64
INFO:Number of training steps per epoch: 32046
INFO:Number of training steps between validations: 2500
INFO:Examples for total dataset:
INFO:  Total number: 2278845
INFO:  Breakdown by particletype:
INFO:    proton: 1136452
INFO:    gamma: 1142393
INFO:
INFO:Examples for training:
INFO:  Total number: 2050960
INFO:  Breakdown by particletype:
INFO:    proton: 1023014
INFO:    gamma: 1027946
INFO:
INFO:Examples for validation:
INFO:  Total number: 227885
INFO:  Breakdown by particletype:
INFO:    gamma: 114447
INFO:    proton: 113438
INFO:

Stereo mode:

INFO:Logging has been correctly set up
INFO:Loading data:
INFO:For a large dataset, this may take a while...
INFO:Batch size: 64
INFO:Number of training steps per epoch: 17411
INFO:Number of training steps between validations: 2500
INFO:Examples for total dataset:
INFO:  Total number: 1238154
INFO:  Breakdown by particletype:
INFO:    gamma: 574983
INFO:    proton: 663171
INFO:
INFO:Examples for training:
INFO:  Total number: 1114338
INFO:  Breakdown by particletype:
INFO:    gamma: 517461
INFO:    proton: 596877
INFO:
INFO:Examples for validation:
INFO:  Total number: 123816
INFO:  Breakdown by particletype:
INFO:    proton: 66294
INFO:    gamma: 57522
INFO:

IndexError: Cannot choose from an empty sequence

I'm having trouble extracting images from simtel files.
One of the files I'm looking at is
proton_20deg_0deg_run20089___cta-prod3-merged_desert-2150m-Paranal-3HB89-NGFD.simtel.gz

I get the following stacktrace:

(fact) [kbruegge@vollmond image-extractor]$ python image-extractor/image_extractor.py run_list.txt ./test.h5 
WARNING: AstropyDeprecationWarning: prefer the use of an EventSource or EventSourceFactory [__main__]
WARNING:astropy:AstropyDeprecationWarning: prefer the use of an EventSource or EventSourceFactory

gzip: stdout: Broken pipe
WARNING: AstropyDeprecationWarning: prefer the use of an EventSource or EventSourceFactory [__main__]
WARNING:astropy:AstropyDeprecationWarning: prefer the use of an EventSource or EventSourceFactory
WARNING:ctapipe.io.hessioeventsource.HESSIOEventSource:Only one pyhessio event_source allowed at a time. Previous hessio file will be closed.

gzip: stdout: Broken pipe
WARNING: AstropyDeprecationWarning: prefer the use of an EventSource or EventSourceFactory [__main__]
WARNING:astropy:AstropyDeprecationWarning: prefer the use of an EventSource or EventSourceFactory
WARNING:ctapipe.io.hessioeventsource.HESSIOEventSource:Only one pyhessio event_source allowed at a time. Previous hessio file will be closed.

gzip: stdout: Broken pipe
Traceback (most recent call last):
  File "image-extractor/image_extractor.py", line 641, in <module>
    extractor.process_data(data_file, args.max_events)
  File "image-extractor/image_extractor.py", line 273, in process_data
    random_tel_id = random.choice(selected_tels[tel_type])
  File "/home/kbruegge/.conda/envs/fact/lib/python3.6/random.py", line 258, in choice
    raise IndexError('Cannot choose from an empty sequence') from None
IndexError: Cannot choose from an empty sequence
Closing remaining open files:./test.h5...done

Update to ctapipe v0.7.0

I tried to update to ctapipe v0.7.0 (or/and current master of ctapipe). So far, I'm able to produce the h5 files without any breaks. However, my configurations of the ctapipe CameraCalibrater or of the ctapipe GainSelector are not right. Can anyone (more familiar with ctapipe) please have a look into this? You can find the code here. Thanks in advance! The 'charge' channel seems to be working fine, while the 'pulse_time' channel is messed up.

ImageMapper error: weird shape for LSTCam [shift_imagemapper]

Trying to create a reader for LSTCam I get a ValueError:
dataset = DL1DataReader(files, selected_telescope_type='LST_LSTCam', selected_telescope_ids=[1], event_info=['mc_energy', 'core_x', 'core_y', 'alt', 'az', 'mc_particle'], )

ValueError Traceback (most recent call last)
in ()
1 dataset = DL1DataReader(files, selected_telescope_type='LST_LSTCam', selected_telescope_ids=[1],
----> 2 event_info=['mc_energy', 'core_x', 'core_y', 'alt', 'az'],
3 # mapping_settings={'mapping_method': {'LST_LSTCam': 'axial_addressing'}}
4 )

~/dl1-data-handler/dl1_data_handler/reader.py in init(self, file_list, mode, selected_telescope_type, selected_telescope_ids, selection_string, intensity_selection, shuffle, seed, image_channels, mapping_method, mapping_settings, array_info, event_info, transforms, validate_processor)
173 self.image_channels = image_channels
174 self.image_mapper = ImageMapper(channels=self.image_channels,
--> 175 **mapping_settings)
176
177 if array_info is None:

~/dl1-data-handler/dl1_data_handler/image_mapper.py in init(self, camera_types, mapping_method, padding, interpolation_image_shape, rotate_camera, mask_interpolation, channels)
127 self.index_matrixes[camtype] = None
128 # Calculating the mapping tables for the selected camera types
--> 129 self.mapping_tables[camtype] = self.generate_table(camtype)
130
131 def map_image(self, pixels, camera_type):

~/dl1-data-handler/dl1_data_handler/image_mapper.py in generate_table(self, camera_type)
189 # Finding the nearest point in the hexagonal grid for each point in the square grid
190 tree = spatial.cKDTree(hex_grid)
--> 191 nn_index = np.reshape(tree.query(table_grid)[1], (output_dim, output_dim))
192 # Store the nn_index array in the index_matrix. Replace virtual pixel indexes with -1.
193 if map_method == 'axial_addressing':

~/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py in reshape(a, newshape, order)
255 [5, 6]])
256 """
--> 257 return _wrapfunc(a, 'reshape', newshape, order=order)
258
259

~/anaconda3/lib/python3.6/site-packages/numpy/core/fromnumeric.py in _wrapfunc(obj, method, *args, **kwds)
50 def _wrapfunc(obj, method, *args, **kwds):
51 try:
---> 52 return getattr(obj, method)(*args, **kwds)
53
54 # An AttributeError occurs if the object does not have

ValueError: cannot reshape array of size 9800 into shape (98,98)

Add run information in DL1DataReader

To fully benefit from Gammaboard functionalities, we need to keep track of:

num_showers
shower_reuse
spectral_index
max_scatter_range
energy_range_max
energy_range_min
min_alt
max_alt
mc_energy of triggered events

of every runs in the DL1DataReader

MAGIC software school

For the MAGIC software school, we want to show a small demonstration of the current DL pipeline with the MAGIC telescopes. Therefore the demo notebook have to be updated and extended.

Add bins argument to reader.num_examples() method

The method reader.num_examples() calculates the number of examples in the dataset, grouped by an arbitrary set of features (typically the class label). It would not work if continuous features (e.g. mc_energy) are used as labels, since those features are more or less unique for a given event. Instead, it would be useful to group these features by bins.

To enable this, add an argument to this method which is a dictionary of feature names to bins and use it when grouping. If bins are specified for a feature, they would override the default grouping behavior when grouping the examples.

Update to use refactored ctapipe event routines

In the latest version of ctapipe, we have quite massively overhauled the routines that read event data, so it would be good to start to think about supporting them to maintain compatibility. For the most part this is easy:

Replace `hessio_event_source`

This is now replaced with the ctapipe.io.EventSourceFactory class (if you use the ctapipe configuration system; see the docs), or you can use the simple wrapper event_source as so:

from ctapipe.io import event_source
...

with event_source("input.simtel.gz") as source:
    for event in source:
        do_stuff(event)

That will then support all data file formats that we support, not just hessio. Note that EventSources can be looped over multiple times now, or provide random access using EventSeeker

Use the instrument module correctly, particularly the `SubarrayDescription` class:

It seems you have you have your own names for each telescope, which is bad and will certainly cause confusion! Please use the automatic naming given by the SubarrayDescription hierarchy, which has been there for many months now. Simply converting a TelescopeDescription (or CameraGeometry or OpticsDescription, etc.) to a string gives you the standardized identifier of that part of the instrument:

sub = event.inst.subarray 

print("This subarray is called", str(sub))
for tel_id, tel in sub.tel.items():
    print("telescope with id {} has identifier '{}'".format(tel_id, tel))

Would output:

This subarray is called MonteCarloArray
telescope with id 1 has identifier 'LST:LSTCam'
telescope with id 2 has identifier 'LST:LSTCam'
telescope with id 3 has identifier 'LST:LSTCam'
telescope with id 4 has identifier 'LST:LSTCam'
telescope with id 5 has identifier 'MST:NectarCam'
telescope with id 6 has identifier 'MST:NectarCam'
telescope with id 7 has identifier 'MST:NectarCam'
telescope with id 8 has identifier 'MST:NectarCam'
telescope with id 9 has identifier 'MST:NectarCam'
telescope with id 10 has identifier 'MST:NectarCam'
telescope with id 11 has identifier 'MST:NectarCam'
telescope with id 12 has identifier 'MST:NectarCam'
telescope with id 13 has identifier 'MST:NectarCam'
telescope with id 14 has identifier 'MST:NectarCam'
telescope with id 15 has identifier 'MST:NectarCam'
telescope with id 16 has identifier 'MST:NectarCam'
telescope with id 17 has identifier 'MST:NectarCam'
telescope with id 18 has identifier 'MST:NectarCam'
telescope with id 19 has identifier 'MST:NectarCam'

I'm not sure if colons are allowed in e.g. HDF5 dataset names, so we can also think about changing this to a better character, but it's best to do it in ctapipe, so all dependent tools then use the same naming.

Note also that the use of event.inst.tel_pos and other quantities have been removed. All of that information is in the SubarrayDescription, e.g. sub.tel[13].camera gives you the CameraGeometry for telescope 13, or sub.tel[13].optics.equivalent_focal_length is the focal length.

Switch to use ctapipe-stage1?

Hi,

As the ctapipe stage1 tool is now available and writes images and parameters in a standard way,
would you consider using that instead of duplicating efforts?

Is there something missing from the stage1 tool you need?

Do you need some support that would enable you to make the switch?

Reduce size of conda package

As can be seen here, the conda package for DL1DH 0.7.4 is 248 MB. The package for v0.7.2 was only 126.2 kB. Perhaps this is related to #39?

Investigating gain selection issue

I open the thread to follow the gain selection issue raised by PR #72.

It seems I could reproduce an (the?) issue in the case of waveforms of shape (n_gain, n_pix, 1) - that would correspond to ASTRICam.
In this case the gain_mask from _, gain_mask = gain_selector.select_gains(cam_id, waveform) is of shape (n_pix,).

Maybe @riwim or @mackaiver can tell me if that's the issue they encountered.

Starting work on pre-processing and data reader

Hi.
I just open a thread to discuss the developments to be done for the reading and pre-processing of the produced files.
As shown in a ASWG conf call, I have a prototype reader for LST that I use with the standard analysis chain.
Also as discussed in Madrid, @mikael10j has done a lot of work for the pre-processing of images.

I think for a start, we should try to list the developments that we want to see done and order them by priorities.
I will start with:

develop a reader for produced HDF5 files into ctapipe containers
this reader should be set as a plug-in for ctapipe
this has to be related to #31 from early stages to decide where we go

traitlets.traitlets.TraitError: Traitlet does not exist: r1_product

Hey there,
In the process of setting up the environment, I faced a hard time to get the necessary dataset to run the software. I used
gamma_test.simtel.gz : https://github.com/cta-observatory/ctapipe-extra/blob/master/ctapipe_resources/gamma_test.simtel.gz

I created a folder named inputs in installation directory, and outputs directory and configured my _runlist accordingly. I got the above-mentioned error.

dl1_data_handler_demo.ipynb: "ValueError: ndarray is not C-contiguous"

Hi all,

I try to run the DL1 Data Handler Demo notebook. I have installed ctapipe as described there and dl1dh as described here by pip. When I execute data_writer.process_data(runlist) of the demo notebook I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-47301b29a123> in <module>
----> 1 data_writer.process_data(runlist)

/home/user/dl1-data-handler_0.8.0/dl1_data_handler/writer.py in process_data(self, run_list)
    812             for run in run_list:
    813                 logger.info("Starting run for target: {}...".format(run['target']))
--> 814                 self._process_data(run['inputs'], run['target'])
    815 
    816         logger.info("Done!")

/home/user/dl1-data-handler_0.8.0/dl1_data_handler/writer.py in _process_data(self, file_list, output_filename)
    886             # Write all events sequentially
    887             for event in event_source:
--> 888                 self.calibrator.calibrate(event)
    889                 self.combine_channels(event)
    890                 if (self.preselection_cut_function is not None and not

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/ctapipe/calib/camera/calibrator.py in calibrate(self, event)
    120         self.r1.calibrate(event)
    121         self.dl0.reduce(event)
--> 122         self.dl1.calibrate(event)

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/ctapipe/calib/camera/dl1.py in calibrate(self, event)
    211                         e = self.extractor
    212                         g = event.inst.subarray.tel[telid].camera
--> 213                         e.neighbours = g.neighbor_matrix_where
    214                     extract = self.extractor.extract_charge
    215                     charge, peakpos, window = extract(cleaned)

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/astropy/utils/decorators.py in __get__(self, obj, owner)
    742                 return val
    743             else:
--> 744                 val = self.fget(obj)
    745                 obj.__dict__[self._key] = val
    746                 return val

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/ctapipe/instrument/camera.py in neighbor_matrix_where(self)
    392         ndarray
    393         """
--> 394         return np.ascontiguousarray(np.array(np.where(self.neighbor_matrix)).T)
    395 
    396     @lazyproperty

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/astropy/utils/decorators.py in __get__(self, obj, owner)
    742                 return val
    743             else:
--> 744                 val = self.fget(obj)
    745                 obj.__dict__[self._key] = val
    746                 return val

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/ctapipe/instrument/camera.py in neighbor_matrix(self)
    376     @lazyproperty
    377     def neighbor_matrix(self):
--> 378         return _neighbor_list_to_matrix(self.neighbors)
    379 
    380     @lazyproperty

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/astropy/utils/decorators.py in __get__(self, obj, owner)
    742                 return val
    743             else:
--> 744                 val = self.fget(obj)
    745                 obj.__dict__[self._key] = val
    746                 return val

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/ctapipe/instrument/camera.py in neighbors(self)
    369             self.pix_x.value,
    370             self.pix_y.value,
--> 371             rad=1.4 * dist.value
    372         )
    373 

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/ctapipe/instrument/camera.py in _find_neighbor_pixels(pix_x, pix_y, rad)
    685     indices = np.arange(len(pix_x))
    686     kdtree = KDTree(points)
--> 687     neighbors = [kdtree.query_ball_point(p, r=rad) for p in points]
    688     for nn, ii in zip(neighbors, indices):
    689         nn.remove(ii)  # get rid of the pixel itself

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/ctapipe/instrument/camera.py in <listcomp>(.0)
    685     indices = np.arange(len(pix_x))
    686     kdtree = KDTree(points)
--> 687     neighbors = [kdtree.query_ball_point(p, r=rad) for p in points]
    688     for nn, ii in zip(neighbors, indices):
    689         nn.remove(ii)  # get rid of the pixel itself

ckdtree.pyx in scipy.spatial.ckdtree.cKDTree.query_ball_point()

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/scipy/spatial/ckdtree.cpython-37m-x86_64-linux-gnu.so in View.MemoryView.memoryview_cwrapper()

/scratch/user/envs/ctapipe_dl1dh/lib/python3.7/site-packages/scipy/spatial/ckdtree.cpython-37m-x86_64-linux-gnu.so in View.MemoryView.memoryview.__cinit__()

ValueError: ndarray is not C-contiguous

Any ideas what I could do?

Edit: I also tried to install dl1_data_handler with the described conda method. I get the same error.

Simplify requirements file

The requirements for DL1DataHandler are currently listed in two separate files, requirements.txt and environment.yml. Is there a reason for this or is it possible to provide just environment.yml? Also, the environment file should only specify the top-level dependencies (currently setuptools, numpy, pytables, amd pyyaml). Lower-level dependencies will be automatically installed and don't need to be specified.

Height of first interaction

Could you add the height of interaction to the events info ?
Thank you !

Filling variables whose input comes from simtel and corsika

Some of the metadata fields that are stored as attributes are currently not being filled, specifically:

CORSIKA_ver
E_max
E_min
prod_site_B_field
prod_site_alt
prod_site_array
prod_site_coord
prod_site_subarray
simtel_ver
spectral_index

These data can be found in the simtel history block, but retrieval of such a block through pyhessio/ctapipe has not been implemented (see this issue cta-observatory/pyhessio#52).

Pixel positions for Astri are wrong in h5 files

The pixel positions in the h5 file for SSTA are that from low to high pixel number, the camera is filled row by row from bottom left to the top right. However, it seems that the ASTRI camera should be filed from top left to bottom right by 8x8 sub-module sectors.

DL1DataReader doesn't support multiprocessing data loading

Trying to load the data with a pytorch Dataloader using several processes (num_workers) makes pytables crash.

dataset = DL1DataReader(files, 
                        selected_telescope_type='LST_LSTCam', selected_telescope_ids=[1],
                        image_channels=['charge', 'peakpos'],
                        event_info=['mc_energy', 'core_x', 'core_y', 'alt', 'az', 'shower_primary_id'],
                        array_info=['x', 'y', 'z'],
                        mapping_settings={'camera_types': ['LSTCam'],
                                          'mapping_method': {'LSTCam': 'axial_addressing'}},
                        transforms=[ConvertShowerPrimaryIDToClassLabel(), 
                                    ImpactToKm(), 
                                    DataForGammaLearn()]
                        )
dl = DataLoader(dataset, batch_size=256, shuffle=True, drop_last=True, num_workers=4)
for i, batch in enumerate(dl):
    print('batch {}'.format(i))
    print(type(batch['image']))
    print(batch['label'].shape)

---------------------------------------------------------------------------
HDF5ExtError                              Traceback (most recent call last)
<ipython-input-6-3293362984ef> in <module>()
----> 1 for i, batch in enumerate(dl):
      2     print('batch {}'.format(i))
      3     print(type(batch['image']))
      4     print(batch['label'].shape)

~/anaconda3/envs/torch10/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    635                 self.reorder_dict[idx] = batch
    636                 continue
--> 637             return self._process_next_batch(batch)
    638 
    639     next = __next__  # Python 2 compatibility

~/anaconda3/envs/torch10/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_next_batch(self, batch)
    656         self._put_indices()
    657         if isinstance(batch, ExceptionWrapper):
--> 658             raise batch.exc_type(batch.exc_msg)
    659         return batch
    660 

HDF5ExtError: Traceback (most recent call last):
  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 138, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/mikael/dl1-data-handler/dl1_data_handler/reader.py", line 395, in __getitem__
    image = self._get_image(filename, self.tel_type, image_index)
  File "/home/mikael/dl1-data-handler/dl1_data_handler/reader.py", line 341, in _get_image
    in f.root.Telescope_Type_Information.where(query)][0]
  File "/home/mikael/dl1-data-handler/dl1_data_handler/reader.py", line 340, in <listcomp>
    length = [x['num_pixels'] for x
  File "tables/tableextension.pyx", line 903, in tables.tableextension.Row.__next__
  File "tables/tableextension.pyx", line 1084, in tables.tableextension.Row.__next__inkernel
  File "tables/tableextension.pyx", line 577, in tables.tableextension.Table._read_records
tables.exceptions.HDF5ExtError: HDF5 error back trace

  File "H5Dio.c", line 216, in H5Dread
    can't read data
  File "H5Dio.c", line 587, in H5D__read
    can't read data
  File "H5Dchunk.c", line 2276, in H5D__chunk_read
    error looking up chunk address
  File "H5Dchunk.c", line 3022, in H5D__chunk_lookup
    can't query chunk address
  File "H5Dbtree.c", line 1047, in H5D__btree_idx_get_addr
    can't get chunk info
  File "H5B.c", line 341, in H5B_find
    unable to load B-tree node
  File "H5AC.c", line 1763, in H5AC_protect
    H5C_protect() failed
  File "H5C.c", line 2565, in H5C_protect
    can't load entry
  File "H5C.c", line 6890, in H5C_load_entry
    Can't deserialize image
  File "H5Bcache.c", line 181, in H5B__cache_deserialize
    wrong B-tree signature

End of HDF5 error back trace

Problems reading records.

The following minimal example also generates errors:

def read_dataset(num, dataset):
    print(dataset)
    for k in range(1000):
        j = int(np.random.rand()*len(dataset))
        data = dataset[j]
        print('worker {} reads data, iteration {}'.format(num, k))
    return
workers = [mp.Process(target=read_dataset, args=(i, dataset,)) for i in range(4)]
for w in workers:
    w.start()
for w in workers:
    w.join()

worker 3 reads data, iteration 64
worker 0 reads data, iteration 79
worker 1 reads data, iteration 72

Process Process-9:

worker 2 reads data, iteration 66

Process Process-10:
Traceback (most recent call last):

worker 3 reads data, iteration 65

  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
Traceback (most recent call last):

worker 2 reads data, iteration 67

  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)

worker 3 reads data, iteration 66

  File "<ipython-input-26-dae7f6b79d79>", line 6, in read_dataset
    data = dataset[j]
  File "/home/mikael/dl1-data-handler/dl1_data_handler/reader.py", line 393, in __getitem__
    nrow = f.root._f_get_child(self.tel_type)[image_index]['event_index']
  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/site-packages/tables/table.py", line 2079, in __getitem__
    return self.read(start, stop, step)[0]
  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/site-packages/tables/table.py", line 1934, in read
    arr = self._read(start, stop, step, field, out)

worker 3 reads data, iteration 67

  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/site-packages/tables/table.py", line 1848, in _read
    self._read_records(start, stop - start, result)
  File "<ipython-input-26-dae7f6b79d79>", line 6, in read_dataset
    data = dataset[j]
  File "tables/tableextension.pyx", line 577, in tables.tableextension.Table._read_records
  File "/home/mikael/dl1-data-handler/dl1_data_handler/reader.py", line 393, in __getitem__
    nrow = f.root._f_get_child(self.tel_type)[image_index]['event_index']

worker 3 reads data, iteration 68

tables.exceptions.HDF5ExtError: HDF5 error back trace

  File "H5Dio.c", line 216, in H5Dread
    can't read data
  File "H5Dio.c", line 587, in H5D__read
    can't read data
  File "H5Dchunk.c", line 2304, in H5D__chunk_read
    unable to read raw data chunk
  File "H5Dchunk.c", line 3659, in H5D__chunk_lock
    data pipeline read failed
  File "H5Z.c", line 1279, in H5Z_pipeline
    filter returned failure during read

End of HDF5 error back trace

Problems reading records.
  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/site-packages/tables/table.py", line 2079, in __getitem__
    return self.read(start, stop, step)[0]
  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/site-packages/tables/table.py", line 1934, in read
    arr = self._read(start, stop, step, field, out)
  File "/home/mikael/anaconda3/envs/torch10/lib/python3.6/site-packages/tables/table.py", line 1848, in _read
    self._read_records(start, stop - start, result)
  File "tables/tableextension.pyx", line 577, in tables.tableextension.Table._read_records
tables.exceptions.HDF5ExtError: HDF5 error back trace

  File "H5Dio.c", line 216, in H5Dread
    can't read data
  File "H5Dio.c", line 587, in H5D__read
    can't read data
  File "H5Dchunk.c", line 2304, in H5D__chunk_read
    unable to read raw data chunk
  File "H5Dchunk.c", line 3659, in H5D__chunk_lock
    data pipeline read failed
  File "H5Z.c", line 1279, in H5Z_pipeline
    filter returned failure during read

End of HDF5 error back trace

Problems reading records.

While the following example using Threading works:

def worker(num, dataset):
    print(dataset)
    for k in range(1000):
        j = int(np.random.rand()*len(dataset))
        data = dataset[j]
        print('worker {} reads data, iteration {}'.format(num, k))
    return

threads = []
for i in range(4):
    t = threading.Thread(target=worker, args=(i, dataset,))
    threads.append(t)
    t.start()

worker 0 reads data, iteration 999

worker 1 reads data, iteration 991
worker 3 reads data, iteration 987
worker 2 reads data, iteration 987
worker 1 reads data, iteration 992worker 3 reads data, iteration 988
worker 2 reads data, iteration 988

worker 3 reads data, iteration 989
worker 1 reads data, iteration 993worker 2 reads data, iteration 989

worker 1 reads data, iteration 994
worker 3 reads data, iteration 990
worker 1 reads data, iteration 995
worker 3 reads data, iteration 991
worker 2 reads data, iteration 990
worker 3 reads data, iteration 992
worker 1 reads data, iteration 996
worker 3 reads data, iteration 993
worker 1 reads data, iteration 997
worker 3 reads data, iteration 994
worker 2 reads data, iteration 991
worker 1 reads data, iteration 998
worker 3 reads data, iteration 995
worker 1 reads data, iteration 999worker 2 reads data, iteration 992

worker 3 reads data, iteration 996
worker 2 reads data, iteration 993
worker 3 reads data, iteration 997
worker 2 reads data, iteration 994
worker 3 reads data, iteration 998
worker 3 reads data, iteration 999
worker 2 reads data, iteration 995
worker 2 reads data, iteration 996
worker 2 reads data, iteration 997
worker 2 reads data, iteration 998
worker 2 reads data, iteration 999

Threading in python doesn't seem to allow true concurrency though, because of the GIL ( Global Interpreter Lock).

Loading batches in a single process may slow down the training.

Loading HDF5 files in ctapipe container

Hi,
Is there a method to load HDF5 files in ctapipe containers?
(same way we open simtel files)
So that we can start processing in ctapipe from calibrated images if desired.

Handle all selected channels with _select_image

Issues reducing SCT data

I tried to reduce some SCT from prod3b data with write_data.py (v0.7.4) and I'm getting the following error:

Process Process-1: Traceback (most recent call last): File "/home/nieto/Software/anaconda3/envs/dl1dh-v074/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap self.run() File "/home/nieto/Software/anaconda3/envs/dl1dh-v074/lib/python3.7/multiprocessing/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "/home/nieto/Software/anaconda3/envs/dl1dh-v074/lib/python3.7/site-packages/dl1_data_handler/writer.py", line 858, in _process_data event_source = io.event_source(filename) File "/home/nieto/Software/anaconda3/envs/dl1dh-v074/lib/python3.7/site-packages/ctapipe/io/eventsourcefactory.py", line 102, in event_source **kwargs) File "/home/nieto/Software/anaconda3/envs/dl1dh-v074/lib/python3.7/site-packages/ctapipe/core/factory.py", line 254, in produce instance = factory._instance File "/home/nieto/Software/anaconda3/envs/dl1dh-v074/lib/python3.7/site-packages/ctapipe/core/factory.py", line 226, in _instance instance = product(config, parent, **kwargs) File "/home/nieto/Software/anaconda3/envs/dl1dh-v074/lib/python3.7/site-packages/ctapipe/io/simteleventsource.py", line 22, in __init__ self.file_ = SimTelFile(self.input_url) File "/home/nieto/Software/anaconda3/envs/dl1dh-v074/lib/python3.7/site-packages/eventio/simtel/simtelfile.py", line 136, in __init__ self.telescope_descriptions[o.telescope_id][key] = o.parse() File "/home/nieto/Software/anaconda3/envs/dl1dh-v074/lib/python3.7/site-packages/eventio/simtel/objects.py", line 243, in parse assert_exact_version(self, supported_version=1) File "/home/nieto/Software/anaconda3/envs/dl1dh-v074/lib/python3.7/site-packages/eventio/version_handling.py", line 11, in assert_exact_version given_version=self.header.version, NotImplementedError: Unsupported version of CameraOrganization: only supports version 1got 2

This error seems to be related to #135 in pyhessio and fixed in v0.16.1. The latest ctapipe release, used by dl1-data-handler, includes pyeventio v0.11.0. It seems we either need to wait for a new ctapipe release that includes an updated pyeventio version or add pyeventio >v0.16.1as a dl1-data-handler dependency.

image-extractor as a conda package

Building image-extractor as a conda package may be helpful to users. In particular, adding this feature has been suggested by people working with image-extractor on the GRID (@bregeon).

Wrong mapping for FlashCam

There is a mapping bug for FlashCam, while using the current ctapipe and dl1-data-handler master.
Example gamma event:

This is due to the discrepancy of the pixel positions (in the following the pixel intensity is pixel index) from the ctapipe (ctapipe.instrument.camera.CameraGeometry)

and from the dl1 h5 files (subarray.tel[tel_id].camera.pix_x.value)

The ImageMapper class is currently reading the pixel positions from ctapipe, which leads to the scattered events above.

I would suggest to read the pixel positions from the h5 files in the reader and pass the pixel positions to the ImageMapper. If there is no pixel positions passed to the ImageMapper, the ImageMapper should get the pixel positions from ctapipe. This also make it possible to call the dl1 reader and ImageMapper without ctapipe dependency.

No module named dl1_data_handler.writer

Hey, I'm unsure of how to installation process is, but I have done the following:

Created conda environment with python 3.7 ( python 3.6 gave me compability conflict because of cuda 10.2).
Installed as per the instructions, checked out v0.7.3.
Typed the following command

Add necessary documentation

In order to make the code more useful, the following documentation would be very useful:

An up-to-date and detailed description of the current cta-ml format (all tables, their purposes, the datatypes of each column, etc.). This is the most critical.
The version of ctapipe it is written for (as this is the key dependency and the most likely interface to change) This should be done after #15.
Instructions for extending/modifying the code to handle new data formats (this will depend strongly on whether/when the rewrite in #25 occurs).
Instructions on performance and mass processing. Given that this code is likely to be used to process large amounts of data, it would be useful to give instructions on how best to handle the procedure (multiple jobs/multithreading). Performance benchmarks could also be provided if available to give an idea of how long processing will take (per event).
Benchmarks on the effects of compression/chunking/indexing on write/read/search speed.

@nietootein @aribrill Anything else you can think of?

Add image mapping method to return unmapped vectors

As mentioned by @mikael10j in #46, the image mapper should have a method to return the unmapped vectors. Because this behavior most closely reflects the HDF5 data, it should be the default method if no method is specified.

reader: Data selection

Realizing the data selection via f.root.Events.where and a cut_condition string is very nice and simple. However, if the selection doesn't apply on the event info (like the event intensity or the result of the cleaning), or if the selection needs more complex operations than the ones listed here, we can't use it. And that's exactly why _select_event_intensity is implemented. What do you think of a more generic _select_event function which take in argument a dictionary of {filtering.function: filtering.parameters} ? So we could easily add filtering functions without modifying the core code of the reader. If you agree I can take care of coding it.

'channels' argument in the image_mapper has no functionality

I noticed that the 'channels' argument in the image_mapper and the third dimension in self.image_shapes don't have any functionality, since we are dealing with the different channels in the map_image function. Therefore, we can remove this argument and the third dimension in self.image_shapes.

Mapped images turned into double

Images returned by image mapper (except the vector in case of axial addressing) are of np.float64 while the original data are of np.float32.
Using float as dtype for numpy array is equivalent to np.float64.

Add processing functions from CTLearn and GammaLearn

Update processor.py with Transforms that implement desired processing functions from CTLearn and GammaLearn. The most efficient approach might be to implement first the highest-priority functions that we plan to use, and any others only later as needed. This is Task 1 in #35.

The functions currently implemented in CTLearn are:

Cropping
- This consists of three parts: cleaning the image, calculating and appending the image centroid coordinates, and cropping each input about the centroid, so these could be implemented as separate transforms. The cleaning is needed as an intermediate step to calculate accurate centroids.
- Arguments
  - image cleaning method (two-level or None)
  - bounding box size for each telescope type
  - two-level cleaning thresholds for each telescope type
  - whether to return the cleaned or uncleaned images
- Problem: with cleaning and cropping as separate Transforms, it's not clear to me how to implement the option to use cleaning for calculating the centroids while still returning the uncleaned, but cropped, images.
Normalization
- Only option is log normalization, which consists of subtracting from each pixel the minimum value from all selected images (probably best to have the users calculate themselves), then taking the natural log
Sorting
- trigger: order images by all triggered followed by all blank images
- size: order images by largest to smallest total sum of pixel values

To me, the highest priority functions of these are sorting (required for our CNN-RNN implementation) and perhaps image cleaning. As I recall, we ended up not using cropping or log normalization in our benchmarks, as we found them to have little effect on classification performance.

@mikael10j, what are the processing functions implemented in GammaLearn?

I think the next steps are to decide which processing functions we want to have in DL1DH and then split up the work of implementing them. Are there any other functions we need but are in neither package?

use more ctapipe functionality

I haven't been following the development of this code much, but looking quickly at the code, it seems to re-implement a lot of what is available in ctapipe, specifically the functionality of the HDF5TableWriter (which turns a ctapipe Container class into a row of a HDF5 table automatically). The ctapipe io system was designed exactly for the use case of writing DL1 data, so shouldn't be forgotten. We've been busy making it more useful recently as well, and there are still a few issues out for increased functionality.

I suspect much of this code could be simplified by using it (probably by >50%), and it gives you the advantage of keeping physical unit info, metadata, etc, which will be required in CTA output files.

I'd suggest:

use ctapipe.io.HDF5TableWriter to simplify most of the tables you create, and get rid of your custom table definitions (the Container classes should be sufficient, and if not we can define new ones). You can basically throw away all of the code in dump_event this way.
for the header info, I would suggest dumping the info from the Provenance() system, which contains all of the input files, output files, and the machine info, python version, etc. instead of using a custom solution. It gives you all this info automatically as a JSON object or dict.
I think you are misusing the EventSource classes - you never need to specify the subclass, just use a EventSourceFactory (or the event_source() helper function) , and it will construct the correct one based on the file contents (e.g. if you open a simtel file, it checks the magic number and return a SimTelEventSource), so the user just needs to specify a filename
You mix your own argument parsing with the ctapipe Tool argument parsing, which is somewhat confusing. We should stick to one configuration system (though I agree the current one in ctapipe is not so friendly, we are working to make out own rather than use traitlets.config as we do now). If you use the ctapipe.core.Tool base class, you get provenance handling, and configuration for free.

Add data quality tests

Possible data quality tests:

conversion ok (based on file opening, data retrieval and file size)
randomly check image quality (test TBD)

Wrong attributes access in 'R0Container'

Hi,

When running image-extractor over a simtel file: gamma_20deg_0deg_run8___cta-prod3_desert-2150m-Paranal-HB9_cone10.simtel I have this error:

Traceback (most recent call last):
  File "image_extractor.py", line 641, in <module>
    extractor.process_data(data_file, args.max_events)
  File "image_extractor.py", line 462, in process_data
    event_row['run_number'] = event.r0.run_id
AttributeError: 'R0Container' object has no attribute 'run_id'
Closing remaining open files:./dataset.hdf5...done

I temporarily bypassed it by commenting the line 462. However, I'm not sure if this might cause trouble when reading/using the hdf5 produced.

ImageMapper: nearest interpolation normalization

Hi,

I was wondering if the nearest neighbors interpolation method implemented in ImageMapper should normalize pixels to preserve overall intensity ?

I might do something wrong but I found out that it's not:

any clue ?

Implement compression

As far as I know we have not implemented any compression so far. Having compression would be of great help when reducing large datasets.

Single-thread writting

It seems that single-thread writing is preferred for the processing on the grid (see here). @bryankim96, is write_data_single_thread.py ready for production? Would it make sense to have a single script and add the option to select single or multi-thread processing to the config file?

Use signed instead of unsigned types

Some information is currently being stored as unsigned integers (e.g. event_id). I've discovered that TensorFlow is unable to automatically convert unsigned NumPy types (except uint8) to tensors. I've added a workaround in CTLearn here to cast these unsigned types to signed tensors. It would be more convenient for them to be stored as signed types directly, unless there's an important reason for them to be stored as unsigned types.

See this Stack Overflow for more information.

read the parameter tables in the DL1DH reader

The reader need some adjustment to incorporate with #94. Parts are also needed for #92.

Array_Info and Telescope_Info are empty in some h5 files

For some h5 files on the grid, Array_Info and Telescope_Info tabs are empty.
I've also noticed that some file are really small (sevel MB). Is there any reason for that ?
Thanks

Build failing due to inability to find ctapipe

The Travis build is currently failing in master on the following step:

$ pip install .
Processing /home/travis/build/cta-observatory/dl1-data-handler
Collecting numpy>=1.15.0 (from dl1-data-handler==0.7.5)
  Downloading https://files.pythonhosted.org/packages/c1/e2/4db8df8f6cddc98e7d7c537245ef2f4e41a1ed17bf0c3177ab3cc6beac7f/numpy-1.16.3-cp36-cp36m-manylinux1_x86_64.whl (17.3MB)
Collecting tables>=3.4.4 (from dl1-data-handler==0.7.5)
  Downloading https://files.pythonhosted.org/packages/ab/79/4e1301a87f3b7f27aa6c9cb1aeba4875ff3edb62a6fe3872dc8f04983db4/tables-3.5.1-cp36-cp36m-manylinux1_x86_64.whl (4.3MB)
Collecting ctapipe>=0.6.2 (from dl1-data-handler==0.7.5)
  ERROR: Could not find a version that satisfies the requirement ctapipe>=0.6.2 (from dl1-data-handler==0.7.5) (from versions: none)
ERROR: No matching distribution found for ctapipe>=0.6.2 (from dl1-data-handler==0.7.5)
The command "pip install ." failed and exited with 1 during .

The problem seems to be that pip cannot find ctapipe, which makes sense as it's not available on PyPI. Previously, when the build was succeeding, we were installing ctapipe directly from GitHub (last successful commit).

I see at least two approaches. We could revert to installing directly from GitHub. Alternatively, we could switch to using conda for installation, so we can use ctapipe's recommended installation method using conda. A concern is that the most recent ctapipe conda package may not be compatible with the version of eventio we need (see #52). Any opinions or other ideas?

reader: event identifers building with multithreading

During the building of event identifiers we may apply some intense filters. It would be nice to do it in a multithreaded way to save time.

Add image mapping methods from CTLearn

Implement Tasks 2 and 3 from #35.

The image mapping methods to merge are bilinear, bicubic, and nearest-neighbor interpolation; rebinning; oversampling; axial addressing; and image shifting.