wipacrepo / decotools Goto Github PK

View Code? Open in Web Editor NEW

1.0 6.0 4.0 2.2 MB

Toolset for analyzing DECO data

Home Page: https://wipacrepo.github.io/decotools/

License: MIT License

Python 99.78% Shell 0.22%

image-recognition python physics deep-learning

decotools's Introduction

decotools

A set of tools to help make DECO life easier ✨

Installation

The decotools Python package can be installed directly from GitHub. decotools is built on Google's Tensorflow, which must be installed to use decotools. If tensorflow is already installed, then decotools can be installed from GitHub via

$ pip install git+https://github.com/WIPACrepo/decotools#egg=decotools

Alternatively, if tensorflow is not installed, then the following commands can be used to install tensorflow along with decotools.

For installing the CPU version of tensorflow:

$ pip install git+https://github.com/WIPACrepo/decotools#egg=decotools[tf]

For installing the GPU version of tensorflow:

$ pip install git+https://github.com/WIPACrepo/decotools#egg=decotools[tf-gpu]

Documentation

The documentation for decotools is available at https://wipacrepo.github.io/decotools/.

Contributing

Contributions to decotools are welcome! Please see the contributing guide for more information.

decotools's People

Stargazers

Watchers

Forkers

jrbourbeau milesjwinter mattmeehan apizzuto

decotools's Issues

Add image metrics tool to help with blob extraction

It would be nice to have a function that takes an image path and returns several metrics to help us determine if it is a "good" image that we should pass to decotools.extract_blobs.

Metrics I can think of:

Average pixel intensity (over the entire image) — this can help filter out super noisy images
Maximum pixel intensity (over the entire image)
Number of pixels above a specified intensity threshold.

@mattmeehan @cschneider6 @milesjwinter any other potential metrics you can think of?

Another bug in get_iOS_files()

I tried calling get_iOS_files(start_date='2017-07-26',device_id='D8D8E48D-7D3F-4693-A927-A402CF127D25') and I got this error:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Tried testing with a few different device IDs and it's the same problem. Basically, any start_date before 2017-07-27 seemed to be causing errors. And when start_date='2017-07-28' (today's date), no error would occur but I would get back an empty list (even when I manually check and the files do exist).

Add module with CNN image classifications

I'd like to add a new module to decotools that adds image classification using our trained CNN.

Add travis CI integration

I'd like to have travis ci start running tests automatically

Need to finish image_file_to_xml_file function

Need to complete image_file_to_xml_file function in decotools/io.py

get_android_files returns image file paths that don't exist

So it appears that there are some entries in the database that don't have a corresponding image file in the file system (specifically, this happens for 30 entries in the DB). Given that get_android_files constructs the image file path from the "path" column in the DB, it makes sense that get_android_files should also check if the image file exists. If the file doesn't exist, then it can be dropped from the output of get_android_files.

Add documentation page

It would be really great if we could add a documentation page hosted on GitHub pages. I'll work on getting an MkDocs directory together.

io.py caused issue ipython

When running ipython within the decotools directory, the following error occurs:

  File "/home/mrmeehan/.local/lib/python2.7/site-packages/IPython/utils/openpy.py", line 9, in <module>
    import io
  File "/home/mrmeehan/software/decotools/decotools/io.py", line 7, in <module>
    import pandas as pd

Add testing utility to generate test image arrays

Currently, when we want to create a test (fake) image files in the decotools tests, we save a random numpy array to a temporary file. For example,

import numpy as np
from skimage.io import imsave

tmpfile_1 = tmpdir.join('temp_image_1.png')
imsave(str(tmpfile_1), np.random.random((5, 5, 4)))
tmpfile_2 = tmpdir.join('temp_image_2.png')
imsave(str(tmpfile_2), np.random.random((200, 100, 4)))

files = [str(tmpfile_1), str(tmpfile_2)]

I'd prefer to have a function that generates test images, so we don't have to repeat code.

get_iOS_files by deviceID

In the get_iOS_files function, I think it would be useful to include a parameter to select a specific device ID. This way we could easily separate images taken by each device. Also, it might be useful to be able to exclude a specific device ID. There is one device that has ~15,000 events, but I think this was because the camera wasn't properly covered, so these events wouldn't be useful.

Add parallelized histogramming of pixel intensities

We often histogram pixel intensities when analyzing DECO data. While this is a simple task on its own, it can be time-consuming to histogram many images in serial. I think it would be useful to have a function that parallelizes this using dask. The function would just take a list of image files as input and return the corresponding histograms, e.g.:

def get_intensity_histograms(files, rgb_sum=False, cumulative=False, n_jobs=1):
    ...

We'd have parameters to determine the rgb conversion, cumulative vs differential distribution, and number of processes for parallelization.

@jrbourbeau What do you think? And any other parameters or features that would be useful?

Add a contribution guide for project

Add a contribution guide to the documentation

Move CV validation into fit method

Currently the CNN class has both a fit and fit_with_kfold method. The content of these methods is similar. fit_with_kfold is just fit with sklearn.model_selection.StratifiedKFold built into it.

I'd like to unify these two methods by adding a cv parameter to the fit method.

@mattmeehan @milesjwinter any objections?

Add image as a BlobGroup class attribute

Right now several BlobGroup class methods are defined like below:

def get_sub_image(self, image):
def get_raw_moment(self, image, p, q):
def get_max_intensity(self, image):

I think that this image input should be upgraded to a BlobGroup class attribute. This makes sense logically because each blob group is only defined for a given image.

@mattmeehan can you foresee any issue with this?

get_iOS_files breaks if no xml file exists

It looks like some of the png image files don't have a corresponding xml file. Right now this is causing get_iOS_files to break

Update CNN fit default parameter values

I'd like to update the default parameter values for CNN.fit method to their currently used values.

Addition of image collection statistics

I think it would be useful to have some built-in image collection statistics. Not exactly sure what all we would want to include. But things along the lines of (for a given date range)

Number of images taken
Number of images per day
Number of images that pass some specified cuts
etc.

We might even consider adding some interactive plotting capabilities.

Add flake8 support

I'd like to add flake8 support to enforce a consistent coding style.

TODO:

Edit decotools to be flake8 compliant
Add flake8 . to .travis.yml so that it is always run automatically

@zdgriffith @mattmeehan anything else you guys can think of related to coding style?

Switch to tensorflow for keras backend

Currently, decotools uses Theano as the backend for keras. However, Theano will stopped being supported. It's final release, Theano 1.0.0, was recently released. I'd like to switch to using tensorflow as the keras backend.

Add a get_andriod_files function

It'd be great to have an get_andriod_files function analogous to the current get_iOS_files function.

Grayscale conversion options

Currently, in blob_extraction.py images are converted to grayscale using a simple RGB sum. However, we have older code that converts to grayscale using a weighted RGB sum in PIL. It would be nice to have an option to use either of these conventions for backwards compatibility.

Bug in get_iOS_files

I tried running get_iOS_files to see if there were any new devices in the data set. It work just fine with the default settings, but then I tried setting include_min_bias=True in case any new phones haven't seen events yet, and I'm getting a weird error when I do this:

Traceback (most recent call last):
File "getAllDevices.py", line 6, in
all_files = dt.get_iOS_files(include_min_bias=True)
File "/home/cschneider/.virtualenvs/deco/lib/python2.7/site-packages/decotools/fileio.py", line 193, in get_iOS_files
df = get_metadata_dataframe(file_list)
File "/home/cschneider/.virtualenvs/deco/lib/python2.7/site-packages/decotools/fileio.py", line 57, in get_metadata_dataframe
xml_dict = xml_to_dict(xml_file)
File "/home/cschneider/.virtualenvs/deco/lib/python2.7/site-packages/decotools/fileio.py", line 19, in xml_to_dict
tree = ET.parse(xmlfile) # Initiates the tree Ex:
File "/cvmfs/icecube.opensciencegrid.org/py2-v3/RHEL_6_x86_64/lib64/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/cvmfs/icecube.opensciencegrid.org/py2-v3/RHEL_6_x86_64/lib64/python2.7/xml/etree/ElementTree.py", line 657, in parse
self._root = parser.close()
File "/cvmfs/icecube.opensciencegrid.org/py2-v3/RHEL_6_x86_64/lib64/python2.7/xml/etree/ElementTree.py", line 1665, in close
self._raiseerror(v)
File "/cvmfs/icecube.opensciencegrid.org/py2-v3/RHEL_6_x86_64/lib64/python2.7/xml/etree/ElementTree.py", line 1517, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

Any ideas on how to fix this?

Add blob extraction unit tests

I'd like to have some unit tests for decotools.extract_blobs. We could use a publically available android test image to check that things like the blob area, length, etc. are the expected values.

Add tests

Adding tests for decotools would be a nice addition

Switch to sphinx docs

I'd like to switch to using sphinx instead of mkdocs for the decotools documentation. Mkdocs doesn't auto-generate API documentation.

String format error in get_iOS_files()

I tried running decotools.get_iOS_files(device_id='F216114B-8710-4790-A05D-D645C9C79C27',end_date='2017-07-20',n_jobs=20) and this returns an error message that reads "ValueError: Unknown string format."

I've run this same format of code before so I'm not sure what's been changed to now cause this error.

Problem with missing metadata files in fileio.py

I think the KeyError issue when specifying a device ID or phone model is coming from the fact that there are actually no .xml files in the 2017.07.29, 2017.07.30, or 2017.07.31 directories in /net/deco. I manually did a check for anything ending in .xml in these directories and nothing is being returned/listed. So adding the if-else statement to look for files starting with 'metadata-' vs. files that just start with the device ID does not fix this error. It must be a problem when the program attempts to open the non-existent file. I can take a look at this some more and see if I can fix it.

Bug in fileio.py

When I try calling get_iOS_files() on a specific phone model or device ID, I get this long error message that I think ultimately has to do with the pandas.concat function. The final line of the error message says:

File "/home/cschneider/.virtualenvs/deco/lib/python2.7/site-packages/pandas/core/reshape/concat.py", line 239, in init
raise ValueError('No objects to concatenate')
ValueError: No objects to concatenate

Add blob finding support

I would like to have some blob finding support to decotools. Something along the lines of

blob_list = decotools.get_image_features(image_file)

Add bounding box option to metrics module

It would be useful to be able to calculate local image metrics in addition to global ones. We could include a bounding box keyword argument in get_intensity_metrics and get_rgb_hists, which would allow the user to evaluate those functions on a local box of pixels surrounding a blob, rather than the entire image. This way we can reject noise on both a global and local level.

PyTables not a requirement

PyTables should be in requirements/default.txt so DataFrames can be saved to hdf files.

convnet.convert_images raises ValueError if edge check fails

convnet.convert_images raises a ValueError if any of the images passed to it failed the edge check. Here's some example code with the full error message:

import decotools as dt
# Get some image files
image_files = dt.get_android_files(start_date='2016.10.30', end_date='2016.11.07')
# process_image_files calls the convert_images function
images = dt.process_image_files(image_files)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-59e1aa4ad428> in <module>()
----> 1 images = dt.process_image_files(image_files)

/home/mrmeehan/software/decotools/decotools/convnet.py in process_image_files(image_files, size)
     74             images.append(None)
     75
---> 76     scaled_images = convert_images(images)
     77
     78     return scaled_images

/home/mrmeehan/software/decotools/decotools/convnet.py in convert_images(images)
     54         shape:(n_images, n_rows, n_cols, 1)
     55     """
---> 56     images = np.array(images, dtype='float32')
     57     images = np.mean(images/255., axis=-1, keepdims=True)
     58     if len(images.shape) == 3:

ValueError: setting an array element with a sequence.

This happens because images that fail the edge check in process_image_files are still added to the image list as None:

decotools/decotools/convnet.py

Lines 68 to 74 in 35fe00c

 if pass_edge_check(maxX, maxY, image.size, crop_size=2*size): 

 x0, x1, y0, y1 = get_crop_range(maxX, maxY, size=size) 

 cropped_img = image.crop((x0, y0, x1, y1)) 

 cropped_img = np.asarray(cropped_img) 

 images.append(cropped_img) 

 else: 

 images.append(None)

image is no longer regularly shaped, so numpy can't convert it. An easy fix would be to ignore images that fail the edge check altogether, i.e. don't add anything to the image list. It might be good to have some sort of logging or print message to let the user know which, or at least how many, images failed the edge check. Many images end up failing, so it might be easier to just stick with the number.

Improve test coverage

I'd like to start adding more tests to decotools

	if pass_edge_check(maxX, maxY, image.size, crop_size=2*size):
	x0, x1, y0, y1 = get_crop_range(maxX, maxY, size=size)
	cropped_img = image.crop((x0, y0, x1, y1))
	cropped_img = np.asarray(cropped_img)
	images.append(cropped_img)
	else:
	images.append(None)