The current DICOM + plus 2 .csv files setup is a little messy and cumbersome to work with. To extract lesions, we would first have to load all DICOM series of interest from disk. Then compare this with the information in the first .csv file (prostateX-images-train.csv) to get the lesion info (ijk, spacing, etc.). After that we have to load prostateX-findings-train.csv to obtain the zone and clinSig information.
We have to put all this information together in order to extract the right lesion, with the right truth label and zone information from the right DICOM series, before we can start training. That's why I decided to restructure the data and combine the DICOM pixel data and the two .csv files into one hdf5 dataset.
The code for this can be found in the h5_converter branch. There are, as of now, three files: csv_fix.py, h5_converter.py and h5_query.py. Csv_fix and h5_converter only have to be run once in order to actually build the hdf5 set (which I have already done). The way the set is structured can be found in h5_converter.py.
To actually retrieve something from the set we can use h5_query.py. It contains a class that lets us draw DICOM images and their lesion information very quickly. It's almost instant. Much faster than our old way of reading DICOM files from disk and then loading their pixel data.
Note that there is no actual lesion pixel data in the hdf5 set. Just the lesion attributes from the .csv files and the DICOM pixel data. Actually extracting the lesion pixel data from the DICOM pixel data should be much more straightforward with the query result from h5_query.py.
The new HDF5 dataset can be found at https://jspunda.stackstorage.com/s/0Zy95CMqQzwVaAq
The password for the file is: ismi2017
Whether or not we are actually going to be using this new set of course depends on what everyone thinks, but in my opinion it will simplify and speed things up a lot in the future.