Giter Club home page Giter Club logo

prostatex's Issues

How to handle multiple similar DICOM series per patient?

As can be seen from the data set overview page on the wiki, it's
possible that we have multiple, similar DICOM series per patient. Some of which
carry the exact same DICOM series description, while others have almost the same
DICOM series descriptions.

My question is how do we want to handle this? Currently, for simplicity's sake we
dismiss any duplicate DICOM series or just take the series that has the
highest series number.

However, we are throwing away possibly valuable data. On the other hand,
maybe these duplicate DICOM series are too blurry or too distorted to use
and it would only negatively impact our performance.

Ideally, you would want to go through the entire set and for each duplicate,
hand pick the ones that you want to keep, but I don't think this is feasible,
given the amount of DICOM series and perhaps our lack of expert knowledge.

As for the test set, if we encounter multiple, similar DICOM series for a lesion
in a certain patient, we could for example classify both of them and somehow average them. The same tradeoff between possible data loss and negatively impacted
performance applies here though.

Note that if we do want to take into account every DICOM series we find for
a lesion in a patient, our .hdf5 representation and lesion extraction
requires some refactoring to support this properly.

transfer from real millimetre range to pixel range

Make a function with:
input(the real area we wanna get, e.g. 16mm times 16mm square)
output(the pixel range, e.g. [8, 8])

The example above means:
if we wanna get a (16mm times 16mm) square,
we need to get a range (8 times 8) pixel square, given VoxelSpacing [2,2,3].

Other team edits

Hi y'all,
Us from the other team are a bit apprehensive towards just editing your code, because we don't want to mess things up for you. Our question is how you want us to treat your repo.

Is it an idea that we just create our own 'master branch' that we merge our branches to and ask if you want to merge our code to yours? Do you want us to fork the repo into our own? Any preferences?

Thank you for your input :)

Experiment with the network

To find out what combination of architecture and parameters work best for the network, we should first have less variation in the results if we run the algorithm multiple times.

Tasks

When the results are almost the same every run. Then we can easier compare the results.
We can try the following.

  • Use weight decay
  • Add dropout.
  • See if more layers changes results. For example: replace pool layer with conv layer, or add another 1by1 conv at the end of the network.

2D vs. 3D

Let's discuss if we are going to use 2D or 3D data as input for the initial model.
Both probably have advantages and disadvantages, and having an overview hopefully helps us make a better decision.

I've not included the "3d ~= more data" point in the list below. The reason is that we don't know if more data means more information and thus better performance, or it means more data and thus slower training. Or possibly a combination of the two.

Neither did I include the point: "lesions are 3 dimensional therefore using 3D makes more sense". I'm not sure this argument is valid.

Do comment or add more points that I forgot.

2D

pros

  • easier to visualize
  • There is augmentation that we can use (e.g. keras ImageDataGenerator).
  • Transfer learning. enough weights available online.

cons

3D

pros

cons

Experimental results consultation

Hi,
I'm working on ProstateX-2017 as a machine learning test and without a good performance.So,i want to use your code as the reference.Because i don't have the label of test set , I would like to consult with you about how your network's performance is ?

Patient-Level Train Validation Split

Currently we are splitting the set of lesions. We should split the set of patients, to avoid having one lesion of the patient in the training and one in the validation.

Utilize K-trans

We have to find out if they can be used and if so, how.

  • Make Ktrans hdf5 file available.

  • Open K-trans files

  • Visualize K-trans files

  • Put K-trans files in an input array

  • Get the exact lesion area.

  • Normalise the data.

  • Train a network using K-trans files

To solve now!

  • Fix 'Ktrans' test set

Improvements could be tried:

  • Try to clip in the whole image.

  • Check the 0.0 case

Python2 or Python3

Just to avoid some future annoyances:
Are you using python2 or python3?

Load images

Using PyDicom/SimpleITK to read image data.

ProstateX-Images-Train.csv clarification

Something I've noticed in ProstateX-Images-Train.csv is that for example for ProstateX-0000 there are three entries for 'ep2d_diff_tra_DYNDIST' (row 3, 4 and 5) in the spreadsheet. Those three rows all come from the same DICOM series, as can be seen from the 'DCMSerDescr' and 'DCMSerNum' columns. Other than the 'Name' column, these three rows contain the exact same information.

Furthermore, there is just one 'ep2d_diff_tra_DYNDIST' series with series number 6 for ProstateX-0000. So why do we need three entries which all point to the same centroid location? Just their names are different. Is it safe to say that we can keep only one and ignore the other two? Maybe @henkjanhuisman / @jonasteuwen knows more?

Initial convnet not learning anything

Just pushed new code to keras_net/train.py which contains a simple convnet (conv32, conv32, maxpool, conv64, conv64, maxpool, dense512). Good news is that it at least runs the net for our loaded lesions and the labels, but it doesn't seem to be learning anything. The predictions before and after training are pretty much the same. After the first epoch the validation accuracy doesn't change anymore. For each of the lesions it has a very high probability for one class and almost zero for the other.

I've already tried different optimizers, like SGD and Adam, but right now it's using rmsprop and I've also tweaked the learning rate ranging from 0.01 to 0.00001, but no real change. Not sure what's causing this. I would have expected to see at least some learning with a network like this.

Just for testing/debugging purposes I tried to train a network with just two layers and only a few neurons in it. This produces the same type of output as the convnet; everything classified as the same class.

Currently it's splitting the data in a 70-30 split using train_test_split from scikit learn. Perhaps it generates a very unbalanced split?

Maybe someone else has an idea.

Augmentation of Data

Hi,
The purpose of this issue is to discuss the parameters of the data augmentation and find the best combination.
Today @kbasten and I looked at all the settings of ImageDataGenerator and discussed what configuration seemed reasonable to us (we called it baseline).
Here I will put some plots of the auc's during training with different settings.
We can use those figures to figure out further improvements.

This is the baseline configuration.

[baseline]
rotation_range=360
width_shift_range=0.1
height_shift_range=0.1
shear_range=0.2
zoom_range=0.2
channel_shift_range=2.0
fill_mode=nearest
horizontal_flip=True
vertical_flip=True

The other configurations can be found in augmentation.ini

Old plots without patient level data split

baseline ~0.78 https://cloud.githubusercontent.com/assets/13403863/26149891/62a5fbcc-3afc-11e7-80e0-d14a62cdd3c5.png

channel_shift_10 ~0.80 topscore! https://cloud.githubusercontent.com/assets/13403863/26149867/401853b6-3afc-11e7-8654-f12a6332b244.png

channel_shift_50 ~0.78 https://cloud.githubusercontent.com/assets/13403863/26151127/a6d97602-3b01-11e7-9e18-4ce8ff1f3e89.png

no_rotation ~0.68 https://cloud.githubusercontent.com/assets/13403863/26148096/d9c0c220-3af5-11e7-81e3-537a7712b6f3.png

Intermediate results presentation

So, guys, I've checked the slides and the mid project presentation is indeed next Friday 19th May from 14:00 to 17:00.

I haven't found anywhere how long each presentation should be but looking at the time and the number of groups, I'm guessing each group will have around 15mins for presentation + questions.

Lesion visualization and understanding data better

Now that we are able to load lesion cutouts and we have started training, it might be a good idea to understand our data a little better. At the moment it feels like we are just 'throwing the lesions in a network' and hoping that the network picks up on what features are import to determine whether or not a lesion is clinically significant.

While that may eventually provide some results after lots of trial and error with parameter tweaking, it would still feel as if we don't understand the data that well. I think that if we try to get a better idea of what exactly makes a lesion clinically significant, we could tweak our network in a more targeted way, as opposed to 'random' trial and error.

Yesterday, during the meeting it was mentioned that for example in ADC images lower values (400-600) are indicative of a clinically significant lesion. That's why I decided to start there. In the data_visualization branch you can find an initial visualization function that plots lesions and their histograms. It supports saving to disk, but that takes a few minutes, so if you don't want to bother with that you can download the ADC plots from here: https://jspunda.stackstorage.com/s/6UN2Ds6s7kwJtqE (password: ismi2017)

I've looked at the plots myself and already I noticed that it is indeed true that if you see more pixels with an intensity between 400-600 it is more likely to be clinically significant. But there are of course a lot of 'strange' cases as well. Some not clinically significant lesions do have peaks at lower values like 500 and some clinically significant lesions don't have any pixels with values below 600.

So I'm not sure how useful it is to look at the histogram of the lesion, but I think it's a first step towards understanding the data a little better. Maybe someone else has some other ideas for plotting things in our data. I think we could also use some of the plots in our presentation next week.

find out how to make lesion images isotropic

The T2TRA images have different mm/pixel. To train a network this mm/pixel value has to be the same for all the input images.
Additionally, when we can change the mm/pixel values we can use this to combine the lesion images of different scan types (e.g. ADC, t2_tra)

New HDF5 dataset

The current DICOM + plus 2 .csv files setup is a little messy and cumbersome to work with. To extract lesions, we would first have to load all DICOM series of interest from disk. Then compare this with the information in the first .csv file (prostateX-images-train.csv) to get the lesion info (ijk, spacing, etc.). After that we have to load prostateX-findings-train.csv to obtain the zone and clinSig information.

We have to put all this information together in order to extract the right lesion, with the right truth label and zone information from the right DICOM series, before we can start training. That's why I decided to restructure the data and combine the DICOM pixel data and the two .csv files into one hdf5 dataset.

The code for this can be found in the h5_converter branch. There are, as of now, three files: csv_fix.py, h5_converter.py and h5_query.py. Csv_fix and h5_converter only have to be run once in order to actually build the hdf5 set (which I have already done). The way the set is structured can be found in h5_converter.py.

To actually retrieve something from the set we can use h5_query.py. It contains a class that lets us draw DICOM images and their lesion information very quickly. It's almost instant. Much faster than our old way of reading DICOM files from disk and then loading their pixel data.

Note that there is no actual lesion pixel data in the hdf5 set. Just the lesion attributes from the .csv files and the DICOM pixel data. Actually extracting the lesion pixel data from the DICOM pixel data should be much more straightforward with the query result from h5_query.py.

The new HDF5 dataset can be found at https://jspunda.stackstorage.com/s/0Zy95CMqQzwVaAq
The password for the file is: ismi2017

Whether or not we are actually going to be using this new set of course depends on what everyone thinks, but in my opinion it will simplify and speed things up a lot in the future.

Spy on the other team

We have to keep tabs on the other team and see if they code something nifty we can use.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.