Giter Club home page Giter Club logo

ham10000_dataset's Introduction

HAM 10000 Dataset Tools

Creative Commons Lizenzvertrag

This repository gives access to the tools created and used for assembling the training dataset for the proposed HAM-10000 (Human Against Machine with 10000 training images) study, which extending part 3 of the ISIC 2018 challenge. The dataset itself is available for download at the Harvard dataverse or the ISIC-archive.


Extract

Following technique was used to leverage image data from PowerPoint slides, by extracting and ordering them with unique identifiers:


Filter

To more efficiently order large image sets of containing non-annotated overview (clinic), closeup (macro) and dermatoscopic (dsc) images, we fine-tuned a neural network to distinguish between those types automatically.

1. Annotation

  • filter/filter_annotation.py: An OpenCV based script to quickly annotate images within a subfolder into different image types. Results are stored in a CSV-file with the option to abort-and-resume annotation.

2. Training

Training was performed in Caffe / DIGITS abstracting away many training variables. We gained 1501 annotated images with the tool above and proceeded to training: GoogLeNet pretrained on ImageNet (taken from the NVIDIA DIGITS 5 Model Store) was fine-tuned on three classes for 20 epochs, landing at a final top-1 accuracy on the test-set of 98.68% (one dermatoscopic image classified as macro). The trained model files are provided in ./classify/caffe_model/*

3. Inference


Unify

Pathologic diagnoses in clinical practice are often non-standardized and verbose. The notebook below depicts our boilerplate used on different datasets to merge raw string data into a clean set of classes.

  • unify/unify_diagnoses.ipynb uses the pandas library to clean and unify diagnosis texts of dermatologic lesions into a confined set of diagnoses other or ambiguous classes.
    Note: The notebook contains only a subset of example terms for display purposes, as regular expressions are optimized to fit a given dataset. Therefore, most commonly the ones given will not be ready to be applied on a new set out of the box. Importantly, also the order of relabeling diagnoses matter, so we highly recommend manual checkup of relabeled diagnoses and stepwise iteration when applying to a new dataset.

Standardise

To normalise image format without squeezing, one Bash/ImageMagick command was applied to final images before data submission to the archive:

find . -type f \( -iname \*.jpg -o -iname \*.jpeg -o -iname \*.tiff -o -iname \*.tif \) -print0 | xargs -0 -n1 mogrify -strip -rotate "90<" -resize "600x450^" -gravity center -crop 600x450+0+0 -density 72 -units PixelsPerInch -format jpg -quality 100


Cite

If tools or data helped your research, please cite:

  • Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data 5, 180161 doi:10.1038/sdata.2018.161 (2018).
@article{Tschandl2018_HAM10000,
  author    = {Philipp Tschandl and
               Cliff Rosendahl and
               Harald Kittler},
  title     = {The {HAM10000} dataset, a large collection of multi-source dermatoscopic
               images of common pigmented skin lesions},
  journal   = {Sci. Data},
  volume    = {5},
  year      = {2018},
  pages     = {180161},
  doi       = {10.1038/sdata.2018.161}
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.