Giter Club home page Giter Club logo

prostate-pathology-parser's Introduction

Natural Language Processing Systems for Pathology Parsing in Limited Data Environments with Uncertainty Estimation

Anobel Y. Odisho MD MPH (1), Briton Park, BS (2), Nicholas Altieri, BA (2), John DeNero, PhD (3), Matthew R. Cooperberg MD MPH (1,5), Peter R. Carroll MD MPH (1), Bin Yu PhD (2,3,4)

  • 1 Department of Urology, UCSF Helen Diller Family Comprehensive Cancer Center, San Francisco
  • 2 Department of Statistics, University of California, Berkeley
  • 3 Department of Electrical Engineering and Computer Science, University of California, Berkeley
  • 4 Chan-Zuckerberg Biohub, San Francisco, California
  • 5 Department of Epidemiology & Biostatistics, University of California, San Francisco

Objective

Cancer is a leading cause of death, but much of the diagnostic information is stored as unstructured data in pathology reports. We aim to improve uncertainty estimates of machine-learning based pathology parsers and evaluate performance in low data settings.

Materials and Methods

Our data comes from the Urologic Outcomes Database at UCSF which includes 3,232 annotated prostate cancer pathology reports from 2001-2018. We approach 17 separate information extraction tasks, involving a wide range of pathologic features. To handle the diverse range of fields we required two statistical models, a document classification method for pathologic features with a small set of possible values and a token extraction method for pathologic features with a large set of values. For each model, we used isotonic calibration to improve the model’s estimates of its likelihood of being correct.

Results

Our best document classifier method, a convolutional neural network, achieves a weighted F1 score of 0.97 averaged over 12 fields and our best extraction method achieves an accuracy of 0.93 averaged over 5 fields. The performance saturates as a function of dataset size with as few as 128 data points. Furthermore, while our document classifier methods have reliable uncertainty estimates, our extraction based methods do not, but after isotonic calibration, expected calibration error drops to below 0.03 for all extraction fields.

Conclusions

We find that when applying machine learning to pathology parsing, large datasets may not always be needed, and that calibration methods can improve the reliability of uncertainty estimates.

Technical Details

This repository contains the codebase for extracting data from prostate reports. There are two high-level approaches. The token extraction approach, which should be used to extract continuous data fields or categorical data fields with a large possible set of values is exemplified in main_pipelines/token_extraction/RandomForest.ipynb notebook. The classification approach, which should be used for categorical data with a small number of possible values is referenced in main_pipelines/classification/ConvolutionalNetwork.ipynb and main_pipelines/classification/LogisticRegressionWithCalibration.ipynb. The latter notebook also contains code for calibrating probabilities. For small training sizes (in the hundreds or less), the non-deep learning approaches should be used.

Because the data used to train these models are protected, the data and the trained models cannot be public. The data structure were preprocessed from the raw data and are Python dictionaries. The 'train', 'val', and 'test' keys denote the split of the data. The corresponding values are Python lists of dictionaries representing each patient in the data. The labels and text of each patient are accessed through the patient dictionaries.

For example, data['train'][i]['document'] contains the pathology report as a string for the ith patient in the training split, while data['train'][i]['labels']['TumorType'] contains the label for the data field "Tumor Type" for the ith pateint in the training split.

Corresponding Author

Bin Yu, PhD [email protected]

prostate-pathology-parser's People

Contributors

anobel avatar bpark738 avatar

Stargazers

Roman avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.