Giter Club home page Giter Club logo

medicat's Introduction

MedICaT

MedICaT is a dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references. Instructions for access are provided here.

Figures and captions are extracted from open access articles in PubMed Central and corresponding reference text is derived from S2ORC.

The dataset consists of:

  • 217,060 figures from 131,410 open access papers
  • 7507 subcaption and subfigure annotations for 2069 compound figures
  • Inline references for ~25K figures in the ROCO dataset

A sample of the data is available in sample/.

An example data entry:

{
  "pdf_hash": "57c9ad0f4aab133f96d40992c46926fabc901ffa",
  "fig_key": "Figure1",
  "fig_uri": "2-Figure1-1.png",
  "s2_caption": "Figure 1. (A) Barium enema and (B) endoscopic image of the high-grade distal colonic obstruction caused by a 5-cm anastomotic stricture.",
  "s2orc_caption": "Figure 1. (A) Barium enema and (B) endoscopic image of the high-grade distal colonic obstruction caused by a 5-cm anastomotic stricture.",
  "s2orc_references": [
    "Computed tomography (CT) showed a distal large bowel obstruction, and a barium enema revealed a high-grade stenosis proximal to the anastomotic site in the recto-sigmoid region (Figure 1 ).",
    "Flexible sigmoidoscopy revealed a tight, fibrotic, benign-appearing anastomotic stricture 15 cm from the anal verge ( Figure 1) ."
  ],
  "radiology": false,
  "scope": true,
  "predicted_type": "Medical images",
  "oa_info": {
    "doi": "10.14309/crj.2014.54",
    "doi_url": "https://doi.org/10.14309/crj.2014.54",
    "oa": {
      "is_oa": true,
      "oa_status": "gold",
      "journal_is_oa": true,
      "journal_is_in_doaj": true,
      "license": "cc-by-nc-nd",
      "provenance": "unpaywall"
    }
  }
}

The corresponding figure is located at figures/57c9ad0f4aab133f96d40992c46926fabc901ffa_2-Figure1-1.png ({pdf_hash}_{fig_uri}).

To download:

Please fill out this form for access. We will respond to your request within 48 hours.

Code

Please see the code directory for the code associated with our paper. The code/README.md includes additional information about how you can use this code.

To cite:

If using this dataset, please cite:

@inproceedings{subramanian-2020-medicat,
    title={{MedICaT: A Dataset of Medical Images, Captions, and Textual References}},
    author={Sanjay Subramanian, Lucy Lu Wang, Sachin Mehta, Ben Bogin, Madeleine van Zuylen, Sravanthi Parasa, Sameer Singh, Matt Gardner, and Hannaneh Hajishirzi},
    year={2020},
    booktitle={Findings of EMNLP},
}

License

Each source document in MedICaT is licensed differently. Articles included in MedICaT have open access licenses (see CC and UPW) or are in the public domain. The license for each article is provided in the associated entry in the dataset. Please abide by these licenses when using. The MedICaT dataset is available for non-commercial use only.

Contact us

Email: {sanjays, lucyw}@allenai.org

medicat's People

Contributors

lucylw avatar sanjayss34 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.