The medquad from abachaa

medquad's Introduction

-------------------------------------------
MedQuAD: Medical Question Answering Dataset  
-------------------------------------------

MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.  

We included additional annotations in the XML files, that could be used for diverse IR and NLP tasks, such as the question type, the question focus, its syonyms, its UMLS Concept Unique Identifier (CUI) and Semantic Type. 
We  added the category of the question focus (Disease, Drug or Other) in the 4 MedlinePlus collections. All other collections are about diseases.  
 
The paper cited below describes the collection, the construction method as well as its use and evaluation within a medical question answering system.   

N.B. We removed the answers from 3 subsets to respect the MedlinePlus copyright (https://medlineplus.gov/copyright.html):  
(1) A.D.A.M. Medical Encyclopedia, (2) MedlinePlus Drug information, and (3) MedlinePlus Herbal medicine and supplement information. 
-- We kept all the other information including the URLs in case you want to crawl the answers. Please contact me if you have any questions.  
 
-------------------
QA Test Collection  
-------------------

We used the test questions of the TREC-2017 LiveQA medical task:  https://github.com/abachaa/LiveQA_MedicalTask_TREC2017/tree/master/TestDataset. 

As described in our BMC paper, we have manually judged the answers retrieved by the IR and QA systems from the MedQuAD collection. 
We used the same judgment scores as the LiveQA Track: 1-Incorrect, 2-Related, 3-Incomplete, and 4-Excellent. 
-- Format of the qrels file: Question_ID judgment Answer_ID 

The QA test collection contains 2,479 judged answers that can be used to evaluate the performance of IR & QA systems on the LiveQA-Med test questions:  https://github.com/abachaa/MedQuAD/blob/master/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip

----------
Reference  
---------- 

If you use the MedQuAD dataset and/or the collection of 2,479 judged answers, please cite the following paper: "A Question-Entailment Approach to Question Answering". Asma Ben Abacha and Dina Demner-Fushman. BMC Bioinformatics, 2019.    

	@ARTICLE{BenAbacha-BMC-2019,    
		  author    = {Asma {Ben Abacha} and Dina Demner{-}Fushman},
		  title     = {A Question-Entailment Approach to Question Answering},
		  journal = {{BMC} Bioinform.}, 
		  volume    = {20},
  		  number    = {1},
     		  pages     = {511:1--511:23},
  		  year      = {2019},
  	url       = {https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4}
		   }     
		   
License
- The MedQuAD dataset is published under a Creative Commons Attribution 4.0 International Licence (CC BY). https://creativecommons.org/licenses/by/4.0/

Contact
-  Asma Ben abacha (abenabacha at microsoft dot com)

medquad's Issues

Any future plan to release the knowledge source of those question answer pairs?

Hi,

Thank you for releasing this amazing dataset which is a great contribution to ML in healthcare! I wonder if you have any future plan to release the knowledge source of all those question answer pairs as well to make it a great reading comprehension benchmark like SQuAD. I noticed that many questions can be answered using the webpages whose urls are mentioned in your dataset. I wonder if you are planning to clean up those webpages and extract relevant passages to support the reading comprehension task.
Thanks!

How can you actually crawl the answer for MedLineDrugs?

I see that we only have questions pairs but there is no real method to crawl the answer from MedLineDrugs. Is it crawlable? Thank you so much!

License for using the data

Hi Abachaa,

I don't see a common license as part of this github repo. Are you planning to add a license to this github to make it easier to use the datasets?

Recommend Projects

abachaa / medquad Goto Github PK

medquad's Introduction

medquad's People

Contributors

Stargazers

Watchers

Forkers

medquad's Issues

Any future plan to release the knowledge source of those question answer pairs?

How can you actually crawl the answer for MedLineDrugs?

License for using the data

Code for scraping answers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent