Motivation

We generated this dataset to train a machine learning model for automatically generating psychiatric case notes from doctor-patient conversations. Since, we did have access to real doctor-patient conversations, we used transcripts from two different sources to generate audio recordings of enacted conversations between a doctor and a patient. We employed eight students who worked in pairs to generate these recordings. Six of the transcripts that we used to produce this recordings were hand-written by Cheryl Bristow and rest of the transcripts were adapted from Alexander Street which were generated from real doctor-patient conversations. Our study requires recording the doctor and the patient(s) in seperate channels which is the primary reason behind generating our own audio recordings of the conversations.

We used Google Cloud Speech-To-Text API to transcribe the enacted recordings. These newly generated transcripts are auto-generated entirely using AI powered automatic speech recognition whereas the source transcripts are either hand-written or fine-tuned by human transcribers (transcripts from Alexander Street).

We provided the generated transcripts back to the students and asked them to write case notes. The students worked independently using a software that we developed earlier for this purpose. The students had past experience of writing case notes and we let the students write case notes as they practiced without any training or instructions from us.

Authors

Kazi, Nazmul
Kuntz, Matt
Kanewala, Upulee
Kahanda, Indika

Contributors

Bristow, Cheryl
Arzubi, Eric

Description of the dataset

Case note categories:

Index	Category	Abbr.
0	Client Details	CD
1	Chief Complaint	CC
2	History of Present Illness	HPI
3	Past Psychiatric History	PPH
4	History of Substance Use	HSU
5	Social History	SH
6	Family History	FH
7	Review of Systems	RS

Directories:

Directory	Description
`transcripts/source`	Source transcripts that are used to generate the audio recordings.
`recordings`	Audio recordings of the enacted doctor-patient conversations.
`transcripts/transcribed`	Transcripts generated from the audio recordings using Google Cloud Speech-To-Text API.
`casenotes`	Casenotes written by the students, i.e. annotators.

Transcript file structure:

[
	{
		"speaker"  : 1,
		"dialogue" : ["sentence 1", "sentence 2", ...]
	},
	...
]

Definitions

Term	Definition
`speaker`	Speaker of the current dialogue turn.
`dialogue`	Sentence(s) spoken by the speaker in current dialogue turn.

Case note file structure:

[
	{
		"categoryId" : "0",
		"sourceId"   : "0",
		"formalText" : "formal text"
	},
	...
]

Definitions

Term	Definition
`categoryId`	Index of the case note category, e.g. 5 = Social History, to which this sentence is used. This property is zero-indexed.
`sourceId`	Index of the source sentence in the transcript. This property is zero-indexed.
`formalText`	Modified version of the sentence as it is used in the case note.

NOTE: If a sentnce is used in multiple casenote categories, a record will appear for each use. "sourceId":"n" refers to the sentence whose index is n in the whole transcript whereas multiple sentences can belong to the same dialogue turn. In the following transcript, "sourceId":"3" refers to sentence_d:

[
	{
		"speaker"  : 1,
		"dialogue" : ["sentence_a", "sentence_b"]
	},
	{
		"speaker"  : 2,
		"dialogue" : ["sentence_c", "sentence_d",  "sentence_e"]
	}
]

dataset.pickle

This is a pickle file (protocol version 4) containing all the transcribed transcripts and the casenotes for easy and quick access of the data using python.

Funding

This project is funded by CATalyst Gap fund, Fall 2019.

Transcripts adapted from Alexander Street

D0420-S2-T01 D0420-S3-T02 D0420-S3-T03 D0420-S4-T01 D0420-S4-T02 D0421-S1-T01 D0421-S1-T02 D0421-S1-T03 D0421-S1-T04 D0421-S1-T05 D0421-S2-T01 D0421-S2-T02 D0421-S3-T01 D0421-S3-T02 D0421-S3-T03 D0421-S3-T04 D0421-S3-T05 D0422-S1-T01 D0422-S1-T02 D0422-S1-T03 D0422-S1-T04 D0422-S2-T01 D0422-S2-T02 D0422-S3-T01 D0422-S3-T02 D0422-S3-T03 D0422-S3-T04 D0422-S3-T05 D0422-S3-T06 D0422-S4-T01 D0422-S4-T02 D0422-S4-T03 D0422-S4-T04 D0422-S4-T05 D0423-S1-T01 D0423-S1-T02 D0423-S1-T03 D0423-S2-T01 D0423-S2-T02 D0423-S2-T03 D0424-S1-T01 D0424-S1-T02 D0424-S1-T03 D0424-S2-T01 D0424-S2-T02 D0424-S2-T03 D0424-S2-T04 D0424-S2-T05 D0424-S2-T06 D0424-S3-T01 D0424-S3-T02 D0424-S3-T03 D0424-S3-T04 D0425-S1-T01 D0425-S1-T02 D0425-S1-T03 D0425-S2-T01 D0425-S2-T02 D0425-S2-T03 D0425-S2-T04 D0425-S3-T01 D0425-S3-T02 D0425-S3-T03 D0425-S3-T04 D0425-S3-T05

Disclaimer

This dataset is provided "As Is" without warranty of any kind. In no event shall the authors or copyright holders be liable for any claim, damages or other liability. The source transcripts may not be enacted word-to-word as they appear in the transcript. Similarly, we used automatic speech recognition to transcribe the recordings and the transcribed transcripts may not match exactly as the words appear in the audio recordings.

This work is licensed under a Creative Commons Attribution 4.0 International License.

fireindark707 / dataset_automated_medical_transcription Goto Github PK

dataset_automated_medical_transcription's Introduction

Motivation

Authors

Contributors

Description of the dataset

Case note categories:

Directories:

Transcript file structure:

Definitions

Case note file structure:

Definitions

dataset.pickle

Funding

Transcripts adapted from Alexander Street

Disclaimer

dataset_automated_medical_transcription's People

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent