ParaMed: A Parallel Corpus for Chinese-English Translation in the Medical Domain

Description

NEJM is a Chinese-English parallel corpus crawled from the New England Journal of Medicine website. English articles are distributed through https://www.nejm.org/ and Chinese articles are distributed through http://nejmqianyan.cn/. The corpus contains all article pairs (around 2000 pairs) since 2011.

This project was motivated by the fact that the Biomedical translation shared task in WMT19 did not provide training data for Chinese/English. In fact, we did not find any publically available parallel corpus between English and Chinese. We collected the NEJM corpus to faciliate machine translation between English and Chinese in the medical domain. We found that a remarkable boost in BLEU score can be achieved by pre-training on WMT18 and fine-tuning on the NEJM corpus.

Data Download

You can download ~ 70% of data here. Read on if you would like the entire dataset.

The New England Journal of Medicine is a pay-for-access journal. We are therefore prohibited by their copyright policy to freely distribute the entire dataset. However, Journal Watch and Original Research articles are open access six months after the initial publication. These articles make up about ~ 70% of the entire dataset and you can access them immediately using the link above.

If you are a NEJM subscriber through an institution or a personal account, as we belive most biomedical researchers are, you are entitled to access the full data. Please email us at [email protected] for access to the entire dataset.

Installation & Prerequisite

The code in this repository was written in python 3.7.

First you need to clone the repository:

git clone https://github.com/boxiangliu/med_translation.git

Then install the following python packages if they are not installed yet.

selenium
numpy
pandas
matplotlib
seaborn
nltk

Also install these packages outside of python:

Reproducing the paper

WARNING: Those without access to NEJM will likely not be able to run all steps in this repo.

1. Crawl the NEJM website

In crawl/crawl.py, replace the value of nejm_username and nejm_password with your credentials. Then run:

python3 crawler/crawl.py

This step crawls thousands of articles from NEJM and will take a number of hours.

[Optional] Plot summary statistics of crawled articles

To plot the distribution of articles by year and by type, run:

python3 crawler/url_stat.py

Filter crawled articles

The crawled articles are peppered with bits and pieces of noisy text. To remove them, run:

python3 crawler/filter.py

[Optional] Compare pre- and post-filtered articles

To see the effect of filtering on article lengths, run:

python3 crawler/article_stat.py

2. Preprocessing

We will normalize, break up paragraphs into sentences, tokenize and truecase.

Normalize punctuations

We will standardize English and Chinese punctuations. Run:

bash preprocess/normalize.sh

Split paragraphs into sentences

We will split English and Chinese paragraphs into sentences using eserix:

bash preprocess/detect_sentences/eserix.sh

[Optional] Compare eserix with punkt

We can also split paragraphs with another popular python module called punkt, and compare the performance of the two algorithms.

python3 preprocess/detect_sentences/punkt.py
python3 preprocess/detect_sentences/sent_stat.py

Tokenize and truecase

The final preprocessing steps will be tokenization and truecasing

bash preprocess/tokenize.sh
bash preprocess/truecase.sh

3. Sentence alignment

Manual alignment

In order to compare automatic sentence alignment algorithms, we need to establish a set of ground truth alignment. Lucky for you, we have done the dirty work of aligning sentences. You can download the raw sentences (unaligned) here. Create a directory and untar into this directory. These will be used as input to the sentence alignment algorithm below.

mkdir ../processed_data/preprocess/manual_align/input/
cd ../processed_data/preprocess/manual_align/input/
tar xzvf manual_align_input.tar.gz

Next download the manual alignment result here. Place it into the following directory.

mkdir ../processed_data/preprocess/manual_align/alignment/

The alignment result will be used to assess the quality of alignment algorithms.

Assessment of alignment algorithms

We assess the performance of Gale-Church (length-based), Microsoft aligner (Lexicon-based), and Bleualign (translation-based). Install them on your system.

Next run the following commands to align sentences using all algorithms.

Align with Moore's algorithm:

bash evaluation/nejm/align/moore/input.sh
bash evaluation/nejm/align/moore/align.sh

Align with Bleualign and Gale-Church algorithm:

bash evaluation/nejm/align/bleualign/input.sh
bash evaluation/nejm/align/bleualign/translate.sh
bash evaluation/nejm/align/bleualign/align.sh

Evaluate the performance of all algorithms:

bash evaluation/nejm/evaluate.sh
python3 evaluation/nejm/vis_pr_curve.py

Align the entire corpus

In the manuscript, we found that the Microsoft aligner gave the best performance. We use it to align the entire corpus.

bash alignment/moore/input.sh
bash alignment/moore/align.sh

Clean the `NEJM` corpus

Some sentences such as paragraph headings will be duplicated many times. To remove them, run the following command:

bash clean/concat.sh
bash clean/clean.sh

Split the data into train, dev and test:

Run the following to split data into train (~ 93000), development (~ 2000), and test (~ 2000):

bash split_data/split_train_test.py

Translation

To determine whether NEJM helps improving machine translation in the biomedical domain, we first train a baseline model using WMT18 English/Chinese dataset and fine-tune the model on NEJM.

WMT18 preprocessed en/zh data can be downloaded here.

Train the baseline model: bash translation/wmt18/train.sh

Subset the dataset to see translation performance improvement at various corpus sizes.

python3 subset/subset.py

Fine-tune on NEJM dataset and test the fine-tune performance:

python3 translation/nejm/finetune.sh
python3 translation/nejm/test_finetune.sh

Train on NEJM from scratch (without using WMT18):

python3 translation/nejm/train_denovo.sh
python3 translation/nejm/test_denovo.sh

Plot bleu score:

python3 translation/nejm/plot_bleu.py

Questions?

If you have any questions, please email us at [email protected]

boxiangliu / paramed Goto Github PK

paramed's Introduction