Giter Club home page Giter Club logo

histnorm's Introduction

Historical Text Normalization

Compiled tools, datasets, and other resources for historical text normalization.

The resources provided here have originally been published along with the following publication:

@inproceedings{bollmann2019-largescale,
  author = {Bollmann, Marcel},
  title = {A Large-Scale Comparison of Historical Text Normalization Systems},
  booktitle = {Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
  location = {Minneapolis, Minnesota},
  publisher = {Association for Computational Linguistics},
  year = {2019},
  pages = {3885--3898},
  url = {http://www.aclweb.org/anthology/N19-1389},
}

If you use an original part of this repository (such as the provided scripts or the previously unpublished dataset splits), I would appreciate if you cite the above-mentioned publication. If you use one of the referenced datasets and/or tools, please remember to (also) cite these accordingly.

For further reading, a lot of additional details and background information can also be found in:

The following additional material is also available:

Datasets

Language Source Corpus Time Period Genre Tokens (total) Source of Splits
English¹ ICAMET 1386-1698 Letters 188,158 HistCorp
German Anselm 14th-16th c. Religion 71,570 prev. unpublished
German RIDGES 1482-1652 Science 71,570 prev. unpublished
Hungarian HGDS 1440-1541 Religion 172,064 HistCorp
Icelandic IcePaHC 15th c. Religion 65,267 HistCorp
Portuguese Post Scriptum 15th-19th c. Letters 306,946 prev. unpublished
Slovene goo300k 1750-1899 Mixed 326,538 KonvNormSl 1.0
Spanish Post Scriptum 15th-19th c. Letters 132,248 prev. unpublished
Swedish GaW 1527-1812 Official Records 65,571 HistCorp

¹Due to licensing restrictions, the ICAMET dataset may not be distributed further, but the HistCorp website contains instructions on how to obtain the same dataset splits.

Helpful Scripts

The scripts/ directory contains a collection of scripts that was used in the process of running the normalization experiments in Bollmann (2019). This includes preprocessing scripts, evaluation and significance testing scripts, and more. For more details, please see the README file in the scripts/ folder.

TL;DR: The Recommended Normalization Approach

In most cases, you want to combine a naive memorization baseline (for in-vocabulary tokens) with a good, trained model (for out-of-vocabulary tokens).

  • If you have little training data (<500 tokens), you probably want to use the Norma tool (which already includes a naive memorization component); see "Using Norma" for details.

  • Otherwise, the results from Bollmann (2019) suggest using cSMTiser as the trained model in this scenario; see below under "Using cSMTiser".

The naive memorization component can be trained as follows:

scripts/memorizer.py train german-lexicon.txt german-anselm.train.txt

Apply it via:

scripts/memorizer.py apply german-lexicon.txt german-anselm.dev.txt > dev.memo.pred

To combine naive memorization with a trained model (for out-of-vocabulary tokens), first train and apply one of the normalizers discussed below. If the predictions of that trained model are in dev.model.pred, you can then apply this combined strategy via:

scripts/memorizer.py combine german-lexicon.txt dev.model.pred german-anselm.dev.txt

This will output a new prediction file that returns the learned memorization if possible, and the corresponding line from dev.model.pred otherwise.

Tools

The following tools are evaluated in Bollmann (2019):

The detailed instructions below assume that the data files are provided in the same format as contained in this repository; i.e., as tab-separated text files where the first column contains a historical word form and the second column contains its normalization.

Using Norma

Norma (and at least one of its dependencies) needs to be compiled manually on your system before it can be used. Detailed instructions for this can be found in the Norma repository.

To use Norma, you need to:

  1. Prepare a configuration file; you can use the recommended configuration file, but should adjust the filenames given inside.

  2. Prepare a lexicon of contemporary word forms. You can use the contemporary datasets provided here for this purpose, and create a lexicon file with the following command (example given for German):

    norma_lexicon -w datasets/modern/combined.de.uniq -a lexicon.de.fsm -l lexicon.de.sym -c
    

    Make sure that the names of the lexicon files match what is given in your norma.cfg before you start training.

Data files for Norma need to be in two-column, tab-separated format. To train a new model, use:

normalize -c norma.cfg -f german-anselm.train.txt -s -t --saveonexit

The names of the saved model files are defined in norma.cfg. Generating normalizations is done via:

normalize -c norma.cfg -f german-anselm.dev.txt -s > german-anselm.predictions

Using Marian

You need to install the Marian framework and clone the normalization-NMT repository on your local machine. You then need to:

  1. Preprocess the input to be in separate source/target files with whitespace-separated characters. This format can be easily generated as follows:

    mkdir preprocessed
    scripts/convert_to_charseq.py german-anselm.{train,test,dev}.txt --to preprocessed

    This will create the preprocessed input files (named train.src, train.trg, etc.) in the preprocessed/ subdirectory.

  2. Edit the train_seq2seq.sh script that comes with normalization-NMT to point to the correct paths (for Marian and the preprocessed input), as well as adjust the GPU memory settings and device ID to the correct values for your system. As an example, check out the modified script used for the experiments in Bollmann (2019).

Then, training the model is as simple as calling:

bash train_seq2seq.sh

Generating normalizations is best done by calling marian-decoder directly, like this:

cat preprocessed/dev.src | $MARIAN_PATH/marian-decoder -c $MODELDIR/model.npz.best-perplexity.npz.decoder.yaml -m $MODELDIR/model.npz.best-perplexity.npz --quiet-translation --device 0 --mini-batch 16 --maxi-batch 100 --maxi-batch-sort src -w 10000 --beam-size 5 | sed 's/ //g' > german-anselm.predictions

Marian outputs predictions in the same format as the input files, i.e. with whitespace-separated characters, which is why we pipe it through sed 's/ //g' to obtain the regular representations. You can skip this part, of course, but it's required if you want to use the evaluation scripts supplied here and/or compare with the other normalization methods.

Using XNMT

XNMT is based on Python 3.6 and DyNet. You can find detailed instructions on how to install it in the "Getting Started" section of the documentation.

Since XNMT is new software that is changing quickly, and there was no tagged release at the time of my experiments, it is possible that the newest version is not compatible with the exact scripts and instructions provided here. For reference, the experiments performed in Bollmann (2019) are based on XNMT commit 6557ee8. You can obtain this exact version of the code by cloning the XNMT repository and then issuing git checkout 6557ee8.

To use XNMT, you need to:

  1. Preprocess the input to be in separate source/target files with whitespace-separated characters (the same as for Marian):

    scripts/convert_to_charseq.py german-anselm.{train,test,dev}.txt --to preprocessed
  2. Edit the example configuration file by replacing the <<TMPDIR>> string with the path to your preprocessed input files; for example:

    sed -i 's|<<TMPDIR>>|./preprocessed|g' examples/xnmt-config.yaml

    For very small datasets, you might also want to increase the patience value (find the line that says patience: 5 and adjust it).

Afterwards, you can train the model by calling:

PYTHONHASHSEED=0 python3 -m xnmt.xnmt_run_experiments xnmt-config.yaml --dynet-seed 0 --dynet--gpu

This both trains and evaluates; the final predictions will be stored as dev.predictions in the given directory.

Using cSMTiser

cSMTiser requires an installation of Moses and MGIZA. Detailed instructions can be found in the cSMTiser repository. To use it, you need to:

  1. Preprocess the input to be in separate orig/norm files. There is a bash script to achieve this that has the same argument structure as the script for XNMT and Marian above:

    mkdir preprocessed
    scripts/convert_to_orignorm.sh german-anselm.{train,test,dev}.txt --to preprocessed
  2. Edit the example configuration file by replacing the <<TMPDIR>> string with the path to your preprocessed input files; for example:

    sed -i 's|<<TMPDIR>>|preprocessed|g' examples/csmtiser-config.yaml

    Likewise, you should replace <<MODELDIR>> with the desired output directory (absolute path!) for your trained model, and <<MOSESDIR>> with the path to your local Moses installation.

    To add the contemporary data for language modelling (optional), find the line in the configuration file that says lms: [] and replace it with (e.g.) lms: [datasets/modern/combined.de.uniq].

Afterwards, training the model requires the following two commands (from the cSMTiser directory):

python preprocess.py csmtiser-config.yaml
python train.py csmtiser-config.yaml

Generating normalizations is achieved by calling:

python normalise.py csmtiser-config.yaml preprocessed/test.orig

The predicted normalizations will, in this case, be written to preprocessed/test.orig.norm.

License

All software (in the scripts/ directory) is provided under the MIT License.

Licenses for the datasets differ. The German Anselm data is licensed under CC BY-SA 3.0. The German RIDGES data is licensed under CC BY 3.0. The Icelandic data is licensed under GNU LGPL v3. The Slovene data is licensed under CC BY-SA 4.0. The other datasets unfortunately do not indicate a license, but the rights holders have indicated that the resource is "free" for research purposes. Please see the READMEs included in each dataset subdirectory for more details.

Contact

For questions or problems, feel free to file a GitHub issue, or contact me directly:

histnorm's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

histnorm's Issues

slovene data: noway,dude normalization

The Slovene datasets have a number of strange normalizations with "no_way,_dude".
Does this come from the original dataset?

grep -cR "no_way,_dude"  *
slovene/slovene-goo300k-gaj.dev.txt:73
slovene/slovene-goo300k-bohoric.test.txt:34
slovene/slovene-goo300k-bohoric.dev.txt:37
slovene/slovene-goo300k-gaj.test.txt:74
slovene/slovene-goo300k-bohoric.train.txt:0
slovene/slovene-goo300k-gaj.train.txt:0

Question on Spanish dataset compilation

A small question regarding the compilation of the Spanish training dataset. Were the 4 parts just concatenated on their historical order (this probably matters for the training size ablation experiments, e.g. if your training material is mostly 16th's century, but most of your test/dev set material is from 17th-19th century. Or were the size-reduced samples not continuous chunks?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.