Giter Club home page Giter Club logo

punctuation's Introduction

punkProse

Punctuation generation for speech transcripts using lexical, syntactic and prosodic features.

Modification on forked repository (by reducing training to one stage and addition of more word-level prosodic features). This version lets use any combination of word-aligned features.

Prosodically annotated files are in proscript format (https://github.com/alpoktem/proscript). For example data and extraction scripts see: https://github.com/alpoktem/ted_preprocess

How does it perform?

English punctuation model was trained from a prosodically annotated TED corpus consisting of 1038 talks (155174 sentences). Link to dataset: http://hdl.handle.net/10230/33981

Punctuation generation accuracy with respect to human transcription:

PUNCTUATION PRECISION RECALL F-SCORE
Comma (,) 61.3 48.9 54.4
Question Mark (?) 71.8 70.6 71.2
Period (.) 82.6 83.5 83.0
Overall 73.7 67.3 70.3

These scores are obtained with a model trained with leveled pause duration and mean f0 features together with word and POS tags.

Example Run

  • Requirements:
    • Python 3.x
    • Numpy
    • Theano
    • yaml

Data directory (path $datadir) should look like the output folder (data/corpus) in https://github.com/alpoktem/ted_preprocess. Vocabularies and sampled training/testing/development sets are stored here.

Sample run explained here is provided in run.sh.

Training

Training is done on sequenced data stored in train_samples under $datadir.

Dataset features to train with are given with the flag -f. Other training parameters are specified through the parameters.yaml file. To train with word, pause, POS and mean f0:

modelId="mod_word-pause-pos-mf0"

python main.py -m $modelId -f word -f pause_before -f pos -f f0_mean -p parameters.yaml

Testing

Testing is done on proscript data using punctuator.py. Either single <input-file> or <input-directory> is given as input using -i or -d respectively. Even if there's punctuation information on this data, it is ignored. Predictions for each file in the $test_samples directory are put into $out_preditions directory. Input files should contain the parameters that the model was trained with.

model_name="Model_single-stage_""$modelId""_h100_lr0.05.pcl"

python punctuator.py -m Model_single-stage_mod_word-pause-pos-mf0_h100_lr0.05.pcl -d $test_samples -o $out_predictions

Scoring the testing output:

Predictions are compared with groundtruth data using error_calculator.py. It either takes two files to compare or two directories containing groundtruth/prediction files. Use -r for reducing punctuation marks.

python error_calculator.py -g $groundtruthData -p $out_predictions -r

Citing

More details can be found in the publication: https://link.springer.com/chapter/10.1007/978-3-319-68456-7_11

This work can be cited as:

@inproceedings{punkProse,
	author = {Alp Oktem and Mireia Farrus and Leo Wanner},
	title = {Attentional Parallel RNNs for Generating Punctuation in Transcribed Speech},
	booktitle = {5th International Conference on Statistical Language and Speech Processing SLSP 2017},
	year = {2017},
	address = {Le Mans, France}
}

punctuation's People

Contributors

ottokart avatar lisannewiengarten avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.