Giter Club home page Giter Club logo

europarl_uds's Introduction

europarl_uds

A multilingual and multitask adaption of the Europarl Corpus

Requirements

codecs
glob
langdetect
lxml

Generating the corpus

Run parallel_from_xml.py to extract parallel text from xmls

Requirements

tqdm

parameters

-i -> input directory: path to proceeedings xml
-o -> output file path
-s -> source language(s)
-p -> parallel language(s)
-n -> native/non-native speaker filter
-d -> direct/pivot translation(s) filter
-f -> start date (optional)
-t -> end date (optional)
-h -> help

Sample command

python parallel_from_xml.py -i proceedings/xml/ -o extracted/parallels_ns.tsv -s de en es fr it nl pt -p de en es fr it nl pt -n ns -d 0 -f 19990721 -t 19990723

Create Translationese Dataset

This is created using the output of the extracted parallel corpus Run create_translationese_splits.py to create train, dev and test splits.

Requirements

tqdm
numpy
pandas

parameters

-i -> input file
-o -> output directory
-s -> source language(s)
-p -> parallel language(s)
-n -> native/non-native speaker filter
-d -> direct/pivot translation(s) filter
-t -> fraction of data for train split
-v -> fraction of data for dev split
-r -> flag to either keep or retain original texts as part of dataset
-h -> help

Sample command

python create_translationese_splits.py -i extracted/parallels_ns_pivot.tsv -o tr_splits_ns_pivot/ -s de en es fr it pt nl -p en es de fr it pt nl -n 1 -d 2 -t 0.7 -v 0.15

Create Machine Translation Dataset

This is created using the output of the extracted parallel corpus Run create_parallel_splits.py to create train, dev and test splits.

Requirements

tqdm
numpy
pandas

-i -> input file
-o -> output directory
-s -> source language(s)
-p -> parallel language(s)
-n -> native/non-native speaker filter
-d -> direct/pivot translation(s) filter
-t -> fraction of data for train split
-v -> fraction of data for dev split
-r -> flag to either keep or retain original texts as part of dataset
-h -> help

Sample command

python create_parallel_splits.py -i extracted/parallels_direct.tsv -o mt_splits_direct/ -s de en es fr it pt nl -p en es de fr it pt nl -n 0 -d 1 -t 0.7 -v 0.15

europarl_uds's People

Contributors

userkay avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.