Giter Club home page Giter Club logo

maligna-wrapper's Introduction

mALIGNA-wrapper

A set of scripts to build parallel corpora using mALIGNa


Usage

File system structure:

${corpus_title}
|-- source
|   |-- ${title}_${source_lang}.snt
|   |-- ${title}_${target_lang}.snt
|-- work
|   |-- ${source_lang}-${target_lang}
|       |-- ${title}_${source_lang}.snt
|       |-- ${title}_${target_lang}.snt
|       |-- ${title}_${source_lang}.snt.aligned
|       |-- ${title}_${target_lang}.snt.aligned
|-- aligned_idx
|   |-- ${source_lang}-${target_lang}
|       |-- ${title}.${source_lang}.idx
|       |-- ${title}.${target_lang}.idx
|-- result
    |-- ${corpus_title}.${source_lang}-${target_lang}.${source_lang}
    |-- ${corpus_title}.${source_lang}-${target_lang}.${target_lang}
    |-- ${corpus_title}.unique.${source_lang}-${target_lang}.${source_lang}
    |-- ${corpus_title}.unique.${source_lang}-${target_lang}.${target_lang}
  • Additional Python dependency: PyYAML. Install it using the python -m pip install PyYAML command if necessary.
  • mALIGNa must be present on your machine (see References below).
  • Before running the shell script, put your source files in ${corpus_title}/source/snt directory.
  • The content of source files must be segmented in sentences (one sentence per line).
  • Filenames of input files must have the following pattern: ${title}_${lang}.snt (e.g. document_en.snt).
  • Parallel files must have identical titles (e.g. article_001_en.snt, article_001_fr.snt).
  • There are two source data directories - 'original_source_data_directory' and 'preprocessed_source_data_directory' - specified in the YAML file. The 'original_source_data_directory' is used for files containing sentences in natural language (i.e. unmodified sentences). The 'preprocessed_source_data_directory' is used for additionaly preprocessed files originated from the 'original_source_data_directory' (e.g. stemmed files, additionally tokenized files etc.). The sentence alignment itself is done using the content from the 'preprocessed_source_data_directory'. On the contrary, the building of parallel corpora is done using the content from 'original_source_data_directory'. If no additional preprocessing has been made on source files, both paths must be equal.
  • The 'work', 'aligned_idx' and 'result' directories are created automatically.
  • Aligned corpora are placed in the 'result' directory.

Note: It is not necessary to keep all automatically created subdirectories (work, aligned_idx, result) under the same root but it is much easier to track the alignment process in this way.

An example of a configuration file (YAML):

(for running on Windows OS; replace values in square brackets with actual paths; see also io_args.yml.sample)


source_language: en
target_language: fr

corpus_title: aligned_corpora

maligna:
  root: E:\tools\maligna
  main_class: net.loomchild.maligna.ui.console.Maligna

original_source_data_directory: [...]\aligned_corpora\source
preprocessed_source_data_directory: [...]\aligned_corpora\source
work_directory: [...]\aligned_corpora\work
alignment_index_directory: [...]\aligned_corpora\aligned_idx
output_data_directory: [...]\aligned_corpora\result

Running the shell script

  • Enter the actual values for parameters in the configuration YAML file (see above).
  • Specify the name of the configuration (YAML) file in the run_maligna.bat file (the value of config_file). The YAML file must reside in the script directory.
  • Execute the following command (on Windows):
    .\run_maligna.bat

Note: The current set of scripts may be also run under UNIX/Linux OS. For this purpose, a Bash script similar to run_maligna.bat must be executed.


References:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.