Giter Club home page Giter Club logo

inuktitut-morpheme-segment's Introduction

Statistical morpheme segmentation of Inuktitut

This project uses the Inuktitut Morphological Analyzer developed by the National Research Council of Canada on the Nunavut Hansard corpus and the Inuktitut Bible to obtain a baseline and training data consisting of Inuktitut words and their morphemes.

Then, the project uses the Morfessor package (which implements HMMs specifically for segmentation), using the recursive algorithm for training, to try and get the best segmentations (decoding done with Viterbi).

The test set is a portion of the Inuktitut Bible.

Project setup

  • README, .gitignore, etc
  • pipeline.sh: bash script that runs everything
  • Uqailaut.jar: the morphological analyzer
  • corpus/
    • NunavutHansard.txt: version 2.0 of the Nunavut Hansard (too large for GitHub, download from link above)
    • NunavutHansard-sm.txt: first 100k lines of the Nunavut Hansard
    • bible/
      • genesis.txt
  • scripts/
    • Corpus.java
    • run_hmm.py
    • make_small.py
  • data/
    • train/
      • genesis-annotation
      • genesis-text
      • NunavutHansard-sm-annotation
      • NunavutHansard-sm-text
      • NunavutHansard-text
    • test/
      • genesis-gold
      • genesis-text
      • NunavutHansard-sm-text
      • NunavutHansard-text
  • models/
    • bible.bin
    • bible-segmentation
    • hansard.bin
  • results/: decoding using the various models on the test set (genesis-gold)
  • writeup/
    • finalproject.tex: LaTeX file
    • finalproject.pdf: PDF of the final report
    • bib.bib: BibTeX bibliography

How to run

bash ./pipeline.sh

Ideally you should have Java 1.8 and Python 3 (although it should work with Python 2.7).

Results

The data is too sparse for really good results. Generally, increasing the weight assigned to the annotated corpus increases F-measure and recall at the cost of precision over the unsupervised method.

In the future I would leave enough time to train and test on the Nunavut Hansard, train with online and online+batch training, and train with Viterbi in addition to the recursive algorithm.

The results are such that, at least with this size dataset, I wouldn't recommend statistical methods over a rule-based one, especially since Inuktitut morphology is so regular.

inuktitut-morpheme-segment's People

Contributors

jessicahuynh avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar  avatar

Forkers

mark-walle cesine

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.