Giter Club home page Giter Club logo

morphochain's Introduction

MorphoChain

Author: Karthik Narasimhan ([email protected])

Unsupervised Discovery of Morphological Chains (TACL 2015)

  • A model for unsupervised morphological analysis that integrates orthographic and semantic views of words.
  • Model consistently outperforms three state-of-the-art baselines on the task of morphological segmentation on Arabic, English and Turkish.

Download

You can clone the repository and use the production2 branch (default) for the latest code.

Dependencies (before Compiling)

  1. This project uses the LBFGS-B algorithm for optimization (the jar files for the library are included in lib/). We, however, recommend you to download and install the lbfgsb_wrapper for Java from here since there may be additional steps for you to take for installing on Mac OSX. At the end of the install, move the files lbfgsb_wrapper-.jar and liblbfgsb_wrapper.so (or liblbfgsb_wrapper.dylib on OSX) into the lib/ directory.
  2. External library: commons-lang3-3.3.2.jar (included in lib/)
  3. Install the Junit framework following instructions in http://junit.org/ or using Maven.
  4. Replace the path for jdk.home.1.7 in the build.properties file with your local install.
  5. (optional) Change path.variable.maven_repository in build.properties to your local maven repository if you wish to use your Maven installs.

Compile

Use 'ant all' to compile on the terminal (requires ant version > 1.6). You can also directly import the entire directory into IntelliJ or Eclipse and compile using the GUI.

Sample Usage

Here is an example of how to run the code from the home directory of the project. The output will contain the predicted segmentations for all the words in the test file. If you do not have gold segmentations to test against, you can just input a file with the word as its own segmentation (i.e. : instead of : in each line of the file - see FORMATS.txt for details).

PARAMS_FILE=params.properties;
OUT_FILE=output.txt;
java -ea  -Djava.library.path=lib/ -classpath "./lib/*:./out/production/Morphology" Main $PARAMS_FILE >$OUT_FILE

Configuration

Most parameters in the model can be changed in the file params.properties

Word Vectors

A good tool to produce your own vectors from a raw corpus is word2vec. You can also use any pre-existing vectors as long as they satisfy the format as specified in FORMATS.txt.

Contact

Please use the issue tracker or email me if you have any questions/suggestions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.