Giter Club home page Giter Club logo

mimick's Introduction

Mimick

Code for Mimicking Word Embeddings using Subword RNNs (EMNLP 2017) and subsequent experiments.

tl;dr

Given a word embedding dictionary (with vectors from, e.g. FastText or Polyglot or GloVe), Mimick trains a character-level neural net that learns to approximate the embeddings. It can then be applied to infer embeddings in the same space for words that were not available in the original set (i.e. OOVs - Out Of Vocabulary).

Citation

Please cite our paper if you use this code.

@inproceedings{pinter2017mimicking,
  title={Mimicking Word Embeddings using Subword RNNs},
  author={Pinter, Yuval and Guthrie, Robert and Eisenstein, Jacob},
  booktitle={Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
  pages={102--112},
  year={2017}
}

Dependencies

The main dependency for this project is DyNet. Get it here.

  • As of November 22, 2017, the code complies with Dynet 2.0. You may access the 1.0 version code via the commit log.

Create Mimick models

The mimick directory contains scripts relevant to the Mimick model: dataset creation, model creation, intrinsic analysis (see readme within). The models directory within contains models trained for all 23 languages mentioned in the paper. If you're using the pre-trained models, you don't need anything else from the mimick directory in order to run the tagging model. If you train new models, please add them here via pull request!

  • December 12, 2017 note: pre-trained model are now in DyNet 2.0 format (and employ early-stopping). The 1.0-compatible models are still available in a subdirectory.

CNN Version (November 2017)

As of the November 22 PR, there is a CNN version of Mimick available for training. It is currently a single-layer convolutional net (conv -> ReLU -> max-k-pool -> fully-connected -> tanh -> fully-connected) that performs the same function as the LSTM version.

Tag parts-of-speech and morphosyntactic attributes using trained models

The root directory of this repository contains the code required to perform extrinsic analysis on Universal Dependencies data. Vocabulary files are supplied in the vocabs directory.

The entry point is model.py, which can use tagging datasets created using the make_dataset.py script. Note that model.py accepts pre-trained Word Embedding models via text files with no header. For Mimick models, this exact format is output into the path in mimick/model.py script's --output argument. For Word2Vec, FastText, or Polyglot models, one can create such a file using the scripts/output_word_vectors.py script that accepts a model (.pkl or .bin) and the desired output vocabulary (.txt).

mimick's People

Contributors

yuvalpinter avatar dhgarrette avatar kelseyball avatar ruyimarone avatar muralibalusu12 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.