Giter Club home page Giter Club logo

slm's Introduction

Segmental Language Models

Introduction

A PyTorch Implementation of Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling

Implemented features

Models:

  • Unsupervised Learning with Segmental Language Models
  • Supervised Learning with Segmental Language Models

Usage

Chinese Corpus:

  • segmented.txt: segmented data set for supervised training
  • unsegmented.txt: unsegmented data set. You can use both this data set and test.txt for unsupervised training
  • test.txt: unsegmented data set for evaluation
  • test_gold.txt: gold segmented test data set

Train

For example, this command train an unsupervised SLM model on pku dataset with maximal segment length 4 and GPU 0.

bash run.sh train unsupervised pku 4 0

Check run.sh and argparse configuration at codes/run.py for more arguments and more details.

Predict

bash run.sh predict unsupervised pku 4 0

Evaluation

bash run.sh eval unsupervised pku 4

Speed

The Segmental Language Models usually take about 30 - 50 minutes to converge, which depends on the maximal segment length (2 - 4).

Unsupervised results of the SLM model (Maximal Segment Length = k)

Dataset PKU MSR AS CityU
k = 2 0.797 (0.802) 0.776 (0.785) 0.794 (0.794) 0.786 (0.782)
k = 3 0.803 (0.798) 0.784 (0.794) 0.800 (0.803) 0.803 (0.805)
k = 4 0.797 (0.792) 0.782 (0.790) 0.798 (0.804) 0.798 (0.797)

Note that this is a re-implementation of the SLM model. Due to the differences in detailed settings, such as data loader setting, dropout rate and learning rate, the re-implementation performance is a little different from what is reported in the paper.

Using the library

The python library is organized around 4 objects:

  • InputDataset (dataloader.py): prepare data stream for training and evaluation
  • CWSTokenizer (tokenization.py): work along with InputDataset for data pre-processing
  • SegmentalLM (model.py): build the model and provide train/test API for SLM
  • SLMConfig (model.py): manage configurations for SLM

The run.py file contains the main function, which parses arguments, reads data, initialize the model and provides the training loop.

Citation

If you use the codes, please cite the following paper:

@inproceedings{sun2018unsupervised,
  title={Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling},
  author={Sun, Zhiqing and Deng, Zhi-Hong},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
  pages={4915--4920},
  year={2018}
}

slm's People

Contributors

edward-sun avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.