Giter Club home page Giter Club logo

problm's Introduction

ProbLM

probabilistic counting for language modeling.

Description

A language model computes the probability of a sentence: $$ P(w_1, \dots , w_m)=\prod_{i=1}^m P(w_i \mid w_1, \dots , w_{i-1}) \approx \prod_{i=1}^m P(w_i \mid w_{i-(n-1)}, \dots , w_{i-1}) $$ where $P(w_i \mid w_{i-(n-1)}, \dots , w_{i-1})$ can be estimated by counting the occurance of n-grams: $$ P(w_i \mid w_{i-(n-1)}, \dots , w_{i-1})\approx \frac{\text{count}( w_{i-(n-1)}, \dots , w_{i-1}, w_i)}{\text{count}(w_{i-(n-1)}, \dots , w_{i-1})} $$ and with smoothing (so that model will not output zero probability for an unseen n-gram): $$ P(w_i \mid w_{i-(n-1)}, \dots , w_{i-1})\approx \frac{\text{count}( w_{i-(n-1)}, \dots , w_{i-1}, w_i)+1}{\text{count}(w_{i-(n-1)}, \dots , w_{i-1})+|V|} $$ where $|V|$ is the vocabulary size.

Traditional method maintain a huge table to store the counts of every n-gram, which is memory-inefficient, since there are $|V|^n$ possible n-grams and $|V|$ is already large. Our implementation use memory-efficient CountMinSketch algorithm to estimate n-gram counts and HyperLogLog to estimate vocabulary size.

Requirements

  1. python3
  2. numpy

Usage

0. Prepare the corpus (optional)

  1. Download a English wikipedia corpus:
>>> mkdir corpus
>>> wget https://dumps.wikimedia.org/enwiki/20180801/enwiki-20180801-pages-articles-multistream.xml.bz2 -P corpus/
  1. Convert the corpus to plain text (require gensim):
>>> python make_corpus.py corpus/enwiki-20180801-pages-articles-multistream.xml.bz2 corpus/enwiki.txt

1. Train the model

>>> ./train.sh

2018-08-15 00:00:11,798: INFO: processed 100 lines
2018-08-15 00:00:27,022: INFO: processed 200 lines
2018-08-15 00:00:39,387: INFO: processed 300 lines
2018-08-15 00:00:51,641: INFO: processed 400 lines
2018-08-15 00:01:00,908: INFO: processed 500 lines
2018-08-15 00:01:07,962: INFO: model saved to models/enwiki_ns3

2. Evaluate the model

>>> ./human_eval.sh

INFO:root:type = 'count_min_sketch'
INFO:root:counter = <frequency_estimation.CountMinSketch object at 0x000001A3DF80C8D0>
INFO:root:vocab_size = 82346
INFO:root:hash_size = 1048576
INFO:root:hash_num = 32
INFO:root:ngram_size = 3
Enter a sentence (EXIT to break):

>>> it is important

INFO:root:('<BOS>', '<BOS>')     count = 500
INFO:root:('<BOS>', '<BOS>', 'it')       count = 2
INFO:root:('<BOS>', 'it')        count = 2
INFO:root:('<BOS>', 'it', 'is')  count = 2
INFO:root:('it', 'is')   count = 1817
INFO:root:('it', 'is', 'important')      count = 10
INFO:root:('is', 'important')    count = 33
INFO:root:('is', 'important', '<EOS>')   count = 0

------------------------------------------
Probability: 0.0000000000000000020930175548966076439696

problm's People

Contributors

yanlinf avatar

Stargazers

 avatar  avatar

Forkers

hugozhl

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.