Giter Club home page Giter Club logo

persian-spell-checker-kenlm's Introduction

Download Persian Wiki-Dump, Train Kenlm & Spell Checker

In this project, I download persian wiki-dump dataset from wikipedia, preprocess it and finally train a spell checker and a kenlm language model.

Download & preprocessing Persian Wiki-Dump

Download Persian wiki-Dump

Download the persian wiki dump using the following bash script. The size of the dataset is about 1G so have patience!

Note: If you live in Iran, most surely you do because this repo is for the Persian language, turn on your vpn!

language=fa
bash download_wiki_dump.sh $language

Extract TXT

Extract and convert .bz2 format to .txt. Using wikiextractor the dump is cleaned and converted .txt file. This may take some time as well!

n_processors=16
bash extract_and_clean_wiki_dump.sh ${language}wiki-latest-pages-articles.xml.bz2 $n_processors

Note: In case of a pdb error, change the expand_templates=True variable to expand_templates=False which is an input argument to the clean_text function located in around line 948 of wikiextractor/wikiextractor/extract.py.

Preprocessing and normalization

The output text should be preprocessed and normalized to remove unnecessary texts like "[doc]" and normalize the texts using hazm and nltk libraries!

Install python requirements:

Install the requirements:

pip install -r requirements.txt

preprocess and normalize

Main Processing. It may take some time!

python preprocess_wiki_dump.py fawiki-latest-pages-articles.txt
python cleaner.py

Get the word-count of the corpus

Using this script the corpus words will be counted. Before that some extra normalization and cleaning will be applied to the words as well.

sudo apt-get install pv
bash get_counts.sh 

Get top frequent vocabs for SymSpell[Spell-Checker]

Symspell needs a text file that contains vocabs and their occurrence. fa_wiki.counts that created in the Get the word-count of the corpus section should be trimmed to only contain the 80k top frequent words and prevent those that have lower frequency than 50.

python get_spellchecker_top_vocabs.py --top-vocabs 80000 --ignore-less 25 --output wiki_fa_80k.txt 

Symspell

Symspell is a simple spell checker. First, install it from pypi using the following command:

pip install symspellpy

For using it, just instantiate it with the vocab dictionary we created in the Get top frequent vocabs for SymSpell section

# import symspell
from symspellpy import SymSpell, Verbosity

# instantiate it
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = "wiki_fa_80k.txt"
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

# input sample:
input_term = "اهوار"  # misspelling of "اهواز" It's a city name!

# lookup the dictionary
suggestions = sym_spell.lookup(input_term, Verbosity.ALL, max_edit_distance=2)
# display suggestion term, term frequency, and edit distance
for suggestion in suggestions[:5]:
    print(suggestion)

The output is as follows. As you can see اهواز is correctly chosen!

اهواز, 1, 4692
ادوار, 1, 1350
الوار, 1, 651
انوار, 1, 305
اهورا, 1, 225

Get top frequent vocabs for KenLM

Using the following code, top most frequent 80K samples is written to kenlm_vocabs.txt. To make it faster vocabs with less than 25 occurrences are discarded!

python get_kenlm_top_vocabs.py --top-vocabs 80000 --ignore-less 25 --output wiki_fa_kenlm_vocabs.txt

Train KenLM model

First install the KenLM requirements using the following commands:

sudo apt-get update
sudo apt-get install cmake build-essential libssl-dev libeigen3-dev libboost-all-dev zlib1g-dev libbz2-dev liblzma-dev -y

Then clone and make the c++ modules:

git clone https://github.com/kpu/kenlm.git
cd kenlm
mkdir -p build
cd build
cmake ..
make -j 4

If everything goes fine, you can find lmplz and build_binary under the ./kenlm/build/bin directory. Eventually, train kenlm language model using following bash script.

bash train_kenlm.sh -o 4 -l fa

Note: the binary module is also created because it's much faster than the non-binarized one.

Kenlm Inference on python

Install KenLm:

pip install https://github.com/kpu/kenlm/archive/master.zip

How to use it:

import kenlm

model = kenlm.Model('fa_wiki.binary')
print("score: ", model.score('کشور ایران شهر تهران', bos=True, eos=True))
print("score: ", model.score('کشور تهران شهر ایران', bos=True, eos=True))
# score:  -11.683658599853516
# score:  -15.572178840637207

For more examples check out the following link: https://github.com/kpu/kenlm/blob/master/python/example.py

References

  1. https://github.com/tiefenauer/wiki-lm
  2. https://towardsdatascience.com/pre-processing-a-wikipedia-dump-for-nlp-model-training-a-write-up-3b9176fdf67
  3. https://github.com/kpu/kenlm

persian-spell-checker-kenlm's People

Contributors

pooya-mohammadi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

persian-spell-checker-kenlm's Issues

"processing cleaning jobs" takes too long

Hello Pooya!
Thanks for your useful repository. I was going through the read.me you've prepared and running the steps but reading each sentence of the corpus after the preprocess takes too long! something about 1896 hours while the number of workers is set to 40!
Do you recommend any more easy-processing way?

no requirements.txt

Hello!
Could you, please, upload your requirements.txt file as I get legacy-install-failure while pip install hazm!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.