This repository contain a simple python module for doing language classification with some basic algorithms.
This is not supposed to be used in any production environment, is just an example to show how language detection works.
The classes defined in the langdet
modules implement some algorithms to compute language detection using cosine similarity.
If you want to use one of these classes in your code, is probably better to serialize a model and use it:
>>> from langdet import TrigramCosineLanguageDetector as LD, stream_sample
>>> ld = LD()
>>> ld.train(stream_sample('datasets/train.txt'))
>>> import pickle
>>> with open('model.ldm', 'wb') as fout:
... pickle.dump(ld, fout)
...
And then load in your code:
import pickle
ld = pickle.load(open('model.ldm', 'rb'))
lang, score = ld.detect('some text to test it work')
The datasets included in the datasets
directory are extracted from Wikipedia abstracts, you can find the tools used to generate the datasets in the tools
dir.
To dump the data from wikipedia use the dump_wiki.sh
script:
$ ./tools/dump_wiki.sh LANG1 LANG2 ...
Where LANGX
is the two letter language code. The script requires curl (by default installed in most *nix and BSD-derivative distributions like OSX or Ubuntu) and download the abstract dumps and create the files LANGX-abstract.xml.gz
.
After you've dumped the abstract, you can extract the text of each abstract with the wiki_extract_docs.py
script and get two random samples for training and testing the language classifiers using gen_datasets.sh
:
$ for lang in de en es fr it pt; do
> echo computing ${lang}...;
> ./tools/wiki_extract_docs.py ${lang}-abstract.xml.gz | ./tools/gen_datasets.sh 10000;
> done