langdet's Introduction

Simple language classifier

This repository contain a simple python module for doing language classification with some basic algorithms.

This is not supposed to be used in any production environment, is just an example to show how language detection works.

Classes

The classes defined in the langdet modules implement some algorithms to compute language detection using cosine similarity.

If you want to use one of these classes in your code, is probably better to serialize a model and use it:

>>> from langdet import TrigramCosineLanguageDetector as LD, stream_sample
>>> ld = LD()
>>> ld.train(stream_sample('datasets/train.txt'))
>>> import pickle
>>> with open('model.ldm', 'wb') as fout:
...     pickle.dump(ld, fout)
...

And then load in your code:

import pickle
ld = pickle.load(open('model.ldm', 'rb'))
lang, score = ld.detect('some text to test it work')

Datasets

The datasets included in the datasets directory are extracted from Wikipedia abstracts, you can find the tools used to generate the datasets in the tools dir.

To dump the data from wikipedia use the dump_wiki.sh script:

$ ./tools/dump_wiki.sh LANG1 LANG2 ...

Where LANGX is the two letter language code. The script requires curl (by default installed in most *nix and BSD-derivative distributions like OSX or Ubuntu) and download the abstract dumps and create the files LANGX-abstract.xml.gz.

After you've dumped the abstract, you can extract the text of each abstract with the wiki_extract_docs.py script and get two random samples for training and testing the language classifiers using gen_datasets.sh:

$ for lang in de en es fr it pt; do
>   echo computing ${lang}...;
>   ./tools/wiki_extract_docs.py ${lang}-abstract.xml.gz | ./tools/gen_datasets.sh 10000;
> done

Recommend Projects

duilio / langdet Goto Github PK

langdet's Introduction

Simple language classifier

Classes

Datasets

langdet's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent