Giter Club home page Giter Club logo

langdet's Introduction

Simple language classifier

This repository contain a simple python module for doing language classification with some basic algorithms.

This is not supposed to be used in any production environment, is just an example to show how language detection works.

Classes

The classes defined in the langdet modules implement some algorithms to compute language detection using cosine similarity.

If you want to use one of these classes in your code, is probably better to serialize a model and use it:

>>> from langdet import TrigramCosineLanguageDetector as LD, stream_sample
>>> ld = LD()
>>> ld.train(stream_sample('datasets/train.txt'))
>>> import pickle
>>> with open('model.ldm', 'wb') as fout:
...     pickle.dump(ld, fout)
... 

And then load in your code:

import pickle
ld = pickle.load(open('model.ldm', 'rb'))
lang, score = ld.detect('some text to test it work')

Datasets

The datasets included in the datasets directory are extracted from Wikipedia abstracts, you can find the tools used to generate the datasets in the tools dir.

To dump the data from wikipedia use the dump_wiki.sh script:

$ ./tools/dump_wiki.sh LANG1 LANG2 ...

Where LANGX is the two letter language code. The script requires curl (by default installed in most *nix and BSD-derivative distributions like OSX or Ubuntu) and download the abstract dumps and create the files LANGX-abstract.xml.gz.

After you've dumped the abstract, you can extract the text of each abstract with the wiki_extract_docs.py script and get two random samples for training and testing the language classifiers using gen_datasets.sh:

$ for lang in de en es fr it pt; do
>   echo computing ${lang}...;
>   ./tools/wiki_extract_docs.py ${lang}-abstract.xml.gz | ./tools/gen_datasets.sh 10000;
> done

langdet's People

Contributors

duilio avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.