Giter Club home page Giter Club logo

language-detector's Introduction

This package aids the detection of the language of a given text.

End goal is to detect any text, no matter how short or obscure (think messages from Twitter, WhatsApp, Instagram, SMS, etc) and return an object describing the language that best matches it.

{
  language: 'en',
  country: 'gb'
}

This is obtained with a combination of "reducing" and "matching". Given a piece of text we can reduce it to a set of potential languages by checking for common patterns (see src/utils/reducers.js), additionally we can match the n-grams of sed text to a set of pre-compiled language profiles generated through "learning" (processing known samples).

Usage

With built in language detection

Usage:

const detect = require('@chattylabs/language-detection')
const result = detect('some text to detect')
const language = result.language

With custom language profiles

const detect = require('@chattylabs/language-detection')
const customLanguageProfiles = require('../path/to/data/languageProfiles.json')

const result = detect(text, {
  languageProfiles: customLanguageProfiles,
  reducers: customReducers
})
const language = result.language

NOTE: the languages you provide will be the set used, you could additionally merge them with our base:

const combinedProfiles = {
  ...require('@chattylabs/language-detection').languageProfiles,
  ...customLanguageProfiles
}

Generating your own language profiles

You will need to build a "training" script, which analysis all your sample data and generates the language profiles object.

Your sample data should be a set of txt files containing as much text as possible and similar to the text you will be detecting. Do this per locale or language. e.g. data/samples/en.txt, data/samples/fr.txt, data/samples/cn.txt or data/samples/en_GB.txt (for country indentifier locale code must use underscore _ separator)

// bin/train.js
const train = require('@chattylabs/language-detection').train
train('./path/to/custom/samples/*.txt', './path/to/custom/export/languageProfiles.json')

then execute it via the cli node bin/training.js or via an npm script.

NOTE: filenames determine the language, but using filename such as en_GB will result in the response splitting this out into language and country.

With custom reducers

const detect = require('@chattylabs/language-detection')
const customLanguageProfiles = require('../path/to/data/languageProfiles.json')
const customReducers = require('../path/to/your/reducers')

const result = detect(text, {
  languageProfiles: customLanguageProfiles,
  reducers: customReducers
})
const language = result.language

Writing reducers

Reducers are a collection of objects which map a regex to an array of languages. They help reduce the amount of languages we need to run the n-gram matching on, by finding intersections of known patterns.

So for example, imagine we provide the following reducers:

# /path/to/data/languageProfiles.json
module.exports = [
  {
    regex: /[ñ]+/i,
    languages: ['es', 'gn', 'gl']
  },
  {
    regex: /[á|é|í|ó|ú]+/i,
    languages: ['fr', 'es', 'it', 'cn', 'nl', 'fo', 'is', 'pt', 'vi', 'cy', 'el', 'gl']
  }
]

From the above, we would reduce the words "Alimentación de niño" to the languages ['es', 'gl'], and only run n-gram matching on those. If the reducer were to just return 1 language, that would be our result.

NOTE: providing your own reducers will override the base ones. If you chose not to use them, but do use your own language profiles, languages not in your profiles will not be taken into account.

You can also combine your own reducers with the base ones:

const combinedProfiles = {
  ...require('@chattylabs/language-detection').reducers,
  ...customReducers
}

References

language-detector's People

Contributors

danielantelo avatar franriadigos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

language-detector's Issues

source of language corpus

Where is the source text dataset for the Ngrams of those 73 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

Move generated dynamic data to an external storage.

Description

Implement a solution to load the data samples and LanguageProfiles from an external source like Firebase Storage.

Decide whether it's worth to delegate the loading of the data to a developer and just design some kind of an interface contract, or to develop ourselves a solution example for Firebase Storage.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.