Giter Club home page Giter Club logo

pyspeak's Introduction

Build Status

pyspeak

The aim is to do rough language classification of a continuous stream of documents, using keyword matching.

Note: This is an old project that has been bumped to use the latest ahocorasick library.

For this we are using the pyahocorasick library which implements the Aho-Corasick automatons, that are commonly used for fast multi-pattern matching in intrusion detection systems (such as snort), anti-viruses and many other applications that need fast matching against a pre-defined set of string keys.

The program is given a set of languages, and for each language a list of keywords. The program will say for each document, in which language is its text (or "I don't know"), based on whether it finds any keywords it knows. The idea is to be able to have many rules like this "If the text contains rijkswaterstaat it's in dutch", "If it contains Deutsche it's in german" etc.

Dependencies

Usage

  • Simple run with default values: python main.py. Writes to standard output for each tweet: detected language, actual language, tweet content.
  • List all command line options with python main.py -h.

Sample output

$ python main.py
en (en) Madonna's New Song Could Drop Any Second! Be Still, Our Rebel Heart! h
en (en) Warner Bros. to preview The Hobbit: There and Back Again at CinemaCon
...
nl (nl) Ik wil nu heel dicht bij jou zijn<3
nl (nl) Gezond eten is echt moeilijk.
...
en (de) Vielen Dank an unsere Fans für die großartige Atmosphäre und einen
de (de) Coverdownload nur nach Login. Haben manche Verlage Angst davor, dass j 

Command-line configuration

You can override the maximum number of keywords to load with --max_kwords, as well as the minimum keyword length to consider with --min_kword_len.

Additional configuration

Check out pyspeak/settings.py. Also see below.

Input data

  • Keywords obtrainted from University of Leipzig Frequency Lists: nl en de fr.
  • Tweets obtained from https://twitter.com/search, under search terms "lang:nl", "lang:en", etc. Selected first 20 tweets in each language, so that we "know" they are in that language (vague tweets like "OK" ignored for now, may be used for testing and experiments later). For now assume short documents, so just analysing the whole "document" (or tweet in this case) is fine.

Solution

  • Each document has a "score" for each language we are looking up. This score could be just a counter of how many keywords were found in a document, or perhaps something that also takes into account that a word was found in language A but not in language B, etc. For now, maximum score determines the inferred language.
  • Disregarding the case where there are multiple languages in one sentence. One classification result per input-text.
  • Makes use of an implementation of the Aho–Corasick string matching algorithm, for fast multiple keyword/phrase search in texts. Runs for each incoming document per keyword set, and get scores that way.
  • Solution implemented as a single-run script for now.

pyspeak's People

Contributors

kibernick avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.