Giter Club home page Giter Club logo

gramify's Introduction

Gramify

Pre-calculates letter and n-gram frequency data for text corpora, intended to be used for keyboard layout analysis.

Usage: ./target/release/gramify FILE [options]

Options:
    -h, --help          Show usage instructions, then exit
    -o, --output-format json|msgpack
                        Output format
    -i, --input-format json|msgpack|raw
                        Input format
        --letter-threshold NUM
                        Threshold of significance for letters. Letters that
                        appear fewer than NUM times per million will not
                        appear in the output.
        --letter-pattern REGEX
                        Regex pattern for letters. Letters that don't match
                        REGEX will be excluded from output.
        --bigram-threshold NUM
                        Threshold of significance for bigrams. Bigrams that
                        appear fewer than NUM times per million will not
                        appear in the output.
        --bigram-pattern REGEX
                        Regex pattern for bigrams. Bigrams that don't match
                        REGEX will be excluded from output.
        --skipgram-threshold NUM
                        Threshold of significance for skipgrams. Skipgrams
                        that appear fewer than NUM times per million will not
                        appear in the output.
        --skipgram-pattern REGEX
                        Regex pattern for skipgrams. Skipgrams that don't
                        match REGEX will be excluded from output.
        --trigram-threshold NUM
                        Threshold of significance for trigrams. Trigrams that
                        appear fewer than NUM times per million will not
                        appear in the output.
        --trigram-pattern REGEX
                        Regex pattern for trigrams. Trigrams that don't match
                        REGEX will be excluded from output.

Terminology

  • Bigram: two consecutive characters
  • Skipgram: two characters separated by one character (this is useful for keyboard layout analysis)
  • Trigram: three consecutive characters

Format

Gramify supports JSON and MessagePack input and output formats. The value associated with each n-gram key is the ratio of instances of that n-gram relative to the total number of n-grams in the corpus.

Ready-to-use Corpora

You can find ready-made frequency data in the corpora folder, filtered on "significant" letters and n-grams for size since for keyboard layout analysis, the long tail of infrequent n-grams doesn't really affect the overall score of a layout but takes a long time to evaluate.

  • iweb: from Shai Coleman's sanitized corpus used for Colemak, 527MB of source text (unfiltered n-gram data also available)
  • xsznix-fb-messages: Personal Facebook messages sent by the author, 15MB of source text
  • carpalx-books: from Martin Krzywinski's corpus used for Carpalx, 12MB of source text
  • akl-messages: Messages from the Alt Keyboard Layout Discord server, 9MB of source text
  • typeracer: Quotes from Typeracer, 2MB of source text

License

The frequency analyses in the corpora folder are for research and education purposes only.

I don't care what you do with the code in this repo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.