Giter Club home page Giter Club logo

sally's Introduction

Sally

Sally - A Tool for Embedding Strings in Vector Spaces

Introduction

Sally is a small tool for mapping a set of strings to a set of vectors. This mapping is referred to as embedding and allows for applying techniques of machine learning and data mining for analysis of string data. Sally can be applied to several types of strings, such as text documents, DNA sequences or log files, where it can handle common formats such as directories, archives and text files of string data.

Sally implements a standard technique for mapping strings to a vector space that can be referred to as generalized bag-of-words model. The strings are characterized by a set of features, where each feature is associated with one dimension of the vector space. The following types of features are supported by Sally: bytes, tokens (words), n-grams of bytes and n-grams of tokens.

Sally proceeds by counting the occurrences of the specified features in each string and generating a sparse vector of count values. Alternatively, binary or TF-IDF values can be computed and stored in the vectors. Sally then normalizes the vector, for example using the L1 or L2 norm, and outputs it in a specified format, such as plain text or in LibSVM or Matlab format.

Consult the manual page of Sally for more information.

Dependencies

Debian & Ubuntu Linux

The following packages need to be installed for compiling Sally on Debian and Ubuntu Linux

gcc 
libz-dev
libconfig8-dev
libarchive-dev 

For bootstrapping Sally from the GIT repository or manipulating the automake/autoconf configuration, the following additional packages are necessary.

automake 
autoconf 
libtool

Mac OS X

For compiling Sally on Mac OS X a working installation of Xcode is required including gcc. Additionally, the following packages need to be installed via Homebrew

libconfig   
libarchive (from homebrew-alt) 

OpenBSD

For compiling Sally on OpenBSD the following packages are required. Note that you need to use gmake instead of make for building Sally.

gmake
libconfig
libarchive

For bootstrapping Sally from the GIT repository, the following packages need be additionally installed

autoconf
automake
libtool

Compilation & Installation

From GIT repository first run

$ ./bootstrap

From tarball run

$ ./configure [options]
$ make
$ make check
$ make install

Options for configure

--prefix=PATH           Set directory prefix for installation

This feature enables support for OpenMP in Sally. It is still experimental. Sally will execute certain parts of the processing in parallel making use of multi-core architectures where possible.

--enable-md5hash        Enable MD5 as alternative hash

Sally uses a hash function for mapping different features to different dimensions in the vector space. By default the very efficient Murmur hash is used for this task. In certain critical cases it may be useful to use a cryptographic hash as MD5.

Copyright (C) 2010-2015 Konrad Rieck ([email protected]); Christian Wressnegger ([email protected]); Alexander Bikadorov ([email protected])

sally's People

Contributors

rieck avatar chwress avatar abika avatar

Watchers

James Cloos avatar sandikast avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.