Giter Club home page Giter Club logo

stringdist's Introduction

Build Status Coverage Status CRAN DownloadsResearch software impactMentioned in Awesome Official Statistics

stringdist

  • Approximate matching and string distance calculations for R.
  • All distance and matching operations are system- and encoding-independent.

The package offers the following main functions:

  • stringdist computes pairwise distances between two input character vectors (shorter one is recycled)
  • stringdistmatrix computes the distance matrix for one or two vectors
  • stringsim computes a string similarity between 0 and 1, based on stringdist
  • amatch is a fuzzy matching equivalent of R's native match function
  • ain is a fuzzy matching equivalent of R's native %in% operator
  • seq_dist, seq_distmatrix, seq_amatch and seq_ain for distances between, and matching of integer sequences. (see also the hashr package).

These functions are built upon C-code that re-implements some common (weighted) string distance functions. Distance functions include:

  • Hamming distance;
  • Levenshtein distance (weighted);
  • Restricted Damerau-Levenshtein distance (weighted, a.k.a. Optimal String Alignment);
  • Full Damerau-Levenshtein distance (weighted);
  • Longest Common Substring distance;
  • Q-gram distance
  • cosine distance for q-gram count vectors (= 1-cosine similarity)
  • Jaccard distance for q-gram count vectors (= 1-Jaccard similarity)
  • Jaro, and Jaro-Winker distance
  • Soundex-based string distance.

Also, there are some utility functions:

  • qgrams() tabulates the qgrams in one or more character vectors.
  • seq_qrams() tabulates the qgrams (somtimes called ngrams) in one or more integer vectors.
  • phonetic() computes phonetic codes of strings (currently only soundex)
  • printable_ascii() is a utility function that detects non-printable ascii or non-ascii characters.

Installation

To install the latest release from CRAN, open an R terminal and type

install.packages('stringdist')

Beta versions are released through my drat repository. These versions build and pass all current tests correctly on Linux but builds have not been tested on all architectures that CRAN supports. Windows users will also need to have rtools installed.

drat::addRepo("markvanderloo")
install.packages("stringdist")

To obtain the package from the very latest source code open a bash terminal (or git bash if you work under Windows with msysgit) and type

git clone https://github.com/markvanderloo/stringdist.git
cd stringdist
bash ./build.bash
R CMD INSTALL output/stringdist_*.tar.gz

Warning: the github version can change any time and may not even build properly. As most of the code is written in C, the development version may crash your R-session.

Resources

  • A paper on stringdist has been published in the R-journal
  • Slides of te useR!2014 conference.

Note to users: deprecated arguments as of >= 0.9.0, >= 0.9.2

Parallelization used to be based on R's parallel package, that works by spawning several R sessions in the background. As of version 0.9.0, stringdist uses the more efficient openMP protocol to parallelize everything under the hood.

The following arguments have become obsolete and will be removed somewhere in 2016:

  • Argument cluster for function stringdistmatrix.
  • Argument maxDist for functions stringdist and stringdistmatrix (not amatch).
  • Argument ncores for function stringdistmatrix

stringdist's People

Contributors

markvanderloo avatar rsaporta avatar

Watchers

Pieter Schoonees avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.