Giter Club home page Giter Club logo

topsim's Introduction

PyPI version PyPI pyversions PyPI license

Featured on ImportPython Issue 171. Thank you so much for support!

Search the most similar strings against the query in Python 3. State-of-the-art algorithm and data structure are adopted for best efficiency.

  • For both flexibility and efficiency, only set-based similarities are included right now, including Jaccard and Tversky.

  • For simpler code, some general purpose functions have been moved to be part of a new library extratools.

  • TopEmoji is an interesting application of this library, searching the most similar emojis against the query.

topemoji-cli "baby" -k 5
πŸ‘Ά	baby	1.0
πŸ‘Ό	baby angel	0.666
🐀	baby chick	0.666
🍼	baby bottle	0.6659
🚼	baby symbol	0.6659

Reference

This library is originally part of the implementation for our research paper.

Preference-driven similarity join.
Chuancong Gao, Jiannan Wang, Jian Pei, Rui Li, Yi Chang.
Proceedings of the International Conference on Web Intelligence, 2017.

Installation

This package is available on PyPI. Just use pip3 install -U TopSim to install it.

CLI Usage

You can simply use the algorithm on terminal.

Usage:
    topsim-cli <query> [options] [<file>]


Options:
    -I                     Case-sensitive matching.
    -k <k>                 Maximum number of search results. [default: 1]
    --tie                  Include all the results with the same similarity of the "k"-th result. May return more than "k" results.

    -s, --search           Search the query within each line rather than against the whole line, by preferring partial matching of the line.
                           Tversky similarity is used instead of Jaccard similarity.
    -e <e>                 Parameter for "tversky" similarity. [default: 0.001]

    --mapping=<mapping>    Map each string to a set of either "gram"s or "word"s. [default: gram]
    --numgrams=<numgrams>  Number of characters for each gram when mapping by "gram". [default: 2]

    --quiet                Do not print additional information to standard error.
  • The query is matched against each line of the input file (or standard input).
  • Each line and its similarity are separated by tab character \t.

API Usage

Alternatively, you can use the algorithm via API.

from topsim import TopSim

ts = TopSim([
    "python2",
    "python2.7",
    "python3",
    "python3.6",
])

print(ts.search("python", k=3)) # Return each similarity and the respective line numbers.
  • Please check topsim.py for more optional parameters, like similarity function, etc.

Examples

  • Search the most similar line.

ls /usr/bin | topsim-cli "top"

top	1.0
  • Search the three most similar lines.

ls /usr/bin | topsim-cli "top" -k 3

top	1.0
tops	0.5
iotop	0.4286
  • Use Jaccard similarity in default, which puts same weight on matching both the query and the lines.

ls /usr/bin | topsim-cli "git" -k 5

git	1.0
wait	0.2857
git-shell	0.2727
pluginkit	0.2727
kinit	0.25
  • Use Tversky similarity, which puts most weight on matching the query. Ideal when searching within long lines.

ls /usr/bin | topsim-cli "git" -k 5 -s

git	1.0
git-shell	0.7489
pluginkit	0.7489
git-cvsserver	0.7481
git-upload-pack	0.7478
  • For n-gram mapping, higher number of n for can result in better accuracy but fewer matches.

ls /usr/bin | topsim-cli "git" -k 5 -s --numgrams=3

git	1.0
git-shell	0.5993
git-cvsserver	0.5988
git-upload-pack	0.5986
git-receive-pack	0.5984
  • Full support of Chinese/Japanese/Korean.

cat test

εœ°δΈ‰ι²œ
纒烧肉
烀全牛
ζœ¨ι‘»θ‚‰
εœŸθ±†η‚–η‰›θ‚‰

cat test | topsim-cli "牛肉" -k 3 -s

εœŸθ±†η‚–η‰›θ‚‰	0.666
纒烧肉	0.3332
ζœ¨ι‘»θ‚‰	0.3332

Tip

I strongly encourage using PyPy instead of CPython to run the script for best performance.

topsim's People

Contributors

chuanconggao avatar toddrme2178 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

topsim's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.