Giter Club home page Giter Club logo

simba's Introduction

simba 🦁

Similarity measures from Babylon Health.

Installation

$ pip install simba

You can also checkout this repository and install from the root folder:

$ pip install .

Many of the similarity measures in simba rely on pre-trained embeddings. If you don't have your own encoding logic already, you can register your embedding files to use them easily with simba, as long as they're in the standard text format for word vectors (as described here). For example, if you want to use fastText vectors that you've saved to /path/to/fasttext, you can just run

$ simba embs register --name fasttext --path /path/to/fasttext

and simba will recognise them under the name fasttext.

You can do something similar for frequencies files (like these):

$ simba freqs register --name wiki --path /path/to/wiki/counts

Usage

from simba.similarities import dynamax_jaccard
from simba.core import embed

sentences = ('The king has returned', 'Change is good')

# Assuming you've registered fasttext embeddings as described above
x, y = embed([s.split() for s in sentences], embedding='fasttext')
sim = dynamax_jaccard(x, y)

There are more examples, including comparing different similarity metrics on a dataset of pairs, in the examples directory.

Similarity Measures

This library contains implementations of the following methods in simba.similarities. Please consider citing the corresponding papers in your work if you find them useful.

Method Description Paper
avg_cosine Average vector compared with cosine similarity -
batch_avg_pca Average vector with principal component removal [1]
fbow_jaccard_factory Factory method for general fuzzy bag-of-words given a universe matrix [2]
max_jaccard Max-pooled vectors compared with Jaccard coefficient [2]
dynamax_{jaccard, otsuka, dice} DynaMax using Jaccard, Otsuka-Ochiai, and Dice coefficients [2]
gaussian_correction_{tic, aic} Takeuchi and Akaike Information Criteria (TIC and AIC) for Gaussian likelihood [3]
spherical_gaussian_correction_{tic, aic} TIC and AIC for spherical Gaussian likelihood [3]
von_mises_correction_{tic, aic} TIC and AIC for von Mises Fisher likelihood [3]
avg_{pearson, spearman, kendall} Average vector compared with Pearson, Spearman, and Kendall correlation [4]
max_spearman Max-pooled vectors compared with Spearman correlation [5]
cka_factory Factory method for general Centered Kernel Alignment (CKA) [5]
cka_{linear, gaussian} CKA with linear and Gaussian kernels [5]
dcorr CKA with distance kernel (distance correlation) [5]

Papers:

  1. Arora et al., ICLR 2017. A Simple but Tough-to-Beat Baseline for Sentence Embeddings
  2. Zhelezniak et al., ICLR 2019. Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors
  3. Vargas et al., ICML 2019. Model Comparison for Semantic Grouping
  4. Zhelezniak et al., NAACL-HLT 2019. Correlation Coefficients and Semantic Textual Similarity
  5. Zhelezniak et al., EMNLP-IJCNLP 2019. Correlations between Word Vector Sets

Contact

simba's People

Stargazers

 avatar Jeff Carpenter avatar slyviacassell avatar DrCapt avatar Giorgos Tsiftsis avatar Jackson Price avatar Nils Hammerla avatar Denis Emelin avatar Adam Bozson avatar Sasho Savkov avatar Alex Papadopoulos Korfiatis avatar Damir Juric avatar Kosti avatar April Shen avatar Fran avatar

Watchers

Brad Walker avatar Radu Helstern avatar Abraham avatar Themis Savvidis avatar Uriel avatar Nick Mullen avatar James Cloos avatar  avatar Bamdad Dashtban avatar  avatar Rahul Goma Phulore avatar David Rodrigues avatar Petr Křemen avatar  avatar Ertugrul Yilmaz avatar Giorgos Christos Dimitriou avatar Daeus avatar Flavio Santa Rosa Coradini avatar Stefano Sergio avatar Martin Nygren avatar Jon Shaffer avatar  avatar Jack Stephenson avatar Andres Sanchis avatar JPFrancoia avatar  avatar sverredanger avatar Maria Khait avatar  avatar nieszkah avatar Bartłomiej Wojcieszek avatar Neal Madlani avatar João Pereira avatar Jack Farrant avatar Sijin He avatar Jean-Marie F avatar Albert Buchard avatar Prashant Dubey avatar Alessandro Guazzi avatar Eyal Kazin avatar George Richards avatar Eduardo Santos avatar Sam Nixon avatar Mohamed Salamat avatar Catarina Figueiredo avatar Lance Paine avatar Saurabh Johri avatar Konrad Roj avatar Mazen El-Turk avatar Kye Yeung avatar Harley McKee avatar Rob Brentnall avatar Bonamy Klu avatar Dan Randle avatar Mahul Ruparelia  avatar  avatar Kinyik Bence avatar  avatar  avatar  avatar Safoura avatar Dan avatar Mika Leppala avatar Yan avatar Jasmeet Singh Saini avatar Renato Serra avatar Chris Pitt avatar Ryan Heard avatar Cotie Long avatar JahanTech avatar Vincent Tan avatar  avatar Max Wasylow avatar Benny Henshall avatar  avatar  avatar Giulia Prando avatar Danilo Aliberti avatar  avatar Onome Sotu avatar Joseph Enguehard avatar Mohammad Khodadadi avatar Joshua Leung avatar ZohrehShams avatar Kevin Wallace avatar Akshay Kumar avatar  avatar Jetendr Shamdasani avatar  avatar  avatar Vito Celentano avatar paper2code - bot avatar Anna Tran avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.