This project provides fast Python implementation of several KNN (K-Nearest Neighbors) similarity algorithms using sparse matrices, useful in Collaborative Filtering Recommender Systems and others.
The package also include some normalization functions that could be useful in the pre-processing phase before the similarity computation.
Base similarity models:
- Dot Product
- Cosine
- Asymmetric Cosine
- Jaccard
- Dice
- Tversky
Graph-based similarity models:
- P3α
- RP3β
Advanced similarity model:
- S-Plus
All models have multi-threaded routines, using Cython and OpenMP to fit the models in parallel among all available CPU cores.
The package contains normalization functions like: l1, l2, max, tf-idf, bm25, bm25+.
All the functions are compiled at low-level and could operate in-place, on csr-matrixes, if you need to save memory.
For tf-idf, bm25, bm25+ you could chose the log-base and how the term-frequency (TF) and the inverse document frequency (IDF) are computed.
To install:
pip install similaripy
Basic usage:
import similaripy as sim
import scipy.sparse as sps
# create a random user-rating matrix (URM)
urm = sps.random(1000, 2000, density=0.025)
# normalize matrix with bm25
urm = sim.normalization.bm25(urm)
# train the model with 50 knn per item
model = sim.cosine(urm.T, k=50)
# recommend 100 items to users 1, 14 and 8 filtering the items already seen by each users
user_recommendations = sim.dot_product(urm, model.T, k=100, target_rows=[1,14,8], filter_cols=urm)
Package | Version |
---|---|
numpy | >= 1.14 |
scipy | >= 1.0.0 |
tqdm | >= 4.19.6 |
cython | >= 0.28.1 |
NOTE: In order to compile the Cython code it is required a GCC compiler with OpenMP
(on OSX it can be installed with homebrew: brew install gcc
).
This library has been tested with Python 3.6 on Ubuntu, OSX and Windows.
(Note: on Windows there are problem with flag format_output='csr', just let it equals to the default value 'coo')
I recommend configuring SciPy/Numpy to use Intel's MKL matrix libraries. The easiest way of doing this is by installing the Anaconda Python distribution.
I plan to release in the next future some utilities:
- Utilities for sparse matrices
- New similarity functions ( good ideas are welcome :) )
The idea of build this library comes from the RecSys Challenge 2018 organized by Spotify.
My team, the Creamy Fireflies, had problem in compute very huge similarity models in a reasonable time (66 million of interactions in the user-rating matrix) and using python and numpy were not suitable since a full day was required to compute one single model.
As a member of the the team I spent a lot of hours to develop these high-performance similarities in Cython to overcome the problem. At the end of the competition, pushed by my team friends, I decide to release my work to help people that one day will encounter our same problem.
Thanks to my Creamy Fireflies friends for support me.
Released under the MIT License
@misc{boglio_simone_similaripy,
author = {Boglio Simone},
title = {bogliosimone/similaripy},
doi = {10.5281/zenodo.2583851},
url = {https://doi.org/10.5281/zenodo.2583851}
}