Giter Club home page Giter Club logo

ldeluigi / mercury-settrie Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bbva/mercury-settrie

0.0 0.0 0.0 3.77 MB

A Python 3 library developed in C++ that enables efficient storage and querying of sets of sets. It can be used to perform fast document search. Uses the Settrie algorithm: https://osebje.famnit.upr.si/~savnik/papers/cdares13.pdf

License: Other

Shell 0.01% C++ 83.77% Python 3.04% C 0.02% Makefile 0.33% Jupyter Notebook 12.60% SWIG 0.23%

mercury-settrie's Introduction

Mercury Settrie

Python 3.8 Python 3.9 Python 3.10 Python 3.11 Apache 2 license Ask Me Anything !

What is this?

TLDR: A SetTrie is a container of sets that performs efficient subset and superset queries.

Settrie is a Python library implemented in C++ to create, update and query SetTrie objects.

  • Highly efficient C++ implementation.
  • 100% pythonic interface: objects are serializable, use iterators and support native Python sets.

What problems is Settrie good for?

Settrie was born from the need of a better implementation of the algorithm for our recommender system. It has direct application to text indexing for search in logarithmic time of documents containing a set of words. Think of a collection of documents as a container of the set of words that each document has. Finding all the documents that contain a set of words is finding the superset of the set of words in the query. A SetTrie will give you the answer --the list of document names-- in logarithmic time. The data structure is also used in auto-complete applications.

What data size can Settrie tackle?

Settrie is a C++ implementation with a Python interface. It is single threaded and can seamlessly operate over really large collections of sets. Note that the main structure is a tree and a tree node is exactly 20 bytes, a billion nodes is 20 Gb, of course plus some structures to store identifiers, etc. Note that the tree is compressing the documents by sharing the common parts and documents are already compressed by considering them a set of words. An of-the-shelf computer can store in RAM a representation of terabytes of documents and query result in much less than typing speed.

How does this implementation compare to other Python implementations?

It is about 200 times faster and 20 times more memory efficient that a pure Python implementation.

Try it without any installation in Google Colab

The API is very easy to use. You can see this benchmark notebook for reference.

  • Benchmark: Comparing two Settrie implementations Open In Colab

Install

pip install mercury-settrie

Usage

from settrie import SetTrie

# Create a SetTrie object
stt = SetTrie()

# Insert some sets
stt.insert({2, 3}, 'id1')
stt.insert({2, 3, 4.4}, 'id2')
stt.insert({'Mon', 'Tue'}, 'days')

# Find id by set
print(stt.find({2, 3}))

# Find ids of all supersets
for id in stt.supersets({2, 3}):
    print(id)

# Find ids of all subsets
for id in stt.subsets({2, 3}):
    print(id)

# Nested iteration over the sets and elements
for st in stt:
    print(st.id)
    for e in st.elements:
        print('  ', e)

# Store as a pickle file file
import pickle
with open('my_settrie.pickle', 'wb') as f:
    pickle.dump(stt, f)

# Load from a pickle file
with open('my_settrie.pickle', 'rb') as f:
    tt = pickle.load(f)

# Check that they are identical
for t, st in zip(tt, stt):
    assert t.id == st.id
    for et, est in zip(t.elements, st.elements):
        assert et == est

# Remove sets by id
stt.remove('id2')
stt.remove('days')

# After many .remove() calls, the tree has nodes marked as dirty,
# calling .purge() removes them completely and frees RAM.
stt.purge()

Clone and set up a development environment to work with it

To work with Settrie command line or develop Settrie, you can set up an environment with git, gcc, make and the following tools:

  • catch2 (Already included in source code)
  • doxygen 1.9.5 or better (to render C++ documentation)
  • mkdocs 1.4.2 or better (to render Python documentation)
  • swig 4.0.2
  • python 3.x with appropriate paths to python.h (see Makefile)
git clone https://github.com/BBVA/mercury-settrie.git
cd mercury-settrie/src

make

Make without arguments gives help. Try all the options. Everything should work assuming the tools are installed.

Documentation

License

                         Apache License
                   Version 2.0, January 2004
                http://www.apache.org/licenses/

     Copyright 2021-23, Banco de Bilbao Vizcaya Argentaria, S.A.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

mercury-settrie's People

Contributors

sbasaldua avatar kaalam avatar rogerpasky avatar ldeluigi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.