Giter Club home page Giter Club logo

dynamic_ecp's Introduction

Extended Cluster-Pruning

The extended cluster-pruning algorithm is implemented in C++ and provides a python wrapper for usage within the ANN-benchmarking tool to compare with other state of the art ANN algorithms.

Initially proposed as a bachelors project at the IT-University of Copenhagen in spring 2020. This repository continues the work by building a dynamic extension of the algorithm ontop of the original implementation.

This project was an implementation done during my master studies at the IT University of CPH. It is no longer maintained and has therefore been archived.

Project dependencies

gcc, c++, cmake, python3.6, swig, hdf5, docker

How-to's

The following descriptions assumes some sort of Linux distribution even if the eCP source code will compile on e.g. Windows. It is also assumed that ./ is the root of this repository.

How to compile eCP

  1. Install dependencies outlined above.

  2. Afterwards, to configure and compile project load the CMakeLists.txt file in the root dir using your favorite IDE, OR go to ./eCP/scripts and execute:

configure.sh
build.sh

This will build the project in Release mode. To change this see configure.sh script. Currently only GNU/GCC is known to be supported.

How to run eCP in ANN-Benchmarks

Running it manually

  1. Go to ./ann_benchmarks/install/Dockerfile.ecp and change the line describing what repository is cloned when the eCP docker container is build by Ann-Benchmarks (AB). Make it point to the current working branch e.g. git clone -b <branch_name> <url_to_repo> eCP, where 'eCP will be local name of repo dir to help build scripts.

  2. Push your work to the designated branch.

  3. Execute the script ./scipts/ecp_install.sh which will pull down AB repository and copy the needed files into the newly cloned AB repository. Here the Python requirements will be installed and finally an AB install script will be called to setup the eCP algorithm with AB, i.e. build a Docker image containing a compiled executable of the code from the designated branch from earlier.

  4. Goto AB directory and follow the guide to run tests found at https://github.com/erikbern/ann-benchmarks.

Using the provided run script

From within the eCP/scripts folder run the ann-benchmark.sh script, it takes arbitrarily many arguments in a array-style format depending on the amount of experiments the user wants to run. An example configuration could be sudo ./ann-benchmark.sh [abc123,foo,random-xs-20-euclidean,false] [def456,bar,random-xs-20-euclidean,true] corresponding to running the code at commit id abc123 on the random-xs-20-euclidean dataset without early halting and naming it foo whereafter commit id def456 would be run on the same dataset with early halting enabled and under the name bar. Observe that there cannot be any spaces after each comma.

It supports the following flags:

  • -d (--delete) delete the current local ann-benchmarks repository if it exists
  • -l (--local) run ann-benchmarks in local mode (https://github.com/erikbern/ann-benchmarks)

Running local changes

TODO

Documentation

Wrapper interface

SWIG is used to construct the Python-based wrapper around the C++ code. SWIG is handled by CMake which will recompile the wrapper every time the code base is recompiled. This will also happen inside the docker container, when the code is send to the Ann-Benchmarks framework for testing.

The wrapper interface is described in ./eCP/swig/eCP.i and exposes two functions from the C++ source code as outlined below. The interface file is used to describe exactly what part of the C++ code is exposed through the Python API:

eCP_Index(S, L, m)

Builds an index with the provided dataset. Accepts three arguments:

  • A dataset (nested list of data points)
  • Integer determining how many levels the index should have
  • Metric for comparing distance (1 - Angular distance, 0 - Euclidean distance)

query(I, q, k, b)

Queries the given index. Accepts four arguments:

  • Index to be queried
  • Query point
  • Amount of nearest neighbors to return
  • Amount of clusters to search

Python code example of using the wrapper

import eCP_wrapper as e
index = e.eCP_Index([[41, 67, 34, 0, 69, 24],[78, 58, 62, 64, 5, 45],[81, 27, 61, 91, 95, 42],[27, 36, 91, 4, 2, 53],[92, 82, 21, 16, 18, 95],[47, 26, 71, 38, 69, 12],[67, 99, 35, 94, 3, 11],[22, 33, 73, 64, 41, 11], [53, 68, 47, 44, 62, 57], [37, 59, 23, 41, 29, 78], [16, 35, 90, 42, 88, 6], [40, 42, 64, 48, 46, 5], [90, 29, 70, 50, 6, 1], [93, 48, 29, 23, 84, 54], [56, 40, 66, 76, 31, 8]], 2, 1)
result = e.query(index, [5,5,5], 5, 1)

Datasets

A little bit of information about the datasets that have been used for testing.

Dataset Dimensions Train size Test size Neighbors Distance
random-xs-20-euclidean 20 9000 1000 10 Euclidean
random-xs-20-angular 20 9000 1000 10 Angular
random-s-100-euclidean 100 9000 1000 10 Euclidean
MNIST 784 60,000 10,000 100 Euclidean
Fashion-MNIST 784 60,000 10,000 100 Euclidean
SIFT 128 1,000,000 10,000 100 Euclidean
GloVe 25 1,183,514 10,000 100 Angular

dynamic_ecp's People

Contributors

mortenskoett avatar nimertz avatar fremartini avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.