Giter Club home page Giter Club logo

puck's Introduction

Description

This project is a library for approximate nearest neighbor(ANN) search named Puck. In Industrial deployment scenarios, limited memory, expensive computer resources and increasing database size are as important as the recall-vs-latency tradeof for all search applications. Along with the rapid development of retrieval business service, it has the big demand for the highly recall-vs-latency and precious but finite resource, the borning of Puck is precisely for meeting this kind of need.

It contains two algorithms, Puck and Tinker. This project is written in C++ with wrappers for python3.
Puck is an efficient approache for large-scale dataset, which has the best performance of multiple 1B-datasets in NeurIPS'21 competition track. Since then, performance of Puck has increased by 70%. Puck includes a two-layered architectural design for inverted indices and a multi-level quantization on the dataset. If the memory is going to be a bottleneck, Puck could resolve your problems.
Tinker is an efficient approache for smaller dataset(like 10M, 100M), which has better performance than Nmslib in big-ann-benchmarks. The relationships among similarity points are well thought out, Tinker need more memory to save these. Thinker cost more memory then Puck, but has better performace than Puck. If you want a better searching performance and need not concerned about memory used, Tinker is a better choiese.

Introduction

This project supports cosine similarity, L2(Euclidean) and IP(Inner Product, conditioned). When two vectors are normalized, L2 distance is equal to 2 - 2 * cos. IP2COS is a transform method that convert IP distance to cos distance. The distance value in search result is always L2.

Puck use a compressed vectors(after PQ) instead of the original vectors, the memory cost just over to 1/4 of the original vectors by default. With the increase of datasize, Puck's advantage is more obvious.
Tinker need save relationships of similarity points, the memory cost is more than the original vectors (less than Nmslib) by default. More performance details in benchmarks. Please see this readme for more details.

Linux install

1.The prerequisite is mkl, python and cmake.

MKL: MKL must be installed to compile puck, download the MKL installation package corresponding to the operating system from the official website, and configure the corresponding installation path after the installation is complete. source the MKL component environment script, eg. source ${INSTALL_PATH}/mkl/latest/env/vars.sh. This will maintain many sets of environment variables, like MKLROOT.

https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-download.html

python: Version higher than 3.6.0.

cmake: Version higher than 3.21.

2.Clone this project.

git clone https://github.com/baidu/puck.git
cd puck

3.Use cmake to build this project.

3.1 Build this project
cmake -DCMAKE_BUILD_TYPE=Release 
    -DMKLROOT=${MKLROOT} \
    -DBLA_VENDOR=Intel10_64lp_seq \
    -DBLA_STATIC=ON  \
    -B build .

cd build && make && make install
3.2 Build with GTEST

Use conditional compilation variable named WITH_TESTING.

cmake -DCMAKE_BUILD_TYPE=Release 
    -DMKLROOT=${MKLROOT} \
    -DBLA_VENDOR=Intel10_64lp_seq \
    -DBLA_STATIC=ON  \
    -DWITH_TESTING=ON \
    -B build .

cd build && make && make install
3.3 Build with Python

Refer to the Dockerfile

python3 setup.py install 

Output files are saved in build/output subdirectory by default.

How to use

Output files include demos of train, build and search tools.
Train and build tools are in build/output/build_tools subdirectory.
Search demo tools are in build/output/bin subdirectory.

1.format vector dataset for train and build

The vectors are stored in raw little endian. Each vector takes 4+d*4 bytes for .fvecs format, where d is the dimensionality of the vector.

2.train & build

The default train configuration file is "build/output/build_tools/conf/puck_train.conf". The length of each feature vector must be set in train configuration file (feature_dim).

cd output/build_tools
cp YOUR_FEATURE_FILE puck_index/all_data.feat.bin
sh script/puck_train_control.sh -t -b

index files are saved in puck_index subdirectory by default.

3.search

During searching, the default value of index files path is './puck_index'.
The format of query file, refer to demo
Search parameters can be modified using a configuration file, refer to demo

cd output/
ln -s build_tools/puck_index .
./bin/search_client YOUR_QUERY_FEATURE_FILE RECALL_FILE_NAME --flagfile=conf/puck.conf

recall results are stored in file RECALL_FILE_NAME.

More Details

more details for puck

Benchmark

Please see this readme for details.

this ann-benchmark is forked from https://github.com/harsha-simhadri/big-ann-benchmarks of 2021.

How to run this benchmark is the same with it. We add support of faiss(IVF,IVF-Flat,HNSW) , nmslib(HNSW),Puck and Tinker of T1 track. And We update algos.yaml of these method using recommended parameters of 4 datasets(bigann-10M, bigann-100M, deep-10M, deep-100M)

Discussion

Join our QQ group if you are interested in this project.

QQ Group

puck's People

Contributors

jy02414216 avatar nk2014yj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

puck's Issues

效果咨询

您好,请问 PUCK 十亿数据集上的召回率怎么样?求的是 top 几?

条件查询,后续有开源计划吗?

条件查询:支持检索过程中的条件查询,从底层索引检索过程中就过滤掉不符合要求的结果,解决多路召回归并经常遇到的截断问题,更好满足组合检索的要求(暂未开源)

请问下,条件查询后续有开源计划吗?

docker 环境下mkl无法找到

错误描述如下:
-- The C compiler identification is GNU 7.5.0
-- The CXX compiler identification is GNU 7.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
CMake Error at /home/app/cmake/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
Could NOT find MKL (missing: MKL_LIBRARIES)
Call Stack (most recent call first):
/home/app/cmake/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
cmake/FindMKL.cmake:360 (find_package_handle_standard_args)
CMakeLists.txt:28 (find_package)

-- Configuring incomplete, errors occurred!
See also "/home/app/puck/build/CMakeFiles/CMakeOutput.log".
See also "/home/app/puck/build/CMakeFiles/CMakeError.log".

当我使用
find / -name mkl 时发现可以找到mkl
结果如下:
/opt/intel/oneapi/mkl
/opt/intel/oneapi/mkl/2023.2.0/lib/cmake/mkl
/opt/intel/oneapi/mkl/2023.2.0/modulefiles/mkl
/opt/intel/oneapi/mkl/2023.2.0/include/oneapi/mkl
/usr/include/boost/numeric/odeint/external/mkl

请问这个问题如何解决?

关于punk对应的ann benchmarks测试的一系列问题

最近尝试使用puck进行ann- benchmarks测试,有以下问题:

  • 在build dockerfile之前,进行pip install requirements的时候,经常出现错误,主要集中在“无法找到package对应的版本”,不知道是不是版本问题,在这一步我没有安装成功的库选择了跳过,可能会导致后面的问题

  • 在dockerfile中需要进行requirements的构建,但是经常出现“无法找到package对应的版本”的问题,后通过注释dockerfile中的

RUN pip3 config set global.index-url http://pip.baidu.com/root/baidu/+simple/
RUN pip3 config set global.index http://pip.baidu.com/root/baidu/
RUN pip3 config set global.trusted-host pip.baidu.com

后成功构建image,不知是因为我所在机构的封锁,还是链接问题

  • 使用create database脚本进行download数据集之后会出现assert不通过的问题, 定位到代码是 ann- benchmarks / benchmark / darasets.py里面315行左右,是一个先assert在赋值。
    25a5df4a6c1fe8acfc7ad6bd6f7922b0

  • 使用run .py,通过镜像进行benchmark操作的时候,出现权限问题。该问题没有引起程序的崩溃。
    0d77d7567a28d3d154393e4fb59d008d

  • 最后程序崩贵在log的地方,指示为将dict作为real number输出
    fe841cc4109b663d7033cee0a6d11f41

此外,在修改搜索的线程数的时候,需要修改CPU_LIMIT,但是我好像没有找到参数文件中的位置,所以直接进入代码中修改的。总之我没有能够完整的运行benchmark

about puck

puck是两层VQ,是简单的两层kmeans,然后根据聚类中心查询向量吗?

关于ann- benchmark的运行的问题

在成功构建docker之后,运行deep-100M的benchmark出现stat error:

E0218 09:37:10.278347     1 hierarchical_cluster_index.cpp:641] model file data/deep- 
100M.C3000_F3000_FN16_Flat.puckindex/index.dat stat error

并且后面出现了file not found的错误:

E0218 09:37:10.278366     1 py_api_wrapper.cpp:97] load index Faild
Traceback (most recent call last):
  File "run_algorithm.py", line 3, in <module>
    run_from_cmdline()
  File "/home/app/benchmark/runner.py", line 245, in run_from_cmdline
    args.private_query)
  File "/home/app/benchmark/runner.py", line 105, in run
    algo.fit(dataset)
  File "/home/app/benchmark/algorithms/puck_inmem.py", line 79, in fit
    for xblock in ds.get_dataset_iterator(bs=add_part):
  File "/home/app/benchmark/datasets.py", line 332, in get_dataset_iterator
    x = xbin_mmap(filename, dtype=self.dtype, maxn=self.nb)
  File "/home/app/benchmark/datasets.py", line 96, in xbin_mmap
    n, d = map(int, np.fromfile(fname, dtype="uint32", count=2))
FileNotFoundError: [Errno 2] No such file or directory: 'data/deep1b/base.1B.fbin.crop_nb_100000000'

奇怪的是,对应目录下是有对应的文件的:

/puck/ann-benchmarks> ls data/deep1b/base.1B.fbin.crop_nb_100000000
data/deep1b/base.1B.fbin.crop_nb_100000000

另外,能够给出deep-100M或者在benchmark页面中对应的benchmark的数据?现在只有折线图,无法进行详细的比较。

CMAKE_LIBRARY_OUTPUT_DIRECTORY 为空

在进行项目编译的时候,发现CMAKE_LIBRARY_OUTPUT_DIRECTORY的位置为空,这个变量指定为哪个地址?还是自己随意指定?

IDE or Cmake with many erros? pls refine the readme

Error

-- Found Python3: /usr/bin/python3.8 (found suitable version "3.8.10", minimum required is "3.6") found components: Interpreter Development Development.Module Development.Embed
-- Found Python: 3.8.10
-- site-packages: /usr/lib/python3/dist-packages
CMake Error at /usr/local/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find SWIG (missing: SWIG_EXECUTABLE SWIG_DIR)
Call Stack (most recent call first):
  /usr/local/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:600 (_FPHSA_FAILURE_MESSAGE)
  /usr/local/share/cmake-3.27/Modules/FindSWIG.cmake:153 (find_package_handle_standard_args)
  pyapi_wrapper/CMakeLists.txt:3 (find_package)


-- Configuring incomplete, errors occurred!

Solution
sudo apt-get install swig

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.