This is a C++ implementaion of "Topkapi" algorithm from the following work:
Ankush Mandal, Cary Jiang, Anshumali Shrivastava, and Vivek Sarkar. Topkapi: Parallel and Fast Sketches for Finding Top-K Frequent Elements. In Neural Information Processing Systems(NIPS), Montreal, Canada, 2018.
This implementation finds Top-K frequent words from text data. It supports multi-threaded execution using OpenMP and distributed computation using MPI.
- GNU Compiler Collection (GCC)
- OpenMPI
- Download the code from Github:
git clone https://github.com/ankushmandal/topkapi.git
- Go to
src
directory and create the executable usingmake
command - For details on usage, type
./topkapi -h
or./topkapi --help
- A sample slurm script for running the program on cluster is given in
sample_slurm_script
directory undersrc
Instructions on preprocessing any text data is given in utils
directory. The experiments on the paper were carried out using two data sets:
- Gutenberg dataset from Project Gutenberg. Useful instructions on downloading the data set can be found at Nico's Blog.
- Puma datasets under "Wikipedia" section from here.