Giter Club home page Giter Club logo

big-data-made-easy's Introduction

Big Data Made Easy

A list of frameworks, libraries, resources, and shiny things. Inspired by awesome-... stuff. Those most frequently used or well-know items are not listed here, which could be referred from awesome series: Awesome Big Data by Onur Akpolat and The Big-Data Ecosystem Table by Andrea Mostosi .

Projects

###Storage Design and Data Structures

  • Db-readings - Readings in Databases .
  • Bitvector - A C++ container-like data structure for storing a vector of bits with fast appending on both sides and fast insertion in the middle, all in succinct space .
  • BitSliceIndex - Experiments on bit-slice indexing .
  • RoaringBitmap - Roaring Bitmap .
  • Cpp-btree - C++ in-memory containers based on a B-tree data structure.
  • Graphillion - Fast, lightweight graphset operation library .
  • Emphf - An efficient external-memory algorithm for the construction of minimal perfect hash functions .
  • Splay Map - STL map implemented with splay tree .
  • Cedar - C++ implementation of efficiently-updatable double-array trie .
  • WikiSort - Fast and stable sort algorithm that uses O(1) memory. Public domain .
  • Annoy - Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk .
  • Expgram - An ngram toolkit with succinct storage .
  • Cuckoofilter - A Bloom filter replacement for approximated set-membership queries .
  • PackedArray - Random access array of tightly packed unsigned integers .
  • FFBF - Feed-forward Bloom filters .
  • Concurrent Trees - C++ implementation of concurrent Binary Search Trees .
  • Concurrent B-Tree - A working project for High-concurrency B-tree source code in C .
  • Block-graph - A succinct implementation of a block-graph data structure .
  • RePair-WaveletTree-Graph - Graph Implementation with repair bitmap compressed WaveletTree .
  • RLZ - Contains the RLZ compression and self-index source code .
  • Serangequerying - Space-Efficient Structures for Range Querying .
  • Succinct - Experimentation with various succinct data structures. Combines previous doc-counter and wavelet-tree repos .
  • Sdsl-lite - Succinct Data Structure Library 2.0 .
  • Relative-FMIndex - Relative FM-index which is smaller but slower than plain FMIndex.
  • GCSA - Generalized Compressed Suffix Array.
  • Succinct - A collection of succinct data structures .
  • Rmq - Implementations of LCA and RMQ data structures from "The LCA Problem Revisited" .
  • YuNomi - Compressed Array Library .
  • DACs - Directly Addressable Codes (DACs) consist in a variable-length encoding scheme for integers that enables direct access to any element of the encoded sequence and obtains compact spaces .
  • Cpi00 - The compressed permuterm index .
  • Smbt - Succinct Multibit Tree for similarity search .
  • Gwt - Graph-indexing wavelet tree for graph similarity search .
  • Webgraphs - Fast and Compact Web Graph Representations .
  • Erika-trie - Erika-trie: succinct trie library .
  • Path_decomposed_tries - Implementation of the data structures described in the paper "Fast Compressed Tries using Path Decomposition" .
  • Sumire-tries - A variety of succinct tries .
  • Trie4j - (Succinct) trie implementation in Java .
  • SuDS - Succinct Data Structures (SuDS) www.cs.helsinki.fi .
  • Marisa-trie - Marisa succinct trie .
  • LibCDS - Compact Data Structures Library .
  • HSDS - Succinct Data Structure Library Collection.Includes bit-vector/wavelet-matrix/trie .
  • BWTIL - BWT Text Indexing Library: a set of tools to work with BWT-based text indexes .
  • Hip-hyperloglog - C++ implementation of an approximate distinct counter by HIP estimator on HyperLogLog .
  • Gonzalo Navarro - Publications of Gonzalo Navarro .
  • Kvtx - Transaction over CAS see https://docs.google.com/open?id=0B04zCRiCIQGGZDcyNTEwZGQtODk4Yy00NjEwLWI1MjQtYjc3NzJhN2RlNzk0 .
  • Fatcache - Memcache on SSD .
  • WiredTiger - WiredTiger's source tree http://source.wiredtiger.com/ .
  • FD-Tree - FD-Tree: a Tree Index on Solid State Drives .
  • Silo - Multicore in-memory storage engine .
  • MemC3 - An in-memory key-value cache based on concurrent cuckoo hashing.
  • Libart - Adaptive Radix Trees implemented in C .
  • Masstree - Masstree, a fast, multi-core key-value store .
  • NVMKV - NVM key-value store API lIbrary repository. http://opennvm.github.io/nvmkv-documents/ .
  • HYRISE - In-Memory Hybrid Storage Engine .
  • HyPer - A hybrid online transactional processing (OLTP) and online analytical processing (OLAP) high-performance main memory database system that is optimized for modern hardware .
  • NoVoHT - NoVoHT: a Lightweight Dynamic Persistent NoSQL Key/Value Store on NVRAM .
  • HERD - A Highly Efficient key-value system for RDMA .
  • Cayley - An open-source graph database .
  • Forestdb - A Fast Key-Value Storage Engine Based on Hierarchical B+-Tree Trie .
  • Mdbm - A very fast memory-mapped key/value store by Yahoo .
  • Nldb - Nanolat Database supporting 1M transactions per second .
  • FOEDUS - Transactional fast optimistic engine optimized for a large number of CPU cores and NVRAM storage (or fast SSD) .
  • Weaver - A scalable, fast, consistent graph store http://weaver.systems .
  • FastBit_UDF - MySQL UDF for creating, manipulating and querying FastBit indexes .
  • Jump Consistent Hash - A Go implementation of the jump consistent hash .
  • Content Defined Chunking - High Performance Content Defined Chunking .
  • SSD optimizations - Optimizing SSDs random IOPs, noop/tpps scheduler, rotational=0, add_random=0 .
  • Article-SSD - Coding for SSDs - What every programmer should know about solid-state drives .
  • Article-Key-Value - Implementing a Key-Value Store .
  • Article-MVCC - Implementation of MVCC Transactions for Key-Value Stores .
  • Article-SSD - Solid-state revolution: in-depth on how SSDs really work .
  • Dexter - Dexter database research group .
  • Streaminer - A collection of algorithms for mining data streams http://mayconbordin.github.io/streaminer/ .
  • Article-Art of Approximating - The Art of Approximating Distributions: Histograms and Quantiles at Scale .
  • Article-Sketch of the Day - Sketch of the Day: Frugal Streaming .
  • Article-Sketch of the Day - Sketch of the Day: K-Minimum Values .
  • Article-Sketch of the Day - Sketch of the Day: K-Minimum Values: Sketching Error, Hash Functions, and You .

###Distributed System

  • Pequod - A distributed key-value cache with builtin materialized views, see "Easy Freshness with Pequod Cache Joins" .
  • Crate - CRATE: Your Elastic Data Store .
  • Elliptics - Distributed hashtable storage .
  • Mcrouter - Mcrouter is a memcached protocol router for scaling memcached deployments .
  • Codis - Yet another fast distributed solution for Redis .
  • zBase - A high-performance, elastic, distributed key-value store .
  • Dynomite - A generic dynamo implementation for different k-v storage engines .
  • AsterixDB - Full-function BDMS (Big Data Management System) .
  • RAMCloud - A new class of storage for large-scale datacenter applications. It is a key-value store that keeps all data in DRAM at all times .
  • Cockroach - A Scalable, Geo-Replicated, Transactional Datastore .
  • Seaweed-FS - A simple and highly scalable distributed file system. There are two objectives: to store billions of files! to serve the files fast .
  • InfiniSQL - InfiniSQL is the database for always on, rapid growth applications that need to collect and analyze in real time--even for complex transactions .
  • Druid - Real²time Exploratory Analytics on Large Datasets http://druid.io .
  • Wasp - A megastore-like system http://alibaba.github.io/wasp/ .
  • Haeinsa - linearly scalable multi-row, multi-table transaction library for HBase based on Percolator .
  • Yrmcds - Memcached compatible KVS with master/slave replication. http://cybozu.github.io/yrmcds/ .
  • 3levelmemcache - Memcache improvements by Data.com .
  • Vitess - Vitess provides servers and tools which facilitate scaling of MySQL databases for large scale web services .
  • Replicant - A system for maintaining replicated state machines .
  • Skipgraph - Implementation of skipgraph on messagepack-rpc .
  • Kylin - BigQuery based on Hadoop .
  • Cubert - A fast and efficient batch computation engine for complex analysis and reporting of massive datasets on Hadoop .
  • REEF - The Retainable Evaluator Execution Framework .
  • Phat - An implementation of the Chubby lock service protocol in Msgpack RPC .
  • Hydra - A distributed data processing and storage system originally developed at AddThis .
  • Hystrix - A latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable .
  • Nativetask - A high performance C++ API & runtime for Hadoop MapReduce .
  • Taskgraph - A fault tolerant, distributed task driven framework written in Go.
  • Summingbird - Streaming MapReduce with Scalding and Storm https://twitter.com/summingbird .
  • Hustle - A column oriented, embarrassingly distributed relational event database .
  • Embulk - A plugin-based parallel bulk data loader that makes painful data integration works relaxed .
  • Chronos - Chronos: A Replacement for Cron, see http://nerds.airbnb.com/introducing-chronos/ .
  • Tyrant - Golang job scheduler based on mesos.
  • Cocaine - An open-source PaaS (platform as a service) system for creating custom cloud hosting apps from Yandex .
  • Weave - The Docker Network .
  • Course-CS6452 - Datacenter Networks and Services .
  • Article-Service Discovery - Open-Source Service Discovery - Jason Wilder's Blog jasonwilder.com .
  • Article-Replication and Latency consistency tradeoff - Replication and the latency-consistency tradeoff .

###Concurrency

  • Concurrent Queue - A fast multiple-producer, multi-consumer lock-free concurrent queue for C++11 .
  • CAF - An Open Source Implementation of the Actor Model in C++ .
  • TAMER - C++ extensions for readable event-driven programming .
  • C++React - A reactive programming library for C++11 .
  • Libslock - Cross-platform atomic operations and lock algorithm library http://lpd.epfl.ch/site/ssync .
  • CDS - Header only C++ Concurrent Data Structures library .
  • Libcds - A C++ template library of lock-free and fine-grained algorithms .
  • Locksmith - A library for debugging locking in C, C++, or Objective C programs .
  • Concurrency-concepts - A guide to concurrency, multi-threading and parallel programming concepts. Explains the differences between every concept, their advantages and disadvantages in detail .
  • Concurrency Kit - Concurrency primitives, safe memory reclamation mechanisms and non-blocking data structures for the research, design and implementation of high performance concurrent systems .
  • Nanahan - An implementation of Hopscotch hashing for single thread .
  • Scalex - Code snippets for the workshop on concurrent data structure implementation .
  • CBB - Provides a set of concurrent building blocks (Java & C/C++) that can be used to develop parallel/multi-threaded applications .
  • Thrust - A parallel algorithms library which resembles the C++ Standard Template Library (STL) .
  • Varon-t - A C implementation of Disruptor queues http://varon-t.readthedocs.org/ .
  • disruptor-- - Disruptor concurency pattern in c++ .
  • Lockfree Queue - Lock-free Condition Wait for Lock-free Multi-producer Multi-consumer Queue, see http://natsys-lab.blogspot.ru/2013/08/lock-free-condition-wait-for-lock-free.html .
  • Ssmem - A simple object-based memory allocator with epoch-based garbage collection, the publication "Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures" .
  • CLHT - A very fast and scalable (lock-based and lock-free) hash table that uses cache-line sized buckets .
  • Comsat - Comsat lets your application enjoy the scalability of asynchronous web-frameworks, serving many thousands of concurrent long-lived connections, or issuing hundreds of web-service calls for each request, all while maintaining the simple “thread per request” model .
  • Article-TM - Transactional Memory: History and Development .

###Compression

###System Performance And Profiling

###Search Engine and Information Retrieval

  • SF1R - A distributed massive data engine for enterprise/vertical search written in C++ .
  • Partitioned_elias_fano - Code used for the experiments in the paper "Partitioned Elias-Fano Indexes" .
  • Data Structures for Inverted Indexes - Optimal Space-Time Tradeoffs for Inverted Indexes .
  • Surf - SUccinct Retrieval Framework .
  • FastPFor - Fast integer compression .
  • Simdcomp - A simple C library for compressing lists of integers .
  • SIMDCompressionAndIntersection - A C++ library to compress and intersect sorted lists of integers using SIMD instructions .
  • TurboPFor - Fastest Integer Compression .
  • Pos-cmp - Comparison framework for positional inverted indexes and self-index supporting phrase queries .
  • MaskedVByte - SIMD-accelerated VByte Compression, Publication "Vectorized VByte Decoding" .
  • Wavelet - Information Retrieval based on Wavelet Tree .
  • Shuffla - Search engine using kd-tree .
  • RoSA - Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays .
  • Dualsorted - Dual sorted inverted index based on Wavelet Tree .
  • Treap - Faster and Smaller Inverted Indices with Treaps .
  • Gigablast - A distributed open source search engine and spider written in C/C++ for Linux .
  • Libface - Fastest auto-complete in the east .
  • SIMD-Based-Posting-lists - Implementation of Alexander A. Stepanov inverted Index Compression algorithms .
  • Groonga - Open-source fulltext search engine and column store .
  • Pastec - An open source index and search engine for image recognition .
  • Enterprise-search - An open source search engine for corporate data and websites. http://www.searchdaimon.com/ .
  • Verticut - Image search engine on Infiniband .
  • Atire - A search engine built using the most effective recent research techniques discovered by Information Retrieval researchers around the world .
  • Mg4j - Academic search engine with succinct design(say quasi-succinct indices) .
  • Argos - A structural data search engine .

###Large Scale Machine Learning

  • LASER - A Scalable Response Prediction Platform For Online Advertising .
  • Parameter Server - A distributed machine learning framework. http://parameterserver.org .
  • Petuum - A distributed machine learning framework implementing parameter server model .
  • Paracel - Parameter server by Douban Inc .
  • H2O - Fastest in-memory platform for machine learning and predictive analytics on big data .
  • Oryx - Simple real-time large-scale machine learning infrastructure implementing Lambda Architecture .
  • Admm_Allreduce - ADMM optimizer on Apache Hadoop with allReduce. .
  • Hivemall - Scalable machine learning library for Hive/Hadoop .
  • Ml-ease - ADMM based large scale logistic regression .
  • douban_pGBRT - Parallel GBRT from Douban Inc .
  • Parlearn - Parallel SGD implementation .
  • Xgboost - eXtreme Gradient Boosting (Tree) Library .
  • AcroMUSASHI Stream-ML - AcroMUSASHI Stream-ML - Machine Learning Library .
  • DIMSUM - All-pairs similarity via DIMSUM .
  • StreamSVM - StreamSVM is the fastest implementation to learn linear SVM with large dataset that cannot fit in memory in your computer .
  • Distributed-liblinear - Libraries for Large-scale Linear Classification on Distributed Environments .
  • SparkADMM - ADMM implementation on Spark Cluster .
  • NOMAD - Non-locking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion .
  • Stream-ml - Streaming SGD inspired by http://blog.smola.org/post/977927287/parallel-stochastic-gradient-descent .
  • LIBPMF - A Library for Large-scale Parallel Matrix Factorization .
  • LIBMF - A Matrix-factorization Library for Recommender Systems .
  • KnittingBoar - Parallel Iterative Algorithm (SGD) on Hadoop's YARN framework .
  • Trident-ml - Trident-ML : A realtime online machine learning library .
  • Mlpack - A scalable c++ machine learning library .
  • LASSO - A parallel regression model learning system based on MRML.
  • Jubatus - Distributed Online Machine Learning Framework .
  • Vowpal_Wabbit - A fast online learning algorithm http://hunch.net/~vw/ .
  • DeepDist - Lightning-Fast Deep Learning on Spark via parallel stochastic gradient updates(compared with MLLib) .
  • DMLC - Distributed (Deep) Machine Learning Common .
  • SINGA - A General Distributed Deep Learning Platform .
  • BIDMach - CPU and GPU-accelerated Machine Learning Library in Scala .
  • Spark-Multiboost - An implementation of the multi-class/multi-label classifier, of which the training is carried out using AdaBoost.MH on Apache Spark .

big-data-made-easy's People

Contributors

k0t3r avatar yingfeng avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.