Giter Club home page Giter Club logo

mkmh's Introduction

mkmh

Make kmers, minimizers, hashes, and MinHash sketches (with multiple k), and compare them.

C/C++ CI for mkmh

Usage

To use mkmh functions in your code:

  1. Include the header file in your code
    #include "mkmh.hpp"
  2. Compile the library:
    cd mkmh && make lib
  3. Make sure the lib and header are on the LD include/lib paths (e.g. in your makefile):
    `` gcc -o my_code my_code.cpp -L/path/to/mkmh -I/path/to/mkmh -lmkmh
  4. That's it!

Available functionality

Convenience functions:
- Reverse complement a string
- Reverse a string
- Capitalize the characters of a string - Check if a string contains only canonical DNA letters ("A", "a", "C", "c", "T", "t", "G", "g")

Substrings and transforms:
- Get the forward shingles of a string
- Get the kmers size k of a string
- For multiple k, Get the kmers of a string for all k
- Get the (w, k) minimizers of a string
- Calculate the 64-bit hashes of the kmers of a string (with either single or multiple k values)
- Get the MinHash sketch of a string (from either single or multiple k values), using either the top s hashes or the bottom s hashes.

Compare sets of shingles / kmers / minimizers / hashes:
- Take the union of two sets of kmers or hashes.
- Take the intersection of two sets of kmers or hashes.

Fun extras:
- Given a string and a set of query strings, sort the queries in order of percent similarity.

Getting help

Please reach out through github by posting an issue (even if it's just feedback). Email is acceptable as a secondary medium.

mkmh's People

Contributors

andreaguarracino avatar edawson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mkmh's Issues

Scope creep = messy code + no tests

Unfortunately, RKMH has created a bunch of new requirements since the initial bits of this library came into existence. There is a lot of duplicated functionality and some spaghetti coding (functions call functions that call functions....). The total test coverage has also dropped from ~80% to more like 8%. I think it's useful to have both string and char* implementations, so I'd like to have everything duplicated to some degree, but I'd also like to clean things up:

  1. Maintain a single hash function wrapper - right now there are a bunch of places in the code that MurmurHash3_x64_128 gets called, and there should really be only one to ensure consistency.
  2. New requirement: set hash seed from CLI
  3. New requirement: set hash bits (32, 64) from CLI.
  4. New requirement: allow setting the alphabet (which means extending to amino acids).
  5. Remove top_minhash and bottom_minhash helper functions, because they are silly convenience functions that are actually less convenient.
  6. Increase test coverage toward 100%, perhaps using catch.hpp

Common API for hashing functions

Writing this down here so I can remember it one day, when I've submitted my thesis...

Right now we have an API for hashing kmers (calc_hash(**)) and reads (calc_hashes(**)). However, these have fixed hash functions (currently MurmerHash64). Ideally, it would be easier if we could drop in new hash functions using a common API rather than the spaghetti code / infinite-if-elif-elses the design currently uses.

I'm imagining a function that takes a string, a start position, an end position, and a std::function that wraps a given hash function with the necessary parameters. The signature might look something like:

mkmh::hash_t calc_hash(char* seq, const size_t& start, const size_t& end, std::function<mkmh::hash_t(char*, int, int, int)>& hashfunc)

where hashfunc is a wrapper around a hash function that might look like:

mkmh::hash_t murmurhash64(const char* seq, int offset, int end, int seed);

It's a little strange passing a nearly identical function to another one, but it means we could use multiple hash functions within mkmh without modifying the API (or even adding to the source code).

Refactor utility functions out of library

I've made a lot of improvements to the pliib, which holds a lot of the utility functions for strings used in mkmh. At some point, I'd like to refactor mkmh to rely on pliib, rather than carrying multiple versions of the same functions around. This will mean users will have to install pliib (which is also a single-header lib), but it should make the process of updating string manipulation functions a lot easier.

Incorporate xxHash and ntHash as optional hashing algorithms

MurmurHash, while relatively fast / dispersive / backwards compatible with Mash, is slower than some newer algorithms. Moving to xxHash should yield a ~2X speed improvement in the hashing portions of the code, and ntHash should go even faster.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.