slowikj / seqr Goto Github PK

View Code? Open in Web Editor NEW

16.0 3.0 1.0 1.76 MB

fast and comprehensive k-mer counting package

Home Page: https://slowikj.github.io/seqR/

R 40.56% C++ 56.78% CMake 1.01% Shell 0.03% M4 1.56% C 0.06%

k-mer-counting kmer kmer-counting kmer-frequency-count kmers ngrams ngram rcpp rcppparallel rpackage

seqr's Introduction

seqR - fast and comprehensive k-mer counting package

About

seqR is an R package for fast k-mer counting. It provides

highly optimized (the core algorithm is written in C++)
in-memory
probabilistic (with configurable dimensionality of a hash value used for storing k-mers internally),
multi-threaded (with a configurable size of the batch of sequences (batch_size) to process in a single step. If batch_size equals 1, the multi-threaded mode is disabled, which potentially causes a longer computation time)

implementation that supports

various variants of k-mers (contiguous, gapped, and positional counterparts)
all biological sequences (e.g., nucleic acids and proteins)

Moreover, the result optimizes memory consumption by the application of sparse matrices (see package Matrix), compatible with machine learning packages such as ranger and xgboost.

How to…

How to install

To install seqR from CRAN:

install.packages("seqR")

Alternatively, if you want to use the latest development version:

# install.packages("devtools")
devtools::install_github("slowikj/seqR")

How to use

The package provides two functions that facilitate k-mer counting

count_kmers (used for counting k-mers of one type)
count_multimers (a wrapper of count_kmers, used for counting k-mers of many types in a single invocation of the function)

and one function used for custom processing of k-mer matrices:

rbind_columnwise (a helper function used for merging several k-mer matrices that do not have same sets of columns)

To learn more, see features overview vignette and reference.

Examples

counting 5-mers

count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
            k=5)
#> 2 x 12 sparse Matrix of class "dgCMatrix"
#>    [[ suppressing 12 column names 'A.A.A.A.A_0.0.0.0', 'A.V.V.A.V_0.0.0.0', 'V.V.A.V.F_0.0.0.0' ... ]]
#>                             
#> [1,] 1 1 1 1 1 1 1 . . . . .
#> [2,] . . . . . . . 1 1 1 2 1

counting gapped 5-mers with gaps (0, 1, 0, 2) (XX_XX__X)

count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
            kmer_gaps=c(0, 1, 0, 2))
#> 2 x 7 sparse Matrix of class "dgCMatrix"
#>      A.A.A.A.A_0.1.0.2 A.A.V.V.F_0.1.0.2 A.A.V.A.F_0.1.0.2 A.A.A.V.V_0.1.0.2
#> [1,]                 1                 1                 1                 1
#> [2,]                 .                 .                 .                 .
#>      G.S.D.F.A_0.1.0.2 F.G.A.D.S_0.1.0.2 D.F.S.A.G_0.1.0.2
#> [1,]                 .                 .                 .
#> [2,]                 1                 1                 1

counting 1-mers and 2-mers

data(CsgA)

CsgA[1L:2]
#> $`sp|P28307|CSGA_ECOLI Major curlin subunit OS=Escherichia coli (strain K12) OX=83333 GN=csgA PE=1 SV=3`
#>   [1] "M" "K" "L" "L" "K" "V" "A" "A" "I" "A" "A" "I" "V" "F" "S" "G" "S" "A"
#>  [19] "L" "A" "G" "V" "V" "P" "Q" "Y" "G" "G" "G" "G" "N" "H" "G" "G" "G" "G"
#>  [37] "N" "N" "S" "G" "P" "N" "S" "E" "L" "N" "I" "Y" "Q" "Y" "G" "G" "G" "N"
#>  [55] "S" "A" "L" "A" "L" "Q" "T" "D" "A" "R" "N" "S" "D" "L" "T" "I" "T" "Q"
#>  [73] "H" "G" "G" "G" "N" "G" "A" "D" "V" "G" "Q" "G" "S" "D" "D" "S" "S" "I"
#>  [91] "D" "L" "T" "Q" "R" "G" "F" "G" "N" "S" "A" "T" "L" "D" "Q" "W" "N" "G"
#> [109] "K" "N" "S" "E" "M" "T" "V" "K" "Q" "F" "G" "G" "G" "N" "G" "A" "A" "V"
#> [127] "D" "Q" "T" "A" "S" "N" "S" "S" "V" "N" "V" "T" "Q" "V" "G" "F" "G" "N"
#> [145] "N" "A" "T" "A" "H" "Q" "Y"
#> 
#> $`sp|P0A1E7|CSGA_SALEN Major curlin subunit OS=Salmonella enteritidis OX=149539 GN=csgA PE=1 SV=1`
#>   [1] "M" "K" "L" "L" "K" "V" "A" "A" "F" "A" "A" "I" "V" "V" "S" "G" "S" "A"
#>  [19] "L" "A" "G" "V" "V" "P" "Q" "W" "G" "G" "G" "G" "N" "H" "N" "G" "G" "G"
#>  [37] "N" "S" "S" "G" "P" "D" "S" "T" "L" "S" "I" "Y" "Q" "Y" "G" "S" "A" "N"
#>  [55] "A" "A" "L" "A" "L" "Q" "S" "D" "A" "R" "K" "S" "E" "T" "T" "I" "T" "Q"
#>  [73] "S" "G" "Y" "G" "N" "G" "A" "D" "V" "G" "Q" "G" "A" "D" "N" "S" "T" "I"
#>  [91] "E" "L" "T" "Q" "N" "G" "F" "R" "N" "N" "A" "T" "I" "D" "Q" "W" "N" "A"
#> [109] "K" "N" "S" "D" "I" "T" "V" "G" "Q" "Y" "G" "G" "N" "N" "A" "A" "L" "V"
#> [127] "N" "Q" "T" "A" "S" "D" "S" "S" "V" "M" "V" "R" "Q" "V" "G" "F" "G" "N"
#> [145] "N" "A" "T" "A" "N" "Q" "Y"

count_multimers(sequences=CsgA,
                k_vector = c(1, 2))
#> 5 x 144 sparse Matrix of class "dgCMatrix"
#>    [[ suppressing 144 column names 'R', 'L', 'Y' ... ]]
#>                                                                                
#> [1,] 2 9 4 1 8 5 10 2 2 4 16 29 4 11  9 2 16 3 14 1 2 1 1 1 1 2 1 1 1 1 1 3 1 2
#> [2,] 3 8 5 2 7 6 11 2 2 4 17 22 3 11 10 2 20 1 15 1 3 1 . . 3 1 1 1 2 5 2 3 . 2
#> [3,] 3 8 5 2 7 6 11 2 2 4 17 22 3 11 10 2 20 1 15 1 3 1 . . 3 1 1 1 2 5 2 3 . 2
#> [4,] 2 9 4 1 9 5 10 2 1 4 15 30 4 11  9 2 17 4 13 1 2 1 1 1 1 2 1 1 1 1 1 3 1 2
#> [5,] 3 8 5 2 7 6 11 2 2 4 17 22 3 11 10 2 20 1 15 1 3 1 . . 3 1 1 1 2 5 2 3 . 2
#>                                                                                
#> [1,] 1 3 1 3 1 2 7 1 1 1 1 3 1 1 2 2 1 1 12 3 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2
#> [2,] . 2 . 2 1 2 3 . 1 2 . 3 1 . 3 2 2 1  6 4 2 3 1 2 1 . 1 . 1 1 . 1 . 1 1 2 .
#> [3,] . 2 . 2 1 2 3 . 1 2 . 3 1 . 3 2 2 1  6 4 2 3 1 2 1 . 1 . 1 1 . 1 . 1 1 2 .
#> [4,] 1 3 1 3 1 2 6 1 1 2 1 3 2 1 2 2 1 . 13 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
#> [5,] . 2 . 2 1 2 3 . 1 2 . 3 1 . 3 2 2 1  6 4 2 3 1 2 1 . 1 . 1 1 . 1 . 1 1 2 .
#>                                                                               
#> [1,] 1 1 1 1 1 1 2 7 1 1 2 1 1 1 2 1 1 1 1 1 1 2 2 3 1 1 1 1 1 1 1 2 2 1 1 3 1
#> [2,] . 1 . 1 1 1 1 5 . . 2 . . 1 1 1 . . . 1 1 2 1 4 1 1 1 1 . 1 . 2 3 1 1 1 .
#> [3,] . 1 . 1 1 1 1 5 . . 2 . . 1 1 1 . . . 1 1 2 1 4 1 1 1 1 . 1 . 2 3 1 1 1 .
#> [4,] 1 1 . 1 1 1 1 7 1 1 2 1 1 1 2 1 . 1 1 1 1 2 2 3 1 . 1 1 1 1 1 1 2 1 1 3 1
#> [5,] . 1 . 1 1 1 1 5 . . 2 . . 1 1 1 . . . 1 1 2 1 4 1 1 1 1 . 1 . 2 3 1 1 1 .
#>                                                                             
#> [1,] 1 1 2 1 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#> [2,] . 1 . 1 3 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 . . . .
#> [3,] . 1 . 1 3 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 . . . .
#> [4,] 1 1 2 1 2 . . . . . . . . . . . . . 1 . . . . . . . . . . . . . 1 1 1 1
#> [5,] . 1 . 1 3 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 . . . .

How to cite

For citation type:

citation("seqR")

or use:

Jadwiga Słowik and Michał Burdukiewicz (2021). seqR: fast and comprehensive k-mer counting package. R package version 1.0.0.

Benchmarks

The seqR package has been compared with other existing k-mer counting R packages: biogram, kmer, seqinr, and biostrings.

All benchmark experiments have been performed using Intel Core i7-6700HQ 2.60GHz 8 cores, using the microbenchmark R package.

Contiguous k-mers

Changing k

The input consists of one DNA sequence of length 3 000.

Changing the number of sequences

Each DNA sequence has 3 000 elements, contiguous 5-mer counting.

Gapped k-mers

Changing the first contiguous part of a k-mer

The input consists of one DNA sequence of length 1 000 000. Gapped 5-mers counting with base gaps (1, 0, 0, 1).

Changing the first gap size

The input consists of one DNA sequence of length 100 000. Gapped 5-mers counting with base gaps (1, 0, 0, 1).

seqr's People

Contributors

Stargazers

Watchers

Forkers

minghao2016

seqr's Issues

Improve generic iterator implementation in the dictionary module

In particular, check whethether any redundant copy is performed
and why LinearListDictionary is not compatible with AlphabetEncoding.

Handle the case when gapped k-mer is longer than a sequence

Add user interruption check

In order to stop computation when a user pressed stop.

Set an optimized dictionary for each Alphabet Encoding depending on the number of elements

Depending on a specific input alphabet structure, a different dictionary may turn out to be more efficient.

For example, taking into account tidysq case, a simple array would be better because the alphabet is encoded using small integers.

Secondly, for alphabet with relatively small number of elements, using a simple linear list dictionary would turn out to be more efficient than a more complex hash map.

Memory not mapped error

An error occurs when the garbage collector is invoked at the same time as the computation of the algorithm - it releases the memory that is going to be used soon.

It turns out that if we change methods which allocate Rcpp objects and do not return them back to R, everything works as expected.

The goal: C_unpack_raws should return a pure C++ data structure instead of Rcpp structure.

Add specialized versions of input wrappers for sequential (not parallel) mode

If the parallel mode is not enabled, there is no need to create c++ objects

Make columns in a result matrix compliant with biogram output

Add a suffix that determines a k-mer type

Tidysq: make sure that rowGetter create no copy of Rcpp::RawVector

Use a data structure to store common function params

Use a data structure to store common function params,
for example, named Rcpp::List.
Prepare this structure on the R side.
In order to add a new param easier.

Count only k-mers that are in the provided input k-mer set

Add parallel_mode param

In order to determine whether the computations should be done in parallel or sequential mode.

std::unordered_map is thought not to be perfectly optimized data structure`.
It is possible to make it better in some cases.
a very valuable cpp conference talk: https://www.youtube.com/watch?v=M2fKMP47slQ

For example: https://github.com/sparsehash/sparsehash

Change the algorithm for positional k-mer space

An integer representing a position of a k-mer in a sequence tends to be larger than relatively small P constants used in hash function formula. Therefore, it is not recommended to use it during the hashing of a sequence.

How to fix it?
There are two solutions:

Use a large P
Different positional k-mer hashing approach - use (d + 1)-dimensional representations of k-mer (one extra integer indicates not transformed k-mer position)
Change the dictionary structure and use a specialized version for positional variant; for example, 2-level approach: the first level is the position, the second level -- the hash

Add window slide size param

Add window slide size param in order to count k-mers using a fixed, configurable window slide step. Currently, after each step, the window moves by one position.

C++ tests

In order to prevent exposing methods that are not necessary for a user and currently they are exposed just because of testing.

Check out: https://github.com/mi2-warsaw/FSelectorRcpp

Process sequences in batches

Therefore, unpacking all tidysq sequences at once would be unnecessary => memory optimization

Add verbose param

In particular, let a user know about the current progress of batch computations.

Reduce the package size

Fix the following issue:

installed size is 15.4Mb
    sub-directories of 1Mb or more:
      libs        13.2Mb
      thirdparty   1.3Mb

Extract hash type

Add a max_threads_num_param

Add a new param max_threads_num_param to the algorithm method.

Performance of std::string

Perhaps, an additional helper buffer for k-mer formatting would improve performance and prevent from the heap lock in multithreaded k-mer string generation.

https://www.drdobbs.com/tools/memory-constraints-on-thread-performance/231300494

Add param to find k-mers without counts

tidySq integration

parallel workers should take a lambda function that access i-th row:

std::function<input_vector_t(int rowNum)>

This function will be used in parallel k-mer counting procedure (kmer_counting_common.h).

For example for Rcpp StringMatrix the lambda is:

[](int rowNum) -> Rcpp::StringMatrix::Row { return seqMatrix(rowNum, Rcpp::_); }

Add BloomFilter to find k-mers

Bloom filter would be an extra layer between KMerManager and a custom dictionary.

Wrap related input fields into structures in order to reduce the number of function's params

Determine the number and params (P, M) for hashing functions

Prepare optimized and specialized variants of AlphabetEncoding for each input sequence type

AlphabetEncoding as a template instead of a class
Move the responsibility of checking if a sequence element is allowed to AlphabetEncoding from the algorithm
(for encode method: return a notAllowedElementCode if the input element is invalid)
Each sequence type has custom specialization of AlphabetEncoding

Tidysq: make unpacking sequences parallel

Finish README

Add more examples
Add benchmarks

Add count_multimers

Add fill or complete parameter to count_kmers

Add fill or complete parameter to count_kmers to add columns with k-mers that weren't present in the data. It will streamline the binding of the k-mer matrices.

Add param with_kmer_names

A param with_kmer_names would determine whether the output matrix should contain named columns or not. A user would be able to make computations faster if the exact column names are not used.

Clean up package namespace: refactor C++ exported function names

Typed find_kmers functions should have the form .count... (a param in Rpp::export attribute can be added to rename it)
Remove @export and move corresponding R tests into C++ form

Final k-mer frequency result format

Currently, an Rcpp::IntegerMatrix is returned, although the matrix would be sparse in the most of cases.

simple_triplet_matrix is an alternative solution, however, it also requires the normal matrix generation step, which wastes both CPU time and RAM resources.

Another solution is connected with Rcpp::Modules and creating a custom sparse data structure.

Better hash computation of a complex_hash (std::vector<int>)

Currently, it is computed using a simple polynomial formula (hash/complex_hash.h)
The recommended way is to use a dedicated function from the boost lib.

Extra param to find_kmers to specify a dictionary to store k-mers

As the total algorithm efficiency may depend on the perfomance of accessing k-mer data, it is important to choose an appropriate dictionary. The performance might depend on the size of data that is to be stored. Surprisingly, a linear list performs better that a hash map when the total number of elements is relatively small. Thus, it may be useful to make it possible for users to specify a dictionary type on their own because they know better the input data that is provided to the algorithm.

Nullable handling + delegation to a specialized method

Handle the case when params: alphabet, sequences are nulls.
Unfortunately, in R: c() produces null in Rcpp

Add a proxy method count_kmers in R code that dispatches further processing to a specialized Rcpp method

Fix CI

Make the package work on macOS
Add ubuntu 20.04
Remove Travis CI
Fix build on windows R 3.6

Better hashing functions

Try to get rid of modulo that is pretty costly.

Try to use: https://github.com/lemire/fastmod ; https://arxiv.org/pdf/1902.01961.pdf

Add merging k-mer output matrices

Add default alphabet value "all"

Output matrix does not contain rows for the last input sequences without k-mers

If the last input sequences do not contain any k-mer, then the corresponding output matrix raws are not generated.
The problem does not occur when such input sequences are not in the end.

A problematic example:

count_kmers(list("aaaaacbb", "aa"), 5, letters)

The output is:

A 1x4 simple triplet matrix.

However, it should be:

A 2x4 simple triplet matrix.

The aforementioned problem does not occur for the following example:

count_kmers(list("aaaaacbb", "aa", "aaaaaaa"), 5, letters)

Document API

Improve the performance for string matrix input

Currently, the input memory has to be copied due to the fact that Rcpp structures aren't thread-safe and RcppParallel does not provide any thread-safe wrapper for Rcpp::StringMatrix.
This is very time consuming and inefficient. Particularly, creating std::string structures seems to be very inefficient.

Actually, the memory is copied sequentially (in other words, the action is not parallel)
and after that, we take advantage of parallel algorithm.

In fact, it seems more efficient just not to use parallel version.
This hypothesis should be tested.