Giter Club home page Giter Club logo

seqr's Introduction

seqR - fast and comprehensive k-mer counting package

CRAN_Status_Badge R build status Lifecycle: stable License: GPL v3 codecov.io Code Quality Status Code Quality Score

About

seqR is an R package for fast k-mer counting. It provides

  • highly optimized (the core algorithm is written in C++)
  • in-memory
  • probabilistic (with configurable dimensionality of a hash value used for storing k-mers internally),
  • multi-threaded (with a configurable size of the batch of sequences (batch_size) to process in a single step. If batch_size equals 1, the multi-threaded mode is disabled, which potentially causes a longer computation time)

implementation that supports

  • various variants of k-mers (contiguous, gapped, and positional counterparts)
  • all biological sequences (e.g., nucleic acids and proteins)

Moreover, the result optimizes memory consumption by the application of sparse matrices (see package Matrix), compatible with machine learning packages such as ranger and xgboost.

How to…

How to install

To install seqR from CRAN:

install.packages("seqR")

Alternatively, if you want to use the latest development version:

# install.packages("devtools")
devtools::install_github("slowikj/seqR")

How to use

The package provides two functions that facilitate k-mer counting

  • count_kmers (used for counting k-mers of one type)
  • count_multimers (a wrapper of count_kmers, used for counting k-mers of many types in a single invocation of the function)

and one function used for custom processing of k-mer matrices:

  • rbind_columnwise (a helper function used for merging several k-mer matrices that do not have same sets of columns)

To learn more, see features overview vignette and reference.

Examples

counting 5-mers
count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
            k=5)
#> 2 x 12 sparse Matrix of class "dgCMatrix"
#>    [[ suppressing 12 column names 'A.A.A.A.A_0.0.0.0', 'A.V.V.A.V_0.0.0.0', 'V.V.A.V.F_0.0.0.0' ... ]]
#>                             
#> [1,] 1 1 1 1 1 1 1 . . . . .
#> [2,] . . . . . . . 1 1 1 2 1
counting gapped 5-mers with gaps (0, 1, 0, 2) (XX_XX__X)
count_kmers(sequences=c("AAAAAVVAVFF", "DFGSADFGSA"),
            kmer_gaps=c(0, 1, 0, 2))
#> 2 x 7 sparse Matrix of class "dgCMatrix"
#>      A.A.A.A.A_0.1.0.2 A.A.V.V.F_0.1.0.2 A.A.V.A.F_0.1.0.2 A.A.A.V.V_0.1.0.2
#> [1,]                 1                 1                 1                 1
#> [2,]                 .                 .                 .                 .
#>      G.S.D.F.A_0.1.0.2 F.G.A.D.S_0.1.0.2 D.F.S.A.G_0.1.0.2
#> [1,]                 .                 .                 .
#> [2,]                 1                 1                 1
counting 1-mers and 2-mers
data(CsgA)

CsgA[1L:2]
#> $`sp|P28307|CSGA_ECOLI Major curlin subunit OS=Escherichia coli (strain K12) OX=83333 GN=csgA PE=1 SV=3`
#>   [1] "M" "K" "L" "L" "K" "V" "A" "A" "I" "A" "A" "I" "V" "F" "S" "G" "S" "A"
#>  [19] "L" "A" "G" "V" "V" "P" "Q" "Y" "G" "G" "G" "G" "N" "H" "G" "G" "G" "G"
#>  [37] "N" "N" "S" "G" "P" "N" "S" "E" "L" "N" "I" "Y" "Q" "Y" "G" "G" "G" "N"
#>  [55] "S" "A" "L" "A" "L" "Q" "T" "D" "A" "R" "N" "S" "D" "L" "T" "I" "T" "Q"
#>  [73] "H" "G" "G" "G" "N" "G" "A" "D" "V" "G" "Q" "G" "S" "D" "D" "S" "S" "I"
#>  [91] "D" "L" "T" "Q" "R" "G" "F" "G" "N" "S" "A" "T" "L" "D" "Q" "W" "N" "G"
#> [109] "K" "N" "S" "E" "M" "T" "V" "K" "Q" "F" "G" "G" "G" "N" "G" "A" "A" "V"
#> [127] "D" "Q" "T" "A" "S" "N" "S" "S" "V" "N" "V" "T" "Q" "V" "G" "F" "G" "N"
#> [145] "N" "A" "T" "A" "H" "Q" "Y"
#> 
#> $`sp|P0A1E7|CSGA_SALEN Major curlin subunit OS=Salmonella enteritidis OX=149539 GN=csgA PE=1 SV=1`
#>   [1] "M" "K" "L" "L" "K" "V" "A" "A" "F" "A" "A" "I" "V" "V" "S" "G" "S" "A"
#>  [19] "L" "A" "G" "V" "V" "P" "Q" "W" "G" "G" "G" "G" "N" "H" "N" "G" "G" "G"
#>  [37] "N" "S" "S" "G" "P" "D" "S" "T" "L" "S" "I" "Y" "Q" "Y" "G" "S" "A" "N"
#>  [55] "A" "A" "L" "A" "L" "Q" "S" "D" "A" "R" "K" "S" "E" "T" "T" "I" "T" "Q"
#>  [73] "S" "G" "Y" "G" "N" "G" "A" "D" "V" "G" "Q" "G" "A" "D" "N" "S" "T" "I"
#>  [91] "E" "L" "T" "Q" "N" "G" "F" "R" "N" "N" "A" "T" "I" "D" "Q" "W" "N" "A"
#> [109] "K" "N" "S" "D" "I" "T" "V" "G" "Q" "Y" "G" "G" "N" "N" "A" "A" "L" "V"
#> [127] "N" "Q" "T" "A" "S" "D" "S" "S" "V" "M" "V" "R" "Q" "V" "G" "F" "G" "N"
#> [145] "N" "A" "T" "A" "N" "Q" "Y"

count_multimers(sequences=CsgA,
                k_vector = c(1, 2))
#> 5 x 144 sparse Matrix of class "dgCMatrix"
#>    [[ suppressing 144 column names 'R', 'L', 'Y' ... ]]
#>                                                                                
#> [1,] 2 9 4 1 8 5 10 2 2 4 16 29 4 11  9 2 16 3 14 1 2 1 1 1 1 2 1 1 1 1 1 3 1 2
#> [2,] 3 8 5 2 7 6 11 2 2 4 17 22 3 11 10 2 20 1 15 1 3 1 . . 3 1 1 1 2 5 2 3 . 2
#> [3,] 3 8 5 2 7 6 11 2 2 4 17 22 3 11 10 2 20 1 15 1 3 1 . . 3 1 1 1 2 5 2 3 . 2
#> [4,] 2 9 4 1 9 5 10 2 1 4 15 30 4 11  9 2 17 4 13 1 2 1 1 1 1 2 1 1 1 1 1 3 1 2
#> [5,] 3 8 5 2 7 6 11 2 2 4 17 22 3 11 10 2 20 1 15 1 3 1 . . 3 1 1 1 2 5 2 3 . 2
#>                                                                                
#> [1,] 1 3 1 3 1 2 7 1 1 1 1 3 1 1 2 2 1 1 12 3 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2
#> [2,] . 2 . 2 1 2 3 . 1 2 . 3 1 . 3 2 2 1  6 4 2 3 1 2 1 . 1 . 1 1 . 1 . 1 1 2 .
#> [3,] . 2 . 2 1 2 3 . 1 2 . 3 1 . 3 2 2 1  6 4 2 3 1 2 1 . 1 . 1 1 . 1 . 1 1 2 .
#> [4,] 1 3 1 3 1 2 6 1 1 2 1 3 2 1 2 2 1 . 13 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2
#> [5,] . 2 . 2 1 2 3 . 1 2 . 3 1 . 3 2 2 1  6 4 2 3 1 2 1 . 1 . 1 1 . 1 . 1 1 2 .
#>                                                                               
#> [1,] 1 1 1 1 1 1 2 7 1 1 2 1 1 1 2 1 1 1 1 1 1 2 2 3 1 1 1 1 1 1 1 2 2 1 1 3 1
#> [2,] . 1 . 1 1 1 1 5 . . 2 . . 1 1 1 . . . 1 1 2 1 4 1 1 1 1 . 1 . 2 3 1 1 1 .
#> [3,] . 1 . 1 1 1 1 5 . . 2 . . 1 1 1 . . . 1 1 2 1 4 1 1 1 1 . 1 . 2 3 1 1 1 .
#> [4,] 1 1 . 1 1 1 1 7 1 1 2 1 1 1 2 1 . 1 1 1 1 2 2 3 1 . 1 1 1 1 1 1 2 1 1 3 1
#> [5,] . 1 . 1 1 1 1 5 . . 2 . . 1 1 1 . . . 1 1 2 1 4 1 1 1 1 . 1 . 2 3 1 1 1 .
#>                                                                             
#> [1,] 1 1 2 1 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
#> [2,] . 1 . 1 3 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 . . . .
#> [3,] . 1 . 1 3 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 . . . .
#> [4,] 1 1 2 1 2 . . . . . . . . . . . . . 1 . . . . . . . . . . . . . 1 1 1 1
#> [5,] . 1 . 1 3 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1 . . . .

How to cite

For citation type:

citation("seqR")

or use:

Jadwiga Słowik and Michał Burdukiewicz (2021). seqR: fast and comprehensive k-mer counting package. R package version 1.0.0.

Benchmarks

The seqR package has been compared with other existing k-mer counting R packages: biogram, kmer, seqinr, and biostrings.

All benchmark experiments have been performed using Intel Core i7-6700HQ 2.60GHz 8 cores, using the microbenchmark R package.

Contiguous k-mers

Changing k

The input consists of one DNA sequence of length 3 000.

Changing the number of sequences

Each DNA sequence has 3 000 elements, contiguous 5-mer counting.

Gapped k-mers

Changing the first contiguous part of a k-mer

The input consists of one DNA sequence of length 1 000 000. Gapped 5-mers counting with base gaps (1, 0, 0, 1).

Changing the first gap size

The input consists of one DNA sequence of length 100 000. Gapped 5-mers counting with base gaps (1, 0, 0, 1).

seqr's People

Contributors

michbur avatar piotr-ole avatar slowikj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

minghao2016

seqr's Issues

Set an optimized dictionary for each Alphabet Encoding depending on the number of elements

Depending on a specific input alphabet structure, a different dictionary may turn out to be more efficient.

For example, taking into account tidysq case, a simple array would be better because the alphabet is encoded using small integers.

Secondly, for alphabet with relatively small number of elements, using a simple linear list dictionary would turn out to be more efficient than a more complex hash map.

Memory not mapped error

An error occurs when the garbage collector is invoked at the same time as the computation of the algorithm - it releases the memory that is going to be used soon.

It turns out that if we change methods which allocate Rcpp objects and do not return them back to R, everything works as expected.

The goal: C_unpack_raws should return a pure C++ data structure instead of Rcpp structure.

Add parallel_mode param

In order to determine whether the computations should be done in parallel or sequential mode.

Change the algorithm for positional k-mer space

An integer representing a position of a k-mer in a sequence tends to be larger than relatively small P constants used in hash function formula. Therefore, it is not recommended to use it during the hashing of a sequence.

How to fix it?
There are two solutions:

  1. Use a large P
  2. Different positional k-mer hashing approach - use (d + 1)-dimensional representations of k-mer (one extra integer indicates not transformed k-mer position)
  3. Change the dictionary structure and use a specialized version for positional variant; for example, 2-level approach: the first level is the position, the second level -- the hash

Add window slide size param

Add window slide size param in order to count k-mers using a fixed, configurable window slide step. Currently, after each step, the window moves by one position.

Add verbose param

In particular, let a user know about the current progress of batch computations.

Reduce the package size

Fix the following issue:

installed size is 15.4Mb
    sub-directories of 1Mb or more:
      libs        13.2Mb
      thirdparty   1.3Mb

tidySq integration

parallel workers should take a lambda function that access i-th row:

std::function<input_vector_t(int rowNum)>

This function will be used in parallel k-mer counting procedure (kmer_counting_common.h).

For example for Rcpp StringMatrix the lambda is:

[](int rowNum) -> Rcpp::StringMatrix::Row { return seqMatrix(rowNum, Rcpp::_); }

Add param with_kmer_names

A param with_kmer_names would determine whether the output matrix should contain named columns or not. A user would be able to make computations faster if the exact column names are not used.

Final k-mer frequency result format

Currently, an Rcpp::IntegerMatrix is returned, although the matrix would be sparse in the most of cases.

simple_triplet_matrix is an alternative solution, however, it also requires the normal matrix generation step, which wastes both CPU time and RAM resources.

Another solution is connected with Rcpp::Modules and creating a custom sparse data structure.

Extra param to find_kmers to specify a dictionary to store k-mers

As the total algorithm efficiency may depend on the perfomance of accessing k-mer data, it is important to choose an appropriate dictionary. The performance might depend on the size of data that is to be stored. Surprisingly, a linear list performs better that a hash map when the total number of elements is relatively small. Thus, it may be useful to make it possible for users to specify a dictionary type on their own because they know better the input data that is provided to the algorithm.

Fix CI

  1. Make the package work on macOS
  2. Add ubuntu 20.04
  3. Remove Travis CI
  4. Fix build on windows R 3.6

Output matrix does not contain rows for the last input sequences without k-mers

If the last input sequences do not contain any k-mer, then the corresponding output matrix raws are not generated.
The problem does not occur when such input sequences are not in the end.

A problematic example:

count_kmers(list("aaaaacbb", "aa"), 5, letters)

The output is:

A 1x4 simple triplet matrix.

However, it should be:

A 2x4 simple triplet matrix.

The aforementioned problem does not occur for the following example:

count_kmers(list("aaaaacbb", "aa", "aaaaaaa"), 5, letters)

Improve the performance for string matrix input

Currently, the input memory has to be copied due to the fact that Rcpp structures aren't thread-safe and RcppParallel does not provide any thread-safe wrapper for Rcpp::StringMatrix.
This is very time consuming and inefficient. Particularly, creating std::string structures seems to be very inefficient.

Actually, the memory is copied sequentially (in other words, the action is not parallel)
and after that, we take advantage of parallel algorithm.

In fact, it seems more efficient just not to use parallel version.
This hypothesis should be tested
.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.