Light

seqan / needle Goto Github PK

View Code? Open in Web Editor NEW

13.0 4.0 6.0 9.22 MB

A fast and space-efficient pre-filter for estimating the quantification of very large collections of nucleotide sequences

License: BSD 3-Clause "New" or "Revised" License

CMake 5.65% C++ 86.05% R 3.77% Python 1.16% Shell 3.37%

ibf minimizer ngs-data bloom-filters seqan3

needle's Introduction

SeqAn - The Library for Sequence Analysis

NOTE SeqAn3 is out and hosted in a different repository
We recommend using SeqAn3 for new applications.

What Is SeqAn?

SeqAn is an open source C++ library of efficient algorithms and data structures for the analysis of sequences with the focus on biological data. Our library applies a unique generic design that guarantees high performance, generality, extensibility, and integration with other libraries. SeqAn is easy to use and simplifies the development of new software tools with a minimal loss of performance.

License

The SeqAn library itself, the tests and demos are licensed under the very permissive 3-clause BSD License. The licenses for the applications themselves can be found in the LICENSE files.

Prerequisites

Older compiler versions might work but are neither supported nor tested.

Linux, macOS, FreeBSD

GCC ≥ 11
Clang/LLVM ≥ 15
Intel oneAPI C++ Compiler 2024.0.2 (IntelLLVM)

Windows

Visual C++ ≥ 17.0 / Visual Studio ≥ 2022

Architecture support

Intel/AMD platforms, including optimisations for modern instruction sets (POPCNT, SSE4, AVX2, AVX512)
All Debian release architectures supported, including most ARM and all PowerPC platforms.

Build system

To build tests, demos, and official SeqAn applications you also need CMake ≥ 3.12.

Some official applications might have additional requirements or only work on a subset of platforms.

Documentation Resources

Contact

needle's People

Contributors

Stargazers

Watchers

Forkers

rrahn mitradarja eseiler baajarmeh mr-c sgssgene

needle's Issues

[Feature] Add Update possibilities

Add a possibility to add a new experiment and a possibility to add an IBF with a new expression level. For adding a new expression level, the IBFs "cornering" the new expression value need to be recalculated.

Add hibf_lib and chopper

[ ] hibf_lib
[ ] chopper

Add Cli tests

Add cli tests as proposed by https://github.com/joergi-w/app-template/wiki

Recalculate normalized expression value after preprocessing

It might make sense to use a different normalization method after preprocessing, this should be possible for all methods based on a genome mask, where new sequences are given. Not possible remains to calculate anything on the sequences of the experiment, because these are not given.

[TODO] Change genome mask to include and exclude option

Rename genome mask to include.
Add option to exclude certain minimisers.

Add insert function

Inserting new experiments needs to be implemented

Remove Underscore from Testnames

See here for why: https://github.com/google/googletest/blob/master/googletest/docs/faq.md#why-should-test-suite-names-and-test-names-not-contain-underscore

Needle 2.0.0

Add HIBF #100
Add Merge option #38
Add update option? #38
Make Preprocessing better with multiple threads
Make count competitive with kallisto & Co.?
Add argument verbose #32

Normalization method Genome Mean does not work

If the genome file is given, the mean normalization works still on all minimizers.

Make more use of header files or use cereal

Currently, the header file information is not used at all, when using ibf with minimiser files. So, an user would have to type in cutoffs, k, w, shape again. This does not make sense...
On the other hand, the expression level information should be not used, there is no reason, why an user wants to stay with those. They should only be used for statistics.

Maybe the information about k, w, shape can be stored in a better way, so that not an extra file is necessary? Like in the name? experiment_k_w_shape_cutoff.minimiser could work. Then the header would only be used for the statistic function.
Or stome them in a data structure.

Add Threshold to search

Add threshold option to search, to make it more comparable to other tools like Mantis.

Calculate FPR for every file on every level

Instead of having one fpr for all files on a level, estimate the fpr for every file, store it and then use it in the estimations. This is important, if the amount of file content differs a lot, because then the fpr for all is not correct.

Use biggest file to estimate size of ibf?

In order to gurantee a fpr for every file, I could change from using the mean the maximum... This would increase the size, but compression should be even smaller.

Improve documentation

Add link to apptemplate in Readme
Explain how doc can be built
Check documentation for errors or missing paragraphs
Add more information, so that built documentation is easier to use. Maybe add a small tutorial?

Reading in expression file is broken

There is a problem with the splitting function

Add preprocessing step

Add an optional preprocessing step that counts all minimizers and their occurrences for an experiment and saves it in a binary file, the ibf should then be build on these binarys.

test failing with SeqAn version 3.3.0

https://buildd.debian.org/status/fetch.php?pkg=seqan-needle&arch=amd64&ver=1.0.2%2Bds-2&stamp=1704746033&raw=0

"needle" name conflict

I put together a basic package of Needle for Debian and I discovered that there is another programs with the same name

/use/bin/needle is already provided by EMBOSS
https://packages.debian.org/bullseye/emboss a.k.a https://bio.tools/needle-ebi or https://bio.tools/needle

This causes a problem with Debian policy

Two different packages must not install programs with different functionality but with the same filenames. […] If this case happens, one of the programs must be renamed.

https://www.debian.org/doc/debian-policy/ch-files.html#binaries

And I would rather that the Debian package of Needle has the same program name as other packaging systems. Otherwise that causes problems for user's scripts and workflows..

Perhaps this Needle could be renamed? Sorry to ask it, but it would be best to not have the confusion.

Make Mantis cutoffs automatic for ibf not based on minimiser function

At the moment the mantis cutoffs are only used when the minimiser function is called beforehand.

[BUG] Multithread results sometimes in error

Sometimes the ape test test_needle fails for the multithread option.

[ RUN      ] ibfmin.no_given_expression_levels_multiple_threads
/home/mitradarvish/Dokumente/develop/needle/test/api/test_needle.cpp:337: Failure
Expected equality of these values:
  expected_result2
    Which is: [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
  res2
    Which is: [1,0,1,1,1,1,1,1,1,1,1,0,1,1,1,1]
[  FAILED  ] ibfmin.no_given_expression_levels_multiple_threads (1 ms)

Replace std::unordered_set

Use robin_hood::unordered_set instead of std::unordered_set

robin_hood::unordered_node_map instead of robin_hood::unordered_map?

Add argument verbose

Add command line argument verbose

[BUG] Threshold in search does not work

Using a threshold not 0.5 in the search returns a cereal error.

Make Path to files in example working

At the moment test only works when build in the same directory as the code is, because test uses the path ./example

Update submodules

I'll do a PR that:

Updates SeqAn to 3.2.0 (#126)
Updates robin-hood (#126)
~~Integrates sharg-parser~~
Updates CI (#126)

Add DESeq normalization

For estimating differential expressed genes, add a normalization method inspired by the DESeq algo.
(see: https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html)

[TODO] Entangle api tests and check validity of cli tests

lib/robin-hood-hashing missing from source tarballs

Probably because it is a Git submodule.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.