lh3 / fermi-lite Goto Github PK

View Code? Open in Web Editor NEW

72.0 11.0 23.0 284 KB

Standalone C library for assembling Illumina short reads in small regions

License: MIT License

Makefile 0.59% C 94.07% C++ 5.33%

bioinformatics genomics denovo-assembly

fermi-lite's People

Contributors

Stargazers

Watchers

Forkers

walaj edawson martindrab voutcn frankzelph lixiangchun frank-y-liu biocodings taoliu nh13 tillea scalavision davisem mr-c lishengkang biosharp-dotnet-labs gerbenvoshol julianhess ggraham

fermi-lite's Issues

Can fermi-lite be made strand specific ?

Is there any parameter or easy code changes that could make fermi-lite strand specific ? I am happy to make the changes if you think this is straightforward and had some pointers, I tried removing some of the revcomp lines in the source code but the resulting assembly was worse off.

In my use-case I know all my input sequences are from the same strand therefore I do not want their reverse complement to be tried for overlaps.

Thank you

parameter determination

We are interested in using fermi-lite for local assembly. In the readme you mention the need to automatically determine parameters for the assembly. What parameters need to be determined?

Please tag a stable release

when you're ready. I've been keening for a library like this for a long time. Thanks, Heng!

Possible Error in bfc_ec1dir?

Hello,

I am using the error correction code of fermi-lite in my thesis and it works pretty well. I have noticed that count k-mer occurrences with help of a table built on top of a set of khash tables (enhanced with locking support). The lowest 14 bits of table keys is used to count occurrences of corresponding k-mers. Bits 0-7 count low quality instances, 8-13 are responsible for high-quality ones.

So, to extract the low-quality occurence, you need just to AND the key with the 0xff mask. To get the high-quality one, a right shift of 8, followed by AND with 0x3f mask, is required.

On line 450 of the bfc.c file (the bfc_ec1dir function), there is probably a wrong mask applied:
pen.absent_high = ((s>>8&0xff) < e->opt->min_cov);

Can you look into it please? I think I have quite deep understanding of the code now but I am still probably missing few details..

considerations for local assembly

We would like to deploy fermi-lite in vg for local assembly and homogenization. Is there any particular consideration that we should take when doing this?

It may be helpful to assemble the data from many genomes in a small region (1kb-100kb for instance). What parameters might we use in that case?

fml-asm: mrope.c:230: mr_insert_multi: Assertion `len > 0 && s[len-1] == 0' failed.

This exception is thrown when short sequences are encountered.

@0:1    4281    .    .
CCCACAGAACTAAAACAGAAGAATTCTC
@0:2    4281    .    .
CCTAGACAGAACCCATCTAAGAAACGAC

I have seen this on occasion when a read is truncated.

Thanks.

bseq1_t --> fseq1_t

Hi Heng,
In trying to link bwa and fml in the same executable, I ran into an issue where bseq1_t was defined differently in each library. I ended up making a fork that fixed this issue for SeqLib, but am getting some feedback that it would be better to avoid having multiple fml / bwa clones out there and instead just have SeqLib link to the official fml.

Would you be willing to consider a PR that does the minimal amount of re-naming within fml to be able to link to bwa without multiple definition errors?

Take over patches from libSeqLib which needs separate bfc.h

Hi,
I'm just opening this issue to link to a pull request which adds an enhancement of libSeqLib. If this patch would be applied libSeqLib could drop the incompatible code copy of fermi-lite.
Kind regards, Andreas.

Remove need for SSE2 support

Currently the code in ksw.c seems to need SSE2 to compile. It would be nice to have some kind of -- probably slower -- fallback implementation to improve support on architectures without these instructions.

cc1: fatal error: prog.c: No such file or directory

While compiling, I got following error:

gcc -Wall -O2 prog.c -o prog -L /usr/local/bin -lfml -lz -lm -lpthread
cc1: fatal error: prog.c: No such file or directory

gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4)
Fedora 36.

How to solve this?

No assembly reported for 100 reads with the same sequence

@lh3 I was playing around with this tool but I couldn't get it to work on a "simple" case. I duplicated a read 100 times and would expect it to output the duplicated read. Any thoughts?

``` @M50205:20:000000000-B82KM:1:1108:8421:4217/2 CTAAGGTGGACATGTTGGCTTCTCTCTGTTCTTAACATGTTAAAATTAAAATTAACTTCTCTGGTGTGTGGAGATGTCTTACAATAACAGTTGCTACTATTTCTTTTCTTTTTCTCTTTCTTTCCTCTCTCTTTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTAGACAAGGTCTCAATTTGTCACTCAGAGTGAAGTGCATTGGCATGAACATTGCTCACTTCATCCTTAACCTTCTTGGCCAAAGAACTCCTCCTGCCTCACCCCC + 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 ```

Fermi-lite aggressive trimming

Hi,
When setting aggressive trimming in fermi-lite to pop bubbles in heterozygous regions, what is the strategy being employed.
Is the longer path in the bubble being kept or the shorter path? Or the one with highest average coverage?
I am using fermi-lite to do local assembly of ~2kb regions.

Thanks,
Cristian

Cannot assemble a simple example

Consider the following 8 reads.

>seq1
ATCCTGAGAATCAATCTGTGAAAATTATGTCTTGGGAGGAGGGGAAGGAAACCAAAAATTTTTAGAAAAGCTGGAACTCTTAGCTATCTAGAAGCAGGTC
>seq2
GGGAAGGAAACCAAAAATTTTTAGAAAAGCTGGAACTCTTAGCTATCTAGAAGCAGGTCTTGAATCTCACAGAATCGCAAAGGAAGAAAATCAGGGCCTA
>seq3
TTTAGAAAAGCTGGAACTCTTAGCTATCTAGAAGCAGGTCTTGAATATCACAGAATCGCAAAGGAAGAAAATCAGGGCCTACCTATCTAAATTTAAAATT
>seq4
GAAATTTTAAATTTAGATATGTAGGCCCTGATTTTCTTCCTTTGCGATTCTGTGATATTCAAGACCTGCTTCTAGATAGCTAAGAGTTCCAGCTTTTCTA
>seq5
TGAGAAAATTATGTCTTGGGAGGAGGGGAAGGAAACCAAAAATTTTTAGAAAAGCTGGAACTCTTAGCTATCTAGAAGCAGGTCTTGAATATCACAGAAT
>seq6
TGAAAATTATGTCTTGGGAGGAGGGGAAGGAAACCAAAAATTTTTAGAAAAGCTGGAACTCTTAGCTATCTAGAAGCAGGTCTTGAATATCACAGAATCG
>seq7
TTTTTAGAAAAGCTGGAACTCTTAGCTATCTAGAAGCAGGTCTTGAATATCACAGAATCGCAAAGGAAGAAAATCAGGGCCTACATATCTAAATTTAAAA
>seq8
ATAGCTAAGAGTTCCAGCTTTTCTAAAAATTTTTGGTTTCCTTCCCCTCCTCCCAAGACATAATTTTCACAGATTGATTCTCAGGATTGGCAATCATGCA

A quick multiple sequence alignment shows that there is very good consensus among these 8 reads for most of the alignment.

seq1            -------------atcctgagaatcaatctgtgaaaattatgtcttgggaggaggggaag
_R_seq8         tgcatgattgccaatcctgagaatcaatctgtgaaaattatgtcttgggaggaggggaag
seq5            -----------------------------tgagaaaattatgtcttgggaggaggggaag
seq6            -------------------------------tgaaaattatgtcttgggaggaggggaag
seq2            ------------------------------------------------------gggaag
seq3            ------------------------------------------------------------
_R_seq4         ------------------------------------------------------------
seq7            ------------------------------------------------------------
                                                                            

seq1            gaaaccaaaaatttttagaaaagctggaactcttagctatctagaagcaggtc-------
_R_seq8         gaaaccaaaaatttttagaaaagctggaactcttagctat--------------------
seq5            gaaaccaaaaatttttagaaaagctggaactcttagctatctagaagcaggtcttgaata
seq6            gaaaccaaaaatttttagaaaagctggaactcttagctatctagaagcaggtcttgaata
seq2            gaaaccaaaaatttttagaaaagctggaactcttagctatctagaagcaggtcttgaatc
seq3            -------------tttagaaaagctggaactcttagctatctagaagcaggtcttgaata
_R_seq4         ---------------tagaaaagctggaactcttagctatctagaagcaggtcttgaata
seq7            -----------tttttagaaaagctggaactcttagctatctagaagcaggtcttgaata
                               *************************                    

seq1            -------------------------------------------------------
_R_seq8         -------------------------------------------------------
seq5            tcacagaat----------------------------------------------
seq6            tcacagaatcg--------------------------------------------
seq2            tcacagaatcgcaaaggaagaaaatcagggccta---------------------
seq3            tcacagaatcgcaaaggaagaaaatcagggcctacctatctaaatttaaaatt--
_R_seq4         tcacagaatcgcaaaggaagaaaatcagggcctacatatctaaatttaaaatttc
seq7            tcacagaatcgcaaaggaagaaaatcagggcctacatatctaaatttaaaa----

However, I cannot get fml-asm to produce any assembly from these reads. I've tried relaxing parameters in various ways but with no success. Are there any parameter settings that will assemble these reads, or is this a particularly challenging case that can't easily be solved?

Thanks!