chanzuckerberg / shasta Goto Github PK

View Code? Open in Web Editor NEW

270.0 23.0 58.0 11.45 MB

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads

License: Other

C++ 97.58% CMake 0.27% Python 2.06% Shell 0.08%

dna sequencing assembly nanopore pacbio long-read de-novo

shasta's Introduction

Shasta long read assembler

This repository is no longer in use

The project has moved to github.com/paoloshasta/shasta.

Shasta development continues in the new repository.

New Shasta releases will appear in the Releases page of the new repository.

Old Shasta releases (up to 0.10.0) continue to be available in the Releases page of this repository.

For questions, issues, and discussion on any version of Shasta, please use the Issues page of the new repository.

Here is the old README file for this repository. It is now obsolete.

shasta's People

Contributors

Stargazers

Watchers

shasta's Issues

Results availability

Hello,

Is the assembly file of GM12878 available (results)?

Thank you,
Eleni

The first character in the Fasta file is not ">”.

Dear shasta developers,
I am using shasta to assemble my nanopore fastq format data in linux server. But shasta v0.3.0 binary version (https://github.com/chanzuckerberg/shasta/releases/download/0.3.0/shasta-Linux-0.3.0) could not recognize my fastq file, but took it as fasta file and report error The first character in the Fasta file is not ">".
Do you have any ideas what I should do?
Thanks in advance!
Jianjun

limiting core usage

Hi,

I did not find a parameter to set the number of cores. Is there one?
shasta seems to try to use all the available cores :
Using 1 threads for reading and 240 threads for processing.

Why SeqAn libraries are unsupported on macOS?

SeqAn libraries appears to be unused according to AssemblerAlign1.cpp.

// Alternative alignment functions with 1 suffix.
#include "Assembler.hpp"
using namespace shasta;

// Standard library.
#include "chrono.hpp"

// Seqan.
#ifdef __linux__
#include <seqan/align.h>
#endif



#ifndef __linux__

// For macOS we don't have SeqAn, so we can't do any of this.
void Assembler::alignOrientedReads1(
    ReadId readId0, Strand strand0,
    ReadId readId1, Strand strand1,
    int matchScore,
    int mismatchScore,
    int gapScore)
{
    throw runtime_error("alignOrientedReads1 is not available on macOS.");
}

#else

But, SeqAn's libraries and header files are available via homebrew.

$ brew search seqan
==> Formulae
brewsci/bio/seqan

Are there any other problems for binary distribution?

How to choose the GPU devices for executing shastaGpu?

My workstation comes with GeForce GTX 750 Ti and Tesla K40c.
I would like to use only Tesla K40c for shastaGpu, but --useGpu option activated both GPU devices.
Would it be possible?

Spending a very long time "Processing marker graph vertices"

I believe I mentioned this before, but I couldn't find the specific issue.

I find that in some configurations of assemblies of phage genomes, virtually all the runtime is consumed in the marker graph processing step. Only one thread is running when it appears to be stuck in this step.

Do you have any idea if there would be a way to improve the runtime or parallelism of this step? Happy to help, also I can provide data to reproduce this.

When will the next release be cut?

Hi,

I am trying out Shasta on a CCS data set and following the parameters suggested in #56.

However, the option --MinHash.minBucketSize is available only after the 0.2.0 release.
Building from source on my OS seems to be a non-trivial process as indicated in docs/BuildingFromSource.html.

Hence I am wondering when will the next release be, so that I can plan on what to do next.

Thank you!

Steve

Allow multi-line reads in input fasta files

Currently Shasta requires each read in an input Fasta file to be on a single line. It would be good to remove this restriction, as the fasta format does allow reads to span multiple lines, and some utilities do create fasta files with reads over multiple lines.

URL for human ONT data

Hello,
I wonder where I can download the eleven human ONT data mentioned in your bioRxiv paper?
Could you provide related URL ?
Thanks !

Best

A standard exception occurred in thread

Hi,

I tried to run Shasta with:

#!/bin/bash
set -e
SCRIPT='/home/raymond/devel/shasta/shasta-Linux-0.1.0'
inputFile='canu_corr150x.correctedReads.nonOtherChar.fasta'
outputFile='correct_1kb'

$SCRIPT \
    --input $inputFile \
    --output $outputFile

But it raised this error:

Shasta Release 0.1.0
2019-Aug-15 00:52:11.143336
This is the static executable for the Shasta assembler. It provides limited Shasta functionality at but has no dependencies and requires no installation.

Default values of assembly parameters are optimized for an assembly at coverage 60x. If your data have significantly different coverage, some changes in assembly parameters may be necessary to get good results.

For more information about the Shasta assembler, see
https://github.com/chanzuckerberg/shasta

Complete documentation for the latest version of Shasta is available here:
https://chanzuckerberg.github.io/shasta

Options in use:
Input FASTA files: canu_corr150x.correctedReads.nonOtherChar.fasta
outputDirectory = correct_1kb
memoryMode = anonymous
memoryBacking = 4K

[Reads]
minReadLength = 10000
palindromicReads.maxSkip = 100
palindromicReads.maxMarkerFrequency = 10
palindromicReads.alignedFractionThreshold = 0.1
palindromicReads.nearDiagonalFractionThreshold = 0.1
palindromicReads.deltaThreshold = 100

[Kmers]
k = 10
probability = 0.1

[MinHash]
m = 4
hashFraction = 0.01
minHashIterationCount = 10
maxBucketSize = 10
minFrequency = 2

[Align]
maxSkip = 30
maxMarkerFrequency = 10
minAlignedMarkerCount = 100
maxTrim = 30

[ReadGraph]
maxAlignmentCount = 6
minComponentSize = 100
maxChimericReadDistance = 2

[MarkerGraph]
minCoverage = 10
maxCoverage = 100
lowCoverageThreshold = 0
highCoverageThreshold = 256
maxDistance = 30
edgeMarkerSkipThreshold = 100
pruneIterationCount = 6
simplifyMaxLength = 10,100,1000

[Assembly]
markerGraphEdgeLengthThresholdForConsensus = 1000
consensusCaller = SimpleConsensusCaller
useMarginPhase = False
storeCoverageData = False

Shasta Release 0.1.0
2019-Aug-15 00:52:11.143998 Loading reads from /data/raymond/work/Eucalyptus_pauciflora/genome/bin/genome/assembly/shasta/canu_corr150x.correctedReads.nonOtherChar.fasta.
Input file block size: 2147483648 bytes.
Using 1 threads for reading and 56 threads for processing.
Input file size is 74766867552 bytes.
2019-Aug-15 00:52:11.156517 Reading block 0 2147483648, 2147483648 bytes.
Block read in 2.12567 s at 1.01026e+09 bytes/s.
Processing 2147471993 input characters.
A standard exception occurred in thread 42: Invalid base character 78
./run.sh: line 12: 42463 Aborted                 (core dumped) $SCRIPT --input $inputFile --output $outputFile

Do you know what caused this issue?

Many thanks,
Raymond

How to build the executable that is not static

Hello!

I tried to build shasta from source to use these options

--Assembly.useMarginPhase
--Assembly.storeCoverageData

But I was not able to use it because the executable is static.

How should I build shasta so that I can use these options?

Here are more details.

I built shasta according to the instruction in https://chanzuckerberg.github.io/shasta/BuildingFromSource.html

git clone https://github.com/chanzuckerberg/shasta.git
sudo shasta/scripts/InstallPrerequisites-Ubuntu.sh
mkdir shasta-build
cd shasta-build
cmake ../shasta
make all
make install

But I was not able to use those options with the executable 'shasta-build/shasta-install/bin/shasta', even though there was "shasta.so" in the same directory.

The error messages were like these

Shasta development build. This is not a released version.
2019-Aug-29 02:05:20.780316 Terminated after catching a runtime error exception:
Assembly.useMarginPhase is not supported by the Shasta static executable.

with regards

Jinyoung

viral quasispecies

It seems that 95% of my runtime on my (viral) application is in this step. The number of vertices is pretty small usually (thousands). But it can take tens of minutes. Coverage is high, so perhaps this is a problem.

Is there any way to speed it up? It runs in serial in my case. Is that expected? Happy to share my config files if you need to see what I'm doing.

Optimization for other species

Hello,

I study maize genetics and am interested in Shasta for speeding up our genome assembly pipeline. An initial test run of Shasta with 40-45x coverage of nanopore reads with an N50 of ~30kb returned a total number of assembled bases (1.5Gb) and N50 contig size (100kb) that was much smaller than what we're used to seeing from other assemblers (2-2.3Gb genome, 1.5-2.5Mb contigs). I'm guessing that this may be partially due to the complexity of the maize genome - do you have parameter tuning recommendations that may help?

Support for reads in fastq + compressed formats

Some users would like support for input reads in the following formats:

Fastq.
Compressed fasta (one or more compressed formats)
Compressed fastq (one or more compressed formats).

For now, Shasta provides two scripts (available in shasta/scripts in the source code tree,
or in shasta-install/bin in the build):

FastqToFasta.py to convert fastq to fasta.
FastqGzToFasta.py to convert fastq.gz to fasta in a single pass (that is, without going to disk twice).

There are also standard utilities to convert back and for between fasta and fastq format, and to decompress.

Some considerations:

Fastq format is twice as big and reading adds additional assembly cost.
Decompressing compressed formats also adds additional cost, and could be particularly problematic for compressed formats for which decompression is hard or impossible to parallelize.

Assertion failed: readCount > 0 at void ChanZuckerberg::shasta::Assembler::findAlignmentCandidatesLowHash(std::size_t, double, std::size_t, std::size_t, std::size_t, std::size_t, std::size_t) in /home/paolo/shasta/src

Assertion failed: readCount > 0 at void ChanZuckerberg::shasta::Assembler::findAlignmentCandidatesLowHash(std::size_t, double, std::size_t, std::size_t, std::size_t, std::size_t, std::size_t) in /home/paolo/shasta/src

Can you help figure out this issue?

Error in assembly: exception occurred in processing input characters

Hi! I have been trying Shasta for the assembly of both simulated and experimental nanopore reads. While reading the input files, in some cases, I have been getting the following error:

A standard exception occurred in thread 3: Assertion failed: buffer[bufferIndex++] == '>' at void ChanZuckerberg::shasta::ReadLoader::processThreadFunction(std::size_t) in /home/paolo/shasta/src/ReadLoader.cpp line 274

What does this error mean?

I have been running shasta with the default parameters, with the following command:
./shasta-Linux-0.1.0 --input inputfile.fasta

The complete log obtained in the terminal is as shown below:

Shasta Release 0.1.0
2019-Jun-03 15:27:41.808391 Loading reads from /g/steinmetz/vijayram/output/dazz_db/JS734_only_circ.fasta.
Input file block size: 2147483648 bytes.
Using 1 threads for reading and 80 threads for processing.
Input file size is 1016091 bytes.
2019-Jun-03 15:27:42.964100 Reading block 0 1016091, 1016091 bytes.
Block read in 0.00115359 s at 8.80809e+08 bytes/s.
Processing 1016091 input characters.
A standard exception occurred in thread 3: Assertion failed: buffer[bufferIndex++] == '>' at void ChanZuckerberg::shasta::ReadLoader::processThreadFunction(std::size_t) in /home/paolo/shasta/src/ReadLoader.cpp line 274
Aborted

Thank you!

Parameters for Plant Genomes

Hi there,

This is somewhat of a follow up to #54. I tried assembling some tomato data with default parameters and it didn't quite work out. The assembly should be between 750M to 800M, but it was closer to 500M.

I bring it up as a potentially good test case, if you are interested, since the genome is relatively simple (especially since it is very homozygous) and there is test data on SRA for 3 different accessions. The data is associated with my preprint.

The "BGV" (BGV006775) accession probably has the highest quality data, but the "M82" accession would be most convenient for my testing purposes.

Thanks

unable to build from source

Hi,
I am not sure if this is a bug or me just stupid. I also couldn't find documentation for building from source.

Steps performed:

Download the code from git
mkdir build
cd build
cmake ..
make

Expected result:
Build should complete successfully

Actual Result:
Fails with error:
[ 0%] Building CXX object src/CMakeFiles/shasta.dir/AlignmentGraph.cpp.o
[ 1%] Building CXX object src/CMakeFiles/shasta.dir/AssembledSegment.cpp.o
[ 2%] Building CXX object src/CMakeFiles/shasta.dir/Assembler.cpp.o
In file included from /soe/smittal2/shasta/source/src/Assembler.cpp:1:0:
<hidden_folder_path>/src/Assembler.hpp:24:10: fatal error: marginPhase/callConsensus.h: No such file or directory
#include "marginPhase/callConsensus.h"
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

Is there any reason not to use SHASTA for assembling PacBio data?

SHASTA is designed to assemble Oxford Nanopore long-read data; however, there is (naively) no reason I can see why it should not work equally well on PacBio long-read data (for the assembly step; for subsequent polishing of the assembly, non-SHASTA-associated tools will definitely be needed).

Am I missing something? Is there some reason why SHASTA should be used exclusively to assemble Oxford Nanopore reads, but not PacBio reads?

Error 122

Encountered the following error when running this command on a 2 TB, 32 core Linux machine. Any suggestions?

Thanks,
KF

shasta-Linux-0.1.0 --input subreads.fasta --output SHASTA

2019-Aug-21 23:15:16.174763 LowHash iteration 5 begins.
2019-Aug-22 02:36:28.718041 Terminated after catching a runtime error exception:
Error 122 during ftruncate to size 1987878912

Threads option not available in 0.1.0 release?

I note that the --threads option is mentioned in the documentation, but the prebuild binary release 0.1.0 does not seem to recognize this as a command option. This is a problem for us, as we use shared compute resources, and the program appears to look at the total number of cores available and uses all of them.

What are the output files of the assembler?

I would like to try your assembler but don't see any specification of what filetypes your assembler outputs in the documentation. Is it fasta or a graph?

valgrind for binary built from source

Hi,
I am not sure if this is an issue or I am doing something wrong.
I am trying to profile shasta.
To do this I need to build from source.
I am able to successfully build from source and run it successfully.

But when I try to run it with valgrind, I get the error:

==46381== vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x8 0x6F 0x5 0xF6 0xC0 0x3E 0x0 vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0 vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0 ==46381== valgrind: Unrecognised instruction at address 0x404250. ==46381== at 0x404250: _GLOBAL__sub_I__ZN14ChanZuckerberg6shasta30TrainedBayesianConsensusCallerC2Ev (char_traits.h:350) ==46381== by 0x6E57CB: __libc_csu_init (in /soe/smittal2/shasta/shasta/build/src-static-executable/shasta) ==46381== by 0x6E4F96: (below main) (in /soe/smittal2/shasta/shasta/build/src-static-executable/shasta) ==46381== Your program just tried to execute an instruction that Valgrind ==46381== did not recognise. There are two possible reasons for this. ==46381== 1. Your program has a bug and erroneously jumped to a non-code ==46381== location. If you are running Memcheck and you just saw a ==46381== warning about a bad jump, it's probably your program's fault. ==46381== 2. The instruction is legitimate but Valgrind doesn't handle it, ==46381== i.e. it's Valgrind's fault. If you think this is the case or ==46381== you are not sure, please let us know and we'll try to fix it. ==46381== Either way, Valgrind will now raise a SIGILL signal which will ==46381== probably kill your program. ==46381== ==46381== Process terminating with default action of signal 4 (SIGILL) ==46381== Illegal opcode at address 0x404250 ==46381== at 0x404250: _GLOBAL__sub_I__ZN14ChanZuckerberg6shasta30TrainedBayesianConsensusCallerC2Ev (char_traits.h:350) ==46381== by 0x6E57CB: __libc_csu_init (in /soe/smittal2/shasta/shasta/build/src-static-executable/shasta) ==46381== by 0x6E4F96: (below main) (in /soe/smittal2/shasta/shasta/build/src-static-executable/shasta) ==46381==

Strangely I don't get this error when use the pre-built binary. Is there something I am missing?

Any pointers to solve this issue are appreciated.

segmentation faults when beginning "simplifyMarkerGraph"

using both release 0.4.0, and a local compiled version of Shasta I get a segmentation fault
I am on a CentOs server with version 2.6.32-754.18.2.el6.x86_64
Most all packages were installed through Linux brew, and for the locally compiled version only the static library & executable was built.
command that was run was :
shasta --command=assemble --memoryBacking 2M --threads 8 --input foo.fasta --assemblyDirectory out_dir

trailing 20 lines of output are consistently:

2020-Apr-22 10:13:59.791654 Flagged as weak 2 edges with coverage 69 out of 572 total.
2020-Apr-22 10:13:59.792702 Flagged as weak 2 edges with coverage 70 out of 550 total.
2020-Apr-22 10:13:59.793527 Flagged as weak 2 edges with coverage 71 out of 566 total.
2020-Apr-22 10:13:59.795658 Flagged as weak 2 edges with coverage 74 out of 554 total.
2020-Apr-22 10:13:59.796736 Flagged as weak 2 edges with coverage 76 out of 570 total.
2020-Apr-22 10:13:59.800192 Flagged as weak 2 edges with coverage 84 out of 484 total.
Transitive reduction removed 2946484 marker graph edges out of 3566124 total.
The marker graph has 480952 vertices and 619640 strong edges.
2020-Apr-22 10:13:59.833666 Transitive reduction of the marker graph ends.
2020-Apr-22 10:13:59.870307 Begin prune iteration 0
Pruned 588 edges at prune iteration 0.
2020-Apr-22 10:14:00.396775 Begin prune iteration 1
Pruned 486 edges at prune iteration 1.
2020-Apr-22 10:14:00.927001 Begin prune iteration 2
Pruned 46 edges at prune iteration 2.
2020-Apr-22 10:14:01.486068 Begin prune iteration 3
Pruned 4 edges at prune iteration 3.
2020-Apr-22 10:14:02.010430 Begin prune iteration 4
Pruned 0 edges at prune iteration 4.
2020-Apr-22 10:14:02.554399 Begin prune iteration 5
Pruned 0 edges at prune iteration 5.
The original marker graph had 480952 vertices and 3566124 edges.
The number of surviving edges is 618516.
2020-Apr-22 10:14:03.294165 Begin simplifyMarkerGraph iteration 0 with maxLength = 10

High coverage assembly (was: SimpleBayesianConsensusCaller cleanup)

Only the following low priority clean up tasks in the SimpleBayesianConsensusCaller code remain:

Harmonize naming conventions with the rest of the Shasta code.
Use braces even for single-line code blocks in if/else (see the rest of the Shasta code for examples).

Release 0.2.0 is slower than Release 0.1.0

Release 0.2.0 is slower than Release 0.1.0 in the disjoint set computation used to merge marker graph vertices. This can substantially slow down assemblies on machines with large numbers of virtual processors.

This is caused by the fact that release 0.2.0 was built on Ubuntu 18.04, which has a version of the C++ standard library in which std::atomic<uint128_t> is not lock free, even when compiling with -mcx16.

Until the disjoint set data structure is modified to stay lock free on Ubuntu 18.04, release builds should be done on Ubuntu 16.04.
Workarounds: You can use one of the following workarounds:
- Use this temporary version of the Shasta executable.
- If you are building from source, make sure to do the build on Ubuntu 16.04 for full performance (the Linux version of the run machine is not important).
- Use Release 0.1.0.

question on the Markers section of documentation

Dear Shasta authors,

I was glad to see your London Calling presentation on twitter and the performance of the assembler looks impressive. I just had a small question concerning the section "Markers" at page https://chanzuckerberg.github.io/shasta/ComputationalMethods.html

In the first marker example given, CGC appears 3 times in the read (CGACACGTATGCGCACGCTGCGCTCTGCAGC) yet it is used only once as a marker, is this intentional or a mistake? This is also the case for TGC.

Best,

Rayan

'Resource temporarily unavailable'

I am running what should be a perfectly reasonable shasta run (assembling long reads of a 320-Mb genome at 100x coverage) on a major computer cluster (NSF XSEDE Bridges). Everything went apparently fine until, with no warning, the assembly died with this error message:

2019-Sep-16 20:16:17.364582 Terminated after catching a runtime error exception:
Resource temporarily unavailable

Is there some obvious reason why I should have gotten this error? And, if so, what am I supposed to do about it? (I cannot have sudo status on a massively shared cluster.)

low coverage data parameters

Do you have any hints for optimizing SHASTA assembly on lower coverage data (20-30X)?

considerations for running the shasta server / how the intermediate graphs is stored on disk

I'd like to run the shasta server to look at intermediate graphs. I've not been keeping these around or working with them. I have just been using the default mode for memory management, and I've only tested --memoryBacking 2M once. (I actually didn't use it because it didn't seem to help my performance. My assembly jobs have only required tens of GB of RAM, so perhaps this is why?)

Can I copy or mount graphs from a remote server to a local directory and then load them in the shasta server? I noticed that I was unable to remove files that were generated with --memoryBacking 2M until running shasta --command cleanupBinaryData.

I am generally curious about how the disk backed memory management works in practice. The reason is that I'd like to improve on what I'm doing in seqwish. Would you provide some pointers to code to read to understand how you're doing this?

telomere and nanopore chimeric sequences

Hi there,

We are wondering how Shasta assembles reads containing telomere sequences? We have an assembly showing that short telomere repeats (less than 1k length) exist in the middle of some scaffolds. Also, nanopore has chimeric sequences, such as those may contain telo-like repeats in the original reads. These problems make trouble to produce chr-level assembly...... How does the assembler deal with these problems?

Thanks!
Chen

Optimizing Shasta for short read lengths (<10kb)

Hi!

I'm super impressed with the speed of Shasta - and I'm taking advantage of its speed to explore parameters to optimize/maximize the size of genome assembly and longest assembled segment for ONT and PacBio data sets having read lengths skewed to <10kb in different marine invertebrates.

For ONT:

I'm wondering if you could recommend Shasta parameters to explore for working with ONT data sets that are skewed to shorter read lengths of 500-10000 bp.

I am assembling ~60x PromethION (~15 million) reads of an estimated 2-2.2 Gb pygmy squid (Idiosepius paradoxus) genome. Unfortunately, sequencing ran into pore clogging issues unless the DNA was sheared and size selected - so our reads are mostly smaller than 10 kb (~65% reads and ~33% bases) and even under 1 kb (~20% reads and ~0.5% bases). I read that default settings in Shasta are optimized for read lengths >10kb, so I'm interested in exploring different parameters to optimize Shasta for my data set.

To begin with I tested minimum read length cutoffs (--Reads.minReadLength) from 500-10000 bps. The results were interesting, in that having more short reads didn't always help - and max lengths in genome assembly and in longest assembled segment occurred at a cutoff of 6000 bp (see below). This is promising but at a 6000 bp cutoff over 45% of the reads are rejected - so I'd like to optimize other parameters to see if I can improve things and use more of the reads.

Also - here is the Assembly Summary for a 1000 bp cutoff, as an example.

Given my ONT read lengths skewed to smaller sizes, I'm unsure what additional Shasta parameters I should be exploring and was wondering if you have any suggestions.

Also, are there reasons to flat out reject short ONT reads - for instance, I'm wondering if they generally have even higher rates of sequencing error - or for other reasons are generally viewed as poor quality and to be avoided in assembly?

For PacBio:

I saw that use of Shasta with PacBio was discussed and closed for now in Issue #56 . I can move my PacBio questions to that thread, no problem - just let me know.

I'm preparing to do Shasta assembly of deep PacBio sequencing of a ctenophore (Bolinopsis species) with an estimated genome size of 200-300 Mb. Like my ONT data set above, the PacBio data skew to under 10 kb - at least in comparison to a typical human ONT dataset that Shasta is optimized for.

I plan to follow Issue #56 recommendations to explore read length and kmer length - and use the Modal consensus caller for repeat counts. Are there other parameters in Shasta that might be useful to explore, given PacBio datasets - and/or do you now have recommended Shasta settings for PacBio?

I'm happy to update on parameter exploration as things progress on both the ONT and PacBio data sets above, if there is interest.

error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.

HI,
We are trying your software,but when we build it with GPU,An error occurred:

[ 32%] Building CUDA object staticExecutableGpu/CMakeFiles/gpuLib.dir//src/gpu/GPU.cu.o
In file included from /usr/include/c++/5/mutex:35:0,
from /mnt/data1/lixinyue/docker/test_xfy/shasta/shasta/shasta-0.4.0/src/gpu/GPU.cu:8:
/usr/include/c++/5/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
#error This file requires compiler and library support
^
staticExecutableGpu/CMakeFiles/gpuLib.dir/build.make:62: recipe for target 'staticExecutableGpu/CMakeFiles/gpuLib.dir//src/gpu/GPU.cu.o' failed
make[2]: *** [staticExecutableGpu/CMakeFiles/gpuLib.dir/__/src/gpu/GPU.cu.o] Error 1
CMakeFiles/Makefile2:150: recipe for target 'staticExecutableGpu/CMakeFiles/gpuLib.dir/all' failed
make[1]: *** [staticExecutableGpu/CMakeFiles/gpuLib.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

Config:
g++ 5.4 or 4.9
cuda 10.0
ubuntu16.04

best,
XFY

Job killed at graph vertices

Giving 256MB ram, for a expected ~600MB size genome, with pacbio reads shast keeps dying at Begin computing marker graph vertices. Perhaps you could give me insight into why that potentially could be? more memory? more processors (just 1 right now).

shasta --input milfoil1line.fa --Reads.minReadLength 2000 --Assembly.consensusCaller Modal --assemblyDirectory milfoil

Support for reads in compressed format (.gz and/or others)

Shasta can load reads from uncompressed FASTA and FASTQ files. In some pipelines it is desirable to be able to read directly from compressed versions of the FASTA and FASTQ files. It is desirable to add Shasta support for reading directly from these compressed formats without requiring decompressing the files first.

Compressed format of interest include .gz .zip, .bz, .bz2.

Clean up of SimpleBayesianConsensusCaller code

Some low priority clean up tasks in the SimpleBayesianConsensusCaller code:

Remove compile warnings (some explicit conversions between integer types needed).
Harmonize naming conventions with the rest of the Shasta code.
Remove Python-style this-> in member functions.
Pass vectors by reference.
Make function split a static member of SimpleBayesianConsensusCaller.
When errors occur, throw an an exception (std::runtime_error) instead of calling exit.
Use const for variables that don't change (e. g. int length = ...).
Don't use BASE, BASE_INDEXES, base_keys. Use class Base or AlignedBase instead.
Use braces even for single-line code blocks in if/else (see the rest of the Shasta code for examples).

Support newer versions of boost?

This header file was removed in boost 1.68, which is the version currently used in conda. Would it be possible to support boost >=1.68?

obtaining GFA output directly from the shasta executable

This is a feature request. The prebuilt binaries are great, but there doesn't seem to be any way to use them to get GFA output. To do that, we need to build shasta on our target system to get the shared library that works as a python module, and this is somewhat involved without root. Would it be simple to add an option to shasta itself to provide GFA output of the raw assembly graph? Making the build simpler would be another way to make sure this is a feature that users in weird HPC environments can rely on.

where is shasta?

Hello,

Trying to find an assembler for my reads, flipping around, tried the shasta installation...seems there is no release directory.

What do I do? Because I did this: https://chanzuckerberg.github.io/shasta/QuickStart.html

curl -O -L https://github.com/chanzuckerberg/shasta/releases/download/X.Y.Z/shasta-macOS-X.Y.Z

then the chmod, command didn't work, looked in the file and low and behold, inside shasta is:

Not Found

Went back to github and I do not see a .../shasta/releases/download, which makes since with the text reading not found.

Please help.

Thank you.

Insight into aborted run

During a recent assembly of prokaryotic MinION data on Shasta v0.4.0, using the following code:

shasta --command=assemble --memoryBacking 2M --threads 8 --input ../../data/Diaz11.ONT.fasta --assemblyDirectory Diaz11_shastaV2 > shasta.log

Got the following error:

terminate called after throwing an instance of 'std::runtime_error'
what(): Error unmapping.
Aborted

However, log file reported this at the end:
Using 8 threads.
Assembly begins for 2 edges of the assembly graph.
2020-Apr-16 22:48:03.540986 Assembled a total 6907818 bases for 2 assembly graph edges of which 1 where assembled.
The assembly graph has 2 vertices and 2 edges of which 1 were assembled.
Total length of assembled sequence is 6907818
N50 for assembly segments is 6907818
...
2020-Apr-16 22:48:03.545419 writeGfa1 begins
2020-Apr-16 22:48:03.911467 writeGfa1 ends
2020-Apr-16 22:48:03.915309 writeGfa1BothStrands begins
2020-Apr-16 22:48:04.596424 writeGfa1BothStrands ends
2020-Apr-16 22:48:04.600124 writeFasta begins
2020-Apr-16 22:48:04.962609 writeFasta ends
Assembly time statistics:
Elapsed seconds: 648.023
Elapsed minutes: 10.8004
Elapsed hours: 0.180006
Average CPU utilization: 0.211944
Shasta Release 0.4.0

Did this run abort before finishing or run to completion?

Ambiguous runtime error; error during read

I'm attempting to run a human genome assembly using Shasta on MacOS with the default parameters. However, I quickly run into the error below. My FASTA file has one-line sequences and no blank lines, so I know that's not the issue. Any help is appreciated! Thanks!

Shasta Release 0.1.0

2019-Aug-21 08:37:31.773937 

This is the static executable for the Shasta assembler. It provides limited Shasta functionality at but has no dependencies and requires no installation.


Default values of assembly parameters are optimized for an assembly at coverage 60x. If your data have significantly different coverage, some changes in assembly parameters may be necessary to get good results.


For more information about the Shasta assembler, see

https://github.com/chanzuckerberg/shasta


Complete documentation for the latest version of Shasta is available here:

https://chanzuckerberg.github.io/shasta


Options in use:

Input FASTA files: linearized.fasta 

outputDirectory = ShastaRun

[Reads]

minReadLength = 10000

palindromicReads.maxSkip = 100

palindromicReads.maxMarkerFrequency = 10

palindromicReads.alignedFractionThreshold = 0.1

palindromicReads.nearDiagonalFractionThreshold = 0.1

palindromicReads.deltaThreshold = 100


[Kmers]

k = 10

probability = 0.1


[MinHash]

m = 4

hashFraction = 0.01

minHashIterationCount = 10

maxBucketSize = 10

minFrequency = 2


[Align]

maxSkip = 30

maxMarkerFrequency = 10

minAlignedMarkerCount = 100

maxTrim = 30


[ReadGraph]

maxAlignmentCount = 6

minComponentSize = 100

maxChimericReadDistance = 2


[MarkerGraph]

minCoverage = 10

maxCoverage = 100

lowCoverageThreshold = 0

highCoverageThreshold = 256

maxDistance = 30

edgeMarkerSkipThreshold = 100

pruneIterationCount = 6

simplifyMaxLength = 10,100,1000


[Assembly]

markerGraphEdgeLengthThresholdForConsensus = 1000

consensusCaller = SimpleConsensusCaller

useMarginPhase = False

storeCoverageData = False


Shasta Release 0.1.0

2019-Aug-21 08:37:31.775828 Loading reads from /Users/Desktop/linearized.fasta.

Input file block size: 2147483648 bytes.

Using 1 threads for reading and 36 threads for processing.

Input file size is 11368700965 bytes.

2019-Aug-21 08:37:31.807769 Reading block 0 2147483648, 2147483648 bytes.

2019-Aug-21 08:37:36.502823 Terminated after catching a runtime error exception:

Error during read.

thread setting is not respected during flagCrossStrandReadGraphEdges

I find that in this phase, shasta tries to use all the threads on the system. I've set both 16 and 32 threads and in each case it attempts to use all available cores (the system has 64, but other users are running jobs so I don't want to use them all).

SimpleBayesianConsensusCaller cleanup

Only the following low priority clean up tasks in the SimpleBayesianConsensusCaller code remain:

Harmonize naming conventions with the rest of the Shasta code.
Use braces even for single-line code blocks in if/else (see the rest of the Shasta code for examples).

Replaces issue 7, which has changed topic.

Assertion failed: readCount > 0 at void ChanZuckerberg::shasta::Assembler::findAlignmentCandidatesLowHash(std::size_t, double, std::size_t, std::size_t, std::size_t, std::size_t, std::size_t) in /home/paolo/shasta/src

Can you help figure out this issue?

Extremely low CPU utilization

Dear shasta maintainers,

I have shasta (binary release, didn't compile it) running on an AWS instance on ONT data. Although the server has 96 cores, the CPU utilization is extremely low - I mean it hovers for LONG periods of time at 0%, occasionally using 1-10%, and I've seen it go up to 100% once. The memory usage is pretty high (hundreds of gigs), but the CPU's just aren't doing anything. I didn't see any specific parameters to pass to tell it to run multi-threaded. Any ideas?

0.2.0 executable does not work on some old Linux kernels

The Shasta 0.2.0 executable does not work on some old Linux kernels on which 0.1.0 was fine.
The reason is the following:

0.1.0 was built on Ubuntu 16.04 which has glibc 2.23, which supports kernel versions 2.6.32 and newer.
0.2.0 was built on Ubuntu 18.04 which has glibc 2.27, which supports kernel versions 3.2 and newer.

For future releases, consider releasing two executables:

shasta-Linux built on Ubuntu 18.04.
shasta-OldLinux build on Ubuntu 16.04.

Unfair comparisons in the 'shasta' paper ?

Hello,
in your paper "Efficient de novo assembly of eleven human genomes
using PromethION sequencing and a novel nanopore
toolkit",
you mension the Canu assembly of CHM13 is 85.8M and compare this results to those generated by other tools with only ONT reads.

It is unfari, since the 85.8m assembly is generated with many other approaches:

"The current assembly draft (v0.6) is generated with Canu v1.7.1 including rel1 data up to 2018/11/15 and incorporating the previously released PacBio data. Two gaps on the X plus the centromere were manually resolved. Contigs with low coverage support were split and the assembly was scaffolded with BioNano. The assembly was polished with two rounds of nanopolish and two rounds of arrow. The X polishing was done using unique markers matched between the assembly and the raw read data, the rest of the genome used traditional polishing. Finally, the assembly was polished with 10X Genomics data. We validated the assembly using independent BACs. The overall QV is Q37 (Q42 in unique regions) and the assembly resolves over 80% of the bacs (280/341).

The assembly is 2.94 Gbp in size with 503 scaffolds (593 contigs) and an NG50 of 83 Mbp (70 Mbp)"

https://github.com/nanopore-wgs-consortium/CHM13

Best

ld ERROR when compiled

Hello, when I compiled with source code, some ERROR occured and aborted complie.
Here are messages on STDERR:

[ 49%] Built target shastaStaticLibrary
[ 50%] Linking CXX executable shasta
/usr/bin/ld: cannot find -latomic
/usr/bin/ld: cannot find -lboost_system
/usr/bin/ld: cannot find -lboost_program_options
/usr/bin/ld: cannot find -lboost_chrono
/usr/bin/ld: cannot find -lpng
/usr/bin/ld: cannot find -lz
/usr/bin/ld: cannot find -lpthread
/usr/bin/ld: cannot find -lstdc++
/usr/bin/ld: cannot find -lm
/usr/bin/ld: cannot find -lc
collect2: error: ld returned 1 exit status
make[2]: *** [staticExecutable/CMakeFiles/shastaStaticExecutable.dir/build.make:85:staticExecutable/shasta] Error 1
make[1]: *** [CMakeFiles/Makefile2:146:staticExecutable/CMakeFiles/shastaStaticExecutable.dir/all]
Error 2
make: *** [Makefile:130: all] Error 2

after check, I am sure all needed lib files have installed in /lib64, and I can not change the value of LD_LIBRARY_PATH may be because of administrator set up. so how can i solve this problem?

the system info of computer is
Linux version 4.18.0-80.11.2.el8_0.x86_64 ([email protected]) (gcc version 8.2.1 20180905 (Red Hat 8.2.1-3) (GCC)) #1 SMP Tue Sep 24 11:32:19 UTC 2019
with CentOS Linux release 8.0.1905 (Core)

errors for 1g genome assembly

hi, I used the following commands:
shasta-Linux-0.1.0 --input /filt_adapter.fasta --Reads.minReadLength 20000

and got the error informations:

Processing 2147475201 input characters.
Block processed in 2.08425 s.
Reads for this block stored in 2.01474 s.
2019-May-24 15:05:39.949279 Reading block 109521666048 111669149696, 2147483648 bytes.
Block read in 8.90425 s at 2.41175e+08 bytes/s.
Processing 2147472802 input characters.
Block processed in 2.35328 s.
2019-May-24 15:06:19.564782 Terminated after catching a runtime error exception:
Error 14 during mremap call for MemoryMapped::Vector: Bad address

Best,

chanzuckerberg / shasta Goto Github PK

shasta's Introduction

Shasta long read assembler

This repository is no longer in use

shasta's People

Contributors

Stargazers

Watchers

Forkers

shasta's Issues

Recommend Projects

Recommend Topics

Recommend Org