ibest / arc Goto Github PK
View Code? Open in Web Editor NEWAssembly by Reduced Complexity (ARC)
License: Apache License 2.0
Assembly by Reduced Complexity (ARC)
License: Apache License 2.0
Allow one to install into bin and run from anywhere
Indexing is very slow. Currently only one file is indexed at any given time (limiting ARC to using only a single processor during indexing). Further tests need to be done to determine whether indexing multiple files at the same time will overwhelm disk I/O and/or result in overall improvements to indexing speed.
Ideas:
Bowtie2 is being run in default mapping mode which means it only report one hit. Add the ability for reporting multiple hits (-k switch).
This has been implemented and is in need of testing.
In some cases there are no reads mapped at all for an entire sample. When this happens, the Sample should be treated as finished and no further jobs added to the queue.
This is implemented, but hasn't been tested.
Blat and bowtie2 appear to support gzipped files, this should be trivial to implement.
In the finished_sample folders, create a folder for each sample, in this folder keep a reads and assembly folder for each target. This is the final set of reads and final set of assembly folders.
Due to kind of a hack to stop ARC from recruiting reads which were already incorporated into a target, there exists a bug where ARC will refuse to recruit reads on the first iteration if the targets are named in a certain way
The code for this is in the mapper:
if len(target.split(":")) == 3:
target, status = target.split(":")[1:]
# This keeps ARC from writing reads which mapped to finished contigs
if status.startswith("Contig") or status.startswith("isogroup"):
continue
ARC doesn't fail to install without biopython.
In some cases it is useful to have quals for assembled contigs. These are available from Newbler (determine whether Spades produces them).
With the recent implementation of repeat masking, it is now possible to get a contig which looks like this:
masked
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
When bowtie2-build is called on a file containing a contig like this, it will crash:
*** glibc detected *** bowtie2-build: double free or corruption (out): 0x000000000481b210 ***
Solution: Avoid writing out any cotings which are all 'n' (these won't recruit reads anyway).
Often times it is advantageous to be able to restart ARC, either because it became obvious that a few more iterations were necessary, or because a different set of targets could be used with the same reads.
For very large projects, it can take minutes/hours to index the massive reads files, so rather than do this, check whether working_dir and idx files exist already, and don't create them if they do.
Currently these folders/files are not deleted when ARC exists (it is up to the user to clean these up) so the change should be easy to implement.
It might be possible to reduce space requirements when using Bowtie2 by only writting out mapped reads (--no-unal flag). Before this is done it is necessary to double check that pairs in which one of the two members of the pair have been mapped are both written.
In the future, using a pipe instead of a file to get output from bowtie2 into the parser would be an even better option. This would require re-writing the mapper + splitter however, and it doesn't appear that Blat can output to stdout. So some other strategy will need to be developed (e.g. creating another Blat patch to enable output to stdout).
In some cases it may be useful to allow ARC to be run for a few more iterations or to resume a run which was terminated for some reason. This could be controlled by adding something to the ARC_config.txt (i.e. restart = True).
Do something like the following:
Hi there,
Thanks for writing and making ARC available - I am beginning to use it and finding it VERY useful. I have an enhancement suggestion I'd love you to consider, if it's easy to implement. If it's not easy, not to worry!
I've been using ARC on some datasets I downloaded from NCBI's SRA using their SRA toolkit. Single-end data was working well in ARC, but I was having some trouble with paired-end data until I realized that read naming in those SRA files is not like standard Illumina format. I cooked up a script that fixes the read names to be suitable for ARC, but it'd be nice to avoid that step if possible.
When I download paired-end reads from SRA, pairs of reads get named like this:
SRR505874.1.1 and SRR505874.1.2 (first pair in the set)
SRR505874.51.1 and SRR505874.51.2 (51st pair in the set)
etc
ARC doesn't seem to like the .1 and .2 extensions (understandable), but if I fix the read names so that both members of the pair have the same names (i.e. SRR505874.1 for the first pair, and SRR505874.51 for the 51st pair) then ARC works fine. It'd be nice to be able to specify some option to ARC that it strip off the .1/.2 extensions from the names itself. I've pasted at the bottom the error given by ARC when I leave on the .1/.2 extensions looks.
If you want some test datasets, please let me know and I should be able to cook something up.
all the best,
Janet Young
Dr. Janet Young
Malik lab
http://research.fhcrc.org/malik/en.html
Fred Hutchinson Cancer Research Center
1100 Fairview Avenue N., A2-025,
P.O. Box 19024, Seattle, WA 98109-1024, USA.
tel: (206) 667 4512
email: jayoung ...at... fhcrc.org
ARC Version: v1.1.3 2014-09-02
[2015-04-20 18:32:19,484 INFO 20261] Reading config file...
[2015-04-20 18:32:19,608 INFO 20261] max_incorporation not specified in ARC_config.txt, defaulting to 10
[2015-04-20 18:32:19,608 INFO 20261] workingdirectory not specified in ARC_config.txt, defaulting to ./
[2015-04-20 18:32:19,608 INFO 20261] fastmap not specified in ARC_config.txt, defaulting to False
[2015-04-20 18:32:19,608 INFO 20261] keepassemblies not specified in ARC_config.txt, defaulting to False
[2015-04-20 18:32:20,111 INFO 20261] Setting up working directories and building indexes...
/home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_1.fastq
/home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_2.fastq
[2015-04-20 20:21:46,540 INFO 20261] Sample: T_malaccensis, indexed reads in 6566.36648893 seconds.
[2015-04-20 20:21:48,152 INFO 20261] allocating a new mmap of length 4096
[2015-04-20 20:21:48,153 INFO 20261] Running ARC.
[2015-04-20 20:21:48,153 INFO 20261] Submitting initial mapping runs.
[2015-04-20 20:21:48,153 INFO 20261] Starting...
[2015-04-20 20:21:48,156 INFO 12215] child process calling self.run()
[2015-04-20 20:21:48,156 INFO 12216] child process calling self.run()
[2015-04-20 20:21:48,158 INFO 12217] child process calling self.run()
[2015-04-20 20:21:48,158 INFO 12215] Sample: T_malaccensis Running bowtie2.
[2015-04-20 20:21:48,159 INFO 12218] child process calling self.run()
[2015-04-20 20:21:48,201 INFO 12215] Sample: T_malaccensis Calling bowtie2-build.
[2015-04-20 20:21:48,201 INFO 12215] bowtie2-build -f /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/I000_contigs.fasta /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/idx/idx
[2015-04-20 20:21:53,044 INFO 12215] Sample: T_malaccensis Calling bowtie2 mapper
[2015-04-20 20:21:53,044 INFO 12215] bowtie2 -I 0 -X 1500 --very-fast-local --mp 12 --rdg 12,6 --rfg 12,6 -p 4 -x /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/idx/idx -k 3 -1 /home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_1.fastq -2 /home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_2.fastq -S /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/mapping.sam
[2015-04-20 22:38:52,969 INFO 12215] Sample: T_malaccensis, Processed 126931493 lines from SAM in 6139.76893306 seconds.
[2015-04-20 22:38:54,263 INFO 12215] Sample: T_malaccensis Running splitreads.
[2015-04-20 22:52:42,095 ERROR 12215] Traceback (most recent call last):
File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/process_runner.py", line 62, in run
self.launch()
File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/process_runner.py", line 43, in launch
job.runner()
File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/runners/base.py", line 58, in runner
self.start()
File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/runners/mapper.py", line 55, in start
self.splitreads()
File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/runners/mapper.py", line 392, in splitreads
read2 = idx_PE2[readID]
File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/biopython-1.60-py2.7-linux-x86_64.egg/Bio/SeqIO/_index.py", line 423, in getitem
if not row: raise KeyError
KeyError
[2015-04-20 22:52:42,095 ERROR 12215] An unhandled exception occurred
[2015-04-20 22:52:42,095 ERROR 20261] Terminating processes
[2015-04-20 22:52:42,166 ERROR 20261] ARC.app unexpectedly terminated
[2015-04-20 22:52:42,167 INFO 20261] process shutting down
Alessandro asks:
"One option which could be really useful in ARC config file is to disable the hammer steps during spades assembler (read error correction, option --only-assembler) for each iteration."
It would probably be good to only disable the read error correction on all but the final assembly, this way the final assembly would (in theory) have been generated from error corrected reads.
I'm running into an issue where I get a python KeyError for targets that fulfill the following conditions:
1). Target assembly fails in iteration 1. Then "Writing reads as contigs."
2). Target assembly killed (spades times out) on iteration 2. Then "Writing contigs from previous iteration."
3). Target recruits fewer reads in iteration 3 than in 2. Target assembly fails in iteration 3.
evan@maven:/mnt/Data1/Gary/ARC$ grep Contig40947 ARC--try2.log
[2015-02-05 08:54:19,745 INFO 21734] Sample: ALL5 target: 36948|Contig40947 iteration: 1 Split 180 reads in 0.0773031711578 seconds
[2015-02-05 08:54:56,345 INFO 21744] Sample: ALL5 target: 36948|Contig40947 iteration: 1 Assembly failed after 7.53394508362 seconds
[2015-02-05 08:58:46,348 INFO 21761] Sample: ALL5 target: 36948|Contig40947 finishing target..
[2015-02-05 08:58:46,348 INFO 21761] Sample: ALL5 target: 36948|Contig40947 iteration: 1 Assembly reports status: assembly_failed.
[2015-02-05 08:58:46,348 INFO 21761] Sample ALL5 target 36948|Contig40947: Writing reads as contigs.
[2015-02-05 10:47:54,027 INFO 21740] Sample: ALL5 target: 36948|Contig40947 iteration: 2 Split 411178 reads in 240.437613964 seconds
[2015-02-05 12:56:23,086 WARNING 21751] Sample: ALL5 target: 36948|Contig40947 Assembly killed after 7200.12375808 seconds.
[2015-02-05 12:56:23,109 INFO 21751] Sample: ALL5 target: 36948|Contig40947 iteration: 2 Assembly killed after 7200.14657593 seconds
[2015-02-06 11:39:30,387 INFO 21740] Sample: ALL5 target: 36948|Contig40947 finishing target..
[2015-02-06 11:39:30,387 INFO 21740] Sample: ALL5 target: 36948|Contig40947 iteration: 2 Assembly reports status: assembly_killed.
[2015-02-06 11:40:16,442 INFO 21740] Sample: ALL5 target: 36948|Contig40947 iteration: 2 Writing contigs from previous iteration.
[2015-02-08 10:39:37,060 INFO 21736] Sample: ALL5 target: 36948|Contig40947 iteration: 3 Setting last_assembly to True
[2015-02-08 10:39:37,060 INFO 21736] Sample: ALL5 target: 36948|Contig40947 iteration: 3 Split 2511 reads in 2.27638602257 seconds
[2015-02-08 10:40:18,630 INFO 21753] Sample: ALL5 target: 36948|Contig40947 iteration: 3 Assembly failed after 41.5684621334 seconds
[2015-02-09 08:02:30,358 INFO 21752] Sample: ALL5 target: 36948|Contig40947 finishing target..
[2015-02-09 08:02:30,358 INFO 21752] Sample: ALL5 target: 36948|Contig40947 iteration: 3 Assembly reports status: assembly_failed.
[2015-02-09 08:02:30,358 INFO 21752] Sample ALL5 target 36948|Contig40947 did not incorporate any more reads, no more mapping will be done
KeyError: '36948|Contig40947'
[2015-02-09 08:02:30,358 INFO 21752] Sample: ALL5 target: 36948|Contig40947 finishing target..
[2015-02-09 08:02:30,358 INFO 21752] Sample: ALL5 target: 36948|Contig40947 iteration: 3 Assembly reports status: assembly_failed.
[2015-02-09 08:02:30,358 INFO 21752] Sample ALL5 target 36948|Contig40947 did not incorporate any more reads, no more mapping will be done
[2015-02-09 08:02:30,584 ERROR 21752] Traceback (most recent call last):
File "/home/evan/anaconda/lib/python2.7/site-packages/ARC/process_runner.py", line 62, in run
self.launch()
File "/home/evan/anaconda/lib/python2.7/site-packages/ARC/process_runner.py", line 43, in launch
job.runner()
File "/home/evan/anaconda/lib/python2.7/site-packages/ARC/runners/base.py", line 58, in runner
self.start()
File "/home/evan/anaconda/lib/python2.7/site-packages/ARC/runners/finisher.py", line 135, in start
self.write_target(target, target_folder, outf=fin_outf, finished=True)
File "/home/evan/anaconda/lib/python2.7/site-packages/ARC/runners/finisher.py", line 291, in write_target
targetLength=self.params['summary_stats'][target]['targetLength'],
KeyError: '36948|Contig40947'
[2015-02-09 08:02:30,584 ERROR 21752] An unhandled exception occurred
[2015-02-09 08:02:30,585 ERROR 21684] Terminating processes
[2015-02-09 08:02:30,728 ERROR 21684] ARC.app unexpectedly terminated
[2015-02-09 08:02:30,732 INFO 21684] process shutting down
Any thoughts? Thanks very much!
Evan
Instead of passing a list of certain params from class to class, pass all params (and delete those which aren't necessary i.e. read_dict for the assemblers). This will make future additions/enhancements to the code much easier to implement and also require less maintenance.
currently ARC will install if dependencies are not OK.
running install
running build
running build_py
running build_scripts
running install_lib
running install_scripts
changing mode of /usr/bin/ARC to 775
running install_egg_info
Removing /usr/lib/python2.6/site-packages/ARC-1.1.0-py2.6.egg-info
Writing /usr/lib/python2.6/site-packages/ARC-1.1.0-py2.6.egg-info
Currently ARC can be run in a folder where it has been run before and it will skip re-indexing reads. If ARC previously crashed or was terminated prematurely it won't have cleaned up a variety of intermediary files however (bowtie idx, assembly folder, etc).
If ARC detects that it is being re-run clean up all of these files so that it will run successfully.
If a user kills ARC during read indexing (e.g. with cntrl+c) the read index (a SQLite database) will only contain a subset of the reads in the Fastq file. At this point the user should delete the partial indexes before running ARC again. However, if ARC is then run a second time without deleting the partial indexes, ARC improperly detects that the FastQ has already been indexed and skips re-index in order to save time. When read splitting takes place reads exist in the fastq which are not in the index, causing a crash.
This problem has come up frequently enough that it would be nice to fix it. One option that immediately comes to mind is to modify app.py so that each indexing step is wrapped in a try/except/finally block which explicitly deletes the index file.
Currently the "urt" (use read tips) switch for Newbler can be enabled so that it is used during all assemblies except for the last one. This doesn't function properly however if assemblies are terminated early because no more reads will map. In this case the final assembly will be made with the "urt" switch enabled, causing the final contigs to have low-quality tips.
Currently when an assembly is killed for a target, any contigs assembled in a previous iteration for that target will not be written out to the finished folder. It would instead be nice to get the contigs assembled at a previous iteration.
Alternatively mapping coverage could be calculated at each step and used to mask bases/contigs which have higher than expected coverage (indicating repeats) however these calculations would likely be costly and slow ARC down.
Modify map_against_reads behavior so that it includes the contigs AND reads on the 2nd iteration. In other words, it does mapping, attempts to do assemblies, and then includes contigs if any, as well as all reads (assembled or not) as targets for the second round of mapping.
Add in Stopping criteria, target ends when no additional progress is being made.
Build a table during the final finishing stage (or write to it at each iteration) which is formatted like:
iteration, target1, target2, .... targetn
1 12, 15, ......, N
And contains counts for the number of reads mapped at that iteration
It would be nice to get a final set of summary tables or datasets with details like:
Sample | Target | Status | Iteration | Reads | Contigs | ContigLength |
---|---|---|---|---|---|---|
S1 | T1 | Finished | 5 | 2300 | 1 | 2000 |
S1 | T2 | NoContigs | 1 | 5 | 0 | 0 |
S1 | T3 | Killed | 3 | 15232 | 12 | 9400 |
S1 | T4 | Repeat | 6 | 12000 | 11 | 15000 |
... | ... | ... | ... | ... | ... | ... |
SN | TM | Finished | 12 | 300 | 1 | 1500 |
This would make it much easier to generate a final set of summary statistics, and facilitate many other comparisons without the need for any log-file
In config.py, line 203 (if not (pe or se)) there is a small logic error which can cause ARC to fail to identify incorrectly formatted config files which contain a SE and one of two PE files. This also causes a weird edge case when the config file is formatted with comment characters in front of the column headers e.g.
Sample1 ./reads/Sample1_R1.fasta PE1
instead of:
Sample_ID FileName FileType
Sample1 ./reads/Sample1_R1.fasta PE1
Fix this logic.
Finished contigs output file is being overwritten each time, so the final set of contigs is incomplete.
Bowtie2 does not work with fasta files as input, must use either blat in this case, or we could produce bogus qualities and generate a fastq. Should check however if using format=fasta, then must use blat
Because of the "sloppy mapping" approach, targets sometimes pull in a few repetitive regions which then pull in a few more etc, causing big problems for assembly speed and slowing down the whole process. Currently this is partially handled by repeat detection and removal based on % difference in read incorporation from iteration to iteration.
Some alternative, smarter approaches to dealing with this might include:
The final set of reads may be necessary for mapping & variant discovery as well as a number of other analysis steps. Instead of discarding the last set of reads, keep them (maybe gzipped) in the final_Sample folder.
Update install requirement to include python modules required by ARC, but are not a part of the standard install. Biopython for sure, but what about subprocess and logger
It turns out that the original version of the log handler wasn't thread-safe and was either garbling entries, or failing to write them entirely. A modified version is now in testing which should fix this.
This will involve re-writing the index_db functionality in Biopython.
Currently a SQLite database is being generated by SeqIO.index_db() to provide fast random access to reads in the fasta/fastq input files. If this database were modified to include an additional column indicating the target against which the read has been mapped in previous iterations, only newly mapped reads would need to be split out on each iteration. This would drastically improve speed because lookups in the SQLite database are very quick. The current major bottleneck in ARC is the splitting step, so it is the obvious target for further optimization.
When very large datasets are used (a full HiSeq lane for example), ARC is incredibly slow at splitting reads. I.E
[2013-06-20 11:35:09,603 INFO 21595] Split 3 reads for sample Sample1 target HWI-ST522_0060:7:2108:11503:138410#0/1_Cluster-3254_M072 in 3139.90650702 seconds
It might be necessary to re-think the current indexing scheme, perhaps going back to a simpler approach where the splitter runs through the whole file, pulling out every read that was hit and either writing it to memory or a temporary folder on the disk. This would make it so that all assemblies couldn't be kicked off until all reads had been processed.
Alternatively, we could dump support for BLAT and pull the reads directly from the SAM file, making it unnecessary to go to the original reads files entirely. This would also require that the entire SAM file was parsed before any assemblies could be started.
As part of the read/coverage tracking code, only include the PE/SE files in the call to the assembler if they actually have reads in them.
Maintain information about the length of contigs and calculated expected coverage, using this, add a -e switch to Newbler (Spades doesn't seem to support expected coverage, but research this further).
Hi there,
Excellent work so far on ARC - I'm really looking forward to using this! In playing around with the approach, I've run into an issue...
When assembling with SPAdes (v.2.4.0), there appears to be a problem where SPAdes/ARC choke on very low-coverage contigs. The console log from ARC is:
[2013-07-02 14:57:33,035 INFO 6413] Reading config file...
[2013-07-02 14:57:33,043 INFO 6413] Setting up working directories and building indexes...
[2013-07-02 14:58:50,929 INFO 6413] Setting up multiprocessing...
[2013-07-02 14:58:50,929 INFO 6413] Starting...
[2013-07-02 14:58:51,465 INFO 6420] Running bowtie2 for Sample1
[2013-07-02 14:58:51,471 INFO 6420] Calling bowtie2-build for sample: Sample1
[2013-07-02 14:58:51,472 INFO 6420] bowtie2-build -f /path/to/uce-probes-arc-format.fasta /path/to/working_Sample1/idx/idx
[2013-07-02 14:58:52,066 INFO 6420] Calling bowtie2 for sample: Sample1
[2013-07-02 14:58:52,066 INFO 6420] nice -n 19 bowtie2 -I 0 -X 1500 --local -p 12 -x /path/to/working_Sample1/idx/idx -1 /path/to/reads/gallus-gallus-READ1.fastq -2 /path/to/reads/gallus-gallus-READ2.fastq -U /path/to/reads/gallus-gallus-READ-singleton.fastq -S /path/to/working_Sample1/mapping.sam
[2013-07-02 14:59:25,197 INFO 6420] Sample: Sample1, Processed 2643335 lines in 5.86183905602 seconds.
[2013-07-02 14:59:25,637 INFO 6420] Running splitreads for Sample1
[2013-07-02 14:59:28,473 INFO 6420] Split 84 reads for sample Sample1 target uce1119 in 2.83518695831 seconds
[2013-07-02 14:59:29,083 INFO 6421] Running Spades for sample: Sample1 target: uce1119
[2013-07-02 14:59:29,094 INFO 6421] Calling spades for sample: Sample1 target uce1119
[2013-07-02 14:59:29,094 INFO 6421] spades.py -t 1 -1 /path/to/working_Sample1/t__001379/PE1.fastq -2 /path/to/working_Sample1/t__001379/PE2.fastq -s /path/to/working_Sample1/t__001379/SE.fastq -o /path/to/working_Sample1/t__001379/assembly
[2013-07-02 14:59:30,401 INFO 6420] Split 4 reads for sample Sample1 target uce4473 in 4.76373004913 seconds
[2013-07-02 14:59:30,554 INFO 6423] Running Spades for sample: Sample1 target: uce4473
[2013-07-02 14:59:30,555 ERROR 6423] An unhandled exception occured
Process ProcessRunner-4:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/home/bcf/git/ARC/bin/../ARC/process_runner.py", line 66, in run
raise e
KeyError: 'assembly_SE'
[2013-07-02 14:59:30,590 ERROR 6413] Fatal error returned from ProcessRunner-4
[2013-07-02 14:59:30,590 ERROR 6413] Terminating processes
[2013-07-02 14:59:30,590 ERROR 6413] A fatal error was encountered.
'Unrecoverable error'
I figured this might be related to the 4 reads split into the PE files for the target, so I ran SPAdes against just those reads split into t__002555
(the directory containing the PE files for the 4 reads) with:
spades.py -t 1 -1 /path/to/working_Sample1/t__002555/PE1.fastq -2 /path/to/working_Sample1/t__002555/PE2.fastq -o /path/to/working_Sample1/t__002555/assembly
The error returned from SPAdes was (snipped):
Verification of expression 'cov_.size() > 10' failed in function 'void cov_model::KMerCoverageModel::Fit()'. In file '/home/yasha/gitrep/algorithmic-biology/assembler/src/debruijn/kmer_coverage_model.cpp' on line 192. Message 'Invalid kmer coverage histogram'.
Verification of expression 'cov_.size() > 10' failed in function 'void cov_model::KMerCoverageModel::Fit()'. In file '/home/yasha/gitrep/algorithmic-biology/assembler/src/debruijn/kmer_coverage_model.cpp' on line 192. Message 'Invalid kmer coverage histogram'.
Exception caught std::exception
I looked around for an easy way to turn off the coverage limitation for the assembly to see if that would allow the process to continue, but SPAdes seems to have no CLI flags to do that (at least not readily apparent).
Thanks very much and keep up the excellent work!
best,
b
This is partially implemented, but the assembler is still run, even if only the reads will be written out.
Fix call to get rid of this warning:
/opt/modules/devel/python/2.7.5/lib/python2.7/site-packages/Bio/Seq.py:302: BiopythonDeprecationWarning: This method is obsolete; please use str(my_seq) instead of my_seq.tostring().
There are some instances where no PE or SE reads will be mapped. Handle this gracefully.
Basic instruction for how to install and run.
In the MultiMite test, 9 target X sample combinations were flagged as hitting a repeat and further assembly was stopped at iteration 2. In actuality this occurred because a small number of reads were recruited on the first iteration followed by a large number on the second. In 8 of 9 cases a reduced number of contigs was produced on iteration 2 compared to 1, and in the 9th case the number was equal.
Based on these results:
Set up a new criteria for repeat detection which includes num contigs. For example:
if NumReads > lastNumReads * multiplier AND NumContigs > lastNumContigs:
isRepeat = True
This should guard against most cases of false repeat detection.
The speed of ARC is heavily dependent on disk storage speed. Putting the working directory on a flash-based drive speeds ARC up tremendously, as does putting the reads on a flash-based drive. Even better on high memory systems is to store everything in RAM (i.e. /dev/shm on CentOS).
Currently some tricks can be used to make this happen, for example creating a set of "working" directories and then symlinking them to the location where ARC is running. A better alternative would be to just tell ARC where to put the working folder with a parameter which defaults to './'.
For example, making a folder name with a "|" or "" etc will seriously screw up pathing and command line operations like starting the assemblers.
Some users have very high identity targets and don't need sloppy mapping on the first iteration. Being able to use only high-specificity mapping parameters on the first iteration could allow these users to avoid getting off-target contigs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.