ksahlin / isocon Goto Github PK
View Code? Open in Web Editor NEWDerives consensus sequences from a set of long noisy reads by clustering and error correction.
License: GNU General Public License v3.0
Derives consensus sequences from a set of long noisy reads by clustering and error correction.
License: GNU General Public License v3.0
When installing IsoCon with dependency networks v2.4 the following error occurs:
isolated: 0
Number of edges: 84000
Total edit distance: 12122653
Avg ed (ed/edges): 144.31729761904762
Traceback (most recent call last):
File "/home/alice.namias/.local/bin/IsoCon", line 292, in <module>
run_pipeline(params)
File "/home/alice.namias/.local/bin/IsoCon", line 159, in run_pipeline
candidate_file, read_partition, to_realign = isocon_get_candidates.find_candidate_transcripts(params.read_file, params)
File "/home/alice.namias/.local/lib/python3.6/site-packages/modules/isocon_get_candidates.py", line 129, in find_candidate_transcripts
G_star, graph_partition, M, converged = partitions.partition_strings(S, params)
File "/home/alice.namias/.local/lib/python3.6/site-packages/modules/partitions.py", line 420, in partition_strings
G_star, converged = graphs.construct_exact_nearest_neighbor_graph(S, params)
File "/home/alice.namias/.local/lib/python3.6/site-packages/modules/graphs.py", line 63, in construct_exact_nearest_neighbor_graph
if G.node[s1]["degree"] > 1:
AttributeError: 'DiGraph' object has no attribute 'node'
In general: No. IsoCon is designed for targeted sequencing where the CCS flnc reads are cut at relatively precise positions (i.e., at the start and stop primer sites). If this is not the case it may both affect runtime and quality of the output.
However, if a nontargeted Iso-Seq dataset is processed such that the flnc reads from a particular gene are extracted (e.g., by using pre-cluster module from TuFU or aligning ccs reads to genome/transcriptome and separate by region) and these reads are cut at the same start and end position -- IsoCon should work well. Keep in mind though that if reads are "cut", the quality values associated with the ccs reads will also have to be cut the same way to preserve the base quality values remains to their base. This could be done relatively easily from the bam file.
Dear Kristoffer,
Hello,
I am trying to use IsoCon for my transcriptome.
We have 30 cells to analysis and my flnc file was generated with 30 cells.
Is it okay to use merged bam file through Samtools? or bax2bam?
Thank you.
Won
Hi,
I tried to used IsoCon for the sugarcane transcriptome (10Gb Genome, highly polyploidy, 100-130 chromosomes, Aneuploid with varying ploidy level).
And after 7 days with 24CPU, I had this error, the memory allocated has been exceeded (120gb).
########################### Execution Started #############################
/var/spool/pbs/mom_priv/jobs/100062.tinmgmr1.SC: line 17: : command not found
/var/spool/pbs/mom_priv/jobs/100062.tinmgmr1.SC: line 20: : command not found
=>> PBS: job killed: mem 127935680kb exceeded limit 125829120kb
########################### Job Execution History #############################
JobName:IsoConSugar
SessionId:24450
ResourcesRequested:mem=120gb,ncpus=24,place=free,walltime=326:00:00
ResourcesUsed:cpupercent=2400,cput=1596:37:06,mem=127935680kb,ncpus=24,vmem=211531212kb,walltime=136:17:10
QueueUsed:Long
ExitStatus:271
The files generated are:
4096 Apr 3 18:31 alignments ## empty repertory
78263932 Apr 6 15:08 candidates_step_10.fa
78310448 Apr 6 20:41 candidates_step_11.fa
78330217 Apr 7 02:16 candidates_step_12.fa
78345513 Apr 7 07:50 candidates_step_13.fa
78355500 Apr 7 13:27 candidates_step_14.fa
78359544 Apr 7 19:00 candidates_step_15.fa
78364148 Apr 8 00:35 candidates_step_16.fa
78365950 Apr 8 06:10 candidates_step_17.fa
78366454 Apr 8 11:43 candidates_step_18.fa
78366454 Apr 8 17:17 candidates_step_19.fa
78366454 Apr 8 22:53 candidates_step_20.fa
78366454 Apr 9 04:27 candidates_step_21.fa
78366454 Apr 9 10:01 candidates_step_22.fa
77594624 Apr 9 10:01 candidates_step_23.fa
62809976 Apr 4 16:32 candidates_step_2.fa
67331407 Apr 4 23:02 candidates_step_3.fa
72215051 Apr 5 05:12 candidates_step_4.fa
75426509 Apr 5 11:05 candidates_step_5.fa
77103366 Apr 5 16:46 candidates_step_6.fa
77782120 Apr 5 22:22 candidates_step_7.fa
78103952 Apr 6 03:59 candidates_step_8.fa
78212789 Apr 6 09:35 candidates_step_9.fa
0 Apr 3 18:31 filtered_reads.fa
0 Apr 3 18:31 logfile.txt
So, can we use IsoCon for organisms with large polyploid genome ?
If so do you have a way, a program that we can used to finish the process without to have to re-run and generated again, in this case these 23 files.fa.
Thank you,
Virgg
Hi @ksahlin. Thanks a ton for all of your tools with noisy reads. I'm looking for a solution for de novo clustering of ONT amplicon reads from environmental sequencing, ie fungal rRNA amplicons. The data I'm trying this on is from a mock community of mixed species. The region is the ITS-LSU region of rRNA in fungi -- we typically define species with a 97% pident cutoff with this region. The data has been pre-processed by re-orienting reads into the same direction and finding/trimming both forward and reverse primer sequences.
I've tried isONclust and at first it seemed like it might be working great (and quite fast), but then on further inspection it was a little too liberal on clustering the data that I have access to at the moment, effectively combining too many reads into the same "gene family". I ran a parameter search by varying k and w to see if I could get it to give me the proper results, but essentially never got a set of parameters that could delineate the clusters properly. My goal is to find a method to identify the "centroid", as then it is relatively straightforward to use spoa
and racon/medaka
for error correction. I tried to clean up the clustering little bit by invoking a "sub clustering" by plotting read lengths (as fungal ITS-LSU sequences are variable in length) and then pulling out "peaks" from the lengths of reads -- this seemed to work okay, but still not quite what I'm looking for.
Based on some of the other issues in your tool repositories, I then tried IsoCon
which you had indicated seemed to be a more general approach. IsoCon
has a much much longer runtime and then eventually crashed with the error below (note I ran it initially without --prefilter_candidates --min_candidate_support 2
and it crashed with same error).
If you have any other suggestions on an appropriate workflow I'd be grateful to hear your opinions.
Thanks,
Jon
$ IsoCon pipeline -fl_reads reads.oriented.proper-primers.yacrd.fastq -outfolder isocon_test2 --verbose --prefilter_candidates --min_candidate_support 8 --nr_cores 7
fl_reads: reads.oriented.proper-primers.yacrd.fastq
outfolder: isocon_test2
ccs: None
nr_cores: 7
verbose: True
neighbor_search_depth: 4294967296
min_exon_diff: 20
min_candidate_support: 8
p_value_threshold: 0.01
min_test_ratio: 5
max_phred_q_trusted: 43
ignore_ends_len: 15
cleanup: False
prefilter_candidates: True
which: pipeline
is_fastq: True
ITERATION: 1
Max transcript length:2694, Min transcript length:806
Non-converged (unique) sequences left: 67501
[0, 964, 1928, 2892, 3856, 4820, 5784, 6748, 7712, 8676, 9640, 10604, 11568, 12532, 13496, 14460, 15424, 16388, 17352, 18316, 19280, 20244, 21208, 22172, 23136, 24100, 25064, 26028, 26992, 27956, 28920, 29884, 30848, 31812, 32776, 33740, 34704, 35668, 36632, 37596, 38560, 39524, 40488, 41452, 42416, 43380, 44344, 45308, 46272, 47236, 48200, 49164, 50128, 51092, 52056, 53020, 53984, 54948, 55912, 56876, 57840, 58804, 59768, 60732, 61696, 62660, 63624, 64588, 65552, 66516, 67480]
query chunks: [964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 21]
processing 0
processing 14500
processing 3000
processing 17500
processing 6000
processing 500
processing 9000
processing 12000
processing 15000
processing 3500
processing 18000
processing 6500
processing 1000
processing 9500
processing 12500
processing 15500
processing 4000
processing 7000
processing 18500
processing 10000
processing 1500
processing 13000
processing 4500
processing 16000
processing 7500
processing 10500
processing 2000
processing 19000
processing 13500
processing 5000
processing 16500
processing 8000
processing 11000
processing 2500
processing 19500
processing 14000
processing 5500
processing 17000
processing 8500
processing 11500
processing 29000
processing 20000
processing 20500
processing 32000
processing 23500
processing 35000
processing 26500
processing 21000
processing 29500
processing 32500
processing 38000
processing 24000
processing 35500
processing 27000
processing 21500
processing 30000
processing 33000
processing 24500
processing 38500
processing 27500
processing 36000
processing 22000
processing 30500
processing 33500
processing 25000
processing 39000
processing 28000
processing 36500
processing 22500
processing 31000
processing 25500
processing 34000
processing 28500
processing 39500
processing 37000
processing 23000
processing 31500
processing 40500
processing 26000
processing 34500
processing 43500
processing 40000
processing 46500
processing 37500
processing 41000
processing 55000
processing 49500
processing 52500
processing 44000
processing 47000
processing 55500
processing 58000
processing 50000
processing 41500
processing 53000
processing 44500
processing 58500
processing 56000
processing 47500
processing 50500
processing 42000
processing 53500
processing 59000
processing 56500
processing 45000
processing 48000
processing 51000
processing 54000
processing 42500
processing 59500
processing 57000
processing 45500
processing 51500
processing 48500
processing 54500
processing 60000
processing 57500
processing 43000
processing 60500
processing 52000
processing 46000
processing 49000
processing 67000
processing 67500
processing 61000
processing 64000
processing 61500
processing 64500
processing 62000
processing 65000
processing 65500
processing 62500
processing 66000
processing 63000
processing 66500
processing 63500
isolated: 0
Number of edges: 76499
Total edit distance: 14654968
Avg ed (ed/edges): 191.57071334265808
Traceback (most recent call last):
File "/Users/jon/miniconda3/envs/amptk_dev/bin/IsoCon", line 292, in <module>
run_pipeline(params)
File "/Users/jon/miniconda3/envs/amptk_dev/bin/IsoCon", line 159, in run_pipeline
candidate_file, read_partition, to_realign = isocon_get_candidates.find_candidate_transcripts(params.read_file, params)
File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/isocon_get_candidates.py", line 129, in find_candidate_transcripts
G_star, graph_partition, M, converged = partitions.partition_strings(S, params)
File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/partitions.py", line 420, in partition_strings
G_star, converged = graphs.construct_exact_nearest_neighbor_graph(S, params)
File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/graphs.py", line 63, in construct_exact_nearest_neighbor_graph
if G.node[s1]["degree"] > 1:
AttributeError: 'DiGraph' object has no attribute 'node'
Dear Kristoffer,
We amplified two markers (the nuclear ribosomal cistron ~6000 bp and a low copy nuclear gene ~2500 bp) for 12 samples. Then we combined both markers for each sample and barcoded them for sequencing using Nanopore's RPB004 kit, which fragments the amplicons to attach the adapters (we will switch eventually to a kit that sequence the full length of the amplicons).
After demultiplexing and removing chimeric reads, I was testing IsoCon
on one of the samples but it recovered 600 candidates (for two markers), and I think is because our data violates the assumption of "reads are cut at relatively precise positions (i.e., at the start and stop primer sites)" because the fragmentation breaks down the amplicon. However, there are reads with lengths that are close to the full length of the amplicon and the shorter ones are included within that length, like in this picture where I mapped the nanopore reads to one of the longest reads:
I wonder if there is any parameter I can alter in IsoCon
to use sub-length reads in the consensus, or any software that can accomplish what I need. For example, I tried isONclust
but I read it doesn't reverse-complement the reads during clustering, so It wouldn't be completely ideal for this application. I also tried several de novo assemblers, but the coverage is to high and uneven and the targets are to short for these to work correctly.
Thanks in advance,
Edgardo
The header is on the following format:
> acc + "_” + support + "_" + p_value + "_" + N_t + "_" + delta_size
Here,
Example
If there are two clusters c1 and c2 for 10 and 5 reads respectively, and their total distance is 3 (e.g., one SNP and one indel of length 2). Lets say you are testing whether the SNP and deletion that c2 have with respect to c1 is statistically supported, and the hypothesis test comes back significant (i.e., it is statistically supported), then you will have in c2’s header:
IsoCon with default parameters is fast if
If 1 is not satisfied, it has a bad effect on the runtime and, in addition, we cannot guarantee the quality of the algorithm output. If 1 is satisfied, but not 2, the algorithm is designed to give good quality output — it might just take a long time. The algorithm can be speed up by setting --neighbor_search_depth
to a low value, e.g., 500 or 1000 (default is no limit). Read below why. Regarding " relatively precise positions" in 1, this means that most reads are within say 5-15 bases of the primer cut point. It doesn't matter if some of the reads might be inaccurately cut.
For targeted datasets such as the ones on FMR1, IsoCon's runtime on a Mac laptop using 3 cores was about 2 hours for the 3 cells datasets of patients and less than 30 mins for the 1 cell control datasets. In our simulated data we degenerated 12500 reads from transcripts (from DAZ) of length 6000bp with median error rate of 276 errors per read (corresponding to most reads having 1-3 passes see Table 1 simulated error profile). The algorithm takes ~15-20h to run on 4 cores. This can be compared with the other two datasets in that table, HSFY takes ~2h and TSPY takes ~5minutes on 4 cores. So the error rate plays a key role.
We are using an exact approach to find the Nearest neighbor graph (defined in the manuscript). We sort sequences by length and place them in an array L indexed by i. For a given sequence L[i], we calculate the edit distance between L[i] and L[i +/- j], for j =1,2,3… . We keep track of the best edit distance we encounter for L[i] (call this value a_i), and we stop to compute edit distances whenever the sequence length difference between L[i] and L[i +/- j] is larger than a_i. We have then guaranteed that the nearest neighbor is found. This is how we guarantee exact NN-graph is built. If 1-2 is in general met, we don’t have to explore a lot of alignments (we find best edit distance with small j as nearest neighbours will be located close to each other in L). The algorithm can be speed up setting a limit on j by —neighbor_search_depth [low number e.g. 500 or 1000]
. But a guarantee that nearest neighbours are not found is lost and it may affect quality of output. We are exploring the possibility to use minimap2 for the alignments.
Hi,when I run IsoCon,I found the results vary greatly with different --min_candidate_support set. So I wonder how to set this parameter is ok?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.