ksahlin / isocon Goto Github PK

View Code? Open in Web Editor NEW

13.0 5.0 1.0 1.18 MB

Derives consensus sequences from a set of long noisy reads by clustering and error correction.

License: GNU General Public License v3.0

Python 100.00%

ccs clustering transcriptome error-correction long-reads bioinformatics

isocon's People

Contributors

Stargazers

Watchers

Forkers

sisov

isocon's Issues

IsoCon breaks for networkx v2.4

When installing IsoCon with dependency networks v2.4 the following error occurs:

isolated: 0
Number of edges: 84000
Total edit distance: 12122653
Avg ed (ed/edges): 144.31729761904762
Traceback (most recent call last):
  File "/home/alice.namias/.local/bin/IsoCon", line 292, in <module>
    run_pipeline(params)
  File "/home/alice.namias/.local/bin/IsoCon", line 159, in run_pipeline
    candidate_file, read_partition, to_realign = isocon_get_candidates.find_candidate_transcripts(params.read_file, params)
  File "/home/alice.namias/.local/lib/python3.6/site-packages/modules/isocon_get_candidates.py", line 129, in find_candidate_transcripts
    G_star, graph_partition, M, converged = partitions.partition_strings(S, params)
  File "/home/alice.namias/.local/lib/python3.6/site-packages/modules/partitions.py", line 420, in partition_strings
    G_star, converged = graphs.construct_exact_nearest_neighbor_graph(S, params)
  File "/home/alice.namias/.local/lib/python3.6/site-packages/modules/graphs.py", line 63, in construct_exact_nearest_neighbor_graph
    if G.node[s1]["degree"] > 1:
AttributeError: 'DiGraph' object has no attribute 'node'

Can IsoCon be used on nontargeted Iso-Seq data sets?

In general: No. IsoCon is designed for targeted sequencing where the CCS flnc reads are cut at relatively precise positions (i.e., at the start and stop primer sites). If this is not the case it may both affect runtime and quality of the output.

However, if a nontargeted Iso-Seq dataset is processed such that the flnc reads from a particular gene are extracted (e.g., by using pre-cluster module from TuFU or aligning ccs reads to genome/transcriptome and separate by region) and these reads are cut at the same start and end position -- IsoCon should work well. Keep in mind though that if reads are "cut", the quality values associated with the ccs reads will also have to be cut the same way to preserve the base quality values remains to their base. This could be done relatively easily from the bam file.

Multiple CCS.bam file

Dear Kristoffer,

Hello,
I am trying to use IsoCon for my transcriptome.
We have 30 cells to analysis and my flnc file was generated with 30 cells.
Is it okay to use merged bam file through Samtools? or bax2bam?

Thank you.

Won

Can we used IsoCon for a organism with a large polyploid genome / PBS: job killed: mem 127935680kb exceeded limit 125829120kb

Hi,
I tried to used IsoCon for the sugarcane transcriptome (10Gb Genome, highly polyploidy, 100-130 chromosomes, Aneuploid with varying ploidy level).

And after 7 days with 24CPU, I had this error, the memory allocated has been exceeded (120gb).

########################### Execution Started #############################
/var/spool/pbs/mom_priv/jobs/100062.tinmgmr1.SC: line 17: : command not found
/var/spool/pbs/mom_priv/jobs/100062.tinmgmr1.SC: line 20: : command not found
=>> PBS: job killed: mem 127935680kb exceeded limit 125829120kb
########################### Job Execution History #############################
JobName:IsoConSugar
SessionId:24450
ResourcesRequested:mem=120gb,ncpus=24,place=free,walltime=326:00:00
ResourcesUsed:cpupercent=2400,cput=1596:37:06,mem=127935680kb,ncpus=24,vmem=211531212kb,walltime=136:17:10
QueueUsed:Long
ExitStatus:271

The files generated are:
4096 Apr 3 18:31 alignments ## empty repertory
78263932 Apr 6 15:08 candidates_step_10.fa
78310448 Apr 6 20:41 candidates_step_11.fa
78330217 Apr 7 02:16 candidates_step_12.fa
78345513 Apr 7 07:50 candidates_step_13.fa
78355500 Apr 7 13:27 candidates_step_14.fa
78359544 Apr 7 19:00 candidates_step_15.fa
78364148 Apr 8 00:35 candidates_step_16.fa
78365950 Apr 8 06:10 candidates_step_17.fa
78366454 Apr 8 11:43 candidates_step_18.fa
78366454 Apr 8 17:17 candidates_step_19.fa
78366454 Apr 8 22:53 candidates_step_20.fa
78366454 Apr 9 04:27 candidates_step_21.fa
78366454 Apr 9 10:01 candidates_step_22.fa
77594624 Apr 9 10:01 candidates_step_23.fa
62809976 Apr 4 16:32 candidates_step_2.fa
67331407 Apr 4 23:02 candidates_step_3.fa
72215051 Apr 5 05:12 candidates_step_4.fa
75426509 Apr 5 11:05 candidates_step_5.fa
77103366 Apr 5 16:46 candidates_step_6.fa
77782120 Apr 5 22:22 candidates_step_7.fa
78103952 Apr 6 03:59 candidates_step_8.fa
78212789 Apr 6 09:35 candidates_step_9.fa
0 Apr 3 18:31 filtered_reads.fa
0 Apr 3 18:31 logfile.txt

So, can we use IsoCon for organisms with large polyploid genome ?
If so do you have a way, a program that we can used to finish the process without to have to re-run and generated again, in this case these 23 files.fa.

Thank you,
Virgg

Error: AttributeError: 'DiGraph' object has no attribute 'node'

Hi @ksahlin. Thanks a ton for all of your tools with noisy reads. I'm looking for a solution for de novo clustering of ONT amplicon reads from environmental sequencing, ie fungal rRNA amplicons. The data I'm trying this on is from a mock community of mixed species. The region is the ITS-LSU region of rRNA in fungi -- we typically define species with a 97% pident cutoff with this region. The data has been pre-processed by re-orienting reads into the same direction and finding/trimming both forward and reverse primer sequences.

I've tried isONclust and at first it seemed like it might be working great (and quite fast), but then on further inspection it was a little too liberal on clustering the data that I have access to at the moment, effectively combining too many reads into the same "gene family". I ran a parameter search by varying k and w to see if I could get it to give me the proper results, but essentially never got a set of parameters that could delineate the clusters properly. My goal is to find a method to identify the "centroid", as then it is relatively straightforward to use spoa and racon/medaka for error correction. I tried to clean up the clustering little bit by invoking a "sub clustering" by plotting read lengths (as fungal ITS-LSU sequences are variable in length) and then pulling out "peaks" from the lengths of reads -- this seemed to work okay, but still not quite what I'm looking for.

Based on some of the other issues in your tool repositories, I then tried IsoCon which you had indicated seemed to be a more general approach. IsoCon has a much much longer runtime and then eventually crashed with the error below (note I ran it initially without --prefilter_candidates --min_candidate_support 2 and it crashed with same error).

If you have any other suggestions on an appropriate workflow I'd be grateful to hear your opinions.

Thanks,
Jon

$ IsoCon pipeline -fl_reads reads.oriented.proper-primers.yacrd.fastq -outfolder isocon_test2 --verbose --prefilter_candidates --min_candidate_support 8 --nr_cores 7
fl_reads: reads.oriented.proper-primers.yacrd.fastq
outfolder: isocon_test2
ccs: None
nr_cores: 7
verbose: True
neighbor_search_depth: 4294967296
min_exon_diff: 20
min_candidate_support: 8
p_value_threshold: 0.01
min_test_ratio: 5
max_phred_q_trusted: 43
ignore_ends_len: 15
cleanup: False
prefilter_candidates: True
which: pipeline
is_fastq: True

ITERATION: 1

Max transcript length:2694, Min transcript length:806
Non-converged (unique) sequences left: 67501
[0, 964, 1928, 2892, 3856, 4820, 5784, 6748, 7712, 8676, 9640, 10604, 11568, 12532, 13496, 14460, 15424, 16388, 17352, 18316, 19280, 20244, 21208, 22172, 23136, 24100, 25064, 26028, 26992, 27956, 28920, 29884, 30848, 31812, 32776, 33740, 34704, 35668, 36632, 37596, 38560, 39524, 40488, 41452, 42416, 43380, 44344, 45308, 46272, 47236, 48200, 49164, 50128, 51092, 52056, 53020, 53984, 54948, 55912, 56876, 57840, 58804, 59768, 60732, 61696, 62660, 63624, 64588, 65552, 66516, 67480]
query chunks: [964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 964, 21]
processing  0
processing  14500
processing  3000
processing  17500
processing  6000
processing  500
processing  9000
processing  12000
processing  15000
processing  3500
processing  18000
processing  6500
processing  1000
processing  9500
processing  12500
processing  15500
processing  4000
processing  7000
processing  18500
processing  10000
processing  1500
processing  13000
processing  4500
processing  16000
processing  7500
processing  10500
processing  2000
processing  19000
processing  13500
processing  5000
processing  16500
processing  8000
processing  11000
processing  2500
processing  19500
processing  14000
processing  5500
processing  17000
processing  8500
processing  11500
processing  29000
processing  20000
processing  20500
processing  32000
processing  23500
processing  35000
processing  26500
processing  21000
processing  29500
processing  32500
processing  38000
processing  24000
processing  35500
processing  27000
processing  21500
processing  30000
processing  33000
processing  24500
processing  38500
processing  27500
processing  36000
processing  22000
processing  30500
processing  33500
processing  25000
processing  39000
processing  28000
processing  36500
processing  22500
processing  31000
processing  25500
processing  34000
processing  28500
processing  39500
processing  37000
processing  23000
processing  31500
processing  40500
processing  26000
processing  34500
processing  43500
processing  40000
processing  46500
processing  37500
processing  41000
processing  55000
processing  49500
processing  52500
processing  44000
processing  47000
processing  55500
processing  58000
processing  50000
processing  41500
processing  53000
processing  44500
processing  58500
processing  56000
processing  47500
processing  50500
processing  42000
processing  53500
processing  59000
processing  56500
processing  45000
processing  48000
processing  51000
processing  54000
processing  42500
processing  59500
processing  57000
processing  45500
processing  51500
processing  48500
processing  54500
processing  60000
processing  57500
processing  43000
processing  60500
processing  52000
processing  46000
processing  49000
processing  67000
processing  67500
processing  61000
processing  64000
processing  61500
processing  64500
processing  62000
processing  65000
processing  65500
processing  62500
processing  66000
processing  63000
processing  66500
processing  63500
isolated: 0
Number of edges: 76499
Total edit distance: 14654968
Avg ed (ed/edges): 191.57071334265808
Traceback (most recent call last):
  File "/Users/jon/miniconda3/envs/amptk_dev/bin/IsoCon", line 292, in <module>
    run_pipeline(params)
  File "/Users/jon/miniconda3/envs/amptk_dev/bin/IsoCon", line 159, in run_pipeline
    candidate_file, read_partition, to_realign = isocon_get_candidates.find_candidate_transcripts(params.read_file, params)
  File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/isocon_get_candidates.py", line 129, in find_candidate_transcripts
    G_star, graph_partition, M, converged = partitions.partition_strings(S, params)
  File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/partitions.py", line 420, in partition_strings
    G_star, converged = graphs.construct_exact_nearest_neighbor_graph(S, params)
  File "/Users/jon/miniconda3/envs/amptk_dev/lib/python3.6/site-packages/modules/graphs.py", line 63, in construct_exact_nearest_neighbor_graph
    if G.node[s1]["degree"] > 1:
AttributeError: 'DiGraph' object has no attribute 'node'

Assembling fragmented amplicons

Dear Kristoffer,

We amplified two markers (the nuclear ribosomal cistron ~6000 bp and a low copy nuclear gene ~2500 bp) for 12 samples. Then we combined both markers for each sample and barcoded them for sequencing using Nanopore's RPB004 kit, which fragments the amplicons to attach the adapters (we will switch eventually to a kit that sequence the full length of the amplicons).

After demultiplexing and removing chimeric reads, I was testing IsoCon on one of the samples but it recovered 600 candidates (for two markers), and I think is because our data violates the assumption of "reads are cut at relatively precise positions (i.e., at the start and stop primer sites)" because the fragmentation breaks down the amplicon. However, there are reads with lengths that are close to the full length of the amplicon and the shorter ones are included within that length, like in this picture where I mapped the nanopore reads to one of the longest reads:

I wonder if there is any parameter I can alter in IsoCon to use sub-length reads in the consensus, or any software that can accomplish what I need. For example, I tried isONclust but I read it doesn't reverse-complement the reads during clustering, so It wouldn't be completely ideal for this application. I also tried several de novo assemblers, but the coverage is to high and uneven and the targets are to short for these to work correctly.

Thanks in advance,

Edgardo

Final candidates header explanation

The header is on the following format:

> acc + "_” + support + "_" + p_value + "_" + N_t + "_" + delta_size

Here,

“acc" is the consensus accession.
“support" is the number of reads in the final consensus cluster that support all the variants that are unique to this cluster. Note that this is not the total number of reads assigned to the cluster as some reads may not support all variants if multiple (see below example)
"p-value" is the least significant p-value of the statistical tests to other candidates.
“N_t" is the number of reads used in the statistical test that proveds the p-value. This is usually the number of reads in the cluster itself plus the number of reads in the cluster it is tested against for similarity.
“delta_size" is the number of positions different to the closest other consensus (edit distance). This is the combination of SNPs and indels. If an indel is of length 3 if will contribute 3 towards delta_size.

Example

If there are two clusters c1 and c2 for 10 and 5 reads respectively, and their total distance is 3 (e.g., one SNP and one indel of length 2). Lets say you are testing whether the SNP and deletion that c2 have with respect to c1 is statistically supported, and the hypothesis test comes back significant (i.e., it is statistically supported), then you will have in c2’s header:

N_t = 15 (the reads used to check if the clusters should be merged or if they are statistically significant)
delta_size=3
p-value: something lower that the threshold for not merging.
“support” would be however many reads supported both the SNP and the deletion in c2 in the statistical test. This is somewhere between 1 and 5 (although typically towards the cluster size, e.g. 3, 4, or 5).

Runtime on my dataset

IsoCon with default parameters is fast if

The CCS flnc reads are cut at relatively precise positions (i.e., at the start and stop primer sites).
The CCS flnc reads, in general, does not have a lot of errors in them. (The number of errors depends both on number of passes and transcript length. Rough estimation of what "a lot of errors" would be is if the majority of CCS reads had over 50 errors.)

If 1 is not satisfied, it has a bad effect on the runtime and, in addition, we cannot guarantee the quality of the algorithm output. If 1 is satisfied, but not 2, the algorithm is designed to give good quality output — it might just take a long time. The algorithm can be speed up by setting --neighbor_search_depth to a low value, e.g., 500 or 1000 (default is no limit). Read below why. Regarding " relatively precise positions" in 1, this means that most reads are within say 5-15 bases of the primer cut point. It doesn't matter if some of the reads might be inaccurately cut.

What is the runtime can I expect?

For targeted datasets such as the ones on FMR1, IsoCon's runtime on a Mac laptop using 3 cores was about 2 hours for the 3 cells datasets of patients and less than 30 mins for the 1 cell control datasets. In our simulated data we degenerated 12500 reads from transcripts (from DAZ) of length 6000bp with median error rate of 276 errors per read (corresponding to most reads having 1-3 passes see Table 1 simulated error profile). The algorithm takes ~15-20h to run on 4 cores. This can be compared with the other two datasets in that table, HSFY takes ~2h and TSPY takes ~5minutes on 4 cores. So the error rate plays a key role.

Why does IsoCon scale badly with 1-2 not being met?

We are using an exact approach to find the Nearest neighbor graph (defined in the manuscript). We sort sequences by length and place them in an array L indexed by i. For a given sequence L[i], we calculate the edit distance between L[i] and L[i +/- j], for j =1,2,3… . We keep track of the best edit distance we encounter for L[i] (call this value a_i), and we stop to compute edit distances whenever the sequence length difference between L[i] and L[i +/- j] is larger than a_i. We have then guaranteed that the nearest neighbor is found. This is how we guarantee exact NN-graph is built. If 1-2 is in general met, we don’t have to explore a lot of alignments (we find best edit distance with small j as nearest neighbours will be located close to each other in L). The algorithm can be speed up setting a limit on j by —neighbor_search_depth [low number e.g. 500 or 1000]. But a guarantee that nearest neighbours are not found is lost and it may affect quality of output. We are exploring the possibility to use minimap2 for the alignments.

How to set --min_candidate_support should be appropriate？

Hi，when I run IsoCon，I found the results vary greatly with different --min_candidate_support set. So I wonder how to set this parameter is ok？

ksahlin / isocon Goto Github PK

isocon's People

Contributors

Stargazers

Watchers

Forkers

isocon's Issues

IsoCon breaks for networkx v2.4

Can IsoCon be used on nontargeted Iso-Seq data sets?

Multiple CCS.bam file

Can we used IsoCon for a organism with a large polyploid genome / PBS: job killed: mem 127935680kb exceeded limit 125829120kb

Error: AttributeError: 'DiGraph' object has no attribute 'node'

Assembling fragmented amplicons

Final candidates header explanation

Runtime on my dataset

What is the runtime can I expect?

Why does IsoCon scale badly with 1-2 not being met?

How to set --min_candidate_support should be appropriate？

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent