I'm running a test on some TCGA data. 4 small groups, 5 members each. I've rerun the a

Aggregate step segmentation faults about imoka HOT 6 CLOSED

ritchielabigh commented on June 20, 2024

Aggregate step segmentation faults

from imoka.

Comments (6)

CloXD commented on June 20, 2024

Hello Jake,
I think it's a problem of memory due to the big number of results ( I have never tried iMOKA with WGS, but I imagined there would have been lots of results ).
Try increasing the general threshold (-T) to 90 and the source threshold (-t) to 95 ( or even 95 and 99 ) to keep only the best results.
With larger cohorts, the accuracy values should be more reliable: if in the reduction step you kept the default values, you used 1/4
of the samples as test, that means 1 for each group. Take a look at the reduced matrix and if there are only 100, it would be better to increase the number of samples in each group to 10 or increase the fraction of the test set ( -t ) to 0.4 ( so with 5 samples, it will use 2 as test and 3 as training ).
I hope this will help.
Cheers,
Claudio

from imoka.

jakewendt commented on June 20, 2024

Thanks again Claudio.

Initially, this was just a test of principle, so the accuracy of the results weren't really that important. Once functioning, I am planning to run all available samples.

Not sure where to check for 100 as you suggested.

The reduced matrix did keep half a billion kmers which is quite a bit.

head 15/reduced.matrix
#{"adjustments":[0.25,0.05],"cross_validation":100,"file_in":"/francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/matrix.json","file_out":"/francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix","kept":538537323,"min_acc":65.0,"minimum_count":5,"perc_test":0.25,"processed":864984338,"standard_error":0.5}
kmer	nMutant_x_nWT	nMutant_x_tMutant	nMutant_x_tWT	nWT_x_tMutant	nWT_x_tWT	tMutant_x_tWT	nMutant	nWT	tMutant	tWT
AAAAAAAAAAAAAAA	79.500	81.500	93.500	66.000	62.000	22.000	462462.289	504177.301	527753.348	534412.340
AAAAAAAAAAAAAAC	77.000	49.000	83.500	68.000	35.000	58.500	18289.031	23344.986	20507.831	22575.272

I also just noticed a new quirk with WGS. At least paired data anyway. The kmer counts aren't canonical as the reduced.matrix includes reverse complements. I'm guessing that they probably should given that half the reads are forward and have are reverse complement. That would mean going back to the preprocessing step, I think, and changing the library type. I'm assuming that the default to library type is effectively ff. I'm gonna try fr. Suggestions there?

I'll make the mods to the aggregate that you suggested and rerun.

Thanks again,
Jake

from imoka.

CloXD commented on June 20, 2024

No problem.
The reduced matrix has accuracies different than only 100, so that's fine ( from the second column to the seventh ).
The k-mers are not canonical on purpose to handle stranded RNA-seq.
An optimization of iMOKA for WGS would include the use of canonical k-mer, the adaptation of the aggregation step for canonical ( all the steps that consider the k-mer sequence, such as the generation of the graphs, the mapping etc.. ) and eventually a discretization of the k-mer counts.
Those changes require lots of work (and a dataset of test), but unfortunately, my contract just ended and I don't know yet if I'll continue to develop iMOKA in the future or if someone else will.
Cheers,
Claudio

from imoka.

jakewendt commented on June 20, 2024

Will passing --library-type fr to preprocess correctly orient the extracted kmers when used in paired sequences when passed in the source files as ...?

sample	group	FILE_R1.fastq.gz;FILE_R2.fastq.gz

from imoka.

CloXD commented on June 20, 2024

yes, It will convert the file matching the RE /[]?[R]2[.]/ and convert it to its reverse complementary ( the file 1 is associated with []?[R_]1[._] ).

from imoka.

jakewendt commented on June 20, 2024

Just to close this off, I reran from preprocessing with --library-type fr, reduce with --test-percentage 0.5 and aggregate with --global-threshold 95 --origin-threshold 99 and the problem went away. The change in aggregate parameters is likely what stopped the seg fault.

Thanks again Claudio

from imoka.

Aggregate step segmentation faults about imoka HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent