ritchielabigh / imoka Goto Github PK

interactive Multi Objective K-mer Analysis

License: Other

JavaScript 61.64% TypeScript 4.99% HTML 3.32% CSS 0.31% Dockerfile 0.07% Makefile 0.32% Python 1.76% Shell 0.54% C++ 23.75% C 3.12% R 0.19%

angular data-analysis gui k-mer machine-learning rna-seq-analysis singularity

imoka's People

Contributors

Stargazers

Watchers

Forkers

subrina harel-coffee

imoka's Issues

Plotting legend issue with PCA

We have run a set of 4 groups through iMOKA and when using the GUI to view results, we notice that the legend seems to have an error. Where there should be 4 unique shapes and colors, we see only 3 unique shapes, but 4 colors. Instead of circle, diamond, diamond, square, it should be circle, square, diamond, cross (+).

The results themselves all make sense, hovering over each datapoint, but the legend had us confused for a while.

Attached is a screenshot of the PC plot.

Preprocessing not producing the "sorted.bin" files

Preprocessing not producing the "sorted.bin" files when using the v1.1 images, but v1.0 does.

I tried running the "iMOKA_core create" step that is part of preprocessing step to see what wasn't happening.

I see that the final failing code is in iMOKA/iMOKA_core/src/Matrix/BinaryMatrix.cpp but I can't find where the "Warning! File --- is empty and will be ignored." message is coming from. I don't really get it since the file is very much not empty. Any help would be appreciated.

cat kma.input 
/francislab/data1/working/20210428-EV/20210706-iMoka/SFHH005a.tsv	SFHH005a	testing


singularity exec /francislab/data2/refs/singularity/iMOKA_extended-1.1.img iMOKA_core create -i /francislab/data1/working/20210428-EV/20210706-iMoka/kma.input -o kma.out -r 1
Creating a binary file from /francislab/data1/working/20210428-EV/20210706-iMoka/kma.input
Rescale Factor: 1
Prefix size: -1
Warning! File /francislab/data1/working/20210428-EV/20210706-iMoka/SFHH005a.tsv is empty and will be ignored.
Error! Empty file /francislab/data1/working/20210428-EV/20210706-iMoka/kma.input


singularity exec /francislab/data2/refs/singularity/iMOKA-1.1.img iMOKA_core create -i /francislab/data1/working/20210428-EV/20210706-iMoka/kma.input -o kma.out -r 1
Creating a binary file from /francislab/data1/working/20210428-EV/20210706-iMoka/kma.input
Rescale Factor: 1
Prefix size: -1
Warning! File /francislab/data1/working/20210428-EV/20210706-iMoka/SFHH005a.tsv is empty and will be ignored.
Error! Empty file /francislab/data1/working/20210428-EV/20210706-iMoka/kma.input


singularity exec /francislab/data2/refs/singularity/iMOKA-1.0.img iMOKA_core create -i /francislab/data1/working/20210428-EV/20210706-iMoka/kma.input -o kma.out -r 1
Creating a binary file from /francislab/data1/working/20210428-EV/20210706-iMoka/kma.input
Rescale Factor: 1
Prefix size: -1
0 - /francislab/data1/working/20210428-EV/20210706-iMoka/SFHH005a.tsv
Reading the sorted tsv kmer count text file.
The prefix size is fix to 9
done
Saving the matrix header to kma.out
Completed in 02 seconds .
Good luck!

Sample Variation Normalization

Claudio,

What are your thoughts on how iMOKA performs on a dataset with differing sample sequencing depths and read lengths?

For example, the TCGA dataset contains samples of a few X to 50X as well as read lengths from 50 to 150bp.

Does iMOKA do enough normalization to analyze all samples together, so long as k is less than the read length?

Does iMOKA normalize simply based on the number of k-mers accounting for both read count and read length in one go?

Jake

aggregate produces many lines of "stdtr domain error"

Running the aggregate step with ...

threads=32
mem=7
kdir=${PWD}/${k}
img=/francislab/data2/refs/singularity/iMOKA_extended-1.1.1.img
sbatch="sbatch [email protected] --mail-type=FAIL "
export SINGULARITY_BINDPATH=/francislab
export OMP_NUM_THREADS=${threads}
export IMOKA_MAX_MEM_GB=$((threads*(mem-1)))
export sbatch_mem=$((threads*mem+4))G

sbatch --export=SINGULARITY_BINDPATH,OMP_NUM_THREADS,IMOKA_MAX_MEM_GB --job-name=${k}iMOKAaggregate --time=60 --ntasks=${threads} --mem=${sbatch_mem} --output=${kdir}/iMOKA.aggregate.${date}.txt --wrap="singularity exec ${img} iMOKA_core aggregate --input ${kdir}/reduced.matrix --count-matrix ${kdir}/matrix.json --output ${kdir}/aggregated" )

Produces appears to produce the appropriate aggregate output files, but the output includes "stdtr domain error" many times. Is it supposed to? Does it matter?

Starting the analysis with the following arguments: 
	- Input file= /francislab/data1/working/20210428-EV/20210706-iMoka/21/reduced.matrix
	- Output file= /francislab/data1/working/20210428-EV/20210706-iMoka/21/aggregated
	- Count matrix file= /francislab/data1/working/20210428-EV/20210706-iMoka/21/matrix.json
	- Configuration file= nomap
	- Shift= 1
	- General threshold= 70
	- Source threshold= 80
	- Coverage limit= 50
	- Correlation thr= 1
	- Consistency= 2

Step 0 : Reading /francislab/data1/working/20210428-EV/20210706-iMoka/21/reduced.matrix...
	Total lines: 320746
done.             
	Read 90049 kmers.
	Space occupied: 47.691Mb
	Max values: 
	 - Astro_x_GBMWT:88.25
	 - Astro_x_GBMmut:92.5
	 - Astro_x_Oligo:92
	 - GBMWT_x_GBMmut:93.75
	 - GBMWT_x_Oligo:91
	 - GBMmut_x_Oligo:100

Step 1 : Computing edges... done.
Step 2 : Building the groups...done.
	Found 6992 graphs : 
	  - complex: 13,260
	  - linear: 6976,43929
	  - linear_circular: 3,10
	 Nodes used: 44199/90049 ( 45850 discarded )
	Space occupied: 54.062Mb
Step 3 : Extracting the sequences... done. 
	Found  2437 sequences.
	[ Average sequence lenght: 33 ]
Step 4 : processing the best k-mers for each graph...done.
Step 5 : Recovering winners and counts...done.
	Current memory: "142.133Mb"
Step 6 : Writing output files...
stdtr domain error

stdtr domain error

stdtr domain error

stdtr domain error

stdtr domain error

stdtr domain error

"stdtr domain error" is present 516 times.

stdtr domain error

stdtr domain error

stdtr domain error
done.

Completed in 02 seconds .
Good luck!

It produces the aggregate files ...

-rw-r-----  1 gwendt francislab   486816 Jul 14 07:52 aggregated.kmers.matrix
-rw-r-----  1 gwendt francislab  2508822 Jul 14 07:52 aggregated.json
-rw-r-----  1 gwendt francislab   798518 Jul 14 07:52 aggregated.tsv
-rw-r-----  1 gwendt francislab      584 Jul 14 07:52 aggregated.info.json
-rw-r-----  1 gwendt francislab    13456 Jul 14 07:52 iMOKA.aggregate.20210714073343.txt

The random forest step appears to work with the output ...

Starting the creation of 1 models.
Original data have 2426 dimensions with 40 samples, divided in 4 groups:
	0 - Astro 	 10 samples
	1 - GBMWT 	 10 samples
	2 - GBMmut 	 10 samples
	3 - Oligo 	 10 samples


Round 0
Training RF 
Grid search found the following values:
{'min_samples_split': 0.05, 'n_estimators': 1000}
Model RF processed. Accuracy: 0.59 
[[8 2 0 0]
 [2 6 1 1]
 [0 1 7 2]
 [2 0 1 7]]

Am I concerned about nothing?

How to run it in my Mac?

Hi. I downloaded the iMOKA-darwin-x64.zip and tried to run it on my Mac, but it said it only support lung environment. Is it possible to run your software on Mac?

Thanks.
Matthew

Aggregate step segmentation faults

I'm running a test on some TCGA data. 4 small groups, 5 members each. I've rerun the aggregate step several times, always with the same result. A segmentation fault.

Starting the analysis with the following arguments: 
	- Input file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix
	- Output file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/aggregated
	- Count matrix file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/matrix.json
	- Configuration file= /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/config.json
	- Shift= 1
	- General threshold= 70
	- Source threshold= 80
	- Coverage limit= 50
	- Consistency= 2

Step 0 : Reading /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix...
	Total lines: 538537325
done.             
	Read 450319809 kmers.
	Space occupied: 211.375Gb
	Max values: 
	 - nMutant_x_nWT:100
	 - nMutant_x_tMutant:100
	 - nMutant_x_tWT:100
	 - nWT_x_tMutant:100
	 - nWT_x_tWT:100
	 - tMutant_x_tWT:100

Step 1 : Computing edges... done.
Step 2 : Building the groups.../var/spool/slurm/d/job263422/slurm_script: line 4: 251271 Segmentation fault      singularity exec /francislab/data2/refs/singularity/iMOKA_extended-1.1.6.img iMOKA_core aggregate --input /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/reduced.matrix --count-matrix /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/matrix.json --mapper-config /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/config.json --output /francislab/data1/working/20200603-TCGA-GBMLGG-WGS/20210923-iMOKA-tumor-normal-test/15/aggregated

Unfortunately, it leaves no indication of what the problem is.

Any questions or suggestions?

"{} Message" when opening K-mer list

Claudio,

I just had a successful run on a dataset that produced large aggregated k-mer lists with k=51 (~800MB) and k=31 (1GB).

When I attempt to load it into the GUI, the little modal window simply shows "{} Message" and it doesn't load. The file does not appear to be corrupted. I will attempt to rerun.

The random forest output file loads nicely.

Are there limitations on the size of file the GUI can handle? Is there another reason that this might occur?

Thanks
Jake

the singularity image

Hello Geno,
I am interested in iMoka and could not find the singularity image for the link. What is the name tag for the imalge?
thanks Yizhou

No error handling when there are no kmers above the thresholds counts at "aggregate" step

Hello ! At the aggregating step, I had the following error:

Starting the analysis with the following arguments: 
        - Input file= /mnt/projects_tn04/KMER_analysis/DATA/imoka_files/reduced_matrix.matrix
        - Output file= /mnt/projects_tn04/KMER_analysis/DATA/imoka_files/aggregated
        - Count matrix file= /mnt/projects_tn04/KMER_analysis/DATA/imoka_files/matrix_output.json
        - Configuration file= nomap
        - Shift= 1
        - General threshold= 70
        - Source threshold= 80
        - Coverage limit= 50
        - Correlation thr= 1
Reading...Read 0 kmers.
Space occupied: 7.449Mb
Computing edges... Segmentation fault (core dumped)

using the command:

singularity exec --bind "/mnt/projects_tn04/" /mnt/software/singularity/iMOKA.sif iMOKA_core aggregate -i reduced_matrix.matrix -c matrix_output.json -o aggregated

I check my input files, they were not empty:

TTTGTGCCTTTTGCACTATCTGTGCTCGTAA 66.700
TTTTACCCACCTGTTTACGATGGTTGCGTTA 66.050
TTTTGAGCTTTATGATTTCTAGTTCACAAAT 66.100
TTTTGCACTATCTGTGCTCGTAATGGCGAAT 65.600

I then tried:

singularity exec --bind "/mnt/projects_tn04/" /mnt/software/singularity/iMOKA.sif iMOKA_core aggregate -i reduced_matrix.matrix -c matrix_output.json -o aggregated -t 60 -T 60

and then got:

Starting the analysis with the following arguments: 
        - Input file= reduced_matrix.matrix
        - Output file= aggregated
        - Count matrix file= matrix_output.json
        - Configuration file= nomap
        - Shift= 1
        - General threshold= 60
        - Source threshold= 60
        - Coverage limit= 50
        - Correlation thr= 1
Reading...Read 327 kmers.
Space occupied: 7.172Mb
Computing edges... Done.
Building the groups
Found 52 graphs : 
        - linear: 52,145
Nodes used: 145/327 ( 182 discarded )
Extracting the sequences... Found 4 sequences.
Skipping the mapping step.Found 4 nodes.

I guess this means that there is no check on the gg object to see if it is empty before gg.makeEdges() and that causes a segfault.

File not found error in iMOKA aggregate step

Hello,

I am using iMOKA for k-mer analysis. When I was running iMOKA's aggregate step, there appears an "Error: file reduced.matrix doesn't exists!" at the first line of output information, but the execution was not broken and the results are not empty. I checked also that the file reduced.matrix does exist.
In this case, I would like to know if this error message being normal ? Are the results generated correctly despite this error line ?
Thanks in advance.

Best wishes.

iMOKA GUI

Thank you for your help to this point. We have processed our dataset with iMOKA (preprocess, create, reduce, aggregate and random forest) from the command line on our cluster for a variety of k and both v1.0 and the latest v1.1 singularity images.

We would now like to use the iMOKA GUI to import and explore it on our Macs.

When we load the aggregated.json file, the GUI does not behave as yours did in the demo video. When loading the file, it never prompts to "save in library as", but a little pop up says that the file of type kmers loaded properly. Then nothing shows up in the top right window. No "experiments" load. And the Dashboard remains at the default Welcome and Videos. No pie charts. A "K-mer list" does appear in the left column, but when selected, the only thing that shows is "Results" and a blank page. It does not show a list of kmers. No file information or alignments with IGV.

I can load the output.json and it shows some PCA and TSNE plots which is nice.

I understand that there is some feature limitation when not running on linux so I tried running on linux with x11 but it did the same thing.

I also no longer see any mention alignment or annotation in the logs even though I used the extended image.

Is the GUI meant to be used as a stand-alone tool on non-Linux machines like this? Can you help?

Wonky reduce thread

I'm running a large dataset on 64 threads. The reduce step seems to be going well and has not completed yet.

I noticed that 1 of the threads is "wonky". Most of the files are fine. All except 45, which is empty.

-rw-r----- 1 gwendt francislab 42198265 Jul 31 21:06 reduced.matrix0
-rw-r----- 1 gwendt francislab 45089596 Jul 31 21:06 reduced.matrix1
-rw-r----- 1 gwendt francislab 40672218 Jul 31 21:07 reduced.matrix2
-rw-r----- 1 gwendt francislab 42294357 Jul 31 21:06 reduced.matrix3
-rw-r----- 1 gwendt francislab 43622962 Jul 31 21:06 reduced.matrix4
-rw-r----- 1 gwendt francislab 40426897 Jul 31 21:06 reduced.matrix5
-rw-r----- 1 gwendt francislab 42336317 Jul 31 21:06 reduced.matrix6
-rw-r----- 1 gwendt francislab 40500780 Jul 31 21:06 reduced.matrix7
-rw-r----- 1 gwendt francislab 44213195 Jul 31 21:07 reduced.matrix8
-rw-r----- 1 gwendt francislab 40647645 Jul 31 21:07 reduced.matrix9
-rw-r----- 1 gwendt francislab 43499920 Jul 31 21:07 reduced.matrix10
-rw-r----- 1 gwendt francislab 44770248 Jul 31 21:06 reduced.matrix11
-rw-r----- 1 gwendt francislab 43467325 Jul 31 21:07 reduced.matrix12
-rw-r----- 1 gwendt francislab 40894598 Jul 31 21:06 reduced.matrix13
-rw-r----- 1 gwendt francislab 44720210 Jul 31 21:06 reduced.matrix14
-rw-r----- 1 gwendt francislab 43278675 Jul 31 21:06 reduced.matrix15
-rw-r----- 1 gwendt francislab 44459912 Jul 31 21:07 reduced.matrix16
-rw-r----- 1 gwendt francislab 43900975 Jul 31 21:07 reduced.matrix17
-rw-r----- 1 gwendt francislab 36582878 Jul 31 21:06 reduced.matrix18
-rw-r----- 1 gwendt francislab 45358887 Jul 31 21:07 reduced.matrix19
-rw-r----- 1 gwendt francislab 40419121 Jul 31 21:06 reduced.matrix20
-rw-r----- 1 gwendt francislab 43360170 Jul 31 21:06 reduced.matrix21
-rw-r----- 1 gwendt francislab 42187445 Jul 31 21:07 reduced.matrix22
-rw-r----- 1 gwendt francislab 42139376 Jul 31 21:06 reduced.matrix23
-rw-r----- 1 gwendt francislab 40385381 Jul 31 21:07 reduced.matrix24
-rw-r----- 1 gwendt francislab 43056370 Jul 31 21:07 reduced.matrix25
-rw-r----- 1 gwendt francislab 40072249 Jul 31 21:06 reduced.matrix26
-rw-r----- 1 gwendt francislab 41983797 Jul 31 21:06 reduced.matrix27
-rw-r----- 1 gwendt francislab 41565199 Jul 31 21:06 reduced.matrix28
-rw-r----- 1 gwendt francislab 43163191 Jul 31 21:06 reduced.matrix29
-rw-r----- 1 gwendt francislab 43630429 Jul 31 21:06 reduced.matrix30
-rw-r----- 1 gwendt francislab 43579982 Jul 31 21:07 reduced.matrix31
-rw-r----- 1 gwendt francislab 39352865 Jul 31 21:06 reduced.matrix32
-rw-r----- 1 gwendt francislab 45720059 Jul 31 21:07 reduced.matrix33
-rw-r----- 1 gwendt francislab 41770569 Jul 31 21:07 reduced.matrix34
-rw-r----- 1 gwendt francislab 40696043 Jul 31 21:06 reduced.matrix35
-rw-r----- 1 gwendt francislab 38705579 Jul 31 21:07 reduced.matrix36
-rw-r----- 1 gwendt francislab 41573206 Jul 31 21:06 reduced.matrix37
-rw-r----- 1 gwendt francislab 42975079 Jul 31 21:07 reduced.matrix38
-rw-r----- 1 gwendt francislab 45269788 Jul 31 21:07 reduced.matrix39
-rw-r----- 1 gwendt francislab 43655470 Jul 31 21:06 reduced.matrix40
-rw-r----- 1 gwendt francislab 44032364 Jul 31 21:07 reduced.matrix41
-rw-r----- 1 gwendt francislab 39238044 Jul 31 21:06 reduced.matrix42
-rw-r----- 1 gwendt francislab 42170614 Jul 31 21:06 reduced.matrix43
-rw-r----- 1 gwendt francislab 41549272 Jul 31 21:06 reduced.matrix44
-rw-r----- 1 gwendt francislab        0 Jul 30 23:59 reduced.matrix45
-rw-r----- 1 gwendt francislab 42163913 Jul 31 21:06 reduced.matrix46
-rw-r----- 1 gwendt francislab 41597309 Jul 31 21:07 reduced.matrix47
-rw-r----- 1 gwendt francislab 41337375 Jul 31 21:06 reduced.matrix48
-rw-r----- 1 gwendt francislab 44680255 Jul 31 21:06 reduced.matrix49
-rw-r----- 1 gwendt francislab 43853187 Jul 31 21:07 reduced.matrix50
-rw-r----- 1 gwendt francislab 42507805 Jul 31 21:07 reduced.matrix51
-rw-r----- 1 gwendt francislab 43123127 Jul 31 21:06 reduced.matrix52
-rw-r----- 1 gwendt francislab 45203103 Jul 31 21:07 reduced.matrix53
-rw-r----- 1 gwendt francislab 44950654 Jul 31 21:06 reduced.matrix54
-rw-r----- 1 gwendt francislab 43074147 Jul 31 21:06 reduced.matrix55
-rw-r----- 1 gwendt francislab 44744756 Jul 31 21:07 reduced.matrix56
-rw-r----- 1 gwendt francislab 45211901 Jul 31 21:06 reduced.matrix57
-rw-r----- 1 gwendt francislab 42131109 Jul 31 21:06 reduced.matrix58
-rw-r----- 1 gwendt francislab 43679939 Jul 31 21:06 reduced.matrix59
-rw-r----- 1 gwendt francislab 43335459 Jul 31 21:06 reduced.matrix60
-rw-r----- 1 gwendt francislab 44835238 Jul 31 21:06 reduced.matrix61
-rw-r----- 1 gwendt francislab 43155599 Jul 31 21:07 reduced.matrix62
-rw-r----- 1 gwendt francislab 42294623 Jul 31 21:07 reduced.matrix63

Most of the logs are progressing well and look something like ...

Perc	Total	Kept	MinEntropy	RunningTime
0.242438	100000	9754	2.53888	00:50:01
0.818095	200000	19326	1.79968	01:41:37
1.30261	300000	29819	1.63002	02:36:52
1.74865	400000	40330	1.43562	03:34:31
2.38123	500000	51991	1.43809	04:36:45
2.77666	600000	63545	0.757176	05:40:15
3.24291	700000	75983	1.81971	06:47:48
3.61461	800000	85103	1.71512	07:37:41
4.0653	900000	95349	2.77771	08:32:17
5.08751	1000000	104958	1.70238	09:23:03
5.80341	1100000	116967	1.63045	10:27:06

Some are progressing much faster than others, currently between 5 and 30%. But the log for 45 looks very different. Perc is not getting larger like the rest and Kept and MinEntropy are stuck at 0.

Perc	Total	Kept	MinEntropy	RunningTime
0.00117842	100000	0	0	00:31:40
0.000357119	200000	0	0	01:03:26
0.00112724	300000	0	0	01:35:13
0.000706139	400000	0	0	02:07:02
0.000766148	500000	0	0	02:38:48
7.37509e-05	600000	0	0	03:10:31
5.59012e-05	700000	0	0	03:42:14
0.000536956	800000	0	0	04:13:59
0.00117401	900000	0	0	04:45:39
0.000965567	1000000	0	0	05:17:24
0.000191923	1100000	0	0	05:49:07
0.00111759	1200000	0	0	06:20:51
0.00109619	1300000	0	0	06:52:32
0.00044954	1400000	0	0	07:24:16
0.000901621	1500000	0	0	07:55:59
0.00031934	1600000	0	0	08:27:42
0.000215662	1700000	0	0	08:59:27
0.000275124	1800000	0	0	09:31:10
0.000163167	1900000	0	0	10:02:53
...
0.000984228	7200000	0	0	38:04:43
0.00107386	7300000	0	0	38:36:25
0.000275069	7400000	0	0	39:08:08
0.000379675	7500000	0	0	39:39:51
0.000610525	7600000	0	0	40:11:37
0.000398798	7700000	0	0	40:43:20
0.000580537	7800000	0	0	41:15:00
5.7413e-05	7900000	0	0	41:46:43
1.14371e-06	8000000	0	0	42:18:27
0.00115464	8100000	0	0	42:50:09

It doesn't make sense to me. It has not completed yet and probably won't for about ~60 hours.

Thoughts?

Memory control

Hello again Claudio.

I am running this pipeline on our cluster and I continue to increase k to see how far I can go. Initially, I set IMOKA_MAX_MEM_GB to the full amount of memory requested for the job minus about 10GB for singularity headroom. However, the higher I go with k, the lower I have to set IMOKA_MAX_MEM_GB because the job would crash with OUT_OF_MEMORY. For example, if I submit a job that requests 450GB of memory, I have to set IMOKA_MAX_MEM_GB=256 for k=81. If I increase k to 91, I have to lower IMOKA_MAX_MEM_GB to 192.

I thought that I'd have a look at memory usage off of the cluster. When I run ...

export OMP_NUM_THREADS=20
export IMOKA_MAX_MEM_GB=100
singularity exec  iMOKA_extended-1.1.2.img iMOKA_core reduce --input matrix.json --output testreduce.matrix > testreduce.out 2> testreduce.err &

... for my dataset with k=31, it actually peaks at about 90GB.

For k=41 it actually peaks at about 115GB.

For k=61 it actually peaks at about 155GB.

Is there a better way to more accurately and more stably set the memory limit?

Thanks
Jake

Error when starting singularity exec iMOKA preprocess.sh -i test

Hello ,
I have difficulties testing your soft. When I try to start the pipeline in the singularity image:

singularity exec iMOKA preprocess.sh -i test

It returns

ERROR! Give a valid input file using -i

But I followed the recommendation using a tab delimited file with the 3 columns (I put a screenshot of the file with the metacharacter)

Can you help me please ?

cant get singularity to run

iMOKA wont run on my mac. I've tried running it on ubuntu on a virtual machine but when i give it the singularity address eg:
https://github.com/RitchieLabIGH/iMOKA/releases/download/v1.0/iMOKA
or my local address I get the error:

Options given are empty
check options
/bin/sh: 1: singularity: not found
"Command failed: singularity --version\n/bin/sh 1: singularity: not found\n"
COMPLETED

any ideas?

How to incorporate paired end reads?

Hi,

I have forward and reverse reads for each sample in fastq.gz format. I would like to know how to create the source.tsv file for the first step?
Does iMOKA take two fastq files as input (as follows)? OR should I merge the forward and reverse reads in one fastq file?

sample_1_name sample_1_group sample_1_forward_read_file sample_1_reverseComplement_read_file
sample_2_name sample_2_group sample_2_forward_read_file sample_2_reverseComplement_read_file

This is a great software for k-mer based analysis. But I am stuck on the issue to incorporate paired end read files.
Looking forward to your suggestion.

-Archie

--threads does not work?

Dear RitchieLabIGH,

thank you for developing a great tool! I'm using this tool to generate some k-mer features for my current research.

I notice that when I have specified the option -t 128 to use 128 threads, it seems like the option does not work and does not affect the performance of the preprocess.sh function. Please see attached the two screenshots of my ongoing analysis.

This is the command I'm using:
singularity exec --bind /home/hieunguyen/iMOKA/:/input iMOKA.img preprocess.sh -i source.tsv -t 128 -r 256

I would like to ask if this behaviour is normal, or perhaps did I setup something wrong?

Thanks again and I'm looking forward to your reply.
All the best
Hieu Nguyen

Confidence Interval for ROC AUC metric (Random Forest)

Hi,

I know this extends a bit beyond the iMOKA novelty, but am curious about your opinion on a 95% CI for the measured accuracy of the best performing RF models? My thought is to use a large number of bootstraps with replacement, each with a train/test split, retraining the RF with the optimal parameters selected at the earlier stage (min_samples_split and n_estimators), and computing a bootstrap CI from the resulting distribution of accuracy metrics.

Does this seem like a valid way to estimate the error?

Empty aggregated.sequences.bed.norep.bed causes seg fault

Sorry to report another bug without a solution.

During the aggregate step, it looks like all sequences intersect and then the norep file is empty. IMOKA then crashes with ...

***** ERROR: Requested column 4, but database file /francislab/data1/working/20191008_Stanford71/20210714-iMOKA/81/aggregated.sequences.bed.norep.bed.sorted only has fields 1 - 0.
/var/spool/slurm/d/job203747/slurm_script: line 4: 189892 Segmentation fault      singularity exec /francislab/data2/refs/singularity/iMOKA_extended-1.1.1.img iMOKA_core aggregate --input reduced.matrix --count-matrix matrix.json --mapper-config config.json --output aggregated

I suspect that it will run if I remove --mapper-config config.json but our cluster is rather full at the moment.

Aggregate step crash because "Kept 0 alignments"

I've been using a subset of my dataset with read lengths less than or equal to 30bp. And then I've been investigating shorter and shorter k. Eventually it crashes, I presume because Step 4 kept 0 alignments.

The following are tails of the aggregate step successfully for k=25 and then failing for k=23, 21 and 19.

tail -18 25.cutadapt2.lte30/iMOKA.aggregate.*.txt 
Step 3 : Extracting the sequences... done. 
	Found  16 sequences.
	[ Average sequence lenght: 29 ]
Step 4 : Mapping the sequences... done.
	Found 62 alignments.
        Kept 58 alignments.
done.
Step 5 : Annotating...using repeat annotation.
	All the sequences mapped to repetitive elementsdone.
	Space occupied: 6.121Mb
	Processing unannotated...done.
done.
Step 6 : Recovering winners and counts...done.
	Current memory: "10.781Mb"
Step 7 : Writing output files...done.

Completed in 01 minute and 12 seconds .
Good luck!

tail -9 23.cutadapt2.lte30/iMOKA.aggregate.*.txt 
Step 3 : Extracting the sequences... done. 
	Found  43 sequences.
	[ Average sequence lenght: 27 ]
Step 4 : Mapping the sequences... done.
	Found 4 alignments.
        Kept 0 alignments.
done.
Step 5 : Annotating...terminate called after throwing an instance of 'char const*'
/var/spool/slurm/d/job204014/slurm_script: line 4: 151177 Aborted                 singularity exec /francislab/data2/refs/singularity/iMOKA_extended-1.1.2.img iMOKA_core aggregate --input /francislab/data1/working/20210428-EV/20210706-iMoka/23.cutadapt2.lte30/reduced.matrix --count-matrix /francislab/data1/working/20210428-EV/20210706-iMoka/23.cutadapt2.lte30/matrix.json --mapper-config /francislab/data1/working/20210428-EV/20210706-iMoka/23.cutadapt2.lte30/config.json --output /francislab/data1/working/20210428-EV/20210706-iMoka/23.cutadapt2.lte30/aggregated

tail -9 21.cutadapt2.lte30/iMOKA.aggregate.*.txt 
Step 3 : Extracting the sequences... done. 
	Found  85 sequences.
	[ Average sequence lenght: 26 ]
Step 4 : Mapping the sequences... done.
	Found 4 alignments.
        Kept 0 alignments.
done.
Step 5 : Annotating...terminate called after throwing an instance of 'char const*'
/var/spool/slurm/d/job203968/slurm_script: line 4: 13743 Aborted                 singularity exec /francislab/data2/refs/singularity/iMOKA_extended-1.1.2.img iMOKA_core aggregate --input /francislab/data1/working/20210428-EV/20210706-iMoka/21.cutadapt2.lte30/reduced.matrix --count-matrix /francislab/data1/working/20210428-EV/20210706-iMoka/21.cutadapt2.lte30/matrix.json --mapper-config /francislab/data1/working/20210428-EV/20210706-iMoka/21.cutadapt2.lte30/config.json --output /francislab/data1/working/20210428-EV/20210706-iMoka/21.cutadapt2.lte30/aggregated

tail -9 19.cutadapt2.lte30/iMOKA.aggregate.*.txt 
Step 3 : Extracting the sequences... done. 
	Found  100 sequences.
	[ Average sequence lenght: 24 ]
Step 4 : Mapping the sequences... done.
	Found 4 alignments.
        Kept 0 alignments.
done.
Step 5 : Annotating...terminate called after throwing an instance of 'char const*'
/var/spool/slurm/d/job203963/slurm_script: line 4: 282382 Aborted                 singularity exec /francislab/data2/refs/singularity/iMOKA_extended-1.1.2.img iMOKA_core aggregate --input /francislab/data1/working/20210428-EV/20210706-iMoka/19.cutadapt2.lte30/reduced.matrix --count-matrix /francislab/data1/working/20210428-EV/20210706-iMoka/19.cutadapt2.lte30/matrix.json --mapper-config /francislab/data1/working/20210428-EV/20210706-iMoka/19.cutadapt2.lte30/config.json --output /francislab/data1/working/20210428-EV/20210706-iMoka/19.cutadapt2.lte30/aggregated