al-mcintyre / mcaller Goto Github PK
View Code? Open in Web Editor NEWA python program to call methylation (m6A in DNA) from nanopore signal data
License: MIT License
A python program to call methylation (m6A in DNA) from nanopore signal data
License: MIT License
Hello,
https://salsa.debian.org/med-team/mcaller/-/blob/master/debian/patches/pickl3latin.patch
and
https://salsa.debian.org/med-team/mcaller/-/blob/master/debian/patches/python2to3.patch
do the trick - plus the change in the shebang (#!) line. At least so I hoped. Following your README.md instructions for testing, all the invocations to mCaller run, but not so
$ ./make_bed.py -f testdata/masonread1.eventalign.diffs.6 -d 1 -t 0.5
testdata/masonread1.eventalign.diffs.6
Traceback (most recent call last):
File "./make_bed.py", line 80, in aggregate_by_pos
csome,read,pos,context,values,strand,label,prob = tuple(line.split('\t'))
ValueError: not enough values to unpack (expected 8, got 6)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./make_bed.py", line 205, in <module>
main()
File "./make_bed.py", line 202, in main
aggregate_by_pos(args.mCaller_file,output_file,args.min_read_depth,args.mod_threshold,args.positions,args.control,args.vo,args.gff,args.ref,args.plot,args.plotdir,args.plotsummary)
File "./make_bed.py", line 82, in aggregate_by_pos
csome,read,pos,context,values,strand,label = tuple(line.split('\t'))
ValueError: not enough values to unpack (expected 7, got 6)
This maybe something associated with Python 3 differences or something else - maybe this rings an immediate bell on your end.
There are ways to have the code compatible to Python 2 and 3 at the same time. Is that something you would aim for? Please kindly direct me.
Many thanks
Steffen
mCaller.py -m GATC -r reference.fa -e human.0.eventalign.tsv -f astq_0.fastq -b C
Traceback (most recent call last):
File "mCaller.py", line 187, in <module>
main()
File "mCaller.py", line 157, in main
assert os.path.isfile(modelfile), 'model file not found at '+modelfile
AssertionError: model file not found at /software/mCaller/model_NN_6_m5C.pkl
Hi, I'm very excited to try your algorithm.
I'm just wondering whether you have a model for R9.4 flow cells. "model_NN_6_m6A.pkl" or "mod_NN_6_m6A.pkl" is the one?
Hi @al-mcintyre,
I tried to use mCaller for 6mA methylation calling in E. coli. Meanwhile, I employed other software in methylation collection, including Deepmod and Deepsignal. The correlation among these three methods poses a confusing problem for detaild research. I don't know how to balance Quantity and accuracy.
Here, the parameters of mCaller were set as coverage > 3 and percent >0.5.
Looking forware to your reply.
Bin
Hi @ralowe, @al-mcintyre Could mCaller detect methylation from other species, such as human?
Best.
Hi,
I´m trying to train my own model via mCaller. The methylation sites obtained after training look promising. However, when I try to apply the new model to a dataset I get an error.
Here is the full error message based on the test data provided following the commands described in the README:
testdata/
1 contigs
1 threads
26dd376e-9d82-41fc-921e-71e559c8e8d1_Basecall_2D_template 13673 CATTAMCAATC 1.1,-0.23666666666666666,0.28,1.17,-0.02,0.6699999999999999,7.055265349382997 - - Index or Key Error
['general'] ['AA', 'AC', 'AG', 'AM', 'AT', 'MG', 'MA', 'MC', 'MM', 'MH', 'MT'] MC
'MH'
Traceback (most recent call last):
File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/mCaller.py", line 188, in
main()
File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/mCaller.py", line 185, in main
args.train,modelfile,args.skip_thresh,args.qual_thresh,args.classifier,training_tsv,args.plot_training)
File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/mCaller.py", line 59, in distribute_threads
extract_features(tsvname,refname,read2qual,nvariables,skip_thresh,qual_thresh,modelfile,classifier,0,endline=bytesize,train=train,pos_label=training_pos_dict,base=base,motif=motif,positions_list=positions_list)
File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/extract_contexts.py", line 218, in extract_features
print model[twobase_model].predict_proba([diffs])
KeyError: 'MH'
Many thanks for any clarification!
Christian
Hello,
I just wanted to let you know that nanopolish extract cannot extract fast5 file pathway information to incorporate into read headers. The authors of the program have basically noted its obsolete for current versions of nanopolish. I have a set of four isolates where two are complementary mutants that have a methylase knocked out. I first de-multiplexed fast5 previously with Deepbinner since I was concerned about the clonal nature of these isolates. However, after thinking about the pathway more clearly, I'm thinking besides adding more computational time, omitting demultiplexing shouldn't be a concern for downstream results. Am I correct in thinking that demultiplexing fast5 files is not necessary for downstream analysis?
Thanks!
Will
hi @ralowe @al-mcintyre Thanks for share the project , nice work.
I use mCaller detect m6A meth susses, but when I compared the result with tombo, they are so big big big big big difference,the same sample detect result just like this:
So, my question is why are these two soft's results so differentiations, and could you give me the answer about these results, and which one close to the reality?
(my opinion : I suppose to tombo, cause that the methy lever most lower, the higher methy lever gets higher false positive rate)
i have 20x nanopore datasets, and i use this software to call m6A genome loci. however, the result show that the m6A loci significantly focus on some genome region. i wonder whether the nanopre datasets suit for m6A analysis and the result of this software was convincing.
Because i have other 10x PacBio datasets. Although the reference genome is different, the m6A distribution result of PB was better than nanopre.
I tried running
mCaller.py <-m GATC or -p positions.txt> -r <reference>.fasta -d r95_twobase_model_NN_6_m6A.pkl -e <filename>.eventalign.tsv -f <filename>.fastq -b A
with the variables set for my data, however mCaller didn't find r95_twobase_model_NN_6_m6A.pkl. I found the absolute path of the file and use that as an argument, but was it meant to be able to locate the file within its own installed folder, or did the pickle need to be in the local path?
For detecting 6mA modifications, I am currently working on mCaller and following the instructions under 'Pipeline for methylation detection from R9 data'. The extracting and aligning commands generated the required files, but the nanopolish eventalign command couldn't be executed.
I gave the following command: nanopolish eventalign -t 1 --scale-events -n -r trytry.fastq -b trytry.sorted.bam -g Desktop/Thesis/example/reference.fasta trytry.eventalign.tsv
But the following error occurred:
contig position reference_kmer read_name strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level
[bam process] iterating over region: trytry.eventalign.tsv
[E::sam_itr_next] Null iterator
I would be greatly pleased if you could help me out.
Thankyou.
Hello,
We are in the last minutes of this month's Debian Covid-19 Hackathon and would very much like to redistribute mCaller with the Linux distribution and other software that depends on it. Could you specify one, please, so we know we are allowed to redistribute your work?
Many thanks and regards,
Steffen
Hi,
I was wondering which in which sequencing contexts (e.g. GATC) mCaller can predict methylated adenine?
Thanks,
Vahid
I been trying to run mCaller on promethION data and I have some questions.
1. On nanopolish index :
When indexing reads with nanopolish index, argument '-q' is called invalid option, yet it manage to find almost all path to fast5. I don't see any '-q' option in the nanopolish index help.
Would it be recommanded/possible to provide sequencing summary files to make the indexing faster ? Then where in the command line should I add the albacore_output.fastq path ?
2. On make_bed.py :
'-m' argument is called invaling option. I guessed that this value is a threshold for a minimum fraction of methylated reads at a genomic position. To run this script I have to put out the 'm' option and therefore I have to keep the default value of 0.5.
3. On .eventalign.diffs.6 :
For this file you give the first seven column names, but I can't see anywhere the meaning the 8th column. Would it be the methylation fraction ? Yet it seems to me that it is a per-reads file (I find duplicated genomic positions), so I thought the fraction would be either 0 or 1..
I see from the REAMDE that the columns are "chromosome, read name, genomic position, position k-mer context, features, strand, label, and probability of methylation"
Here is some output from some PCR E. Coli samples which should have no methylation.
U00096.3 4a2c2f99-da0d-4d53-96f3-1573d778d0ae 782 ACCGGMTCGAT 0.8,-0.195,-1.1199999999999999,1.28,-1.99,-0.9225,18.939263322884013 - A 0.28
U00096.3 4a2c2f99-da0d-4d53-96f3-1573d778d0ae 881 ATGTGMTCAGC -0.36,1.07,-2.35,-3.3,-1.47,-2.5,18.939263322884013 - A 0.25
U00096.3 4a2c2f99-da0d-4d53-96f3-1573d778d0ae 1168 AAGGGMTCTGG 0.25428571428571434,-0.63,0.26333333333333336,-1.115,-6.17,3.17,18.939263322884013 - A 0.27
U00096.3 d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1 619 CGTCGMTCAGG 2.1666666666666665,2.33,-1.5050000000000001,-3.07,2.306666666666666,-3.6350000000000002,18.848024316109424 + m6A 0.91
U00096.3 d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1 726 TGTCGMTCGCC 4.38,-1.5699999999999998,0.34,6.92,3.455,1.95,18.848024316109424 + m6A 0.94
U00096.3 d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1 880 GGCTGMTCACA 2.13,-2.78,-0.97,2.0974999999999997,3.86,0.35,18.848024316109424 + m6A 0.88
U00096.3 d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1 1167 TCCAGMTCCCT -1.935,-3.34,-1.86,-2.11,-0.54,0.04500000000000015,18.848024316109424 + A 0.03
U00096.3 d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1 1569 GTGGGMTCTCG 2.145,1.86,-1.205,1.3249999999999997,-0.16666666666666666,-2.7,18.848024316109424 + m6A 0.77
U00096.3 d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1 1634 CAGGGMTCTTC -1.805,-1.165,-0.06,3.34,-3.6100000000000003,2.1550000000000002,18.848024316109424 + A 0.38
U00096.3 d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1 1723 TACCGMTCAGG 2.31,1.35,-0.78,6.26,2.85,-0.8480000000000001,18.848024316109424 + m6A 0.98
U00096.3 d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1 2020 GGCGGMTCAAT 0.5333333333333333,-1.4,-0.99,-0.6549999999999999,-3.17,0.01,18.848024316109424 + A 0.12
U00096.3 d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1 2353 AGATGMTCTTT 2.0,2.4,0.71,0.32,1.9949999999999999,-2.805,18.848024316109424 + m6A 0.8
U00096.3 d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1 2632 TAATGMTCCGC 1.52,0.21,-2.655,2.34,1.27,-1.0416666666666667,18.848024316109424 + m6A 0.86
U00096.3 d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1 2761 TGCTGMTCTGC -0.53,-2.285,-1.33,0.69,-2.12,-4.57,18.848024316109424 + A 0.35
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 619 CGTCGMTCAGG 0.6549999999999999,-1.52,-1.3175000000000001,-0.84,1.95,0.08,19.64565043894653 + A 0.3
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 726 TGTCGMTCGCC 4.96,-1.9975,-0.4666666666666666,2.4,-2.02,0.365,19.64565043894653 + A 0.35
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 781 TATCGMTCCGG 0.79,-3.1133333333333333,-2.47,1.2,0.66,-1.4800000000000002,19.64565043894653 + m6A 0.63
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 880 GGCTGMTCACA 2.4066666666666667,0.51,2.76,1.13,0.685,-2.15,19.64565043894653 + m6A 0.57
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 1167 TCCAGMTCCCT 2.16,6.13,1.23,-0.96,5.97,1.405,19.64565043894653 + m6A 0.89
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 1569 GTGGGMTCTCG -3.07,-1.074,-2.525,-0.895,3.19,-2.54,19.64565043894653 + m6A 0.73
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 1634 CAGGGMTCTTC -1.86,-2.5,-2.4000000000000004,-1.73,-11.36,-0.69,19.64565043894653 + A 0.38
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 1723 TACCGMTCAGG 0.47,-2.85,-1.46,3.79,3.255,0.44,19.64565043894653 + m6A 0.93
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 2020 GGCGGMTCAAT 0.5466666666666666,-0.04500000000000004,-2.0533333333333332,1.37,-0.82,0.57,19.64565043894653 + A 0.2
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 2353 AGATGMTCTTT 1.495,1.99,2.758,1.2366666666666666,0.4950000000000002,-0.79,19.64565043894653 + A 0.32
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 2515 CGACGMTCTCT 0.81,3.085,-0.07,4.31,-10.54,-2.85,19.64565043894653 + m6A 0.96
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 2632 TAATGMTCCGC 1.29,1.8,-2.55,0.39749999999999996,1.17,0.45999999999999996,19.64565043894653 + A 0.43
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 3073 TGCCGMTCGGT 2.37,1.096666666666667,-0.5199999999999999,-1.865,6.8149999999999995,0.5,19.64565043894653 + m6A 0.94
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 3262 TGATGMTCGAA 4.39,-2.05,6.45,1.72,0.88,-3.21,19.64565043894653 + m6A 0.79
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 3755 GAAAGMTCACA 0.8333333333333334,0.25,0.8700000000000001,2.57,3.44,-1.01,19.64565043894653 + m6A 0.88
U00096.3 143b58d8-6651-4192-9592-1026c2a75cc3 3904 CGAAGMTCCTC -0.965,0.46,0.10666666666666662,2.275,-0.36,-2.42,19.64565043894653 + A 0.45
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 619 CGTCGMTCAGG 0.67,-0.3500000000000001,0.9,-0.51,2.58,-0.55,20.60535117056856 + A 0.5
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 726 TGTCGMTCGCC 2.6966666666666668,0.28,-0.695,1.12,3.16,1.75,20.60535117056856 + A 0.47
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 781 TATCGMTCCGG 0.4366666666666667,-0.83,0.64,-0.5700000000000001,0.05333333333333338,-2.19,20.60535117056856 + A 0.27
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 880 GGCTGMTCACA -0.17,0.895,-2.55,0.86,0.40500000000000014,-3.47,20.60535117056856 + m6A 0.84
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 1569 GTGGGMTCTCG -2.1125,-0.49,-1.05,10.19,2.01,-0.14,20.60535117056856 + m6A 0.93
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 1634 CAGGGMTCTTC -0.83,-1.415,-0.57,0.24,-2.025,-2.955,20.60535117056856 + A 0.18
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 1723 TACCGMTCAGG -0.14,-3.75,-2.19,-2.89,-4.63,-0.8799999999999999,20.60535117056856 + A 0.19
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 2020 GGCGGMTCAAT 1.09,-1.08,-0.72,-3.1466666666666665,-2.88,-4.333333333333333,20.60535117056856 + A 0.33
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 2353 AGATGMTCTTT 0.39999999999999997,-0.13,2.24,4.525,1.13,-0.3,20.60535117056856 + m6A 0.59
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 2632 TAATGMTCCGC 1.0075,0.11,-1.11,0.17833333333333334,0.8833333333333333,1.24,20.60535117056856 + A 0.09
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 2761 TGCTGMTCTGC -0.08,0.31,0.52,-1.46,-0.33799999999999997,1.95,20.60535117056856 + A 0.02
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 3073 TGCCGMTCGGT 2.3266666666666667,0.75,-0.19666666666666663,2.96,3.81,1.78,20.60535117056856 + m6A 0.76
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 3755 GAAAGMTCACA 0.04999999999999999,-1.4333333333333333,1.1199999999999999,1.11,0.985,-3.58,20.60535117056856 + m6A 0.65
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 3904 CGAAGMTCCTC 1.06,1.89,-0.48500000000000004,-0.02,1.54,1.86,20.60535117056856 + A 0.13
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 4820 GCGTGMTCAGT 3.035,-4.26,1.47,0.030000000000000027,-0.24,1.32,20.60535117056856 + A 0.12
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 4916 GTTGGMTCTGC -0.11000000000000001,0.41,-1.44,-0.62,0.39999999999999997,-1.89,20.60535117056856 + A 0.27
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 5345 AGGCGMTCGTG -0.71,1.35,1.53,0.29,0.8450000000000002,5.33,20.60535117056856 + A 0.05
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 5489 TCATGMTCATC 2.57,2.265,0.6399999999999999,-0.14500000000000002,2.09,-0.12,20.60535117056856 + A 0.44
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 5851 TGATGMTCTTA -1.8133333333333335,-1.7349999999999999,4.48,-5.14,-8.296666666666667,-4.5,20.60535117056856 + m6A 0.62
U00096.3 17fdb5ab-3500-4825-9112-67dc62ea11e4 6061 GACGGMTCCCC -4.71,-0.94,-1.1366666666666667,-4.17,-0.21500000000000002,-1.85,20.60535117056856 + A 0.09
On all 6mA contexts, is mCaller reporting a high probability of methylation?
It is in the code, but the required arguments bed1 and bed2 are not required for the version argument.
$ ./compare_genomes.py --version
usage: compare_genomes.py [-h] --bed1 BED1 --bed2 BED2 [-g GENOME_ALIGNMENT] [-v]
compare_genomes.py: error: the following arguments are required: --bed1, --bed2
$ ./make_bed.py --version
usage: make_bed.py [-h] [-d MIN_READ_DEPTH] [-t MOD_THRESHOLD] -f MCALLER_FILE [-p POSITIONS] [--control] [--gff] [--ref REF] [--plot] [--plotsummary] [--plotdir PLOTDIR] [--vo] [-v]
make_bed.py: error: the following arguments are required: -f/--mCaller_file
$ ./mCaller.py --version
mCaller v0.3
Dear authors,
when we run the command line:
./make_bed.py -f testdata/masonread1.eventalign.diffs.6 -d 1 -m 0.5
the error says that programm 'make_bed.py' cannot recognize the argument "-m 0.5"
we don't understand why '-m 0.5' cannot be recognized as an argument when running this file.
Hello,
I am trying to run the m6a model with a reference assembly I generated using flye. However, I'm getting this error:
python3 /data/opt/programs/etc/mCaller/mCaller-1.0.3/mCaller.py -t 24 -m CRAANNNNNNNTGC -r ./6180WT/6180WT_flye_assembly.fasta -d /data/opt/programs/etc/mCaller/mCaller-1.0.3/r95_twobase_model_NN_6_m6A.pkl -e 6180WT.eventalign.tsv -b A -f barcode04_6180WT.fastq
1 contigs
24 threads
Process Process-2:
Error: could not find sequence for reference contig g_1
Error: could not find sequence for reference contig contig
Process Process-11:
Error: could not find sequence for reference contig 14632
Error: could not find sequence for reference contig ntig_1
Process Process-8:
Traceback (most recent call last):
File "/data/opt/programs/etc/python/v3.7.0/Python-3.7.0/make_location/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
Any insights into how to resolve this would be appreciated.
Hi,
I'm having some problems with mCaller.py
../00_biosoft/mCaller/mCaller.py -m A -r ../04_mCaller/PT12_sm.fa -d ../00_biosoft/mCaller/r94_model_NN_6_m6A.pkl -e ../04_mCaller/02_wa.eventalign.tsv -f ../03_Guppy/02_WA-fast5_out/02_WA_all.fastq -b A &
I have seen this issue, but I don't think that's consistent reasons with my problem.
I used the -n parameter when I output the eventalign.tsv file with nanopolish.
../00_biosoft/nanopolish/nanopolish eventalign -t 12 --scale-events -n -r ../03_Guppy/02_WA-fast5_out/02_WA_all.fastq -b 02_wa_all.sorted.bam -g PT12_sm.fa > 02_wa.eventalign.tsv
Here is part of my fasq and eventalign.tsv fail, could you do me a favor please?
Hello, I ran into
$ ./make_bed.py -f testdata/masonread1.eventalign.diffs.6 -d 1 -m 0.5
usage: make_bed.py [-h] [-d MIN_READ_DEPTH] [-t MOD_THRESHOLD] -f MCALLER_FILE [-p POSITIONS] [--control] [--gff] [--ref REF] [--plot] [--plotsummary] [--plotdir PLOTDIR] [--vo] [-v]
make_bed.py: error: unrecognized arguments: -m 0.5
but it was just copy'n'pasted from the README.md.
Hi, @al-mcintyre
I tried to use your script to trianning my m6A R9.4 model.But I have noticed that I must provide a file which contain m6A pos and A pos in ref.
Unlucky,I can't provide such completely methylated or canonical sites.
Can I trainning the model by providing a set of methylated reads and non-methylated reads instead of input locations?
Would you give me some tips about rebuilding your script if you like?
thanks:)
Hello, Dr McIntyre,
i have noticed that the output of mCaller.py is like this:
...
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722827 GMTCMMCCTMT 3.4875,-2.31,-0.5075,-0.67,-1.16,-0.07,19.519852053140095 + A 0.02
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722831 MMCCTMTMTCM -1.16,-0.07,-1.68,2.46,8.355,1.555,19.519852053140095 + m6A 0.79
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722833 CCTMTMTCMCC -1.68,2.46,8.355,1.555,-6.78,1.2333333333333334,19.519852053140095 + A 0.18
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722836 MTMTCMCCCCM 1.555,-6.78,1.2333333333333334,0.57,0.6699999999999999,1.05,19.519852053140095 + A 0.08
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722841 MCCCCMMTTMM 1.05,0.5525,0.79,-0.16500000000000004,0.8,-4.52,19.519852053140095 + A 0.4
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722842 CCCCMMTTMMC 0.5525,0.79,-0.16500000000000004,0.8,-4.52,0.81,19.519852053140095 + A 0.02
...
So, the first question is : what's the meaning of last column ( such as 0.02, 0.79, 0.18) ? Does it represent "probability of methylation score" ?
The corresponding output of make_bed.py is like this:
...
R93_chr01 11722823 11722824 TGCMGMTCMMC 0.2631578947368421 + 19
R93_chr01 11722826 11722827 MGMTCMMCCTM 0.11764705882352941 + 17
R93_chr01 11722831 11722832 MMCCTMTMTCM 0.17391304347826086 + 23
R93_chr01 11722833 11722834 CCTMTMTCMCC 0.16 + 25
R93_chr01 11722865 11722866 GCTMCMMGTGG 0.10714285714285714 - 28
...
So, the 2nd question is: what's meaning of 5th column (such as 0.2631578947368421, 0.11764705882352941) ? i have noticed that in README file, this 5th column may represents "% methylated" ? if true, what's the difference between "probability of methylation score" and "% methylated" ?
Thank you.
Hi,
I try to run mCaller on Pseudomonas aeruginosa and it works good for known motifs. Just for reasons of curiosity I wanted to check for all As to see if it is possible to detect methylated As that are not in a motif. I tried the -m A option to state the motif to be only A and in a second try I tried a 0-based position file with all As of the reference genome. Unfortunately both time the same error occurs:
61c7c0c1-6316-41d3-92cc-a98fa60e0d9a 1128158 GTNNNMNCNGG 2.35,2.21,-2.36,-1.9133333333333333,-3.805,-3.745,18.353837395288313 + - Index or Key Error
['MG', 'MH'] ['AA', 'AC', 'AG', 'AM', 'AT', 'MG', 'MA', 'MC', 'MM', 'MH', 'MT'] MN
'MN'
[[0.32129482 0.67870518]]
I think this is some python error. Do you have an idea how to fix this problem? Thank you for your help in advance!
Cheers
Sebastian
Hi,
when aligning ont data with reference, I use minimap2 in align process, and I found that when using bwa ,the result bam file is 1.5G, but when I use minimap2, the result bam size is only 12M,
the referencce is Pseudomonas, and genome size is 6.1M and ont data size is 2.1G, extracted using "nanopolish extract ", the command line is as below:
minimap: minimap2 -a -c -x map-ont micro.fastq reference.fasta| samtools view -Sb - | samtools sort -T /tmp/tmp.sorted -o micro.sort.bam
bwa: bwa mem -x ont2d -t 4 reference.fasta micro.fastq | samtools view -Sb - | samtools sort -T /tmp/tmp.sorted -o micro1.sort.bam
Hi, I needed to:
pip install "scikit_learn==0.22.2.post1"
to make it run.
Since in the README there are no specifications for the versions, here are the versions that work on python=3.8.8
:
biopython==1.79
cycler==0.11.0
fonttools==4.28.5
h5py==3.6.0
joblib==1.1.0
kiwisolver==1.3.2
matplotlib==3.5.1
numpy==1.22.0
packaging==21.3
pandas==1.3.5
Pillow==9.0.0
pyparsing==3.0.6
pysam==0.18.0
python-dateutil==2.8.2
pytz==2021.3
scikit-learn==0.22.2.post1
scipy==1.7.3
seaborn==0.11.2
six==1.16.0
threadpoolctl==3.0.0
Hello, Dr McIntyre,
Althrough some research have used PacBio data to detect 6mA in the plant which l' m interested, using nanopore data to detect 6mA in this plant species has not been available so far. So, i want to first detect genome-wide " true" 6mA sites in this plant species using my own experimental nanopore data with extremely strict parameters (such as : make_bed.py -d 30 -t 0.9 ), and then using the outputs to train the original r94_model_NN_6_m6A.pkl to get a new model, and finally return to use this new model to detect genome-wide 6mA site in this plant species using my experimental nanopore data with suggested parameters (such as : make_bed.py -d 15 -t 0.5 ). Is this idea feasible ? Could you give some suggestions ?
Thank you.
restructure directories
Hi,
Maybe my data size is a little big. I failed in the sorting step when I run the mCaller.py. The error information is as follows:
sort: write failed: /tmp/sort7fkfH5: No space left on device
And my command is as follows:
mCaller.py -m A -r Ref.fa -d ~/software/mCaller-master/r94_model_NN_6_m6A.pkl -e eventalign_try/$name.eventalign.tsv -f raw_data/links/$name.fastq.gz -b A -t 64
I would like to ask if I can split the chromosomes and try again ? And does it make a difference ? Thanks very much !
Hello,
Thank you for making mCaller! I think it's a nice alternative to using tombo.
I have a position file for all the 'A' sites I want to check for m6A, however when I use the -p option with my positions file, mCaller returns an empty file. My positions file is formatted as tab separated exactly how is described in the "testdata" (with chromosome, position and strand). I left out the label column because I am not re-training. Adding a label column gives the same result. Is there any troubleshooting I can do to get the positions file working?
In the meantime, instead of using a positions file, I tried mCaller by motif. I'm working with a motif that includes an ambiguous nucleotide 'N'. The motif is 'GANTC'. Running mCaller with 'GANTC' returns no result. I then tried running mCaller on each possible motif individually (i.e. GAATC, GAGTC, GATTC, GACTC) and it worked! However in the results for GAATC, I noticed mCaller is labeling both A positions as modified (ex. 'TAATGMMTCCT'). Only the first A position can be modified for this particular motif (i.e. 'GMATC'). How can I tell mCaller to only consider the first 'A' in this situation?
Thank you!
The 5th step under 'Pipeline for methylation detection from R9 data' is to run mCaller to detect 6mA modifications
The following command was given: mCaller/mCaller.py -m GATC -r Desktop/Thesis/example/reference.fasta -d mCaller/r95_twobase_model_NN_6_m6A.pkl -e trytry.eventalign.tsv -f trytry.fastq -b A
But the following error occured:
17 contigs
1 threads
Traceback (most recent call last):
File "mCaller/mCaller.py", line 187, in
main()
File "mCaller/mCaller.py", line 184, in main
args.train,modelfile,args.skip_thresh,args.qual_thresh,args.classifier,training_tsv,args.plot_training)
File "mCaller/mCaller.py", line 58, in distribute_threads
extract_features(tsvname,refname,read2qual,nvariables,skip_thresh,qual_thresh,modelfile,classifier,0,endline=bytesize,train=train,pos_label=training_pos_dict,base=base,motif=motif,positions_list=positions_list)
File "/home/arm/mCaller/extract_contexts.py", line 143, in extract_features
while tsv.tell() <= endline-500: #TODO: why 500?
OSError: telling position disabled by next() call
I would be greatly obliged if you could help me out.
Thankyou.
Unable to run the mCaller.py -m GATC script with the following error thrown. Please assist.
/home/adas/mCaller-master/mCaller.py -m GATC -r Pv_wholegenome_Seq.fasta -d r94_model_NN_6_m6A -e mCal.eventalign.tsv -f barcode1_merged.fastq -b A
Traceback (most recent call last):
File "/home/adas/mCaller-master/mCaller.py", line 187, in
main()
File "/home/adas/mCaller-master/mCaller.py", line 157, in main
assert os.path.isfile(modelfile), 'model file not found at '+modelfile
AssertionError: model file not found at r94_model_NN_6_m6A
Hi,
I've just completed a first run using mCaller. The process has completed and I have output although the following error is repeatedly shown in the log output as well:
Process Process-4:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/mnt/home1/gurdon/cep46/mCaller/mCaller_nanopolish.py", line 46, in worker
outtup = extract_features(tsvname,fastaname,read2qual,nvariables,skip_thresh,qual_thresh,modelfile,classifier,startline,endline=endline,train=train,pos_label=training_pos_dict,base=base,motif=motif,positions_list=positions_list)
File "/mnt/b2/home1/gurdon/cep46/mCaller/extract_contexts.py", line 138, in extract_features
while tsv.tell() <= endline-500:
OSError: telling position disabled by next() call
Is this error something I should be worried about?
Best,
Clare
Would be really helpful to have - to mention in papers (looks better than git hashes) and to trigger updates to the Debian etc packages.
Thank you!
Steffen
Hello Team,
I have successfully extracted the methylation calls and have generated the bed file, but I am trying to find out the differentially methylated regions for the same.Can you suggest me how can I proceed with the bed file generated. It would be really nice if you can provide some insights for the same.
Thanks & Regards
Priyanka Roy
Our nanopore DNA sequencing data comes from R9.41 flowcell. I wonder wheter the newest models such as CAAYNNNNNRTAC_model_6_m6A.pkl and CRAANNNNNNNTGC_model_6_m6A.pkl, are suitable to detech 6mA modification in fast5 data from R9.41 flowcell.
Hi, @al-mcintyre
The positons.txt that you mention in the "README",which "with a list of positions at which to classify
bases (must be formatted as space- or tab-separated file with chromosome, position, strand, and label if
training) , can't be obtain in simple way,would you please show method to get POSITIONS.txt in detail?
And in testdata folder,there are three positons file(e,g, test_positions.txt,test_positions_A.txt, test_positions_m6A.txt), do we need so many positions file to methylation call? And what they do in our test?
Under the latest version of sklearn (0.24.2) it appears that the MLP modules shifted around a bit. As a result under the latest version, I got the following error when trying to run mCaller.
ModuleNotFoundError: No module named 'sklearn.neural_network.multilayer_perceptron'
Since you don't import this directly in the Python code, I think it comes from the pickle file like this.
My solution for this was to downgrade sklearn to 0.22, so maybe that should be specified as a requirement or otherwise a higher version should be specified with an updated pickle.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.