al-mcintyre / mcaller Goto Github PK

View Code? Open in Web Editor NEW

43.0 43.0 16.0 135.98 MB

A python program to call methylation (m6A in DNA) from nanopore signal data

License: MIT License

Python 98.29% Shell 1.71%

mcaller's People

Contributors

Stargazers

Watchers

Forkers

qgouil xiuying ralowe anjunipeiris wangdi2014 hygine tw7649116 loneknightz skerker lqsae pandawyh smoe cmason-lab hegu2692 wook2013 flzh628

mcaller's Issues

python3 transition - I have patches locally, how can I help?

Hello,
https://salsa.debian.org/med-team/mcaller/-/blob/master/debian/patches/pickl3latin.patch
and
https://salsa.debian.org/med-team/mcaller/-/blob/master/debian/patches/python2to3.patch
do the trick - plus the change in the shebang (#!) line. At least so I hoped. Following your README.md instructions for testing, all the invocations to mCaller run, but not so

$         ./make_bed.py -f testdata/masonread1.eventalign.diffs.6 -d 1 -t 0.5     
testdata/masonread1.eventalign.diffs.6
Traceback (most recent call last):
  File "./make_bed.py", line 80, in aggregate_by_pos
    csome,read,pos,context,values,strand,label,prob = tuple(line.split('\t'))
ValueError: not enough values to unpack (expected 8, got 6)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./make_bed.py", line 205, in <module>
    main()
  File "./make_bed.py", line 202, in main
    aggregate_by_pos(args.mCaller_file,output_file,args.min_read_depth,args.mod_threshold,args.positions,args.control,args.vo,args.gff,args.ref,args.plot,args.plotdir,args.plotsummary)
  File "./make_bed.py", line 82, in aggregate_by_pos
    csome,read,pos,context,values,strand,label = tuple(line.split('\t'))
ValueError: not enough values to unpack (expected 7, got 6)

This maybe something associated with Python 3 differences or something else - maybe this rings an immediate bell on your end.

There are ways to have the code compatible to Python 2 and 3 at the same time. Is that something you would aim for? Please kindly direct me.

Many thanks

Steffen

Do mCaller have 5mC model ?

Question1: when I use mCaller call 5mC meth but I haven't 5mC model file, do you have 5mC train model to use ? the code like this :
mCaller.py -m GATC -r reference.fa -e human.0.eventalign.tsv -f astq_0.fastq -b C
got Error like this:

Traceback (most recent call last):
  File "mCaller.py", line 187, in <module>
    main()
  File "mCaller.py", line 157, in main
    assert os.path.isfile(modelfile), 'model file not found at '+modelfile
AssertionError: model file not found at /software/mCaller/model_NN_6_m5C.pkl

Question2: dose those model file suit for any species ？
@al-mcintyre @ralowe

Model for R9.4?

Hi, I'm very excited to try your algorithm.

I'm just wondering whether you have a model for R9.4 flow cells. "model_NN_6_m6A.pkl" or "mod_NN_6_m6A.pkl" is the one?

Hi @al-mcintyre,
I tried to use mCaller for 6mA methylation calling in E. coli. Meanwhile, I employed other software in methylation collection, including Deepmod and Deepsignal. The correlation among these three methods poses a confusing problem for detaild research. I don't know how to balance Quantity and accuracy.

Here, the parameters of mCaller were set as coverage > 3 and percent >0.5.
Looking forware to your reply.

Bin

Could mCaller detect methylation from other species

Hi @ralowe, @al-mcintyre Could mCaller detect methylation from other species, such as human?
Best.

Training and using a new model fails

Hi,

I´m trying to train my own model via mCaller. The methylation sites obtained after training look promising. However, when I try to apply the new model to a dataset I get an error.

Here is the full error message based on the test data provided following the commands described in the README:

testdata/
1 contigs
1 threads
26dd376e-9d82-41fc-921e-71e559c8e8d1_Basecall_2D_template 13673 CATTAMCAATC 1.1,-0.23666666666666666,0.28,1.17,-0.02,0.6699999999999999,7.055265349382997 - - Index or Key Error
['general'] ['AA', 'AC', 'AG', 'AM', 'AT', 'MG', 'MA', 'MC', 'MM', 'MH', 'MT'] MC
'MH'
Traceback (most recent call last):
File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/mCaller.py", line 188, in
main()
File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/mCaller.py", line 185, in main
args.train,modelfile,args.skip_thresh,args.qual_thresh,args.classifier,training_tsv,args.plot_training)
File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/mCaller.py", line 59, in distribute_threads
extract_features(tsvname,refname,read2qual,nvariables,skip_thresh,qual_thresh,modelfile,classifier,0,endline=bytesize,train=train,pos_label=training_pos_dict,base=base,motif=motif,positions_list=positions_list)
File "/opt/share/software/packages/mCaller-2020-03-24/pyvirt/bin/extract_contexts.py", line 218, in extract_features
print model[twobase_model].predict_proba([diffs])
KeyError: 'MH'

Many thanks for any clarification!
Christian

nanopolish extract non-functional

Hello,

I just wanted to let you know that nanopolish extract cannot extract fast5 file pathway information to incorporate into read headers. The authors of the program have basically noted its obsolete for current versions of nanopolish. I have a set of four isolates where two are complementary mutants that have a methylase knocked out. I first de-multiplexed fast5 previously with Deepbinner since I was concerned about the clonal nature of these isolates. However, after thinking about the pathway more clearly, I'm thinking besides adding more computational time, omitting demultiplexing shouldn't be a concern for downstream results. Am I correct in thinking that demultiplexing fast5 files is not necessary for downstream analysis?

Thanks!

Will

Question about m6A result compare with tombo

hi @ralowe @al-mcintyre Thanks for share the project , nice work.
I use mCaller detect m6A meth susses, but when I compared the result with tombo, they are so big big big big big difference，the same sample detect result just like this：

mCaller result：total meth position ：5954091，overlap with tombo 21876 (0.3%)
tombo: total methy position：10532279，overlap with mCaller 21876 (0.2%)
overlap methy position relation ：-0.1
and i found that the mCaller methy lever higher than tombo two or more

So, my question is why are these two soft's results so differentiations, and could you give me the answer about these results, and which one close to the reality?
(my opinion : I suppose to tombo, cause that the methy lever most lower, the higher methy lever gets higher false positive rate)

why the m6A result significantly focus on some genome region by using this software

i have 20x nanopore datasets, and i use this software to call m6A genome loci. however, the result show that the m6A loci significantly focus on some genome region. i wonder whether the nanopre datasets suit for m6A analysis and the result of this software was convincing.
Because i have other 10x PacBio datasets. Although the reference genome is different, the m6A distribution result of PB was better than nanopre.

Pickle file search path not clear

I tried running

mCaller.py <-m GATC or -p positions.txt> -r <reference>.fasta -d r95_twobase_model_NN_6_m6A.pkl -e <filename>.eventalign.tsv -f <filename>.fastq -b A

with the variables set for my data, however mCaller didn't find r95_twobase_model_NN_6_m6A.pkl. I found the absolute path of the file and use that as an argument, but was it meant to be able to locate the file within its own installed folder, or did the pickle need to be in the local path?

Facing an issue for the 'nanopolish eventalign' command

For detecting 6mA modifications, I am currently working on mCaller and following the instructions under 'Pipeline for methylation detection from R9 data'. The extracting and aligning commands generated the required files, but the nanopolish eventalign command couldn't be executed.
I gave the following command: nanopolish eventalign -t 1 --scale-events -n -r trytry.fastq -b trytry.sorted.bam -g Desktop/Thesis/example/reference.fasta trytry.eventalign.tsv
But the following error occurred:
contig position reference_kmer read_name strand event_index event_level_mean event_stdv event_length model_kmer model_mean model_stdv standardized_level
[bam process] iterating over region: trytry.eventalign.tsv
[E::sam_itr_next] Null iterator

I would be greatly pleased if you could help me out.
Thankyou.

License info missing

Hello,
We are in the last minutes of this month's Debian Covid-19 Hackathon and would very much like to redistribute mCaller with the Linux distribution and other software that depends on it. Could you specify one, please, so we know we are allowed to redistribute your work?
Many thanks and regards,
Steffen

Motif context

Hi,

I was wondering which in which sequencing contexts (e.g. GATC) mCaller can predict methylated adenine?

Thanks,
Vahid

Invalid options for nanopolish index & make_bed.py

I been trying to run mCaller on promethION data and I have some questions.

1. On nanopolish index :
When indexing reads with nanopolish index, argument '-q' is called invalid option, yet it manage to find almost all path to fast5. I don't see any '-q' option in the nanopolish index help.
Would it be recommanded/possible to provide sequencing summary files to make the indexing faster ? Then where in the command line should I add the albacore_output.fastq path ?

2. On make_bed.py :
'-m' argument is called invaling option. I guessed that this value is a threshold for a minimum fraction of methylated reads at a genomic position. To run this script I have to put out the 'm' option and therefore I have to keep the default value of 0.5.

3. On .eventalign.diffs.6 :
For this file you give the first seven column names, but I can't see anywhere the meaning the 8th column. Would it be the methylation fraction ? Yet it seems to me that it is a per-reads file (I find duplicated genomic positions), so I thought the fraction would be either 0 or 1..

How to interpret diff.6 output?

I see from the REAMDE that the columns are "chromosome, read name, genomic position, position k-mer context, features, strand, label, and probability of methylation"

Here is some output from some PCR E. Coli samples which should have no methylation.

U00096.3	4a2c2f99-da0d-4d53-96f3-1573d778d0ae	782	ACCGGMTCGAT	0.8,-0.195,-1.1199999999999999,1.28,-1.99,-0.9225,18.939263322884013	-	A	0.28
U00096.3	4a2c2f99-da0d-4d53-96f3-1573d778d0ae	881	ATGTGMTCAGC	-0.36,1.07,-2.35,-3.3,-1.47,-2.5,18.939263322884013	-	A	0.25
U00096.3	4a2c2f99-da0d-4d53-96f3-1573d778d0ae	1168	AAGGGMTCTGG	0.25428571428571434,-0.63,0.26333333333333336,-1.115,-6.17,3.17,18.939263322884013	-	A	0.27
U00096.3	d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1	619	CGTCGMTCAGG	2.1666666666666665,2.33,-1.5050000000000001,-3.07,2.306666666666666,-3.6350000000000002,18.848024316109424	+	m6A	0.91
U00096.3	d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1	726	TGTCGMTCGCC	4.38,-1.5699999999999998,0.34,6.92,3.455,1.95,18.848024316109424	+	m6A	0.94
U00096.3	d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1	880	GGCTGMTCACA	2.13,-2.78,-0.97,2.0974999999999997,3.86,0.35,18.848024316109424	+	m6A	0.88
U00096.3	d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1	1167	TCCAGMTCCCT	-1.935,-3.34,-1.86,-2.11,-0.54,0.04500000000000015,18.848024316109424	+	A	0.03
U00096.3	d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1	1569	GTGGGMTCTCG	2.145,1.86,-1.205,1.3249999999999997,-0.16666666666666666,-2.7,18.848024316109424	+	m6A	0.77
U00096.3	d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1	1634	CAGGGMTCTTC	-1.805,-1.165,-0.06,3.34,-3.6100000000000003,2.1550000000000002,18.848024316109424	+	A	0.38
U00096.3	d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1	1723	TACCGMTCAGG	2.31,1.35,-0.78,6.26,2.85,-0.8480000000000001,18.848024316109424	+	m6A	0.98
U00096.3	d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1	2020	GGCGGMTCAAT	0.5333333333333333,-1.4,-0.99,-0.6549999999999999,-3.17,0.01,18.848024316109424	+	A	0.12
U00096.3	d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1	2353	AGATGMTCTTT	2.0,2.4,0.71,0.32,1.9949999999999999,-2.805,18.848024316109424	+	m6A	0.8
U00096.3	d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1	2632	TAATGMTCCGC	1.52,0.21,-2.655,2.34,1.27,-1.0416666666666667,18.848024316109424	+	m6A	0.86
U00096.3	d1e7a6bc-92ed-422f-a7b8-b9ebeb39bac1	2761	TGCTGMTCTGC	-0.53,-2.285,-1.33,0.69,-2.12,-4.57,18.848024316109424	+	A	0.35
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	619	CGTCGMTCAGG	0.6549999999999999,-1.52,-1.3175000000000001,-0.84,1.95,0.08,19.64565043894653	+	A	0.3
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	726	TGTCGMTCGCC	4.96,-1.9975,-0.4666666666666666,2.4,-2.02,0.365,19.64565043894653	+	A	0.35
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	781	TATCGMTCCGG	0.79,-3.1133333333333333,-2.47,1.2,0.66,-1.4800000000000002,19.64565043894653	+	m6A	0.63
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	880	GGCTGMTCACA	2.4066666666666667,0.51,2.76,1.13,0.685,-2.15,19.64565043894653	+	m6A	0.57
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	1167	TCCAGMTCCCT	2.16,6.13,1.23,-0.96,5.97,1.405,19.64565043894653	+	m6A	0.89
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	1569	GTGGGMTCTCG	-3.07,-1.074,-2.525,-0.895,3.19,-2.54,19.64565043894653	+	m6A	0.73
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	1634	CAGGGMTCTTC	-1.86,-2.5,-2.4000000000000004,-1.73,-11.36,-0.69,19.64565043894653	+	A	0.38
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	1723	TACCGMTCAGG	0.47,-2.85,-1.46,3.79,3.255,0.44,19.64565043894653	+	m6A	0.93
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	2020	GGCGGMTCAAT	0.5466666666666666,-0.04500000000000004,-2.0533333333333332,1.37,-0.82,0.57,19.64565043894653	+	A	0.2
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	2353	AGATGMTCTTT	1.495,1.99,2.758,1.2366666666666666,0.4950000000000002,-0.79,19.64565043894653	+	A	0.32
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	2515	CGACGMTCTCT	0.81,3.085,-0.07,4.31,-10.54,-2.85,19.64565043894653	+	m6A	0.96
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	2632	TAATGMTCCGC	1.29,1.8,-2.55,0.39749999999999996,1.17,0.45999999999999996,19.64565043894653	+	A	0.43
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	3073	TGCCGMTCGGT	2.37,1.096666666666667,-0.5199999999999999,-1.865,6.8149999999999995,0.5,19.64565043894653	+	m6A	0.94
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	3262	TGATGMTCGAA	4.39,-2.05,6.45,1.72,0.88,-3.21,19.64565043894653	+	m6A	0.79
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	3755	GAAAGMTCACA	0.8333333333333334,0.25,0.8700000000000001,2.57,3.44,-1.01,19.64565043894653	+	m6A	0.88
U00096.3	143b58d8-6651-4192-9592-1026c2a75cc3	3904	CGAAGMTCCTC	-0.965,0.46,0.10666666666666662,2.275,-0.36,-2.42,19.64565043894653	+	A	0.45
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	619	CGTCGMTCAGG	0.67,-0.3500000000000001,0.9,-0.51,2.58,-0.55,20.60535117056856	+	A	0.5
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	726	TGTCGMTCGCC	2.6966666666666668,0.28,-0.695,1.12,3.16,1.75,20.60535117056856	+	A	0.47
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	781	TATCGMTCCGG	0.4366666666666667,-0.83,0.64,-0.5700000000000001,0.05333333333333338,-2.19,20.60535117056856	+	A	0.27
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	880	GGCTGMTCACA	-0.17,0.895,-2.55,0.86,0.40500000000000014,-3.47,20.60535117056856	+	m6A	0.84
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	1569	GTGGGMTCTCG	-2.1125,-0.49,-1.05,10.19,2.01,-0.14,20.60535117056856	+	m6A	0.93
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	1634	CAGGGMTCTTC	-0.83,-1.415,-0.57,0.24,-2.025,-2.955,20.60535117056856	+	A	0.18
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	1723	TACCGMTCAGG	-0.14,-3.75,-2.19,-2.89,-4.63,-0.8799999999999999,20.60535117056856	+	A	0.19
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	2020	GGCGGMTCAAT	1.09,-1.08,-0.72,-3.1466666666666665,-2.88,-4.333333333333333,20.60535117056856	+	A	0.33
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	2353	AGATGMTCTTT	0.39999999999999997,-0.13,2.24,4.525,1.13,-0.3,20.60535117056856	+	m6A	0.59
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	2632	TAATGMTCCGC	1.0075,0.11,-1.11,0.17833333333333334,0.8833333333333333,1.24,20.60535117056856	+	A	0.09
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	2761	TGCTGMTCTGC	-0.08,0.31,0.52,-1.46,-0.33799999999999997,1.95,20.60535117056856	+	A	0.02
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	3073	TGCCGMTCGGT	2.3266666666666667,0.75,-0.19666666666666663,2.96,3.81,1.78,20.60535117056856	+	m6A	0.76
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	3755	GAAAGMTCACA	0.04999999999999999,-1.4333333333333333,1.1199999999999999,1.11,0.985,-3.58,20.60535117056856	+	m6A	0.65
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	3904	CGAAGMTCCTC	1.06,1.89,-0.48500000000000004,-0.02,1.54,1.86,20.60535117056856	+	A	0.13
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	4820	GCGTGMTCAGT	3.035,-4.26,1.47,0.030000000000000027,-0.24,1.32,20.60535117056856	+	A	0.12
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	4916	GTTGGMTCTGC	-0.11000000000000001,0.41,-1.44,-0.62,0.39999999999999997,-1.89,20.60535117056856	+	A	0.27
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	5345	AGGCGMTCGTG	-0.71,1.35,1.53,0.29,0.8450000000000002,5.33,20.60535117056856	+	A	0.05
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	5489	TCATGMTCATC	2.57,2.265,0.6399999999999999,-0.14500000000000002,2.09,-0.12,20.60535117056856	+	A	0.44
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	5851	TGATGMTCTTA	-1.8133333333333335,-1.7349999999999999,4.48,-5.14,-8.296666666666667,-4.5,20.60535117056856	+	m6A	0.62
U00096.3	17fdb5ab-3500-4825-9112-67dc62ea11e4	6061	GACGGMTCCCC	-4.71,-0.94,-1.1366666666666667,-4.17,-0.21500000000000002,-1.85,20.60535117056856	+	A	0.09

On all 6mA contexts, is mCaller reporting a high probability of methylation?

--version bails out when otherwise required args are not given

It is in the code, but the required arguments bed1 and bed2 are not required for the version argument.

$ ./compare_genomes.py --version
usage: compare_genomes.py [-h] --bed1 BED1 --bed2 BED2 [-g GENOME_ALIGNMENT] [-v]
compare_genomes.py: error: the following arguments are required: --bed1, --bed2
$ ./make_bed.py --version
usage: make_bed.py [-h] [-d MIN_READ_DEPTH] [-t MOD_THRESHOLD] -f MCALLER_FILE [-p POSITIONS] [--control] [--gff] [--ref REF] [--plot] [--plotsummary] [--plotdir PLOTDIR] [--vo] [-v]
make_bed.py: error: the following arguments are required: -f/--mCaller_file
$ ./mCaller.py --version
mCaller v0.3

Problems about run of Test Data

Dear authors,
when we run the command line:

./make_bed.py -f testdata/masonread1.eventalign.diffs.6 -d 1 -m 0.5

the error says that programm 'make_bed.py' cannot recognize the argument "-m 0.5"
we don't understand why '-m 0.5' cannot be recognized as an argument when running this file.

model is not recognizing reference

Hello,

I am trying to run the m6a model with a reference assembly I generated using flye. However, I'm getting this error:

python3 /data/opt/programs/etc/mCaller/mCaller-1.0.3/mCaller.py -t 24 -m CRAANNNNNNNTGC -r ./6180WT/6180WT_flye_assembly.fasta -d /data/opt/programs/etc/mCaller/mCaller-1.0.3/r95_twobase_model_NN_6_m6A.pkl -e 6180WT.eventalign.tsv -b A -f barcode04_6180WT.fastq

1 contigs
24 threads
Process Process-2:
Error: could not find sequence for reference contig g_1
Error: could not find sequence for reference contig contig
Process Process-11:
Error: could not find sequence for reference contig 14632
Error: could not find sequence for reference contig ntig_1
Process Process-8:
Traceback (most recent call last):
File "/data/opt/programs/etc/python/v3.7.0/Python-3.7.0/make_location/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()

Any insights into how to resolve this would be appreciated.

Error: could not find sequence for reference contig contig

Hi,
I'm having some problems with mCaller.py

../00_biosoft/mCaller/mCaller.py -m A -r ../04_mCaller/PT12_sm.fa -d ../00_biosoft/mCaller/r94_model_NN_6_m6A.pkl -e ../04_mCaller/02_wa.eventalign.tsv -f ../03_Guppy/02_WA-fast5_out/02_WA_all.fastq -b A &

I have seen this issue, but I don't think that's consistent reasons with my problem.

I used the -n parameter when I output the eventalign.tsv file with nanopolish.

../00_biosoft/nanopolish/nanopolish eventalign -t 12 --scale-events -n -r ../03_Guppy/02_WA-fast5_out/02_WA_all.fastq -b 02_wa_all.sorted.bam -g PT12_sm.fa > 02_wa.eventalign.tsv

Here is part of my fasq and eventalign.tsv fail, could you do me a favor please?

README.md instructions on make_bed.py: -m argument unknown

Hello, I ran into

$ ./make_bed.py -f testdata/masonread1.eventalign.diffs.6 -d 1 -m 0.5
usage: make_bed.py [-h] [-d MIN_READ_DEPTH] [-t MOD_THRESHOLD] -f MCALLER_FILE [-p POSITIONS] [--control] [--gff] [--ref REF] [--plot] [--plotsummary] [--plotdir PLOTDIR] [--vo] [-v]
make_bed.py: error: unrecognized arguments: -m 0.5

but it was just copy'n'pasted from the README.md.

Question about trainning custom m6A model

Hi, @al-mcintyre
I tried to use your script to trianning my m6A R9.4 model.But I have noticed that I must provide a file which contain m6A pos and A pos in ref.
Unlucky,I can't provide such completely methylated or canonical sites.
Can I trainning the model by providing a set of methylated reads and non-methylated reads instead of input locations?
Would you give me some tips about rebuilding your script if you like?
thanks:)

how to interpret the output of mCaller.py and make_bed.py ?

Hello, Dr McIntyre,
i have noticed that the output of mCaller.py is like this:
...
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722827 GMTCMMCCTMT 3.4875,-2.31,-0.5075,-0.67,-1.16,-0.07,19.519852053140095 + A 0.02
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722831 MMCCTMTMTCM -1.16,-0.07,-1.68,2.46,8.355,1.555,19.519852053140095 + m6A 0.79
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722833 CCTMTMTCMCC -1.68,2.46,8.355,1.555,-6.78,1.2333333333333334,19.519852053140095 + A 0.18
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722836 MTMTCMCCCCM 1.555,-6.78,1.2333333333333334,0.57,0.6699999999999999,1.05,19.519852053140095 + A 0.08
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722841 MCCCCMMTTMM 1.05,0.5525,0.79,-0.16500000000000004,0.8,-4.52,19.519852053140095 + A 0.4
R93_chr01 0000a989-a60a-4bbf-9dea-d7338c442966 11722842 CCCCMMTTMMC 0.5525,0.79,-0.16500000000000004,0.8,-4.52,0.81,19.519852053140095 + A 0.02
...
So, the first question is : what's the meaning of last column ( such as 0.02, 0.79, 0.18) ? Does it represent "probability of methylation score" ?

The corresponding output of make_bed.py is like this:
...
R93_chr01 11722823 11722824 TGCMGMTCMMC 0.2631578947368421 + 19
R93_chr01 11722826 11722827 MGMTCMMCCTM 0.11764705882352941 + 17
R93_chr01 11722831 11722832 MMCCTMTMTCM 0.17391304347826086 + 23
R93_chr01 11722833 11722834 CCTMTMTCMCC 0.16 + 25
R93_chr01 11722865 11722866 GCTMCMMGTGG 0.10714285714285714 - 28
...

So, the 2nd question is: what's meaning of 5th column (such as 0.2631578947368421, 0.11764705882352941) ? i have noticed that in README file, this 5th column may represents "% methylated" ? if true, what's the difference between "probability of methylation score" and "% methylated" ?

Thank you.

probable mCaller python error

Hi,

I try to run mCaller on Pseudomonas aeruginosa and it works good for known motifs. Just for reasons of curiosity I wanted to check for all As to see if it is possible to detect methylated As that are not in a motif. I tried the -m A option to state the motif to be only A and in a second try I tried a 0-based position file with all As of the reference genome. Unfortunately both time the same error occurs:

61c7c0c1-6316-41d3-92cc-a98fa60e0d9a 1128158 GTNNNMNCNGG 2.35,2.21,-2.36,-1.9133333333333333,-3.805,-3.745,18.353837395288313 + - Index or Key Error
['MG', 'MH'] ['AA', 'AC', 'AG', 'AM', 'AT', 'MG', 'MA', 'MC', 'MM', 'MH', 'MT'] MN
'MN'
[[0.32129482 0.67870518]]

I think this is some python error. Do you have an idea how to fix this problem? Thank you for your help in advance!

Cheers
Sebastian

can I replace bwa with minimap2 for ont data?

Hi，
when aligning ont data with reference, I use minimap2 in align process, and I found that when using bwa ,the result bam file is 1.5G, but when I use minimap2, the result bam size is only 12M,
the referencce is Pseudomonas, and genome size is 6.1M and ont data size is 2.1G, extracted using "nanopolish extract ", the command line is as below:
minimap: minimap2 -a -c -x map-ont micro.fastq reference.fasta| samtools view -Sb - | samtools sort -T /tmp/tmp.sorted -o micro.sort.bam
bwa: bwa mem -x ont2d -t 4 reference.fasta micro.fastq | samtools view -Sb - | samtools sort -T /tmp/tmp.sorted -o micro1.sort.bam

update scikit_learn version to make it work

Hi, I needed to:

pip install "scikit_learn==0.22.2.post1"

to make it run.
Since in the README there are no specifications for the versions, here are the versions that work on python=3.8.8:

biopython==1.79
cycler==0.11.0
fonttools==4.28.5
h5py==3.6.0
joblib==1.1.0
kiwisolver==1.3.2
matplotlib==3.5.1
numpy==1.22.0
packaging==21.3
pandas==1.3.5
Pillow==9.0.0
pyparsing==3.0.6
pysam==0.18.0
python-dateutil==2.8.2
pytz==2021.3
scikit-learn==0.22.2.post1
scipy==1.7.3
seaborn==0.11.2
six==1.16.0
threadpoolctl==3.0.0

model training with my own nanopore data

Hello, Dr McIntyre,
Althrough some research have used PacBio data to detect 6mA in the plant which l' m interested, using nanopore data to detect 6mA in this plant species has not been available so far. So, i want to first detect genome-wide " true" 6mA sites in this plant species using my own experimental nanopore data with extremely strict parameters (such as : make_bed.py -d 30 -t 0.9 ), and then using the outputs to train the original r94_model_NN_6_m6A.pkl to get a new model, and finally return to use this new model to detect genome-wide 6mA site in this plant species using my experimental nanopore data with suggested parameters (such as : make_bed.py -d 15 -t 0.5 ). Is this idea feasible ? Could you give some suggestions ?

Thank you.

fix python package structure

restructure directories

No space left on device

Hi,

Maybe my data size is a little big. I failed in the sorting step when I run the mCaller.py. The error information is as follows:

sort: write failed: /tmp/sort7fkfH5: No space left on device

And my command is as follows:

mCaller.py -m A -r Ref.fa -d ~/software/mCaller-master/r94_model_NN_6_m6A.pkl -e eventalign_try/$name.eventalign.tsv -f raw_data/links/$name.fastq.gz -b A -t 64

I would like to ask if I can split the chromosomes and try again ? And does it make a difference ? Thanks very much !

mCaller with position file returns no results

Hello,

Thank you for making mCaller! I think it's a nice alternative to using tombo.

I have a position file for all the 'A' sites I want to check for m6A, however when I use the -p option with my positions file, mCaller returns an empty file. My positions file is formatted as tab separated exactly how is described in the "testdata" (with chromosome, position and strand). I left out the label column because I am not re-training. Adding a label column gives the same result. Is there any troubleshooting I can do to get the positions file working?

In the meantime, instead of using a positions file, I tried mCaller by motif. I'm working with a motif that includes an ambiguous nucleotide 'N'. The motif is 'GANTC'. Running mCaller with 'GANTC' returns no result. I then tried running mCaller on each possible motif individually (i.e. GAATC, GAGTC, GATTC, GACTC) and it worked! However in the results for GAATC, I noticed mCaller is labeling both A positions as modified (ex. 'TAATGMMTCCT'). Only the first A position can be modified for this particular motif (i.e. 'GMATC'). How can I tell mCaller to only consider the first 'A' in this situation?

Thank you!

Facing an issue with the command to detect 6mA

The 5th step under 'Pipeline for methylation detection from R9 data' is to run mCaller to detect 6mA modifications
The following command was given: mCaller/mCaller.py -m GATC -r Desktop/Thesis/example/reference.fasta -d mCaller/r95_twobase_model_NN_6_m6A.pkl -e trytry.eventalign.tsv -f trytry.fastq -b A

But the following error occured:
17 contigs
1 threads
Traceback (most recent call last):
File "mCaller/mCaller.py", line 187, in
main()
File "mCaller/mCaller.py", line 184, in main
args.train,modelfile,args.skip_thresh,args.qual_thresh,args.classifier,training_tsv,args.plot_training)
File "mCaller/mCaller.py", line 58, in distribute_threads
extract_features(tsvname,refname,read2qual,nvariables,skip_thresh,qual_thresh,modelfile,classifier,0,endline=bytesize,train=train,pos_label=training_pos_dict,base=base,motif=motif,positions_list=positions_list)
File "/home/arm/mCaller/extract_contexts.py", line 143, in extract_features
while tsv.tell() <= endline-500: #TODO: why 500?
OSError: telling position disabled by next() call

I would be greatly obliged if you could help me out.
Thankyou.

Can't run the methylation detection script

Unable to run the mCaller.py -m GATC script with the following error thrown. Please assist.

/home/adas/mCaller-master/mCaller.py -m GATC -r Pv_wholegenome_Seq.fasta -d r94_model_NN_6_m6A -e mCal.eventalign.tsv -f barcode1_merged.fastq -b A
Traceback (most recent call last):
File "/home/adas/mCaller-master/mCaller.py", line 187, in
main()
File "/home/adas/mCaller-master/mCaller.py", line 157, in main
assert os.path.isfile(modelfile), 'model file not found at '+modelfile
AssertionError: model file not found at r94_model_NN_6_m6A

Errors on running mCaller

Hi,
I've just completed a first run using mCaller. The process has completed and I have output although the following error is repeatedly shown in the log output as well:
Process Process-4:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/mnt/home1/gurdon/cep46/mCaller/mCaller_nanopolish.py", line 46, in worker
outtup = extract_features(tsvname,fastaname,read2qual,nvariables,skip_thresh,qual_thresh,modelfile,classifier,startline,endline=endline,train=train,pos_label=training_pos_dict,base=base,motif=motif,positions_list=positions_list)
File "/mnt/b2/home1/gurdon/cep46/mCaller/extract_contexts.py", line 138, in extract_features
while tsv.tell() <= endline-500:
OSError: telling position disabled by next() call

Is this error something I should be worried about?

Best,

Clare

Release tags?

Would be really helpful to have - to mention in papers (looks better than git hashes) and to trigger updates to the Debian etc packages.
Thank you!
Steffen

Fix tsv header issue (contig contig not recognized) and add ambiguous base handling

Regarding bed file generated

Hello Team,

I have successfully extracted the methylation calls and have generated the bed file, but I am trying to find out the differentially methylated regions for the same.Can you suggest me how can I proceed with the bed file generated. It would be really nice if you can provide some insights for the same.

Thanks & Regards
Priyanka Roy

whether the newest models are suited for R9.41 flowcell data

Our nanopore DNA sequencing data comes from R9.41 flowcell. I wonder wheter the newest models such as CAAYNNNNNRTAC_model_6_m6A.pkl and CRAANNNNNNNTGC_model_6_m6A.pkl, are suitable to detech 6mA modification in fast5 data from R9.41 flowcell.

method to get the POSITIONS.txt

Hi， @al-mcintyre
The positons.txt that you mention in the "README",which "with a list of positions at which to classify
bases (must be formatted as space- or tab-separated file with chromosome, position, strand, and label if
training) , can't be obtain in simple way,would you please show method to get POSITIONS.txt in detail?
And in testdata folder,there are three positons file(e,g, test_positions.txt,test_positions_A.txt, test_positions_m6A.txt), do we need so many positions file to methylation call? And what they do in our test?

No module named 'sklearn.neural_network.multilayer_perceptron'

Under the latest version of sklearn (0.24.2) it appears that the MLP modules shifted around a bit. As a result under the latest version, I got the following error when trying to run mCaller.

ModuleNotFoundError: No module named 'sklearn.neural_network.multilayer_perceptron'

Since you don't import this directly in the Python code, I think it comes from the pickle file like this.

My solution for this was to downgrade sklearn to 0.22, so maybe that should be specified as a requirement or otherwise a higher version should be specified with an updated pickle.

'could not find sequence for reference contig'

Hi all,
Thank you for the amazing tool. I applied mcaller to detect GATC methylation on my nanopore data, but an error showed up. I can share the fastq, tsv, and ref file if needed. Thanks for your help!
Kewei Xu

al-mcintyre / mcaller Goto Github PK

mcaller's People

Contributors

Stargazers

Watchers

Forkers

mcaller's Issues

Recommend Projects

Recommend Topics

Recommend Org