dpryan79 / methyldackel Goto Github PK

A (mostly) universal methylation extractor for BS-seq experiments.

License: MIT License

C 96.50% Python 3.01% Makefile 0.48%

methylation bioinformatics methylation-extraction bisulfite

methyldackel's Introduction

MethylDackel (formerly named PileOMeth, which was a temporary name derived due to it using a PILEup to extract METHylation metrics) will process a coordinate-sorted and indexed BAM or CRAM file containing some form of BS-seq alignments and extract per-base methylation metrics from them. MethylDackel requires an indexed fasta file containing the reference genome as well.

Prerequisites

A C compiler, such as gcc, htslib (at least versions 1.11, earlier versions are not compatible) and libBigWig are required. For libBigWig, the static library is used.

Compilation

Compilation and installation can be performed via:

git clone https://github.com/dpryan79/MethylDackel.git
cd MethylDackel
make LIBBIGWIG="/some/path/to/libBigWig.a"
make install prefix=/some/installation/path

If the linker has issues finding the htslib headers and library, then specify them using CFLAGS and LIBS:

make install CFLAGS="-O3 -Wall -I/some/path/include " LIBS="-L/some/path/lib" prefix=/some/installation/path LIBBIGWIG="/some/path/to/libBigWig.a"

License

This software is licensed under the terms of the MIT license (see the LICENSE file). Please feel free to redistribute this according to those terms.

Methylation Context

MethylDackel groups all Cytosines into one of three sequence contexts: CpG, CHG, and CHH. Here, H is the IUPAC ambiguity code for any nucleotide other than G. If an N is encountered in the reference sequence, then the context will be assigned to CHG or CHH, as appropriate (e.g., CNG would be categorized as in a CHG context and CNC as in a CHH context). If a Cytosine is close enough to the end of a chromosome/contig such that its context can't be inferred, then it is categorized as CHH (e.g., a Cytosine as the last base of a chromosome is considered as being in a CHH context).

Usage

The most basic usage of MethylDackel is as follows:

MethylDackel extract reference_genome.fa alignments.bam

This will calculate per-base CpG metrics and output them to alignments_CpG.bedGraph, which is a standard bedGraph file with column 4 being the number of reads/read pairs with evidence for a methylated C at a given position and column 5 the equivalent for an unmethylated C. An alternate output filename prefix can be specified with the -o some_new_prefix option.

By default, MethylDackel will only calculate metrics for Cytosines in a CpG context, but metrics for those in CHG and CHH contexts are supported as well (see the --CHH and --CHG options). If you would like to ignore Cytosines in CpGs, simply specify --noCpG. Each type of metric is output to a different file. For per-CpG and/or per-CHG rather than per-Cytosine metrics, see "Per-CpG/CHG metrics", below.

MethylDackel can filter reads and bases according to MAPQ and Phred score, respectively. The default minimums are MAPQ >= 10 and Phred >= 5, though these can be changed with the -q and -p options. MethylDackel can also account for methylation bias (described below) with the --OT, --OB, --CTOT, and --CTOB options.

Given a bigWig of Bismap mappability data (using the -M option), MethylDackel can also filter out reads with a mappability score which is too low to support methylation calling. With default settings, a read would be rejected if less than 15 bases have mappability ≥ 0.01, but this can be configured with the -t and -b options.

This mappability data can also be read in from a BBM file. A BBM file (short for "Binary BisMap") is a custom compressed version of the data from the bigWig which is used by MethylDackel. It can be read in much faster than a bigWig and takes up less disk space. These files can be created by specifying the -O or -N options when running MethylDackel extract, and can be read in using the -B option (instead of -M).

A note on overlapping reads

A commonly asked and important question is how paired-end reads that overlap are treated. If two mates from a paired-end experiment happen to both overlap the same CpG then one could get skewed results if the methylation status is taken from both reads, rather than only one of them (i.e., double counting paired-end reads is a bad idea). MethylDackel was written with this in mind, so by default paired-end reads that overlap will never be double counted.

It's useful to note how this is implemented internally, since it relates to the -p option. For the sake of an example, suppose you have two mates that overlap a given position. In this example, we'll specify that mate #1 has a base call with a phred score of 20 at that position and that mate #2 has a base all with a phred score of 15.

If both base calls agree, then the one with the lower score (here, mate #2 with a score of 15) is set to 0. Further, the other read's base score (here, mate #1 with a score of 20) is increased by 20%. Since the -p option defaults to 10, it will skip mate #2 and only count read #1.
If the base calls instead disagree, then the process is more complicated. There are a total of four possibilities for how this could work:
Mate #1 has a higher phred score and the base call is not an N. Then its phred score is decremented by that of the other mate (in the example above, 20-15=5, so 5 would be the new phred score). The phred score for the mate is then set to 0.
Mate #2 has a higher phred score and the base call is not an N. In that case the same procedure as that above is followed, with the mate #2 phred score decremented and the mate #1 phred score set to 0.
The mate with the higher phred score has a base call of N. In this case, the phred scores for both mates are set to 0.
The mates have identical phred scores. In this case, the phred scores are both set to 0.

It's important to think about the above possibilities in conjunction with the -p option. If two mates agree but individually have low quality base calls, they may combine to produce a base call of acceptable quality. Similarly, if the two mates have high quality base calls but these disagree, the result could be a combined base call of low quality (due to the disagreement). This is actually the same prodcedure that samtools mpileup uses internally, so if you're familiar with variant calling with samtools then the methodology and reasons are identical.

Single Cytosine methylation metrics extraction

MethylDackel extract produces a variant of bedGraph that's similar to the "coverage" file produced by Bismark and Bison. In short, each line consists of 6 tab separated columns:

The chromosome/contig/scaffold name
The start coordinate
The end coordinate
The methylation percentage rounded to an integer
The number of alignments/pairs reporting methylated bases
The number of alignments/pairs reporting unmethylated bases

All coordinates are 0-based half open, which conforms to the bedGraph definition. When paired-end reads are aligned, it can often occur that their alignments overlap. In such cases, MethylDackel will not count both reads of the pair in its output, as doing so would lead to incorrect downstream statistical results.

An example of the output is below:

track type="bedGraph" description="SRR1182519.sorted CpG methylation levels"
1	25115	25116	100	3	0
1	29336	29337	50	1	1

Note the header line, which starts with "track". The "description" field is used as a label in programs such as IGV. Each of the subsequent lines describe single Cytosines, the 25116th and 29337th base on chromosome 1, respectively. The first position has 3 alignments (or pairs of alignments) indicating methylation and 0 indicating unmethylation (100% methylation) and the second position has 1 alignment each supporting methylation and unmethylation (50% methylation).

Per-CpG/CHG metrics

In many circumstances, it's desireable for metrics from individual Cytosines in a CpG to be merged, producing per-CpG metrics rather than per-Cytosine metrics. This can be accomplished with the --mergeContext option in MethylDackel extract. If this is used, then this output:

track type="bedGraph" description="SRR1182519.sorted CpG methylation levels"
1	25114	25115	66	2	1
1	25115	25116	100	3	0

is changed to this:

track type="bedGraph" description="SRR1182519.sorted merged CpG methylation levels"
1	25114	25116	83	5	1

This also works for CHG-level metrics. If bedGraph files containing per-Cytosine metrics already exist, they can be converted to instead contain per-CpG/CHG metrics with MethylDackel mergeContext.

Excluding low-coverage regions

If your downstream analysis requires an absolute minimum coverage (here, defined as the number of methylation calls kept after filtering for MAPQ, phred score, etc.), you can use the --minDepth option to achieve this. By default, MethylDackel extract will output all methylation metrics as long as the coverage is at least 1. If you use --minDepth 10, then only sites covered at least 10x will be output. This works in conjunction with the --mergeContext option, above. So if you request per-CpG context output (i.e., with --mergeContext) and --minDepth 10 then only CpGs with a minimum coverage of 10 will be output.

Excluding partially converted reads

Some users wish to filter out reads that have evidence of incomplete bisulfite conversion. This can be determined by looking for CHH and CHG-context Cytosines in each read and determining their methylation state. Such filtering should generally be avoided, since there are regions in most genomes with at least some CHH and CHG-context Cytosine methylation, which would cause excess filtering in those regions. However, if you absolutely need to filter out only partially-converted alignments, you can do so with the --minConversionEfficiency option. The default is 0, meaning that CHH and CHG-context Cytosine conversion status is ignored. The maximum value is 1.0, meaning that 100% of the CHH and CHG-context Cytosines in an alignment must be converted to T. Note that an alignment with no CHH or CHG-context Cytosines will be given a conversion efficiency of 1.0.

Logit, fraction, and counts only output

The standard output described above can be modified if you supply the --fraction, --counts, or --logit options to MethylDackel extract.

The --fraction option essentially produces the first 4 columns of the standard output described above. The only other difference is that the range of the 4th column is now between 0 and 1, instead of 0 and 100. Instead of producing a file ending simply in .bedGraph, one ending in .meth.bedGraph will instead be produced.

The --counts option produces the first three columns of the standard output followed by a column of total coverage counts. This last column is equivalent to the sum of the 5th and 6th columns of the standard output. The resulting file ends in .counts.bedGraph rather than simply .bedGraph.

The --logit option produces the first three columns of the standard output followed by the logit transformed methylation fraction. The logit transformation is log(Methylation fraction/(1-Methylation fraction)). Note that log uses base e. Logit transformed methylation values range between +/- infinity, rather than [0,1]. The resulting file ends in .logit.bedGraph rather than simply .bedGraph.

Note that these options may be combined with --mergeContext. However, MethylDackel mergeContext can not be used after the fact to combine these.

methylKit-compatible output

methylKit has its own format, which can be produced with the --methylKit option. Merging Cs into CpGs or CHGs is forbidden in this format. Likewise, this option is mutually exclusive with --logit et al.

Excluding likely variant sites

If your samples are not genetically homogenous, it can sometimes be advantageous to exclude likely variant sites from methylation extraction. As an example, since unmethylated Cs are read as Ts, extracting methylation from a position with a C->T mutation will cause incorrect results. In such a case, the opposite strand will have an A rather than a G (in the non-variant case, there would be a G regardless of methylation status). MethylDackel tracks the number of non-Gs on the strand opposite of Cs in the reference sequence. If the fraction of these exceeds the --maxVariantFrac option, then that position will be excluded from output. To exclude cases where the --maxVariantFrac value is exceeded only due to low coverage, the opposite strand must have a depth of coverage of at least --minOppositeDepth. Note that the default value for --minOppositeDepth is 0, indicating that the variant site exclusion process is skipped.

Note that if one additionally specifies --mergeContext, that a given CpG or CHG will be excluded from output if either of its individual Cs would be excluded given the specified --minOppositeDepth and --maxVariantFrac`.

Methylation bias plotting and correction

In an ideal experiment, we expect that the probability of observing a methylated C is constant across the length of any given read. In practice, however, there are often increases/decreases in observed methylation rate at the ends of reads and/or more global changes. These are termed methylation bias and including such regions in the extracted methylation metrics will result in noisier and less accurate data. For this reason, users are strongly encouraged to make a methylation bias plot. MethylDackel comes with a function for just this purpose:

MethylDackel mbias reference_genome.fa alignments.sorted.bam output_prefix

That command will create a methylation bias (mbias for short) plot for each of the strands for which there are valid alignments. The command can take almost all of the same options as MethylDackel extract, so if you're interested in looking at only a single region or only CHH and CHG metrics then you can do that (run MethylDackel mbias -h for the full list of options). The resulting mbias graphs are in SVG format and can be viewed in most modern web browsers:

If you have paired-end data, both reads in the pair will be shown separately, as is the case above. The program will suggest regions for inclusion ("--OT 2,0,0,98" above) and mark them on the plot, if applicable. The format of this output is described in MethylDackel extract -h. These suggestions should not be accepted blindly; users are strongly encouraged to have a look for themselves and tweak the actual bounds as appropriate. The lines indicate the average methylation percentage at a given position and the shaded regions the 99.9% confidence interval around it. This is useful in gauging how many methylation calls a given position has relative to its neighbors. Note the spike in methylation at the end of read #2 and the corresponding dip at the beginning of read #1. This is common and these regions can be ignored with the suggested trimming bounds. Note also that the numbers refer to the first and last base that should be included during methylation extraction, not the last and first base to ignore!.

Alignment trimming

In some protocols it is useful to trim X bases from one or both ends of the alignments regardless of their length. To do this, one can use the --nOT, --nOB, --nCTOT, and -nCTOB options, which have a format similar to that mentioned for --OT et al. above (the only difference being that one specifies a number of bases, rather than the position within each alignment). Thus, to ignore the 5 bases on either end of alignments on the original top (OT) strand, one would specify --nOT 5,5,5,5. Thus, it is irrelevant how long each alignment is, the outermost 5 bases will always be ignored. Note that specifying bounds longer than a given alignment (e.g., --nOT 100,200,300,400 for a pair of 50 base long alignments) will not cause an error.

Ignored alignments

By default, any alignment marked as being secondary (bit 256), having failed QC (bit 512), being a PCR/optical duplicate (bit 1024), being supplemental (bit 2048), or having a mapQ < 10 is ignored. This is a reasonable default and should only be changed by expert users. For those needing to change this behaviour, please see the -F (--ignoreFlags) or -q options to both MethylDackel mbias and MethylDackel extract.

Note that PCR (but not optical!) duplicates should be included for enrichment-based library preparations, like RRBS. In such cases it is vital to use the --keepDupes option!

Reads that do not map to a unique location in the genome may contain methylation information about multiple loci and can add noise to methylation calls. Using --mappability <bigwig> eliminates reads that cannot be uniquely placed (note: MapQ alone is insufficient to identify such reads). Specify a bigWig file containing low-mappabilty regions (e.g. bismap) to use this feature. --minMappableBases and --mappabilityThreshold can be used to fine-tune filtering. For faster execution, bigWig files can be compacted to bbm files using --outputBBMFile and used via --mappabilityBB.

Per-read metrics

As of version 0.4.0, MethyDackel provides a perRead subcommand that will produce the per-read CpG methylation level. This has been used by Stamenova et al. in their Hi-Culfite method. The output is a tab-separated file with the following fields:

read name
chromosome
position
CpG methylation (%)
number of informative bases.

Fragments longer than 10kb are currently not handled correctly.

Citing MethylDackel

There are no immediate plans for a MethylDackel publication. If you use MethylDackel (or PileOMeth, as it was formerly known) in your research, please simply cite the URL for this repository on github.

methyldackel's People

Contributors

Stargazers

Watchers

methyldackel's Issues

--keepDupes and -F 0x400 gives me different results

Hi,

I have some single end RRBS-seq data. I aligned with bwa-meth and marked the duplicates with samblaster.

one expects to see a lot of duplicated read for RRBS data, because the enzyme cutting site is the same.

MethylDackel (0.3.0-3-g084d926) gives me quite different results specifying --keepDupes or -F 0x400. Are these two options the same?

thanks,
Tommy

Add option to require flags

Essentially the inverse of the -F option.

Usage example:

PileOMeth extract -f 3 ...

To require all reads to be mapped in a proper pair

dot in the 4th column of

I’m wondering what meaning I should make of ‘.’ that is positioned on the 4th column of the resulting bedGraph file that is generated by MethylDackel extract.

Please see some examples below. Those are selected from the one of the resulting bedGraph files.

1 3408772 3408773 . 10 1
1 3408803 3408804 . 0 9
1 3408804 3408805 . 0 10

Thanks a lot.

allow to specify bai manually

By default MethylDackel tries to search for bai inside the bam folder, in some cases it is not always the case and bai can be in a different place. I would like to be able to configure the path to .bai manually in case it is not in the same folder

what is format of the inputfile ?

How to get the BAM file as the input file for this software ?
Can I get that bam file from bismark, BSMAP or other soft ?
Thank U !

methylKit

I'm trying to use the methylKit output with methylKit, following their vignette https://bioconductor.org/packages/release/bioc/vignettes/methylKit/inst/doc/methylKit.html#2_basics

and I'm getting an error:

Error in fread(filepath, ...) : 
  Input is either empty or fully whitespace after the skip or autostart. Run again with verbose=TRUE.

Have you seen this before?

Failed reads

Does PileOMeth ignore the QC-failed reads flag?

About the sam flag for methylation extract

Hello, I am trying to read the source code of MethylDackel to understand the methylation extract method.

I find the function int getStrand(bam1_t *b) in common.c said, "Read2, reverse comp. == OT" and "Read2, forward == OB". I cannot understand that. I guess the read2 mapped to rc strand maybe from OB, while the read2 mapped to forward strand maybe from OT.

Could you please help me to solve this problem? It will be greatly helpful to me.

Non-discordant overlapping reads don't get their phred scores tweaked

This is sort of an htslib issue, but in the examples we always double-count bases. This is because discordant pairs are ignored in overlap_push so we don't subsequently filter them out. My fear is that bwa always marks nested alignments as not being discordant (I don't know), which is going to be a big issue as read lengths increase (since we'll more frequently run into issues where both mates in a pair sequence the entire insert).

If this is a common thing with BWA, then the simplest solution is to change htslib (I can just make a pull request).

PileOMeth extract option to filter low coverage reads and bases

Usually, when we call methylation frequency in some region or specific CpG site, the depth of the coverage should be higher than > 5 or > 10 so that the methylation level was creditable. I think PileOMeth extract should add this option.

best bisulfite treated reads aligner to choose

--Hi,

which bisulfite reads aligner do you recommend to then use MethylDackel, several aligners exist:

bwa-meth: https://github.com/brentp/bwa-meth
ERNE-BS5: http://erne.sourceforge.net/
RMAPBS used in MethPipe (http://smithlabresearch.org/software/methpipe/)
Bismark (http://www.bioinformatics.babraham.ac.uk/projects/bismark/)
Bison (https://sourceforge.net/projects/dna-bison/)

thank you --

Min Depth of Coverage

I want to set a minimum depth of coverage to be 0, such that all CpG sites are produced in the results, even if the values after the position are all 0. Is there an easy way to make that possible? Currently, you can't set a min depth less than one.

Confidence intervals shouldn't b <0 or > 1

It looks like with very low coverage the confidence intervals in the mbias output can exceed the biologically possible bounds.

Multithreaded processing

In part, htslib-1.4 should help speed things up a bit, though I expect there are options that will need to be passed. Additionally, MethylDackel should itself be made multithreaded. In essence, the genome (or requested region) could get segmented and each segment processed independently. The same strategy is used in deepTools. This wouldn't be a nightmare I hope.

Is there a description on how methylation is called in MethylDackel

Hi @dpryan79,
Thank you for this software, it has been really useful in my research. Since there is no publication on MethylDackel, are there any description on how methylation is called from a BAM file?

error: [fai_fetch_seq] The sequence "chr1_CT_converted" not found

Hello, I am trying to run MethylDackel extract and I keep getting this error:

[fai_fetch_seq] The sequence "chr1_CT_converted" not found
[fai_fetch_seq] The sequence "chr1_CT_converted" not found
faidx_fetch_seq returned -2 while trying to fetch the sequence for tid chr1_CT_converted:0-1000000!
Note that the output will be truncated!

The bam file is sorted and indexed and the reference genome file is also indexed.

Do you have any suggestions to how I can overcome that?

Thank you very much.

re: Todo: BisSNP VCF format?

In the "TODO" section, a VCF style format is mentioned. Yaping did something like this for Bis-SNP (obviously, since it calls SNPs while calling methylation fraction). Have you looked at that?

best,
--t

segfault when i tried to run this out on the output of bwameth

I tried to run on this small input file:

dataset_948627.dat.zip

Program received signal SIGSEGV, Segmentation fault.
0x0000000000402d3a in isCpG (seq=0x7fffeca41010 'N' <repeats 200 times>..., pos=190780450, seqlen=<value optimized out>) at common.c:11
11	    if(*(seq+pos) == 'C' || *(seq+pos) == 'c') {
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  0x0000000000402d3a in isCpG (seq=0x7fffeca41010 'N' <repeats 200 times>..., pos=190780450, seqlen=<value optimized out>) at common.c:11
#1  0x0000000000404cb1 in extractCalls (config=0x7fffffffdb30) at extract.c:170
#2  0x0000000000405f00 in extract_main (argc=<value optimized out>, argv=<value optimized out>) at extract.c:674
#3  0x00000035eec1ecdd in __libc_start_main () from /lib64/libc.so.6
#4  0x0000000000402a99 in _start ()

methylKit-compatible output

Hi,
I am pretty new in the RRBS world. I have a basic question. In the past I have been using Bs seeker to map and call the methylation and then I use methylKit. A collaborator told me that it would be more accurate to use bwa-meth for mapping followed by methylDackel and then methylkit for the downstream.
In bs seeker methylation call, I have been doing the call in the CpG context. I just wonder why it is not possible to have an methylkit output with the cpg context. If I use the methylkit output, does that mean that I have all the contexts included? would you rather recommend to merge context in Cpg and then transform with a bash script to a methylkit format? Many thanks for your suggestions!
Best regards
Alice

runtime error

Hi,
I have compiled successfully in Centos 6.7, but get a runtime error when analyzing data.
During the first run, fasta index is built ok, then it fails. Next runs fails directly.

error-message.txt

I have compiled it also on OSX, and the same problem. The error message is different, and mentions malloc problems:

PileOMeth(27055,0x7fff71bc7000) malloc: *** error for object 0x7ffe21d03188: incorrect checksum for freed object - object was probably modified after being freed.
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6

Destranding?

HI,
I´m new working with Methyldackel. I would lik to know if the output of the tool (default or fraction) is already destranded?

Thanks for helping

remove/exclude reads with densely un-converted non-CpG sites

This is a feature request that comes from a wet lab member dealing with data that seems to suffer from a subpopulation of un-converted reads. Digging a bit, it seems like BS-Seeker2 first, and Bismark later, implemented a --filter_non_conversion flag. See:

FelixKrueger/Bismark#76

Given the work done on issue #56 , which seems related to what's needed to implement the feature described here, I wonder if this could be added to methyldackel?

Thanks in advance

Adding pileometh to bioconda

Hi @bgruening and @dpryan79 ,

I hope you don't mind if I add pileometh to bioconda.

Best regards,
Hassan

Specific non CpG context

Hi Devon,

Forgive me if I am missing this in the documentation, but I don't seem to find an option to extract non-CpG methylation in specific context (ex: CpA, CpT rather than --CHH/--CHG). Is there any way to accomplish this with MethylDackel?

Thanks for the clarification in the documentation.

Segfault occasionally in MethylDackel mbias

I've observed this error a few times... (we are routinely assessing mbias in each chr for thousands of libraries)
Retrying the command usually succeeds

Core was generated by `MethylDackel mbias --noSVG -@ 8 -r chrY /mnt/galaxy/data/genome/grch38_core+bs_'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000407d17 in mergeStrandMeth ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64
(gdb) bt
#0  0x0000000000407d17 in mergeStrandMeth ()
#1  0x000000000040849b in mbias_main ()
#2  0x00000032f761ecdd in __libc_start_main () from /lib64/libc.so.6
#3  0x0000000000402f8d in _start ()

MethylKit format not compatible with methRead

I am using the latest versions of MethylDackel (0.2.0) and MethKit (0.99.2). With the --methylKit option, MethylDackel produces files with freqC and freqT columns between [0-1]. It seems like MethylKit acutally wants these values between [0-100]. When I load the methylKit files produced by methylDackel, using methRead, the numC and numT columns of the resulting methylKitRaw objects are wrong. Here's an example.

test<-methRead(list('test'), list('test'), treatment=0, assembly='hg19', context='CpG')
print(test)
> test
methylRawList object with 1 methylRaw objects

methylRaw object with 9 rows
--------------
   chr start end strand coverage numCs numTs
1 chrM    33  33      +      524     3     2
2 chrM    34  34      -     1490     6     9
3 chrM    61  61      +      476     3     2
4 chrM    62  62      -      787     1     7
5 chrM    78  78      +      544     3     2
6 chrM    79  79      -     1225     3    10
...

And test looks like this (it's just the head from a MethylDackel methylKit file):

chrBase	chr	base	strand	coverage	freqC	freqT
chrM.33	chrM	33	F	524	0.549618	0.450382
chrM.34	chrM	34	R	1490	0.408054	0.591946
chrM.61	chrM	61	F	476	0.556723	0.443277
chrM.62	chrM	62	R	787	0.111817	0.888183
chrM.78	chrM	78	F	544	0.612132	0.387868
chrM.79	chrM	79	R	1225	0.215510	0.784490
chrM.80	chrM	80	F	538	0.630112	0.369888
chrM.81	chrM	81	R	1253	0.222666	0.777334
chrM.91	chrM	91	F	482	0.647303	0.352697

When I manually load 'test' into R and multiple the freqC and freqT columns by 100, then save it and read the new file using methRead, the numCs and numTs columns are now correct.

paired-end mode calling

Hi,

There is a new sample prep that generates read pairs where the CpG
modification needs to be read for both reads in a pair. E.g.

ff = forward/forward read pairs
rr = reverse/reverse read pairs

RefC:Base1Base2 RefG:Base1Base2

C:..  G:A.  ff  =>  ModA
C:,t  G:,,  rr  =>  ModA
C:.T  G:A.  ff  =>  C
C:t,  G:,a  rr  =>  C
C:..  G:..  ff  =>  ModB
C:,,  G:,,  rr  =>  ModB

I couldn't find any way to obtain this currently in MethylDackel,
maybe there is...

Since it seems that MethylDackel is being adopted more and more in the
community, I thought I'd propose to have this new sample prep
implemented in it (I wouldn't want to hack this together in the
"other" methylation caller, with the name starting with a b and ends
with a k, and be stuck with not being able to align the reads with
bwa-meth).

So far I scripted this in a very slow way:

Take coordinate sorted bam and sort it by read name, so pairs of
reads are consecutive.
Create a mini-sam entry for each pair of reads, run samtools
mpileup, intersect it with the list of CpGs, and record what type is
each CpG for this read pair in a tabular file.
Collate the tabular file for summary statistics, per-CpG
percentages, etc.

On a scale of 1 to 10, how much different would this be from the
current codebase and how feasible to implement?

Support for relative inclusion regions

In data from some non-classical protocols the existing options for defining the considered regions (--OT,--OB,--CTOT and --CTOB) are suboptimal since the length of the reads can vary. In the mbias plots this is usually seen as a gradual decrease (increase) of the methylation level, instead of a usual sharp border. An option for excluding N bases from the start/end of the read would guarantee more adequate calling.

Help understanding what methyldackel is doing with this bam file

I’m eyeballing the methylation calls from methyldackel (Version: 0.3.0 (using HTSlib version 1.2.1) ) on my bam file (bisulfite_test_bwameth.bam.zip).
When I count “by eye” in IGV or samtools tview(below), I find 12 A and 49 G for one specific position (I:21312) in Neuorspora crassa.

Methyldackel produces counts of 11 methylated and 5 unmethylated
I 21311 21312 68 11 5

I thought maybe this was due to me counting both reads by eye, but I tried to avoid doing that (and the difference is too big anyway).

I also tried with some crazy options to see if things were being filtered that I didn’t appreciate:
MethylDackel extract -D 200000 -F 0 -p 1 -q 0 --keepDupes --keepDiscordant /mnt/galaxy/data/genome/nc12/sam_indexes/nc12/nc12.fa test.bam
I 21311 21312 57 20 15
still not enough reads being considered...

Tweaking IGV to flag reads with mapq < 5 yields 9 A and 28 G in the histogram… still way different from methyldackel...

Can you help me understand what's going on?

here's the samtools tview output in the relevant location:

21311
CGAT
.R..
.A..
.A..
....
....
.A..
....
A...
....
....
....
T...
....
....
.A..
....
....
.A..
....
....
....
.A..
,,,,
....
....
t,,,
....
....
.A..
,,,,
,,,,
.A..
....
....
T...
.A..
.A..
T...
....
....
....
....
....
....
T...
....
,,,,
,,,,
.A..
,a,,
,,,,
....
....
,a,,
,,,,
T...
....
,a,,
....
....
....
....
....
....
.A..
....
....
....
—

sorted bedGraph ouput

In the present, the output of PileOMeth extract is non-sorted bedGraph output with the track header. Why not sort the bedgraph before put the header? I don't know whether it is harder to do?

mbias --keepDupes and -F argument parsing

Maybe this is related to #58 ... but for mbias.

To get duplicates to be included I had to include both -F and --keepDupes:

MethylDackel mbias --noSVG --keepDupes -F 2816 -@ 1 -r plasmid_puc19c grch38_core+bs_controls.fa EM-seq-2hr-4ul-2.md.bam | head -n 2
Strand	Read	Position	nMethylated	nUnmethylated
OT	1	1	4211	14

I thought the --keepDupes was not necessary, but it seems to be:

MethylDackel mbias --noSVG  -F 2816 -@ 1 -r plasmid_puc19c grch38_core+bs_controls.fa EM-seq-2hr-4ul-2.md.bam | head -n 2
Strand	Read	Position	nMethylated	nUnmethylated
OT	1	1	20	1

however --keepDupes is not sufficient:

MethylDackel mbias --noSVG  --keepDupes -@ 1 -r plasmid_puc19c grch38_core+bs_controls.fa EM-seq-2hr-4ul-2.md.bam | head -n 2
Strand	Read	Position	nMethylated	nUnmethylated
OT	1	1	20	1

Perhaps related... octal notation seems strange too: I thought -F would override --keepDupe, but it does not seem to.

MethylDackel mbias --noSVG --keepDupes  -F 0xF00 -@ 1 -r plasmid_puc19c grch38_core+bs_controls.fa EM-seq-2hr-4ul-2.md.bam | head -n 2
Strand	Read	Position	nMethylated	nUnmethylated
OT	1	1	4211	14

note: I expect a lot of dups in this library.

empty output

Hi,
I am very interested in using MethylDackel for my ongoing project looking at CpG in HIV.
I tried to run it on one sample (sorted aln bam file) again its own reference but the result is empty
I guess I am missing something obvious....
Would you mind taking a look at one sample? I can share via dropbox (private email)?
a

Using another method, I extracted the cytosine frequency at each pos so I was expecting to get non-empty output...
C9_CytosineFreq.pdf

question about double-counting C's in overlap region

Thank you for this incredibly useful tool!

I have a question about how PileOMeth extract handles methylation metrics in regions where read1 and read2 from the same paired-end fragment overlap.

Because this is a question, I posted it on BioStars:

https://www.biostars.org/p/215908/

I am looking forward to hearing what you think!

log with different loses

Hi, is there a log in MethylDackel that records how many reads/bases are lost to issues like reads being discordant, singletons, dupes, or bases being below Q-score threshold, etc.?

Would it be possible to have an optional log with these?

Methylation per read

Thank you for this tool - it's been really nice to work with (runs very fast and the code is very readable).

We are working on a project for which we need to know the methylation per read. In Bismark, this is given via the XM tag, where there is essentially a string that represents (for every cytosine covered), the context and whether or not it is methylated.

I'm currently getting this info by print out statements in lines 424-426 in extract.c and then processing the resulting text files but obviously this is less than ideal (multiple lines printed per read name that then need to be combined, files are big, etc). Is there any way to make this an option in MethylDackel? Alternatively, do you have suggestions for a better way to gather this info on a per read basis?

Thanks!

Add a --methylKit option

It was requested to directly add the ability to output stranded information, a la that required for methylKit. This should prove easy enough, since the strand is already processed internally.

threading related miscounting in mbias

for core in 1 8; do for i in `seq 1 10`; do MethylDackel mbias  --noSVG  -r contig1636 -@ $core contig1636.fa 1636_only.bam > ${core}core_rep${i};done;done

Every execution of the tool should produce the same output.
With 1 core it does:

cat 1core_* |sort |uniq | wc -l = 305

however with 8 cores:

cat 8core_* |sort |uniq | wc -l = 331

Interestingly - it only seems to occur near the end of reads. Here are some of the interesting lines
I don't see any problem before position 69.

OT	2	69	1006633219	63
OT	2	69	139	63
OT	2	69	1610612995	63
OT	2	69	1811939587	63
OT	2	69	3355443459	63
OT	2	69	4027096907	63
OT	2	70	160	57
OT	2	70	32734	57
OT	2	70	32741	57
OT	2	70	32825	57
OT	2	70	32845	57
OT	2	70	32879	57
OT	2	71	1007254735	51
OT	2	71	159	51
OT	2	71	1611234511	51
OT	2	71	1812561103	51
OT	2	71	3356064975	51
OT	2	72	139	61
OT	2	72	32713	61
OT	2	72	32720	61
OT	2	72	32824	61
OT	2	72	32858	61
OT	2	73	126	67
OT	2	73	1296579251	67
OT	2	74	1111901116	63
OT	2	74	111	63
OT	2	75	63	167
OT	2	75	63	168
OT	2	75	976705670	168
OT	2	76	11	359
OT	2	76	825309759	359

I'm including example data here in case that's helpful:
contig1636_example.zip

Error with floating point exception

Hi,

I am getting this error while running.
PileOMeth extract --CHG --CHH ~/whole_genome_TAIR10/TAIR10_all.fa 32338_CTTGTA_C89J6ANXX_3_20160105B_20160105_2.modified.bam

writing to prefix:'32338_CTTGTA_C89J6ANXX_3_20160105B_20160105_2.modified'
Floating point exception

How can I fix this?

methylation calls from spliced alignments

thank you for the nice tool! can MethylDackel deal with spliced alignments and call methylation ?

save the output to working folder or let the user configure the output folder

I run bs-seq pipelines with Cromwell and it causes me a lot of pain that methyldackel saves the results to the same folder as the input bam. I would like to configure the output folder

add Galaxy integration

The aim is to integrate PileOMeth into Galaxy and the Galaxy ToolShed.
@dpryan79 feel free to assign me to this issue.

How to cite PileOMeth in our manuscript?

BED files make everything sloooooooooooooooooow

Something with the BED file handling seriously slows things down. It's almost as if it's scanning over the list of regions for every covered base in a BAM file. With a large number of regions, it is vastly faster to extract everything and then bedtools intersect...which seems silly.

[fai_fetch_seq] The sequence "KI270728.1" not found

When I tried to run mbias in one of the bam files, I got this error message and the .svg files are not generated.

[fai_fetch_seq] The sequence "KI270728.1" not found
faidx_fetch_seq returned -2 while trying to fetch the sequence for tid 25 (KI270728.1)!
Note that the output will be truncated!

Is this due to nonstandard chromosomes in the bam file ? How do I get around this problem?

per-read and per-CpG metrics for bs-seeker2

Hi,
I'm trying to get per-read and per-CpG metrics using bam files generated by bs-seeker2, in a similar way to the output generated by bismark ('CpG_context_xxx.txt' from bismark_methylation_extractor). It looks like this:

Bismark methylation extractor version v0.17.0
LIN6K:00291:01397	-	chr1	1160677	z
LIN6K:00291:01397	+	chr1	1160714	Z
LIN6K:00291:01397	+	chr1	1160728	Z
LIN6K:00291:01397	+	chr1	1160730	Z
LIN6K:00291:01397	+	chr1	1160736	Z
LIN6K:00291:01397	+	chr1	1160761	Z
LIN6K:00291:01397	+	chr1	1160804	Z
LIN6K:00291:01397	+	chr1	1160819	Z
LIN6K:00291:01397	+	chr1	1160827	Z
LIN6K:00291:01397	+	chr1	1160876	Z
LIN6K:00291:01397	+	chr1	1160884	Z
LIN6K:00291:01397	+	chr1	1160886	Z
LIN6K:00291:01397	+	chr1	1160904	Z
LIN6K:00291:01397	+	chr1	1160924	Z
LIN6K:01879:02897	+	chr1	1160714	Z
LIN6K:01879:02897	+	chr1	1160728	Z
LIN6K:01879:02897	+	chr1	1160730	Z
LIN6K:01879:02897	+	chr1	1160736	Z
LIN6K:01879:02897	+	chr1	1160761	Z

In other word, each read or fragment has multiple lines with the methylation status of each CpG. Is there an easy way to extract this information from the .bam files using MethylDackel ?
Thanks !

discrepancy between MethylDackel and IGV PBAT-style data aligned with bwameth

Hi,

I tried to used MethylDackel extract on a dataset aligned with bwameth which was prepped with a PBAT-style method, in the sense that the data aligns to the CTOT and CTOB strands.

We used to do this by aligning with bismark --pbat:

Option '--pbat': alignments to original strands (OT and OB) strands were ignored (i.e. not performed)

And then extract the methylation counts with bismark_methylation_extractor.

Currently, we are evaluating the use of bwa-meth followed by MethylDackel. But because of the way bwa-meth labels the strands and the PBAT nature of the data, MethylDackel seems to be counting on the basis of a directional OT/OB alignment, rather than a PBAT CTOT/CTOB alignment. I spotted this by the discrepancy between the IGV colouring and the MethylDackel counts, see below:

Long story short, the question is:

As a replacement of the bismark --pbat + bismark_methylation_extractor combo, how can I run bwa-meth + MethylDackel extract to obtain the counts as a PBAT CTOT/CTOB method?

Unable to determine strand error

I receive the following error when attempting to run PileOMeth:

PileOMeth: common.c:78: updateMetrics: Assertion strand != 0' failed.
/sge/middle_earth/spool/n26vm1/job_scripts/1089854: line 1: 11686 Aborted (core dumped) PileOMeth extract --CHG --CHH genome.fa x.bam`

Sample lines from my BAM file (the file is valid according to Picard's ValidateSamFile):

SRR299053.60865334  16  chr10   3002236 38  100M    *   0   0   TAAATCTAAAAATATTCATTCACTACCGCCACCTAACTACCTCACATTTAATATTAACTAAAACAAATACTATTATAAACATCTACGATACACAAAAACT    ########B?CC-CC9:<=9:CD:@EDEEEBEECECAEBEF?DEDGBGEFGEGAFGGGDGEFFFFGGGEGGGFGGEEEEDGDGFGGGGGGGGGGGGGGGG    NM:i:0  MD:Z:100    AS:i:100    XS:i:85 RG:Z:G1 XA:Z:rchr10,-3044656,100M,5;    YC:Z:CT YD:Z:r
SRR299053.61911728  0   chr10   3002249 60  100M    *   0   0   GTTTATTTATTATCGTTATTTGGTTATTTTATATTTAGTGTTGATTGAAGTAGATGTTGTTGTAGGTATCTGCGGTGTATAGAGGTTTAATTTATATAAG    EEEEEEFGEGGEGGGGGAGDEEEEEEEEEC>FAFFEEEEEFFFDFGGFGGFFGFGGFGGFGGBEGEEEFEEGGDGFDGBG?E=EACA?:5?BB:B>?###    NM:i:0  MD:Z:100    AS:i:100    XS:i:23 RG:Z:G1 YC:Z:CT YD:Z:f
SRR299053.22449369  16  chr10   3002253 31  100M    *   0   0   ATTCACTACCGCCACCTAACTACCTCACATTTAATATTAACTAAAACAAATACTATTATAAACATCTACGATACACAAAAACTCAATCCACACAAAAAAA    B=46>C@7ACCCCCCEDEEEECEEEFCDFEECEECEEBEEF;EEFECFFFEDDFFGGGEGEEDEEGGEFGGGGFGFFFFFFFFFFGEGGGGGGGGGGGGG    NM:i:0  MD:Z:100    AS:i:100    XS:i:88 RG:Z:G1 XA:Z:rchr10,-3044673,100M,4;    YC:Z:CT YD:Z:r
SRR299053.9560181   16  chr10   3002286 23  100M    *   0   0   ATATTAACTAAAACAAATACTATTATAAACATCTACAATACACAAAAACTCAATCCACACAAAAAAAACAAACACAATTATTATCAACATTTAAAACATA    B:==:EEB8BABA4CCA?>ACDB=DABBAAB@BA@BEEEBA?EEEFFEFDEEEBEFDE?EGGGGGCFFFFEEECEFFF?FFFGBGGGFGGGGGGGGGGGG    NM:i:0  MD:Z:100    AS:i:100    XS:i:91 RG:Z:G1 XA:Z:rchr10,-3044706,100M,3;    YC:Z:CT YD:Z:r

My best guess is that this is from line 51 of common.c returning 0. It is not clear to me why that line does not return 4 instead (i.e. default to single-end forward in this instance).

I'd appreciate any insight into this that you might be able to provide.

Third column in bedGraph is percentage?

Dear Devon,
On Brent's advice, I have used PileOMeth on bwameth's BAM file.

In the output of the extract module, I see that the 4th column (chromosome counted as 1st column) contains values like "1000" or "666".
Given that the last two columns contain the methylated and unmethylated calls, it is obvious that the 4th column represents the ratio. I simply believe that you multiplied by 1000 instead of 100 to obtain the percentage.

Kind regards
Kevin

C>T SNPs

Hi Devon,

Just out of curiosity, how does MethylDackel/PileOMeth handles C>T SNPs in the methylation calling? BisSNP and the newer biscuit have the ability to distinguish the SNP from the opposite strand in directional sequencing and exclude those SNPs from the methylation calling. Is there a similar way to exclude those SNPs which are unreliable for methylation calling with MethylDackel?

Thanks!

bismark equivalent files

Hi, I am testing MethylDackel extract on bwameth alignments. Is there an equivalent file to the .CpG_report.txt.gz file produced by bismark_methylation_extractor?

dpryan79 / methyldackel Goto Github PK

methyldackel's Introduction

Prerequisites

Compilation

License

Methylation Context

Usage

A note on overlapping reads

Single Cytosine methylation metrics extraction

Per-CpG/CHG metrics

Excluding low-coverage regions

Excluding partially converted reads

Logit, fraction, and counts only output

methylKit-compatible output

Excluding likely variant sites

Methylation bias plotting and correction

Alignment trimming

Ignored alignments

Per-read metrics

Citing MethylDackel

methyldackel's People

Contributors

Stargazers

Watchers

Forkers

methyldackel's Issues

Recommend Projects

Recommend Topics

Recommend Org