Giter Club home page Giter Club logo

methhaplo's Introduction

GitHub Clones

MethHaplo: Combining Allele-specific DNA Methylation and SNPs for Haplotype Region Identification

DNA methylation is an important epigenetic modification that plays a critical role in most eukaryotic organisms. Parental alleles in haploid genomes may exhibit different methylation patterns, which can lead to different phenotypes and even different therapeutic and drug responses to diseases. However, to our knowledge, no software is available for the identification of DNA methylation haplotype regions. In this paper, we developed a new method, MethHaplo, that identify DNA methylation haplotype regions with allele-specific DNA methylation and single nucleotide polymorphisms (SNPs) from whole-genome bisulfite sequencing (WGBS) data. Our results showed that methylation haplotype regions were ten times longer than haplotypes with SNPs only. When we integrate WGBS and high-throughput chromosome conformation capture (Hi-C) data, MethHaplo could call even longer haplotypes. By constructing methylation haplotypes for various cell lines, we provide a clearer picture of the effect of DNA methylation on gene expression, histone modification and three-dimensional chromosome structure at the haplotype level. Our method could benefit the study of parental inheritance-related disease and heterosis in agriculture.

This is a README file for the usage of MethHaplo.


REQUIREMENTS

  1. gcc (v4.8) , gsl library
  2. SAMtools
  3. Python3
  4. Perl

INSTALL


a) Download git clone https://github.com/ZhouQiangwei/MethHaplo.git

b) Change directory into the top directory of MethHaplo cd MethHaplo

c) Type

  • make
  • make install

d) The binary of MethHaplo will be created in current folder

USAGE of MethHaplo


Example data

You can found the test data in ./test dir.

Citation:

Zhou, Q., Wang, Z., Li, J. et al. MethHaplo: combining allele-specific DNA methylation and SNPs for haplotype region identification. BMC Bioinformatics 21, 451 (2020).

Usage

1. MethHaplo command

        MethHaplo: Combining Allele-specific DNA Methylation and SNPs for Haplotype Region Identification
        Usage: methHaplo -M [mode] -a Y/N -m methfile -s <sam>/-b <bam> -o outputprefix
        Options:
                -M <string> [hap|asm]         methHaplo analysis mode
                                                hap: iterative approach, prefer longer haplotype results;
                                                asm: hypergeometric approach, prefer accurate asm results.(default: hap);
                -m, --methfile <file>         methratio file (requires)
                                                format: chr  pos  strand  context methC  coverage  methlevel
                -o, --out <string>            output file prefix
                -s, --sam <samfile>           sam file from batmeth2-align.  This file should be coordinate sorted, 
                                                using the <samtools sort> command, and must contain methylstatus[MD:Z:].
                -b, --bam <bamfile>           bam file, should be coordinate sorted. (use this option or -s option but not both)
                -a <Y/N>                      If bam/sam file contain MD state by batmeth2 calmeth scripts.
                                                If not, please define genome location by -g paramater.
                -g, --genome <genome>         If bam/sam file isnot contain MD.
                -q <int>                      only process reads with mapping quality >= INT [default >= 20].
                -c, --context                 methylation context process for methHaplo. CG, CHG, CHH, ALL[default].
                -C, --NMETH                   Number of methylated reads cover cytosine site. default: 2 [m>=2]
                -N, --NCOVER                  Number of coverage reads in cytosine site. default: 6 [n >= 6]
                -f, --MFloat                  Cutoff of methratio. default: 0.1 [ f =< meth <= 1-f]
                --minIS <INT>                 Minimum insert size for a paired-end read to be considered as single fragment for phasing, default 0
                --maxIS <INT>                 Maximum insert size for a paired-end read to be considered as a single fragment for phasing, default 1000
                --DBtmpsize <INT>             Maximum size of temp read store, default 12000. (only useful in asm mode)
                --PE                          Paired-end reads.
                -v, --vcffile <file>          snp file (optional)
                -r, --chromosomal-order       Use natural ordering (1,2,10,MT,X) rather then the default (1,10,2,MT,X). 
                                                This requires new version of the unix \sort\ command which supports the --version-sort option.
                -p, --parallel <int>          Change the number of sorts run concurrently to <int>
                -t, --temporary-directory     Use a directory other than /tmp as the temporary directory for sorting.
                -h, -?, --help                This help message.

2. Allele-specific DNA methylation region visualization

python methpoint.py align.md.sort.bam chrom:start-end strand outputprefix visulsort

        [align.md.sort.bam] BS-Seq alignment file for visualization.
        [chrom:start-end] The region in chromosome:start-end for visualization.
        [strand] visualization strand. [+/-/.]
        [outputprefix] output file prefix
        [visulsort] Methylation and Unmethylation position in the figure. [0/1]

asmexample

The figure above represents the distribution of methylation sites in raw reads, orange represents methylation sites, green represents unmethylation sites, and blue represents mutation base information. The bottom figure shows DNA methylation sites and methylation levels.

3. Aellele-specific DNA methylation Sites distribution across TSS/TES etc.

3.1 Caculate coverage across TSS/TES sites.
ASManno [options] -o <OUT_PREFIX> -G GENOME -gff <GFF file>/-gtf <GTF file>/-b <bed file> -ap <asm plus file> -an <asm neg file>
Usage:
	-o|--out         Output file prefix
	-G|--genome      Genome
	-ap|--asmplus    ASM plus file.
	-an|--asmneg     ASM neg file.
	-p|--pvale       Pvalue cutoff. default: 0.01
	-gtf|-gff        Gtf/gff file
	-b|--BED         Bed file, chrom start end (strand, .bed4 format)
	--ped            loci file, chrom start (strand, .ped3 format)
	-d|--distance    ASM distributions in body and <INT>-bp flanking sequences. The distance of upstream and downstream. default:2000
	-B|--body        For different analysis input format, gene/TEs body methylation level. [Different Methylation Gene(DMG/DMT...)]
	-P|--promoter    For different analysis input format.[Different Methylation Promoter(DMP)]
	-s|--step        Gene body and their flanking sequences using an overlapping sliding window of 5% of the sequence length at a step of 0.8% of the sequence length. So default step: 0.008 (0.8%)
	-h|--help
3.2 Visualization
python methylevel.py Num Input1.Methylevel.1.txt [Input2 ...] lable outprefix

asmsite

methhaplo's People

Contributors

actions-user avatar zhouqiangwei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

methhaplo's Issues

hap output questions

Hi, thank you for the nice tool. I'm just starting to use it and am a little unclear about the output.

I've run

${methHaplo} \
    -M hap \
    -m asmtestdata.mr \
    -o ${outPrefix} \
    -b samtestdata.bam \
    -a Y \
    -t ${tmp_dir} \
    --PE

on the provided asm data in /test/. It has generated three output files:

  • haplo
#Block	chr1	2000037	2001013
chr1	2000037	A	G	2
chr1	2000047	C	T	1
chr1	2000118	C	T	1
chr1	2000142	T	C	1
  • n.haplo
BLOCK: offset: 1 len: 11 phased: 11 SPAN: 563 fragments 30
1	1	0	chr1	2000037	G	A	0/1	0	.	30.81
2	0	1	chr1	2000230	G	A	0/1	0	.	100.00
3	0	1	chr1	2000301	G	A	0/1	0	.	42.78
4	1	0	chr1	2000328	G	A	0/1	0	.	100.00
5	0	1	chr1	2000335	G	A	0/1	0	.	14.61
  • p.haplo
BLOCK: offset: 1 len: 23 phased: 23 SPAN: 1072 fragments 58
1	1	0	chr1	2000047	C	T	0/1	0	.	100.00
2	1	0	chr1	2000118	C	T	0/1	0	.	100.00
3	0	1	chr1	2000142	C	T	0/1	0	.	100.00
4	1	0	chr1	2000150	C	T	0/1	0	.	100.00

Q1: Are these results generated using both methylation and SNP information? Is it possible to specify to just use methylation or just use SNP information with MethHaplo?

Q2: please could you explain the meaning of "fragments" in the .n and .p files? Please could you explain the meaning of the last 3 columns in the .n and .p files?.

makefile is missing flags

I believe that $(LDFLAGS) are missing from lines 56-88 of the makefile, meaning that the optimisation is missing, and that compilation errors will occur if gsl is installed in non-default locations.

$(LDFLAGS) should be added to these lines.

--PE flag

Hello,
Please could you explain the effect of the --PE flag further. I understand that it should be used to indicate whether the reads are paired-end or single end, but please could you explain how these are treated differently by the algorithm?

In my hands the phase blocks output in the .haplo file by the command with the PE flag are identical to the results without the PE flag (in both cases using paired end read data). I'm not sure if this is due to a conflict with another parameter, the flag's not being passed properly, or whether the PE flag isn't implemented fully (I see in the src readme that it's not recommended to use PE)

1.Is there a way to diagnose whether the argument's been passed correctly from the methHaplo output? Is see line 434 of methyhaplo.cpp looks like it should log this (Show_log("Paired-end mode");), but I can't see "Paired-end mode" or "single-end mode" in the output files - where should this be written to?
2. are the output blocks using the paired information, or treating the reads as single end? the commands used are below:

methHaplo 
  -M hap  
  -m METHRATIO.mr             
  -o OUTPUT             
  -b BAM_FILE.bam             
  -a Y             
  --PE             
  --vcffile VCF_FILE.vcf
methHaplo 
  -M hap  
  -m METHRATIO.mr             
  -o OUTPUT             
  -b BAM_FILE.bam             
  -a Y                        
  --vcffile VCF_FILE.vcf

Thanks for your help, please let me know if anything's unclear or I should provide more information.

Dont find the methpoint.py

Hi, Zhou,
Thank you for developing this useful tools. I don't known if there are any mistakes when I install MethHaplo, but I don't find the python scripts for visualization, including methpoint.py and methylevel.py. Could you please help me where to find them.
Thanks in advanced.

Huang Yue

".tmp" being added to .neg.txt and .plus.txt files

For some reason when I use the following command

methHaplo -M asm -m Fairchild_meth.methratio.txt -o Meth_haplo -s Fairchild_meth_state.sort.sam -a Y -v 2021-11-24_linkedread_SNPs_only.vcf

It produces the files Meth_haplo.tmp.neg.txt and Meth_haplo.tmp.plus.txt which are both empty. I then receive the error awk: fatal: cannot open file `Meth_haplo.neg.txt' for reading (No such file or directory).

Why does the program keep appending .tmp to these files? It does not do this for the .bed, .plus, .neg files.

ERROR: Invalid Option "--strand" specified.

methHaplo -M hap -a Y -m asmtestdata.mr -b asmtestdata.bam -o haptest.myoutput

Program dir: /home/yuxin/Software/MethHaplo/bin/
[Methyhaplo] process methfile!
[MethyHaplo] Processing methratio file ...
Done!
[Methyhaplo] split bamfile by alignment strand!
[Methyhaplo] Haplotype assembly with plus strand DNA methylation information!
/home/yuxin/Software/MethHaplo/bin//extracthairs: /lib/x86_64-linux-gnu/libhts.so.3: no version information available (required by /home/yuxin/Software/MethHaplo/bin//extracthairs)
/home/yuxin/Software/MethHaplo/bin//extracthairs: /lib/x86_64-linux-gnu/libhts.so.3: no version information available (required by /home/yuxin/Software/MethHaplo/bin//extracthairs)
/home/yuxin/Software/MethHaplo/bin//extracthairs: /lib/x86_64-linux-gnu/libhts.so.3: no version information available (required by /home/yuxin/Software/MethHaplo/bin//extracthairs)
/home/yuxin/Software/MethHaplo/bin//extracthairs: /lib/x86_64-linux-gnu/libhts.so.3: no version information available (required by /home/yuxin/Software/MethHaplo/bin//extracthairs)

ERROR: Invalid Option "--strand" specified.
failed: at /home/yuxin/Software/MethHaplo/bin/methHaplo line 250.

Columns with .mr file

Hi:

Thanks for developing this nice tool. I want to use it for ASM detection. However, I am not sure what are columns in .mr file means? In the READ file it indicates the last three columns are:

image

However, in the example file in the test folder, the .mr file seems to have more columns, and the first 7 columns seem are not entirely the same as the README indicated.

Best
Tian

Column explanation of the output

I'm not sure what { #MM #MU #UM #UU }/{#MV Ref|Var} means. I understand M is Methylated and U is Unmethylated, but what do MV and Ref|Var mean, and why {#MV Ref|Var} is the denominator? Is this the 8th column?

#chrom	coordinate1	coordinate2	{ #MM	#MU	#UM	#UU }/{#MV	Ref|Var}	pvalue
chr4	10207	10233	2	2	0	2	0.466667	0.995731
chr4	10233	10279	0	7	1	5	0.461538	0.995731
chr4	10279	10311	2	1	5	11	0.523220	1.000000

My purpose is to use MethHaplo to find out ASM regions, so I ran the asm mode. should I merge asm.plus.bed and asm.neg.bed to get the complete set of ASM regions?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.