hhg7 / defiant Goto Github PK
View Code? Open in Web Editor NEWDifferential methylation: Easy, Fast, Identification and ANnoTation
License: GNU General Public License v2.0
Differential methylation: Easy, Fast, Identification and ANnoTation
License: GNU General Public License v2.0
You write in the README.md that you support the following formats:
Input Type 5 Bismark coverage2cytosine format:
//Bismark coverage2cytosine format Example: chr1 762 763 + 17 64 CG CGA
Column1: chromosome, which is a string. Column2: nucleotide/start position, an unsigned integer [0,4294967295]. Column3: strand. Column4: methylated C count, an unsigned integer in [0,4294967295]. Column5: C count, an unsigned integer in [0,4294967295]. Column6: C-context, e.g. CG, CH, CHH. Column7: C-context, e.g. CGA, CGT, etc.
Input Type 6 Bismark coverage2cytosine format:
Example: chr1 762 763 0.265625 17 76
Column1: chromosome, which is a string. Column2: nucleotide/start position, an unsigned integer in [0,4294967295]. Column3: nucleotide/end position, an unsigned integer in [0,4294967295]. Column4: methylation percentage, which is calculated by Defiant. Column5: methylated C count, an unsigned integer in [0,4294967295]. Column6: C count, an unsigned integer in [0,4294967295].
However, these don’t entirely match what is described in the bismark_methylation_extractor help:
The genome-wide cytosine methylation output file is tab-delimited in the following format:
<chromosome> <position> <strand> <count methylated> <count non-methylated> <C-context> <trinucleotide context>
and
The coverage output looks like this (tab-delimited, 1-based genomic coords; zero-based half-open coordinates available with
--zero_based
):
<chromosome> <start position> <end position> <methylation percentage> <count methylated> <count non-methylated>
You call both "coverage2cytosine" format. The "coverage2cytosine" Bismark module can create a "genome-wide cytosine methylation output file" (which looks ALMOST like Input Type 5) from the coverage output (which looks ALMOST like your Input Type 6), but can also be created from bismark_methylation_extractor directly.
In Input Type 5 example you show start and end position (and 8 columns in total), but describe below only start position and 7 columns in total. I assume it's just a typo in the example?
You write the start/end position for all are in [0,4294967295], Bismark by default uses 1-based, unless --zero-based
is explicitly specified, and only then it becomes half-open. So, by default it's all 1-based and start position == end position, in your example it says '762 763', so should indeed --zero-based
be specified?
Bismark clearly states "count methylated" and "count non-methylated" rather than "methylated C count" and "C count". "C count" sounds like total count (methylated + non-methylated). What is actually expected here?
Input Type 6 "Column4: methylation percentage, which is calculated by Defiant." - Why is this calculated by Defiant? And how? Shouldn't this be input to Defiant? It is part of the Bismark coverage output. However your example... "chr1 762 763 0.265625 17 76 "
How would you get to 0.265625? It's neither 17/76, nor 17/(76+17), depending on what you actually mean in no 4... (17/64=0.265625 , assuming the 64 that you mention in input type 5 example )
However, from an Bismark run, I got e.g. in coverage output (test.deduplicated.bismark.cov.gz):
chr3 3008646 3008646 33.3333333333333 1 2
chr3 5620584 5620584 75 3 1
So, the methylation percentage is (100*col5/(col5+col6)) and not (col5/col6)
(Also, the start and end position are same (as stated in 3), unless --zero-based is used, but then it would not be valid input to the coverage2cytosine script.)
Please consider to provide an example call for the bismark_methylation_extractor, that will produce files of the type that defiant will read and process as expected.
Looking forward to test the program once this is clarified.
Dear David,
My input file looks like this (it is a Bismark coverage2cytosine format originating from a non-model organism):
Fvb1 4 + 1 0 CG CGA
Defiant is not recognizing this file format because defiant requires that chromosome names start with "chr".
I changed read_bisulfite_data.h (and recompiled) to get Defiant going on my input file:
if ((number_of_columns == 1) && (strstr(temporary_char_string,"chr") != NULL)) {
into
if (number_of_columns == 1) {
This workaround works and Defiant is generating relevant results in no time!
However, to make the annotation work using a gtf file that contains chromosome names not starting with "chr", more elaborate changes are needed.
It would be great if Defiant could be used on non-model organisms.
Thanks!
Kind regards,
Filip
Hi,
I am using defiant to do DMR analysis for my two WGBS samples. I used Bismark Coverage2cytosine output as input for defiant. The command used to run defiant is:
defiant -L case1,case2 -l SK2_vs_SK6 -a annotation.gtf -cpu 6 -fdr bh -i case1.CpG_report.txt case2.CpG_report.txt
But I am getting the following error upon running:
script.sh: line 2: 159754 Segmentation fault (core dumped)
I have sufficient memory in my server (256 GB). Please Help me to solve this error.
Thank you.
Dear David,
Could you please let me know what are the acceptable annotation formats usable with -a option.
I gave a gff and later a gtf. Both of them were not recognized.?
Regards,
Jeffin
Hi @hhg7 ,
I am a contributor to "bioconda", https://bioconda.github.io/, that packages bioinformatics tools for easier installation and distribution. I would like to create a conda package for "defiant". However, one of the requirements for this, is to have a tagged release. Therefore, can you please tag a release and create a tarball for it so I can proceed?
Thank you in advance, and please let me know if you have any questions regarding this request.
--
Natasha
Hi,
can defiant analysis Difference Methylation on ROI?For example, I want to analysis difference methylation of a set of immune response genes between two groups, the difference methylation is on whole genebody level, not DMR.
Thanks!
Yuanzhi
Hi David,
Hope you are doing well!
I have a question which has left me a bit stumped. We are studying the Indian variety of rice whose annotations are available in ENSEMBL.
I have provided the GTF, as converting to refFlat wasn't really working, however it seems that the number of features that are being read are a bit too much?
There are 255844 genes and 2140 chromosomes read from ../ref/Oryza_indica.ASM465v1.43.gtf
(ftp://ftp.ensemblgenomes.org/pub/release-43/plants/gtf/oryza_indica/Oryza_indica.ASM465v1.43.chr.gtf.gz)
I believe that there should be 40320-ish genes.
I'm converting it to refFlat using gtfToGenePred, where in it only seems to detect 2262 genes. I'm a bit confused about this disparity. I'm using the following to convert:
/apps/kent_utils/gtfToGenePred -genePredExt -geneNameAsName2 Oryza_indica.ASM465v1.43.gtf reflat.txt
This seems to throw an error:
Couldn't recognize annotation file format with strand_definition = 3 @ annotation.h line 112
which is kinda confusing. Do you have any ideas?
Hi,
I'm running the tool on bismark coverage reports.
This is a draft genome assembly annotated using augustus. I have tried giving the gtf, gff and converted the gff to genePred format as well. All the three cases the error is being raised.
The features annotated in the gff file are as follows, over approximately 5000 scaffolds.
CDS
gene
intron
start_codon
stop_codon
transcript
I tried download with multiple browsers and computers, but I can't unzip defiant.zip.
Thanks for taking a look at that.
Hi
I got this error when i tried to run defiant. Input file is "Input Type 2 known for MethylKit input"
Thank You
########
Set 1 will be labeled "BS1"
Set 2 will be labeled "BS2"
There are 2 sets to be read
There will be 1 runs.
Run information will be in defiant_run_info.txt
Now reading file ss_BS1_L005.ATCGmap.filter.CG.txt into memory...
Chr1.1192 Chr1 1192 F 15 100.00 0.00
:
could not identify file type with number_of_columns = 7, integers = 2, floats = 2
failed at read_bisulfite_data.h line 205
strand is defined @ column 4.
column1_is_int = false.
column1_is_chr = true.
Hi,
I want to use defiant to identify DMR in my WGBS data (non-model organism). I used the following command but can not identify any DMR. Please let me know if and how I could relax the conditions more. Am I doing something wrong?
Installation was done with the bioconda package (https://anaconda.org/bioconda/defiant). The input file are the bismark2coverage.report.txt files. Those files only contain positions of CpG sites. I checked in the run_info_comparison.txt file, and 100% of my nucleotide pairs passed the coverage condition. What I noticed is that my chromosome names don't start with "chr," but I saw that you fixed that in the issue "Chromosome names need to start with "chr". Additionally, I compared the "file_type5.txt" from the issue "Clarify input format (for Bismark output)" with my file and I have the same column. I only noticed that the numbers are much lower in columns 4 and 5. My numbers are between 0 and 5, whereas yours are up to 30. You only included the + strand in your file, whereas I have both the + and - strands.
Could that be why it does not find any DMR in my dataset?
Command used:
defiant -c 10 -p 0.05,0.10 -L Normoxia,Anoxia -l comparison -i N1-D14-1.CpG_report.txt,N2-D15-1.CpG_report.txt,N3-D16-2.CpG_report.txt,N7-D36-1.CpG_report.txt A1-D17-1.CpG_report.txt,A2-D18-1.CpG_report.txt,A4-D26-2.CpG_report.txt,A7-D37-1.CpG_report.txt
Please let me know if you need additional information and thank you for your help!
Hi Dave!
I have recently come across cgmaptools which is a host of tools that can be used to call methylation and other analysis like allele specific methylation, snv calling etc.
The cgmap output can be derived from either post-processing the aligned bams or is one of the outputs from BS-Seeker.
The format as such looks like this:
chr1 G 3000851 CHH CC 0.1 1 10
chr1 C 3001624 CHG CA 0.0 0 9
chr1 C 3001631 CG CG 1.0 5 5
chr1 G 3001632 CG CG 0.9 9 10
Where the headers are:
Chromosome Reference_Nucleotide Position Context_Type Context Methylation_Status Methylated_Cyto_Counts Total_Cyto_Counts
The format as such seems to be similar to Bismark's format 5 as per the Readme.
Would it be possible for you to support it?
Hello and thank you for this great tool!
When I use one annotation file, defiant runs fine with no problems, e.g.
head ../reference.gtf
chr1 Gnomon gene 8414 14345 . + . gene_id "LOC115964938"; transcript_id ""; db_xref "GeneID:115964938"; gbkey "Gene"; gene "LOC115964938"; gene_biotype "lncRNA";
chr1 Gnomon transcript 8414 14345 . + . gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gbkey "ncRNA"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA";
chr1 Gnomon exon 8414 8803 . + . gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA"; exon_number "1";
chr1 Gnomon exon 10872 10973 . + . gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA"; exon_number "2";
chr1 Gnomon exon 11087 11255 . + . gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA"; exon_number "3";
chr1 Gnomon exon 11408 11506 . + . gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA"; exon_number "4";
chr1 Gnomon exon 11731 14345 . + . gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA"; exon_number "5";
chr1 Gnomon transcript 9687 14345 . + . gene_id "LOC115964938"; transcript_id "XR_004086011.1"; db_xref "GeneID:115964938"; gbkey "ncRNA"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X3"; transcript_biotype "lnc_RNA";
chr1 Gnomon exon 9687 10048 . + . gene_id "LOC115964938"; transcript_id "XR_004086011.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X3"; transcript_biotype "lnc_RNA"; exon_number "1";
chr1 Gnomon exon 10872 10973 . + . gene_id "LOC115964938"; transcript_id "XR_004086011.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X3"; transcript_biotype "lnc_RNA"; exon_number "2";
../defiant -a ../reference.gtf -b -N -c 10 -cpn 5 -q 3000 -cpu 4 -L "bottom,top" -l out -i bottom1.txt,bottom2.txt,bottom3.txt,bottom4.txt,bottom5.txt,bottom6.txt,bottom7.txt,bottom8.txt,bottom9.txt,bottom10.txt,bottom11.txt top1.txt,top2.txt,top3.txt,top4.txt,top5.txt,top6.txt,top7.txt,top8.txt,top9.txt,top10.txt,top11.txt
____ ___________________ _ ________
/ __ \/ ____/ ____/ _/ | / | / /_ __/
/ / / / __/ / /_ / // /| | / |/ / / /
/ /_/ / /___/ __/ _/ // ___ |/ /| / / /
/_____/_____/_/ /___/_/ |_/_/ |_/ /_/
Differential methylation: Easy, Fast, Identification and ANnoTation
by David E. Condon, University of Pennsylvania, 2015-2020
Reading gene annotation ../reference.gtf
There are 45548 genes and 20 chromosomes read from ../reference.gtf.
You specified the '-b' option, so there will be an output bed file.
You specified the '-N' option, so Defiant will print a list of individually differentially methylated nucleotides in the answers file.
The minimum coverage used will be 10
However, when I use an annotation file with a different annotation name, I am getting a segmentation fault
error, e.g.
head ../FilteringA2.UCSCbed12ExportFINALgeneNamesONLY.delChrCMandChr1Mins.gtf
chr1 FilteringA2 transcript 9756 14465 . + . gene_id "QL01p000012"; transcript_id "QL01p000012";
chr1 FilteringA2 exon 9756 10164 . + . gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "1"; exon_id "QL01p000012.1";
chr1 FilteringA2 exon 10872 10973 . + . gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "2"; exon_id "QL01p000012.2";
chr1 FilteringA2 CDS 10913 10973 . + 0 gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "2"; exon_id "QL01p000012.2";
chr1 FilteringA2 exon 11177 11255 . + . gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "3"; exon_id "QL01p000012.3";
chr1 FilteringA2 CDS 11177 11255 . + 2 gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "3"; exon_id "QL01p000012.3";
chr1 FilteringA2 exon 11408 11506 . + . gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "4"; exon_id "QL01p000012.4";
chr1 FilteringA2 CDS 11408 11506 . + 1 gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "4"; exon_id "QL01p000012.4";
chr1 FilteringA2 exon 11731 11880 . + . gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "5"; exon_id "QL01p000012.5";
chr1 FilteringA2 CDS 11731 11880 . + 1 gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "5"; exon_id "QL01p000012.5";
../defiant -a ../FilteringA2.UCSCbed12ExportFINALgeneNamesONLY.delChrCMandChr1Mins.gtf \
> -b -N \
> -c 10 \
> -cpn 5 \
> -q 3000 \
> -cpu 4 \
> -L "bottom,top" \
> -l out \
> -i bottom1.txt,bottom2.txt,bottom3.txt,bottom4.txt,bottom5.txt,bottom6.txt,bottom7.txt,bottom8.txt,bottom9.txt,bottom10.txt,bottom11.txt top1.txt,top2.txt,top3.txt,top4.txt,top5.txt,top6.txt,top7.txt,top8.txt,top9.txt,top10.txt,top11.txt
____ ___________________ _ ________
/ __ \/ ____/ ____/ _/ | / | / /_ __/
/ / / / __/ / /_ / // /| | / |/ / / /
/ /_/ / /___/ __/ _/ // ___ |/ /| / / /
/_____/_____/_/ /___/_/ |_/_/ |_/ /_/
Differential methylation: Easy, Fast, Identification and ANnoTation
by David E. Condon, University of Pennsylvania, 2015-2020
Reading gene annotation ../FilteringA2.UCSCbed12ExportFINALgeneNamesONLY.delChrCMandChr1Mins.gtf
Segmentation fault
I have tried changing transcript
to gene
in the second annotation file, but no luck. Do you know why this error might be happening?
Thank you
Lily
I was running defiant and can't find any dmr areas. I don't know why I can't find the dmr area, I really want to know why.
columns are chromosome,cpgstart,cpgend,methylation percent,coverage
/data/home/hlkong/defiant-master/defiant -p 0.5 -v -FDR -l aaaa -L p.Recurrence,n.No_Recurrence -i control1,control2,control3 case1,case2,case3
### log file :
defiant_run_info_aaaa.txt.txt
I add the -E
parameter when I run the Defiant.
Generate a file named after ### nucleotide_p.ptc_n.btn_GSE_original.tsv
, I want to know what these Positive integer mean, as indicated in the box.
@hhg7 Many thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.