hhg7 / defiant Goto Github PK

View Code? Open in Web Editor NEW

17.0 3.0 9.0 1.4 MB

Differential methylation: Easy, Fast, Identification and ANnoTation

License: GNU General Public License v2.0

Shell 64.34% R 35.66%

bisulfite-sequencing differentially-methylated-elements methylation dmr rrbs wgbs

defiant's People

Contributors

Stargazers

Watchers

Forkers

matthewpcondon rmauswon wangdi2014 ktang zzygyx9119 khl0798 inambioinfo zhangyongzlm romariosales

defiant's Issues

Clarify input format (for Bismark output)

You write in the README.md that you support the following formats:

Input Type 5 Bismark coverage2cytosine format:

//Bismark coverage2cytosine format Example: chr1 762 763 + 17 64 CG CGA

Column1: chromosome, which is a string. Column2: nucleotide/start position, an unsigned integer [0,4294967295]. Column3: strand. Column4: methylated C count, an unsigned integer in [0,4294967295]. Column5: C count, an unsigned integer in [0,4294967295]. Column6: C-context, e.g. CG, CH, CHH. Column7: C-context, e.g. CGA, CGT, etc.

Input Type 6 Bismark coverage2cytosine format:

Example: chr1 762 763 0.265625 17 76

Column1: chromosome, which is a string. Column2: nucleotide/start position, an unsigned integer in [0,4294967295]. Column3: nucleotide/end position, an unsigned integer in [0,4294967295]. Column4: methylation percentage, which is calculated by Defiant. Column5: methylated C count, an unsigned integer in [0,4294967295]. Column6: C count, an unsigned integer in [0,4294967295].

However, these don’t entirely match what is described in the bismark_methylation_extractor help:

The genome-wide cytosine methylation output file is tab-delimited in the following format:
<chromosome> <position> <strand> <count methylated> <count non-methylated> <C-context> <trinucleotide context>

and

The coverage output looks like this (tab-delimited, 1-based genomic coords; zero-based half-open coordinates available with --zero_based):
<chromosome> <start position> <end position> <methylation percentage> <count methylated> <count non-methylated>

You call both "coverage2cytosine" format. The "coverage2cytosine" Bismark module can create a "genome-wide cytosine methylation output file" (which looks ALMOST like Input Type 5) from the coverage output (which looks ALMOST like your Input Type 6), but can also be created from bismark_methylation_extractor directly.
In Input Type 5 example you show start and end position (and 8 columns in total), but describe below only start position and 7 columns in total. I assume it's just a typo in the example?
You write the start/end position for all are in [0,4294967295], Bismark by default uses 1-based, unless --zero-based is explicitly specified, and only then it becomes half-open. So, by default it's all 1-based and start position == end position, in your example it says '762 763', so should indeed --zero-based be specified?
Bismark clearly states "count methylated" and "count non-methylated" rather than "methylated C count" and "C count". "C count" sounds like total count (methylated + non-methylated). What is actually expected here?
Input Type 6 "Column4: methylation percentage, which is calculated by Defiant." - Why is this calculated by Defiant? And how? Shouldn't this be input to Defiant? It is part of the Bismark coverage output. However your example... "chr1 762 763 0.265625 17 76 "
How would you get to 0.265625? It's neither 17/76, nor 17/(76+17), depending on what you actually mean in no 4... (17/64=0.265625 , assuming the 64 that you mention in input type 5 example )
However, from an Bismark run, I got e.g. in coverage output (test.deduplicated.bismark.cov.gz):
chr3 3008646 3008646 33.3333333333333 1 2
chr3 5620584 5620584 75 3 1
So, the methylation percentage is (100*col5/(col5+col6)) and not (col5/col6)
(Also, the start and end position are same (as stated in 3), unless --zero-based is used, but then it would not be valid input to the coverage2cytosine script.)

Please consider to provide an example call for the bismark_methylation_extractor, that will produce files of the type that defiant will read and process as expected.

Looking forward to test the program once this is clarified.

Chromosomes names need to start whith "chr"

Dear David,

My input file looks like this (it is a Bismark coverage2cytosine format originating from a non-model organism):
Fvb1 4 + 1 0 CG CGA
Defiant is not recognizing this file format because defiant requires that chromosome names start with "chr".

I changed read_bisulfite_data.h (and recompiled) to get Defiant going on my input file:
if ((number_of_columns == 1) && (strstr(temporary_char_string,"chr") != NULL)) {
into
if (number_of_columns == 1) {
This workaround works and Defiant is generating relevant results in no time!

However, to make the annotation work using a gtf file that contains chromosome names not starting with "chr", more elaborate changes are needed.

It would be great if Defiant could be used on non-model organisms.

Thanks!

Kind regards,
Filip

Error: Segmentation fault (core dumped)

Hi,

I am using defiant to do DMR analysis for my two WGBS samples. I used Bismark Coverage2cytosine output as input for defiant. The command used to run defiant is:

defiant -L case1,case2 -l SK2_vs_SK6 -a annotation.gtf -cpu 6 -fdr bh -i case1.CpG_report.txt case2.CpG_report.txt

But I am getting the following error upon running:

script.sh: line 2: 159754 Segmentation fault (core dumped)

I have sufficient memory in my server (256 GB). Please Help me to solve this error.

Thank you.

annotation files

Dear David,

Could you please let me know what are the acceptable annotation formats usable with -a option.
I gave a gff and later a gtf. Both of them were not recognized.?

Regards,
Jeffin

Tagged release

Hi @hhg7 ,

I am a contributor to "bioconda", https://bioconda.github.io/, that packages bioinformatics tools for easier installation and distribution. I would like to create a conda package for "defiant". However, one of the requirements for this, is to have a tagged release. Therefore, can you please tag a release and create a tarball for it so I can proceed?

Thank you in advance, and please let me know if you have any questions regarding this request.

--
Natasha

Difference Methylation analysis on Regions Of Interesting (ROI)

Hi,

can defiant analysis Difference Methylation on ROI？For example, I want to analysis difference methylation of a set of immune response genes between two groups,  the difference methylation is on whole genebody level, not DMR.

Thanks!

Yuanzhi

Usage with Ensembl GTF or converted refFlat

Hi David,

Hope you are doing well!

I have a question which has left me a bit stumped. We are studying the Indian variety of rice whose annotations are available in ENSEMBL.

I have provided the GTF, as converting to refFlat wasn't really working, however it seems that the number of features that are being read are a bit too much?

There are 255844 genes and 2140 chromosomes read from ../ref/Oryza_indica.ASM465v1.43.gtf (ftp://ftp.ensemblgenomes.org/pub/release-43/plants/gtf/oryza_indica/Oryza_indica.ASM465v1.43.chr.gtf.gz)

I believe that there should be 40320-ish genes.

I'm converting it to refFlat using gtfToGenePred, where in it only seems to detect 2262 genes. I'm a bit confused about this disparity. I'm using the following to convert:

/apps/kent_utils/gtfToGenePred -genePredExt -geneNameAsName2 Oryza_indica.ASM465v1.43.gtf reflat.txt

This seems to throw an error:
Couldn't recognize annotation file format with strand_definition = 3 @ annotation.h line 112

which is kinda confusing. Do you have any ideas?

chromosome broke my pigeonholes @ defiant.c line 1381.

Hi,

I'm running the tool on bismark coverage reports.

This is a draft genome assembly annotated using augustus. I have tried giving the gtf, gff and converted the gff to genePred format as well. All the three cases the error is being raised.

The features annotated in the gff file are as follows, over approximately 5000 scaffolds.

CDS
gene
intron
start_codon
stop_codon
transcript

Can not unzip defiant.zip

I tried download with multiple browsers and computers, but I can't unzip defiant.zip.

Thanks for taking a look at that.

Methyl Kit Input

I got this error when i tried to run defiant. Input file is "Input Type 2 known for MethylKit input"

Thank You

########

Set 1 will be labeled "BS1"
Set 2 will be labeled "BS2"
There are 2 sets to be read
There will be 1 runs.
Run information will be in defiant_run_info.txt
Now reading file ss_BS1_L005.ATCGmap.filter.CG.txt into memory...
Chr1.1192 Chr1 1192 F 15 100.00 0.00
:
could not identify file type with number_of_columns = 7, integers = 2, floats = 2
failed at read_bisulfite_data.h line 205
strand is defined @ column 4.
column1_is_int = false.
column1_is_chr = true.

Finding DMR

Hi,
I want to use defiant to identify DMR in my WGBS data (non-model organism). I used the following command but can not identify any DMR. Please let me know if and how I could relax the conditions more. Am I doing something wrong?

Installation was done with the bioconda package (https://anaconda.org/bioconda/defiant). The input file are the bismark2coverage.report.txt files. Those files only contain positions of CpG sites. I checked in the run_info_comparison.txt file, and 100% of my nucleotide pairs passed the coverage condition. What I noticed is that my chromosome names don't start with "chr," but I saw that you fixed that in the issue "Chromosome names need to start with "chr". Additionally, I compared the "file_type5.txt" from the issue "Clarify input format (for Bismark output)" with my file and I have the same column. I only noticed that the numbers are much lower in columns 4 and 5. My numbers are between 0 and 5, whereas yours are up to 30. You only included the + strand in your file, whereas I have both the + and - strands.

Could that be why it does not find any DMR in my dataset?

Command used:
defiant -c 10 -p 0.05,0.10 -L Normoxia,Anoxia -l comparison -i N1-D14-1.CpG_report.txt,N2-D15-1.CpG_report.txt,N3-D16-2.CpG_report.txt,N7-D36-1.CpG_report.txt A1-D17-1.CpG_report.txt,A2-D18-1.CpG_report.txt,A4-D26-2.CpG_report.txt,A7-D37-1.CpG_report.txt

Please let me know if you need additional information and thank you for your help!

Support for CGmap inputs

Hi Dave!

I have recently come across cgmaptools which is a host of tools that can be used to call methylation and other analysis like allele specific methylation, snv calling etc.

The cgmap output can be derived from either post-processing the aligned bams or is one of the outputs from BS-Seeker.

The format as such looks like this:

chr1 G 3000851 CHH CC 0.1 1 10
chr1 C 3001624 CHG CA 0.0 0 9
chr1 C 3001631 CG CG 1.0 5 5
chr1 G 3001632 CG CG 0.9 9 10

Where the headers are:

Chromosome Reference_Nucleotide Position Context_Type Context Methylation_Status Methylated_Cyto_Counts Total_Cyto_Counts

The format as such seems to be similar to Bismark's format 5 as per the Readme.

Would it be possible for you to support it?

Annotation file error - segmentation fault

Hello and thank you for this great tool!

When I use one annotation file, defiant runs fine with no problems, e.g.

head ../reference.gtf 
chr1	Gnomon	gene	8414	14345	.	+	.	gene_id "LOC115964938"; transcript_id ""; db_xref "GeneID:115964938"; gbkey "Gene"; gene "LOC115964938"; gene_biotype "lncRNA"; 
chr1	Gnomon	transcript	8414	14345	.	+	.	gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gbkey "ncRNA"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA"; 
chr1	Gnomon	exon	8414	8803	.	+	.	gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA"; exon_number "1"; 
chr1	Gnomon	exon	10872	10973	.	+	.	gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA"; exon_number "2"; 
chr1	Gnomon	exon	11087	11255	.	+	.	gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA"; exon_number "3"; 
chr1	Gnomon	exon	11408	11506	.	+	.	gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA"; exon_number "4"; 
chr1	Gnomon	exon	11731	14345	.	+	.	gene_id "LOC115964938"; transcript_id "XR_004086010.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 5 samples with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X2"; transcript_biotype "lnc_RNA"; exon_number "5"; 
chr1	Gnomon	transcript	9687	14345	.	+	.	gene_id "LOC115964938"; transcript_id "XR_004086011.1"; db_xref "GeneID:115964938"; gbkey "ncRNA"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X3"; transcript_biotype "lnc_RNA"; 
chr1	Gnomon	exon	9687	10048	.	+	.	gene_id "LOC115964938"; transcript_id "XR_004086011.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X3"; transcript_biotype "lnc_RNA"; exon_number "1"; 
chr1	Gnomon	exon	10872	10973	.	+	.	gene_id "LOC115964938"; transcript_id "XR_004086011.1"; db_xref "GeneID:115964938"; gene "LOC115964938"; model_evidence "Supporting evidence includes similarity to: 100% coverage of the annotated genomic feature by RNAseq alignments, including 1 sample with support for all annotated introns"; product "uncharacterized LOC115964938, transcript variant X3"; transcript_biotype "lnc_RNA"; exon_number "2";

../defiant -a ../reference.gtf -b -N -c 10 -cpn 5 -q 3000 -cpu 4 -L "bottom,top" -l out -i bottom1.txt,bottom2.txt,bottom3.txt,bottom4.txt,bottom5.txt,bottom6.txt,bottom7.txt,bottom8.txt,bottom9.txt,bottom10.txt,bottom11.txt top1.txt,top2.txt,top3.txt,top4.txt,top5.txt,top6.txt,top7.txt,top8.txt,top9.txt,top10.txt,top11.txt

    ____  ___________________    _   ________
   / __ \/ ____/ ____/  _/   |  / | / /_  __/
  / / / / __/ / /_   / // /| | /  |/ / / /   
 / /_/ / /___/ __/ _/ // ___ |/ /|  / / /    
/_____/_____/_/   /___/_/  |_/_/ |_/ /_/

Differential methylation: Easy, Fast, Identification and ANnoTation

 by David E. Condon, University of Pennsylvania, 2015-2020


Reading gene annotation ../reference.gtf
There are 45548 genes and 20 chromosomes read from ../reference.gtf.
You specified the '-b' option, so there will be an output bed file.

You specified the '-N' option, so Defiant will print a list of individually differentially methylated nucleotides in the answers file.

The minimum coverage used will be 10

However, when I use an annotation file with a different annotation name, I am getting a segmentation fault error, e.g.

 head ../FilteringA2.UCSCbed12ExportFINALgeneNamesONLY.delChrCMandChr1Mins.gtf 
chr1	FilteringA2	transcript	9756	14465	.	+	.	gene_id "QL01p000012"; transcript_id "QL01p000012"; 
chr1	FilteringA2	exon	9756	10164	.	+	.	gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "1"; exon_id "QL01p000012.1";
chr1	FilteringA2	exon	10872	10973	.	+	.	gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "2"; exon_id "QL01p000012.2";
chr1	FilteringA2	CDS	10913	10973	.	+	0	gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "2"; exon_id "QL01p000012.2";
chr1	FilteringA2	exon	11177	11255	.	+	.	gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "3"; exon_id "QL01p000012.3";
chr1	FilteringA2	CDS	11177	11255	.	+	2	gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "3"; exon_id "QL01p000012.3";
chr1	FilteringA2	exon	11408	11506	.	+	.	gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "4"; exon_id "QL01p000012.4";
chr1	FilteringA2	CDS	11408	11506	.	+	1	gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "4"; exon_id "QL01p000012.4";
chr1	FilteringA2	exon	11731	11880	.	+	.	gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "5"; exon_id "QL01p000012.5";
chr1	FilteringA2	CDS	11731	11880	.	+	1	gene_id "QL01p000012"; transcript_id "QL01p000012"; exon_number "5"; exon_id "QL01p000012.5";

../defiant -a ../FilteringA2.UCSCbed12ExportFINALgeneNamesONLY.delChrCMandChr1Mins.gtf \
> -b -N \
> -c 10 \
> -cpn 5 \
> -q 3000 \
> -cpu 4 \
> -L "bottom,top" \
> -l out \
> -i bottom1.txt,bottom2.txt,bottom3.txt,bottom4.txt,bottom5.txt,bottom6.txt,bottom7.txt,bottom8.txt,bottom9.txt,bottom10.txt,bottom11.txt top1.txt,top2.txt,top3.txt,top4.txt,top5.txt,top6.txt,top7.txt,top8.txt,top9.txt,top10.txt,top11.txt

    ____  ___________________    _   ________
   / __ \/ ____/ ____/  _/   |  / | / /_  __/
  / / / / __/ / /_   / // /| | /  |/ / / /   
 / /_/ / /___/ __/ _/ // ___ |/ /|  / / /    
/_____/_____/_/   /___/_/  |_/_/ |_/ /_/

Differential methylation: Easy, Fast, Identification and ANnoTation

 by David E. Condon, University of Pennsylvania, 2015-2020


Reading gene annotation ../FilteringA2.UCSCbed12ExportFINALgeneNamesONLY.delChrCMandChr1Mins.gtf
Segmentation fault

I have tried changing transcript to gene in the second annotation file, but no luck. Do you know why this error might be happening?

Thank you

Lily