slowkow / centipede.tutorial Goto Github PK

:bug: How to use CENTIPEDE to determine if a transcription factor is bound.

Home Page: https://slowkow.github.io/CENTIPEDE.tutorial

R 100.00%

tutorial bioinformatics enrichment dnase-seq transcription-factors rstats

centipede.tutorial's Introduction

CENTIPEDE Tutorial

CENTIPEDE fits a bayesian hierarchical mixture model to learn TF-specific distribution of experimental data on a particular cell-type for a set of candidate binding sites described by a motif.

This is a practical tutorial for running CENTIPEDE with DNase-Seq data. It explains how to prepare the data and how to run the analysis. The goal is to predict if a putative transcription factor binding site is actually bound or not. For details about the statistical models underlying the methods, please see (Pique-Regi, et al. 2011).

Read the tutorial online or download the PDF:

This repository has functions to ease the use of CENTIPEDE:

centipede_data() converts data to the format required for CENTIPEDE.
parse_region() parses a string like "chr1:123-456".
read_bedGraph() reads a bedGraph file with 4 columns: chrom, start, end, score.
read_fimo() reads a text file output by FIMO and selects sites that meet a significance threshold.

I also provide example data that you can use to follow the tutorial:

cen is a list with two items:
- cen$mat is a matrix of read-start counts for 3,337 genomic regions.
- cen$regions is a dataframe describing those regions.
site_cons is a vector with mean conservation scores for the 3,337 regions, computed across 100 vertebrates.

Installation

Install CENTIPEDE by running this in your shell (not within an R session):

wget http://download.r-forge.r-project.org/src/contrib/CENTIPEDE_1.2.tar.gz
R CMD INSTALL CENTIPEDE_1.2.tar.gz

Next, install the tutorial package:

# This command didn't work for me.
# install.packages("CENTIPEDE", repos="http://R-Forge.R-project.org")

install.packages("devtools")
devtools::install_github("slowkow/CENTIPEDE.tutorial")

centipede.tutorial's People

Contributors

Stargazers

Watchers

Forkers

crazyhottommy rescalante-lilly alexg9010 cyang-2014 yixf-self danrlu learning-jusue404 fanying2015 mperalc heejungshim lichenbiostat kvandeynze liu5796796

centipede.tutorial's Issues

Shifted forward and reverse dip

Hi,

First of all this introduction you wrote helped me a lot to understand the data required for use with centipede!

I currently using your package for my master's thesis and now i have a problem where you might be able to help me.

As you might see in the plot below the significant dip in dnase-cut probability does not overlap in forward and reverse strand. And i think this shift happened in the centipede_data function in your package.

My read length is 35 and that seems to have something to do with the shift.

Maybe you could help me to resolve this problem.

Thanks in advance
Alex

Center the input matrix around the motif

Hi,

Thank you for the tutorial and package, it's extremely helpful!! I was troubleshooting with my dataset and came across this issue. (I'm very sorry for moving the post around, thought it might be different from the previously reported issue.)

Here's an example with the original code:

library(CENTIPEDE.tutorial)

region=parse_region("2L:10-20")
region_len <- abs(region$end - region$start) + 1

item = list(
  strand = as.factor(c("+","+","-","-")),  
  pos = as.integer(c(15,20,2,12)),
  qwidth = as.integer(c(10,10,10,10))
)

################### copied from functions.R line 152-173 ###############

# take care of negative reads starting at end position
neg_shift <- item$qwidth * as.numeric(item$strand == "-")
item$pos <- item$pos + neg_shift

idx <- item$pos >= region$start & item$pos <= region$end
if (sum(idx) == 0) {
  return(rep(0, 2 * region_len))
}
strand <- item$strand[idx]
position <- item$pos[idx]

# Create a row that represents the flanking region surrounding a site,
# each column is a nucleotide position relative to the center of the motif
# match.
# The values in this matrix are the number of read start sites that occur
# at that position.
# Since we have two strands, the first half of the matrix represents the
# positive strand and the second have represents the negative strand.
is.neg <- as.numeric(strand == "-")
j <- 1 + position - min(position) + (region_len * is.neg)

as.numeric(table(factor(j, levels = seq(1, 2 * region_len))))

I wonder whether this is what is intended. This does get rid of the peak in the left edge of the cut probability plot (where position==min(position) and on "+" strand will put a 1 in the matrix).

j <- 1 + position - region$start + (region_len * is.neg)
as.numeric(table(factor(j, levels = seq(1, 2 * region_len))))

Thanks,
Dan

centipede fitting error

Dear Kamil,

I would like to ask your help. I'm trying to use the CENTIPEDE package for creating ATAC footprint plots with bed a file. I transformed the file to the format that the fimo file has. I got the following error:

fit <- fitCentipede(
Xlist = list(DNase = cen$mat),
Y = as.matrix(data.frame(
Intercept = rep(1, nrow(cen$mat))
))
)
Initialization of the parameters:
Error in quantile.default(x, c(TrimP, 1 - TrimP)) :
missing values and NaN's not allowed if 'na.rm' is FALSE

Can you help me?

Thanks in advance.
Best Regards,
Attila

error if needed with Mca and categorical variables

Hello!

I'm running an MCA with categorical variables ( transformed in factors) there are no missing values but I got this error:

# Error in if (sum(tabF[, 2] <= proba) > 0) resF <- tabF[tabF[, 2] <= proba,  : 
  missing value where TRUE/FALSE needed

Here is the script that I am using

col_names<- names (sololikert1)
sololikert1[,col_names] <- lapply(sololikert1[,c(col_names)], as.factor)
str(sololikert1[,col_names])
str(sololikert1)  
summary(sololikert1)
library(FactoMineR)
res.mca = MCA(sololikert1, quali.sup=c(5:13))
res = dimdesc(res.mca, axes=1:2, proba=0.05)

Here you have little more details:

The database is 304 obs. Variable 23 is a place of living; 17 is sex; 18 is age clusters; var from 8.4 to 14.3 are likert results ranging from 1 to 5

tibble [304 × 13] (S3: tbl_df/tbl/data.frame)
 $ id                   : Factor w/ 304 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ 23-Valle di residenza: Factor w/ 5 levels "1","2","3","4",..: 3 4 3 5 3 2 2 2 3 2 ...
 $ 17-Sesso             : Factor w/ 2 levels "1","2": 2 1 1 2 1 2 1 1 2 1 ...
 $ 18-Fascia d'età      : Factor w/ 4 levels "1","2","3","4": 1 2 1 4 2 2 4 2 4 4 ...
 $ 8.4-                 : Factor w/ 5 levels "1","2","3","4",..: 4 3 5 5 1 3 5 3 3 4 ...
 $ 8.5-                 : Factor w/ 5 levels "1","2","3","4",..: 3 4 3 4 1 3 5 3 2 4 ...
 $ 8.6-                 : Factor w/ 5 levels "1","2","3","4",..: 4 4 5 5 5 4 5 2 3 5 ...
 $ 8.7-                 : Factor w/ 5 levels "1","2","3","4",..: 4 4 5 5 5 4 5 2 4 5 ...
 $ 11.1-                : Factor w/ 5 levels "1","2","3","4",..: 3 1 3 4 1 2 3 1 1 1 ...
 $ 11.2-                : Factor w/ 5 levels "1","2","3","4",..: 2 4 3 5 1 3 4 1 1 1 ...
 $ 14.1-[               : Factor w/ 5 levels "1","2","3","4",..: 1 1 3 4 1 4 3 4 1 3 ...
 $ 14.2-[               : Factor w/ 5 levels "1","2","3","4",..: 1 1 4 5 1 3 5 4 1 3 ...
 $ 14.3-                : Factor w/ 5 levels "1","2","3","4",..: 1 3 2 5 1 5 5 4 4 4 ...
> str(sololikert1)  #ti dice quali sono le viariabili qualitat. e quantitaitve
tibble [304 × 13] (S3: tbl_df/tbl/data.frame)
 $ id                   : Factor w/ 304 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ 23-Valle di residenza: Factor w/ 5 levels "1","2","3","4",..: 3 4 3 5 3 2 2 2 3 2 ...
 $ 17-Sesso             : Factor w/ 2 levels "1","2": 2 1 1 2 1 2 1 1 2 1 ...
 $ 18-Fascia d'età      : Factor w/ 4 levels "1","2","3","4": 1 2 1 4 2 2 4 2 4 4 ...
 $ 8.4-                 : Factor w/ 5 levels "1","2","3","4",..: 4 3 5 5 1 3 5 3 3 4 ...
 $ 8.5-                 : Factor w/ 5 levels "1","2","3","4",..: 3 4 3 4 1 3 5 3 2 4 ...
 $ 8.6-                 : Factor w/ 5 levels "1","2","3","4",..: 4 4 5 5 5 4 5 2 3 5 ...
 $ 8.7-                 : Factor w/ 5 levels "1","2","3","4",..: 4 4 5 5 5 4 5 2 4 5 ...
 $ 11.1-                : Factor w/ 5 levels "1","2","3","4",..: 3 1 3 4 1 2 3 1 1 1 ...
 $ 11.2-                : Factor w/ 5 levels "1","2","3","4",..: 2 4 3 5 1 3 4 1 1 1 ...
 $ 14.1-[               : Factor w/ 5 levels "1","2","3","4",..: 1 1 3 4 1 4 3 4 1 3 ...
 $ 14.2-[               : Factor w/ 5 levels "1","2","3","4",..: 1 1 4 5 1 3 5 4 1 3 ...
 $ 14.3-                : Factor w/ 5 levels "1","2","3","4",..: 1 3 2 5 1 5 5 4 4 4 ...

> res.mca = MCA(sololikert1, quali.sup=c(5:13))
Warning message:
ggrepel: 9 unlabeled data points (too many overlaps). Consider increasing max.overlaps 
> res = dimdesc(res.mca, axes=1:2, proba=0.05)
Error in if (sum(tabF[, 2] <= proba) > 0) resF <- tabF[tabF[, 2] <= proba,  : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: ggrepel: 9 unlabeled data points (too many overlaps). Consider increasing max.overlaps 
2: ggrepel: 9 unlabeled data points (too many overlaps). Consider increasing max.overlaps 
> res = dimdesc(res.mca, axes=1:3, proba=0.05)
**Error in if (sum(tabF[, 2] <= proba) > 0) resF <- tabF[tabF[, 2] <= proba,  : 
  missing value where TRUE/FALSE needed**

Error in if (sum(idx) == 0) { : missing value where TRUE/FALSE needed

Hi,
When I run the centipede_data , the result show :
Error in if (sum(idx) == 0) { : missing value where TRUE/FALSE needed.
Maybe it is some problem with my bam file?

Thanks
Zhen

MOODS

Instead of using FIMO from the MEME Suite, try using MOODS: https://github.com/jhkorhonen/MOODS

gunzip -c chr*.fa.masked > hg19.fa

Hello,

When I run

gunzip -c chr*.fa.masked > hg19.fa

in CENTIPEDE.tutorial in Genomic sequence Section, I got the following error

gzip: chr10.fa.masked: not in gzip format
gzip: chr11.fa.masked: not in gzip format
gzip: chr11_gl000202_random.fa.masked: not in gzip format
...

Am I doing it wrong? Thanks.

About FIMO finding putative TFBS in the tutorial.

I read the CENTIPEDE manuscript. I think the author scanned all putative TFBS across the genome.

However, in your tutorial, you suggested using FIMO only to obtain TFBS in peak regions.

Is it proper to that? Do you have any comment about this?

Thank you.

Error in quantile.default(x, c(TrimP, 1 - TrimP))

Hi,

I use my own data to test the procedure, in the final steps, when I run:

idx <- site_cons > 0.999
fit3 <- fitCentipede(
  Xlist = list(
    DNase = cen$mat[idx, ]
  ),
  Y = as.matrix(data.frame(
    Intercept = rep(1, nrow(cen$mat[idx, ])),
    Conservation = site_cons[idx]
  ))
)

It gives an error:

Initialization of the parameters:
Error in quantile.default(x, c(TrimP, 1 - TrimP)) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE

but when I run:

fit2 <- fitCentipede(
  Xlist = list(
    DNase = cen$mat
  ),
  Y = as.matrix(data.frame(
    Intercept = rep(1, nrow(cen$mat)),
    Conservation = site_cons
  ))
)

It goes right, I don't know what's the problem, please give me some help!

Thanks.
Hu

Conservation analysis (site_cons) fails

Hi, I tried running your tutorial, but ran into the following issue when running the conservation analysis. Any suggestions on getting this to work would be appreciated!

Note, that the object can be viewed as it is already in the tutorial, however, using the listed commands it fails.

Thanks,
Kris

> xs <- findOverlaps(sites, cons)
> site_cons <- sapply(1:length(sites), function(i) {
+   # Conservation scores for each positions in a PWM match.
+   ys <- cons[subjectHits(xs[queryHits(xs) == i])]
+   vals <- rep(ys$score, width(ys))
+   idx <- seq(
+     from = start(sites[i]) - min(start(ys)) + 1,
+     length.out = width(sites[i])
+   )
+   vals <- vals[idx]
+   mean(vals)
+ })
Error in seq.default(from = start(sites[i]) - min(start(ys)) + 1, length.out = width(sites[i])) : 
  'from' must be a finite number
In addition: Warning message:
In min(start(ys)) : no non-missing arguments to min; returning Inf
Called from: seq.default(from = start(sites[i]) - min(start(ys)) + 1, length.out = width(sites[i]))

Error at FIMO --parse-genomic-coord

Hi,

I was attempting the tutorial and got all the way through to parsing with FIMO the sequences within the peaks that match the PWM. However when I entered the command:

fimo --text --parse-genomic-coord $meme $dnase_fasta | gzip > $sites

I received back the following message many times over, seemingly for all of the contents:

Warning: ? is not a valid character in DNA alphabet.
Converting ? to N.

Once I entered zcat $sites | head to look at the sites it just gave back the header,
#pattern name sequence name start stop strand score p-value q-value matched sequence

The only other issue I encountered is at an earlier step where I was informed that hg19.fa did not have an index file associated with it so I used samtools faidx to create one. I am not sure if this led to the later issues.

I followed the tutorial exactly, using all of the same files. If there are any thoughts on what may have happened it would really be appreciated, thank you.

Rob.

Cut-site matrix generation

Error at matrix2meme step

Dear Kamil,

Thank you for making this tutorial. I'm trying to figure out how to use CENTIPEDE and your explanation about the steps are quite clear. I have a small issue with test example you give. When I execute the matrix2meme line you give in the tutorial, I run into the following error:

Can't use an undefined value as a HASH reference at
/home/sander/Programs/meme/lib/perl/MotifUtils.pm line 428, <STDIN> line 13.

Do you have any idea what I could have done wrong? I (think I) have installed meme correctly, including all it's (perl) dependencies. I've tried the version of meme you're using in the tutorial, and also the newest version. Running perl 5.18.2 from Ubuntu 14.04.

Kind regards,
Sander

Creygthon Lab, Hubrecht Institute

object 'sequence.name' not found

I did try your tutorial with the meme-suite version4.12, I received this error while I wanted to create centipede data

cen <- centipede_data(
  bam_file   = "./mait_plus_rep1.bam",
  fimo_file  = "./tr.sites.gz",
  pvalue     = 1e-4,
  flank_size = 1000
)

Error in order(sequence.name, start, stop, -score) : 
  object 'sequence.name' not found

Meaning of "score" column in the output

Hello slowkow,

Apologies if this isn't the right place to ask this. But have the authors of this tool documented the meaning of "score" column in the CENTIPEDE output? You cover posterior probability in your tutorial, but not score. Do you happen to know the meaning of it and it's relevance in deciding which motif sites are more important than others?

Thanks a lot!

Adapting FIMO file to the correct format for centipede_data

FIMO from the MEME suite website outputs data in the following format:

motif_id motif_alt_id sequence_name start stop strand score p-value q-value matched_sequence
ZNF528 MA1597.1 Peak_31367#chr12#10213230#10213429 54 70 - 27.9633 1.45e-10 2.88e-05 CCCAGGGAAGCCATCTC
ZNF528 MA1597.1 Peak_31367#chr12#10213177#10213376 107 123 - 27.9633 1.45e-10 2.88e-05 CCCAGGGAAGCCATCTC
SP4 MA0685.1 Peak_73465#chr19#45001886#45002085 50 66 - 25.5488 3.14e-10 3.97e-05 CAGGCCACGCCCCCTTC
SP4 MA0685.1 Peak_73465#chr19#45001835#45002034 101 117 - 25.5488 3.14e-10 3.97e-05 CAGGCCACGCCCCCTTC
SP4 MA0685.1 Peak_73465#chr19#45001828#45002027 108 124 - 25.5488 3.14e-10 3.97e-05 CAGGCCACGCCCCCTTC
THAP11 MA1573.1 Peak_110384#chr3#141370283#141370482 140 158 - 27.4944 3.59e-10 6.36e-05 AGGACTACATTTCCCAGAC
CTCF MA0139.1 Peak_71057#chr19#2474615#2474814 96 114 + 25.2247 4.23e-10 0.000166 CGGCCACCAGATGGCGCCA
ZNF16 MA1654.1 Peak_181996#chr9#129485761#129485960 1 23 + 27.5244 5.42e-10 0.000109 AATGGGGAGCCATCGAAGGCCTT
ZNF16 MA1654.1 Peak_181996#chr9#129485656#129485855 106 128 + 27.5244 5.42e-10 0.000109 AATGGGGAGCCATCGAAGGCCTT

In your tutorial it seems that I need to adapt FIMO output:

sequence.name start stop X.pattern.name strand score p.value

307 chr1 753016 753228 1 + 13.53 1.14e-05

315 chr1 876197 876409 1 - 12.07 3.73e-05

29 chr1 1365483 1365695 1 - 11.88 4.24e-05

30 chr1 1365877 1366089 1 - 12.72 2.24e-05

31 chr1 1406705 1406917 1 - 11.20 6.73e-05

64 chr1 1566358 1566570 1 + 13.99 7.75e-06

q.value matched.sequence

307 NA TTTCCCAGAAGGA

315 NA CTTCCCCGAAGGG

29 NA TTTCCAAGAAAGT

30 NA CTTCCCAGGAGAG

31 NA CTTCACAGAATTA

64 NA TTTCCAAGAACCG

I am getting the following error:
-- Column specification ------------------------------------------------------------------ cols( sequence_name = col_character(), chr = col_character(), start = col_double(), stop = col_double(), strand = col_character(), score = col_double(), p-value= col_double(),q-value` = col_double(),
matched_sequence = col_character(),
motif_id = col_character(),
motif_alt_id = col_character()
)

Error in h(simpleError(msg, call)) :
error in evaluating the argument 'which' in selecting a method for function 'ScanBamParam': In range 4685: at least two out of 'start', 'end', and 'width', must
be supplied.`

How do I need to adapt my FIMO output?