Giter Club home page Giter Club logo

chip_seq_training's Introduction

Introduction to ChIP-seq analysis

Welcome in this tutorial, you will find here a detailed ChIP-seq workflow, starting from sequencing read to the final coverage tracks and differentially accessible genomic regions.

This tutorial is using the Galaxy platform to perform the data download, quality control, mapping and peak calling. We will then explore the result via IGV and RSAT.

A small guide for this course :

  • โšก๏ธ : time to shine, this is your hands-on objective
  • ๐Ÿ‘€ : unscroll a help note if you're stuck
  • โ“ : quiz time!
  • ๐Ÿชฉ : feeling on fire? try this optional exercise
  • ๐Ÿช : this is referring to Galaxy
  • ๐Ÿงฌ : this is referring to IGV
  • ๐Ÿ”ฎ : this is referring to RSAT

Useful links :

๐Ÿ“ 1. A brief note on ChIP-seq

ChIP-seq stands for Chromatin Immuno Precipitation followed by sequencing.

We can split the analysis as follow :

  • a. We use antibody to target our DNA-binding protein of interest and fragment the DNA. We then use immunoprecipitation protocol to retrieve the DNA that was binding the targeted Transcription Factor.

  • b We sequence these fragments using Next Generation Sequencing

      1. We map the reads to a reference genome
      1. We detect regions/peaks with enriched IP signal
      1. We detect regulatory pattern in the peaks

๐Ÿ“ 2. A brief note on Galaxy

Today, we will work on the Galaxy platform. It's simple, free and open-source. It already has all the tools we need installed with proper versions. It's a robust way to design your analysis pipeline while exploring the different available tools.

๐Ÿ“ 3. Let's start the analysis : loading the raw data

๐Ÿ”ธ 3.a Find the identifier

We will work on the study from Myers et al..

โšก๏ธ Open the article : Find the model organism and the targeted proteins in their ChIP-seq experiments.

Upon publication of their work, authors should deposit their raw data on a publicly available repositories. You can access and download these archives via two main platforms, the Sequence Read Archive (SRA) from NCBI (US) and the European Nucleotide Archive (ENA) from EBI (EU). Both platforms regularly cross-update each other.

To find the correct accession ID in a study, you should look for the following :

  • A BioProject accession, starting with PRJ (e.g. PRJNA176146), that will link to the complete project archive
  • A GEO identifier, starting with GSE (e.g. GSE41186), that will link to a specific experiments, in our case ChIP-seq.

โšก๏ธ Your turn : Find the accession identifiers for raw sequencing data from Myers et al..

Tips ๐Ÿ‘€

You are looking for a code starting with GSE. You usually find it in the Data accessibility section of an article, else you can try to Ctrl+F for GSE in the paper.


๐Ÿ”ธ 3.b Load the raw data to Galaxy

You can see in this project that multiple experiments were performed (ChIP-on-ChIP, ChIP-seq, RNA-seq). For the sake of time and simplicity, we will focus our tutorial on the following two ChIP-seq samples :

  • ChIP-seq of the FNR protein in anaerobic condition, sample A
  • Input DNA in anaerobic condition

โ“ Can you guess why we are selecting this pair of dataset?

โ“ Are the sequencing data single-end or paired-end?

โšก๏ธ Find the SRA identifier (starts with SRR) of these two samples and upload them to Galaxy. Assign them a clear name (e.g. FNR and Input).

  • ๐Ÿช Get Data : Download and Extract Reads in FASTA/Q
  • Assign a new name to a sample : edit the Name attribute via the โœ๏ธEdit attributes link and save.
Tips ๐Ÿ‘€

The two sample's identifier are SRR576933 (FNR ChIP) & SRR576938 (Input).

Paste the SSR identifier in ๐Ÿช Galaxy's tool and click `Execute`. The job will start running and turn green once finished.
Once finished, edit the name for both and group them as a collection (see below).

๐Ÿ”ธ 3.c Group into a Dataset Collection

We will group both sample into a Dataset Collection. Working on a collection allows to perform the same type of command on both samples in parrallel in a single go.

โšก๏ธ Make a Dataset Collection with both samples

  • ๐Ÿช Click the Operations on multiple datasets button in the History and select the two renamed input files
  • ๐Ÿช Select the Build Dataset List command and name the collection
  • ๐Ÿช Now you can run all tools on the two files in batch using the Dataset collection input ๐ŸŽ‰

๐Ÿ“ 4. Sweep up the dust : cleaning raw data

We need to assess the quality of the sequencing reads. We need to know the confidence in the sequence call and the potential presence of contaminants (index or other organism).

๐Ÿ”ธ 4.a Quality Check of raw data

FastQC in the most common tool used to get an overview of a fastq file's quality. We will run it with default settings.

โšก๏ธ Run FASTQC on the Dataset Collection with both samples

  • ๐Ÿช FASTQC : Reads Quality Report
  • Select the Dataset Collection as input
  • Explore the result with the webpage output (view data)

โ“ Do you see any problem with the quality of the dataset?

Tips ๐Ÿ‘€

FastQC output comprises several parts, you can refer to the tool's website for detailed information. Take a close look to the Overrepresented sequences in FNR sample.


๐Ÿ”ธ 4.b Trimming contaminants

We have seen in the previous chapter that 29% of our FNR raw data correspond an Illumina Adapter sequence.

โ“ What's an adapter sequence again ?

From here we have two choices :

  • Leave the contaminant sequence in the dataset and assuming correct read mapping will filter them out
  • Remove the contaminant sequence from out dataset set prior to mapping

We will hop for the 2nd choice, as it will provide more acurrate mapping statitics that could help us detect other potential problem in the dataset that prevent a high fraction of mapped reads.

โšก๏ธ Trim the contaminant Index sequence on both samples with Trimmomatic

  • Copy the Index sequence from the FASTQ output
  • ๐Ÿช Trimmomatic : flexible read trimming tool for Illumina NGS data
  • Use the correct data type (single/paired end)
  • Set the accuracy of the match between adapter to 8.
  • Save the log output

High accuracy thresholds for short reads will remove adapter dimers but adapter contamination at the 3'end of the reads will remain undetected, this is why we lower the accuracy to 8. A threshold of 8 corresponds to 12 perfect matches between read and adapter, so adapter contamination of 12bp and longer will be detected.

Tips ๐Ÿ‘€

The overrepresented Adapter Sequence is an Illumina Index 5 :

>TruSeq Adapter
GATCGGAAGAGCACACGTCTGAACTCCAGTCACA

We can specifically select it as a custom fasta filter :


๐Ÿ”ธ 4.c Checking trimming

You can access the trimming statistics in the output log file. Take a look.

โ“ How many reads have been discared for each dataset ? How many reads do we have left?

An important question to ask ourself is whehter we have sequenced our sample deep enough. This will depend on two modalities :

  • Have we saturated the sequencing library?
  • Do we expect to cover the majority of the genome?

โ“ How can we solve these two above questions?

Tips ๐Ÿ‘€

Library is saturated when you have sequenced (almost) all the spot in the lane and you mostly increase duplication level.

We must compare the number of good quality reads with the genome size of our organism (e.g. for the 3 Gb human genome, 10 million reads are considered sufficient). To do so, take a look at Escherichia coli's genome information on NCBI.


Everything is almost ready, make sure your data now looks all right with a new FASTQC run on trimmed data.

โšก๏ธ Run FASTQC on the Dataset Collection after trimming

โ“ Is the Adapter sequenced properly filtered out?

๐Ÿชฉ You can merge multiple FASTQC output in a fancy way with MultiQC, try it out!

๐Ÿ“ 5. Gotta map them all : read alignment

Mapping is a step where you can explore a very large parameter space. We need to be careful about our selected settings.

๐Ÿ”ธ 5.a Loading a reference genome

First thing first, what map do we use? We can check that in the article's text.

โ“ What is the model organism used?

โ“ What is the reference genome of the study?

๐Ÿชฉ Try to fetch its information back from the NCBI website

Tips ๐Ÿ‘€

You can see the sentence Resulting reads were aligned to the published E. coli K-12 MG1655 genome (U00096.2) in the Materials and Methods section


This reference comes from NCBI Nucleotide database, which is not ideal (large amount of errors in sequences and annotations). It is now highly recommended to work with NCBI Reference Sequence (RefSeq) data, the curated subset of NCBIโ€™s Nucleotide database. Similar to sample identifier, reference genome also have a unique ID.

โšก๏ธ Find E.Coli RefSeq genome ID and upload it to Galaxy

  • ๐Ÿช NCBI Accession Download : Download sequences from GenBank/RefSeq by accession through the NCBI ENTREZ API
  • Paste the RefSeq ID as a direct entry, keep output as fasta
Tips ๐Ÿ‘€

The RefSeq ID for the E.Coli strain K-12 MG1655 is NC_000913.3. You can find it by searching this strain on NCBI's nucleotide database (1), filtering for RefSeq match only (2) and looking at the 1st entry (3).

You can then paste it to Galaxy genome loader

๐Ÿ”ธ 5.b Selecting mapping parameters

There's plenty of mappers available, you probably have already heard of STAR, Bowtie and BWA for example. We need to select the appropriate one for our data.

We have short (36bp) and single-end reads, this is best suited for Bowtie.

Now the tricly part : understanding and playing with the parameters.

โšก๏ธ Take a look at the Bowtie -v and -n parameters in Galaxy parameter list.

  • ๐Ÿช Map with Bowtie for Illumina

โ“ What is the -v 2 parameter doing?

โ“ What are the -n 2 -l 35 parameters doing?

โ“ Make a guess : which of the two options would you pick here?

๐Ÿชฉ Do you see any other parameter that would have been useful to us?

Tips ๐Ÿ‘€

The Bowtie v-mode performs end-to-end alignment of the read and filter out the ones with more than n mismatches.
The Bowtie n-mode aligns a seed of l basepairs from the high quality 5'end and remove the ones with more than n mismatches.
Remember your FASTQC reports, does the Per base sequence quality graph hint a something you would want to exclude?


These two parameters modify the alignment strategy, we also need to take care of the reads reporting strategy.

โšก๏ธ Take a look at the Bowtie -m and --un parameters in Galaxy parameter list.

โ“ What are the -m 1 or -m 3 parameters doing?

โ“ What is the --un parameter doing?

โ“ Make a guess : which parameter set would you pick here?

Tips ๐Ÿ‘€

-m discards reads that map to more than n reported genome locations
--un write the reads that failed to align in a separate file, think about why this would be useful.


๐Ÿ”ธ 5.c Mapping the reads with Bowtie

Now that we have explored the Bowtie parameter space, let's map our reads to the E.Coli reference genome.

โšก๏ธ Map both trimmed samples with Bowtie on E.Coli RefSeq genome

  • ๐Ÿช Map with Bowtie for Illumina
  • Genome index is done before mapping, keep default settings
  • Expand Bowtie full parameter list
  • Perform a seed-based alignment with a seed of 35bp and a maximum of 2 mismatches.
  • Discard all the reads that map to multiple positions
  • Save the bowtie mapping statistics to the history
Tips ๐Ÿ‘€

Select the tool Map with Bowtie for Illumina and fill the parameter form. Select the RefSeq genome as a single file and not a collection.

Expand Bowtie parameter form
Select n-mode mapping with a 35bp seed and max 2 mismatches
Remove multi-mapping reads
Save the mapping output to Galaxy history

โšก๏ธ After reads have finished mapping, check the mapping statistics output.

โ“ How many alignment are reported per sample?

โ“ Does the mapping ratio seem good enough?

๐Ÿชฉ Check the mapping statistics with Samtools flagstat : tabulate descriptive stats for BAM datset

๐Ÿชฉ Run another mapping job with other Bowtie parameters or the untrimmed samples, how does is affect the mapping performances?

Tips ๐Ÿ‘€

Take a look at the mapping stats output of Bowtie, scroll down the short read warning down to the global stats. You should obtain the following result :

  • Input sample has 6217172 alignments with 94.67% mapped reads.
  • FNR sample has 2173747 alignments with 89.98% mapped reads

Everything good on the mapping statistics side, let's take a closer look a the mapped reads output now.

โ“ What is the alignment file format?

โ“ What is the SO (sorting order)?

Tips ๐Ÿ‘€

The output mapped reads are stored in a SAM file format. You can see this in the Attributes section of the data. For more detail on this format, check the GATK help page and the image below (credit : D. Caetano-Anolles).

The SAM files are Unsorted (i.e. reads are unordered), you can see this is the SAM header.


โšก๏ธ Sort the mapping output by genomic coordinates and save it as a compressed BAM output

  • ๐Ÿช Samtools sort : order of storing aligned sequences
  • Use the coordinate sorting key
  • Make sure the output is a coordinate-sorted BAM file

๐Ÿ“ 6. Reaching the summits : Peak calling

๐Ÿ”ธ 6.a A brief note on Peak Calling algorithm

Here we are!

We have our data ready, we can now get to the detection of biological signal by detecting genomic position with enriched ChIP signal.

The most widely used tools for this is the good old MACS (Model-based Analysis for ChIP-Seq) tool. It is used for ChIP-seq as well as for ATAC-seq. It can robustly detect peaks of different profiles (e.g. broad or narrow). Current version is MACS2, beta MACS3 is on its way. Check its GitHub repo for more details.

โ“ What's behind a peak calling algorithm?

The goal of MACS2 is to detect region significantly enriched for chromatin immunoprecipitation signal (our FNR IP) compared to a given background. The input sample does exactly that : it gives a measure of the background signal obtain without a targetted IP. In order to pecisely detect the position of the binding protein, MACS2 performs a modelling approach to estimate the initial DNA fragment length that was obtained from IP and shift/extend the reads accordingly.

The enrichment score is then computed by comparing the enrichment in a sliding window between the IP and the Input signal based on a Poisson distribution. To avoid bias in background noise, MACS2 compute the Poisson lambda values from the input signal with severeal windows sizes (1kb, 10kb, whole genome).

๐Ÿ”ธ 6.b Extracting FNR and Input sample from the Collection

Now that you know what MACS2 need to do, select the proper input files. We need to specifically assign both samples as treatment and control files, so we need to extract them separately from the collection.

โšก๏ธ Extract FNR and Input sorted BAM files from the Dataset collection

  • ๐Ÿช Extract Dataset from a list
  • Select the element identifier (i.e the name you assigned)
  • Run once per sample and check the output file attributes

๐Ÿ”ธ 6.c Running MACS2 callpeak

We have our BAM files ready, now let's take a look at the required parameters of MACS2 on the Galaxy platform.

  • ๐Ÿช MACS2 callpeak : Call peaks from alignment results

โ“ What is the effective genome size and why is it needed?

โ“ Should we build a shifting model?

โ“ What are duplicate tags and how MACS2 treat them? Remember you also have duplicate informatiom from the FastQC output.

โ“ How a change in the p-value cutoff impact the result?

Tips ๐Ÿ‘€

The effective genome size is the portion of the genome that is mappable. There is no predefined value for E.Coli, we will use the complete genome size of the reference.
The shifting model is doable when we have enough reads and enough detected peaks, or ideally paired-end data. This is not the case in our sample. We will not use a model.
Duplicates are read the uniquely align to the exact same position that might come from PCR duplicates. We can ignore them with the default parameter --keep-dup 1 in the Advanced Options.
If you want to remove your duplicates rather than ignoring them, use the Picard tool ๐Ÿช MarkDuplicates : examine aligned records in BAM datasets to locate duplicate molecules. ๐Ÿชฉ You can give it a try on the sorted BAM files.
The p-value (or p-adjusted) threshold set the limit for calling a peak, the lower the threshold, the higher the number of reported peaks. This is a critical parameter to take into account.


All onboard, we're ready to launch MACS2 run!

โšก๏ธ Run MACS2 callpeak on FNR ChIP-seq with input as control.

  • ๐Ÿช MACS2 callpeak : Call peaks from alignment results
  • Set E.Coli effective genome size to 4641652bp
  • Set the MACS2 peak calling to run without model, with a 0 shift and a 200bp extsize
  • Ignore duplicates
  • Save a BedGraph output

โ“ What does MACS2 when given --no-model --shift 0 --extsize 200?

โ“ How many region have been reported, what is the output file format?

Tips ๐Ÿ‘€

Select FNR IP as treatment and input as control

Do not build a model
Select an additional BedGraph output

The output file is a BED file, we report 213 peaks enriched in FNR IP.


๐Ÿชฉ Explore the different output files on your own, can you guess what they refer to? Expand the Tool Standard Output section in Galaxy output info, do you understand the log message?

๐Ÿชฉ Run another MACS callpeak with a different p-value or with the modelling on, what happens?

๐Ÿ“ 7. Enjoying the landscape : Data visualization

๐Ÿ”ธ 7.a Converting tracks to BigWig

We have a BAM output for read alignment signal and a BED output for peaks. For signal browsing and large dataset, it is more efficient to work with BigWig and BigBed file format. BigWig fill will summarise the individual read information from BAM into a sum of read mapping on each position of the genome. BigBed is the indexed binary equivalent of BED. For a detail explanation of the BigWig formats, see the UCSC Genome Browser help on bigWig and bigBed. As our BED file is small, we will not convert it to BigBed here.

โšก๏ธ Convert both FNR IP and input BAM file to BigWig with DeepTools.

  • ๐Ÿช bamCoverage : generates a coverage bigWig file from a given BAM or CRAM file
  • Use the sorted BAM files as input.
  • Keep the default 50bp granularity
  • Use the 1x coverage normalisation
  • Set E.Coli effective genome size to 4641652bp
  • We are aiming to generate the track visualisation that most closely match our MACS2 analysis, so we need to also ignore the Duplicates and extend our single-end reads to 200bp (see Advanced Options).

โ“ What is the 1x normalisation? Why do we need it?

Tips ๐Ÿ‘€

Select FNR IP and Input sorted BAM as input

Extend single-end reads to 200bp and ignore duplicates

๐Ÿ”ธ 7.b Assessing the quality of the peaks and FNR IP

We now have the two required inputs for DeepTools graphs :

  • A set of regions (ours MACS2 peaks as BED file)
  • A set of signal tracks (our FNR and input BigWig files)

DeepTools is a very efficient tool for signal exploration, and it comes with an easy-to-use online manual. It is usually done in two steps :

  1. pre-compute the signal for the region of interest
  2. plot the signal in your requested format

In this tutorial, we will do a Profile plot, but you can explore other possibilities there (e.g. Heatmaps)!

โšก๏ธ Compute the signal matrix using DeepTools.

  • ๐Ÿช computeMatrix : prepares data for plotting a heatmap or a profile of given regions
  • Select our MACS2 peaks as regions to plot
  • Select the BamCoverge collection output as score files. Do not use the BAM file directl, as they are not 1x normalised it will make them harder to contrast.
  • Use the reference-point mode centered as region's center and 1kb window up/downstream.

โ“ Why don't we use the scale-region mode? See DeepTools manual for help.

Tips ๐Ÿ‘€

Select regions and signal track inputs

Use the reference-point mode

Now the visualization part!

โšก๏ธ Generate the Profile plot using DeepTools.

  • ๐Ÿช plotProfile : creates a profile plot for score distributions across genomic regions
  • Use our freshly-computed matrix as input
  • Our tracks are 1x normalised, we can group them in the same plot
Tips ๐Ÿ‘€

Select matrix as input

Group the two signal (FNR and input) in the same plot

โ“ Is the output graph consistent with our biological hypothesis?

Ok, the signal in our detected MACS2 peaks seems good! Yet, this plot is only focusing on our selected regions, how can we make sure our ChIP-seq worked? To answer that, we should look at the general distribution of ChIP-seq signal across the genome :

  • If the FNR ChIP worked, we expect a small number of peaks with high signal and the rest with low signal
  • If the FNR ChIP failed, we expect it to ressemble the non-specific Input, meaning a uniform low signal in every region

To assess where we are standing, we need to compute the Fingerprint of our data (aka the Lorenz curve), it's a quality-check plot that is also supported in DeepTools (told you this tool is great). You also have a nice explanatory section about it on the DeepTools manual.

โšก๏ธ Generate the Lorenz curve for our FNR IP and Input data

  • ๐Ÿช plotFingerprint : plots profiles of BAM files; useful for assessing ChIP signal strength
  • Select our Collection of sorted BAM files as input
  • For time sake (and using E.Coli), reduce the sampling rate to 1000
  • Same as the other plot, extend reads to 200bp and ignore duplicates
Tips ๐Ÿ‘€

Select matrix as input

Group the two signal (FNR and input) in the same plot

โ“ On a scale of Challenger Deep to Mt Everest, how peaky is our IP?

๐Ÿ”ธ 7.c Browsing the signal in the Genome

We have our data, we have visualised it's overall quality at full genome scale and at MASC2 peak levels. Now, we want to go look a specific examples of regions. To do so, we will use a Genome Browser that will serves as a search tool.

We will use the ๐Ÿงฌ Integrative Genome Viewer (IGV) that you should have installed locally on your machine. Another widely used browser is the UCSC Genome Browser, which can be used online.

To get IGV up and running, we need to download locally some of our Galaxy output files :

  • The E.Coli genome sequence as a FASTA files and its gene annotation as a GFF3 file.
  • The MACS peaks of FNR signal as BED file
  • The normalised FNR and input signal as BigWig files

โšก๏ธ Save the FASTA, BED and BigWig files locally from Galaxy history

  • ๐Ÿช Scroll down your history and simply use the download button, unzip the bigwig collection

We're just missing the GFF3 file of gene annotations. We haven't loaded to Galaxy. To retrieve it, we must go back to where we found the E.Coli fasta sequence.

โšก๏ธ Find the correct gene annotation file for our refrence genome

  • Remember that our E.Coli genome is from the RefSeq database (chapter 5.a)
Tips ๐Ÿ‘€

The RefSeq ID for the E.Coli strain K-12 MG1655 is NC_000913.3. You can find it by searching this strain on NCBI's nucleotide database (1), filtering for RefSeq match only (2) and looking at the 1st entry (3).

On the RefSeq page, you can ask for a local data drop via the Send to tab.


Ready to go? Let's turn IGV on and load all our files!

โšก๏ธ Get IGV ready to explore

  • ๐Ÿงฌ Genomes : Load Genome from file...
  • Select the RefSeq fasta sequence
  • ๐Ÿงฌ File : Load from file...
  • Select the BED, GFF3 and BigWig files
  • For better visualisation, comment (add ## at the beginning of the line) the third line of the GFF3 in a raw text editor (i.e. not word) before loading.

To be able to compare our FNR IP with input, we need to look at them with the same y-axis and comparable normalisation strategies! Both tracks are 1x-normalised, so we should just make sure the y-axis range is always the same for the two tracks.

  • ๐Ÿงฌ Select BigWig tracks : Right-clik and Group Autoscale
  • ๐Ÿงฌ Select BigWig tracks : Right-clik and Show Data Range
  • ๐Ÿงฌ Select BED tracks : Right-clik and Expanded
  • Start browsing, zoom in, zoom out, customise your tracks, ..
  • Go to the following genes : pepT, ycfP

โ“ What would have happen if we did not took care of group auto-scaling?

โ“ Anything interesting to see ?

๐Ÿชฉ IGV also support BAM file as input, load the FNR sorted BAM from Galaxy to IGV (careful, large file).

๐Ÿชฉ Sort the MACS2 BED file on Galaxy by the Score column to get highest and lowest scoring reported peaks. Which one do you trust most?

Tips ๐Ÿ‘€

Here is a view of pepT location

Here is a close-up of the BAM track for FNR IP, it automatically computes a global coverage track upon loading. Can you understand what the grey lines and the small colored lines represent?

๐Ÿช Sort data in ascending or descending order to find FNR_peak_64 and FNR_peak_122 as top and bottom scoring peak respectively


๐Ÿ“ 8. Finding motifs in the sequences

๐Ÿ”ธ 8.a A brief note on RSAT and motif discovery

Now that we have the predicted binding sites of FNR, we can explore further the binding interaction. Based on the region sequences, we can look for enriched DNA motif that would have a high affinity with FNR. This motifs are usually call Transcription Factor Binding Motifs (TFBS) and are defined as a Position Weight Matrix (PWM or sequence logo).

Several PWM databases exist, usually focusing on a subset of model organisms (e.g. RegulonDB for prokaryotes, TRANSFAC and JASPAR for eukaryotes) and curated for redundancy and motif quality.

The snail PWM seen from the JASPAR database :

To find these motifs, we will use the ๐Ÿ”ฎ Regulatory Sequence Analaysis Tool (RSAT). It's an easy-to-use online tool that support a large number of organisms. Other tools exist such as i-cis target, the MEME suite and HOMER. Each of them use their own set of discovery algorithms and parameters, feel free to explore them and pick the best-suited one for your own analysis.

๐Ÿ”ธ 8.b Running motif discovery

Time to look for motifs!

If you follow the small questionnaire on the welcome page on which tool to use, you will see that peak-motif is our way to go. Yet, this tool needs the fasta sequence of our MACS2 peaks.

โšก๏ธ Use Galaxy to export our MASC2 peak set as fasta

  • ๐Ÿช bedtools GetFastaBed : use intervals to extract sequences from a FASTA file
  • Use MASC2 narrowpeak output as input
  • Download the resulting fasta locally
Tips ๐Ÿ‘€

All set for peak-motif!

โšก๏ธ Use RSAT Prokaryotes server to find enriched motifs in our peak set

  • ๐Ÿ”ฎ peak-motif
  • Find a relevan title and select your fasta file as Peak sequences
  • Do not cut the peak sequence
  • Look 6bp and 7bp oligomer search, as well as dyad analysis
  • Compare your resulting enriched motifs with prokaryote database
Tips ๐Ÿ‘€
To avoid cutting peaks, set the cutting length to 0. Tick the *Spaced word pairs* analysis.
The prokaryote database is RegulonDB, select the 2015 collection.

โ“ Do you understand all the resut panels?

โ“ Do you see evidence of FNR binding?

โ“ What would have happened if we did not select dyad analysis?

๐Ÿ”ธ 8.c Going further

๐Ÿ Congrats, you made it to the final section!

Through this tutorial you have seen the following :

  • Finding a study of interest and extracting raw NGS data
  • Quality-filtering and mapping the reads to a reference genome
  • Calling peaks with significant IP enrichment with MACS2
  • Exporting data as explorable BigWig and BED tracks
  • Visualizing our result with DeepTools, at region and genome scale
  • Finding motifs of enriched TFBS in our peak set using RSAT

๐Ÿšจ Note that this analysis was performed on single-end, prokaryote, FNR IP data. If you want to apply this workflow to other data types, you should consider some additional tunings.

๐Ÿ”น ATAC-seq : ATAC-seq also target chromatin, but contrary to ChIP, it does not use targetted antibody. As a result, it is not particularly advised to run MACS for ATAC peak with the model option (instead, use --no-model). Otherwise, you can also perfrom the same analysis as ChIP-seq (DeepTools, motif enrichment)

๐Ÿ”น mapping : As always, make sure you use the correct reference genome. Here we used Bowtie as we are dealing with short single-end sequence. For longer (~150/200bp) paired-end reads, you can use other better-suited mapper such as BWA-MEM (used in CellRangerATAC). If you decide to use the famous STAR mapper, don't forget it's originally designed for RNA-seq data and you should enforce the --alignIntronMax 1 --alignEndsType EndToEnd options to forbid spliced alignment.

๐Ÿ”น paired-end data : These reads do not need to be manually extended, peak-calling algorithm will simply join the respective mates. To do so, specify this information in MACS with -f BAMPE.

๐Ÿ”น motif database : Do not use RegulonDB for eukaryote data, also change the RefSeq genome accordingly (and make sure you adapt parameters linked to genome size too)

๐Ÿ”น TF/Histone peaks : Accessibility peaks can have various shape of peaks (e.g. H3K27ac vs CTCF). To take this into account, you can explore MACS2 options for broad peak calling.

๐Ÿ”น In general, always scan the parameter options again to make sure your data match your analysis.

Hope you enjoyed the ride! Below, you can find some extra ressource for further ChIP-seq and ATAC-seq analysis :

Peak annotation : You can use the R package ChIPseeker to annotate your peaks based on nearest annotated features (e.g. gene TSS)

Replicates : If you have multiple IP for the same protein and you want to integrate them into a common peak calling analysis, you can use the Irreproducible Discovery Rate (IDR) methodology. For a detailed overview of this method, check M. Love's example.

No Input : If you do not have an input, you can still run MACS2, but keep in mind it is not advised as you will obtain false positive peaks that arise from non-specific enrichment.

Differential enrichment & ATAC : If you are looking for a quantitative analysis of different signal enrichment between conditions, several options exist. You can try the R packages DiffBind and csaw. They are based on DESeq2 and EdgeR models respectively. Good news, although these tools have been initially designed for ChIP-seq, they nicely expand to ATAC-seq analysis! They also both have a very large online documentation.

In case of doubt : Old but gold, you can read the article on ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. It's a robust pipeline used for creating the ENCODE database.

chip_seq_training's People

Contributors

dagousket avatar

Watchers

 avatar

Forkers

abotzki

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.