a Distal Enhancer Locating Tool based on AdaBoost and shape features of chromatin modifications
Accurate identification of DNA regulatory elements becomes an urgent need in the post-genomic era. Recent genome-wide chromatin states mapping efforts revealed that DNA elements are associated with characteristic chromatin modification signatures, based on which several approaches have been developed to predict transcriptional enhancers. However, their practical application is limited by incomplete extraction of chromatin features and model inconsistency for predicting enhancers across different cell types. To address these issues, we define a set of non-redundant shape features of histone modifications, which shows high consistency across cell types and can greatly reduce the feature dimension. Integrating shape features with a machine-learning algorithm AdaBoost, we developed an enhancer predicting method, DELTA (Distal Enhancer Locating Tool based on AdaBoost). We show that DELTA significantly outperforms current enhancer prediction methods in prediction accuracy on different datasets and can predict enhancers in one cell type using models trained in other cell types without loss of accuracy. Overall, our study presents a novel framework for accurately identifying enhancers from epigenetic data across multiple cell types.
Please check the file 'INSTALL' in the distribution.
Usage: delta.py [-c chip_files] [-P promoter_loci] [-E enhancer_loci] [options]
Example: delta.py -c H3K4me1.bed,H3K4me3.bed,H3K27ac.bed -E p300.bed -P tss.bed -g hg19
--version
Show program's version number and exit
-h, --help
Show this help message and exit
-c CHIP_BEDS, --chip_bed=CHIP_BEDS
ChIP-seq bed file of histone modifications
-E ENHANCER, --enhancer=ENHANCER
BED file containing the enhancer loci
-P PROMOTER, --promoter=PROMOTER
BED file containing the promoter loci
-R, --read
Read existing training and predicting data instead of
generate from ChIP-seq (default: False)
-g GENOME, --genome=GENOME
Genome assembly should be one of the followings: dm3,
mm9, hg17, hg18, hg19
-b BIN_SIZE, --bin_size=BIN_SIZE
Length of dividing bins (default: 100)
-w WIN_SIZE, --window_size=WIN_SIZE
Length of sliding window, should be integer times of
bin size (default: 2000)
--iteration_number=ITER_NUM
Number of iteration for AdaBoost (default: 100)
--pvalue_threshold=P_THRES
P-value threshold for enhancer prediction (default:
0.5)
-o OUTPUT, --output=OUTPUT
Output file name (default output file is
"predicted_enhancer.bed")
-c / --chip_bed
ChIP-seq files contain chromatin modifications mapping data. User should provide ChIP-seq files separated by comma, e.g. H3K4me1.bed,H3K4me3.bed,H3K27ac.bed.
The BED format is defined in "http://genome.ucsc.edu/FAQ/FAQformat#format1".
-R / --read
The "-R" option lets user read existing training and predicting data instead generate them from ChIP-seq files, which would be a time consuming process. WARNING: Use with care!!!, wrong training and predicting data could be load.
--pvalue_threshold
P-value threshold for enhancer prediction. User could adjust number of predictions by tuning this parameter.
1.predicted_enhancer.bed is a BED format file containing the predicted enhancers. User should be aware that if the step size is smaller than window size, the predicted enhancers may be redundant. uniq command should be used in this situation to remove repetitive predictions.
2.adaboost.R is a R script generated by delta.py for executing AdaBoost algorithm.
3.tmp_dir is a directory contains temporary files created by delta.py. It should not be removed until the entire training and prediction is done.
Source code of DELTA is freely available for academic use. For commercial license please contact Dr. Chenggang Zhang ([email protected]).