Giter Club home page Giter Club logo

sv-hotspot's Introduction

SV-HotSpot: detection of hotspots targeted by recurrent structural variants associated with gene expression

tool logo

SV-HotSpot is a tool for detection and visualization of genome hotspots targeted by recurrent structural variants (SVs) associated with gene expression. To so so, SV-HotSpot seamlessly integrates SV calls, gene expression, genome annotations (inluding genes and other functional elements such as regulatory elements) to identify and annotate recurrent SVs, and assess their potential consequences on the expression of nearby genes.

SV-HotSpot is developed at Christopher Maher Lab at Washington University in St. Louis.

SV-HotSpot Docker Instructions

To use SV-HotSpot, a docker image has been created and tested on Linux and Mac. To run SV-HotSpot, you need to have Docker installed on your machine.

Docker Installation

To obtain the latest docker image, run the following in your command line:

docker pull chrismaherlab/sv-hotspot

To test the image, run the following command which shows the usage of this tool:

docker run chrismaherlab/sv-hotspot sv-hotspot
USAGE:
      sv-hotspot.pl [OPTIONS] -g/--genome <genomeName> --sv <structuralVariants>

      NOTE:
	(1) Genome name should be one of the UCSC genome releases (https://genome.ucsc.edu/FAQ/FAQreleases.html#release1). 
            - Built-in Genomes: hg18, hg19, hg38, mm9, mm10, dm3, dm6, rn5, rn6.
            - Please refer to the documentation in case your genome is not listed above. 
	(2) Structutal variants file should be in "BEDPE" format.
	(3) Gene expression data and copy number segments are required to visualize hotspot regions.
	(4) Region of interest file(s) (e.g. promoters, enhancers, chip-seq, etc.) should be in "BED" format

OPTIONS:
	-w/--sliding-win-size		sliding window size	<int>		[ sliding window size. default: 100kb ]
	-s/--sliding-win-step		sliding window step 	<int>		[ sliding window step. default: 1kb ]
	-a/--annot			annotation file 	<filename>	[ an annotation file in "BED" format ]
	-e/--exp			expression file 	<filename>	[ expression file in a "matrix" format ]
	-c/--cn				Copy number file 	<filename>	[ copy number segments in BED format ]
	-W/--peakPick-window-size	peak calling window 	<int>		[ peakPick window size. default: 100bp ]
	-m/--peakPick-min-sd		peak calling min. SD	<int>		[ peakPick standard deviation cutoff. default: 5 ]
	-t/--pct-samples		percentage of samples	<int>		[ percentage of samples cutoff to call peaks. default: 10 ]
	-o/--output			output directory	<string>	[ results output directory. default: ./ ]
	-p/--pval			pvalue cuttoff		<float>		[ pvalue threshold used for significance. default: 0.05 ]
	-G/--genes-of-int		list of genes		<filename>	[ list of genes of interest to be used for visualization ]
	-r/--region-of-int		region(s) of interest	<filename>	[ region of interest file(s) in "BED" format separated by comma ]
	-C/--chrom			chromosome name 	<string>	[ chromosome name used to detect hotspots. default: ALL ]
	-t/--sv-type			structural variant type	<string>	[ SV type used to detect hotspots. default: ALL ]
	-d/--merge-dist-size		distance size		<int>		[ distance cutoff used to merge adjacent peaks. default: 10kb ]
	--merge-pct-samples		percentage of samples	<int>		[ percentage of samples cutoff to merge similar peaks. default: 5 ]
	--stop-merge-num-peaks		number of peaks		<int>		[ number of peaks cutoff to stop merging adjacent peaks. default: 0 ]
	-k/--num-nearby-genes		Number nearby genes	<int>		[ number of up/downstream genes to the peak. default: 1 ]
	--t-amp				amplification threshold	<float/int>	[ amplification cutoff. default: 2.8 ]
	--t-del				deletion threshold	<float/int>	[ deletion cutoff. default: 0.5 ]
	--stat-test			statistical test	<string>	[ statistical test used for comparison (wilcox.test or t.test). default: wilcox.test ]
	--chip-cov			chip-seq coverage	<filename>	[ chip-seq coverage file in "BED" foramt ]
	--chip-cov-lbl			chip-seq coverage label	<string>	[ label used for chip-seq coverage ]
	--plot-top-peaks		plot top # peaks	<int>		[ number of top peaks to plot. default: top 10 ]
	--left-ext			size of left extension	<int>		[ size of the left extension of the peak. default: 0bp ]
	--right-ext			size of right extension	<int>		[ size of the right extension of the peak. default: 0bp ]
	--genes-to-plot			genes names		<string>	[ names of genes to show on the plot. default: none.
										  If no genes were provided, all genes in the peak will be plotted ]
	--plot-layout			plot orientation	<string>	[ orientation of the peak plot (wide or narrow). default: narrow ]

How to run SV-HotSpot using Docker?

We have provided an example data available in "test_data" folder specifically for identifying SV hotspots affecting androgen receptor (AR) gene. To read more about this study, please refer to this Cell paper.

To run SV-HotSpot on this test data, use the following command:

docker run -v /local/folder:/data chrismaherlab/sv-hotspot sv-hotspot -g hg38 -C chrX \
              --sv /data/test_data/sv.bedpe -e /data/test_data/exp.tsv \
              -c /data/test_data/cna.tsv --chip-cov /data/test_data/H3K27ac.bg \
              -r /data/test_data/enhancers.bed -o /data/SV-HOTSPOT-TEST \
              -w 100000 -s 30000 -d 50000 --merge-pct-samples 5 \
              -t-amp 2.99 --t-del 1.35 --stat-test wilcox.test --pval 0.05 \
              --chip-cov-lbl H3K27ac --left-ext 100000 --right-ext 100000 \
              --plot-top-peaks 1 --genes-to-plot AR
  • Note, the -v flags map your local filesystem locations to a “location” within the Docker image. Therefore, you need to change /local/folderto your local folder on your machine. This folder must contain the "test_data" folder. The final output will be sent to this folder as well. Use "$PWD" in case you want to use the current directroy.
  • Please note that you need to provide the absolute path for /local/folder.
  • /data is a folder on the image container used to receive the input data that was mapped by -v flag. The SV-HotSpot pipeline is configured with this folder and it has to be provided the same way as in the command above.

Post-processing and filtering

Once hotspots of SVs are identified, users should perform post-processing and filtering. SV-HotSpot provides a script to perform post-filtering of the results based on multiple criteria including filtering of genes with low expression and/or weak expression association (p-value, logFC, mean group expression), removal of wide peaks not associated with known cancer genes (eg. using COSMIC census genes), removal of peaks with low number of samples harboring hotspot SVs. To performed filtering, run the filter.r command as follows:

Usage:  Rscript filter.r
            <SV-HotSpot_result_dir>
            <max.p-value-to-infer-expression-association>
            <min.logfc>
            <min.group.expr>
            <max.number.of.associated.gene.per.peak, peak w/ more genes associated are ignored>
            <max.peak.length, peak wider than this w/o assoc. w/ known genes are ignored>
            <min.peak.percent.samples, 0-100, peaks with percent of samples smaller than this are ignored>
            <tsv_file_of_cosmic_census_genes>
            <csv_other_known_genes_to_keep>

        Example: 
           Rscript filter.r /path/to/sv-hotspot-output 0.05 0.22 10 9 500000 15 \
                            data/cosmic_census_genes.tsv AR,ERG,PTEN,TP53

Output will be written to SV-HotSpot result directory which includes the following two files:

1) annotated_peaks_summary.filtered.tsv
2) genes.associated.with.SVs.filtered.tsv

Visualization of hotpots (peaks)

The top hotspots are automatically visualized when running SV-Hotspot detection (parameter --plot-top-peaks=# when running SV-HotSpot detection). Additionally, users can plot hotspots identified by SV-HotSpot independently of SV-Hotspot detection. To plot hotspot sites (or peaks), run the following command:

To show the usage page of this script, run the following command:

docker run chrismaherlab/sv-hotspot plot-peak

To plot peaks, use the following command which plots pX.172 and pX.173 peaks (peak name(s) taken from "annotated_peaks_summary.tsv" output file). You will also need to provide the SV calls file used to run SV-Hotspot, path to SV-HotSpot results directory, expression and copy number data, and the remaining parameters shown in SV-HotSpot detection command above. Peak names must be separated by comma and no space between them.

docker run -v "$PWD":/data chrismaherlab/sv-hotspot plot-peak \
           -p pX.55.1 --genes-to-plot AR \
           --res-dir /data/SV-HOTSPOT-TEST/sv-hotspot-output \
           -o /data/SV-HOTSPOT-TEST/sv-hotspot-output \
           --sv /data/test_data/sv.bedpe -e /data/test_data/exp.tsv \
           -c /data/test_data/cna.tsv --chip-cov /data/test_data/H3K27ac.bg \
           -r /data/test_data/enhancers.bed --t-amp 2.99 --t-del 1.35 \
           --chip-cov-lbl H3K27ac --left-ext 100000 --right-ext 100000
  • res-dir must refer to the folder containing the results generated from running sv-hospot command with -o option. Please note that you need to include "sv-hotspot-output" at the end of res-dir path since SV-HotSpot always creates this folder which is used to write all output results.

Visualization of whole genome and individual chromsomes

SV-HotSpot also provides whole genome level visualization of hotspot sites. The script plot_whole_genome.r can be used to generate the histogram aggregation of the counts of samples harboring structural variants at whole genome level (circos plot) or individual chromosome level. This script can be run as follows.

   Usage: Rscript plot_whole_genome.r
             <SV-Hotspot_result_dir>
             <annotation_dir>
             <genome_assembly_version, eg. hg38>
             <plot_circos, eg. TRUE, FALSE>
             <chromosomes_to_plot, eg. "ALL", "chr1,chrX"> 
             <genes_to_show, eg. "ERG,PTEN,ETV1">
             <color_genes_by_association_direction_with_SV, eg. "TRUE", "FALSE">
             <output_figure_format, eg. "png", "pdf">
             <output_figure_dir>

   Example:
           Rscript plot_whole_genome.r /path/to/sv-hotspot-output /path/to/annotations \
                                        hg38 TRUE chr1,chrX,chr21 ERG,AR,PTEN TRUE png out

SV-HotSpot Manual

Input

The tool requires as an input the following:

  1. Genome assembly name (e.g. hg38, hg19, mm9, mm10, etc.) which is used to extract chromosomes sizes file (a tab-delimited file with two columns, chrom and size). Genome name should be one of the UCSC genome releases (https://genome.ucsc.edu/FAQ/FAQreleases.html#release1). SV-HotSpot built-in genomes are: hg18, hg19, hg38, mm9, mm10, dm3, dm6, rn5, rn6.
  1. Structural variants file in BEDPE format (Example of how to prepare this file is available in test_data folder). In case only VCF files are available, please try to convert them to BEDPE format using existing tools such vcftobedpe form svtools. Instructions on how to install it and use it can be found here.

Optional files:

  1. Expression data in a matrix format where the first column represents the feature (e.g. gene) and columns represents samples. See the example provided at test_data folder. Please note that features in this file have to be unique.

  2. Copy number segments in BED format. See the example at test_data on how to prepare this file.

  3. An annotation file in BED format. If the user didn't provide this file, a built-in annotation file based on the genome name will be used. Please note that the feature in the expression file has to match the one in this file.

  4. Region of interest file(s) (e.g. enhancers, transcription binding sites, etc) in BED format. An example of this file is given in test_data folder. Multiple files can be provided but have to be separated by comma (make sure no space between the file names).

  5. ChIP-Seq coverage in BED format. See the example at test_data on how to prepare this file.

All other parameters are optional and a default value was assigned to each (run sv-hotspot.pl --help for more details).


IMPORTANT NOTES:

  • To prepare all your files, please look at the examples provided in test_data folder and do accordingly.
  • Structural variant types must be in the format of THREE letters and should only include the following types: BND, DEL, DUP, INS, INV.
  • All files headers should be the same as the ones in the examples files.
  • The chromosome names in all files should be consistent and in the format of chr#.
  • The feature name in the annotation file and the expression file should be the same.
  • The "name" column in the SVs file should be in the format of sample/type (e.g. Sample1/INV)

ChIP-seq coverage data

Given the large quantities of data generated in ChIP-Seq, most computational analyses face significant challenges processing this large magnitude of data. Therefore, it is recommended that you average ChIP-seq data using a window of size range from 1-10k and provided in the format mentioned above.

Output

There are two main output files:

  1. annotated_peaks_summary.tsv: this file has all information about identified peaks.

  2. genes.associated.with.SVs.tsv: this file contains statisitcal information for all genes affected by SVs.

  3. In addition, SV-HotSpot provides various visualizations composed of overlaying tracks representing copy number aggregation, SV aggregation, and gene/regulatory/region of interest annotation tracks.

Additional files:

  1. Peaks files for each chromosome and their corresponding figures. These files are located in the peaks folder.
  2. UCSC custom track files. These files are located in peaks/ucsc_custom_track_files. These files can be viewed on the UCSC Genome Browser.

Example of SV-HotSpot visualization

(1) Recurrent SVs targeting a non-coding region located upstream of AR gene:

visualization example

(2) Circos plot of recurrent SV hotspots identified by SV-HotSpot across the genome:

visualization example

References

  1. Quigley, D. A., Dang, H. X., Zhao, S. G., Lloyd, P., Aggarwal, R., Alumkal, J. J., ... & Playdle, D. (2018). Genomic hallmarks and structural variation in metastatic prostate cancer. Cell, 174(3), 758-769.

How to cite SV-HotSpot?

Coming soon ....

Contact

Abdallah Eteleeb: [email protected]

sv-hotspot's People

Contributors

eteleeb avatar hdng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.