wassermanlab / jaspar-ucsc-tracks Goto Github PK

Code and data used to create the JASPAR UCSC Genome Browser tracks data hub

License: MIT License

Python 67.52% Perl 22.67% Shell 9.82%

jaspar-ucsc-tracks's Introduction

JASPAR UCSC tracks

This repository contains the data and code used to generate the JASPAR UCSC Genome Browser track data hub.
For more information visit the JASPAR website.

News

01/07/2018 To speed-up TFBS predictions, we switched from MEME and the Perl TFBS package to PWMScan.

Content

The genomes folder contains scripts to download and process different genome assemblies
The profiles folder contains the output from the script get-profiles.py, which downloads the JASPAR CORE profiles for different taxons
The file environment.yml, within the conda folder, contains the conda environment used to generate the genomic tracks for JASPAR 2022 (see installation)
The script install-pwmscan.sh downloads and installs PWMscan and places its binaries in the in the bin folder.
The script scan-sequence.py takes as its input the profiles folder and a nucleotide sequence in FASTA format
(e.g. a genome), and outputs TFBS predictions
The script scans2bigBed creates a bigBed track file from TFBS predictions

The original scripts used for the publication of JASPAR 2018 have been placed in the folder version-1.0.

Dependencies

Python 3.7 with the following libraries: Biopython (<1.74), NumPy, pyfaidx and tqdm
PWMScan
UCSC binaries for standalone command-line use

Note that for running scan_sequence.py, only the Python dependencies and PWMScan are required.

Installation

To install PWMScan, execute the script install-pwmscan.sh.

The remaining dependencies can be installed through the conda package manager:

conda env create -f ./conda/environment.yml

Availability

Genomic tracks and TFBS predictions for human and seven other model organisms, covering 11 genome assemblies, are available online:

http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/

Usage

To illustrate how the genomic tracks are generated, we provide an example for the baker's yeast genome:

Download the genome sequence and chromosome sizes (automated in this script)
Scan the genome sequence using all fungi profiles from the JASPAR CORE

./scan-sequence.py --fasta-file ./genomes/sacCer3/sacCer3.fa --profiles-dir ./profiles/ \
    --output-dir ./tracks/sacCer3/ --threads 4 --latest --taxon fungi

For this example, the scanning step should take no longer than a minute. For human and other similar genomes, this step is usually finished within a few hours (the final amount of time will depend on the number of --threads specified).

Create the genomic track

./scans2bigBed -c ./genomes/sacCer3/sacCer3.fa.sizes -i ./tracks/sacCer3/ -o ./tracks/sacCer3.bb -t 4

TFBS predictions from the previous step are merged into a bigBed track file. In column five, we use as scores the p-values from PWMScan (scaled between 0-1000, where 0 corresponds to p-value = 1 and 1000 to p-value ≤ 10-10). This allows for comparison of prediction confidence across TFBSs. Again, for this example, this step should be completed within a few minutes, while for larger genomes it can take a few hours.

Important note: disk space requirements for large genomes (i.e. danRer11, hg19, hg38, mm10, and mm39) are substantial. In these cases, we highly recommend allocating at least 1Tb of disk space.

jaspar-ucsc-tracks's People

Contributors

Stargazers

Watchers

Forkers

maggishaggy tixii malcook whorton-j-a hengbingao

jaspar-ucsc-tracks's Issues

About installation

Hi,
Thank you for the development of JASPAR-UCSC-tracks

I am very interested in using this program. However, I try to install the program by the bash execution of the script install-pwmscan.sh. However, the following error shows every time:

gcc -fPIC -O3 -std=gnu99 -W -Wall -o hashtable.o -c hashtable.c
make: gcc: Command not found
make: *** [Makefile:42: hashtable.o] Error 127

Could you help me?

Thank you!

fetch -p parameter

@oriolfornes

Shouldn't this part from the fetch* script not be "from jaspar2pfm.py" rather than from "jaspar2meme.py"?

parser.add_option("-p", action="store", type="string", dest="profiles_dir", help="Profiles directory (from jaspar2meme.py)", metavar="<profiles_dir>")

mismatched taxon

Hi, thank you for the great resources.

I'm looking at hg38 http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/hg38/
and found some motifs that seem unrelated to Vertebrata, for example, MA2020 is from Arabidopsis thaliana. MA1879 is from Ciona intestinalis (these are just examples and there are more).

I see the option --taxon vertebrate for hg38 in your code https://github.com/wassermanlab/JASPAR-UCSC-tracks/blob/master/scan-sequences.sh
, so I assumed that only motifs linked to vertebrates are included.

Since other genome versions like hg19 or mm10 also include the non-vertebrate motif annotation, I wondered if it's intended or if there are some mistakes.

Thank you!
Nana

mm10 build?

It would be great if you could also provide tfbs prediction tracks for mm10

Provided binaries do not work

The provided binaries of matrix_scan and matrix_probe do not work on any of my Linux systems. (RH7 and Ubuntu 16.0x-20.0x

Do you have a working Linux binary available?

p_value clarification needed on their scaling and interpretation

In JASPAR UCSC tracks I read that scores in the bigbed files are p-values which have benn

(scaled between 0-1000, where 0 corresponds to p-value = 1 and 1000 to p-value ≤ 10-10)

Can I use R's rescale function to recover the p-values from the scores? For instance, a score of 950 comes from a p-value of .05

library(scales)
rescale(950,c(1,10**-10),c(0,1000))
.05

In any case, I am having trouble effectively interpreting the p_value.

In PWMScan: a fast tool for scanning entire genomes with a position-specific weight matrix I read

The P-value of a PWM score x is defined as the probability that a random k-mer sequence of the length of the PWM has a binding score ≥ x given the base composition of the genome.

I would hope to find some measure of how well the sequence at a candidate (or putative) binding site identified by PWMScan matches the motif PWM. This does not seem to provide that. Or am I mistaken?

I would like possibility of being more stringent in selection of candidates from this trace by setting a threshold on the score. However, I am hesitant to adopt this approach as thresholding on the scaled P_value could introduce a bias toward a subset of the universe of motifs. Is my reasoning suspect here?

edit: Perhaps another way of getting at this is to ask: do the motifs have the same distribution of P_values as each other?. If they do, then thresholding across the board at any given P_value should remove an equal fraction of each motif's hits. Do you know if they do?

2022 v 2020 danRer11 tracks changed from using TF name to matrix identifier possibly in error

While comparing

http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2020/JASPAR2020_danRer11.bb
http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/JASPAR2022_danRer11.bb

I find that the 2022 bigbed files display the matrix identifiers while those for 2020 display the TF name, at least as displayed in IGV, as pictured below.

I expect this is in error.

Error cannot convert float NaN to integer

Hello! thank you for always answering the Issues.

I have had some problems with the installation of JASPAR, but I was thinking that I finally covered it. However, when I was trying to run the example command:
./scan-sequence.py genomes/sacCer3/sacCer3.fa profiles/ -o tracks/sacCer3/ --threads 4 --taxon fungi

The error ValueError: cannot convert float NaN to an integer.

My process of installation was:
Git clone https://github.com/wassermanlab/JASPAR-UCSC-tracks
bash install-pwmscan.sh
conda env create -f ./conda/environment.yml
mv pwmscan/* JASPAR-UCSC-tracks/ (I did this due to there was an error with where is matrix_scan

run ./scan-sequence.py genomes/sacCer3/sacCer3.fa profiles/ -o tracks/sacCer3/ --threads 4 --taxon fungi

I'm using JASPAR version 1.0

Thank you so much!