Giter Club home page Giter Club logo

jaspar-ucsc-tracks's Introduction

JASPAR UCSC tracks

This repository contains the data and code used to generate the JASPAR UCSC Genome Browser track data hub.
For more information visit the JASPAR website.

News

01/07/2018 To speed-up TFBS predictions, we switched from MEME and the Perl TFBS package to PWMScan.

Content

  • The genomes folder contains scripts to download and process different genome assemblies
  • The profiles folder contains the output from the script get-profiles.py, which downloads the JASPAR CORE profiles for different taxons
  • The file environment.yml, within the conda folder, contains the conda environment used to generate the genomic tracks for JASPAR 2022 (see installation)
  • The script install-pwmscan.sh downloads and installs PWMscan and places its binaries in the in the bin folder.
  • The script scan-sequence.py takes as its input the profiles folder and a nucleotide sequence in FASTA format
    (e.g. a genome), and outputs TFBS predictions
  • The script scans2bigBed creates a bigBed track file from TFBS predictions

The original scripts used for the publication of JASPAR 2018 have been placed in the folder version-1.0.

Dependencies

Note that for running scan_sequence.py, only the Python dependencies and PWMScan are required.

Installation

To install PWMScan, execute the script install-pwmscan.sh.

The remaining dependencies can be installed through the conda package manager:

conda env create -f ./conda/environment.yml

Availability

Genomic tracks and TFBS predictions for human and seven other model organisms, covering 11 genome assemblies, are available online:

Usage

To illustrate how the genomic tracks are generated, we provide an example for the baker's yeast genome:

./scan-sequence.py --fasta-file ./genomes/sacCer3/sacCer3.fa --profiles-dir ./profiles/ \
    --output-dir ./tracks/sacCer3/ --threads 4 --latest --taxon fungi

For this example, the scanning step should take no longer than a minute. For human and other similar genomes, this step is usually finished within a few hours (the final amount of time will depend on the number of --threads specified).

  • Create the genomic track
./scans2bigBed -c ./genomes/sacCer3/sacCer3.fa.sizes -i ./tracks/sacCer3/ -o ./tracks/sacCer3.bb -t 4

TFBS predictions from the previous step are merged into a bigBed track file. In column five, we use as scores the p-values from PWMScan (scaled between 0-1000, where 0 corresponds to p-value = 1 and 1000 to p-value ≤ 10-10). This allows for comparison of prediction confidence across TFBSs. Again, for this example, this step should be completed within a few minutes, while for larger genomes it can take a few hours.

Important note: disk space requirements for large genomes (i.e. danRer11, hg19, hg38, mm10, and mm39) are substantial. In these cases, we highly recommend allocating at least 1Tb of disk space.

jaspar-ucsc-tracks's People

Contributors

oriolfornes avatar robinvanderlee avatar tixii avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jaspar-ucsc-tracks's Issues

About installation

Hi,
Thank you for the development of JASPAR-UCSC-tracks

I am very interested in using this program. However, I try to install the program by the bash execution of the script install-pwmscan.sh. However, the following error shows every time:

gcc -fPIC -O3 -std=gnu99 -W -Wall -o hashtable.o -c hashtable.c
make: gcc: Command not found
make: *** [Makefile:42: hashtable.o] Error 127

Could you help me?

Thank you!

fetch -p parameter

@oriolfornes

Shouldn't this part from the fetch* script not be "from jaspar2pfm.py" rather than from "jaspar2meme.py"?

parser.add_option("-p", action="store", type="string", dest="profiles_dir", help="Profiles directory (from jaspar2meme.py)", metavar="<profiles_dir>")

mismatched taxon

Hi, thank you for the great resources.

I'm looking at hg38 http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2022/hg38/
and found some motifs that seem unrelated to Vertebrata, for example, MA2020 is from Arabidopsis thaliana. MA1879 is from Ciona intestinalis (these are just examples and there are more).

I see the option --taxon vertebrate for hg38 in your code https://github.com/wassermanlab/JASPAR-UCSC-tracks/blob/master/scan-sequences.sh
, so I assumed that only motifs linked to vertebrates are included.

Since other genome versions like hg19 or mm10 also include the non-vertebrate motif annotation, I wondered if it's intended or if there are some mistakes.

Thank you!
Nana

mm10 build?

It would be great if you could also provide tfbs prediction tracks for mm10

Provided binaries do not work

The provided binaries of matrix_scan and matrix_probe do not work on any of my Linux systems. (RH7 and Ubuntu 16.0x-20.0x

Do you have a working Linux binary available?

p_value clarification needed on their scaling and interpretation

In JASPAR UCSC tracks I read that scores in the bigbed files are p-values which have benn

(scaled between 0-1000, where 0 corresponds to p-value = 1 and 1000 to p-value ≤ 10-10)

Can I use R's rescale function to recover the p-values from the scores? For instance, a score of 950 comes from a p-value of .05

library(scales)
rescale(950,c(1,10**-10),c(0,1000))
.05

In any case, I am having trouble effectively interpreting the p_value.

In PWMScan: a fast tool for scanning entire genomes with a position-specific weight matrix I read

The P-value of a PWM score x is defined as the probability that a random k-mer sequence of the length of the PWM has a binding score ≥ x given the base composition of the genome.

I would hope to find some measure of how well the sequence at a candidate (or putative) binding site identified by PWMScan matches the motif PWM. This does not seem to provide that. Or am I mistaken?

I would like possibility of being more stringent in selection of candidates from this trace by setting a threshold on the score. However, I am hesitant to adopt this approach as thresholding on the scaled P_value could introduce a bias toward a subset of the universe of motifs. Is my reasoning suspect here?

edit: Perhaps another way of getting at this is to ask: do the motifs have the same distribution of P_values as each other?. If they do, then thresholding across the board at any given P_value should remove an equal fraction of each motif's hits. Do you know if they do?

Error cannot convert float NaN to integer

Hello! thank you for always answering the Issues.

I have had some problems with the installation of JASPAR, but I was thinking that I finally covered it. However, when I was trying to run the example command:
./scan-sequence.py genomes/sacCer3/sacCer3.fa profiles/ -o tracks/sacCer3/ --threads 4 --taxon fungi

The error ValueError: cannot convert float NaN to an integer.

My process of installation was:
Git clone https://github.com/wassermanlab/JASPAR-UCSC-tracks
bash install-pwmscan.sh
conda env create -f ./conda/environment.yml
mv pwmscan/* JASPAR-UCSC-tracks/ (I did this due to there was an error with where is matrix_scan

run ./scan-sequence.py genomes/sacCer3/sacCer3.fa profiles/ -o tracks/sacCer3/ --threads 4 --taxon fungi

I'm using JASPAR version 1.0

Thank you so much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.