Giter Club home page Giter Club logo

anses-ploufragan / sigffrid Goto Github PK

View Code? Open in Web Editor NEW
2.0 4.0 0.0 3.62 MB

Find two-boxes transcription factor binding sites (made for sigma factor binding sites) by comparing intergenic DNA regions upstream pairs of orthologous genes using statistical analyses

License: Other

Perl 99.88% Shell 0.12%
transcription-factor-binding-site sigma-factor-binding-site orthologues statistics dna-motifs rmes prediction promoter sigffrid bmc-bioinformatics

sigffrid's Introduction

SIGffRid

Meaning of program name

SIGma factor (binding sites) finder using RMES to select input data.

SIGffRid name is mainly a recall to SigR Streptomyces coelicolor Sigma factor binding sites, well known and easily detectable even using counts instead of statistics in SIGffRid for motif extension. This first "confirmation" helped me to argue to develop the statistical extension criteria needed for SIGffRid to find other kinds of sigma factors.

PRESENTATION

SIGffRid try to deduce Sigma Factor Binding Sites using two phylogenetically close bacterial genomes (typically, showing 16S RNA identity of 97%, with the highest possible number of orthologous sigma factors).

PRINCIPLE

SIGffRid performs a simultaneous analysis of pairs of promoter regions of orthologous genes in Bacterial species. SIGffRid uses a prior identification of slighly over-represented 3 to 7 letters long words in the whole genomes of the two bacterial genomes, as selection criteria for potential -35 and -10 boxes using R'MES program.

From each pair of intergenic upstream sequences of orthologs (obtained from MBGD database), it get pairs of conserved (individually over-represented) words with close spacers on the two sequences: patterns. These patterns are then grouped independantly for each bacterium using pairs of short seeds (of which one is possibly gapped), allowing a variable-length spacer between them (one seed matching to one single word of the pattern). Conserved nucleotides in seeds concerned 6 positions (from 2 to 4 for an individual seed composing a pair).

Next, the conserved area corresponding to seed pair is extended guided by statistical considerations, a feature that ensures a selection of motifs with statistically relevant properties.The extension process is based on a Markov models of third order obtained from each whole genome and concerns the position flanking conserved area related to the letter whose the observed number of occurrences is the less expected in the n sequences concerned. If there are several statistically significant letters at this position, the extension is tried using several letters simultaneously. In this case, we obtain a motif with a position corresponding to several letters.If this motif is statistically significant, we stop the extension process: we have the consensus and all the motifs and positions in upstream sequences related to this motif. Otherwise, extension is tried on each statisitcally significant letter at this position. Sequences who are not concerned are grouped so as to run a distinct extension process. Entension process stop when the motif was extended by 10 letters without interesting results or when less than 8 sequences are concerned.

After each extension by one letter (at one of the 4 positions flanking seeds), the conserved motif is evaluated. We compute a ratio R of the number of occurrences on upstream sequences on the number of occurrences in whole genome (taking into account the two strands).The higher is R, the more specific is the motif for upstream sequences.

To know if this specificity is statistically significant, we use Likelyhood Ratio Test (LRT) which measure the difference of concentration of our motif in the two sets of sequences (test made on data corresponding to the bacteria the motif is got from). If R is greater than a threshold computed according to the proportion of upstream sequences in the studied bacteria, and LRT test gives a score greater than the quartile at 5 percent of the Xhi square law, the motif is considered as an interesting one.

Important Note: We have only used a minimum value threshold for LRT test, which decrease greatly the number of results given. Using our program on some bacteria, we suggest not to consider the motifs with higher LRT score as the best candidates. Known sigma factor binding sites have quite low LRT scores (from 20 to 50 in most cases) but with the highest R.

REQUIREMENT

System

SIGffRid runs on UNIX/Linux systems (tested on Ubuntu 16.04 LTS and Solus 4.1).

Dependencies

SIGffRid needs:

  • Perl (tested with version >= 5.8.8)

Install YAML

Install, using CPAN (cpan install )

  • XML::DOM::XPath (you probably will need to run sudo cpan -f -i XML:DOM::XPath (to force install) to avoid some problem with BioPerl installation (due to testing failures))
  • BioPerl
  • Bio::SeqIO (to extract data from GenBank files)
  • IPC::System::Simple (to run system command)
  • PerlIO::gzip (to read write zip files)
  • Proc::Daemon
  • Getopt::Long (for options)
  • Cwd (for current directory)
  • Statistics::Basic

Install rmes-3.1:

Configuration

add to your .bashrc

export SIGFFRID_DIR=path_to_your_sigffrid_global_directory
export ANSES_SCRIPTS_DIR=path_to_your_sigffrid_global_directory
export [email protected]

and if using SGE (replace long by the name of the SGE queue in which you want to submit jobs)

export ANSES_Q_NAME='long.q'

or if using SLURM (replace long by the name of the slurm queue in which you want to submit jobs)

export ANSES_SLURM_Q_NAME='long'

then type

source ~/.bashrc

Note: depending on SGE/SLURM parameters, mails relative to start/stop/errors will be sent at the mail address provided in ANSES_USER_MAIL environment variable.

WARNINGS

Memory usage

For a pair of bacterial genomes, SIGffRid requires at least 64 GBytes (RAM) (for genomes of 4 Mbases). The memory usage probably can go from 6 to 51 GBytes for bacterial genomes if you treat simultaneously all the pairs of orthologues found in genomes as long as 8 Mbases (typically Streptomyces coelicolor and Streptomyces avermitilis). In such a case, use "functional category" option (-use_fct) which groups sequences of pairs of orthologues by functional categories as defined in MBGD files. By using this option for big genomes, number of results (and false positives) will decrease, and memory usage too.

This option ("grouping by functional category") is NOT recommanded to treat genomes smaller than 5 Gbases: you will obtain more relevant results by using all orthologue relationships (in all cases).

Time consuming:

In all cases, SIGffRid will give results after 1 to 4 days (depending on "grouping by functional category" option , on number of seeds used) (100% CPU on a Intel core i7 8700K at 3.7 GHz, ssd NMVE).

Cluster mod 'SLURM' NOT TESTED YET (tested only with modes 'SGE' (Ubuntu) and 'no' (Solus).

Option -use_fct

This option groups pairs of orthologs by functional categories as defined in MBGD:

  • Amino acid biosynthesis
  • Purines, pyrimidines, nucleosides, and nucleotides
  • Fatty acid, phospholipid and sterol metabolism
  • Biosynthesis of cofactors, prosthetic groups, and carriers
  • Central intermediary metabolism
  • Energy metabolism
  • Transport and binding proteins
  • DNA replication, restriction, modification, recombination, and repair
  • Transcription
  • Translation
  • Regulatory functions
  • Cell envelope
  • Cellular processes
  • Other categories
  • Hypothetical

It decreases resources needs but restrict comparison to each category.

INPUTS

  • Ids of the two bacterial species: VERY IMPORTANT: Bacterial IDs MUST be the IDs used to identify the genome in MBGD file!!!!! (file called cluster when you download it, see 3 items below). We call bacterial ID the group of letters at the beginning of a gene ID (for example sco for Streptomyces coelicolor whose first gene locus is __SCO__0001). CAUTION: In some cases, genes IDs provided in MBGD are not the same as in the ncbi GenBank file.

  • Working directory: the directory where all the outputs and other intermediary files generated by SIGffRid program will be stored. Each working directory has a log file associated with it that stores information about the processes executed. It is a good advice to start SIGffRid program on a new dataset in a new working directory.

  • Genomic data: For each input species the user needs to input the genome file (GenBank format) for the whole bacterial genome (obtained from NCBI website at http://www.ncbi.nlm.nih.gov/genomes/static/eub_g.html). The input will be the sequence file (.gb/.gbk format). CAUTION: The .gb file MUST contain the sequence (default, not). Therefore, when the gb is display in your browser, check the box on the right site of the ncbi interface "show sequence" and then "update view". Afterwhat you can record the gb file by clicking on "Send", check "complet record", check "file", select "GenBank full".

  • Orthologous relationship file (MBGD file): This file is the file that contains information about the orthologous pairs of genes on the given set of species. This file can be obtained from the MBGD database (http://mbgd.genome.ad.jp/) or can be user-defined file. Such file can help to restrict the number of genes in analysis by using the user chosen genes in finding sigma factor binding sites. HELP TO GET ORTHOLOGOUS RELATIONSHIP AND FUNCTIONAL CATEGORY CLUSTERING FROM MBGD DATABASE:

  1. Go to MBGD homepage http://mbgd.genome.ad.jp/.
  2. Click on "Create Ortholog Table" button in the left column (send you to http://mbgd.genome.ad.jp/htbin/SelectOrganism.pl page).
  3. Check boxes related to the two species to compare in the displayed list (multiple selection or use the link to "taxonomy browser").
  4. Click on "Next" button at the top of the page.
  5. In the new page, Click on "Create/View Cluster Table" button.
  6. On the web page displayed, click on "Save Complete Table" button. The file obtained (default named as:'cluster.tab') is the one you have to give to SIGffRid as parameter to define orthologous relationships (you can rename it, its name has no importance).

PLEASE, CHECK cluster file after download (sometimes, database record does not work and cluster file has only the header line).

LAUNCH

Type in terminal

perl sigffrid_cmd_line.pl

and enter to get help.

To run genomes analysis, you can type:

perl sigffrid_cmd_line.pl  \
    -gb_f1 ../ref_genomes/bact1_chr_complete_genome.gb \
    -gb_f2 ../ref_genomes/bact2_chr_complete_genome.gb \
    -id_bact1 sco -id_bact2 sav -MBGD_f MBGD_file/cluster_SCO_SAV.tab \
    -res_dir sigffrid_res/

sco is for Streptomyces coelicolor A3(2) bacteria gene IDs, sav is for Streptomyces avermitilis bacteria gene IDs, for instance. Check bacterial ID to use in the downloaded file from MBGD database (must be the same in GenBank file at the beginning of gene loci: SCO0001).

  • Results folder (here sigffrid_res/) must exist before you run the program.
  • Use -filterout_16SrRNA_res option to automatically get 16S/18S RNA sequences and filter out motifs in results which match 16S/18S RNA sequences and have a strong conservation of their position compared to the +1 START sites of the downstream genes.
  • Use -cluster_mod 'sge' option to use SGE scheduler.

CAUTION: ensure the GenBank file you downloaded for each close bacteria contains the DNA sequence (it does not by default in NCBI web page when you request GenBank file).

CAUTION 2: the two bacteria you compare must be closely phylogenetically related, but not too close (ideally, maybe 97 or 98% of identity when you align their 16S RNA).

RESULTS

They consist in

  • motif consensus, occurrences, and upstream sequences where motifs are found,
  • annotation of genes concerned by each motif,

Located in: {Working_directory}/{your_id}/SIGffRid_orthologs/ directory

Listing of motifs sorted by score are in files:

  • res_sort_R_for_results_grep_{bacteria2_id}.txt
  • res_sort_R_for_results_grep_{bacteria1_id}.txt

You will find all motif occurrences and sequences they are extracted from in SIGffRid_results/MOTIF_UPSTR_treshmulti3_bsu_8/ directory, in the SIGffRid_results/MOTIF_UPSTR_treshmulti3_bsu_8/motif{bacterial ID}_{xxxxx}.txt file.

You will find annotations for genes related to a motif xxxxx in

  • MOTIF_ANNOT_UPSTR directory, in the
  • MOTIF_ANNOT_UPSTR/annot_motif{bacterial_ID}_{xxxxx}.txt file

Advanced users (using command lines) can know how many times a same motif is found by the program (with different seeds). This information, not provided by default can be used to select some interesting motifs (in addition to scores). For a complete list of motifs, type:

grep 'MOTIF' align* 

in SIGffRid_orthologs subdirectory if you do not use "grouping by functional category" option, otherwise type

grep 'MOTIF' */align*

REFERENCE

If you publish or use SIGffRid results, please cite:

Touzain F, Schbath S, Debled-Rennesson I, Aigle B, Kucherov G and Leblond P. SIGffRid: a tool to search for sigma factor binding sites in bacterial genomes using comparative approach and biologically driven statistics (article). BMC Bioinformatics 2008, 9(1):73. DOI:10.1186/1471-2105-9-73

Questions and comments should be sent to fabrice.touzain [@] anses.fr (or in github of course, especially for issues).

FINANCIAL SUPPORT

Fabrice Touzain was supported by the Région Lorraine and by the ACI IMPBio from the Ministère de l'Education Nationale, de l'Enseignement supérieur et de la Recherche.

An additional demi-ATER was usefull to finish this program and Fabrice Touzain PhD. Thank you again to the Région Lorraine.

ACKNOWLEDGMENT

Enormous THANKS to Professor Isabelle Debled-Rennesson (home) for her constant guidance and encouragement, to Sophie Schbath (home) for her statistical explanations and pedagogy to design the method, to Gregory Kucherov for the three hundred kilometers he drived under the snow to make my PhD defense possible in 2007.

Special thanks to PhD who shared by office for their help and kindness: Laurent Noé (home) and Laurent Provot (home).

LICENCE

Terms of the CeCILL licence can be found here: CeCILL

sigffrid's People

Contributors

ftouzain avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sigffrid's Issues

Compatibility with small genomes

Hello,

I noticed in sigffrid_cmd_line.pl that there is an option "small_genome" which doesn't appear to ever be used. I'm currently trying to run an analysis with genomes around 2Mbases, which I figured would be considered small compared to the genomes of the example analyses shown for the program. I have only been able to get results from the program when using one specific comparison genome, but this one in particular is not 97-98% similar based on 16S comparison (my genome of interest is Porphyromonas gingivalis W83, and the comparison genome I was able to get results when using is Porphyromonas asaccharolytica). When running with other comparison genomes (some greater than 98% 16S similarity, some lower, but all higher than P.asaccharolytica), the program will run for maybe a minute, and after it completes, the main results files will be empty. It seems like perhaps the program is able to find motifs for these comparisons which are then not deemed significant. If this is a result of the smaller genome size, are there any parameters that can/should be adjusted to accommodate for this, or is this likely due to an issue of % similarity? Any recommendations would be sincerely appreciated!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.