Giter Club home page Giter Club logo

grape-nf's Introduction

Grape

CI status Nextflow version: >=23.04.0 run with docker run with singularity DOI

Grape provides an extensive pipeline for RNA-Seq analyses. It allows the creation of an automated and integrated workflow to manage and analyze RNA-Seq data.

It uses Nextflow as the execution backend. Please check Nextflow documentation for more information.

Grape has been adopted for RNA-seq integrative analysis within the IHEC consortium. Check the IHEC setup document to run the pipeline following IHEC recommendations.

Requirements

  • Unix-like operationg system (Linux, MacOS, etc)
  • Java 11 or later
  • Docker or Singularity engine

Quickstart

  1. Install Nextflow by using the following command:

    curl -s https://get.nextflow.io | bash
    
  2. Make a test run:

    nextflow run guigolab/grape-nf -with-docker
    

NOTE: the very first time you execute it, it will take a few minutes to download the pipeline from this GitHub repository and the associated Docker images needed to execute the pipeline.

Pipeline software

The preferred way to run the pipeline is to use Docker or Singularity to provision the programs needed for the execution. Just use the -with-docker or -with-singularity option in the pipeline command. Pre-built Grape containers are publicly available at the Grape page in Docker Hub.

Using Singularity

Singularity is the preferred container engine for running the pipeline in an HPC environment. In order to minimize the amount of issues that could arise we recommend the use of Singularity version 3.0 or higher.

Image cache dir

The first time you run the pipeline with Singularity it will download the required images from the Docker Hub and save them in a folder inside the pipeline work dir. You can specify a different location (e.g. a centralized cache) by using the NXF_SINGULARITY_CACHEDIR environment variable or by including the following snippet in a file called nextflow.config and placing it in the current working folder of your pipeline:

singularity {
  cacheDir = "/data/singularity"
}

Please check the Singularity section in Nextflow documentation for more information.

Bind mounts

Nextflow expects that data paths are defined system wide, and your Singularity images need to be able to access these paths. Singularity allows paths that do not currently exist within the container to be created and mounted dynamically by specifying them on the command line. For this to work the user bind control option must be set to true in the Singularity config file. Nextflow support for this feature is enabled by default for the pipeline, by defining the singularity.autoMounts = true setting in the main configuration file.

Starting in version 3.0, Singularity can bind paths to non-existent mount points within the container even in the absence of the “overlay fs” feature, thus supporting architectures running legacy kernel versions (including RHEL6 vintage systems). For older versions of Singularity a kernel supporting the OverlayFS union mount filesystem is required for this functionality to be supported.

Please see here for further instructions on Singularity mounts.

Pipeline parameters

A usage message is provided and can be seen using the --help pipeline option in the command as follows:

nextflow run guigolab/grape-nf --help

--index INDEX_FILE

  • specifies the path of the file containing the list of input files and the corresponding metadata (see the next section for more details).

--genome GENOME_FILE

  • sets the location of the input genome FASTA file

--annotation ANNOTATION_FILE

  • sets the location of the input GTF/GFF annotation file

--steps STEP[,STEP]..

  • defines the pipeline steps to be performed

--paired-end

  • specifies that the data is paired-end (to be used whith BAM input files)

Mapping options

--max-mismatches THRESHOLD

  • set a maximum threashold for the number of allowed mismatches

--max-multimaps THRESHOLD

  • set a maximum threashold for the number of allowed multiple mapped reads

--bam-sort METHOD

  • set the sort method of the out BAM file

--add-xs

  • add the SAM tag XS to the output BAM file (useful for using the file with tools like Cufflinks or StringTie that use the tag to know the directionality of the split maps)

Read group options

These options are used to customize the @RG header tag of the BAM files produced by the mapping step, according to the SAM specifications.

--rg-platform PLATFORM

  • set the PL attribute

--rg-library LIBRARY

  • set the LB attribute

--rg-center-name CENTER_NAME

  • set the CN attribute

--rg-desc DESCRIPTION

  • set the DS attribute

Pipeline input

The pipeline reads the paths of the FASTQ/BAM files to be processed and the corresponding metadata from a TSV file (see the --index parameter). The file must contains the following columns in order:

1 sampleID the sample identifier, used to merge bam files in case multiple sequencing runs of the same sample are present
2 runID the run identifier (e.g. test1)
3 path the path to the fastq file (it can be absolute or relative to the TSV file)
4 type the type (e.g. fastq)
5 view an attribute that specifies the content of the file (e.g. FqRd for single-end data or FqRd1/FqRd2 for paired-end data)

NOTE: Please do not use Excel/Libreoffice or similar programs to create this file. You can use this online TSV editor, also available as a VSCode extension.

NOTE: Fastq files from paired-end data will be grouped together by runID.

Here is an example from the test run:

sample1  test1   data/test1_1.fastq.gz   fastq   FqRd1
sample1  test1   data/test1_2.fastq.gz   fastq   FqRd2

Sample and id can be the same in case you don't have/know sample identifiers:

run1  run1   data/test1_1.fastq.gz   fastq   FqRd1
run1  run1   data/test1_2.fastq.gz   fastq   FqRd2

Pipeline results

The paths of the resulting output files and the corresponding metadata are stored into the pipeline.db file (TSV formatted) which sits inside the current working folder. The format of this file is the same as the index file with few more columns:

1 sampleID the sample identifier, used to merge bam files in case multiple runs for the same sample are present
2 runID the run identifier (e.g. test1)
3 path the path to the fastq file
4 type the type (e.g. bam)
5 view an attribute that specifies the content of the file (e.g. GenomeAlignments)
6 readType the input data type (either Single-End or Paired-End)
7 readStrand the inferred experiment strandedness if any (it can be NONE for unstranded data, SENSE or ANTISENSE for single-end data, MATE1_SENSE or MATE2_SENSE for paired-end data.)

Here is an example from the test run:

sample1   test1   /path/to/results/sample1.contigs.bed    bed      Contigs                     Paired-End   MATE2_SENSE
sample1   test1   /path/to/results/sample1.isoforms.gtf   gtf      TranscriptQuantifications   Paired-End   MATE2_SENSE
sample1   test1   /path/to/results/sample1.plusRaw.bw     bigWig   PlusRawSignal               Paired-End   MATE2_SENSE
sample1   test1   /path/to/results/sample1.genes.gff      gtf      GeneQuantifications         Paired-End   MATE2_SENSE
sample1   test1   /path/to/results/test1_m4_n10.bam       bam      GenomeAlignments            Paired-End   MATE2_SENSE
sample1   test1   /path/to/results/sample1.minusRaw.bw    bigWig   MinusRawSignal              Paired-End   MATE2_SENSE

Output files

The pipeline produces several output files during the workflow execution. Many files are to be considered temporary and can be removed once the pipeline completes. The following files are the ones reported in the pipeline.db file and are to be considered as the pipeline final output.

Alignments to the reference genome

views
GenomeAlignments

This BAM file contains information on the alignments to the reference genome. It includes all the reads from the FASTQ input. Reads that do not align to the reference are set as unmapped in the bam file. The file can be the product of several steps of the pipeline depending on the given input parameters. It is initially produced by the mapping step, then it can be the result of merging of different runs from the same experiment and finally it can run through a marking duplicates process that can eventually remove reads that are marked as duplicates.

Alignments to the reference transcriptome

views
TranscriptomeAlignments

This BAM file contains information on the alignments to the reference transcriptome. It is generally used only for expression abundance estimation, as input in the quantification process. The file is generally produced in the mapping process and can be the result of merging of different runs from the same experiment.

Alignments statistics

views
BamStats

This JSON file contains alignment statistics computed with the bamstats program. It also reports RNA-Seq quality check metrics agreed within the IHEC consortium.

Signal tracks

views
RawSignal
MultipleRawSignal
MinusRawSignal
PlusRawSignal
MultipleMinusRawSignal
MultiplePlusRawSignal

These BigWig files (one or two, depending on the strandedness of the input data) represent the RNA-Seq signal.

Contigs

views
Contigs

This BED file reports RNA-seq contigs computed from the pooled signal tracks.

Quantifications

views
GeneQuantifications
TranscriptQuantifications

These two files report abundances for genes and transcripts in the processed RNA-seq samples. The format can be either GFF or TSV depending on the tool used to perform the quantification.

Pipeline configuration

Executors

Nextflow provides different Executors run the processes on the local machine, on a computational cluster or different cloud providers without the need to change the pipeline code.

By default the local executor is used, but it can be changed by using the executor configuration scope.

For example, to run the pipeline in a computational cluster using Sun Grid Engine you can create a nextflow.config file in your current working directory with something like:

process {
    executor = 'sge'
    queue    = 'my-queue'
    penv     = 'smp'
}

Pipeline profiles

The Grape pipeline can be run using different configuration profiles. The profiles essentially allow the user to run the analyses using different tools and configurations. To specify a profile you can use the -profiles Nextflow option.

The following profiles are available at present:

profile description
gemflux uses GEMtools for mapping pipeline and Flux Capacitor for isoform expression quantification
starrsem uses STAR for mapping and bigwig and RSEM for isoform expression quantification
starflux uses STAR for mapping and Flux Capacitor for isoform expression quantification

The default profile is starrsem.

Run the pipeline

Here is a simple example of how you can run the pipeline:

nextflow -bg run grape-nf -r v1.1.4 --index input-files.tsv --genome refs/hg38.AXYM.fa --annotation refs/gencode.v21.annotation.AXYM.gtf --rg-platform ILLUMINA --rg-center-name CRG -resume > pipeline.log

It is strongly recommended to run one of the pipeline released versions unless you have a very good reason not to do so. This is done via the -r command line option as shown in the command above. Please see this section of Nextflow documentation for more details on this.

By default the pipeline execution will stop as far as one of the processes fails. This behaviour can be changed using the errorStrategy process directive, which can also be specified on the command line. For example, to ignore errors and keep processing you can use:

-process.errorStrategy=ignore.

It is also possible to run a subset of the pipeline steps using the option --steps. For example, the following command will only run mapping and quantification:

nextflow -bg run grape-nf --steps mapping,quantification --index input-files.tsv --genome refs/hg38.AXYM.fa --annotation refs/gencode.v21.annotation.AXYM.gtf --rg-platform ILLUMINA --rg-center-name CRG > pipeline.log

Tools versions

The pipeline can be also run natively by installing the required software on the local system or by using Environment Modules.

The versions of the tools that have been tested so far with the standard pipeline profile are the following:

grape-nf's People

Contributors

dependabot[bot] avatar emi80 avatar karl616 avatar marangiop avatar pditommaso avatar sitag avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

grape-nf's Issues

Include QC?

Would it be possible to include some kind of QC?

I'm not sure if there is a set of QC parameters selected for RNA-seq within IHEC (similar to ChIP-seq and WGBS)? If something like this exists, I would suggest it to be the minimum to calculate these.

Add flag to check/enforce strandness for a dataset

The pipeline allows to process data with different strandness within the same run. Many time all libraries in a dataset have the same strandness and we would need to (1) check this and stop the pipeline if the automatic strandness inference gives different values or (2) enforce a specific strandness to the process and skip the automatic detection.

The flag could be named --readStrand and have the following values:

  • CHECK for (1)
  • one of NONE, SENSE, ANTISENSE, MATE1_SENSE, MATE2_SENSE for (2)

List of software versions

Would it be possible to include a list of required software with versions?

In principle I find the docker solution excellent, but I would need to run the pipeline on a cluster on which I'm not allowed to use docker. I already got it running, but I'm sure I don't have the right versions of every software.

STAR alignment for multiple lane fastq

Hi Developers,
I ran into a problem regarding STAR alignment using multiple lane fastqs.
Given input readindex file contain as follows
sample1 run1 path_to_file/mRNA_L005_R1_001_val_1.fq.gz fastq FqRd1
sample1 run1 path_to_file/mRNA_L005_R2_001_val_2.fq.gz fastq FqRd2
sample1 run1 path_to_file/mRNA_L006_R1_001_val_1.fq.gz fastq FqRd1
sample1 run1 path_to_file/mRNA_L006_R2_001_val_2.fq.gz fastq FqRd2

Here is the command generated by nf
STAR --runThreadN 1 --genomeDir hs38 --readFilesIn mRNA_L005_R1_001_val_1.fq.gz mRNA_L006_R1_001_val_1.fq.gz mRNA_L005_R2_001_val_2.fq.gz mRNA_L006_R2_001_val_2.fq.gz --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 10 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --readFilesCommand pigz -p1 -dc --outSAMtype BAM Unsorted --outStd BAM_Unsorted --quantMode TranscriptomeSAM --outSAMattrRGline ID:run140212_SN758_0152_BC3ETGACXX PU:run140212_SN758_0152_BC3ETGACXX SM:41_Hf02_LiHe_Ct_replicate1 | samtools sort -@ 0.5 -m 536870912 - run140212_SN758_0152_BC3ETGACXX_m4_n10_toGenome

I found that the mapping process is not actually running even though the workflow tell nothing wrong. Actually the workflow didn't run anything.

I am asking whether it's possible to change how to supply mapping process for STAR through --readFilesIn by same way shown here. With comma separates different lanes.

Looking forward to your response

Read 1 decided by order of index file, not FqRd1/FqRd2 (Bug?)

The way I read the documentation I thought the file containing Mate/Read 1 was decided by the FqRd1 tag ==> I didn't care about the order of my index file and it was sometimes jumbled. Analyzing a set experiments I ended up with a mixture of MATE1_SENSE, MATE2_SENSE and NONE assignments. This was incorrect.

It seems the first read in the index file is treated as Mate 1, independent of what is given in the view column (column 5).

I think it comes down to this:

.map { line ->

Column 5 is never taken into account. For me, this is not intuitive.

Example:
One sample with two pairs of fastq files:

index:

sample A read1_001.fq.gz fastq FqRd1
sample A read2_001.fq.gz fastq FqRd2
sample B read2_002.fq.gz fastq FqRd2
sample B read1_002.fq.gz fastq FqRd1

==> resulting alignment step

A:
STAR .... --readFilesIn read1_001.fq.gz read2_001.fq.gz ....
B:
STAR .... --readFilesIn read2_002.fq.gz read1_002.fq.gz ....

B is wrong

Solutions:

  • Make excruciatingly clear in the documentation that the index file have to be sorted
  • Make use of the information in column 5?

I would prefer the latter solution and would have submitted a patch, but this goes a bit to deep into python/nextflow for me to feel comfortable. I will sort my index files from now on.

Move inferExp after markdup step

In some cases using the marked BAM would resolve errors inferring the strandness of the experiment as the tool ignores duplicated reads.

Documentation

Add the description of the option --add-xs to the documentation

Add option for setting read fraction threshold when detecting strandness

A read fraction is used as a threshold to decide the strandness of an experiment. The current value is 0.8 but for some experiments (e.g. low quality data) this is too high and leads to an incorrect inference.

The idea is to add a command line option to the grape_infer_experiment.py script and to the pipeline itself. The option will be a floating point number in the range [0,1] and will be initialised with a default value of 0.8.

Using on local machine

Hello!
I am trying to run grape-nf on my local machine, and it is a bit limited on memory (16gb RAM).
So when i'm running the pipeline, i get an error

[warm up] executor > local
Exception in thread "Task submitter" nextflow.exception.ProcessNotRecoverableException:
Process requirement exceed available memory -- req: 62 GB; avail: 16 GB

Is there a way to run the pipeline with a limited amount of memory?

Thank you!

grape-nf does not execute some steps

Hi @emi80

I ran grape-nf with the following command inside a conda environment (However not all the tools were installed through conda):
nextflow -c /home/abdosa/.nextflow/assets/guigolab/grape-nf/resource.config run -qs 1 -without-docker -w work grape-nf --index readIndex.tsv --genome $genomeFasta --annotation $transcriptAnnotation --steps mapping,bigwig,quantification --genomeIndex $genomeIndex

with genomeFasta=GCA_000001405.15_GRCh38_no_alt_analysis_set.fna (I'm not sure if .fna of .fa matters)

It finished successfully without any error but without performing bigwig and quantification steps.
I attached the log file and the detailed nextflow.log

grape-nf.log
nextflow.log

Best,
Abdul

Citation information

Hi,
our first project with grape-nf has reached the publication phase and I was wondering if you have a preferred way for citing?
Otherwise I will refer to this github repository...
Googling and thinking a bit I came up with https://zenodo.org/

Best,
Karl

Enhancement: bioconda?

Hi,
I spent some time adding missing software versions to bioconda, which I have learned to know as an easy way to set up/install bioinoformatic software. I'm still running tests, but I think it seems to work just fine.

  1. If you want I can provide an environment file
  2. If yes, I still have to add the legacy versions of kentUtils and wanted to ask which tools are used. I have identified bedGraphToBigWig, gtfToGenePred and genePredToBed. Did I miss any?

All recipes provided by bioconda is also provided as docker/singularity containers through biocontainers. There are also possibilities to generate multitool containers. Not sure if it is interesting, but it would relieve you of providing the docker images.

Add step to download input data

Use a specific process because:

  • data source URLs might be unstable
  • use multiple source URLs and stop as soon as one works
  • use storeDir to keep input data in a specific location

Markdup step TimeOut exiting with IHEC usage

Hi,

I'm using the pipeline as part of IHEC, with the following versions:

  • Singularity version 3.2.1-1.el7
  • IHEC Fork of grape-nf
  • Nextflow version 19.04.0.5069

The pipeline itself is running well except the mergeBam step (not always working). When it comes to the markdup step it's taking very long time to end (I tried allowing up to 3 days) and ending with TIMEOUT. I noticed that the sambamba cmd is started but "stuck" and using 0% CPU (monitoring with htop). Then, I tried the sambamba cmd defined inside .command.sh outside the container (it worked) and inside the container (it worked too). I don't know what's happening there. If you need me to add some logs please tell me.

Thanks,
Paul

Add mark-duplicates step

Add an optional step to mark duplicates. Add also an option to remove the duplicates if needed.

Change view for the quantification process

The view attribute for the quantification process currently uses the annotation name or the RSEM transcript index name dinamically, e.g. GeneAnnotation or GeneTxDir.

To uniform the attribute value with other attributes in the pipeline.db file I propose to change the quantification view to GeneQuantifications for genes and TranscriptQuantifications for transcripts. This way it will have clearer semantics, it won't depend on the reference annotation name and will always be the same.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.