irna-cosi / apaeval Goto Github PK

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples

License: MIT License

Shell 1.24% Python 54.23% Nextflow 38.07% Dockerfile 3.29% R 3.16%

alternative-polyadenylation benchmark bioinformatics open-science rna-seq

apaeval's People

Contributors

Stargazers

Watchers

Forkers

dominikburri chelseaherdman mrgazzara daneckaw mfansler melinaklostermann lschaerfen fitzsimmonscm dimmestp ninsch3000 vallurumk grexor sambryce-smith dexwel

apaeval's Issues

Execution Workflow: PAQR

WHAT

Write execution workflow for PAQR. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

~~Output: Adhere to output specification for Identification challenge~~ No identification
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Collect parameters that significantly alter a method's behaviour

in order to create the parameter_codes we'll need to collect parameters that change a method's behaviour in a significant way. I've created an additional column in the methods spreadsheet for that.
Could you all please indicate there what you came across while working on your method? Every method is different, so you need to judge for yourself whether a specific parameter might make a huge difference in the benchmarking challenge, and then we might be able to account for that by running a benchmark multiple times.
Thanks a lot for your help!!!

Matching up predicted with ground truth sites

Both for identification and for quantification we need to identify which ground truth site corresponds to each predicted site, if a 'site' is not a single base. We assume that ground truth sites, whether single position or interval, are correct and non-overlapping. The script should take as input bed files for both predicted and ground truth data, as well as a distance parameter, specifying by how many nucleotides should the predicted sites be extended to the left and right. Predicted sites that overlap can be merged and their expression summed up. For each predicted site/cluster the group truth sites that overlap with it can then be identified. If a predicted site/cluster overlaps with more than one ground truth cluster, it can be assigned to each of the overlapping ground truth sites with a fractional weight. This could be calculated as the number of overlapping nucleotides divided by the length of the predicted site/cluster. The expression can assigned proportionally to these weights.

To have the full information for computing various metrics the output should contain:

1. ID, start and end of predicted site
1. ID, start and end of overlapping ground truth site
1. weight with which the predicted site was assigned to the ground truth site
1. expression of the predicted site assigned to the ground truth site
1. expression level of ground truth site

Put input files in S3 bucket

Put all (currently available) input files into AWS S3 bucket s3://rnasoc-scratch-us-east-1 under prefix input_files.

Generally, the procedure to set up aws-cli and interact with an s3 bucket is described in this guide. See below for a detailed description of all of the steps that need to be taken:

Ask Alex to DM you the access keys
Install AWS CLI (follow the corresponding link in the guide above)
Setup AWS CLI credentials via the keys obtained from Alex (follow the Configuration Basics link in the guide above)
Create an empty directory; move inside that directory and create a subdirectory input_files, then copy all files that you want to upload into that subdirectory; make sure that you are still in the original directory you created and that this directory only contains the single subdirectory input_files
Upload all files to s3 with the following command: aws s3 sync . s3://rnasoc-scratch-us-east-1; if the files were organized as described before, all files in the directory input_files will be created in the s3 bucket under the prefix input_files
Verify that all files (now more accurately objects) are available in the bucket by running aws s3 ls s3://rnasoc-scratch-us-east-1/input_files --recursive
Also verify that you can download a file with aws s3 get-object s3://rnasoc-scratch-us-east-1/input_files/{name_of_file_you_want_to_download}
Create a table that lists all input files, descriptions of the files and their s3 URIs
Share the table in channels #general and #data

Compile parameter codes

parameter_codes describing which datasets are to be run with which method parameters (if applicable) has to be created.

I1: Specification - Compute Resources

Create specification for compute resources.

Working definition: Obtain execution time and memory usage in MiB for method execution. If multiple rules or samples for method: sum time and get max MiB.

Execution Workflow: APAtrap

WHAT

Write execution workflow for APAtrap. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

Output: Adhere to output specification for Identification challenge

This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
Fields:

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - not used, leave as "."
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Q3: Specification - Similarity of per-gene polyadenylation site distibutions

Create specification for Q3.

Working definition: Distance between poly(A) site distributions from quantification and from the corresponding 3'end dataset calculated for each gene separately.

Notes: Could either be for all genes or a subset. Possibly with some threshold value to count genes with correct quantification. Whereas Q2 focuses on expression value, this measures site usage (values in [0,1]).

Data processing: Upload A-seq2 HEK293 ground truth data

We need the bed files with TPMs from the A-seq2 pipeline of the 6 HEK293 samples

I've made a directory in the APAeval Google Drive for initial uploading and sharing of the "ground truth"/orthogonal 3'end data so you can upload those here: https://drive.google.com/drive/folders/1bbESWHzHG7Wv3Bk2FmNnateFophBuP8u?usp=sharing

Uploading based on the PAS cluster coordinates (instead of a single nucleotide) should be okay for now.

Execution Workflow: diffUTR

WHAT

Write execution workflow for diffUTR. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

~~Output: Adhere to output specification for Identification challenge~~ No identification
~~Output: Adhere to output specification for quantification challenge~~ No quantification
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Data Processing: HEK293 RNA-seq with NF-Core

We need to align 2 replicates each for siControl treated and siHNRNPC treated HEK293T cells to hg38 using nf-core RNA-seq workflow

File accessions are:
SRR1573494
SRR1573495
SRR1573496
SRR1573497

More details on these samples are here:
https://docs.google.com/spreadsheets/d/17UEC823JupTH6pARdIjSKdOcXclLUSUC_zawfGtapFY/edit#gid=0&range=10:13

Data Processing: Mouse cortex RNA-seq with NF-Core

We need to align 2 replicates each for embryonic and adult mouse cortex to mm10 using NF-Core RNA-seq workflow.

File accessions are:
SRR1811005
SRR3067958
SRR3067957
SRR3067959

More details on these samples are here:
https://docs.google.com/spreadsheets/d/17UEC823JupTH6pARdIjSKdOcXclLUSUC_zawfGtapFY/edit#gid=0&range=A53:P56

Can an extra output file be added as a checkpoint for the Snakemake workflow?

Hi developers,

I was wondering if it makes sense to add an extra output file for each rule as a checkpoint, something in the lines of

output:        
    out=os.path.join(config["out_dir"], "{sample}", "execute.out"),
    finisher=os.path.join(config["out_dir"], "{sample}", "execute.done")
shell:        
    "(EXECUTE_COMMAND \            
        -i {input.bam} \            
        -o {output.out} \            
        -p1 {params.param1} \            
        -p2 {params.param2}) \            
        &> {log};
        echo "DONE" >{output.finisher}"

This acts as a checkpoint for each rule, in cases where the program(s) partially executes or sometimes the node exits abruptly.
Thank you.

Data Processing: create ground truth from Mayr 3'end seq samples

In order to be able to use the primary immune cell data we'll have to process the available 3'end seq data from this study.
Starting point for creating a workflow could be the rules for 3'-Seq (Mayr) from the PolyASite workflow. However, an additional processing step which removes highly abundant immunoglobulin reads, as described in the publication, might have to be included (Attaching the corresponding paragraph from the Methods section of the publication)
mayr_primary_immune_cells_processing.txt

I4: Specification - Site Feature Annotation

Create specification for I4.

Working Definition: For each site, annotate it with the feature(s) that intersect it. Major features are:

utr5
cds
utr3
intronic
intergenic

Notes: We may additionally want to include some downstream labels, e.g.,

ds_1kb
ds_5kb
ds_10kb

Execution Workflow: mountainClimber

WHAT

Write execution workflow for mountainClimber. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

Output: Adhere to output specification for Identification challenge

This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
Fields:

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - not used, leave as "."
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

I2: Specification - Sensitivity and FDR against ground truth

Define a specification file for I2.

Working definition: Compare output of tool's RNA-seq-based site identification to "ground truth" from the orthogonal 3'-end sequencing data.

Sensitivity = TP/(TP+FN)
FDR = FP/(TP+FP)

Extend Snakemake Pilot execution workflow for AWS compatibility

The pilot execution workflow is running with conda environments. AWS does not support conda environments but only containers.

TO-DO

Use test_data for integration test.
Consider using snakemake profiles for AWS exeuction.
Containerise execution of snakemake rules. Use biocontainers where ever possible.
- for MISO, this biocontainer might be usable: https://biocontainers.pro/tools/misopy.
- Snakemake provides functionality to create containers from conda environments: link.
Leave execution with conda as commented example for local testing.
Implement correct output directory structure and file names.
Update pilot README.md.

Comments

Ensure that sample file paths resolve in AWS buckets.

incosistent envs for calls to snakemake nextflow

From earlier discussion the issue arised, that nextflow and snakemake are executed from different (incosistent) conda environments specifications.
To solve this I need to

Modify the snakemake.yml to environment.yml and add both snakemake and nextflow specific versions
Replace the nextflow env from the templates

I can not find the conda env from which nextflow is triggered, can someone point it out? Is such an environment needed in the context of nextflow? @uniqueg @dominikburri @yuukiiwa

Data Processing: GTEx based simulation data with nf-core RNA-seq

I will align a subset of simulated cerebellum and skeletal muscle fastq files generated for a manuscript in preparation from Yoseph's lab using nf-core RNA-seq workflow.

My initial plan is to use 10 cerebellum and 10 muscle samples to allow for reproducibility to be calculated.

docs: Check links

Go through all README and other documentation files and check for missing/broken links, unclear explanations etc. Either fix yourself or reach out on Slack to clarify how to fix. Create new issues that might arise, if applicable.

Q2: Specification - Correlation to 3'-End Seq

Create specification for Q2.

Working definition: Compute correlation coefficient between tool's quantification obtained RNA-seq and the quantification from the corresponding 3'-end sequencing dataset (for each pair of test datasets).

Execution Workflow: TAPAS

WHAT

Write execution workflow for TAPAS. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

Output: Adhere to output specification for Identification challenge

This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
Fields:

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - not used, leave as "."
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Kamikaze Pilot: Decide on one benchmark

Which specification is ready to go? Q2

uses ground truth? yes
relatively easy to implement?
@lschaerfen has submitted some scripts (#66) that could be used. Need to be integrated into a summary workflow.

To Do:

choose a suitable dataset for the benchmark (#78)
create a param code (#75)
If you would like to help with this, please help review the proposed param codes!
communicate the benchmark to the hackers
support the developers of the corresponding summary workflow
update this issue description

feat(Execution workflow): APA-Scan

WHAT

Write execution workflow for APA-Scan. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

Output: Adhere to output specification for Identification challenge

This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
Fields:

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - not used, leave as "."
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Can APA-Scan do differential PAS quantification? If yes:
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Question: what are you missing in the README in execution_workflows directory?

The README should contain more precise information on how these execution workflows have to look like.
Some open questions:

what exactly means preprocessing within execution wfs? NOT QC or adapter removal, as those will be done in nf-core/rnaseq
test files for development (which are inputs, which are outputs, which to take?)
naming convention for output files (DATASET_NAME/METHOD_NAME/CHALLENGE_NAME.bed ???)
What should the outputs be? one file per sample? replicates aggregated?
explain challenges, identifiers, etc.;
what should the samples table in the snakemake template contain
how should arguments to methods be handled (always use defaults, contact developers, interface all arguments?)
if a method needs to be called differently for the different challenges, should that be done in one workflow or in separate ones?
...

Please comment on this issue with any open questions you have about execution workflows, or with suggestions on standardizing and documenting the execution workflows!

Add All Contributors app

Nice way to have everyone's code contributions tracked more prominently: https://allcontributors.org/

D1: Specification - Compute Resources

Create specification for D1.

Working Definition: Obtain execution time and memory usage in MiB for method execution. If multiple rules or samples for method: sum time and get max MiB.

Q1: Specification - Compute Resources

Create specification for Q1.

Working definition: Obtain execution time and memory usage in MiB for method execution. If multiple rules or samples for method: sum time and get max MiB.

Kamikaze Pilot: Decide on one dataset

Which dataset with corresponding ground truth is ready to go?
How do execution workflows have to be run on that? (If in doubt, would an individual sample (that has a ground truth) be a good start?
To Do:

choose a dataset suitable for the Kamikaze Pilot benchmark (#77)
create a param code (#75)
communicate availability (also for genome/annotation files) to the hackers, along with instructions on what to take care of
update this issue description

param code

Work in progress discussion at PR93. Also available on GDrive sheet

data availability s3 bucket

Mapping of accession numbers, sample names, and s3 URI links to input bam and ground truth bed files consistent with the param code is in below table (may be updated as Nextflow Tower executions finish). Also available on GDrive sheet

accession	sample_name	s3_uri	file_type
SRR6795718	Mayr_CD5B_R3	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795718.bam	bam_noChr
SRR6795719	Mayr_CD5B_R4	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795719.bam	bam_noChr
SRR6795720	Mayr_NB_R1	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795720.bam	bam_noChr
SRR6795721	Mayr_NB_R2	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795721.bam	bam_noChr
SRR6795723	Mayr_NB_R3	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795723.bam	bam_noChr
SRR6795724	Mayr_NB_R4	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795724.bam	bam_noChr
SRR6795725	Mayr_NB_R5	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795725.bam	bam_noChr
SRR6795722	Mayr_NB_R6	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795722.bam	bam_noChr
SRR6795726	Mayr_M_R2	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795726.bam	bam_noChr
SRR6795727	Mayr_M_R6	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795727.bam	bam_noChr
SRR6795713	Mayr_GC_R2	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795713.bam	bam_noChr
SRR6795714	Mayr_GC_R4	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795714.bam	bam_noChr
SRR6795715	Mayr_GC_R1	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795715.bam	bam_noChr
SRR6795716	Mayr_GC_R3	s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795716.bam	bam_noChr
SRR6795684	Mayr_CD5B_R3	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_CD5B_R3.SRR6795684.3seq.hg38.bed	ground_truth
SRR6795685	Mayr_CD5B_R4	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_CD5B_R4.SRR6795685.3seq.hg38.bed	ground_truth
SRR6795686	Mayr_CD5B_R5	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_CD5B_R5.SRR6795686.3seq.hg38.bed	ground_truth
SRR6795687	Mayr_CD5B_R6	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_CD5B_R6.SRR6795687.3seq.hg38.bed	ground_truth
SRR1005606	Mayr_NB_R1	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_NB_R1.SRR1005606.3seq.hg38.bed	ground_truth
SRR1005607	Mayr_NB_R2	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_NB_R2.SRR1005607.3seq.hg38.bed	ground_truth
SRR6795688	Mayr_NB_R3	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_NB_R3.SRR6795688.3seq.hg38.bed	ground_truth
SRR6795689	Mayr_NB_R4	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_NB_R4.SRR6795689.3seq.hg38.bed	ground_truth
SRR6795690	Mayr_M_R1	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_M_R1.SRR6795690.3seq.hg38.bed	ground_truth
SRR6795691	Mayr_M_R2	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_M_R2.SRR6795691.3seq.hg38.bed	ground_truth
SRR6795693	Mayr_GC_R2	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_GC_R2.SRR6795693.3seq.hg38.bed	ground_truth
SRR6795692	Mayr_GC_R1	s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_GC_R1.SRR6795692.3seq.hg38.bed	ground_truth

Execution Workflow: IsoSCM

WHAT

Write execution workflow for IsoSCM. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

Output: Adhere to output specification for Identification challenge

This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
Fields:

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - not used, leave as "."
strand - defines the strand; either "." (=no strand) or "+" or "-".
Does IsoSCM perform PAS quantification? If yes:
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Execution Workflow: DaPars2

WHAT

Write execution workflow for DaPars2. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

Output: Adhere to output specification for Identification challenge

This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
Fields:

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - not used, leave as "."
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Execution Workflow: GETUTR

WHAT

Write execution workflow for GETUTR. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

Output: Adhere to output specification for Identification challenge

This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
Fields:

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - not used, leave as "."
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
~~Output: Adhere to output specification for differential usage challenge~~ No differential usage

Execution Workflow: QAPA

WHAT

Write execution workflow for QAPA. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

~~Output: Adhere to output specification for Identification challenge~~ No identification
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

feat(Execution workflow): Roar

WHAT

Write execution workflow for Roar. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

Output: Adhere to output specification for Identification challenge

This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
Fields:

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - not used, leave as "."
strand - defines the strand; either "." (=no strand) or "+" or "-".
Does Roar perform PAS quantification? If yes:
- Output: Adhere to output specification for quantification challenge
  
  This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.
chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Execution Workflow: CSI-UTR

WHAT

Write execution workflow for CSI-UTR. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

Does CSI-UTR perform PAS identification? If yes:
Output: Adhere to output specification for Identification challenge

This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
Fields:

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - not used, leave as "."
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Data Processing: diffUTR mouse simulation data with nf-core

@plger previously generated mouse simulation data for his work on diffUTR Gerber et al. 2021

There are 6 fastq.gz files uploaded to figshare https://figshare.com/articles/dataset/diffUTR_simulation/13726143 that we will need to align these files to mm10 using the nf-core RNA-seq workflow.

I3: Specification - Percent of genome-wides sites identified

Create specification for I3.

Working definition: Compare output of tool's RNA-seq-based site identification to established atlas(es) of polyadenylation sites (PolyASite, PolyA_DB, GENCODE, RefSeq).

Notes: This would be a measure of what percent of sites identified are near (within a fixed window) of a database's sites.

Execution Workflow: Aptardi

WHAT

Write execution workflow for Aptardi. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

Output: Adhere to output specification for Identification challenge

This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
Fields:

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - not used, leave as "."
strand - defines the strand; either "." (=no strand) or "+" or "-".
~~Output: Adhere to output specification for quantification challenge~~
~~Output: Adhere to output specification for differential usage challenge~~

Execution Workflow: APAlyzer

WHAT

Write execution workflow for APAlyzer. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

~~Output: Adhere to output specification for Identification challenge~~ (No identification)
~~Output: Adhere to output specification for quantification challenge~~ (No quantification)
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Question: Does your method require special STAR settings?

Will you need specific or non-standard information from the .bam tags in order to run your method?

feat(Execution workflow): MISO

WHAT

Write execution workflow for MISO. Use the provided small files for testing (running the workflow on real data is a different issue).

Check out the pilot_benchmark and see whether one of the execution workflows there can be adapted

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

~~Output: Adhere to output specification for Identification challenge~~ No identification
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Housekeeping: Add licenses

Add LICENSE files (according to licenses described in README) to repository

I5: Specification - Sites per genes

Create specification for I5.

Working definition: For each gene, count number of sites falling within it. Tabulate these counts.

Notes: May want to include upper limit (e.g., 10+).

Execution Workflow: LABRAT

WHAT

Write execution workflow for LABRAT. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

~~Output: Adhere to output specification for Identification challenge~~ No identification?
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

Test data: Should ground truth comply with execution workflow output specification?

The provided ground truth test files have number of reads in score column, instead of tpm as specified in the APAeval execution workflow output specification.

Might this be confusing or misleading for hackers, because the output of execution wfs has to contain tpm?
Might this hinder testing of summary workflows?
Or is it on purpose that ground truth and execution wf outputs have different formats?

Run Snakemake on AWS

To increase reproducibility and ease-of-use, Snakemake can be run via Tibanna on AWS.
This way, data and code do not have to be distributed and not run on user workstations.

To make this possible, I think the following tasks are necessary:

AWS is configured and users have access.
Data is stored on AWS in buckets.
Set up Tibanna.
Perform test run with example in tests/pilot_benchmark/snakemake/
Add documentation in README describing the steps above.
Provide templates for configuration and Snakemake run.

If anything is missing or misleading, feel free to adjust the tasks.

Mouse simulation data:

Get the processed data (ground truth)
Upload the fastq files (6/7 files on figshare)
pre-process and upload the ground truth

Create Utils channel

From the first progress report discussion we want to create a place for common tasks.

The idea is that some tasks are reoccurring, like extracting polyA sites from a database, and should not be scattered around in individual execution and/or summary workflows. It therefore makes sense to create a directory that gathers such tasks.

The following steps should be considered

Create a project page to gather related issues.
Create a directory utils.
Write a README.md within this directory describing the
- purpose,
- usage (How-to) and
- description of implemented scripts.
Append to the main README.md that
- When creating a new workflow, the utils directory should be checked for existing common tasks.
- If some common task is not yet in utils, a github issue should be created. Then it should be written and added as individual script.

Additionally, the following should be considered when implementing the structure:

When an execution or summary workflow is run, it should be able to use the scripts within the utils directory.
The individual scripts should be unit tested before being used in the workflows. That is, it should be ensured that the script is flexible, robust and actually doing what is intended to.
...

Execution Workflow: DaPars

WHAT

Write execution workflow for DaPars. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

Use snakemake template or nextflow template to create your workflow.
Comment your code
Run individual rules/processes in either conda envs or docker/singularity containers for reproducibility
Input: .bam or .fastq from test_data
Give feedback about the method

OUTPUTS (see specification):

Output: Adhere to output specification for Identification challenge

This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
Fields:

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - not used, leave as "."
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for quantification challenge

This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

chrom - the name of the chromosome
chromStart - the starting position of the feature in the chromosome
chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
name - defines the name of the identified poly(A) site
score - TPM value for the identified site
strand - defines the strand; either "." (=no strand) or "+" or "-".
Output: Adhere to output specification for differential usage challenge

This TSV file contains two columns:
- gene ID
- significance of differential PAS usage
Column names should not be added to the file.

irna-cosi / apaeval Goto Github PK

apaeval's People

Contributors

Stargazers

Watchers

Forkers

apaeval's Issues

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

Please comment on this issue with any open questions you have about execution workflows, or with suggestions on standardizing and documenting the execution workflows!

param code

data availability s3 bucket

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

WHAT

CHECKLIST

OUTPUTS (see specification):

Recommend Projects

Recommend Topics

Recommend Org