Giter Club home page Giter Club logo

apaeval's People

Contributors

allcontributors[bot] avatar chelseaherdman avatar chilampoon avatar daneckaw avatar dimmestp avatar dominikburri avatar faricazjj avatar grexor avatar lschaerfen avatar mfansler avatar mkatsanto avatar mrgazzara avatar mzavolan avatar ninsch3000 avatar pjewell-biociphers avatar sambryce-smith avatar txellferret avatar uniqueg avatar yuukiiwa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apaeval's Issues

Execution Workflow: PAQR

WHAT

Write execution workflow for PAQR. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge No identification

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

Collect parameters that significantly alter a method's behaviour

in order to create the parameter_codes we'll need to collect parameters that change a method's behaviour in a significant way. I've created an additional column in the methods spreadsheet for that.
Could you all please indicate there what you came across while working on your method? Every method is different, so you need to judge for yourself whether a specific parameter might make a huge difference in the benchmarking challenge, and then we might be able to account for that by running a benchmark multiple times.
Thanks a lot for your help!!!

Matching up predicted with ground truth sites

Both for identification and for quantification we need to identify which ground truth site corresponds to each predicted site, if a 'site' is not a single base. We assume that ground truth sites, whether single position or interval, are correct and non-overlapping. The script should take as input bed files for both predicted and ground truth data, as well as a distance parameter, specifying by how many nucleotides should the predicted sites be extended to the left and right. Predicted sites that overlap can be merged and their expression summed up. For each predicted site/cluster the group truth sites that overlap with it can then be identified. If a predicted site/cluster overlaps with more than one ground truth cluster, it can be assigned to each of the overlapping ground truth sites with a fractional weight. This could be calculated as the number of overlapping nucleotides divided by the length of the predicted site/cluster. The expression can assigned proportionally to these weights.

To have the full information for computing various metrics the output should contain:

    1. ID, start and end of predicted site
    1. ID, start and end of overlapping ground truth site
    1. weight with which the predicted site was assigned to the ground truth site
    1. expression of the predicted site assigned to the ground truth site
    1. expression level of ground truth site

Put input files in S3 bucket

Put all (currently available) input files into AWS S3 bucket s3://rnasoc-scratch-us-east-1 under prefix input_files.

Generally, the procedure to set up aws-cli and interact with an s3 bucket is described in this guide. See below for a detailed description of all of the steps that need to be taken:

  • Ask Alex to DM you the access keys
  • Install AWS CLI (follow the corresponding link in the guide above)
  • Setup AWS CLI credentials via the keys obtained from Alex (follow the Configuration Basics link in the guide above)
  • Create an empty directory; move inside that directory and create a subdirectory input_files, then copy all files that you want to upload into that subdirectory; make sure that you are still in the original directory you created and that this directory only contains the single subdirectory input_files
  • Upload all files to s3 with the following command: aws s3 sync . s3://rnasoc-scratch-us-east-1; if the files were organized as described before, all files in the directory input_files will be created in the s3 bucket under the prefix input_files
  • Verify that all files (now more accurately objects) are available in the bucket by running aws s3 ls s3://rnasoc-scratch-us-east-1/input_files --recursive
  • Also verify that you can download a file with aws s3 get-object s3://rnasoc-scratch-us-east-1/input_files/{name_of_file_you_want_to_download}
  • Create a table that lists all input files, descriptions of the files and their s3 URIs
  • Share the table in channels #general and #data

I1: Specification - Compute Resources

Create specification for compute resources.

Working definition: Obtain execution time and memory usage in MiB for method execution. If multiple rules or samples for method: sum time and get max MiB.

Execution Workflow: APAtrap

WHAT

Write execution workflow for APAtrap. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge

    This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
    Fields:

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - not used, leave as "."
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

Q3: Specification - Similarity of per-gene polyadenylation site distibutions

Create specification for Q3.

Working definition: Distance between poly(A) site distributions from quantification and from the corresponding 3'end dataset calculated for each gene separately.

Notes: Could either be for all genes or a subset. Possibly with some threshold value to count genes with correct quantification. Whereas Q2 focuses on expression value, this measures site usage (values in [0,1]).

Execution Workflow: diffUTR

WHAT

Write execution workflow for diffUTR. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge No identification

  • Output: Adhere to output specification for quantification challenge No quantification

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

Can an extra output file be added as a checkpoint for the Snakemake workflow?

Hi developers,

I was wondering if it makes sense to add an extra output file for each rule as a checkpoint, something in the lines of

output:        
    out=os.path.join(config["out_dir"], "{sample}", "execute.out"),
    finisher=os.path.join(config["out_dir"], "{sample}", "execute.done")
shell:        
    "(EXECUTE_COMMAND \            
        -i {input.bam} \            
        -o {output.out} \            
        -p1 {params.param1} \            
        -p2 {params.param2}) \            
        &> {log};
        echo "DONE" >{output.finisher}"

This acts as a checkpoint for each rule, in cases where the program(s) partially executes or sometimes the node exits abruptly.
Thank you.

Data Processing: create ground truth from Mayr 3'end seq samples

In order to be able to use the primary immune cell data we'll have to process the available 3'end seq data from this study.
Starting point for creating a workflow could be the rules for 3'-Seq (Mayr) from the PolyASite workflow. However, an additional processing step which removes highly abundant immunoglobulin reads, as described in the publication, might have to be included (Attaching the corresponding paragraph from the Methods section of the publication)
mayr_primary_immune_cells_processing.txt

I4: Specification - Site Feature Annotation

Create specification for I4.

Working Definition: For each site, annotate it with the feature(s) that intersect it. Major features are:

  • utr5
  • cds
  • utr3
  • intronic
  • intergenic

Notes: We may additionally want to include some downstream labels, e.g.,

  • ds_1kb
  • ds_5kb
  • ds_10kb

Execution Workflow: mountainClimber

WHAT

Write execution workflow for mountainClimber. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge

    This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
    Fields:

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - not used, leave as "."
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

Extend Snakemake Pilot execution workflow for AWS compatibility

The pilot execution workflow is running with conda environments. AWS does not support conda environments but only containers.

TO-DO

  • Use test_data for integration test.
  • Consider using snakemake profiles for AWS exeuction.
  • Containerise execution of snakemake rules. Use biocontainers where ever possible.
  • Leave execution with conda as commented example for local testing.
  • Implement correct output directory structure and file names.
  • Update pilot README.md.

Comments

  • Ensure that sample file paths resolve in AWS buckets.

incosistent envs for calls to snakemake nextflow

From earlier discussion the issue arised, that nextflow and snakemake are executed from different (incosistent) conda environments specifications.
To solve this I need to

  • Modify the snakemake.yml to environment.yml and add both snakemake and nextflow specific versions
  • Replace the nextflow env from the templates

I can not find the conda env from which nextflow is triggered, can someone point it out? Is such an environment needed in the context of nextflow? @uniqueg @dominikburri @yuukiiwa

Data Processing: GTEx based simulation data with nf-core RNA-seq

I will align a subset of simulated cerebellum and skeletal muscle fastq files generated for a manuscript in preparation from Yoseph's lab using nf-core RNA-seq workflow.

My initial plan is to use 10 cerebellum and 10 muscle samples to allow for reproducibility to be calculated.

docs: Check links

Go through all README and other documentation files and check for missing/broken links, unclear explanations etc. Either fix yourself or reach out on Slack to clarify how to fix. Create new issues that might arise, if applicable.

Q2: Specification - Correlation to 3'-End Seq

Create specification for Q2.

Working definition: Compute correlation coefficient between tool's quantification obtained RNA-seq and the quantification from the corresponding 3'-end sequencing dataset (for each pair of test datasets).

Execution Workflow: TAPAS

WHAT

Write execution workflow for TAPAS. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge

    This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
    Fields:

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - not used, leave as "."
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

Kamikaze Pilot: Decide on one benchmark

Which specification is ready to go? Q2

  • uses ground truth? yes
  • relatively easy to implement?
    @lschaerfen has submitted some scripts (#66) that could be used. Need to be integrated into a summary workflow.

To Do:

  • choose a suitable dataset for the benchmark (#78)
  • create a param code (#75)
    If you would like to help with this, please help review the proposed param codes!
  • communicate the benchmark to the hackers
  • support the developers of the corresponding summary workflow
  • update this issue description

feat(Execution workflow): APA-Scan

WHAT

Write execution workflow for APA-Scan. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge

    This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
    Fields:

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - not used, leave as "."
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Can APA-Scan do differential PAS quantification? If yes:

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

Question: what are you missing in the README in execution_workflows directory?

The README should contain more precise information on how these execution workflows have to look like.
Some open questions:

  • what exactly means preprocessing within execution wfs? NOT QC or adapter removal, as those will be done in nf-core/rnaseq
  • test files for development (which are inputs, which are outputs, which to take?)
  • naming convention for output files (DATASET_NAME/METHOD_NAME/CHALLENGE_NAME.bed ???)
  • What should the outputs be? one file per sample? replicates aggregated?
  • explain challenges, identifiers, etc.;
  • what should the samples table in the snakemake template contain
  • how should arguments to methods be handled (always use defaults, contact developers, interface all arguments?)
  • if a method needs to be called differently for the different challenges, should that be done in one workflow or in separate ones?
  • ...

Please comment on this issue with any open questions you have about execution workflows, or with suggestions on standardizing and documenting the execution workflows!

D1: Specification - Compute Resources

Create specification for D1.

Working Definition: Obtain execution time and memory usage in MiB for method execution. If multiple rules or samples for method: sum time and get max MiB.

Q1: Specification - Compute Resources

Create specification for Q1.

Working definition: Obtain execution time and memory usage in MiB for method execution. If multiple rules or samples for method: sum time and get max MiB.

Kamikaze Pilot: Decide on one dataset

Which dataset with corresponding ground truth is ready to go?
How do execution workflows have to be run on that? (If in doubt, would an individual sample (that has a ground truth) be a good start?
To Do:

  • choose a dataset suitable for the Kamikaze Pilot benchmark (#77)
  • create a param code (#75)
  • communicate availability (also for genome/annotation files) to the hackers, along with instructions on what to take care of
  • update this issue description

param code

Work in progress discussion at PR93. Also available on GDrive sheet

data availability s3 bucket

Mapping of accession numbers, sample names, and s3 URI links to input bam and ground truth bed files consistent with the param code is in below table (may be updated as Nextflow Tower executions finish). Also available on GDrive sheet

accession sample_name s3_uri file_type
SRR6795718 Mayr_CD5B_R3 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795718.bam bam_noChr
SRR6795719 Mayr_CD5B_R4 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795719.bam bam_noChr
SRR6795720 Mayr_NB_R1 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795720.bam bam_noChr
SRR6795721 Mayr_NB_R2 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795721.bam bam_noChr
SRR6795723 Mayr_NB_R3 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795723.bam bam_noChr
SRR6795724 Mayr_NB_R4 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795724.bam bam_noChr
SRR6795725 Mayr_NB_R5 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795725.bam bam_noChr
SRR6795722 Mayr_NB_R6 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795722.bam bam_noChr
SRR6795726 Mayr_M_R2 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795726.bam bam_noChr
SRR6795727 Mayr_M_R6 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795727.bam bam_noChr
SRR6795713 Mayr_GC_R2 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795713.bam bam_noChr
SRR6795714 Mayr_GC_R4 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795714.bam bam_noChr
SRR6795715 Mayr_GC_R1 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795715.bam bam_noChr
SRR6795716 Mayr_GC_R3 s3://rnasoc-scratch-us-east-1/mayr_bams/no_chr/SRR6795716.bam bam_noChr
SRR6795684 Mayr_CD5B_R3 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_CD5B_R3.SRR6795684.3seq.hg38.bed ground_truth
SRR6795685 Mayr_CD5B_R4 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_CD5B_R4.SRR6795685.3seq.hg38.bed ground_truth
SRR6795686 Mayr_CD5B_R5 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_CD5B_R5.SRR6795686.3seq.hg38.bed ground_truth
SRR6795687 Mayr_CD5B_R6 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_CD5B_R6.SRR6795687.3seq.hg38.bed ground_truth
SRR1005606 Mayr_NB_R1 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_NB_R1.SRR1005606.3seq.hg38.bed ground_truth
SRR1005607 Mayr_NB_R2 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_NB_R2.SRR1005607.3seq.hg38.bed ground_truth
SRR6795688 Mayr_NB_R3 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_NB_R3.SRR6795688.3seq.hg38.bed ground_truth
SRR6795689 Mayr_NB_R4 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_NB_R4.SRR6795689.3seq.hg38.bed ground_truth
SRR6795690 Mayr_M_R1 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_M_R1.SRR6795690.3seq.hg38.bed ground_truth
SRR6795691 Mayr_M_R2 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_M_R2.SRR6795691.3seq.hg38.bed ground_truth
SRR6795693 Mayr_GC_R2 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_GC_R2.SRR6795693.3seq.hg38.bed ground_truth
SRR6795692 Mayr_GC_R1 s3://rnasoc-scratch-us-east-1/ground_truths/Mayr_GC_R1.SRR6795692.3seq.hg38.bed ground_truth

Execution Workflow: IsoSCM

WHAT

Write execution workflow for IsoSCM. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge

    This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
    Fields:

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - not used, leave as "."
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Does IsoSCM perform PAS quantification? If yes:

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

Execution Workflow: DaPars2

WHAT

Write execution workflow for DaPars2. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge

    This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
    Fields:

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - not used, leave as "."
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

Execution Workflow: GETUTR

WHAT

Write execution workflow for GETUTR. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge

    This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
    Fields:

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - not used, leave as "."
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge No differential usage

Execution Workflow: QAPA

WHAT

Write execution workflow for QAPA. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge No identification

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

feat(Execution workflow): Roar

WHAT

Write execution workflow for Roar. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge

    This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
    Fields:

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - not used, leave as "."
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Does Roar perform PAS quantification? If yes:

    • Output: Adhere to output specification for quantification challenge

      This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

Execution Workflow: CSI-UTR

WHAT

Write execution workflow for CSI-UTR. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Does CSI-UTR perform PAS identification? If yes:

  • Output: Adhere to output specification for Identification challenge

    This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
    Fields:

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - not used, leave as "."
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

I3: Specification - Percent of genome-wides sites identified

Create specification for I3.

Working definition: Compare output of tool's RNA-seq-based site identification to established atlas(es) of polyadenylation sites (PolyASite, PolyA_DB, GENCODE, RefSeq).

Notes: This would be a measure of what percent of sites identified are near (within a fixed window) of a database's sites.

Execution Workflow: Aptardi

WHAT

Write execution workflow for Aptardi. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge

    This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
    Fields:

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - not used, leave as "."
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for quantification challenge

  • Output: Adhere to output specification for differential usage challenge

Execution Workflow: APAlyzer

WHAT

Write execution workflow for APAlyzer. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge (No identification)

  • Output: Adhere to output specification for quantification challenge (No quantification)

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

feat(Execution workflow): MISO

WHAT

Write execution workflow for MISO. Use the provided small files for testing (running the workflow on real data is a different issue).

Check out the pilot_benchmark and see whether one of the execution workflows there can be adapted

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge No identification

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

I5: Specification - Sites per genes

Create specification for I5.

Working definition: For each gene, count number of sites falling within it. Tabulate these counts.

Notes: May want to include upper limit (e.g., 10+).

Execution Workflow: LABRAT

WHAT

Write execution workflow for LABRAT. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge No identification?

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

Run Snakemake on AWS

To increase reproducibility and ease-of-use, Snakemake can be run via Tibanna on AWS.
This way, data and code do not have to be distributed and not run on user workstations.

To make this possible, I think the following tasks are necessary:

  • AWS is configured and users have access.
  • Data is stored on AWS in buckets.
  • Set up Tibanna.
  • Perform test run with example in tests/pilot_benchmark/snakemake/
  • Add documentation in README describing the steps above.
  • Provide templates for configuration and Snakemake run.

If anything is missing or misleading, feel free to adjust the tasks.

Mouse simulation data:

Get the processed data (ground truth)
Upload the fastq files (6/7 files on figshare)
pre-process and upload the ground truth

Create Utils channel

From the first progress report discussion we want to create a place for common tasks.

The idea is that some tasks are reoccurring, like extracting polyA sites from a database, and should not be scattered around in individual execution and/or summary workflows. It therefore makes sense to create a directory that gathers such tasks.

The following steps should be considered

  • Create a project page to gather related issues.
  • Create a directory utils.
  • Write a README.md within this directory describing the
    • purpose,
    • usage (How-to) and
    • description of implemented scripts.
  • Append to the main README.md that
    • When creating a new workflow, the utils directory should be checked for existing common tasks.
    • If some common task is not yet in utils, a github issue should be created. Then it should be written and added as individual script.

Additionally, the following should be considered when implementing the structure:

  • When an execution or summary workflow is run, it should be able to use the scripts within the utils directory.
  • The individual scripts should be unit tested before being used in the workflows. That is, it should be ensured that the script is flexible, robust and actually doing what is intended to.
  • ...

Execution Workflow: DaPars

WHAT

Write execution workflow for DaPars. Use the provided small files for testing (running the workflow on real data is a different issue).

CHECKLIST

OUTPUTS (see specification):

  • Output: Adhere to output specification for Identification challenge

    This BED file contains single-nucleotide position of poly(A) sites identified by the tool.
    Fields:

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - not used, leave as "."
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for quantification challenge

    This BED file contains positions of unique poly(A) sites with TPM values for each identified site in the score column.

    chrom - the name of the chromosome
    chromStart - the starting position of the feature in the chromosome
    chromEnd - the ending position of the feature in the chromosome; as identified PAS are single-nucleotide, the ending position is the same as starting position
    name - defines the name of the identified poly(A) site
    score - TPM value for the identified site
    strand - defines the strand; either "." (=no strand) or "+" or "-".

  • Output: Adhere to output specification for differential usage challenge

    This TSV file contains two columns:

    • gene ID
    • significance of differential PAS usage

    Column names should not be added to the file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.