alexslemonade / scpca-nf Goto Github PK

scpca-nf is the Nextflow workflow for processing Single-cell Pediatric Cancer Atlas Portal data

License: BSD 3-Clause "New" or "Revised" License

Nextflow 43.55% R 46.30% Python 9.09% Groovy 1.06%

10xgenomics alevin-fry nextflow rna-seq-quantification scrna-seq cite-seq spatial-transcriptomics

scpca-nf's Introduction

scpca-nf

This repository holds a Nextflow workflow (scpca-nf) that is used to process 10X single-cell data as part of the Single-cell Pediatric Cancer Atlas (ScPCA) project. All dependencies for the workflow outside of the Nextflow workflow engine itself are handled automatically; setup generally requires only organizing the input files and configuring Nextflow for your computing environment. Nextflow will also handle parallelizing sample processing as allowed by your environment, minimizing total run time.

The workflow processes fastq files from single-cell and single-nuclei RNA-seq samples using alevin-fry to create gene by cell matrices. The workflow outputs gene expression data in two formats: as SingleCellExperiment objects and as AnnData objects. Reads from samples are aligned using selective alignment, to an index with transcripts corresponding to spliced cDNA and to intronic regions, denoted by alevin-fry as splici. These matrices are filtered and additional processing is performed to calculate quality control statistics, create reduced-dimension transformations, assign cell types using both SingleR and CellAssign, and create output reports. scpca-nf can also process libraries with ADT tags (e.g., CITE-seq), multiplexed libraries (e.g., cell hashing), bulk RNA-seq, and spatial transcriptomics samples.

For more information on the contents of the output files and the processing of all modalities, please see the ScPCA Portal docs.

Using scpca-nf to process your samples

The default configuration of the scpca-nf workflow is currently set up to process samples as part of the ScPCA portal and requires access to AWS through the Data Lab. For all other users, scpca-nf can be set up for your computing environment with a few configuration files.

Instructions for using `scpca-nf`

⚠️ Please note that processing single-cell and single-nuclei RNA-seq samples requires access to a high performance computing (HPC) environment with nodes that can accommodate jobs requiring up to 24 GB of RAM and 12 CPUs.

To run scpca-nf on your own samples, you will need to complete the following steps:

Organize your files so that each folder contains fastq files relevant to a single sequencing run.
Prepare a run metadata file with one row per library containing all information needed to process your samples.
Prepare a sample metadata file with one row per sample containing any relevant metadata about each sample (e.g., diagnosis, age, sex, cell line).
Set up a configuration file, including the definition of a profile, dictating where Nextflow should execute the workflow.

You may also test your configuration file using example data.

For ALSF Data Lab users, please refer to the internal instructions for how to run the workflow on our systems.

scpca-nf's People

Contributors

Stargazers

Watchers

Forkers

femiliani oandrefonseca

scpca-nf's Issues

Move production of filtered.rds output to its own process within workflow

Currently, there is one process, generate-rds.nf, that produces both the unfiltered and filtered rds files. We would like to separate those into two separate processes, both of which take as input the alevin output directory and the metadata.

This would also mean we would need to alter the script, filter_sce_rds.R to also read in the alevin output before filtering.

Create 0.2.0 release

With the addition of #78, #77, #84, #85 and (pending) #87, we have made a number of pretty big changes to the workflow (especially the step-skipping of #77 and #87) , and this seems to me to justify a bump up to version 0.2!

We will want to make this version update/release before we do the bulk mapping, as we want those metadata to reflect a real release.

Return alevin output as RDS files

Related to #6, after the alevin-fry process is complete, the output needs to be transformed into an RDS file, one with the unfiltered matrix and one with the filtered matrix.

Add command line argument to set seed in `filter_sce_rds.R`

Based on discussion in #45 and in the scpca meeting, we will want to make sure we set a seed for reproducibility of steps that occur in the filter_sce_rds.R script, in particular use of emptyDrops inside scpcaTools::filter_counts() and implementation of miQC. To do this we can add in a command line argument for the filtering script to set the seed and put in a default value for that seed. This will then set the seed in the global environment and allow us to use the seed for the steps occurring in the filtering script.

One remaining question is if we want to include the seed as a nextflow parameter or not allow that flexibility to control the seed and leave it as the default for the argument added to filter_sce_rds.R.

Create tagged release 0.1.0 of scpca-nf

When we are ready to kick off production jobs with this workflow, we should tag a release.

This will allow us to reproducibly run the workflow with the following command or something similar:

nextflow run AlexsLemonade/scpca-nf -r v0.1.0 -profile batch --project SCPCP000001

(this command would run all samples for the 00001 project, on AWS)

Add module for spatial transcriptomics quantification

Because we have spatial transcriptomics libraries as part of scpca, we will need to allow for processing of these libraries and add in a module that quantifies spatial transcriptomic libraries to the workflow.

To do this, we will for sure need a process that quantifies the libraries using spaceranger from 10X. We can follow a similar approach that is used in alsf-scpca/workflows.

To maintain consistencies with other libraries where we used Alevin-fry for quantification, we may also want to process these libraries through Alevin-fry. But we should start with adding in the spaceranger workflow and then add in processing using the Alevin-fry module as a next step if we choose to go that route based on results in AlexsLemonade/alsf-scpca#151.

Reorganize config and rename ambiguous params

The section nextflow.config with index file locations is getting pretty big, and with multiple indices, some of the parameter names are getting a bit ambiguous. For example, fasta and gtf should probably be something like ref_fasta and ref_gtf. Changing these will require changing their use throughout the modules.

We can also reorganize somewhat, breaking up the config file into smaller ones that define the processes and parameters separately and taking advantage of the includeConfig directive. I also realized that we can use variables in the config files , so we can cut a lot of redundant path info. More info on this here: https://www.nextflow.io/docs/latest/config.html?highlight=configuration#config-syntax

Add support for cellhashing

At the moment, we only support RNA-seq and CITE-seq data. Adding support for cellhashing should be as simple as adding the correct technology types to the nextflow script. In theory, the feature processes should work the same for CITE-seq and feature barcode libraries.

Change publishDir for alevin output to be different from publishDir for rds files

In an effort to keep the file organization close to how the final product will look, we want to keep the alevin output directories separate from the directories that are holding the final filtered and unfiltered rds files. To do this we will want to alter the publishDir to be different between the map_quant_rna workflow and the generate_rds process. For now we still want to keep the alevin directories and store them (without the RAD files), so will just move them to a new place rather than removing the publishDir step completely.

Allow skipping of salmon mapping in bulk

Similar to #41 and #81, we should continue our efforts to make expensive slow steps skippable by adding this functionality to the bulk workflow. Since we are unlikely to want to skip bulk but not single-cell or vice versa (an update to salmon would presumably affect both) we can consolidate the current --rad-skip option with this one to make a single --skip-mapping option.

Alternatively, we could rename the option --repeat-mapping, which has a nice little benefit that we can set it to false by default, and then nextflow will set the value to true when invoked without an argument. Which is to say we would not need to do something like nextflow run ... --rad-skip false as we do now, but rather just nextflow run ... --repeat-mapping, which is a bit cleaner.

Import spatial transcriptomics output as SpatialExperiment

Following quantification of the Spatial transcriptomics libraries, we will need to have a second process that imports that output into R as a SpatialExperiment and then writes that outputs those objects as RDS files.

To me it only makes sense here to only have one file that includes the filtered results, rather than unfiltered and filtered results.
Following creation of the SpatialExperiment I also think we want to filter to only include spots that overlap the tissue or is there rationale to include all spots?

**If we use Alevin-fry and Spaceranger for quantification, this issue should be dependent on addition of a function to scpcaTools to merge output from both of those tools to create the SpatialExperiment.

Store RAD files & alevin folders after initial mapping

With the knowledge that RAD files do not contain any trace of sequence info, we will likely want to save those as well or instead of the alevin output files that are discussed in #17.

In #17 we are discussing saving the alevin output folder after quantification, but the steps between RAD file generation by salmon alevin -rad and the RDS file are quite efficient; we are unlikely to need those intermediates, but we might want the first (uncollated) RAD file in case of future processing updates.

To do this, we would want to add publishDir directives to the alevin_rad and alevin_feature processes.

Add genetic demultiplexing workflow

The genetic demultiplexing workflow currently implemented in https://github.com/AlexsLemonade/alsf-scpca/tree/main/workflows/genetic-demux should be integrated into the main scpca-nf workflow.

If possible, this may reuse the STAR index created for Cell Ranger, rather than requiring a separate addition to build-index.nf.

Support 5' libraries in production

Related to AlexsLemonade/alsf-scpca#136 and AlexsLemonade/alsf-scpca#137 - now we're ready to update the workflow in this repository!

Make spaceranger process optional

Because the spaceranger step is very costly and takes a long time to run, we should incorporate the option to skip this step (similar to how we skip the production of the rad file in alevin-fry). In doing this, we should break out the file reorganization into it's own process so the first process is running spaceranger and the second process is file re-organization and creating the metadata.json.

Add filtering method to metadata.json

Since we now have the filtering method that is used as a piece of the metadata for the SCE as part of scpcaTools::filter_counts(), we should incorporate that information into the metadata.json file. We can grab it from the metadata slot of the filtered sce as we do for some of the other inputs that make up metadata.json.

Modify filepath for splici index

Based on changes that will be made in AlexsLemonade/alsf-scpca#132, we will need to update the file path that is used for the splici index in main.nf.

Incorporate scpca_project_id into workflow metadata

We are introducing another identifier for the project level in https://github.com/AlexsLemonade/ScPCA-admin/issues/213. This project_id should be incorporated into the meta object, and probably as the top level of the file structure: i.e.

We can also update #24 to use the project ID as an alternative to submitter ID for running by project

SCPCP00001
└── SCPCS00001
   ├── SCPCL00001_unfiltered.rds
   └── SCPCL00001_filtered.rds

Discussion: add random seeds to scripts

I realized that some of the filtering we are doing (emptyDrops and miQC, notably) do/may have statistical models that are fit with some random components. Should we alleviate any variation from this by adding fixed random seeds to the relevant scripts?

Add GTF input to sce file generation

When scpcaTools supports adding gene symbols to the output files(AlexsLemonade/scpcaTools#35), it will need a GTF file as input.

This issue is to track adding that support/input to the workflow and associated scripts, reading the GTF file location from the workflow parameters.

Include cellranger index in build-index.nf

Right now we only include the indices for Alevin-fry in build-index.nf, however, we will be using Spaceranger for spatial transcriptomics libraries and will need to build a separate index for that. In doing this we should make sure that the ensembl versions that we are building from is consistent across all of our indices that we are using. We should use a similar setup to how we were previously building the cellranger index in alsf-scpca.

Modify generation of filtered sce object to include basic cell statistics in colData

The filtered sce object should include colData in the output. This includes everything that is calculated using scater::addPerCellQC() with the subset of mito genes along with a column containing the posterior probability of a cell being compromised using miQC.

We will also want to add in the rowData to the object using scater::addPerFeatureQC() and remove any genes that are not detected in that sample.

Re-organize spatial outputs to reflect desired hand-off structure

Based on our conversation in the multi-team planning meeting, we have made small revisions to the download structure of the spatial outputs. We will need to adjust the current output of the process accordingly to reflect those changes. This should be done after addressing #81, so testing is less time consuming.

This is the structure that will be expected for hand-off.

└── SCPCP00006
    ├── samples_metadata.csv
    └── SCPCS000203
        ├── SCPCL000372_metadata.json 
        └── SCPCL000372_spatial
	    ├── SCPCL000372_spaceranger_summary.html
	    └──filtered_feature_bc_matrix
	       ├── barcodes.tsv.gz
	       ├── features.tsv.gz
	       ├── matrix.mtx.gz
	    └──raw_feature_bc_matrix
	       ├── barcodes.tsv.gz
	       ├── features.tsv.gz
	       ├── matrix.mtx.gz
	    └──spatial
	       ├── aligned_fiducials.jpg
	       ├── detected_tissue_image.jpg
	       ├── scalefactors_json.json
	       ├── tissue_hires_image.png
	       ├── tissue_lowres_image.png
	       ├── tissue_positions_list.csv

This includes removing spaceranger_metrics_summary.csv and spaceranger_versions.html and attaching _spatial to the library folder name.

Downloads will be similar to this but will contain a libraries_metadata.csv instead of samples_metadata.csv, will not include metadata.json, and will also have a README.md.
Like so:

└── SCPCP00006
    ├── libraries_metadata.csv
    ├── README.md
    └── SCPCS000203
        └── SCPCL000372_spatial
	    ├── SCPCL000372_spaceranger_summary.html
	    └──filtered_feature_bc_matrix
	       ├── barcodes.tsv.gz
	       ├── features.tsv.gz
	       ├── matrix.mtx.gz
	    └──raw_feature_bc_matrix
	       ├── barcodes.tsv.gz
	       ├── features.tsv.gz
	       ├── matrix.mtx.gz
	    └──spatial
	       ├── aligned_fiducials.jpg
	       ├── detected_tissue_image.jpg
	       ├── scalefactors_json.json
	       ├── tissue_hires_image.png
	       ├── tissue_lowres_image.png
	       ├── tissue_positions_list.csv

Update README to include instructions on how to run pipeline for external users

Currently the main README file includes instructions on how to run the pipeline either locally or with batch on AWS. In an effort to transition the workflow to be usable by external users, we should start by updating the main README file by adding instructions on how to set up and run the pipeline on your own.

Some things that should be included in these instructions are:

Creating the run metadata file
What parameters need to/ can be adjusted
How to set up the profile that you need based on your system*
Creating your own cellranger and spaceranger containers if planning to run the spatial workflow

*When looking at how nf-core sets their options for profiles, I noticed that they have a set of basic profiles and then a series of custom configs that are set up for different institutions and all loaded when running. The user then inputs the profile they want at the command line, or at least that is what I am gathering when looking at the documentation and the repo. I think we could start with documentation on how to set up a profile to start and then go from there if we want to add any pre-set profiles based on what user's want.

We also may want to break out this issue into smaller steps once we have a better idea of what else we think would be important to include in the instructions, but just wanted to get our thoughts and what we've discussed so far written down.

Export cell statistics & other metadata in a table/json

We need to export cell counts for the front end, so it makes sense to have that output be part of the workflow. The most logical place for this seems to me to be with the QC report, so we should modify the sce_qc_report process (and sce_qc_report.R(https://github.com/AlexsLemonade/scpca-nf/blob/main/bin/sce_qc_report.R) script) to also export either a table or json file with all necessary stats for the front end (probably including project ID, library ID, cell count, (unfiltered cell count?), and maybe things like nextflow workflow version (we can get a lot from nextflow workflow introspection variables).

Whether we do json or a csv is just a matter of output, as we would build as a single row table in R.

Improve aesthetics of QC report

The original vision of the sample QC report was a two column layout (shown belwo), but this will probably require a bit of thought and design work. Some of this work falls under #508, but we might also want to consider a broader look at aesthetics, maybe even to fit with the CCDL/ALSF style.

In working on a trial run of #508, I did confirm that adding a custom .css file in the same directory as the .rmd should work as a means to accomplish most aesthetic adjustments.

Originally posted by @allyhawkins in AlexsLemonade/scpcaTools#37 (comment)

Allow explicit restarting from RAD files (or other points?)

While the -resume option in nextflow is great for when a workflow is repeated shortly after running and/or failures, most intermediate files are not stored long-term, so a later rerun with changes in parameters or other code could well result in rerunning the full workflow.

Since we are storing intermediate files (in particular RAD files), it might be nice to have an option in the workflow that would skip mapping (the most time-consuming step!) if the RAD files are available (after #40 is implemented).

I am not exactly sure what the implementation of this would look like at this stage... We will have to do some investigation of the best way to test for the presence of an alevin output directory and run a different section of the workflow in that case.

Add support for passing library IDs with workflows and joining on that id

Joining runs with multimodal data by sample IDs works for most projects, but does not cover the case where a sample might have multiple libraries with separately paired RNA and CITEseq or cellhash data. To support this, in https://github.com/AlexsLemonade/ScPCA-admin/pull/202, I added a new column, scpca_library_id which can be used as an unambiguous id for joining such libraries.

We could simply swap where we pass sample_id in the workflows, but a more sustainable method might be to use the system that nf-core seems to use: Rather than passing individual sample values (along with things like data files) as we do currently (which means we pass a bunch of values to each process), we could take advantage of a groovy map (dictionary in some other languages) to pass metadata values like various ids in a single process value (following nf-core, we can call this meta). This would (partially) free us from the tyranny of argument order, and allow easier changes to individual processes that might require more or different metadata, without having to change unaffected processes.

So basically, we would us val(meta) in place of val(sample_id), val(run_id) etc. and meta then use values such as meta.sample_id within processes where we would previously have just used sample_id.

In the simplest implementation, we might use all info from the input csv file, but it is probably worth doing a bit of transformation to simplify some column names andonly retain columns that we need.

filter_sce_rds fails when miQC fails to fit

When running sample SCPCL000018, we encountered the following error:

Error executing process > 'generate_sce:filter_sce (17)'

Caused by:
  Process `generate_sce:filter_sce (17)` terminated with an error exit status (1)

Command executed:

  filter_sce_rds.R           --unfiltered_file SCPCL000018_unfiltered.rds           --filtered_file SCPCL000018_filtered.rds           --lower 200           --random_seed 2021

Command exit status:
  1

Command output:
  (empty)

Command error:
  Warning message:
  In miQC::mixtureModel(filtered_sce) :
    Unable to identify two distributions. Use plotMetrics function
                  to confirm assumptions of miQC are met.
  Error in unlist(x[[m]]@parameters) :
    trying to get slot "parameters" from an object of a basic class ("NULL") with no slots
  Calls: <Anonymous> ... sapply -> lapply -> FUN -> sapply -> lapply -> FUN -> unlist
  Execution halted

This seems to be due to a failure to fit the model, which produced a warning at first, but then an error at the following line:

scpca-nf/bin/filter_sce_rds.R

Line 67 in 0ec92f5

 filtered_sce <- miQC::filterCells(filtered_sce, model, posterior_cutoff = 1, verbose = FALSE) 

To address this, we should look at this sample (the unfiltered RDs is at s3://nextflow-ccdl-results/scpca-prod/publish/SCPCP000001/SCPCS000018), but we will likely need to update the filtering script to fail gracefully for future cases when the miQC model fails to fit. Since such failures can occur stochastically, in my experience, we might first try a second attempt at the fit (using the next seed), but if that fails we would presumably want to fill in the probability_compromised column with NA_real_.

Add to workflow generation of output files for bulk RNA-sequencing

Currently the bulk workflow stops at running salmon for each individual sample and writing the entire salmon folder to the internal directory that we don't intend to publish.

It was my interpretation based on discussions in Scpca meetings that we will want the bulk RNA-sequencing to be available as one matrix for each project with all samples for that project included in one file. This would mean that we will probably want to use tximeta() to import all of the samples that are processed for each project and output a tsv file with with all samples for a given project. This would be the only file we actually publish and include in the portal for each project with bulk RNA-sequencing.

I think the approach to do this would be to add another module that runs an Rscript that imports all the samples and then outputs the tsv file.

Support for cellhashed data

After exploration of cellhash data is complete (AlexsLemonade/alsf-scpca#139) we need to implement methods in our workflow.

One note here is that we are likely to need to pass a bit more data with the cellhash samples: the cellhash to sample translation data is currently stored in s3://ccdl-scpca-data/sample_info/christensen/barcodes/christensen-pools.tsv, but that data may need to be moved or rearranged for the workflow, and/or we may want to add a column to the sample data for the location of that data.

Update scpcaTools docker image to 0.1.3

With recent changes to scpcaTools, we need to update the image used to v0.1.3.

Update scpca-tools docker image to tagged version

For development, we have been using the edge tag of the scpca-tools docker image so we always have the latest version, but for a released version of the workflow we will want to use a stable tagged version of that image.
The following line is all that should need to be updated.

scpca-nf/nextflow.config

Line 9 in b3f7853

SCPCATOOLS_CONTAINER = 'ghcr.io/alexslemonade/scpca-tools:edge'

Use public indexes for workflow

Following on https://github.com/AlexsLemonade/ScPCA-admin/issues/304, we will want to update the index locations used by the workflow to be the public bucket locations.

I think this will require only a single update for both using the workflows in main.nf and build-index.nf

Modify filtering to use lower=200 for emptyDrops

Based on testing of emptyDrops with various thresholds, we have decided to use emptyDrops with lower=200 rather than the default lower=100. Currently we have a script that performs filtering using the filter_counts function in scpcaTools. That function allows us to pass any of the options that you can use with emptyDrops through as extra options.

One option to do this would be to pass in lower=200 as a parameter through the workflow and then make it an argument in the Rscript used for filtering. We could also set it within the R script without making it a parameter, but that might be less transparent.

One thing to note is that the emptyDropsCellRanger() function from DropletUtils will be available on the next release of Bioconductor on October 22. In testing, that function showed slightly better consistency with Cell Ranger so we may want to consider re doing the filtering later on after the soft launch and modifying the filtering to use emptyDropsCellRanger.

Use release tagged scpcaTools for processing

As of #31, we are using the edge version of the scpca-tools docker image for processing. When we get close to running samples for release, we will want to change that to a specific release tag (which will require that we actually tag a release of scpcaTools. At that point, we will also want to tag this workflow with a version number.

Determine structure of output files and add process for writing unfiltered and filtered output

After each sample has gone through pre-processing, the data will be contained in an output folder containing all information about the alevin-fry run and output matrix files (currently in .mtx format). We plan to provide both the unfiltered and filtered output to users, so we will need to decide if we want to provide the original .mtx files, SingleCellExperiment objects, or both?

We will also need to decide on the structure of the output folders for the unfiltered and filtered output and what files we want to include. After pre-processing, the next process in the scpca nextflow pipeline should be creating the unfiltered and filtered output in the proper format.

Remove explicit dependence on S3

As discussed in https://github.com/AlexsLemonade/ScPCA-admin/issues/298, the current metadata file that we use as the control list for this workflow, defined in params.run_metafile, assumes that certain files are on S3. When we make that explicit (allowing for files to be stored on other services or locally), we should adjust the workflow to use the urls as specified, not adding s3:// as we currently do, for example at

scpca-nf/modules/af-rna.nf

Lines 99 to 102 in bc587b5

 .map{meta -> tuple(meta, 

 file("s3://${meta.s3_prefix}/*_R1_*.fastq.gz"), 

 file("s3://${meta.s3_prefix}/*_R2_*.fastq.gz") 

 )}

Note that if we changed the name of the s3_prefix field while accomplishing https://github.com/AlexsLemonade/ScPCA-admin/issues/298, this will also need to be updated.

Prepare for new workflow release (0.1.3)

There have been a number of workflow changes completed and pending since the last release, so we will likely want to create a new release soonish.

Changes that have/will be incorporated include:

Change to cr-like em #56
Bulk data #58, #59, #61
Spatial data #64, #66, #68
scpcaTools updates #69

We should probably also address #67 before release

Generate merged SCE objects with feature (cite-seq) data

With #5 and AlexsLemonade/scpcaTools#31 complete, we now need to update the workflow to generate SCE objects (and rds files) with both the RNA-seq and feature data (usually cite-seq). This can be done with a modification to the current R script for DRY reasons, but will probably require a separate process within nextflow, as having two input directories will be required.

Modify generation of unfiltered output to include basic cell statistics in colData

We will want to add in colData to the unfiltered output. In order to do that we should modify generate_unfiltered_rds.R to add in the colData prior to writing the rds file.

Allow processing by project/submitter

Right now we have flags to process individual enumerated run_ids or the whole table (for data types we can process). Since we are likely to want to fire off the workflow by project, we should probably add a flag to allow filtering by PI (submitter). that info is not currently captured in the metadata, but can easily be added before filtering.

Start building nextflow pipeline used to process scpca data with alevin-fry as the first process

Now that we have decided to move forward with using alevin-fry with the splici index and selective alignment for all samples, we can start to build the pipeline that we will use to process scpca samples and generate the desired output. There will be a few steps to the pipeline starting with pre-processing the raw data with alevin-fry, generating the unfiltered and filtered counts matrix, and generating the QC report. We should start by adding in the alevin-fry process and then can build on the other steps as additional processes.

In doing this, we need to also update the salmon container to use salmon 1.5.1.

Implement fenced divs to make a two-column html output.

I looked into making two column output via markdown. It looks like we can do that with "fenced divs" https://bookdown.org/yihui/rmarkdown-cookbook/multi-column.html

Basically we add a bunch of ::: {} elements where we want to create divs, and enclose ones we want side by side with ::::{style="display: flex;"}

We can try it at least!

Originally posted by @jashapiro in AlexsLemonade/scpcaTools#37 (comment)

Add generation of QC report to scpca nextflow pipeline

After the counts matrices are obtained for each sample, we will want to generate a QC report. We will want to add a process to the nextflow pipeline that takes the counts matrices (or SingleCellExperiment objects) as input and outputs a QC report.

Modify workflow to run by project rather than by run id

Based on our current strategy for processing to happen by project, it might be beneficial to alter the workflow to run all samples with a given submitter instead of typing each individual run ID in. We may also think about further breaking it down (perhaps to run by disease group?) for some of the bigger projects if we are going to process in smaller chunks as samples are still coming in.

Update workflow to use new scpcaTools (0.1.2)

When scpcaTools 0.1.2 is completed (AlexsLemonade/scpcaTools#78), we will need to update this workflow to use it.

This should mostly require updating the scpcatools docker image to the latest version, but it is worth noting that this will also change the default filtering to emptydrops_CR & may alter miQC results.

We will want to make sure we are testing with updated versions before merging. (and note that downstream docs will require changes).

Rename index files and regenerate indices using most recent Ensembl version

Currently the names of our indices are not informative, so we need to rename those files with the ensembl version so we know where the indices came from. In the process of doing this, we should regenerate the indices with the newest Ensembl version.

Decide what output files to include for Spatial Transcriptomics

Based on a discussion with @jashapiro in slack, before doing #63, we should decide if we would like to create rds files for the spatial libraries for users, or if we would prefer to create a tar.gz of the outs folder after running spaceranger and provide that to users.
Part of the reasoning here is that we would only be importing the spaceranger output into R as a SpatialExperiment and then outputting that as an rds file, without adding any additional analyses. Would we want to add the extra step of creating the rds file if we are not going to do anything extra to the actual SpatialExperiment object and would people prefer to have the .cloupe file or .mtx.gz files to load into R themselves.

The pro for creating an .rds file is that the import scripts for the portal are already written to accept .rds files so we could keep everything in the same format and consistent across all samples. So I guess the question is how difficult would it be to change the importing for only a subset of samples that fall under the spatial category? Tagging @kurtwheeler for any thoughts he might have on that.

We also provide QC reports for the other libraries, however here we are running spaceranger which generates its own summary html file. Would it be sufficient to use this report or is there any reason to create our own report? (I think the only case for this would be if we were using Alevin-fry + Spaceranger together).

Based on the decisions made here, we may or may not need to complete #63.

Adjust output of ST module to reflect desired output files

After quantification with Spaceranger, we will want to reorganize the output files to only publish those that we are going to be user facing. Right now the workflow includes publishing the entirety of the outs folder from Spaceranger, which we will not need. We can adjust the workflow to only include the output files mentioned in #68 (comment). We should then include all of the files below as one zip file before publishing the files as part of the process in Nextflow.

Originally posted in #68 (comment).

After our discussion today in the ST benchmarking meeting, we have decided to use Spaceranger and provide the outputs from Spaceranger as a zip file. We should include the unfiltered and filtered output from Spaceranger, the web summary (equivalent to our qc report), the spatial folder, and a metadata file that we would add in with version information. This is approximately what the contents of the download for each library would then look like for the ST libraries.

├── SCPCL000000_filtered_files
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
├── SCPCL000000_spaceranger_summary.html
├── SCPCL000000_spatial
│   ├── aligned_fiducials.jpg
│   ├── detected_tissue_image.jpg
│   ├── scalefactors_json.json
│   ├── tissue_hires_image.png
│   ├── tissue_lowres_image.png
│   └── tissue_positions_list.csv
├── SCPCL000000_unfiltered_files
│   ├── barcodes.tsv.gz
│   ├── features.tsv.gz
│   └── matrix.mtx.gz
└── SCPCL00000_metadata.json

	.map{meta -> tuple(meta,
	file("s3://${meta.s3_prefix}/_R1_.fastq.gz"),
	file("s3://${meta.s3_prefix}/_R2_.fastq.gz")
	)}