tilschaef / scrna-seq Goto Github PK

View Code? Open in Web Editor NEW

This project forked from rebecza/scrna-seq

0.0 0.0 0.0 1.33 MB

From fastq to preprocessed counttable (for in-house CELSeq2 method), with Kallisto | Bustools workflow.

Shell 15.08% R 73.58% TeX 11.34%

scrna-seq's People

Contributors

scrna-seq's Issues

JackStraw settings hardcoded

Considering: JackStraw can be used to run for checking significant PCs, however the settings were hardcoded for the amount of dims.

This is an issue in larger datasets where the variability is contained in more than the first 20 PCs.

Within the following chunks, I tried changing the hardcoded amounts of dims, and this should work with the following code (I checked only with 30):

Perform JackStraw Permutations to find significant PCs

seuset.jack <- JackStraw(
   object = seuset,
   dims = 30,
   num.replicate = 100
)
seuset.jack <- ScoreJackStraw(seuset.jack, dims = 1:30)

JackStrawPlot(object = seuset.jack, dims = 1:30)

So we could add a jackstraw_pc variable? Or maybe this works with the PC amount set for params$pcs_max_hvg

Meta-data in general

For now, I see 3 different scenarios for a user to provide meta-data.

Cell names contain meta-data variables. In this case, all the fields are directly extracted based on extract_meta_columns. Possibly the most convenient way.
The user provides custom meta data in .tab/.csv delimited format. In this case, we would minimally require a genome and a sample library column to match the meta-data entries will cell identifiers. The library column or additional columns specified can be used for visualization and grouping in PCA, UMAP etc.

Example:

Genome	Library
GRCh38	820

The users specifies no meta data. In this case, we have to group on library since it is the only information we can infer from cell names (if situation 1 does not apply). We could do something similar to Seurat's CreateSeurat object function where you can specify an identity class for each cell based on the cell name syntax. For example, in our case the cell names have the format sample_well(barcode). In this case, we could say the identity is always the first field after splitting by _. In case sample itself has multiple _, we could ask the user to specify the cell identity index which becomes 1+ position in sample.

Output to pdf

Concerning: .rmd -> .pdf

I really like the pdf output, it is overall really nice!! (:
(also the contents at the top, looks faancy)

There are some spots where the lines are going outside the page width, see below for an example.

Is there a fix for this maybe (a wrapping setting)? Or does this need selection of the long lines in the rmd where this happens?
It also sometimes happens in code chunks.

Parameter extract_meta_columns

I use the extract_meta_columns to generate a meta data table from my cell names.

For this variable, I always add well as the last field: since in the loading of the matrix, the well-id is added (with '_') to the cell names. Is there a way to make this standard? Because, one would not expecting that to be the last field when filling in the parameters (or config file), since the folder names do not include this field as well.

Maybe the cell name extraction in general, is too specific a method to my and only some other researcher's convention of writing the plates/columns, and therefore this is not worth looking into.

However, I was thinking:

Maybe we could always try to extract at least a Library variable and a Well name to add to the metadata? (Thereby always adding ,Well at the end of the extract_meta_columns variable).
Alternative: the last field is removed before generating the meta data table from column names.

SCTransform and Jackstraw

While testing the workflow, JackStraw appeared not to work together with SCTransform.

So when switching on both variables, the workflow crashes.

We should only JackStraw allow to run when run.sct is FALSE (can one add multiple conditions to the {r running JackStraw, eval=params$run.jackstraw}, eval argument?). (and state when it didn't run because of this setting's value)

Output of params settings

Question: Would it be possible to generate an overview with all the Parameter settings used for the knit?

This is especially useful when running this script separate of the s2s workflow, because there is no documentation of the used settings in a config file there.

At the moment, this information is not visible in the knit. (Maybe we could print it somewhere at the bottom of the file?)

Doublet selection

Considering: Especially in droplet-based scRNA-seq methods, there is a relatively higher chance to get doublets in the dataset.

Doublets: are two (or more?) cells encapsulated in the same droplet, where one would expect only a single cell.

CellRanger has a build-in method to estimate the amount of doublets, based on the chances with certain amount of cells loaded in the sample? (Not sure how this works exactly)

And there are several separate methods that try to estimate which droplet/cell entry might have contained doublets (Gert Jan has used one, we could have a look at). I used to only check the nCounts/nFeature distributions over the UMAP, to have an idea if certain clusters are being formed on the basis of these differences (doublets can still be a problem with FACS methods as well, although you select for this in the sorting procedure as well).

However, there is apparently not really a consensus yet in the field, on how to identify these properly.

I do think it is important to include this (in a later release of s2s) in the pipeline, since one does encounter this problem especially in 10X experiments.

Parameters adjustments

Concerning: params: in analysis/kb_seurat_pp.rmd

The interactive param block works really nicely before knitting in Rstudio.

Two points regarding the parameter block:

Is it possible to show one line with a short description of the variable? Especially when working in this "knit with parameters", one only sees the variable name. (For example, when we have the variable filtering, the options are "in" or "out" or nothing, and people will not be aware of this.)
The order of the variables is very random, which makes it also more difficult to understand what they are for. (For example amount_cells_expr and gene_tresh are in one filtering together, however listed with other variables apart.) Maybe in the config file the ordering is different as well?

A question: Do the variables in the param, all have to be in lower case? In some cases - for example nHVG - the use of both lower and upper case would also clarify the variable more. However, I understand this is probably a common convention (descriptions would do in that case as well).

I can help ordering and writing descriptions if you'd like!

Normalization method

Concerning: Seurat's normalization in analysis/kb_seurat_pp.rmd

Seurat has made updates over the years in their normalization methods. One new feature they build is: SCTransform. In this function they combine normalization, HVG selection and scaling + optional regression.

I was always used to using the "older" normalization from Seurat: log((count/scaling factor)*10.000)+1) in short, so with the addition of a pseudocount and taking the natural log. Which is also incorporated in the rmd here.

For the lab as a general method, it would be better to use the updated method for normalization (I discussed this with Simon as well).

Based on this paper, the newer normalization method they set-up at the same time as SCTransform, shows better results for scRNA-seq data.

Options:

Incorporating SCTransform: this will change some naming in the assays of the object as well. To check when using this: the default method in SCTransform is still log(p1) if I saw this correctly.
I think it might be worth a try to adjust the method used in normalizeData() of Seurat (instead of LogNormalize, the CLR for instance), and check if this does the same (as the SCTransform)? However, this needs more reading into the paper to understand what are the other differences between the methods, because probably there's more! Otherwise this might be an "easy" fix (we would change the normalization and the regression is already performed in scaling, however, I think there might still be a difference in HVG selection, this could be different in SCTransform, than if one performs this step separately).

Filepath issues

Not sure if this is a problem when running directly on s2s output:

I run the .rmd file separately from s2s, with: knit with parameters, in RStudio.

Thereby I run the .rmd from the location where it is saved: the analysis/ folder, where also all the subsequent scripts and child-rmds are stored. Within the kb_seurat_pp.rmd however, you are searching for these extra scripts and files in analysis/.

So for me, with the settings as they were and using the kb_seurat_pp.rmd, from the location where it is stored, it did not work. But I don't know if you normally in a run, locate the somewhere else?

Parameter combined_id

Concerns: The combined_id parameter cannot be used anymore in UMAP representations.

Did this parameter name change or is there another reason? In my latest version I am not able to label my UMAPs with setting umap_cols: anymore with the variables I combined in meta_group_id, before I could do this by using combined_id?

tilschaef / scrna-seq Goto Github PK

scrna-seq's People

Contributors

scrna-seq's Issues

JackStraw settings hardcoded

Perform JackStraw Permutations to find significant PCs

Meta-data in general

Output to pdf

Parameter extract_meta_columns

SCTransform and Jackstraw

Output of params settings

Doublet selection

Parameters adjustments

Normalization method

Filepath issues

Parameter combined_id

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent