hyunhwan-jeong / cb2 Goto Github PK

CB2 is an R package which provides functions for hit gene identification and quantification of sgRNA (single-guided RNA) abundances for CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) pooled screen data analysis. Details are in Jeong et al. (2019) <doi:10.1101/gr.245571.118> and Baggerly et al. (2003) <doi:10.1093/bioinformatics/btg173>.

Home Page: https://cran.r-project.org/web/packages/CB2/index.html

License: Other

R 76.99% C++ 23.01%

crispr-cas9 screen beta-binomial r cran

cb2's People

Stargazers

Watchers

Forkers

amchalkie

cb2's Issues

Fasta file issue

I am trying to use the CRISPRCloud2 platform that you created to do DRISPR analysis. I am encountering issue with uploading of the data of the pooled sgRNA library.
I am using a mouse library generated by Weissman Lab which was deposited at Addgene under two subpools: #893987 and #893988.
I tried to use the excel file which contains the sequence information as input to upload to the platform, but was not successful.
I tried converted it to WORD, to SnapGene, changing the file name, but none was helpful. I got "invalid file selected" error message.
Attached is the text file with all sequences in fasta format. Please let me know what needs to be done so that the analysis can move forward.
Thank you very much.
gRANs-only2.fasta.txt

non-targeting controls?

Thanks - this looks nice and is so refreshingly easy to install and use! I was wondering if there was a way in CB2 to account for non-targeting control sgRNAs, or if this is not needed? Thanks

calc_mappability

Hi,

It would be useful if this table also produced:

total reads
number of mapped reads.

Thanks

Is it possible to analyze by inserting one mismatch guide RNA barcode?

Hi,

CB2 program is very useful for analyzing pooled crispr screen.

I have one question. When I analyze the raw data, 1-2 mismatches are found in my fastq file.
So, Can I consider the 1bp mismatch via CB2?

Thanks,

Best regard,
Sujin Kim

run_sgrna_quant fails

Hi,

I see the following error:

Error in data.frame(sgRNA = quant_ret$sgRNA, sequence = quant_ret$sequence) :
arguments imply differing number of rows: 79633, 79637

traceback()
3: stop(gettextf("arguments imply differing number of rows: %s",
paste(unique(nrows), collapse = ", ")), domain = NA)
2: data.frame(sgRNA = quant_ret$sgRNA, sequence = quant_ret$sequence)
1: run_sgrna_quant(LIBRARY_FASTA, df_design)

Suggest adding a more robust check in the output section.
Cheers
Alistair

Error in arising in run_sgrna_quant

Hello,

I have previously run CB2 successfully and enjoyed the methods as well as documentation. When I went to run it on a different experiment I received this error. I thought it may have to do with my library construction but I used a python dictionary to populate the .fasta file so each value should be unique. The row names don't look like names and I'm not sure where the issue is coming from.


`Error` in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
Calls: run_sgrna_quant ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
In addition: Warning message:
non-unique values when setting 'row.names': ‘00:00:00_10’, ‘00:00:00_3’, ‘00:00:00_5’, ‘00:00:00_7’, ‘00:00:00_9’ "

thanks,

Karson

Choice of count normalisation

Hi and thank you for producing this package and the associated paper. I am interested in your idea of using the Beta-Binominal distribution for modelling CRISPR-screen count data and your comparisons with the MAGeCK package. I was wondering why the choice was made in CB2 to only normalise by total library size rather than by a normalisation method that attempts to handle the imbalance caused by highly abundant sequences.

As I understand it, MAGeCK uses a median ratio normalisation (function is very similar to that carried out by DESeq2) for counts before evaluating fold changes, does it make sense to also use a median ratio normalisation before running the measure_sgrna_stats function from CB2 or shoud we stick to using only the total library size normalisation (implented as get_CPM in CB2)?

maximum guide length

I have had great success using CB2 for CRISPR studies and greatly appreciate your work on this package!

I have a project where I am looking to analyze enrichment of guide pairs, rather than individual guides. In another workflow, I simply merge the two guide sequences in my fasta reference files and also merge the two reads together from the fastq files, making a 42mer in each file that is quantified by a simple kmer match. I have so far been unsuccessful in using CB2 to quantify the guides when they are merged. I have have been able to confirm that the individual guides can be quantified properly by CB2 and that the kmer matching workflow I have used in the past is able to quantify the merged guides properly.

Can you please let me know what is the limit in guide length allowable to search with CB2?

Many thanks!

How to handle paired-end fastqs

Hi there. I really appreciate your continued work on this tool.

For my use case, I have paired-end read data (R1.fq, R2.fq). What is the appropriate way to prepare/input these files for use with CB2? It's not clear to me from the source / documentation what would be the correct approach.

Cluster setup failed. 31 of 31 workers failed to connect.

R 4.1 has a nasty bug involving the parallel package that seems to have been giving me Cluster setup failed. 31 of 31 workers failed to connect. errors whenever I attempt to run run_sgrna_quant without first executing parallel:::setDefaultClusterOptions(setup_strategy = "sequential"). This fix however makes execution speed untenably slow, so I've had to abandon R 4.1 altogether for this package and am now running CB2 in a singularity container with a prior R version.

Somewhat related, you may want to investigate replacing the parallel dependency with parallely or BiocParallel. Some of the parallel functions used in CB2 (i.e. detectCores) are now deprecated, but have replacements in parallely and BiocParallel.

Edit: For anyone who is having trouble running CB2 on R 4.1, but doesn't want to mess with rolling back to a prior version, this Singularity environment works well:

Bootstrap: docker
From: rocker/tidyverse:4.0.5

%post
    # Linux Dependencies
    # apt-get quiet level 2 (implies -y)
    apt-get update -qq && apt-get -y --no-install-recommends install \
    # RcppArmadillo/conquer lib dependencies
    liblapack-dev \
    liblapack3 \
    libopenblas-base \
    libopenblas-dev && \
    rm -rf /var/lib/apt/lists/*

    # R stuff
    # CB2 dependency multtest
    R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) { install.packages('BiocManager') }"
    R -e "BiocManager::install('multtest')"
    # CB2
    R -e "install.packages('CB2', method='wget')"

%environment
    export R_VERSION=4.0.5
    export TERM=xterm
    export LC_ALL=en_US.UTF-8
    export LANG=en_US.UTF-8
    export R_HOME=/usr/local/lib/R
    export CRAN=https://packagemanager.rstudio.com/cran/__linux__/focal/2021-05-17
    export TZ=Etc/UTC

CB2 vhat estimation with variable total read count

I am analysing a CRISPR screen dataset using CB2 and noticed that the software is sensitive to changes in the total raw read count per sample. For example, if I take the Evers_CRISPRn_RT112 example dataset and multiply the input read count data by a factor of 10, the vhat estimate (derived from the fit_ab function) for each guide changes. This appears to affect guides non-uniformly. I have attached an Rscript and output plot to illustrate this.

Could you describe why this happens? My expectation was to find that multiplying everything by 10 would not affect the variance estimation, as the data are normalised for analysis.

cb2_raw_read_count_vhat-1.R.zip

plot_count_distribution add export option

add option to return the ggplot object to user

export join_count_and_design

This would be useful for users wishing to do their own plots.

Error in measure_gene_stats(sgrna_stat)

Hi I am trying to get gene stats from the code but ı get this following error;


Error in measure_gene_stats(sgrna_stat) : 
  It looks like `sgrna_stat` does not contain any result of a statistical test.

I used the counts from MAGeCK produced to code but all the format is appropriate with CB2 format. How can I fix this issue? The sgRNA stats worked perfectly.

Thanks in advance

run_sgrna_quant report the wrong sequences associated to sgRNA names

Thank you very much for the CB2 package, it is great and provides us with the quantification of the sgRNA in a very quick and easy way.

When running run_sgrna_quant it output the correct count matrix in regards to the ref. sgRNA names. But the sequences associated with the ref. sgRNA are incorrect. It seems to arise from the C++ fucntion quant. It returns Rcpp::_["sequence"] = ref.seq , which somehow is not working. Replacing ref.seq with sgRNA_hash hopefully fixes the problem.

I am unsure about the fix but hopefully, you can find an easy solution to return the correct sequences associated with the ref. sgRNA names.

run_sgrna_quant not found

Unfortunately, when I try and follow the tutorial available from CRAN, the function run_sgrna_quant was not found. Could you resolve this issue?

inconsistent between .fq.gz and .fastq?

Strange as this may sound, I get different results when using a .fq.gz vs a .fastq file (when the .fastq is a zcat of original). I get different mappability results (0.018 ish vs the expected 85%).

Would be great if this was handled automatically and input from .fq.gz was handled correctly, or alternatively at least a warning that this might be what's happening.

Otherwise - very useful software thank you!

Problem with gene-level statistic

Hi,

I managed to run sgRNA-level statistic, but when I try to make gene-level statistic with measure_gene_stats() function I get these error messages:
Error in library.dynam(lib, package, package.lib) : shared object ‘Matrix.so’ not found In addition: Warning message: S3 methods ‘print.sparseSummary’, ‘print.diagSummary’, ‘c.abIndex’, ‘c.sparseVector’, ‘as.array.Matrix’, ‘as.array.sparseVector’, ‘as.matrix.Matrix’, ‘as.matrix.sparseVector’, ‘as.vector.Matrix’, ‘as.vector.sparseVector’ were declared in NAMESPACE but not found
How could I fix it?
Thanks in advance!

bw,
Zsolt

Clarity around logFC for gene_stats

Hi,

For sgRNA the code is
log2(est$cpm_b + 1) - log2(est$cpm_a + 1)
And for gene level:
mean(logFC) of all sgRNA in that gene.

This has the unfortunate effect that if you try and re-calculate the logFC from the cpm values in the sheet level data, you get a difference. It is worth being explicit about this, or having multiple CPMs (although that would create confusion).

Best wishes
Alistair

gene_stat cpms

Hi,

I've come across a case with confusing results:

At the gene_stat level, we sometimes get cpm_a < cpm_b, while logFC is -ve.
It's to do with the combining of the replicates, but I think it would be useful to have information on the gene level of how many probes are consistent (in this case 3 or 4, or perhaps number_sgrna_up, number_sgrna_down?

Thanks
Alistair

Gene level:
n_sgrna cpm_a cpm_b logFC
4 10.20239 28.36188 -1.069289

sgrna level
n_a n_b phat_a vhat_a phat_b vhat_b cpm_a cpm_b logFC
1 1 1.461569e-05 0 8.329479e-07 0 14.615689 0.83294789 -3.090759
1 1 1.686701e-05 0 1.114901e-04 0 16.867005 111.49007463 2.654428
1 1 5.288807e-06 0 8.329479e-08 0 5.288807 0.08329479 -2.537360
1 1 4.038075e-06 0 1.041185e-06 0 4.038075 1.04118486 -1.303466

Interaction Terms and Complex Designs

I was wondering if more complex experimental designs can be investigated using CB2. Say, for examples two different conditions with three levels each, and interaction terms between the two. The vignettes don't really specify much beyond a simple case/control setup (unless i am mistaken, please correct me if I'm wrong).

E.g. with Condition1 either Control, Low, High and Condition2 being Control, A, B.

Y ~ Condition1 + Condition2 + Condition1:Condition2

Error in cb2_count()

Hi,
I am trying CB2 analysis in R to analyze my CRISPR data.
The package is installed properly (I guess) and example data runs fine.
When I am trying my own data, I can load the files correctly. but I am having trouble running the next steps.

Error:
cb2_count <- run_sgrna_quant(FASTA, df_design)
Error in data.frame(sgRNA = quant_ret$sgRNA, sequence = quant_ret$sequence) :
arguments imply differing number of rows: 201322, 202586

Commands (successful)

FASTA <- "/CB2analysis/CRISPRi_v2_human.trim_1_39_forward.fa"

df_design <- tribble(~group, ~sample_name, "Pre","IMEV0023-Pre_R1_001","Post", "IMEV0023-Post_R1_001","C2","IMEV0023-C2-S1_R1_001","C6","IMEV0023-C6-S4_R1_001",) %>% mutate(
fastq_path = glue("{ex_path}/{sample_name}.fastq.gz")
)

df_design
A tibble: 4 × 3
group sample_name fastq_path

1 Pre IMEV0023-Pre_R1_001 /extdata/Screen23i/IMEV0023-Pre_R1_001.f…
2 Post IMEV0023-Post_R1_001 /extdata/Screen23i/IMEV0023-Post_R1_001.…
3 C2 IMEV0023-C2-S1_R1_001 /extdata/Screen23i/IMEV0023-C2-S1_R1_001…
4 C6 IMEV0023-C6-S4_R1_001 /extdata/Screen23i/IMEV0023-C6-S4_R1_001…

I am using Rstudio: 2022.12.0+353
R 4.2.2

If you can point me to right direction of resolving this, that would be highly appreciated.
HJ

hyunhwan-jeong / cb2 Goto Github PK

cb2's People

Stargazers

Watchers

Forkers

cb2's Issues

Recommend Projects

Recommend Topics

Recommend Org