lschwcp_2023's Introduction

Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression

This repository contains data, code, and figures generated for the manuscript:

Laura Luebbert, Delaney K Sullivan, Maria Carilli, Kristján Eldjárn Hjörleifsson, Alexander Viloria Winnett, Tara Chari, Lior Pachter (2023). [Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression](https://www.biorxiv.org/content/10.1101/2023.12.11.571168). bioRxiv 2023.12.11.571168; doi: https://doi.org/10.1101/2023.12.11.571168

The preprint is posted on the bioRxiv: https://www.biorxiv.org/content/10.1101/2023.12.11.571168

The Notebooks folder contains code to perform all analyses that were used for the preprint, starting with pre-processing of the raw data all the way to final figure generation. The notebooks are easily and readily executable via Google Colaboratory with a link directly to the site from each notebook page.

Large datasets are stored on Caltech Data and can be accessed under the DOIs 10.22002/krqmp-5hy81 and 10.22002/k7xqw-88d74.

Click here to view the interactive Krona plot showing all viruses expressed above the QC threshold in macaque cells that passed quality control, broken down by animal, timepoint, taxonomy, and fraction of positive cells occupied by each virus. Code to reproduce the Krona plot

The precomputed_refs folder contains precomputed reference indices for the detection of viral RNA in sequencing data (through alignment to the optimized PalmDB) and with masked human (or mouse) genome and transcriptome.

A description of kallisto, bustools, and kb-python including tutorials for their use can be found here: https://www.biorxiv.org/content/10.1101/2023.11.21.568164v1

# 1. Install kb-python (optional: install gget to fetch the host genome and transcriptome)
pip install kb-python gget

# 2. Download optimized PalmDB reference files
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt

# 3. Create reference index (+ optional masking of the host, here human, genome using the D-list)
# Single-thread runtime: 1.5 h; Max RAM: 4.4 GB; Size of generated index: 593 MB
# Without D-list: Single-thread runtime: 3.5 min; Max RAM: 3.9 GB; Size of generated index: 592 MB
kb ref \
    --aa \
    --d-list $(gget ref --ftp -w dna homo_sapiens) \
    -i index.idx --workflow custom \
    palmdb_rdrp_seqs.fa
    
# 4. Align sequencing reads
# Single-thread runtime: 1.5 min / 1 million sequences; Max RAM: 2.1 GB
kb count \
    --aa \
    -i index.idx -g palmdb_clustered_t2g.txt \
    --parity single \
    -x default \
    $USER_DATA.fastq.gz

lschwcp_2023's People

Contributors

Stargazers

Watchers

lschwcp_2023's Issues

Usage for 10X datasets

Hi,
Thank you for this great approach to viral transcript quantification.

I am wondering if I might be able to clarify how one would run this on 10x datasets. I noted that you benchmarked this on a sea-well experiment and a parse biosciences combinatorial indexing dataset.

For your example code, there is no assignment for barcode positions, so I presume for see-well and parse datasets, this is auto detected?

Should I just be following the kb-python tutorial and assigning 10x chemistries in the technology parameter? but substitute the index as you have shown in your code?

Many thanks in advance

Query regarding multiple sample runs

Hi. I had a question.

For my 10X fastq data, I have one SRR sample with multiple samples within it (SRR1_S1, SRR1_S2, SRR1_S3, SRR_S4). Each has their own respective paired reads.

How would I format my input to account for this ? I would preferably want to treat it as one sample so aggregate all the forward and reverse reads as the input for fastq_1 and fastq_2 but if you have any ideas, that would be great.

Thank you in advance !

Alignment to palmDB viral index is too slow compared to host index for scRNAseq 10xV3 reads

Hi,

Thank you for the fantastic method for viral quantification and sharing various analyses notebooks.

I am trying to apply your method to a large set of human 10x scRNAseq datasets of ~ 50 Billion reads in order to detect and quantify viral sequences from palmDB. I am using the option 7 i.e. Capturing host reads before alignment to palmDB. During the alignment step, it took around 50-60 hours for the host reads to align and is taking way longer for the viral reads to align to PalmDB.

Currently, it is running for over 4 days and has processed close to 25% of total 50B reads. Is there a way to speed up the alignment for viral reads or is it expected to take this long for alignment to palmDB given the large amount of 10xV3 sequencing reads?
Here is the command I used for aligning the reads to palmDB :

kallisto bus -n --aa -i ./humanCDNA_masked__virus_index.idx -o ./virus -t 8 -B ./HIVdonors_allSamples_batchFile.txt --batch-barcodes --rf-stranded -x 10xv3 --verbose

If there is any way to make it faster that would be really helpful to speed up this viral alignment, thank you so much for your help!

Recommend Projects

pachterlab / lschwcp_2023 Goto Github PK

lschwcp_2023's Introduction

Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression

lschwcp_2023's People

Contributors

Stargazers

Watchers

Forkers

lschwcp_2023's Issues

Usage for 10X datasets

Query regarding multiple sample runs

Alignment to palmDB viral index is too slow compared to host index for scRNAseq 10xV3 reads

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent