Giter Club home page Giter Club logo

lschwcp_2023's Introduction

Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression

This repository contains data, code, and figures generated for the manuscript:

Laura Luebbert, Delaney K Sullivan, Maria Carilli, Kristján Eldjárn Hjörleifsson, Alexander Viloria Winnett, Tara Chari, Lior Pachter (2023). [Efficient and accurate detection of viral sequences at single-cell resolution reveals novel viruses perturbing host gene expression](https://www.biorxiv.org/content/10.1101/2023.12.11.571168). bioRxiv 2023.12.11.571168; doi: https://doi.org/10.1101/2023.12.11.571168

The preprint is posted on the bioRxiv: https://www.biorxiv.org/content/10.1101/2023.12.11.571168

The Notebooks folder contains code to perform all analyses that were used for the preprint, starting with pre-processing of the raw data all the way to final figure generation. The notebooks are easily and readily executable via Google Colaboratory with a link directly to the site from each notebook page.

Large datasets are stored on Caltech Data and can be accessed under the DOIs 10.22002/krqmp-5hy81 and 10.22002/k7xqw-88d74.

Click here to view the interactive Krona plot showing all viruses expressed above the QC threshold in macaque cells that passed quality control, broken down by animal, timepoint, taxonomy, and fraction of positive cells occupied by each virus. Code to reproduce the Krona plot

The precomputed_refs folder contains precomputed reference indices for the detection of viral RNA in sequencing data (through alignment to the optimized PalmDB) and with masked human (or mouse) genome and transcriptome.

A description of kallisto, bustools, and kb-python including tutorials for their use can be found here: https://www.biorxiv.org/content/10.1101/2023.11.21.568164v1



# 1. Install kb-python (optional: install gget to fetch the host genome and transcriptome)
pip install kb-python gget

# 2. Download optimized PalmDB reference files
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_rdrp_seqs.fa
!wget https://raw.githubusercontent.com/pachterlab/LSCHWCP_2023/main/PalmDB/palmdb_clustered_t2g.txt

# 3. Create reference index (+ optional masking of the host, here human, genome using the D-list)
# Single-thread runtime: 1.5 h; Max RAM: 4.4 GB; Size of generated index: 593 MB
# Without D-list: Single-thread runtime: 3.5 min; Max RAM: 3.9 GB; Size of generated index: 592 MB
kb ref \
    --aa \
    --d-list $(gget ref --ftp -w dna homo_sapiens) \
    -i index.idx --workflow custom \
    palmdb_rdrp_seqs.fa
    
# 4. Align sequencing reads
# Single-thread runtime: 1.5 min / 1 million sequences; Max RAM: 2.1 GB
kb count \
    --aa \
    -i index.idx -g palmdb_clustered_t2g.txt \
    --parity single \
    -x default \
    $USER_DATA.fastq.gz

Overview_v3_noCode

lschwcp_2023's People

Contributors

lakigigar avatar lauraluebbert avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

julongwei

lschwcp_2023's Issues

Usage for 10X datasets

Hi,
Thank you for this great approach to viral transcript quantification.

I am wondering if I might be able to clarify how one would run this on 10x datasets. I noted that you benchmarked this on a sea-well experiment and a parse biosciences combinatorial indexing dataset.

For your example code, there is no assignment for barcode positions, so I presume for see-well and parse datasets, this is auto detected?

Should I just be following the kb-python tutorial and assigning 10x chemistries in the technology parameter? but substitute the index as you have shown in your code?

Many thanks in advance

Query regarding multiple sample runs

Hi. I had a question.

For my 10X fastq data, I have one SRR sample with multiple samples within it (SRR1_S1, SRR1_S2, SRR1_S3, SRR_S4). Each has their own respective paired reads.

How would I format my input to account for this ? I would preferably want to treat it as one sample so aggregate all the forward and reverse reads as the input for fastq_1 and fastq_2 but if you have any ideas, that would be great.

Thank you in advance !

Alignment to palmDB viral index is too slow compared to host index for scRNAseq 10xV3 reads

Hi,

Thank you for the fantastic method for viral quantification and sharing various analyses notebooks.

I am trying to apply your method to a large set of human 10x scRNAseq datasets of ~ 50 Billion reads in order to detect and quantify viral sequences from palmDB. I am using the option 7 i.e. Capturing host reads before alignment to palmDB. During the alignment step, it took around 50-60 hours for the host reads to align and is taking way longer for the viral reads to align to PalmDB.

Currently, it is running for over 4 days and has processed close to 25% of total 50B reads. Is there a way to speed up the alignment for viral reads or is it expected to take this long for alignment to palmDB given the large amount of 10xV3 sequencing reads?
Here is the command I used for aligning the reads to palmDB :

kallisto bus -n --aa -i ./humanCDNA_masked__virus_index.idx -o ./virus -t 8 -B ./HIVdonors_allSamples_batchFile.txt --batch-barcodes --rf-stranded -x 10xv3 --verbose

If there is any way to make it faster that would be really helpful to speed up this viral alignment, thank you so much for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.