Giter Club home page Giter Club logo

nanoranger's Introduction

nanoranger

nanoranger is a processing tool for long-read single-cell transcriptomics as described in our Nature Communications paper, and spatial transcriptomics as described in our Immunity paper.

Workflow

The input data can be obtained through sequencing of 10x Genomics whole-transcriptome cDNA libraries or amplicons obtained through targeted amplification, with Oxford Nanopore Technologies (ONT) or Pacific Biosciences devices. The schematic of our workflow is demonstrated below.

schema

If you have a question about the software, or have any suggestions or ideas for new features or collaborations, feel free to create an issue here on GitHub, or write an email to [email protected].

Background

Two of the main challenges of ONT data analysis for single-cell applications have been (i) higher sequencing error compared to Illumina data and (ii) the variable location of cell barcodes and molecular identifiers (UMI) within each sequenced transcript.

To overcome these challenges nanoranger introduces two innovations:

  • The processing pipeline starts with alignment of reads to a transcriptome reference. This initial transcriptome alignment step enables orientation and extraction of 'subread' components - the transcript and the part of the read upstream or downstream of the transcript that contains barcode and UMI.

    By extracting flanking (soft-clipped) portions of a transcript it is possible to reliably assign cell barcodes to their transcript. This also limits the search space from a usually 200nt region at both ends of a read to a small 50nt part. This not only speeds up barcode matching, it also reduces the chance of assigning wrong barcodes to transcripts.

    Another feature automatically enabled by this approach is recovery and reliable quantification of fused/follow-on reads (also called informatic chimeras) generated abundantly in newer ONT chemistries (LSK112 and LSK114) by processing all supplementary transcript alignments for each read. This occasionally leads to extraction of 100s of supplementary transcripts from a single read. By accounting for such events we can recover as high as %50 more usable transcripts from the raw reads. A natural extension of this feature is processing and deconcatenation of libraries which are generated using concatenation methods, such as MAS-ISO-seq and made commercially available as Kinnex kit.

  • To perform barcode matching while accounting for indels and mismatches, nanoranger uses an aligner-based technique by aligning barcode components of the subreads against a reference of known barcodes (such as 737K whitelist for 10x Genomics 5' libraries included here).
    Compared to techniques which solely rely on adapter identification, this approach can avoid missed or erroneously assigned barcodes due to frameshifts introduced by errors in flanking adapters. To achieve this nanoranger uses STAR with a number of changes to the default options. The primary modification is change of the alignment mode to EndToEnd instead of the default softclipping in the unaligned ends of a read to force all bases of the barcode candidates to be mapped to the reference. Simultaneously, nanoranger pads the whitelist of barcodes with unknown nucleotides to avoid penalizing the adapter and UMI sequences which are kept in the barcode candidate reads.

There are different quantification 'modes' available for different libraries structures and tasks and the transcriptome reference can be modified accordingly. For whole transcriptome gene expression analysis a GENCODE transcriptome reference can be used . For 5' immune profiling this can be reduced to a reference of V transcripts and similarly for 3' immune profiling this can be a reference of C transcripts. If a set of targets is used for enrichment from cDNA, to speed up analysis one can only use a reference for those transcripts that are expected to be present.

nanoranger has been primarily tested on targeted libraries generated using 10X 5' Chromium and slide-seq 3' platforms. It can be used for immune profiling and genotyping from other library types with minimal modifications.

Further developments for generating count matrices for whole transcriptome libraries as well as addition of other chemistry types are currently underway.

Software Dependencies

This tool has been tested on Python 3.7.10 under Centos and Ubuntu systems.

The following programs are also assumed to be in path when running the tool. Please refer to the provided link for each to install them prior to start of your data analysis using this tool. Alternatively they are available as bioconda packages.

STAR is used for barcode correction against a set of known barcodes. By certain input parameter changes we use STAR in a Smith-Waterman-like mode.

minimap2 is used for initial alignment of raw nanopore reads to a transcriptome and (subsequently based on operation mode) alignment to a genome.

SAMtools is used for sorting and indexing BAM files

pigz is used for compressing output and intermediate fasta and fastq files.

MiXCR is used for VDJ alignment and clonotype extraction. We have strictly used MiXCR v3 in validating and benchmarking the results against Illumina-based data. Latest versions of MiXCR are not fully tested with our workflow and seem not be compatible out of the box without tunning parameters.

SeqKit is used for splitting input fastq files in case of very large libraries or libraries prepared with cDNA concatenation. Deconcatenation speed-up is achieved by parallel processing of splitted input files. To enable this step set the optional boolean flag --split.

Download and Install

git clone https://github.com/mehdiborji/nanoranger.git
cd nanoranger
chmod -R +x *
pip install -r requirements.txt

Sample Input Commands For Different Modes

The pipeline supports different chemistries through --mode flag

3pXCR_slideseq

  • Analysis of TCR/BCRs from a slide-seq (Curio) spatial transcriptomics library (Human and Mouse C gene transcripts available in data folder and provided to the pipeline with flag --t and VDJ alignment supported by MiXCR 3)
python ~/nanoranger/pipeline.py \
        --c 8 \
        --i ~/nanoranger/sample_fastq/slideseq_XCR.fastq.gz \
        --o XCR \
        --e Puck_220509_18 \
        --m 3pXCR_slideseq \
        --b ~/nanoranger/data/slideseq.matched.barcodes.tsv.gz \
        --t ~/nanoranger/data/XR_C_mouse.fa \
        --x mmu

5p10XTCR

  • Analysis of TCRs from a 10x genomics Chromium 5' library (Human and Mouse V gene transcripts available in data folder and provided to the pipeline with flag --t and VDJ alignment supported by MiXCR 3)
python ~/nanoranger/pipeline.py \
        --c 8 \
        --i ~/nanoranger/sample_fastq/TCR3.fastq.gz \
        --o TCR \
        --e TCR \
        --m 5p10XTCR \
        --t ~/nanoranger/data/TR_V_human.fa \
        --x hsa

5p10XGEX

  • Generation of BAM with barcode and UMI tags for variant calling from a 10x genomics Chromium 5' library (GRCh38.primary_assembly.genome.fa.gz from https://www.gencodegenes.org/human/ can be used)
python ~/nanoranger/pipeline.py \
        --c 8 \
        --i ~/nanoranger/sample_fastq/1022_DNMT3A_RUNX1_SF3B1.fastq.gz \
        --o AML_1022 \
        --e DNMT3A_RUNX1_SF3B1_AML_1022 \
        --m 5p10XGEX \
        --t ~/nanoranger/data/panel_MT_trns.fa \
        --g ~/refs/GRCh38.primary_assembly.genome_v41.fa.gz
  • Detection of known fusions from a 10x genomics Chromium 5' library (for fusions we may skip genome alignment by realigning the extracted transcripts to the initial transcriptome reference)
python ~/nanoranger/pipeline.py \
        --c 8 \
        --i ~/nanoranger/sample_fastq/K562_Kasumi1_BCRABL1_RUNX1_RUNX1T1.fastq.gz \
        --o K562_Kasumi1 \
        --e fusion \
        --m 5p10XGEX \
        --t ~/nanoranger/data/RUNX1_RUNX1T1_ABL1_BCR.fa \
        --g ~/nanoranger/data/RUNX1_RUNX1T1_ABL1_BCR.fa

Downstream of this process, we may like to extract the transcript-BC-UMIs associated with each read and extract the meaningful fusions after removal of potential chimeras and events with few supporting reads. This can be accomplished by running the following script on the final BAM file:

python ~/nanoranger/scripts/downstream/fusion_gene.py --b fusion_genome_tagged.bam --o fusion_reads.csv

For RUNX1_RUNX1T1 fusion, we use a primer for RUNX1T1 transcript close to the fusion site. Reads with a flanking barcode that align to RUNX1 will be fusion reads. Such reads will have another (supplementary or even primary) alignment to RUNX1T1; however, the flanking region of such alignments will not contain any barcodes and will be automatically dropped in the processing. Reads with flanking barcode that align to RUNX1T1 will be wild-type reads.

  • Analysis of MT transcripts in 15-mer MAS-seq arrays from a 10x genomics Chromium 5' library (we may skip whole genome alignment by realigning the extracted transcripts just to the mitochondrial chromosome)
python ~/nanoranger/pipeline.py \
        --c 8 \
        --i ~/nanoranger/sample_fastq/1019_mtDNA.fastq.gz \
        --o AML_1019 \
        --e mito_15mer_AML_1019 \
        --m 5p10XGEX \
        --t ~/nanoranger/data/MT_trns.fa \
        --g ~/nanoranger/data/MT_chr.fa
  • Analysis of CAR-T cells from a 10x genomics Chromium 5' library to detect CAR and CD28 transcripts
python ~/nanoranger/pipeline.py \
        --c 8 \
        --i ~/nanoranger/sample_fastq/97_6_CAR.fastq.gz \
        --o 97_6 \
        --e CAR_97_6 \
        --m 5p10XGEX \
        --t ~/nanoranger/data/CAR_CD28.fa \
        --g ~/nanoranger/data/CAR_CD28.fa
  • Generation of BAM with barcode and UMI tags and genes-by-cells matrix from a 10x genomics Chromium 3' library (GRCh38.primary_assembly.genome.fa.gz from https://www.gencodegenes.org/human/ can be used)
Coming Soon!

3p10XGEX

  • Generation of BAM with barcode and UMI tags and genes-by-cells matrix from a 10x genomics Chromium 3' library (GRCh38.primary_assembly.genome.fa.gz from https://www.gencodegenes.org/human/ can be used)
Coming Soon!

3p10XTCR

  • Analysis of TCRs from a 10x genomics Chromium 3' library (Human and Mouse C gene transcripts available in data folder and alignment supported by MiXCR)
Coming Soon!

Downstream Analysis

nanoranger's People

Contributors

liviuspenter avatar mehdiborji avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

nanoranger's Issues

TCR matching error

When I run 5p10XTCR with the example data TCR3.fastq.gz on my MacOS system, I got the error:
"Traceback (most recent call last):
File "/Users/Home/nanoranger/pipeline.py", line 236, in
utils.process_matching_5p10XTCR(sample,outdir)
File "/Users/Home/nanoranger/utils.py", line 733, in process_matching_5p10XTCR
scores=sort_cnt(all_AS[all_AS[:,1]==0][:,0])
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed"

Error in 5p10XTCR example script

When running the 5p10XTCR example in a docker container the pipeline runs until the following point and then fails:

...<lines above cut>...
TRA chains: 156 (55.12%)
TRB chains: 127 (44.88%)
TCR_out/TCR_testrun_bcreads.fasta
TCR_out/TCR_testrun_ref/
Feb 19 22:24:35 ..... started STAR run
Feb 19 22:24:35 ... starting to generate Genome files

genomeGenerate.cpp:150:genomeGenerate: exiting because of *OUTPUT FILE* error: could not create output file TCR_out/TCR_testrun_ref//genomeParameters.txt
Solution: check that the path exists and you have write permission for this file

It looks like there might be an error in genomeGenerate.cpp putting in an additional / character into the path for the output file

Fusion Calling Error

Hello,
Thank you for this tool. I have 5' 10x Library sequenced with Nanopore Sequencing. I previously used JAFFAL to recover known fusion from Single-Cell which works quite well and I wanted to use your fusion detection pipeline using a fasta file to see how it performs with it. However, I encounter this error message on my own data:

alignment to genome and generation of BC-UMI-Transcript tagged BAM 


cores = 20
ref = /home/user/nanoranger/FUSION_SEQUENCE.fa
infile= FUSION_TEST/fusion_deconcat.fastq.gz
outdir = FUSION_TEST
sample = fusion
[M::mm_idx_gen::0.001*1.50] collected minimizers
[M::mm_idx_gen::0.001*5.99] sorted minimizers
[M::main::0.001*5.96] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.001*5.82] mid_occ = 15
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.002*5.70] distinct minimizers: 626 (98.72% are singletons); average occurrences: 1.032; average spacing: 2.913; total length: 1882
[M::worker_pipeline::0.734*16.79] mapped 103327 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -aY --eqx -x splice -t 20 --secondary=no --sam-hit-only /home/user/nanoranger/FUSION_SEQUENCE.fa FUSION_TEST/fusion_deconcat.fastq.gz
[M::main] Real time: 0.738 sec; CPU: 12.330 sec; Peak RSS: 0.053 GB
[bam_sort_core] merging from 0 files and 20 in-memory blocks...
number of genome aligned reads =  4693
10000 barcode candidates processed
20000 barcode candidates processed
30000 barcode candidates processed
40000 barcode candidates processed
50000 barcode candidates processed
60000 barcode candidates processed
70000 barcode candidates processed
80000 barcode candidates processed
number of short UMI reads =  250
20000 Read-BC-UMI-Transcript tuples saved
40000 Read-BC-UMI-Transcript tuples saved
60000 Read-BC-UMI-Transcript tuples saved
rm: cannot remove 'FUSION_TEST/fusion_matching_*': No such file or directory
`

Suprisingly I encounter the same error with the test data

 alignment to genome and generation of BC-UMI-Transcript tagged BAM 


cores = 8
ref = /home/user/nanoranger/data/RUNX1_RUNX1T1_ABL1_BCR.fa
infile= K562_Kasumi1/fusion_deconcat.fastq.gz
outdir = K562_Kasumi1
sample = fusion
[M::mm_idx_gen::0.001*1.89] collected minimizers
[M::mm_idx_gen::0.001*2.31] sorted minimizers
[M::main::0.001*2.30] loaded/built the index for 7 target sequence(s)
[M::mm_mapopt_update::0.002*2.22] mid_occ = 10
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 7
[M::mm_idx_stat::0.002*2.17] distinct minimizers: 2164 (96.63% are singletons); average occurrences: 1.035; average spacing: 2.973; total length: 6656
[M::worker_pipeline::0.050*6.04] mapped 3152 sequences
[M::main] Version: 2.26-r1175
[M::main] CMD: minimap2 -aY --eqx -x splice -t 8 --secondary=no --sam-hit-only /home/user/nanoranger/data/RUNX1_RUNX1T1_ABL1_BCR.fa K562_Kasumi1/fusion_deconcat.fastq.gz
[M::main] Real time: 0.050 sec; CPU: 0.303 sec; Peak RSS: 0.010 GB
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
number of genome aligned reads =  2883
number of short UMI reads =  4
rm: cannot remove 'K562_Kasumi1/fusion_matching_*': No such file or directory

Here is my working environment

  • Minimap2 v2.26-r1175
  • STAR v2.7.9a
  • Samtools v1.6

The files present in the output directory for my data so far are :
fusion_barcode_scores.csv fusion_barcode_scores.pdf fusion_bcumi_dedup.csv fusion_BCUMI.fasta.gz fusion_deconcat.fastq.gz fusion_genome_tagged.bam fusion_genome_tagged.bam.bai fusion_knee.pdf fusion_matching.sam fusion_trns_ct.csv

I was looking to have an output file with the reads + barcodes + presence of the fusion, but I'm not sure I've found this in any of these files. Do you have a wiki with the output files created and their content description? I guess I must use the fusion_gene.py in the downstream folder in scripts, but I am unsure of the arguments I need to fill in to use it.
Also related to the script you provide, what is the script performing the extraction of the 10x barcodes? I saw that there are two bash scripts barcode_align.sh and barcode_ref.sh so I imagine those two which are called right ?

Thank you for your help,
Evan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.