Giter Club home page Giter Club logo

sgcocaller's Introduction

sgcocaller: Calling crossovers from single-gamete DNA sequencing reads

Two main modules are available in sgcocaller, phasing and crossover calling. sgcocaller phase generates donor haplotype from single-gamete DNA sequencing data. sgcocaller xo and sgcocaller sxo both call crossovers in each single gamete using a Hidden Markov Model.

sgcocaller processes DNA reads from each single gamete in the aligned and sorted BAM file for inferring the haplotypes of single gamete genomes that can later be used for identifying crossovers through finding haplotype shifts (see comapr).

It takes the large bam file which contains all aligned DNA reads from gamete cells and summarizes allele counts for the provided informative SNP markers. While counting the alleles, the Viterbi algorithm is executed for finding the haplotype sequence for the list of SNP markers.

Publication

BioRxiv paper

sgcocaller_fig

Hidden Markov Model configuration

HMM_fig

Inputs

  • Bam, sorted and index bam file which contains DNA reads of single gamete cells with CB tag, eg. from single-cell preprocessing pipeline (CellRanger, STARSolo, etc)
  • VCF, variant call file that contains the list of informative SNPs (phased or unphased SNPs)
  • barcodeFile, the list of cell barcodes of the gametes

Main outputs from sgcocaller phase

  • *phased_snpAnnot.txt
    • _phased_snpAnnot.vcf.gz (if --outvcf)
    • _totalMtx.mtx
    • _altMtx.mtx
    • _gtMtx.mtx

Main outputs from sgcocaller swphase (or sgcocaller autophase)

  • *cellGenoVersusTemplate.txt
    • _corrected_phased_snpAnnot.txt
    • _corrected_phased_snpAnnot.vcf.gz
    • _switch_score.txt

Main outputs from sgcocaller xo

  • *.mtx
    • sparse matrix with columns corresponding to the list of gamete cell barcodes and rows corresponding to the list of SNP positions in VCF file
    • {sample}_chr1_altCount.mtx, a sparse mtx with entries representing alternative allele counts
    • {sample}_chr1_totalCount.mtx, a sparse mtx with entries representing total allele counts
    • {sample}_chr1_vi.mtx, a sparse mtx with entries representing inferred viterbi state (haplotype state)
  • {sample}_chr1_snpAnnot.txt, the SNP positions and allele
  • {sample}_chr1_viSegInfo.txt, statistics of the Viterbi state segments in text file format. It contains consecutive viterbi states for each chromosome with statistics including, starting SNP position, ending SNP position, the number of SNPs supporting the segment, the log likelihood ratio of the viterbi segment and the inferred hidden state.

Usage


  Usage:
      sgcocaller autophase [options] <BAM> <VCF> <barcodeFile> <out_prefix>
      sgcocaller phase [options] <BAM> <VCF> <barcodeFile> <out_prefix> 
      sgcocaller swphase [options] <gtMtxFile> <phasedSnpAnnotFile> <referenceVCF> <out_prefix> 
      sgcocaller sxo [options] <SNPPhaseFile> <phaseOutputPrefix> <barcodeFile> <out_prefix>
      sgcocaller xo [options] <BAM> <VCF> <barcodeFile> <out_prefix>
      

Arguments:

  <BAM> the read alignment file with records of single-cell DNA reads
  
  <VCF> the variant call file with records of SNPs

  <barcodeFile> the text file containing the list of cell barcodes

  <out_prefix>  the prefix of output files

  <out_vcf> the output vcf aftering phasing blocks in hapfile
  
  <gtMtxFile> the output gtMtx.mtx file from running sgcocaller phase

  <phasedSnpAnnotFile>  the output phased_snpAnnot.txt from running sgcocaller phase

Options:
  -t --threads <threads>  number of BAM decompression threads [default: 4]
  --barcodeTag <barcodeTag>  the cell barcode tag in BAM [default: CB]
  --minMAPQ <mapq>  Minimum MAPQ for read filtering [default: 20]
  --baseq <baseq>  base quality threshold for a base to be used for counting [default: 13]
  --chrom <chrom>  the selected chromsome (whole genome if not supplied,separate by comma if multiple chroms)
  --minDP <minDP>  the minimum DP for a SNP to be included in the output file [default: 1]
  --maxDP <maxDP>  the maximum DP for a SNP to be included in the output file [default: 5]
  --maxTotalDP <maxTotalDP>  the maximum DP across all barcodes for a SNP to be included in the output file [default: 25]
  --minTotalDP <minTotalDP>  the minimum DP across all barcodes for a SNP to be included in the output file [default: 10]
  --minSNPdepth <minSNPdepth>  the minimum depth of cell coverage for a SNP to be includes in generated genotype matrix file [default: 1]
  --thetaREF <thetaREF>  the theta for the binomial distribution conditioning on hidden state being REF [default: 0.1]
  --thetaALT <thetaALT>  the theta for the binomial distribution conditioning on hidden state being ALT [default: 0.9]
  --cmPmb <cmPmb>  the average centiMorgan distances per megabases default 0.1 cm per Mb [default: 0.1]
  --phased  the input VCF for calling crossovers contains the phased GT of heterozygous SNPs
  --outvcf  generate the output in vcf format (sgcocaller phase)  
  --templateCell <templateCell>  the cell's genotype to be used a template cell, as the cell's index (0-starting) in the barcode file, default as not supplied [default: -1]
  --maxDissim <maxDissim>  the maximum dissimilarity for a pair of cell to be selected as potential template cells due to not having crossovers in either cell [default: 0.0099]
  --maxExpand <maxExpand>  the maximum number of iterations to look for locally coexisting positions for inferring missing SNPs in template haplotype sequence [default: 1000]
  --posteriorProbMin <posteriorProbMin>  the min posterior probability for inferring missing SNPs [default: 0.99]
  --lookBeyondSnps <lookBeyondSnps>  the number of local SNPs to use when finding switch positions [default: 25]
  --minSwitchScore <minSwitchScore>  the minimum switch score for a site to be identified as having a switch error in the inferred haplotype  [default: 50.0]
  --minPositiveSwitchScores <minPositiveSwitchScores>  the min number of continuing SNPs with positive switch scores to do switch error correction [default: 8]  
  --binSize <binSize>  the size of SNP bins for scanning swith errors, users are recommended to increase this option when SNP density is high. [default: 2000]
  --stepSize <stepSize>  the move step size used in combination with --binSize. [default: 200]
  --dissimThresh <dissimThresh>  the threshold used on the allele concordance ratio for determining if a SNP bin contains a crossover. [default: 0.0099]
  --batchSize <batchSize>  the number of cells to process in one batch when running sxo. This option is only needed when the memory is limited. 
  --notSortMtx  do not sort the output mtx. 
  --maxUseNcells <maxUseNcells>  the number of cells to use for calculating switch scores. All cells are used if not set
  -h --help  show help


  Examples
      ./sgcocaller autophase possorted_bam.bam hetSNPs.vcf.gz barcodeFile.tsv phaseOutputPrefix
      ./sgcocaller phase possorted_bam.bam hetSNPs.vcf.gz barcodeFile.tsv phaseOutputPrefix
      ./sgcocaller xo --threads 4 possorted_bam.bam phased_hetSNPs.vcf.gz barcodeFile.tsv ./percell/ccsnp
      ./sgcocaller sxo snp_phase.txt phaseOutputPrefix barcodeFile.tsv ./percell/ccsnp

Run for single-cell DNA sequenced gametes with donor haplotype known

In cases where the haplotype of the donors is known (i.e the list of hetSNPs has been phased, or the donor is an F1 hybrid sample), sgcocaller xo can be called directly:

For F1 hybrid donor:

      ./sgcocaller xo --threads 4 possorted_bam.bam --cmPmb 1e-3  dbSNP-hetSNPs.vcf.gz barcodeFile.tsv ./percell/ccsnp

cmPmb controls the transition probabilites in the HMM model and it is dependent on the physical distances of the two markers[1]. It is recommended to set a smaller transition probability for sparse and low coverage dataset.

dbSNP-hetSNPs.vcf.gz contains the list of hetSNPs, and the list of REF alleles (or the list of ALT alleles) is the haplotype of the F1 hybrid donor.

For phased hetSNPs provided in VCF file, the GT field is used to extract the haplotype of the donor and sgcocaller xo can be called directly:

      ./sgcocaller xo --threads 4 --cmPmb 1e-3  --phased possorted_bam.bam phased-hetSNPs.vcf.gz barcodeFile.tsv ./percell/ccsnp

Run for single-cell DNA sequenced gametes with donor haplotype unknown

When hetSNPs' phase or haplotype of the donor is unknown, sgcocaller phase module can produce the donor haplotype using the single-gamate DNA sequencing dataset.

      ./sgcocaller phase --threads 4 --barcodeTag CB --chrom 'chr1' --minDP 2 possorted_bam hetSNPs.vcf.gz barcodeFile.txt outdir/path/withPrefix_ 

when specifying --outvcf with above command, ./sgcocaller phase will generate the phased hetSNPs in VCF format.

./sgcocaller swphase is an additional module for identifying potential switch errors in the genrated haplotype from ./sgcocaller phase. The provided utility R code can be used for plotting diagnostic plots which reveal whether ./sgcocaller swphase should be run to scrutinize switching errors.

       ./sgcocaller swphase --chrom "chr1" gtMtxfile.mtx chr1_phased_snpAnnot.txt unphasedHetSNPs.vcf.gz output/dirpath/phaseCorrected/chr1_

Since the ./sgcocaller phase has done the heavy lifting of parsing through the BAM and VCF file that generated the allele counts file and genotype matrix file, the crossover calling step in this use-case can be done through calling sgcocaller sxo that takes these outputs directly:

        ./sgcocaller sxo --thetaREF 0.1 --thetaALT 0.9 --cmPmb 1e-3 --chrom "chr1" sgphase_snp_annot.txt output/sgcocaller/phaseOneStep/apricot_ 

Run for a Bam of a single cell (bulk sample)

For cases where the DNA reads do not have a cell barocode tag (CB) ie. Bam file for one gamete or a bulk sample, sgcocaller can still be applied by adding one simple step:

1, Prepare a bulkBC.txt file in which only one dummy cell barcode "bulk" is listed. In other words, in the bulkBC.txt file, there is only one row and it has the text "bulk".

2, Apply sgcocaller as above :

    ./sgcocaller xo --threads 10 bulk.bam SNPs.vcf.gz bulkBC.txt ./output/bulksample

Setup/installation

Static builds

The static bianry can be simply downloaded which works for GNU/Linux type OS: ./src/sgcocaller

The static build was generated by using docker image docker://svirlyu/sgcocaller_nsb adapted from https://github.com/brentp/hts-nim/blob/master/Dockerfile

/usr/local/bin/nsb -s ./src/sgcocaller.nim -n sgcocaller.nimble -o /mnt/src -- --d:release --threads:on

A static binary build of sgcocaller is available as downloadable artifacts at gitlab repo: "https://gitlab.svi.edu.au/biocellgen-public/sgcocaller" or via the Releases tab

Install from bioconda

sgcocaller is available as a conda package and can be install with bioconda

Using docker

sgcocaller is also available in docker image [svirlyu/sgcocaller] https://hub.docker.com/r/svirlyu/sgcocaller

It can be executed by

docker run -it svirlyu/sgcocaller

## execute sgcocaller
/usr/bin/sgcocaller -h

Install with nimble

sgcocaller uses hts-nim(https://github.com/brentp/hts-nim) that requires the hts-lib library. If you are building the sgcocaller from source, you would need to install hts-lib

git clone --recursive https://github.com/samtools/htslib.git
cd htslib && git checkout 1.10 && autoheader && autoconf && ./configure --enable-libcurl

cd ..
make -j 4 -C htslib
export LD_LIBRARY_PATH=$HOME/htslib
ls -lh $HOME/htslib/*.so

Then, sgcocaller can be installed using nimble

nimble install https://gitlab.svi.edu.au/biocellgen-public/sgcocaller.git

The built package is located at $HOME/.nimble/bin/sgcocaller

Downstream analysis in R

The output files from sgcocaller can be directly parsed into R for construction of individual genetic maps using the R package comapr available from github.com/ruqianl/comapr

A complete analysis workflow can be accessed here.

References

[1] Hinch, AG. (2019). Factors influencing meiotic recombination revealed by whole-genome sequencing of single gamete Science, 363(6433)

sgcocaller's People

Contributors

davismcc avatar rrlyu avatar ruqianl avatar

Watchers

 avatar  avatar

sgcocaller's Issues

Reposting as a new issue

Hi, I'm very interested in using this tool. I also get an issue, using 0.3.9 (see below). I tried again using prior versions (0.3.6, 0.3.4 and 0.3.3) and get the same error. I then installed on both a linux cluster, and a WSL version of Ubuntu locally and get the same issue. File path does not appear to be the issue for me

./sgcocaller xo \

--phased
--cmPmb 0.0001
possorted_genome_bam.bam
Cross.filt.recal.vcf
barcodes.tsv
BULK_1_sgco
2022-11-02T12:41:15-05:00{"": barcodes.tsv, "": possorted_genome_bam.bam, "--binSize": 2000, "--dissimThresh": 0.0099, "--notSortMtx": false, "": nil, "phase": false, "--lookBeyondSnps": 25, "--thetaREF": 0.1, "--outvcf": false, "--baseq": 13, "--minMAPQ": 20, "--threads": 4, "--stepSize": 200, "--cmPmb": 0.0001, "--chrom": nil, "--minPositiveSwitchScores": 8, "--minSwitchScore": 50.0, "": Cross.filt.recal.vcf, "": nil, "--barcodeTag": CB, "--maxUseNcells": nil, "": nil, "--maxExpand": 1000, "sxo": false, "autophase": false, "--maxTotalDP": 25, "xo": true, "--batchSize": nil, "--phased": true, "--templateCell": -1, "--help": false, "": nil, "--maxDP": 5, "--minDP": 1, "--minTotalDP": 10, "--maxDissim": 0.0099, "swphase": false, "--posteriorProbMin": 0.99, "--minSNPdepth": 1, "--thetaALT": 0.9, "<out_prefix>": BULK_1_sgco, "": nil}
2022-11-02T12:41:15-05:00 running sgcocaller 0.3.9
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

I then tried installing from source though ran into an error here also:

nimble install https://gitlab.svi.edu.au/biocellgen-public/sgcocaller.git
Downloading https://gitlab.svi.edu.au/biocellgen-public/sgcocaller.git using git
Verifying dependencies for [email protected]
Info: Dependency on docopt@any version already satisfied
Verifying dependencies for [email protected]
Info: Dependency on regex@>= 0.11.1 already satisfied
Verifying dependencies for [email protected]
Info: Dependency on unicodedb@>= 0.7.2 already satisfied
Verifying dependencies for [email protected]
Info: Dependency on hts@>= 0.3.4 already satisfied
Verifying dependencies for [email protected]
Info: Dependency on https://github.com/ruqianl/distributions.git@any version already satisfied
Verifying dependencies for [email protected]
Installing [email protected]
Building sgcocaller/sgcocaller using c backend
Prompt: Build failed for 'https://gitlab.svi.edu.au/biocellgen-public/[email protected]', would you like to try installing 'https://gitlab.svi.edu.au/biocellgen-public/sgcocaller.git@#head' (latest unstable)? [y/N]
Answer:
Tip: 12 messages have been suppressed, use --verbose to show them.
Error: Aborting installation due to build failure
ianc@PC51:~$ nimble install https://gitlab.svi.edu.au/biocellgen-public/sgcocaller.git --verbose
Downloading https://gitlab.svi.edu.au/biocellgen-public/sgcocaller.git using git
Cloning latest tagged version: v0.3.9
Info /tmp/nimble_1192/gitlabsvieduau_biocellgenpublicsgcocallergit/sgcocaller_1192.nims(5, 22) Warning: imported and not used: 'strutils' [UnusedImport]
Verifying dependencies for [email protected]
Reading official package list
Checking for docopt@any version
Info: Dependency on docopt@any version already satisfied
Verifying dependencies for [email protected]
Reading official package list
Checking for regex@>= 0.11.1
Info: Dependency on regex@>= 0.11.1 already satisfied
Verifying dependencies for [email protected]
Reading official package list
Checking for unicodedb@>= 0.7.2
Info: Dependency on unicodedb@>= 0.7.2 already satisfied
Verifying dependencies for [email protected]
Reading official package list
Checking for hts@>= 0.3.4
Info: Dependency on hts@>= 0.3.4 already satisfied
Verifying dependencies for [email protected]
Reading official package list
Checking for https://github.com/ruqianl/distributions.git@any version
Info: Dependency on https://github.com/ruqianl/distributions.git@any version already satisfied
Verifying dependencies for [email protected]
Installing [email protected]
Building sgcocaller/sgcocaller using c backend
Prompt: Build failed for 'https://gitlab.svi.edu.au/biocellgen-public/[email protected]', would you like to try installing 'https://gitlab.svi.edu.au/biocellgen-public/sgcocaller.git@#head' (latest unstable)? [y/N]
Answer: y
Downloading https://gitlab.svi.edu.au/biocellgen-public/sgcocaller.git using git
Info /tmp/nimble_1192/gitlabsvieduau_biocellgenpublicsgcocallergit_#head/sgcocaller_1192.nims(5, 22) Warning: imported and not used: 'strutils' [UnusedImport]
Verifying dependencies for sgcocaller@#head
Reading official package list
Checking for docopt@any version
Info: Dependency on docopt@any version already satisfied
Verifying dependencies for [email protected]
Reading official package list
Checking for regex@>= 0.11.1
Info: Dependency on regex@>= 0.11.1 already satisfied
Verifying dependencies for [email protected]
Reading official package list
Checking for unicodedb@>= 0.7.2
Info: Dependency on unicodedb@>= 0.7.2 already satisfied
Verifying dependencies for [email protected]
Reading official package list
Checking for hts@>= 0.3.4
Info: Dependency on hts@>= 0.3.4 already satisfied
Verifying dependencies for [email protected]
Reading official package list
Checking for https://github.com/ruqianl/distributions.git@any version
Info: Dependency on https://github.com/ruqianl/distributions.git@any version already satisfied
Verifying dependencies for [email protected]
Installing sgcocaller@#head
Building sgcocaller/sgcocaller using c backend
Error: Build failed for package: sgcocaller
... Details:
... Execution failed with exit code 1
... Command: "/usr/bin/nim" c --noNimblePath -d:NimblePkgVersion=0.3.5 -d:release --path:"/home/ianc/.nimble/pkgs/docopt-0.7.0" --path:"/home/ianc/.nimble/pkgs/regex-0.20.0" --path:"/home/ianc/.nimble/pkgs/unicodedb-0.10.0" --path:"/home/ianc/.nimble/pkgs/hts-0.3.23" --path:"/home/ianc/.nimble/pkgs/distributions-0.1.0" -o:"/tmp/nimble_1192/gitlabsvieduau_biocellgenpublicsgcocallergit_#head/sgcocaller" "/tmp/nimble_1192/gitlabsvieduau_biocellgenpublicsgcocallergit_#head/src/sgcocaller.nim"
... Output: Hint: used config file '/etc/nim/nim.cfg' [Conf]
... Hint: system [Processing]
... Hint: widestrs [Processing]
... Hint: io [Processing]
... Hint: sgcocaller [Processing]
... Hint: os [Processing]
... Hint: strutils [Processing]
... Hint: parseutils [Processing]
... Hint: math [Processing]
... Hint: bitops [Processing]
... Hint: macros [Processing]
... Hint: algorithm [Processing]
... Hint: unicode [Processing]
... Hint: pathnorm [Processing]
... Hint: osseps [Processing]
... Hint: posix [Processing]
... Hint: times [Processing]
... Hint: options [Processing]
... Hint: typetraits [Processing]
... Hint: docopt [Processing]
... Hint: regex [Processing]
... Hint: tables [Processing]
... Hint: hashes [Processing]
... Hint: sequtils [Processing]
... Hint: types [Processing]
... Hint: sets [Processing]
... Hint: properties [Processing]
... Hint: properties_data [Processing]
... Hint: common [Processing]
... Hint: compiler [Processing]
... Hint: parser [Processing]
... Hint: exptype [Processing]
... Hint: scanner [Processing]
... Hint: exptransformation [Processing]
... Hint: nfatype [Processing]
... Hint: litopt [Processing]
... Hint: nodematch [Processing]
... Hint: types [Processing]
... Hint: types_data [Processing]
... Hint: nfa [Processing]
... Hint: deques [Processing]
... Hint: nfafindall [Processing]
... Hint: nfamatch [Processing]
... Hint: util [Processing]
... Hint: hts [Processing]
... Hint: utils [Processing]
... Hint: hts_concat [Processing]
... /home/ianc/.nimble/pkgs/hts-0.3.23/hts/private/hts_concat.nim(127, 9) Error: undeclared identifier: 'csize_t'

SIGSEGV: Illegal storage access. (Attempt to read from nil?)

Hello, I tried to use sgcocaller v0.3.7 to detect COs in gametes using scRNA sequences from pollens, and our reference genome was phased. Here is the command I ran:

dir_to_sgcocaller/bin/sgcocaller xo --threads 8 --cmPmb 1 --phased merged.bam informative_SNPs.recode.vcf cb.list myprefix

Then I got an error as follows:

{"<barcodeFile>": cb.list, "<BAM>": merged.bam, "--binSize": 2000, "<referenceVCF>": nil, "phase": false, "--lookBeyondSnps": 25, "--thetaREF": 0.1, "--outvcf": false, "--baseq": 13, "--minMAPQ": 20, "--threads": 8, "--stepSize": 200, "--cmPmb": 1, "--chrom": nil, "--minPositiveSwitchScores": 8, "--minSwitchScore": 50.0, "<VCF>": informative_SNPs.recode.vcf, "<gtMtxFile>": nil, "--barcodeTag": CB, "<phasedSnpAnnotFile>": nil, "sxo": false, "--maxExpand": 1000, "--maxTotalDP": 25, "xo": true, "--phased": true, "--templateCell": -1, "--help": false, "<SNPPhaseFile>": nil, "--maxDP": 5, "--minDP": 1, "--minTotalDP": 10, "--maxDissim": 0.0099, "swphase": false, "--posteriorProbMin": 0.99, "--minSNPdepth": 1, "--thetaALT": 0.9, "<out_prefix>": rbrevi_test, "<phaseOutputPrefix>": nil}
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

The error message looks like from Nim, which I am not quite familiar with. Can you give me a clue about how to solve this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.