hku-bal / clairs-to Goto Github PK

View Code? Open in Web Editor NEW

41.0 3.0 3.0 5.94 MB

ClairS-TO - a deep-learning method for tumor-only somatic variant calling

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.46% Python 81.86% C++ 7.50% C 9.16% Shell 0.69% Makefile 0.33%

bioinformatics deep-learning genomics illumina long-read-sequencing long-reads nanopore ont pacbio snvs

clairs-to's Introduction

ClairS-TO - a deep-learning method for tumor-only somatic small variant calling

Contact: Ruibang Luo, Zhenxian Zheng, Lei Chen
Email: {rbluo,zxzheng,lchen}@cs.hku.hk

Introduction

ClairS-TO (Somatic Tumor-Only) is a tool in the Clair series to support long-read somatic small variant calling with only tumor samples available.

Without a normal sample, non-somatic noises cannot be identified by finding common signals between a paired tumor and normal. The variant caller itself needs to be more proficient in telling noises from somatic signals.

In ClairS-TO, we use an ensemble of two neural networks with opposite objectives. With the same input, an Affirmative NN determines how likely a candidate is a somatic variant - P(Y_Aff), and a Negational NN determines how likely a candidate is NOT a somatic variant - P(Y_Neg). A conditional probability P(Y_Aff | Y_Neg) that determines how likely a candidate is a somatic variant given the probability that the candidate is not a somatic variant is calculated from the probability of both networks. A somatic variant candidate that doesn't look like a noise usually has a high P(Y_Aff) but a low P(Y_Neg), while a somatic variant candidate that can also be a noise can have both a high P(Y_Aff) and a high P(Y_Neg).

Below is a workflow of ClairS-TO.

Like other tumor-only somatic variant callers, ClairS-TO accepts panel of normals (PoNs) as input to remove non-somatic variants.

For somatic variant calling using paired tumor/normal samples, please try ClairS.

Latest Updates
Installation
Quick Demo
Pre-trained Models
Usage
Tagging non-somatic variant using panel of normals
Disclaimer

Latest Updates

v0.2.0 (Jul. 12, 2024): 1. Added a module called verdict to statistically classify a called variant into either a germline, somatic, or subclonal somatic variant based on the copy number alterations (CNA) profile and tumor purity estimation. To disable, use --disable_verdict option. Please check out more technical details about Verdict here.

v0.1.0 (Apr. 25, 2024): 1. Added support for somatic Indel calling. To disable, use --disable_indel_calling option. Indels are called only in the BED regions specified by the --calling_indels_only_in_these_regions option. The default regions are (whole genome - GIAB stratification v3.3 all difficult regions + CMRG v1.0 regions). 2. Added --panel_of_normals_require_allele_matching option that takes comma separated booleans to indicate whether to require allele matching for each of the PoNs given in --panel_of_normals. By default, allele matching is enabled when using germline variants sources (e.g., gnomAD, dbSNP) for non-somatic tagging, and is disabled when using panels (e.g., 1000G PoN). 3. Added multiple filters to remove as many spurious calls as possible. Including the use of i. phasing information: how good the alternative alleles are from a single haplotype after phasing (Simpson, 2024); ii. ancestral haplotype support: can an ancestral haplotype be found for reads that contain the alternative allele (Zheng et al., 2023); iii. BQ, MQ of the alternative allele reads; iv. variant position in read: whether the supporting alleles are gathered at the start or end of reads; v. strand bias; vi. realignment effect: for short read, whether both the count of supporting alt alleles and AF decreased after realignment. 4. Added --qual_cutoff_phaseable_region and --qual_cutoff_unphaseable_region to allow different qual cutoffs for tagging (as LowQual) the variants in the phaseable and unphaseable regions. Variants in unphaseable regions are suitable for a higher quality cutoff than those in the phaseable regions. 5. Added tags: i. H to indicate a variant is found in phaseable region; ii. SB showing the p-value of Fisher’s exact test on strand bias.

v0.0.2 (Jan. 26, 2024): 1. Added ONT Guppy 5kHz HAC (-p ont_r10_guppy_hac_5khz) and Dorado 4kHz HAC (-p ont_r10_dorado_hac_4khz) models, check here for more details. 2. Added FAU, FCU, FGU, FTU, RAU, RCU, RGU, and RTU tags for the count of forward/reverse strand reads supporting A/C/G/T. 3. Revamped the way how panel of normals (PoNs) are inputted. Population databases are also considered as PoNs, and users can disable default population databases and add multiple other PoNs. 4. Added file and md5 information of the PoNs to the VCF output header. 5. Enabled somatic variant calling in sex chromosomes. 6. Fixed an issue that misses PoNs tagging for low-quality variants.

v0.0.1 (Dec. 4, 2023): Initial release for early access.

Quick Demo

Oxford Nanopore (ONT) Q20+ data as input, see ONT Quick Demo.
PacBio HiFi Revio data as input, see PacBio HiFi Quick Demo.
Illumina NGS data as input, see Illumina Quick Demo.

Quick start

After following installation, you can run ClairS-TO with one command:

./run_clairs_to -T tumor.bam -R ref.fa -o output -t 8 -p ont_r10_guppy_sup_4khz

## Final SNV output VCF file: output/snv.vcf.gz
## Final Indel output VCF file: output/indel.vcf.gz

Check Usage for more options.

Pre-trained Models

ClairS-TO trained both Affirmative and Negational models using GIAB samples, and carry on benchmarking on HCC1395 tumor sample dataset. All models were trained with chr20 excluded (including only chr1-19, 21, 22).

Platform	Model name	Chemistry /Instruments	Basecaller	Latest update	Option (`-p/--platform`)	Reference	Aligner
ONT	r1041_e82_400bps_sup_v420	R10.4.1, 5khz	Dorado SUP	Nov. 10, 2023	`ont_r10_dorado_sup_5khz`	GRCh38_no_alt	Minimap2
ONT	r1041_e82_400bps_sup_v410	R10.4.1, 4khz	Dorado SUP	Nov. 10, 2023	`ont_r10_dorado_sup_4khz`	GRCh38_no_alt	Minimap2
ONT	r1041_e82_400bps_hac_v410	R10.4.1, 4khz	Dorado HAC	Jan. 19, 2024	`ont_r10_dorado_hac_4khz`	GRCh38_no_alt	Minimap2
ONT	r1041_e82_400bps_sup_g615	R10.4.1, 4khz	Guppy6 SUP	Nov. 10, 2023	`ont_r10_guppy_sup_4khz`	GRCh38_no_alt	Minimap2
ONT	r1041_e82_400bps_hac_g657	R10.4.1, 5khz	Guppy6 HAC	Jan. 21, 2024	`ont_r10_guppy_hac_5khz`	GRCh38_no_alt	Minimap2
Illumina	ilmn	NovaSeq/HiseqX	-	Nov. 10, 2023	`ilmn`	GRCh38	BWA-MEM
PacBio HiFi	hifi_revio	Revio with SMRTbell prep kit 3.0	-	Nov. 10, 2023	`hifi_revio`	GRCh38_no_alt	Minimap2

Installation

Option 1. Docker pre-built image

A pre-built docker image is available at DockerHub.

Caution: Absolute path is needed for both INPUT_DIR and OUTPUT_DIR in docker.

docker run -it \
  -v ${INPUT_DIR}:${INPUT_DIR} \
  -v ${OUTPUT_DIR}:${OUTPUT_DIR} \
  hkubal/clairs-to:latest \
  /opt/bin/run_clairs_to \
  --tumor_bam_fn ${INPUT_DIR}/tumor.bam \      ## use your tumor bam file name here
  --ref_fn ${INPUT_DIR}/ref.fa \               ## use your reference file name here
  --threads ${THREADS} \                       ## maximum threads to be used
  --platform ${PLATFORM} \                     ## options: {ont_r10_dorado_sup_4khz, ont_r10_dorado_hac_4khz, ont_r10_dorado_sup_5khz, ont_r10_guppy_sup_4khz, ont_r10_guppy_hac_5khz, ilmn, hifi_revio}
  --output_dir ${OUTPUT_DIR}                   ## output path prefix

Check Usage for more options.

Option 2. Singularity

Caution: Absolute path is needed for both INPUT_DIR and OUTPUT_DIR in singularity.

INPUT_DIR="[YOUR_INPUT_FOLDER]"        # e.g. /home/user1/input (absolute path needed)
OUTPUT_DIR="[YOUR_OUTPUT_FOLDER]"      # e.g. /home/user1/output (absolute path needed)
mkdir -p ${OUTPUT_DIR}

conda config --add channels defaults
conda create -n singularity-env -c conda-forge singularity -y
conda activate singularity-env

# singularity pull docker pre-built image
singularity pull docker://hkubal/clairs-to:latest

# run the sandbox like this afterward
singularity exec \
  -B ${INPUT_DIR},${OUTPUT_DIR} \
  clairs-to_latest.sif \
  /opt/bin/run_clairs_to \
  --tumor_bam_fn ${INPUT_DIR}/tumor.bam \      ## use your tumor bam file name here
  --ref_fn ${INPUT_DIR}/ref.fa \               ## use your reference file name here
  --threads ${THREADS} \                       ## maximum threads to be used
  --platform ${PLATFORM} \                     ## options: {ont_r10_dorado_sup_4khz, ont_r10_dorado_hac_4khz, ont_r10_dorado_sup_5khz, ont_r10_guppy_sup_4khz, ont_r10_guppy_hac_5khz, ilmn, hifi_revio}
  --output_dir ${OUTPUT_DIR} \                 ## output path prefix
  --conda_prefix /opt/micromamba/envs/clairs-to

Option 3. Build a micromamba (or anaconda) virtual environment

Check here to install the tools step by step.

Use micromamba (recommended):

Please install micromamba using the official guide or using the commands below:

wget -O linux-64_micromamba-1.5.1-2.tar.bz2 https://micro.mamba.pm/api/micromamba/linux-64/latest
mkdir micromamba
tar -xvjf linux-64_micromamba-1.5.1-2.tar.bz2 -C micromamba
cd micromamba
./bin/micromamba shell init -s bash -p .
source ~/.bashrc

Or use anaconda:

Please install anaconda using the official guide or using the commands below:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x ./Miniconda3-latest-Linux-x86_64.sh 
./Miniconda3-latest-Linux-x86_64.sh

Install ClairS-TO using micromamba step by step:

# create and activate an environment named clairs-to
# install pypy and packages in the environment
# for micromamba
micromamba create -n clairs-to -c bioconda -c pytorch -c conda-forge pytorch tqdm clair3 bcftools einops scipy scikit-learn python=3.9.0 -y
micromamba activate clairs-to

## for anaconda 
#conda create -n clairs-to -c bioconda -c pytorch -c conda-forge pytorch tqdm clair3 bcftools einops python=3.9.0 -y
#source activate clairs-to

git clone https://github.com/HKU-BAL/ClairS-TO.git
cd ClairS-TO

# make sure in clairs-to environment
# download pre-trained models and other resources
echo ${CONDA_PREFIX}
mkdir -p ${CONDA_PREFIX}/bin/clairs-to_models
mkdir -p ${CONDA_PREFIX}/bin/clairs-to_databases
mkdir -p ${CONDA_PREFIX}/bin/clairs-to_cna_data
wget http://www.bio8.cs.hku.hk/clairs-to/models/clairs-to_models.tar.gz
wget http://www.bio8.cs.hku.hk/clairs-to/databases/clairs-to_databases.tar.gz
wget http://www.bio8.cs.hku.hk/clairs-to/cna_data/reference_files.tar.gz
tar -zxvf clairs-to_models.tar.gz -C ${CONDA_PREFIX}/bin/clairs-to_models/
tar -zxvf clairs-to_databases.tar.gz -C ${CONDA_PREFIX}/bin/clairs-to_databases/
tar -zxvf reference_files.tar.gz -C ${CONDA_PREFIX}/bin/clairs-to_cna_data/

#CLAIRSTO_PATH=`pwd`

## to enable realignment module
#sudo apt install g++ libboost-all-dev -y
#cd ${CLAIRSTO_PATH}/src/realign && g++ -std=c++14 -O1 -shared -fPIC -o realigner ssw_cpp.cpp ssw.c realigner.cpp && g++ -std=c++11 -shared -fPIC -o debruijn_graph -O3 debruijn_graph.cpp

## to install allele counter for verdict module
#sudo apt install curl zlib1g-dev libbz2-dev liblzma-dev libcurl4-openssl-dev gcc -y
#cd ${CLAIRSTO_PATH}/src/verdict/allele_counter && chmod +x ./setup.sh && /bin/bash ./setup.sh ${CLAIRSTO_PATH}/src/verdict/allele_counter

#cd ${CLAIRSTO_PATH}

./run_clairs_to --help

Option 4. Docker Dockerfile

This is the same as Option 1 except that you are building a docker image yourself. Please refer to Option 1 for usage.

git clone https://github.com/HKU-BAL/ClairS-TO.git
cd ClairS-TO

# build a docker image named hkubal/clairs-to:latest
# might require docker authentication to build docker image
docker build -f ./Dockerfile -t hkubal/clairs-to:latest .

# run the docker image like Option 1
docker run -it hkubal/clairs-to:latest /opt/bin/run_clairs_to --help

Usage

General Usage

./run_clairs_to \
  --tumor_bam_fn ${INPUT_DIR}/tumor.bam \    ## use your tumor bam file name here
  --ref_fn ${INPUT_DIR}/ref.fa \             ## use your reference file name here
  --threads ${THREADS} \                     ## maximum threads to be used
  --platform ${PLATFORM} \                   ## options: {ont_r10_dorado_sup_4khz, ont_r10_dorado_hac_4khz, ont_r10_dorado_sup_5khz, ont_r10_guppy_sup_4khz, ont_r10_guppy_hac_5khz, ilmn, hifi_revio}
  --output_dir ${OUTPUT_DIR}                 ## output path prefix
 
## Final SNV output VCF file: output/snv.vcf.gz
## Final Indel output VCF file: output/indel.vcf.gz

Options

Required parameters:

  -T, --tumor_bam_fn TUMOR_BAM_FN   Tumor BAM file input. The input file must be samtools indexed.
  -R, --ref_fn FASTA                Reference file input. The input file must be samtools indexed.
  -o, --output_dir OUTPUT_DIR       VCF output directory.
  -t, --threads THREADS             Max threads to be used.
  -p, --platform PLATFORM           Select the sequencing platform of the input. Possible options {ont_r10_dorado_sup_4khz, ont_r10_dorado_hac_4khz, ont_r10_dorado_sup_5khz, ont_r10_guppy_sup_4khz, ont_r10_guppy_hac_5khz, ilmn, hifi_revio}.

Commonly used parameters:

  -s SAMPLE_NAME, --sample_name SAMPLE_NAME
                        Define the sample name to be shown in the VCF file. Default: SAMPLE.
  -c CTG_NAME, --ctg_name CTG_NAME                                                                                                                                                                         
                        The name of the contigs to be processed. Split by ',' for multiple contigs. Default: all contigs will be processed.
  --include_all_ctgs    Call variants on all contigs, otherwise call in chr{1..22,X,Y} and {1..22,X,Y}.                                                                
  -r REGION, --region REGION                                                                                                                                                                               
                        A region to be processed. Format: `ctg_name:start-end` (start is 1-based, including both end positions).                                                                                                         
  -b BED_FN, --bed_fn BED_FN                                                                                                                                                                               
                        Path to a BED file. Call variants only in the provided BED regions.                                                                                                                
  -G VCF_FN, --genotyping_mode_vcf_fn VCF_FN                                                                                                                               
                        VCF file input containing candidate sites to be genotyped. Variants will only be called at the sites in the VCF file if provided.                                                  
  -H VCF_FN, --hybrid_mode_vcf_fn VCF_FN                                                                                                                                           
                        Enable hybrid calling mode that combines the de novo calling results and genotyping results at the positions in the VCF file given.
  --print_ref_calls     Show reference calls (0/0) in VCF file in genotyping or hybrid mode.
  --disable_indel_calling
                        Disable Indel calling. Default: Enabled.
  --snv_min_af FLOAT
                        Minimal SNV AF required for a variant to be called. Decrease SNV_MIN_AF might increase a bit of sensitivity, but in trade of precision, speed and accuracy. Default: 0.05.
  --indel_min_af FLOAT
                        Minimal Indel AF required for a variant to be called. Default: 0.05.
  --min_coverage INT
                        Minimal coverage required for a variant to be called. Default: 4.
  -q INT, --qual INT    If set, variants with >INT will be tagged as PASS, or LowQual otherwise. Default: ONT - 12 , PacBio HiFi - 8, Illumina - 4.
  --qual_cutoff_phaseable_region INT
                        If set, variants called in phaseable regions with >INT will be tagged as PASS, or LowQual otherwise. Supersede by `--qual`.
  --qual_cutoff_unphaseable_region INT
                        If set, variants called in unphaseable regions with >INT will be tagged as PASS, or LowQual otherwise. Supersede by `--qual`.
  --panel_of_normals FILENAMES
                        The path of the panel of normals (PoNs) used for tagging non-somatic variants. Split by ',' if using multiple PoNs. Default: 'gnomad.r2.1.af-ge-0.001.sites.vcf.gz,dbsnp.b138.non-somatic.sites.vcf.gz,1000g-pon.sites.vcf.gz'.
  --panel_of_normals_require_allele_matching BOOLEANS
                        Use together with `--panel_of_normals`. Whether to require allele matching for each PoN. Split by ',' if using multiple PoNs. Default: 'True,True,False'.
  --snv_output_prefix PATH_PREFIX
                        Prefix for SNV output VCF filename. Default: snv.
  --indel_output_prefix PATH_PREFIX
                        Prefix for Indel output VCF filename. Default: indel.
  --call_indels_only_in_these_regions BED_FN
                        Call Indel only in the provided regions. Supersede by `--bed_fn`. To call Indel in the whole genome, input a BED covering the whole genome. Default: 'GRCh38Chr1-22XY_excludedGIABStratifV3.3AllDifficultRegions_includedCMRGv1.0.bed'                     
  --do_not_print_nonsomatic_calls
                        Do not print those non-somatic variants tagged by `--panel_of_normals`.

Other parameters:

  --snv_pileup_affirmative_model_path PATH                                                                                                                                            
                        Specify the path to your own SNV pileup affirmative model.                                                                                                  
  --snv_pileup_negational_model_path PATH                                                                                                                                              
                        Specify the path to your own SNV pileup negational model. 
  --indel_pileup_affirmative_model_path PATH
                        Specify the path to your own Indel pileup affirmative model.
  --indel_pileup_negational_model_path PATH
                        Specify the path to your own Indel pileup negational model.
  -d, --dry_run         Print the commands that will be ran, but do not run them.
  --chunk_size INT
                        The size of each chuck for parallel processing. Default: 5000000.
  --remove_intermediate_dir
                        Remove the intermediate directory before finishing to save disk space.
  --python PATH         Absolute path of python, python3 >= 3.9 is required.
  --pypy PATH           Absolute path of pypy3, pypy3 >= 3.6 is required.
  --samtools PATH       Absolute path of samtools, samtools version >= 1.10 is required.
  --parallel PATH       Absolute path of parallel, parallel >= 20191122 is required.
  --longphase PATH
                        Absolute path of longphase, longphase >= 1.3 is required.
  --whatshap PATH       Absolute path of whatshap, whatshap >= 1.0 is required.
  --use_longphase_for_intermediate_phasing
                        Use longphase for intermediate phasing.
  --use_whatshap_for_intermediate_phasing
                        Use whatshap for phasing.
  --use_longphase_for_intermediate_haplotagging USE_LONGPHASE_FOR_INTERMEDIATE_HAPLOTAGGING
                        Use longphase instead of whatshap for intermediate haplotagging.
  --disable_intermediate_phasing
                        Disable intermediate phasing, runs faster but reduces precision.
  --disable_nonsomatic_tagging
                        Disable non-somatic variants tagging and ignore `--panel_of_normals`.
  --disable_verdict
                        Disable using verdict to tag the variants in CNA regions. We suggest using the parameter only for sample with tumor purity estimation lower than 0.8. Default: Enabled.

Call Variants in one or multiple chromosomes using the `-C/--ctg_name` parameter

./run_clairs_to -T tumor.bam -R ref.fa -o output -t 8 -p ont_r10_guppy_sup_4khz -C chr21,chr22

Call Variants in one specific region using the `-r/--region` parameter

./run_clairs_to -T tumor.bam -R ref.fa -o output -t 8 -p ont_r10_guppy_sup_4khz -r chr20:1000000-2000000

Call Variants at interested variant sites (genotyping) using the `-G/--genotyping_mode_vcf_fn` parameter

./run_clairs_to -T tumor.bam -R ref.fa -o output -t 8 -p ont_r10_guppy_sup_4khz -G input.vcf

Call Variants in the BED regions using the `-B/--bed_fn` parameter

We highly recommended using BED file to define multiple regions of interest like:

echo -e "${CTG1}\t${START_POS_1}\t${END_POS_1}" > input.bed
echo -e "${CTG2}\t${START_POS_2}\t${END_POS_2}" >> input.bed
...

Then:

./run_clairs_to -T tumor.bam -R ref.fa -o output -t 8 -p ont_r10_guppy_sup_4khz -B input.bed

Tagging non-somatic variant using panel of normals

ClairS-TO by default tags variants if they exist in provided panel of normals (PoNs, i.e., gnomad.r2.1.af-ge-0.001.sites.vcf.gz, dbsnp.b138.non-somatic.sites.vcf.gz, and 1000g-pon.sites.vcf.gz), and pass the filters listed in the table below.

Users can also use their own PoNs for tagging using the --panel_of_normals option.

Particularly, if the --panel_of_normals option is not specified, the three default PoNs will be included. And if users want to use all/part/none of the default PoNs as well as their own PoNs, corresponding file paths of the default PoNs (i.e., ${CONDA_PREFIX}/bin/clairs-to_databases/gnomad.r2.1.af-ge-0.001.sites.vcf.gz, ${CONDA_PREFIX}/bin/clairs-to_databases/dbsnp.b138.non-somatic.sites.vcf.gz, and ${CONDA_PREFIX}/bin/clairs-to_databases/1000g-pon.sites.vcf.gz), and their own PoNs, should be included in the --panel_of_normals option, split by ,.

In addition, we recommend using --panel_of_normals_require_allele_matching option that takes comma separated booleans to indicate whether to require allele matching for each of the PoNs given in --panel_of_normals. By default, allele matching is enabled when using germline variants sources (e.g., gnomAD, dbSNP) for non-somatic tagging, and is disabled when using panels (e.g., 1000G PoN).

Default PoNs	URL	Source	Source URL	Last visited	Total #Variants	Filters	#Variants used for tagging	Remaining Columns in the input
PoN 1	http://www.bio8.cs.hku.hk/clairs-to/databases/gnomad.r2.1.af-ge-0.001.sites.vcf.gz	GATK gnomAD	https://storage.googleapis.com/gatk-best-practices/somatic-hg38/af-only-gnomad.hg38.vcf.gz	July 10, 2023 PM10∶34∶07	268,225,276	Sites with AF ≥ 0.001	35,551,905	#CHROM POS ID REF ALT
PoN 2	http://www.bio8.cs.hku.hk/clairs-to/databases/dbsnp.b138.non-somatic.sites.vcf.gz	GATK dbSNP	https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf	July 10, 2023 PM10∶42∶22	60,691,395	Non-Somatic sites	60,683,019	#CHROM POS ID REF ALT
PoN 3	http://www.bio8.cs.hku.hk/clairs-to/databases/1000g-pon.sites.vcf.gz	GATK 1000G PoN	https://storage.googleapis.com/gatk-best-practices/somatic-hg38/1000g_pon.hg38.vcf.gz	July 10, 2023 PM10∶31∶32	2,609,566	All sites	2,609,566	#CHROM POS ID REF ALT

Disclaimer

NOTE: the content of this research code repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.

clairs-to's People

Contributors

Stargazers

Watchers

Forkers

changqingw randolium yongchaodou

clairs-to's Issues

Different output

Hi,

First of all thank you for ClairS-TO.
I have been trying to use it with bed file option.
I get different output for the same sample if I re-run it.
In some output, it skips certain positions of a chromosome. It shows in the tmp file if I check individual chromosome vcf file. But some positions or entire chromosome get missed out in a merged output vcf. Also does not show phasing in a final vcf.

Thank you in advance.

Shreya

Documentation: training data to create a model

Hi @aquaskyline,

Thanks for developing such a great tool! I am currently testing your variant caller with ONT data originated from different organisms, and it could nicely recognize SNPs but I am having troubles with short deletions.
I was wondering whether it makes sense for me to train it with my own data to create a model. I saw that for Clair3 it is explained in the documentation how to train data, but I did not find the documentation for ClairS-TO.
If this is possible, would you please add in the documentation how users can create their own models?

Thank you very much!

Error at Haplotag BAM step

Thank you for this important piece of software. Using fresh pull of singularity image for ClairS-TO, but at the haplotagging stage get the output below.

I've made sure I am not loading a local instance of conda before I run the pipeline on a slurm node, but the image seems to be picking up pysam locally rather than using the instance inside the container - which it can see as it runs fine on other steps.

Command used to run ($A represents sample ID passed from slurm):

singularity exec ~/beggsa-clinicalnanopore/software/clairs-to_latest.sif /opt/bin/run_clairs_to -T "$A".sorted.bam -R ~/beggsa-clinicalnanopore/genomes/grch38/genome.fa -o clairs-to/ -t 32 -p ont_r10_dorado_hac_4khz --conda_prefix /opt/micromamba/envs/clairs-to
OS: RedHat Enterprise 8.3

Output:

parsing contig/chromosome: chrEBV ... skip
writeResult ... 2s

[INFO] Haplotag the Tumor BAM
[INFO] RUN THE FOLLOWING COMMAND:
( parallel --joblog /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/logs/phasing_log/parallel_3_haplotag_tumor.log -j 32 /opt/micromamba/envs/clairs-to/bin/whatshap haplotag --output /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/tmp/phasing_output/phased_bam_output/tumor_{1}.bam --reference /rds/homes/b/beggsa/beggsa-clinicalnanopore/genomes/grch38/genome.fa --regions {1}  --ignore-read-groups /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/tmp/phasing_output/phased_vcf_output/tumor_phased_{1}.vcf.gz /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/S475276.sorted.bam :::: /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/tmp/CONTIGS ) 2>&1 | tee /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/logs/phasing_log/3_tumor_haplotag.log && parallel -j 32 samtools index  -@32 /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/tmp/phasing_output/phased_bam_output/tumor_{1}.bam :::: /rds/projects/b/beggsa-sarcomaaccelerator/LongRead/LA95_Sarcoma_03-05-24/S475276/20240503_1646_3A_PAW55915_ceedab86/clairs-to/tmp/CONTIGS

Traceback (most recent call last):
  File "/opt/micromamba/envs/clairs-to/bin/whatshap", line 7, in <module>
    from whatshap.__main__ import main
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/__main__.py", line 7, in <module>
    import whatshap.cli as cli_package
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/cli/__init__.py", line 5, in <module>
    from whatshap.bam import (
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/bam.py", line 6, in <module>
    import pysam
  File "/rds/homes/b/beggsa/.local/lib/python3.9/site-packages/pysam/__init__.py", line 4, in <module>
    from pysam.libchtslib import *
ImportError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.27' not found (required by /rds/homes/b/beggsa/.local/lib/python3.9/site-packages/pysam/../pysam.libs/libssh-f4fb36c6.so.4.8.7)
Traceback (most recent call last):
  File "/opt/micromamba/envs/clairs-to/bin/whatshap", line 7, in <module>
    from whatshap.__main__ import main
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/__main__.py", line 7, in <module>
    import whatshap.cli as cli_package
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/cli/__init__.py", line 5, in <module>
    from whatshap.bam import (
  File "/opt/micromamba/envs/clairs-to/lib/python3.9/site-packages/whatshap/bam.py", line 6, in <module>
    import pysam
  File "/rds/homes/b/beggsa/.local/lib/python3.9/site-packages/pysam/__init__.py", line 4, in <module>
    from pysam.libchtslib import *
ImportError: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.27' not found (required by /rds/homes/b/beggsa/.local/lib/python3.9/site-packages/pysam/../pysam.libs/libssh-f4fb36c6.so.4.8.7)

Stalling at Step6

Hi, I had to reinstall ClairS-TO and did at first not notice that there was an update last week. Whenever I ran my pipeline it stalled at step 6, when writing logs I think. After downgrading to version 0.1.0 it runs again through as before. Does someone else see this issue?

IndexError during STEP1

I got heaps of IndexError while running clairs_to with the latest docker image on Singularity:

Traceback (most recent call last):
  File "/opt/bin/clairs_to.py", line 107, in <module>
    main()
  File "/opt/bin/clairs_to.py", line 101, in main
    submodule.main()
  File "/opt/bin/src/extract_candidates_calling.py", line 597, in main
    extract_pair_candidates(args)
  File "/opt/bin/src/extract_candidates_calling.py", line 341, in extract_pair_candidates
    select_indel_candidates=select_indel_candidates
  File "/opt/bin/src/extract_candidates_calling.py", line 91, in decode_pileup_bases
    base_list[-1][1] = base + pileup_bases[base_idx: base_idx + advance]  # add indel seq
IndexError: list index out of range

I wonder if this is expected, or am I actually losing these candidate indels due to the error.

F1 scores and high coverage datasets

Hi there!! We're currently testing ClairS-To using ONT DNA reads and a variant truth set to call SNPs. We have noticed; however, that the F1 score decreases relevently whenever the coverage is above 1000x (e.g. F1 of 0.90 if cov. 1000x, F1 of 0.5 when using higher coverage).

Therefore, I wanted to kindly ask: could there be a reason for this behaviour?

Thank you very much for any reply in advance and thank you very much for ClairS-TO
Cheers!

IndexError in STEP5

I am getting the following IndexError in STEP 5 while running ClairS_TO with the latest docker image and I am using PacBio Hifi data (haplotagged.bam file) as the input.

Traceback (most recent call last):
File "/opt/bin/clairs_to.py", line 107, in
main()
File "/opt/bin/clairs_to.py", line 101, in main
submodule.main()
File "/opt/bin/src/postprocess_vcf.py", line 231, in main
merge_vcf(args)
File "/opt/bin/src/postprocess_vcf.py", line 99, in merge_vcf
qual = float(columns[5])
IndexError: list index out of range

Even with the error, the final output file output.vcf.gz is generated but I am concerned about the output. Could you please help me with that?

Value Error in Step 2-3 CALL_VARIANTS

Hi, I am using ClairS-TO to call SNVs in tumor data (Ilmn). In step 2-3 (pileup Model Calling Variants), roughly 6 hours into the job, I run into this problem:

[INFO] Pileup Model Calling Variants
[INFO] RUN THE FOLLOWING COMMAND:
( parallel --joblog /hpc/pmc_holstege/rstudio/oscar_nanopore/snv_clairsto/SNV_called/logs/parallel_2-3_call_variants.log -j 24 python3 /opt/bin/clairs_to.py call_variants --predict_fn /hpc/pmc_holstege/rstudio/oscar_nanopore/snv_clairsto/SNV_called/tmp/predict/{1/} --call_fn /hpc/pmc_holstege/rstudio/oscar_nanopore/snv_clairsto/SNV_called/tmp/vcf_output/p_{1/}.vcf --platform ilmn --likelihood_matrix_data /opt/micromamba/envs/clairs-to/bin/clairs-to_models/ilmn/likelihood_matrix.txt :::: /hpc/pmc_holstege/rstudio/oscar_nanopore/snv_clairsto/SNV_called/tmp/candidates/CANDIDATES_FILES ) 2>&1 | tee /hpc/pmc_holstege/rstudio/oscar_nanopore/snv_clairsto/SNV_called/logs/2-3_CALL_VARIANTS.log

[INFO] Calling tumor-only somatic variants ...
[INFO] Total time elapsed: 0.04 s
[INFO] Calling tumor-only somatic variants ...
Traceback (most recent call last):
File "/opt/bin/clairs_to.py", line 107, in
main()
File "/opt/bin/clairs_to.py", line 101, in main
submodule.main()
File "/opt/bin/clairs/call_variants.py", line 646, in main
call_variants_from_probability(args)
File "/opt/bin/clairs/call_variants.py", line 567, in call_variants_from_probability
output_vcf_from_probability(
File "/opt/bin/clairs/call_variants.py", line 234, in output_vcf_from_probability
best_match_alt_list, tumor_supported_reads_count_list = rank_variant_alt(
ValueError: too many values to unpack (expected 2)

The traceback is repeated hundreds of times. Step 3 seems to work fine even with this issue, but the vcf file doesn't contain any SNVs in the end.

do you have any suggestions on what might be the cause? Thanks

bcftools doesn't like tag name '1kGPoN'

I am getting "[W::bcf_hrec_check] Invalid tag name: "1kGPoN"" error when processing the vcf generated by ClairS-TO pipeline.

My understanding is that bcftools doesn't like any tag that starts with a number.

Can you change the tag name to PoN instead in shared/vcf.py line 24 and src/nonsomatic_tagging.py line 294?

Thanks a lot in advance.

Get a large number of variants from ClairS-TO

Hi, I am using Clairs-TO to call somatic small variant from PacBio Revio tumor data. As a result, ClairS-TO detected ~28000 variants, which is much larger than I expected. I am not sure whether there are many false positive and thus post this issue to enquire if it's normal condition when using ClairS-TO.

Thanks!

RefCall variants

Hi there! We are testing clairS-TO with the --print_ref_calls option. My question is: what does the RefCall flag mean? Could you further elaborate on it?

I am aware that in other software it can mean that the variant was proposed as mutation but later is rejected by the variant caller; however, I would appreciate if you can provide further information on what it means in ClairS-TO.

Thank you very much and thank you as well for your tool!
Regards!

DP seems downsampling in ClairS-TO

Hello,

Thank you for developing these great long-read small variants callers! We have some Cas9 edited, PCR-based Nanopore samples (Dorado duplex basecalling [email protected] and [email protected]), and want to quantify the editing percentage.

The sequencing depth (DP) is 372 k reads, and the mutation type should be T->TA at chr4:123 (based on BAM file IGV visualization, and previous edited samples). I have tried the following three mutation calling programs:

Run Clair3 with an old ont_guppy5 model. The mutation type is T deletion which is inconsistent with IGV visualization T->TA.

chr4 123 . T . 10.36 RefCall P GT:GQ:DP:AD:AF 0/0:10:372419:186689:0.5013

hkubal/clair3:v1.0.8 /opt/bin/run_clair3.sh \
--bam_fn=/input.bam \
--ref_fn=/hg38.fna --threads=8 \
--platform="ont" --model_path="/opt/models/ont_guppy5" --output=/input/clair3/ \
--var_pct_full=1 --var_pct_phasing=1 --ref_pct_full=1

Run Clair3 with the newest model "r1041_e82_400bps_sup_v430". No INDEL at chr14:123 detected. So I assume it's not germline so try solution 3.
Run ClairS-TO. The DP is 7071, which is different from 372k reads, so it's hard to quantify the editing percentage.

chr4 123 . T TA 15.2734 PASS FAU=21;FCU=0;FGU=13;FTU=3436;RAU=1;RCU=0;RGU=0;RTU=113;SB=0.08341 GT:GQ:DP:AF:AD:AU:CU:GU:TU 0/1:15:7071:0.3793:3549,2682:22:0:13:3549

  hkubal/clairs-to:v0.2.0 /opt/bin/run_clairs_to \
  --tumor_bam_fn /input.bam \
  --ref_fn /hg38.fna \
  --threads 8 \
  --platform ont_r10_dorado_sup_4khz \
  --output_dir /input/clairs_to/

Please let me know which caller and parameters are best suitable for the PCR-based editing sample for editing performance evaluation. Thank you!

Best,
Ermin

Indexing error at phasing step

Hi,

Firstly, many thanks for this incredibly useful tool.

I am running ClairS-TO with the singularity container on a ONT WGS run. The library was basecalled and aligned using dorado 0.6.2.

Command used to run

singularity exec -B ${input_dir},${ref_dir},${output_dir} --bind=/local:/local:rw /b06x-isilon/b06x-m/mnp_nanopore/software/clairs-to_latest.sif \
          /opt/bin/run_clairs_to \
          --tumor_bam_fn ${input_dir}/${id}.hg38.bam \
          --sample_name ${id} \
          --ref_fn ${ref_dir}/hg38.fa \
          --threads ${params.threads} \
          --platform ${params.platform} \
          --output_dir ${output_dir} \
          --conda_prefix /opt/micromamba/envs/clairs-to \

All the previous steps run fine and then it runs into this error at the phasing step:

[INFO] Phase the Tumor BAM
[INFO] RUN THE FOLLOWING COMMAND:
( parallel --joblog /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/logs/phasing_log/parallel_2_phase_tumor.log -j 8 /opt/micromamba/envs/clairs-to/bin/longphase phase  -s /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/phasing_output/vcf/{1}.vcf -b /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/bam/392718.calls.mods.sorted.hg38.bam -r /b06x-isilon/b06x-m/mnp_nanopore/software/hg38/hg38.fa -t 8 -o /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/phasing_output/phased_vcf_output/tumor_phased_{1} --ont :::: /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/CONTIGS && parallel -j 8 bgzip -f /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/phasing_output/phased_vcf_output/tumor_phased_{1}.vcf :::: /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/CONTIGS ) 2>&1 | tee /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/logs/phasing_log/2_phase_tumor.log && parallel -j 8 tabix -f -p vcf /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/phasing_output/phased_vcf_output/tumor_phased_{1}.vcf.gz :::: /b06x-isilon/b06x-m/mnp_nanopore/analysis/ONT_R00200/snv/ClairsTo/tmp/CONTIGS

parsing VCF ... [W::vcf_parse] Contig '##contig=<ID=chr1,length=248387328>' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::bcf_hrec_check] Invalid contig name: "##contig=<ID=chr1,length=248387328>"
pos 0 missing GT value
parsing VCF ... [W::vcf_parse] Contig '##contig=<ID=chr1,length=248387328>' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::bcf_hrec_check] Invalid contig name: "##contig=<ID=chr1,length=248387328>"
pos 0 missing GT value
parsing VCF ... [W::vcf_parse] Contig '##contig=<ID=chr1,length=248387328>' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::bcf_hrec_check] Invalid contig name: "##contig=<ID=chr1,length=248387328>"
pos 0 missing GT value

The full log file is here-
run_clairs_to.log

Could you please help me with this?

Many thanks,
Areeba.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.