I am attempting to perform a benchmarking of various SV callers and their combinations

Question regarding matching parameters about truvari HOT 1 CLOSED

acenglish commented on May 28, 2024

Question regarding matching parameters

from truvari.

Comments (1)

ACEnglish commented on May 28, 2024

Picking matching parameters is more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison.

For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of pctsim==0, pctsize==.5, pctovl==.5, refdist==1000 may seem fair.

Spiral Genetics' BioGraph Discovery reports precise breakpoints and full alternate allele sequences, so when benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. I also use defaults when doing a benchmark comparison to tools like Manta.

If you're still having trouble picking thresholds, it may be beneficial to do a few runs of truvari over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. I built truvari to be flexible in this manner, but more importantly, truvari helps one clearly report the thresholds used for reproducibility (see the json at the top of your log.txt).

If you're curious for more details on what these specific parameters mean, you can see all the implementation details pulled from the code below. Also, I've recently expanded the documentation in the README for detail aboutpctsim. See the section Comparing Haplotype Sequences of Variants. For more details on the Levenshtein distance ratio, see https://stackabuse.com/levenshtein-distance-and-text-similarity-in-python/

#pctsize
def var_sizesim(sizeA, sizeB):
    """
    Calculate the size similarity pct for the two entries
    compares the longer of entryA's two alleles (REF or ALT)
    """
    return min(sizeA, sizeB) / float(max(sizeA, sizeB)), sizeA - sizeB

#pctovl
def get_rec_ovl(astart, aend, bstart, bend):
    """
    Compute reciprocal overlap between two spans
    """
    ovl_start = max(astart, bstart)
    ovl_end = min(aend, bend)
    if ovl_start < ovl_end:  # Otherwise, they're not overlapping
        ovl_pct = float(ovl_end - ovl_start) / max(aend - astart, bend - bstart)
    else:
        ovl_pct = 0
    return ovl_pct

For refdist we only consider matching comparison calls within REFDIST base-pairs upstream of the base call's start position and REFDIST base-pairs downstream of the base calls' end position.

from truvari.

Question regarding matching parameters about truvari HOT 1 CLOSED

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent