Giter Club home page Giter Club logo

irelate's Introduction

irelate.go

Streaming relation (overlap, distance, KNN) testing of (any number of) sorted files of intervals.

Currently supports BED, BAM, GFF, VCF.

![GoDoc] (https://godoc.org/github.com/brentp/irelate?status.png) Build Status Coverage Status

Motivation

We want to relate (e.g. intersect or by distance) sets of intervals. For example, we may want to report the nearest gene to a set of ChIP-Seq peaks. BEDTools does this extremely well, irelate is an attempt to provide an API so that users can write their own tools with little effort in go.

Design

  • data-sources must support the Relatable Interface. (we provide parsers for common formats).
  • a user-defined function returns true if 2 Relatable's are related. (only a small number of interval-pairs are sent to be tested--this is handled automatically by IRelate.). We provide CheckRelatedByOverlap to perform overlap testing.
  • i.Related() gives access to all of the related intervals (after they are added internally by IRelate)
  • the "API" is a for loop
  • A parallel chrom-sweep algorithm is used that avoids problems with chromosome order and parallelizes nicely up to about a dozen CPUs.

Example

(also see main/main.go which is similar to bedtools intersect -sorted -sortout -c)

print the number of b alignments that overlap an interval in a

// CheckRelatedByOverlap returns true if Relatables overlap.
func CheckRelatedByOverlap(a Relatable, b Relatable) bool {
        // note with distance == 0 this just overlap.
        return (b.Start() < a.End()) && (b.Chrom() == a.Chrom())
}

// determine ordering of Relatables.
func Less(a Relatable, b Relatable) bool {
    if a.Chrom() != b.Chrom() {
        return a.Chrom() < b.Chrom()
    }
    return a.Start() < b.Start() // || (a.Start() == b.Start() && a.End() < b.End())
}



// a and b are channels that send Relatables.
a, _ := bix.New('intervals.bed.gz')
b, _ := bix.New('some.vcf.gz')
for interval := range IRelate(CheckRelatedByOverlap, 0, Less, a, b) {
    fmt.Fprintf("%s\t%d\t%d\t%d\n", interval.Chrom(), interval.Start(), interval.End(), len(interval.Related()))
}

The 2nd argument determines the query set of intervals. So, only intervals from a (the 0th) source will be sent from IRelate. If this is set to -1, then all intervals from all sources will be sent. After this, any number of interval streams can be passed to IRelate

If we only want to count variants with a given mapping quality, the loop becomes:

for interval := range IRelate(CheckRelatedByOverlap, 0, Less, a, b) {
    n := 0
    for _, b := range interval.Related() {
         // cast to a bam to ge the mapping quality.
         if int(b.(*Variant).Score()) > 20 {
             n += 1
         }
    }
    fmt.Fprintf("%s\t%d\t%d\t%d\n", interval.Chrom(), interval.Start(), interval.End(), n))
}

note that any number of interval sources are supported even though the example is with 2. We can see the source of each interval with: interval.Source(). That value is set automatically inside of irelate.

This is a very simple example, but the point of this is that since the interface is a simple function (as in CheckRelatedByOverlap) and a for loop, it is easy to create custom applications.

For example, here is the function to relate all intervals within 2KB:

// CheckRelatedBy2KB returns true if intervals are within 2KB.
func CheckRelatedBy2KB(a Relatable, b Relatable) bool {
        distance := uint32(2000)
        // note with distance == 0 this just overlap.
        return (b.Start()-distance < a.End()) && (b.Chrom() == a.Chrom())
}

Note that we are guaranteed that b.Start() >= a.Start() so the check is quite simple.

Relatable

a key interface in irelated is:

// Relatable provides all the methods for irelate to function.
// See Interval in interval.go for a class that satisfies this interface.
// Related() likely returns and AddRelated() likely appends to a slice of
// relatables. Note that for performance reasons, Relatable should be implemented
// as a pointer to your data-structure (see Interval).
type Relatable interface {
        Chrom() string
        Start() uint32
        End() uint32
        Related() []Relatable // A slice of related Relatable's filled by IRelate
        AddRelated(Relatable) // Adds to the slice of relatables
        SetSource() uint32    // Internally marks the source (file/stream) of the Relatable
}

Performance

irelate is quite fast, but use PIRelate for parallel intersection. It is less flexible than irelate, but skips parsing of database intervals for sparse regions in the query. In addition, it has very good (automatic) parallelization.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.