This is to start discussion on how to implement e.g.: features.gfffeature, features.vc

I like that idea a lot, and it'll definitely be easier to work into <code class="notra

started in SHA: <a class="commit-link" data-hovercard-type="commit" data-hovercard-url

implement features.feature class to support methods like length() for files other than .bed about pybedtools HOT 13 CLOSED

daler commented on July 30, 2024

implement features.feature class to support methods like length() for files other than .bed

from pybedtools.

Comments (13)

daler commented on July 30, 2024

yeah, actually this is what I've been working on with the genomicfeatures
package (https://github.com/daler/genomicfeatures)!

genomicfeatures has everything but VCF format (just haven't gotten around
to it yet). Each is derived from a base Interval class, and has things
like a length() method and chrom/start/stop/strand attributes, similar to
what you suggest.

One issue I'd like to deal with is that bedtools has all sorts of derived,
bed-like files. For example, when you do an intersect with the -c flag,
you get an additional count column.

Another example is if you do an intersection but specify the -wao flag,
in which case you get 2 bed features on each line.

Eventually, I'd like to have pybedtools autodetect the kind of feature in
the file it's created from (or the user could give it hints) and keep track
of that feature type throughout the history of the bedtool instance.

I think a generic CompositeInterval class, which for now I'm just
imagining as a tuple of features, could handle the intersect -wao cases I mentioned above. So when bedtool.intersect sees the
wao=True kwarg, it knows that it should combine the feature types of a
and b into a new CompositeInterval.

Then you could do things like

a = bedtool('a.gff')
b = bedtool('b.bed')
c = a.intersect(b, wao=True)
for gfffeature,bedfeature in c.features():
    print len(gfffeature), len(bedfeature)

A potential issue with genomicfeatures is that it depends on Cython and
pysam. I figure though that people who use bedtools will have these
installed anyway, so maybe those dependencies aren't that big of an issue.

from pybedtools.

brentp commented on July 30, 2024

ah right, i'd seen genomicfeatures but not looked at it much.
might be able to save a lot of code doing:

class feature:
    position_locs = (0, 1, 2)
    __slots__ = ('chr', 'start', 'end', 'name')
    def __init__(self, line_arr):

        for attr, val in zip(self.__slots__, line_arr):
             setattr(self, val)
        self.start, self.end = int(self.start), int(self.end)
        self.post_initialize()

    @property
    def length(self):
          return self.end - self.start

class gfffeature(feature):
    __slots__ = ('chr', 'id', 'something', 'start', 'end', ...)
    position_locs = (0, 3, 4)

    def post_initialize(self):
          # set attrs

so then you just set the slots for each new class. the downside is it's probably slower than your implementation.

from pybedtools.

daler commented on July 30, 2024

I like that idea a lot, and it'll definitely be easier to work into pybedtools.

With genomicfeatures I was aiming for fast, full-featured classes for reuse in other code -- it's the stuff implied in those post_initialize() methods that get to be time-consuming (esp. with big BED12 files).

But maybe it's not appropriate to force that sort of optimization into pybedtools yet, and instead take a more pure (and cleaner) Python approach to get something quickly that actually works.

from pybedtools.

brentp commented on July 30, 2024

i will make a start of it. i suppose post_initialize type methods could be done in cython later.
or it could be even lazier

class feature:
    position_locs = (0, 1, 2)
    __slots__ = ('chr', 'start', 'end', 'name', 'other', 'blah', 'line_arr')
    def __init__(self, line_arr):
        self.line_arr = line_arr
        self.chr = line_arr[position_locs[0]] 
        self.start = int(line_arr[postion_locs[1]])
        self.end = int(line_arr[postion_locs[2]])

        self.post_initialize()

    def __getattr__(self, field):
        idx = self.__slots__.index(field)
        if idx == -1: raise AttributeError("%s does not exist" % field)
        return self.line_arr[idx]

then you delay / avoid some of the parsing overhead. and get a bit of performance
improvement from slots.
and each subclass just has to define slots, postion_locs, and any other @properties like gff .attrs

from pybedtools.

brentp commented on July 30, 2024

started in SHA: 784dfa8

let me know any comments. i think i will change so that the gff/gtf attributes don't get parsed until first use.

from here, it needs something that guess the appropriate class based on the column count and content. maybe we can see how aaron did that.

from pybedtools.

brentp commented on July 30, 2024

so i think now, each bedtools object will have an array of _feature_objects

so if it's a bed-file, it'll be

_feature_objects = [features.BedFeature]

then if you do an intersection with a gff file, it'll be:

_feature_objects = [features.BedFeature, features.GFFFeature]

then we can start to implement what you suggest in the comment above.

from pybedtools.

brentp commented on July 30, 2024

that requires that each class knows how many fields it consumes.

and probably means that the Feature() init method should accept the array, rather than the string.

from pybedtools.

daler commented on July 30, 2024

Looking good. I need to get better at git so I can pull your branches and make branches of those . . . and in the end get everything back into the master branch . . .

The __getattr__ method is nice and elegant. Do you think it would be useful to type cast as necessary (e.g., float score fields for BED and GFF) or leave everything as a string? In the minimilist spirit, maybe here we pass the buck to the user to convert as needed. What do you think?

Re: autodetect --

How about an autodetect() function that accepts a line and tries to figure out what it is:

# within the bedtool __init__ ...
feature_classes = auto_detect(line) # returns, e.g., [features.BedFeature]
self._feature_objects = feature_classes

This would only have to happen for the first handful of lines in a file, so speed shouldn't be affected too much. Once the feature classes are known for individual bedtools, combining them shouldn't be that bad using the history mechanism. Plus, a bedtool object would know when you're using intersect(c=True), and can increment field counts and create 'derived' feature classes appropriately.

As far as implementing an autodetect...

GTF/GFF always have 8 fields and can be differentiated based on attributes delimiters and the required "gene_id" field in GTFs. e.g., GFF satisfy the conditions:

attrs = self._line_arr[-1]
assert self._field_count == 8
assert (attrs.count('=') == attrs.count(';')) or (attrs.count('=') == attrs.count(';')-1)

GTF features should satisfy:

attrs = self._line_arr[-1]
assert self._field_count == 8
assert 'gene_id' in attrs
assert (attrs.count('=') != attrs.count(';')) and (attrs.count('=') != attrs.count(';')-1)

Of course, pathological cases like a GTF feature with a gene named '==' would break this, but hopefully it would be good enough for now.

SAM/BAM have 11 fields, so that should be easy (assuming BED11 isn't really a valid format since you need both block sizes and starts). I'm not familiar with VCF though.

BEDs with variable fields are the tricky ones, plus all the "bedtools-derived" formats (e.g., intersectBed -c which extends the field count by 1). Will need more work here.

from pybedtools.

brentp commented on July 30, 2024

cool. i'll try to write auto_detect(), i think your workflow sounds good for that:

check gff/gtf first, i.e. field_count == 8
check for field_count == 11 (and maybe check that filename doesn't end with .bed)
other types including vcf
otherwise, assume bed

maybe each Feature class can define a class method .isa(line_arr)
then as we add a class, it is responsible for that.

so:

class GFFFeature(self):
    @classmethod
    def isa(line_arr):
          return len(line_arr) == 8 and line_arr[-1].count("=") == line_arr[-1].count(";")

the the auto_detect function calls the isa() class method on each Feature subclass in some preset order.

def auto_detect(fn):
      line_arr = _get_non_comment_line(fn).split("\t")
      for klass in [GFFFeature, GTFFeature, SamFeature, VCFFeature, ..., BedFeature]:
           if klass.isa(line_arr): return klass
       raise UnrecognizedFileException(fn)

we do use variable-field beds a lot--just throwing stuff into the extra columns, so i agree that will be important to address.
and i have a lot of scripts that run intersectBed, then cut the appropriate columns...

from pybedtools.

daler commented on July 30, 2024

i like the isa idea a lot, making each class responsible for its own detection. That ought to make adding new feature classes much easier, too.

I'd prefer not to put too much weight on extensions (.bed or otherwise), because I have a ton of lazily-named files that don't have extensions. Plus, all the tempfiles made in the background by pybedtools would need to add the appropriate extensions as well.

from pybedtools.

brentp commented on July 30, 2024

ah, noted on the extensions. i also dont know much about git.
but i think you can do something like:

$ git remote add brentp git://github.com/brentp/pybedtools.git
$ git checkout -b brentp 
$ git fetch brentp
$ git rebase brentp/master # or merge.
# check/change what i did.
$ git checkout master
$ git rebase brentp # or merge.

from pybedtools.

daler commented on July 30, 2024

ok cool.

I started a new issue to discuss the Features string vs array thing: https://github.com/daler/pybedtools/issues#issue/4

from pybedtools.

daler commented on July 30, 2024

closed because 1) the features work great now, including length() and 2) we're not autodetecting any more

from pybedtools.

implement features.feature class to support methods like length() for files other than .bed about pybedtools HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent