Giter Club home page Giter Club logo

Comments (13)

daler avatar daler commented on July 30, 2024

yeah, actually this is what I've been working on with the genomicfeatures
package (https://github.com/daler/genomicfeatures)!

genomicfeatures has everything but VCF format (just haven't gotten around
to it yet). Each is derived from a base Interval class, and has things
like a length() method and chrom/start/stop/strand attributes, similar to
what you suggest.

One issue I'd like to deal with is that bedtools has all sorts of derived,
bed-like files. For example, when you do an intersect with the -c flag,
you get an additional count column.

Another example is if you do an intersection but specify the -wao flag,
in which case you get 2 bed features on each line.

Eventually, I'd like to have pybedtools autodetect the kind of feature in
the file it's created from (or the user could give it hints) and keep track
of that feature type throughout the history of the bedtool instance.

I think a generic CompositeInterval class, which for now I'm just
imagining as a tuple of features, could handle the intersect -wao cases I mentioned above. So when bedtool.intersect sees the
wao=True kwarg, it knows that it should combine the feature types of a
and b into a new CompositeInterval.

Then you could do things like

a = bedtool('a.gff')
b = bedtool('b.bed')
c = a.intersect(b, wao=True)
for gfffeature,bedfeature in c.features():
    print len(gfffeature), len(bedfeature)

A potential issue with genomicfeatures is that it depends on Cython and
pysam. I figure though that people who use bedtools will have these
installed anyway, so maybe those dependencies aren't that big of an issue.

from pybedtools.

brentp avatar brentp commented on July 30, 2024

ah right, i'd seen genomicfeatures but not looked at it much.
might be able to save a lot of code doing:

class feature:
    position_locs = (0, 1, 2)
    __slots__ = ('chr', 'start', 'end', 'name')
    def __init__(self, line_arr):

        for attr, val in zip(self.__slots__, line_arr):
             setattr(self, val)
        self.start, self.end = int(self.start), int(self.end)
        self.post_initialize()

    @property
    def length(self):
          return self.end - self.start

class gfffeature(feature):
    __slots__ = ('chr', 'id', 'something', 'start', 'end', ...)
    position_locs = (0, 3, 4)

    def post_initialize(self):
          # set attrs

so then you just set the slots for each new class. the downside is it's probably slower than your implementation.

from pybedtools.

daler avatar daler commented on July 30, 2024

I like that idea a lot, and it'll definitely be easier to work into pybedtools.

With genomicfeatures I was aiming for fast, full-featured classes for reuse in other code -- it's the stuff implied in those post_initialize() methods that get to be time-consuming (esp. with big BED12 files).

But maybe it's not appropriate to force that sort of optimization into pybedtools yet, and instead take a more pure (and cleaner) Python approach to get something quickly that actually works.

from pybedtools.

brentp avatar brentp commented on July 30, 2024

i will make a start of it. i suppose post_initialize type methods could be done in cython later.
or it could be even lazier

class feature:
    position_locs = (0, 1, 2)
    __slots__ = ('chr', 'start', 'end', 'name', 'other', 'blah', 'line_arr')
    def __init__(self, line_arr):
        self.line_arr = line_arr
        self.chr = line_arr[position_locs[0]] 
        self.start = int(line_arr[postion_locs[1]])
        self.end = int(line_arr[postion_locs[2]])

        self.post_initialize()

    def __getattr__(self, field):
        idx = self.__slots__.index(field)
        if idx == -1: raise AttributeError("%s does not exist" % field)
        return self.line_arr[idx]

then you delay / avoid some of the parsing overhead. and get a bit of performance
improvement from slots.
and each subclass just has to define slots, postion_locs, and any other @properties like gff .attrs

from pybedtools.

brentp avatar brentp commented on July 30, 2024

started in SHA: 784dfa8

let me know any comments. i think i will change so that the gff/gtf attributes don't get parsed until first use.

from here, it needs something that guess the appropriate class based on the column count and content. maybe we can see how aaron did that.

from pybedtools.

brentp avatar brentp commented on July 30, 2024

so i think now, each bedtools object will have an array of _feature_objects

so if it's a bed-file, it'll be

_feature_objects = [features.BedFeature]

then if you do an intersection with a gff file, it'll be:

_feature_objects = [features.BedFeature, features.GFFFeature]

then we can start to implement what you suggest in the comment above.

from pybedtools.

brentp avatar brentp commented on July 30, 2024

that requires that each class knows how many fields it consumes.

and probably means that the Feature() init method should accept the array, rather than the string.

from pybedtools.

daler avatar daler commented on July 30, 2024

Looking good. I need to get better at git so I can pull your branches and make branches of those . . . and in the end get everything back into the master branch . . .

The __getattr__ method is nice and elegant. Do you think it would be useful to type cast as necessary (e.g., float score fields for BED and GFF) or leave everything as a string? In the minimilist spirit, maybe here we pass the buck to the user to convert as needed. What do you think?

Re: autodetect --

How about an autodetect() function that accepts a line and tries to figure out what it is:

# within the bedtool __init__ ...
feature_classes = auto_detect(line) # returns, e.g., [features.BedFeature]
self._feature_objects = feature_classes

This would only have to happen for the first handful of lines in a file, so speed shouldn't be affected too much. Once the feature classes are known for individual bedtools, combining them shouldn't be that bad using the history mechanism. Plus, a bedtool object would know when you're using intersect(c=True), and can increment field counts and create 'derived' feature classes appropriately.

As far as implementing an autodetect...

GTF/GFF always have 8 fields and can be differentiated based on attributes delimiters and the required "gene_id" field in GTFs. e.g., GFF satisfy the conditions:

attrs = self._line_arr[-1]
assert self._field_count == 8
assert (attrs.count('=') == attrs.count(';')) or (attrs.count('=') == attrs.count(';')-1)

GTF features should satisfy:

attrs = self._line_arr[-1]
assert self._field_count == 8
assert 'gene_id' in attrs
assert (attrs.count('=') != attrs.count(';')) and (attrs.count('=') != attrs.count(';')-1)

Of course, pathological cases like a GTF feature with a gene named '==' would break this, but hopefully it would be good enough for now.

SAM/BAM have 11 fields, so that should be easy (assuming BED11 isn't really a valid format since you need both block sizes and starts). I'm not familiar with VCF though.

BEDs with variable fields are the tricky ones, plus all the "bedtools-derived" formats (e.g., intersectBed -c which extends the field count by 1). Will need more work here.

from pybedtools.

brentp avatar brentp commented on July 30, 2024

cool. i'll try to write auto_detect(), i think your workflow sounds good for that:

  • check gff/gtf first, i.e. field_count == 8
  • check for field_count == 11 (and maybe check that filename doesn't end with .bed)
  • other types including vcf
  • otherwise, assume bed

maybe each Feature class can define a class method .isa(line_arr)
then as we add a class, it is responsible for that.

so:

class GFFFeature(self):
    @classmethod
    def isa(line_arr):
          return len(line_arr) == 8 and line_arr[-1].count("=") == line_arr[-1].count(";")

the the auto_detect function calls the isa() class method on each Feature subclass in some preset order.

def auto_detect(fn):
      line_arr = _get_non_comment_line(fn).split("\t")
      for klass in [GFFFeature, GTFFeature, SamFeature, VCFFeature, ..., BedFeature]:
           if klass.isa(line_arr): return klass
       raise UnrecognizedFileException(fn)

we do use variable-field beds a lot--just throwing stuff into the extra columns, so i agree that will be important to address.
and i have a lot of scripts that run intersectBed, then cut the appropriate columns...

from pybedtools.

daler avatar daler commented on July 30, 2024

i like the isa idea a lot, making each class responsible for its own detection. That ought to make adding new feature classes much easier, too.

I'd prefer not to put too much weight on extensions (.bed or otherwise), because I have a ton of lazily-named files that don't have extensions. Plus, all the tempfiles made in the background by pybedtools would need to add the appropriate extensions as well.

from pybedtools.

brentp avatar brentp commented on July 30, 2024

ah, noted on the extensions. i also dont know much about git.
but i think you can do something like:

$ git remote add brentp git://github.com/brentp/pybedtools.git
$ git checkout -b brentp 
$ git fetch brentp
$ git rebase brentp/master # or merge.
# check/change what i did.
$ git checkout master
$ git rebase brentp # or merge.

from pybedtools.

daler avatar daler commented on July 30, 2024

ok cool.

I started a new issue to discuss the Features string vs array thing: https://github.com/daler/pybedtools/issues#issue/4

from pybedtools.

daler avatar daler commented on July 30, 2024

closed because 1) the features work great now, including length() and 2) we're not autodetecting any more

from pybedtools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.