Comments (13)
yeah, actually this is what I've been working on with the genomicfeatures
package (https://github.com/daler/genomicfeatures)!
genomicfeatures
has everything but VCF format (just haven't gotten around
to it yet). Each is derived from a base Interval
class, and has things
like a length()
method and chrom/start/stop/strand attributes, similar to
what you suggest.
One issue I'd like to deal with is that bedtools has all sorts of derived,
bed-like files. For example, when you do an intersect with the -c flag,
you get an additional count column.
Another example is if you do an intersection but specify the -wao flag,
in which case you get 2 bed features on each line.
Eventually, I'd like to have pybedtools
autodetect the kind of feature in
the file it's created from (or the user could give it hints) and keep track
of that feature type throughout the history of the bedtool
instance.
I think a generic CompositeInterval
class, which for now I'm just
imagining as a tuple of features, could handle the intersect -wao
cases I mentioned above. So when bedtool.intersect
sees the
wao=True
kwarg, it knows that it should combine the feature types of a
and b
into a new CompositeInterval
.
Then you could do things like
a = bedtool('a.gff')
b = bedtool('b.bed')
c = a.intersect(b, wao=True)
for gfffeature,bedfeature in c.features():
print len(gfffeature), len(bedfeature)
A potential issue with genomicfeatures
is that it depends on Cython and
pysam. I figure though that people who use bedtools will have these
installed anyway, so maybe those dependencies aren't that big of an issue.
from pybedtools.
ah right, i'd seen genomicfeatures
but not looked at it much.
might be able to save a lot of code doing:
class feature:
position_locs = (0, 1, 2)
__slots__ = ('chr', 'start', 'end', 'name')
def __init__(self, line_arr):
for attr, val in zip(self.__slots__, line_arr):
setattr(self, val)
self.start, self.end = int(self.start), int(self.end)
self.post_initialize()
@property
def length(self):
return self.end - self.start
class gfffeature(feature):
__slots__ = ('chr', 'id', 'something', 'start', 'end', ...)
position_locs = (0, 3, 4)
def post_initialize(self):
# set attrs
so then you just set the slots for each new class. the downside is it's probably slower than your implementation.
from pybedtools.
I like that idea a lot, and it'll definitely be easier to work into pybedtools
.
With genomicfeatures
I was aiming for fast, full-featured classes for reuse in other code -- it's the stuff implied in those post_initialize()
methods that get to be time-consuming (esp. with big BED12 files).
But maybe it's not appropriate to force that sort of optimization into pybedtools
yet, and instead take a more pure (and cleaner) Python approach to get something quickly that actually works.
from pybedtools.
i will make a start of it. i suppose post_initialize type methods could be done in cython later.
or it could be even lazier
class feature:
position_locs = (0, 1, 2)
__slots__ = ('chr', 'start', 'end', 'name', 'other', 'blah', 'line_arr')
def __init__(self, line_arr):
self.line_arr = line_arr
self.chr = line_arr[position_locs[0]]
self.start = int(line_arr[postion_locs[1]])
self.end = int(line_arr[postion_locs[2]])
self.post_initialize()
def __getattr__(self, field):
idx = self.__slots__.index(field)
if idx == -1: raise AttributeError("%s does not exist" % field)
return self.line_arr[idx]
then you delay / avoid some of the parsing overhead. and get a bit of performance
improvement from slots.
and each subclass just has to define slots, postion_locs, and any other @properties like gff .attrs
from pybedtools.
started in SHA: 784dfa8
let me know any comments. i think i will change so that the gff/gtf attributes don't get parsed until first use.
from here, it needs something that guess the appropriate class based on the column count and content. maybe we can see how aaron did that.
from pybedtools.
so i think now, each bedtools object will have an array of _feature_objects
so if it's a bed-file, it'll be
_feature_objects = [features.BedFeature]
then if you do an intersection with a gff file, it'll be:
_feature_objects = [features.BedFeature, features.GFFFeature]
then we can start to implement what you suggest in the comment above.
from pybedtools.
that requires that each class knows how many fields it consumes.
and probably means that the Feature() init method should accept the array, rather than the string.
from pybedtools.
Looking good. I need to get better at git so I can pull your branches and make branches of those . . . and in the end get everything back into the master branch . . .
The __getattr__
method is nice and elegant. Do you think it would be useful to type cast as necessary (e.g., float score fields for BED and GFF) or leave everything as a string? In the minimilist spirit, maybe here we pass the buck to the user to convert as needed. What do you think?
Re: autodetect --
How about an autodetect()
function that accepts a line and tries to figure out what it is:
# within the bedtool __init__ ...
feature_classes = auto_detect(line) # returns, e.g., [features.BedFeature]
self._feature_objects = feature_classes
This would only have to happen for the first handful of lines in a file, so speed shouldn't be affected too much. Once the feature classes are known for individual bedtools, combining them shouldn't be that bad using the history mechanism. Plus, a bedtool object would know when you're using intersect(c=True)
, and can increment field counts and create 'derived' feature classes appropriately.
As far as implementing an autodetect...
GTF/GFF always have 8 fields and can be differentiated based on attributes delimiters and the required "gene_id" field in GTFs. e.g., GFF satisfy the conditions:
attrs = self._line_arr[-1]
assert self._field_count == 8
assert (attrs.count('=') == attrs.count(';')) or (attrs.count('=') == attrs.count(';')-1)
GTF features should satisfy:
attrs = self._line_arr[-1]
assert self._field_count == 8
assert 'gene_id' in attrs
assert (attrs.count('=') != attrs.count(';')) and (attrs.count('=') != attrs.count(';')-1)
Of course, pathological cases like a GTF feature with a gene named '==' would break this, but hopefully it would be good enough for now.
SAM/BAM have 11 fields, so that should be easy (assuming BED11 isn't really a valid format since you need both block sizes and starts). I'm not familiar with VCF though.
BEDs with variable fields are the tricky ones, plus all the "bedtools-derived" formats (e.g., intersectBed -c which extends the field count by 1). Will need more work here.
from pybedtools.
cool. i'll try to write auto_detect(), i think your workflow sounds good for that:
- check gff/gtf first, i.e. field_count == 8
- check for field_count == 11 (and maybe check that filename doesn't end with .bed)
- other types including vcf
- otherwise, assume bed
maybe each Feature class can define a class method .isa(line_arr)
then as we add a class, it is responsible for that.
so:
class GFFFeature(self):
@classmethod
def isa(line_arr):
return len(line_arr) == 8 and line_arr[-1].count("=") == line_arr[-1].count(";")
the the auto_detect function calls the isa() class method on each Feature subclass in some preset order.
def auto_detect(fn):
line_arr = _get_non_comment_line(fn).split("\t")
for klass in [GFFFeature, GTFFeature, SamFeature, VCFFeature, ..., BedFeature]:
if klass.isa(line_arr): return klass
raise UnrecognizedFileException(fn)
we do use variable-field beds a lot--just throwing stuff into the extra columns, so i agree that will be important to address.
and i have a lot of scripts that run intersectBed, then cut the appropriate columns...
from pybedtools.
i like the isa
idea a lot, making each class responsible for its own detection. That ought to make adding new feature classes much easier, too.
I'd prefer not to put too much weight on extensions (.bed or otherwise), because I have a ton of lazily-named files that don't have extensions. Plus, all the tempfiles made in the background by pybedtools
would need to add the appropriate extensions as well.
from pybedtools.
ah, noted on the extensions. i also dont know much about git.
but i think you can do something like:
$ git remote add brentp git://github.com/brentp/pybedtools.git
$ git checkout -b brentp
$ git fetch brentp
$ git rebase brentp/master # or merge.
# check/change what i did.
$ git checkout master
$ git rebase brentp # or merge.
from pybedtools.
ok cool.
I started a new issue to discuss the Features string vs array thing: https://github.com/daler/pybedtools/issues#issue/4
from pybedtools.
closed because 1) the features work great now, including length() and 2) we're not autodetecting any more
from pybedtools.
Related Issues (20)
- pybedtools.bedtool.BedTool.save_seqs leaves open .tmp files
- Support Python 3.10 and 3.11 HOT 1
- "python setup.py bdist_wheel did not run successfully" when pip installing with python v3.11 HOT 8
- to_dataframe() creates 0th row with generic names in nucleotide_content HOT 2
- build failure under python 3.11 HOT 6
- pybedtools intersect error HOT 2
- Cannot create a BedTool object from list of regions that uses np.int64 coordinates
- remove historical py27 support HOT 1
- bedtools intersect reported incorrect interval intersection HOT 3
- Cythonizing files requires `language_level=2` to be set in cythonize() HOT 4
- pybedtools multi_bam_coverage assistance HOT 2
- "fastaFromBed" error HOT 2
- intersect with multiple -b arguments not working with -sorted HOT 1
- Unable to install pybedtools==0.9.1 in Python3.10 HOT 4
- Len modifying the Bedtools after a filter HOT 2
- Has pybedtools considered packaging bedtools? HOT 3
- how to mask gap regions for randomization? HOT 1
- Issue while doing pip install pybedtools HOT 3
- Inconsistent behaviour when using files from `pathlib.PosixPath` with BedTool functions...
- pybedtools.bedtool.Bedtool.sort()
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pybedtools.