daler / pybedtools Goto Github PK
View Code? Open in Web Editor NEWPython wrapper -- and more -- for BEDTools (bioinformatics tools for "genome arithmetic")
Home Page: http://daler.github.io/pybedtools
License: Other
Python wrapper -- and more -- for BEDTools (bioinformatics tools for "genome arithmetic")
Home Page: http://daler.github.io/pybedtools
License: Other
from comments on https://gist.github.com/887065 :
hmm, it could also be incorporated into closest (and others):
rh = r.closest(h, d=True, keep_new=('name', 'dist'))
then you get the original bed + only 2 new columns. then have:
class Bed2(BedFeature):
__slots__ = ('chr', 'name', 'start', 'stop', 'name2', 'dist')
and that class could be automatically generated inside closest. but that might be getting too complicated.
I have some BED3 features and I want to add a strand to them. Currently, the best way to do this is to grab the fields of the existing feature and then append dummy data to pad out the rest of the fields until strand is reached. This seems really awkward:
import pybedtools
x = pybedtools.example_bedtool('venn.b.bed')
def add_strand(f, strand):
fields = f.fields[:]
while len(fields) < 5:
fields.append('.')
fields.append(strand)
return pybedtools.create_interval_from_list(fields)
x.head(3)
x.each(add_strand, "+").saveas().head(3)
Ideally I'd like to the much more reasonable:
def add_strand(f, strand):
f.strand = strand
return f
However, this requires putting the first version of add_strand
above into the Cython code as well as similar code for e.g. GFF format. I'm not sure if this is worth it yet.
Expounding a little more on this . . . Strand is editable, but does not appear in the fields or the string representation. Score acts similarly.
i = x[0]
print i
# chr1 42999170 42999233
i.strand = "+"
print i # doesn't show up
# chr1 42999170 42999233
print i.strand # but the attribute exists
# '+'
print i.fields # just not added to fields
# ['chr1', '42999170', '42999233']
Furthermore, default strand values differ depending on how the Interval was made: IntervalIterator
, Interval
, and create_interval_from_list
all have strand
defaults of '.'
, but Intervals yielded by IntervalFile
have empty values.
This is because IntervalFile
's next()
method calls create_interval
, which in turn uses the underlying C++ BED class, which in turn has default values of empty string. Need some more thought on this, but a possible solution would be to send everything through create_interval_from_list
-- but only as long as this doesn't cost speed.
Trying to get coverage of 5' ends:
bedtools genomecov -bg -5 -ibam t.bam
I either can't figure out the proper keyword or it's not accounted for properly:
In [1]: from pybedtools import BedTool
In [2]: a = BedTool("BPSK96243.local.bam")
In [3]: b = a.genome_coverage(5=True,bg=True)
File "<ipython-input-10-49c9c92e7ea4>", line 1
SyntaxError: keyword can't be an expression
Build is via github source.
Creating BedTools from lists or iterables of tuples should be supported. Currently you need an iterator of Intervals . . .
import pybedtools
x = pybedtools.BedTool([
("chr1", 1, 100),
("chr1", 500, 505)])
not sure the best way to install files in pybedtools/test/data into a system wide-accessible location. currently setup.py is using data_files kwarg -- does this put things in a reasonable place?
setup.py should require argparse and easy_install it if needed
and not require cython to install, e.g.:
http://docs.cython.org/src/userguide/source_files_and_compilation.html#distributing-cython-modules
not sure if we should also go to setuptools instead of distutils or not
BEDTools raises an error -- file does not exist -- when trying to open an empty file. Venn diagram construction can create zero-length output files that are then sent to BEDTools to work on.
At least for the case of Venn diagrams, this should not be considered an error, since it's mostly outside the user's control. In fact, finding out whether there are zero jointly-shared features may be one reason why they're making a Venn diagram in the first place.
My issue:
I'm reading in a GTF file with exons. I want to collect the exons for all transcripts of each gene (as intervals in a list), create a BedTool from the list of exons for each gene, merge exons, and determine the total number of nucleotides covered by the exons. The problem is that when i create a bedtool from the list of intervals and call .merge(), the start sites end up incremented by one (i.e. as if the temporary file created was using the string fields instead of the correct .start attribute. That surprised me. Is this intended? If I'm missing something fundamental, feel free to ignore me and delete the issue.
Thanks.
Wolfgang
Reproducible example:
GTF file:
chr5 ucsc_refseq exon 137924248 137924468 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4
IPython session:
In [1]: import pybedtools
In [2]: gtf = pybedtools.BedTool("test.gtf")
In [3]: gtf.file_type
Out[3]: 'gff'
In [4]: gtf_intervals = list(gtf)
In [5]: gtf_intervals
Out[5]: [Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
In [6]: print "\n".join(str(x) for x in gtf)
chr5 ucsc_refseq exon 137924248 137924468 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4
In [7]: gtf_intervals[0].start
Out[7]: 137924247L
In [24]: new_bt = pybedtools.BedTool(gtf_intervals)
In [25]: new_bt
Out[25]: [Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]
In [26]: new_bt_merged = new_bt.merge()
In [27]: new_bt_merged
Out[27]: <BedTool(/tmp/pybedtools.aVwERJ.tmp)>
In [28]: new_bt_merged[0]
Out[28]: Interval(chr5:137924248-137924468)
In [29]: new_bt_merged[0].start
Out[29]: 137924248L
I tried using easy_install and pip to install, but received error messages after the "creating build/temp.macosx-10.5-x86_64-2.7/src" step. I've included a partial list of them below:
gcc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -isysroot /Developer/SDKs/MacOSX10.5.sdk -DNDEBUG -g -O3 -arch x86_64 -isysroot /Developer/SDKs/MacOSX10.5.sdk -Isrc/ -I/Library/Frameworks/EPD64.framework/Versions/7.3/include/python2.7 -c pybedtools/cbedtools.cpp -o build/temp.macosx-10.5-x86_64-2.7/pybedtools/cbedtools.o
In file included from /Developer/SDKs/MacOSX10.5.sdk/usr/include/c++/4.2.1/vector:66,
from pybedtools/cbedtools.cpp:253:
/Developer/SDKs/MacOSX10.5.sdk/usr/include/c++/4.2.1/bits/stl_algobase.h:64:28: error: bits/c++config.h: No such file or directory
In file included from /Developer/SDKs/MacOSX10.5.sdk/usr/include/c++/4.2.1/bits/stl_algobase.h:69,
from /Developer/SDKs/MacOSX10.5.sdk/usr/include/c++/4.2.1/vector:66,
from pybedtools/cbedtools.cpp:253:
I used sudo and double checked that the files/directories it's complaining about do indeed exist (which they do). I am running Mac OS.10.8...
Starting a new issue to keep things organized...this was previously brought up on
https://github.com/daler/pybedtools/issues#issue/3 :
so i think now, each bedtools object will have an array of _feature_objects
so if it's a bed-file, it'll be
_feature_objects = [features.BedFeature]
then if you do an intersection with a gff file, it'll be:
_feature_objects = [features.BedFeature, features.GFFFeature]
then we can start to implement what you suggest in the comment above.
brentp about 15 hours ago | link | edit | delete
that requires that each class knows how many fields it consumes.and probably means that the Feature() init method should accept the array, rather than the string.
So should a Feature accept an array or a string?
As a concrete example to think about, imagine we have a line with 2 BED features. So:
bedtool._feature_objects = [features.BedFeature, features.BedFeature]
The line looks like:
chr1 200 500 feature1 1.5 + chr1 300 400
We assume we know the field counts of each feature, so:
bedtool._feature_objects[0]._field_num = 6
bedtool._feature_objects[1]._field_num = 3
Now for some thinking aloud:
If the string is to be passed to features.BedFeature, then we'd have to increment the BedFeature.position_locs
by 6 for the second feature so that it knows where to start. This could be implemented using an offset kwarg to Feature
objects, which would just add those offsets to self.locs
.
So bedtool.features()
might look something like:
for line in open(self.fn):
line_features = []
offset = 0
for featureclass in self._feature_objects:
line_features.append(featureclass(line, offset=offset))
offset += featureclass._field_num
yield line_features
If, on the other hand, Feature
objects took an array, then it might look like:
for line in open(self.fn):
line_features = []
L = line.rstrip().split('\t')
offset = 0
for featureclass in self._feature_objects:
line_features.append(featureclass(L[offset:offset+featureclass._field_num]))
yield line_features
The string case seems more elegant somehow, but requires the line to be split twice -- once in each featureclass.__init__
method. So I think the array case might be faster.
I think splitting the line within bedtool.features()
should be the limit of parsing though -- I don't think we want to have the Feature
signature as something like:
Feature(chrom='chr1', start=200, stop=500, attrs=attrs)
because it would mean bedtools.features()
needs all the brains that are already nicely encapsulated in each of the features.*
classes.
in the end, should look something like this:
mypool = multiprocessing.Pool(25)
bt.randomstats(_orig_pool=mypool, *args, **kwargs)
bt.random_op(_orig_pool=mypool, *args, **kwargs)
bt.random_jaccard(_orig_pool=mypool, *args, **kwargs)
Hello,
I am randomly generating reads from a BED and fasta file to simulate read coverage, using pybedtools.BedTool and pybedtools.BedTool.sequence. After I generate ~1000 paired-end reads (so 2000 sequences total), I get this error:
[obot@grettir ~]$ ./simulate_read_coverage.py --fasta /home/hyjkim/reference/mm9/genome/mm9.validated.fa --bed mm9_ensembl-genes_single-exon.bed
finished generating 0 paired-end reads
finished generating 1000 paired-end reads
<type 'exceptions.OSError'>: Too many open files
The command was:
fastaFromBed -fi /home/hyjkim/reference/mm9/genome/mm9.validated.fa -bed /tmp/pybedtools.kmwo6f.tmp -fo /tmp/pybedtools.FdABx4.tmp
Things to check:
* Too many files open -- please submit a bug report so that this can be fixed
Traceback (most recent call last):
File "./simulate_read_coverage.py", line 251, in <module>
main();
File "./simulate_read_coverage.py", line 214, in main
seq1 = bed1.sequence(cl.args['fasta']).print_sequence().split()
File "/opt/python/lib/python2.7/site-packages/pybedtools-0.5.5-py2.7-linux-x86_64.egg/pybedtools/bedtool.py", line 578, in decorated
result = method(self, *args, **kwargs)
File "/opt/python/lib/python2.7/site-packages/pybedtools-0.5.5-py2.7-linux-x86_64.egg/pybedtools/bedtool.py", line 188, in wrapped
check_stderr=check_stderr)
File "/opt/python/lib/python2.7/site-packages/pybedtools-0.5.5-py2.7-linux-x86_64.egg/pybedtools/helpers.py", line 450, in call_bedtools
raise OSError('See above for commands that gave the error')
OSError: See above
Is there away to close and/or remove the temporary files?
Thanks,
Olga
Several BedTool methods -- for example, feature_centers
, normalized_counts
, and lengths
-- are no longer that useful since the new .each()
method can do operations per feature in a more flexible manner.
These methods should be removed, and their analogous .each()
function version should be provided in a new module.
For example, rename_feature
would become a new function that accepts an Interval:
def rename_features(feature, new_name):
feature.name = new_name
return feature
The new module could then act as a repository of useful per-feature functions to serve as ideas and documentation for users
Hello,
I have a bed file that looks like this:
chr1 3044313 3044814 +
chr1 3092096 3092206 +
chr1 4519097 4519204 +
chr1 5053140 5054728 +
chr1 7167819 7169118 +
chr1 7255883 7256013 +
chr1 8806218 8806328 +
chr1 9448750 9448853 +
chr1 9535488 9537535 +
chr1 9564628 9565730 +
And when I parse it using pybedtools.Bedtool("bed_file"), I can't get the strand extracted properly:
>>> all_transcripts = pybedtools.Bedtool("bed_file")
>>> import random as r
>>> txpt = r.choice(all_transcripts)
>>> print 'txpt', txpt
txpt chr6 8451235 8451356 -
>>> print 'txpt.length', txpt.length
txpt.length 121
>>> print 'txpt.strand:', txpt.strand
txpt.strand:
unless I'm missing something this snippet demonstrates the problem, only the first block actually generates a seqfn with any content:
from pybedtools import BedTool
import pybedtools
#from_string=True OK
chrom, start, end = "chr1", 1, 10
fa = pybedtools.example_filename('test.fa')
loc = BedTool("%s\t%i\t%i" % (chrom, start, end), from_string=True)
print open(loc.sequence(fi=fa).seqfn).read()
# no content in .seqfn
loc = BedTool([(chrom, start, end)])
print open(loc.sequence(fi=fa).seqfn).read()
# no content in .seqfn
out = open('t.fa', 'w')
out.write("%s\t%i\t%i" % (chrom, start, end))
out.close()
loc = BedTool(out.name)
print open(loc.sequence(fi=fa).seqfn).read()
I checked the tests and they also seem to only test the case where from_string=True
Design question (for anyone, but hopefully @brentp and @arq5x can weigh in):
Currently, an Interval object is equal to another if chrom, start, stop, and [possibly] name match -- see
pybedtools/pybedtools/cbedtools.pyx
Line 156 in f13ff48
Since an Interval object really just encapsulates a line in a file, I think it would be preferable to do matching on the reconstituted string. That way, if attributes are added, or strand is changed, Intervals will no longer be identical.
However, this means "<" and ">" no longer have meaning. I think this is OK, cause "<" and ">" operators have fuzzy semantics when strand comes into play -- for example, does >
mean "higher coords than" or "downstream from" ?
I propose implementing this with a __hash__
method, so that Intervals can become keys in dicts and we can make sets out of them. Then the __richcmp__
just tests hashes, which should be really fast (Cython accesses Python's C API for hash()
)
Like so:
# in Interval class, cbedtools.pyx
def __hash__(self):
return hash(str(self))
def __richcmp__(self, other, int op):
if op == 2:
return hash(self) == hash(other)
if op == 3:
return hash(self) != hash(other)
else:
raise NotImplementedError('Interval objects can only be tested '
'for equality')
thoughts?
call_bedtools
works, but i'm not convinced it's very efficient with Interval generators.
that's because currentlycall_bedtools
will "render" an generator down to the full string when sending info to p.communicate:
stdout, stderr = p.communicate('\n'.join(generator))
and this sort of defeats the purpose of iterators/generators.
possibly some ideas on this here?:
Just checked out the source and ran python setup.py develop
running build_ext
building 'pybedtools.cbedtools' extension
/usr/bin/gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -pipe -O2 -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Isrc/ -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/bedFile.cpp -o build/temp.macosx-10.6-x86_64-2.7/src/bedFile.o
cc1plus: warning: command line option "-Wstrict-prototypes" is valid for C/ObjC but not for C++
/usr/bin/gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -pipe -O2 -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Isrc/ -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/fileType.cpp -o build/temp.macosx-10.6-x86_64-2.7/src/fileType.o
cc1plus: warning: command line option "-Wstrict-prototypes" is valid for C/ObjC but not for C++
/usr/bin/gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -pipe -O2 -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Isrc/ -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/gzstream.cpp -o build/temp.macosx-10.6-x86_64-2.7/src/gzstream.o
cc1plus: warning: command line option "-Wstrict-prototypes" is valid for C/ObjC but not for C++
/usr/bin/gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -pipe -O2 -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Isrc/ -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c pybedtools/cbedtools.cpp -o build/temp.macosx-10.6-x86_64-2.7/pybedtools/cbedtools.o
i686-apple-darwin10-gcc-4.2.1: pybedtools/cbedtools.cpp: No such file or directory
i686-apple-darwin10-gcc-4.2.1: no input files
error: command '/usr/bin/gcc-4.2' failed with exit status 1
pybedtools.cpp should be built from pybedtools.pyx ? Doesn't seem to run through cython at the minute. Am | doing it right?
Hi,
When I run groupby
on a given BED, the result is not always a BED, but it's still extremely convenient to call groupby
from pybedtools. When the result of groupby
is not a true BED, it yields the following error when iterating through the result:
File "cbedtools.pyx", line 686, in pybedtools.cbedtools.IntervalFile.file_type (pybedtools/cbedtools.cpp:8852)
File "cbedtools.pyx", line 561, in pybedtools.cbedtools.create_interval_from_list (pybedtools/cbedtools.cpp:7293)
pybedtools.cbedtools.MalformedBedLineError: Unable to detect format from ['ENSMUSG00000025903', 'chr1', '4822392', '4822427', '11', '+']
Here is a small test case that reproduces this problem:
import os
import pybedtools
bed = pybedtools.BedTool(os.path.expanduser("~/test.bed"))
c_cols = [1, 2, 3, 5, 6]
# Gene ID
group_cols = [7]
# Operation columns: get comma-separated list of clusters
# including potential duplicates
ops_cols = len(c_cols) * ["collapse"]
grouped_bed = bed.groupby(g=group_cols, c=c_cols, ops=ops_cols)
print "Grouped without a problem.", grouped_bed
print "Now iterating:"
for line in grouped_bed:
print line.fields
test.bed
is:
chr1 4822392 4822427 180 11 + ENSMUSG00000025903
chr1 4848367 4848399 196 12 + ENSMUSG00000025903,ENSMUSG00000033813
chr1 5088389 5088423 221 12 + ENSMUSG00000033793
chr1 6797675 6797720 383 13 + ENSMUSG00000033740
chr1 7160723 7160752 420 24 + ENSMUSG00000051285
chr1 9902237 9902267 671 16 + ENSMUSG00000046101
chr1 10055420 10055451 707 11 + ENSMUSG00000056763
chr1 11131981 11132014 827 31 + ENSMUSG00000048960
chr1 13335253 13335284 1109 14 - ENSMUSG00000005886
chr1 14277180 14277209 1216 28 - ENSMUSG00000025932
Is there a way to iterate through a BedTool
even though it's malformed? Or access a BedTool
as a generator of lists that can then be converted to something else (like a numpy array)? I totally understand that there's no need to support non-BED files so I just want to extract the groupby
result and then treat it as a different non-BED format. Thanks!
should be able to build PDF docs, but make latexpdf
returns errors like
! Missing \endgroup inserted.
<inserted text>
\endgroup
l.8311 ...n{OriginalVerbatim}[commandchars=\\\{\}]
currently .name has a branch for isGff, but it assumes the key in the attribute string is ID.
https://github.com/brentp/pybedtools/blob/b59d5f65c7814b6f1a723c32a15d203d7b6ac36f/pybedtools/cbedtools.pyx#L42
next, do we want to enable setting the name for GFF and GTF? or should we just allow it for bed files?
Hi! I'm trying to install on OSX 10.8.2 with gcc 4.2 and Python 2.7.3 -- EPD 7.3-1 (32-bit).
when I try to install with pip or easy_install I get a error:
easy_install pybedtools
Searching for pybedtools
Reading http://pypi.python.org/simple/pybedtools/
Reading http://pypi.python.org/simple/pybedtools/none
Best match: pybedtools 0.6.1
Downloading http://pypi.python.org/packages/source/p/pybedtools/pybedtools-0.6.1.tar.gz#md5=c67a34e48deb8d747795be09f9bc3dd6
Processing pybedtools-0.6.1.tar.gz
Writing /var/folders/1j/lkx2dg7d7tn6f4t4c6lwmt9r0000gq/T/easy_install-TLOphz/pybedtools-0.6.1/setup.cfg
Running pybedtools-0.6.1/setup.py -q bdist_egg --dist-dir /var/folders/1j/lkx2dg7d7tn6f4t4c6lwmt9r0000gq/T/easy_install-TLOphz/pybedtools-0.6.1/egg-dist-tmp-j83m23
warning: pybedtools/cbedtools.pyx:596:16: Unreachable code
src/bedFile.cpp: In instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’:
src/bedFile.cpp:41: instantiated from here
src/bedFile.cpp:41: error: explicit instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’ but no definition available
src/bedFile.cpp: In instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’:
src/bedFile.cpp:41: instantiated from here
src/bedFile.cpp:41: error: explicit instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’ but no definition available
src/bedFile.cpp: In instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’:
src/bedFile.cpp:41: instantiated from here
src/bedFile.cpp:41: error: explicit instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’ but no definition available
error: Setup script exited with error: command 'gcc' failed with exit status 1
For streaming intersections of moderate-sized files (say, >5000 features), the following blocks::
z = a.intersect(b, stream=True).intersect(c, stream=True)
len(z)
The schematic below shows what's happening with stdin/stdout and pipes. The above command hangs when trying to write to the stdin of the second process, marked below as ^^^^^^
.
FILE -> stdin-|------------------|-stdout -> PIPE -> stdin-|------------------|-stdout -> PIPE -> IntervalIterator
| intersectBed (1) | | intersectBed (2) |
|------------------|-stderr ^^^^^^ |------------------|-stderr
Despite a forced flush of stdout of command (1) and stdin of command (2) in helpers.call_bedtools,as well as forcing flush of stdout in command (2) in the IntervalIterator, this still blocks.
In the Popen command, setting bufsize=1
or bufsize=0
doesn't help. Docs for Popen.communicate()
say that it'll block for large input.
Various stackoverflow answers for similar problems describe the solution to this as using separate threads for each call, however, initial tests make interactive work in IPython a little crazy.
My guess is that workarounds like "rendering" a streaming BedTool to disk will be needed for the near-to-mid-future, since fixes to this will be difficult.
Hello,
I have a BedTool named bed. When I run this snippet of code,
for i in range(len(bed)):
#print bed[i].start
#print bed[i].stop
#print bed[i].name #= GSM+"_{0}".format(i+1)
print i
if I have one print uncommented, "i" increases to 246 before giving me the error "Error: The requested bed file () could not be opened. Exiting!"
If I run the code with two print lines uncommented, "i" goes to 122 before quitting and I receive that same message.
If i run the code with three print lines uncommented, "i" goes to 81 before quitting and I receive the same message.
The bed file from which I am creating the BedTool has more than 246 lines. I am not sure how to resolve this issue. Looking through the other issues it seems doing something regarding "Intervals" might help with the memory issues, but I am not sure exactly how...
My eventual goal is to be able to run through the BedTool and rename the "name", as seen by the commented code..
Please let me know if my question is confusing or unclear or incorrectly formatted.
Best,
Alan
At least with the version I have installed (0.6.2dev), it looks like the window_maker() wrapper does not wrap the -s option. Are you able to reproduce this?
>>> import pybedtools as pbt
>>> pbt.__version__
'0.6.2dev'
>>> pbt.BedTool().window_maker(genome='hg19', w=1000000, s=500000).head()
chr1 0 1000000
chr1 1000000 2000000
chr1 2000000 3000000
chr1 3000000 4000000
chr1 4000000 5000000
chr1 5000000 6000000
chr1 6000000 7000000
chr1 7000000 8000000
chr1 8000000 9000000
chr1 9000000 10000000
>>> pbt.BedTool().window_maker(genome='hg19', w=1000000, s=0).head()
chr1 0 1000000
chr1 1000000 2000000
chr1 2000000 3000000
chr1 3000000 4000000
chr1 4000000 5000000
chr1 5000000 6000000
chr1 6000000 7000000
chr1 7000000 8000000
chr1 8000000 9000000
chr1 9000000 10000000
I ran these commands successfully in my Ubuntu shell:
$ sudo apt-get install bedtools
$ sudo easy_isntall pybedtools
And then these commands in ipython:
In [1]: import pybedtools
In [2]: a = pybedtools.example_bedtool('x.bam')
In [3]: a.genome_coverage()
NotImplementedError Traceback (most recent call last)
/home/sinclair/OD1/raw/<ipython-input-4-bfb19364e87a> in <module>()
----> 1 a.genome_coverage()
/usr/local/lib/python2.7/dist-packages/pybedtools-0.6.2-py2.7-linux-x86_64.egg/pybedtools/bedtool.pyc in decorated(self, *args, **kwargs)
621 # this calls the actual method in the first place; *result* is
622 # whatever you get back
--> 623 result = method(self, *args, **kwargs)
624
625 # add appropriate tags
/usr/local/lib/python2.7/dist-packages/pybedtools-0.6.2-py2.7-linux-x86_64.egg/pybedtools/bedtool.pyc in not_implemented_func(*args, **kwargs)
139 if not_implemented:
140 def not_implemented_func(*args, **kwargs):
--> 141 raise NotImplementedError(help_str)
142 return not_implemented_func
143
NotImplementedError: "genomeCoverageBed" does not appear to be installed or on the path, so this method is disabled. Please install a more recent version of BEDTools and re-import to use this method.
The program is in my $PATH
:
$ which genomeCoverageBed
/usr/bin/genomeCoverageBed
The new stream=True kwarg and the ability to create BedTool objects out of Interval iterators means that the code in the BEDTool wrappers (BedTool.intersect, BedTool.subtract, etc) gets increasingly complex. Currently, stream is only implemented for BedTool.intersect. This needs to be factored out so it can easily be used by other wrapper methods.
I think this should be a method of BedTool, since it will have to figure out how it was constructed.
Thinking aloud . . . this method should:
a
kwarga
needs to be pulled from the current BedTool
a
is an IntervalIterator, make it yield text lines (rather than Intervals) and send that as stdin
to call_bedtool
and set the first arg to '-'
a
is an open file, send that directly as stdin
to call_bedtool
and set first arg to '-'
a
is a string, then stdin
is None, and call_bedtool
gets -a <filename>
as first argb
needs to be collapsed from an iterable into a tempfile (since you can't stream both -a and -b to a BEDTools program)
call_bedtool
I noticed that pybedtools tries to respect the input format when using saveas
, even though internally it always uses BED-like coordinates (which is a good design in my opinion), so that if you load a GFF file and try to save it, it outputs a GFF:
b = BedTool("foo.gff")
# foo.bed will be a GFF, not a BED
b.saveas("foo.bed")
Is there a way to force pybedtools
to convert the GFF to a BED? saveas
will still take trackline
arguments even if b
is a GFF-like BedTool:
b.saveas("foo.bed", trackline="name'test'")
will output:
$ head foo.bed
name='test'
chr10 SE exon 80065614 80065694 . + . ID=chr10:80065614:80065694:+@chr10:80065984:80066058:+@chr10:80071548:80071595:+.A.up;Parent=chr10:80065614:80065694:+@chr10:80065984:80066058:+@chr10:80071548:80071595:+.A
which would make sense for a BED but not for a GFF, so the result is an invalid GFF. It could be useful to have an argument like: as_bed=True
that forces a BedTool
to be serialized as BED if it's a GFF (vice versa is more complicated since GFF has too many fields). Is there already a way to do this that I missed? Thanks.
there are some docstrings in the code already but they arent quite in doctest format.
having them run as part of the test-suite--and the stuff in feature.py would be nice.
I try to merge two BedTools objects but internally one key in kwargs is 'None' which is not accepted.
import pybedtools
annotated = pybedtools.BedTool('file1.bed')
discovered = pybedtools.BedTool('file2.bed')
hg19genome = pybedtools.chromsizes('hg19')
ann_slop = annotated.slop(g=hg19genome, b=10)
disc_slop = discovered.slop(g=hg19genome, b=10)
mergd = ann_slop.merge(disc_slop, s=True, nms=True)
Gives the error:
/usr/local/lib/python2.7/dist-packages/pybedtools-0.5.5-py2.7-linux-x86_64.egg/pybedtools/bedtool.pyc in wrapped(self, *args, **kwargs)
171 # At runtime, this will parse the kwargs, convert streams to
172 # tempfiles if needed, and return all the goodies
--> 173 cmds, tmp, stdin = self.handle_kwargs(prog=prog, **kwargs)
174
175 # Do the actual call
TypeError: handle_kwargs() keywords must be strings
And kwargs is
{None: <BedTool(/tmp/pybedtools.Iz9EUg.tmp)>, 's': True, 'nms': True}
The slopped files are fine themselves:
ipdb> disc_slop.head(2)
chr21 40563340 40563360 AATAAA 8 -
chr21 40564162 40564182 AGTAAA 12 -
ipdb> ann_slop.head(2)
chr1 328449 328470 0 0 +
chr1 529943 529964 0 0 +
In the method's current form, the name "with_column" isn't very intuitive. How about something like "transform"? Or "apply" like R? Other ideas?
the goal is to drop the scipy dependency for randomstats . . .
you guys have probably already handled this but let's add some tests that mix gff and bed input/output and make sure everything is as we expect for length, overlap amount, etc.
This is to start discussion on how to implement e.g.: features.gfffeature, features.vcffeature
could start with a generic features.feature class that takes a line and a tuple for chr, start, stop.
so bed would be
features.feature(line_arr, position_locs=(0, 1, 2))
and gff:
features.feature(line_arr, position_locs=(0, 3, 4))
or something like that. it could be an attribute on the class. so:
class bedfeature(feature):
position_locs = (0, 1, 2)
class gfffeature(feature):
position_locs = (0, 3, 4)
then use getattr for the other kwargs to set things like rgbItem and what not more generically.
Ryan, if you agree, and/or give some feedback, i'll be glad to take a first stab.
BedTool.with_column() currently requires "outcols" contents to be integers because there's no __setitem__
method.
e.g.,
a.with_column(incols=['start','stop'], func=length, outcols=[4])
instead of
a.with_column(incols=['start','stop'], func=length, outcols=['score'])
My naive attempt to at least get an integer Interval.__setitem__
working gives a cython compile error:
def __setitem__(self, object key, object value):
cdef int i
if isinstance(key, (int,long)):
nfields = self._bed.fields.size()
if key >= nfields:
raise IndexError('field index out of range')
elif key < 0: key = nfields + key
self._bed.fields[key] = value
^
------------------------------------------------------------
pybedtools/cbedtools.pyx:146:41: Cannot convert Python object to 'string'
How to do this?
BTW thanks for this nice library and for keeping improving it.
Luca
I have two fairly small bed files (~1600 features in each, overlapping by ~500 features). I'd like to calculate the empirical overlap p-value but to get a decently small p-value, I need a lot of shuffles.
pybedtools is apparently opening a new file for each iteration (not sure how since clearly the files are unlinked and the python objects are deleted as soon as the intersect results are done. relevant code
Any way around this? I can increase the user limit for open files on this machine, but I doubt it will suffice for tens of millions of overlaps... Looks like it's crapping out at only 4541 files on this machine.
In [2]: import pybedtools
In [2]: chromsizes = {'chr1':(1,1000)}
In [3]: a = pybedtools.example_bedtool('a.bed').set_chromsizes(chromsizes)
In [4]: b = pybedtools.example_bedtool('b.bed')
In [5]: results = a.randomstats(b, 1000000, debug=True)
<type 'exceptions.OSError'>: Too many open files
The command was:
shuffleBed -i /usr/local/lib/python2.7/dist-packages/pybedtools/test/data/a.bed -seed 4540 -g /tmp/pybedtools.DsnJBh.tmp
Things to check:
* Too many files open -- are you creating lots of BedTool objects without calling del() on them?
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
/home/wbiesing/<ipython console> in <module>()
/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in randomstats(self, other, iterations, **kwargs)
1543 distribution = self.randomintersection(other, iterations=iterations,
1544 **kwargs)
-> 1545 distribution = np.array(list(distribution))
1546
1547 # Median of distribution
/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in randomintersection(self, other, iterations, intersect_kwargs, shuffle_kwargs, debug)
1624 if debug:
1625 shuffle_kwargs['seed'] = i
-> 1626 tmp = self.shuffle(**shuffle_kwargs)
1627 tmp2 = tmp.intersect(other, **intersect_kwargs)
1628
/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in decorated(self, *args, **kwargs)
370 # this calls the actual method in the first place; *result* is
371 # whatever you get back
--> 372 result = method(self, *args, **kwargs)
373
374 # add appropriate tags
/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in wrapped(self, *args, **kwargs)
174 # Do the actual call
175 stream = call_bedtools(cmds, tmp, stdin=stdin,
--> 176 check_stderr=check_stderr)
177 result = BedTool(stream)
178
/usr/local/lib/python2.7/dist-packages/pybedtools/helpers.pyc in call_bedtools(cmds, tmpfn, stdin, check_stderr)
299 print 'Things to check:'
300 print '\n\t' + '\n\t'.join(problems[err.errno])
--> 301 raise OSError('See above for commands that gave the error')
302
303 return output
OSError: See above for commands that gave the error
Currently in pybedtools we have a more-or-less static version of bedFile.h, a snapshot from some previous version of BEDTools. Recent changes (e.g., 7e989ea) are starting to diverge from BEDTools proper.
But there are some improvements in the latest BEDTools (bedFile.h) that would be nice to include in pybedtools. I'd also like to be flexible during future development so that when new features show up in BEDTools we can incorporate them into pybedtools. Furthermore, if we find bugs we could fix them in pybedtools and have nice clean patches to send to BEDTools.
There's probably some git hackery that will mirror bedtools as a sort of sub-repo inside pybedtools. Assuming most changes to bedtools are new features (rather than API changes) then cbedtools.pyx/pxi should still work with each BEDTools commit but just won't have the additional features until we manually add them. That way it's not too much of a maintenance overhead but still remains up to date.
Thoughts?
it works, just checking
The following code illustrates the leak:
import pybedtools
fn = pybedtools.example_filename('a.bed')
orig_fds = pybedtools.helpers.n_open_fds()
# no len()
n1 = []
for i in range(100):
x = pybedtools.BedTool(fn)
n1.append(pybedtools.helpers.n_open_fds())
print max(n1) - orig_fds
#2
# check len() each time
n2 = []
for i in range(100):
x = pybedtools.BedTool(fn)
len(x)
n2.append(pybedtools.helpers.n_open_fds())
print max(n2) - orig_fds
#100
I often use BED files that have _random
chromosome entries in them that are not necessarily part of my genome build, and I generally want these to be ignored. If I use fastaFromBed
to make FASTAs out of these files, fastaFromBed
just outputs a warning that these regions cannot be fetched but it correctly handles other regions in the file. I found that the sequence
wrapper to fastaFromBed
throws an error for these warnings:
mybed = pybedtools.BedTool("chrX_random 1 100", from_string=True)
mybed.sequence(fi="mm9.fa", fo="seq.fa", s=True)
results in:
File "/usr/local/lib/python2.7/dist-packages/pybedtools-0.6.2-py2.7-linux-x86_64.egg/pybedtools/helpers.py", line 412, in call_bedtools
'stderr, above', stderr)
pybedtools.helpers.BEDToolsError: ('Error message from BEDTools written to stderr, above', 'WARNING. chromosome (chrX_random) was not found in the FASTA file. Skipping.\n')
Since this is not really an error, it might be better to handle it differently or raise something like BEDToolsError that one could catch in a try/except
clause. The current solutions I can think of are: (1) catching the BEDToolsError
and parsing it out to see if it's a warning, or (2) prefiltering on the fly the entries that have these random chromosomes. Solution (1) is preferable to me, and seems more generic, as in:
try:
mybed.sequence(fi="mm9.fa", fo="seq.fa", s=True)
except pybedtools.helpers.BEDToolsError as bed_error:
if bed_error[1].startswith("WARNING"):
print "Just a warning..."
But would break if Aaron ever changes the output format of warnings. If there are better ways to deal with this I'd love to hear. Thanks.
I am following Ryan's template for creating a BedTool
on the fly to avoid temporary files (as outlined here #55). I'm parsing through a BedTool corresponding to a BED file a
. As I iterate through the file, I'm creating a new BedTool b
using create_interval_from_list
. The new BedTool b
represents some shuffles of each BED interval in a
. Once the new BedTool b
is created, I call b.sequence()
to generate a FASTA from it.
I found that the sequence()
call hangs after outputting a large portion of the FASTA, probably because of some issue with the way I'm making the iterator. Strangely, if I call len(b)
before b.sequence(...)
, then the problem goes away. I've looked at bedtool.py
and noticed that __len__()
calls self.count()
, which is apparently what resolves the iterator. Any idea what might be causing this halting condition? Do I need to raise a StopIteration
when feeding a generator to BedTool
?
Here is my code for making the generator, it's extremely simple:
def get_shuffled_clusters_intervals(clusters,
gene_ids_to_coords,
num_shuffles):
# 'clusters' is a BedTool
for cluster_bed in clusters:
...
# Make some intervals related to the current one
intervals = \
bedtools_utils.sample_intervals(...)
if intervals is None:
continue
# Record these as BED intervals
for shuffle_num, curr_interval in enumerate(intervals):
...
curr_interval_bed = \
pybedtools.create_interval_from_list([cluster_bed.chrom,
str(curr_interval[0]),
str(curr_interval[1]),
interval_name,
".",
str(cluster_bed.strand)])
yield curr_interval_bed
return
Now I create a BedTool using this function and call sequence()
:
sampled_clusters_intervals = \
get_shuffled_clusters_intervals(clusters,
gene_ids_to_coords,
num_shuffles)
# Make BEDTool out of cluster intervals
sampled_clusters_bed = pybedtools.BedTool(sampled_clusters_intervals)
# KEY LINE: without len(sampled_clusters_bed), this hangs
num_sampled_clusters = len(sampled_clusters_bed)
sampled_clusters_bed.sequence(fi=genome_seq_fname,
fo=sampled_clusters_fname,
name=True,
s=True)
Any thoughts on what might cause sequence
to hang? The BedTool seems to be made fine from the generator, it's the sequence
line that actually makes use of the generator that hangs. Thanks!
Update: further confirmation that it's a problem somewhere in the sequence()
call or downstream of it - if I take the BedTool that I made using the generator (sampled_clusters_bed
) and write it to a .bed
first:
with open("test.bed", "w") as file_out:
for bed in sampled_clusters_bed: file_out.write(str(bed))
And then load test.bed
as a BedTool instance, and call that object's .sequence()
method, then it all works. So it must be something about how sequence()
parses the generator it seems.
Currently, the genome registries use plain old dictionaries. Consequently, when iterating through chromosomes, the order of the chroms is unpredictable (well, it's predictable, but undesirable). If you were to switch to an OrderedDict, you could could support a specific (and desirable) chrom iteration order, e.g., [chr1, chr2, chr3, etc.). In the case of scurgen, this would allow the HilbertCurve to have a predictable chromosome ordering when doing whole genome plots. It is doable by writing custom key sort functions, but that is nasty at best. That said, OrderedDict is only in 2.7+
What do you think?
I dont know why this is happening, because a look at the code shows that it is calling close_or_delete but randomstats is leaving a ton of pybedtools.tmp* files in my tmp dir. and calling cleanup() does not remove them.
Perhaps what's getting sent to close_or_delete is a filehandle?
I've tried calling randomstats with an object and with object.fn and it never cleans up the files.
My call looks like this:
res = bed.randomstats(loh.fn, 100, processes=25)
I have three BED Tools, a b and c, and I expect that (a+b+c).count() should == (b+a+c).count(), since that function is intended to extract features that are common to all three BedTools. (as discussed here: http://packages.python.org/pybedtools/3-brief-examples.html#example-2-intersections-for-a-3-way-venn-diagram) However, in my tests, I find conflicting results, depending on the order of the operands. Also, the total number present in (a+b+c).count() [or any other order] seems to be an over-estimate of the actual number of overlaps, given that I find elements in (a+b+c) that are specific to only b. I can provide BED files for testing if needed.
as a consequence, "venn_mpl.py -c a.BED -a b.BED -b c.BED " gives a different result from "venn_mpl.py -a a.BED -b b.BED -c c.BED "
This test (commented out for now) doesn't run with GFF files:
https://github.com/daler/pybedtools/blob/master/pybedtools/test/test1.py#L123
It's because within IntervalIterator, I'm using create_interval_from_list, which appears to be BED-specific:
https://github.com/daler/pybedtools/blob/master/pybedtools/helpers.py#L328
Is there a good way to create an Interval out of a BED/GFF/VCF line?
Not sure what's going on, but here's how to reproduce it:
import pybedtools
f = pybedtools.example_bedtool('d.gff')
for i in f.intersect(f, stream=True):
print i
When a BedTool is instantiated from a list of intervals, it seems that you can only iterate over it once, including from any instance methods which happen to use the iterator.
Failing code:
import pybedtools
a = pybedtools.BedTool([pybedtools.Interval('chr1', 0, 1)])
print len(a)
print len(a)
The first print statement returns 1, while the second returns 0. This causes major problems if you want to use an instance method that relies on the iterator, such as field_count()
and intersect()
.
pybedtools: v0.6.2
OS X 10.8
sometimes, a user will just want to do something like:
for ab in a.intersection(b):
# do something with ab
and doesn't need to store the file. pybedtools could reduce some file/io by allowing a keyword like
a.intersection(b, stream=True)
which would then return a generator of features, streamed from Popen.stdout and not available after the first iteration unless saved as a list.
Currently, closing streams in the __del__
method of BedTools still leaves open file handles somewhere. See this comment for details:
Specifically, the following code should not return a "too many files open" error after a few thousand intersections:
import pybedtools
import sys
a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')
for i in range(10000):
sys.stderr.write('\r%s' % i)
sys.stderr.flush()
c = a.intersect(b, stream=True)
del c
Note, however, that this works fine (without the stream=True), suggesting that converting into an iterable is the issue:
import pybedtools
import sys
a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')
for i in range(10000):
sys.stderr.write('\r%s' % i)
sys.stderr.flush()
c = a.intersect(b)
del c
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.