daler / pybedtools Goto Github PK

View Code? Open in Web Editor NEW

297.0 16.0 103.0 28.9 MB

Python wrapper -- and more -- for BEDTools (bioinformatics tools for "genome arithmetic")

Home Page: http://daler.github.io/pybedtools

License: Other

Shell 0.92% Python 82.87% HTML 0.16% C++ 7.54% Dockerfile 0.29% Cython 8.23%

pybedtools's Issues

Automatically create custom derived classes with e.g. `closest`

from comments on https://gist.github.com/887065 :

hmm, it could also be incorporated into closest (and others):

rh = r.closest(h, d=True, keep_new=('name', 'dist'))

then you get the original bed + only 2 new columns. then have:

class Bed2(BedFeature):
    __slots__ = ('chr', 'name', 'start', 'stop', 'name2', 'dist')

and that class could be automatically generated inside closest. but that might be getting too complicated.

editing of strand, score, etc

I have some BED3 features and I want to add a strand to them. Currently, the best way to do this is to grab the fields of the existing feature and then append dummy data to pad out the rest of the fields until strand is reached. This seems really awkward:

import pybedtools
x = pybedtools.example_bedtool('venn.b.bed')

def add_strand(f, strand):
    fields = f.fields[:]
    while len(fields) < 5:
        fields.append('.')
    fields.append(strand)
    return pybedtools.create_interval_from_list(fields)

x.head(3)
x.each(add_strand, "+").saveas().head(3)

Ideally I'd like to the much more reasonable:

def add_strand(f, strand):
    f.strand = strand
    return f

However, this requires putting the first version of add_strand above into the Cython code as well as similar code for e.g. GFF format. I'm not sure if this is worth it yet.

Expounding a little more on this . . . Strand is editable, but does not appear in the fields or the string representation. Score acts similarly.

i = x[0]
print i
# chr1  42999170    42999233

i.strand = "+"
print i  # doesn't show up
# chr1  42999170    42999233

print i.strand  # but the attribute exists
# '+'

print i.fields  # just not added to fields
# ['chr1', '42999170', '42999233']

Furthermore, default strand values differ depending on how the Interval was made: IntervalIterator, Interval, and create_interval_from_list all have strand defaults of '.', but Intervals yielded by IntervalFile have empty values.

This is because IntervalFile's next() method calls create_interval, which in turn uses the underlying C++ BED class, which in turn has default values of empty string. Need some more thought on this, but a possible solution would be to send everything through create_interval_from_list -- but only as long as this doesn't cost speed.

genome coverage read end keyword issues

Trying to get coverage of 5' ends:

bedtools genomecov -bg -5 -ibam t.bam

I either can't figure out the proper keyword or it's not accounted for properly:

In [1]: from pybedtools import BedTool

In [2]: a = BedTool("BPSK96243.local.bam")

In [3]: b = a.genome_coverage(5=True,bg=True)
  File "<ipython-input-10-49c9c92e7ea4>", line 1
SyntaxError: keyword can't be an expression

Build is via github source.

BedTool creation should support lists of tuples

Creating BedTools from lists or iterables of tuples should be supported. Currently you need an iterator of Intervals . . .

import pybedtools
x = pybedtools.BedTool([
        ("chr1", 1, 100),
        ("chr1", 500, 505)])

installing data with setup.py

not sure the best way to install files in pybedtools/test/data into a system wide-accessible location. currently setup.py is using data_files kwarg -- does this put things in a reasonable place?

setup.py

setup.py should require argparse and easy_install it if needed

and not require cython to install, e.g.:

http://docs.cython.org/src/userguide/source_files_and_compilation.html#distributing-cython-modules

not sure if we should also go to setuptools instead of distutils or not

handle empty files in venn diagram construction

BEDTools raises an error -- file does not exist -- when trying to open an empty file. Venn diagram construction can create zero-length output files that are then sent to BEDTools to work on.

At least for the case of Venn diagrams, this should not be considered an error, since it's mostly outside the user's control. In fact, finding out whether there are zero jointly-shared features may be one reason why they're making a Venn diagram in the first place.

start site off-by-one errors when creating BedTool from list of intervals

My issue:
I'm reading in a GTF file with exons. I want to collect the exons for all transcripts of each gene (as intervals in a list), create a BedTool from the list of exons for each gene, merge exons, and determine the total number of nucleotides covered by the exons. The problem is that when i create a bedtool from the list of intervals and call .merge(), the start sites end up incremented by one (i.e. as if the temporary file created was using the string fields instead of the correct .start attribute. That surprised me. Is this intended? If I'm missing something fundamental, feel free to ignore me and delete the issue.

Thanks.
Wolfgang

Reproducible example:

GTF file:
chr5 ucsc_refseq exon 137924248 137924468 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4

IPython session:
In [1]: import pybedtools

In [2]: gtf = pybedtools.BedTool("test.gtf")

In [3]: gtf.file_type
Out[3]: 'gff'

In [4]: gtf_intervals = list(gtf)

In [5]: gtf_intervals
Out[5]: [Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]

In [6]: print "\n".join(str(x) for x in gtf)
chr5 ucsc_refseq exon 137924248 137924468 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 5
chr5 ucsc_refseq exon 137925209 137925388 . - . gene_id "13856"; gene_name "Epo"; transcript_id "NM_007942"; tss_id "tss_07393"; exon_number 4

In [7]: gtf_intervals[0].start
Out[7]: 137924247L

In [24]: new_bt = pybedtools.BedTool(gtf_intervals)

In [25]: new_bt
Out[25]: [Interval(chr5:137924247-137924468), Interval(chr5:137925208-137925388)]

In [26]: new_bt_merged = new_bt.merge()

In [27]: new_bt_merged
Out[27]: <BedTool(/tmp/pybedtools.aVwERJ.tmp)>

In [28]: new_bt_merged[0]
Out[28]: Interval(chr5:137924248-137924468)

In [29]: new_bt_merged[0].start
Out[29]: 137924248L

install failed

I tried using easy_install and pip to install, but received error messages after the "creating build/temp.macosx-10.5-x86_64-2.7/src" step. I've included a partial list of them below:

gcc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -isysroot /Developer/SDKs/MacOSX10.5.sdk -DNDEBUG -g -O3 -arch x86_64 -isysroot /Developer/SDKs/MacOSX10.5.sdk -Isrc/ -I/Library/Frameworks/EPD64.framework/Versions/7.3/include/python2.7 -c pybedtools/cbedtools.cpp -o build/temp.macosx-10.5-x86_64-2.7/pybedtools/cbedtools.o

In file included from /Developer/SDKs/MacOSX10.5.sdk/usr/include/c++/4.2.1/vector:66,

             from pybedtools/cbedtools.cpp:253:

/Developer/SDKs/MacOSX10.5.sdk/usr/include/c++/4.2.1/bits/stl_algobase.h:64:28: error: bits/c++config.h: No such file or directory

In file included from /Developer/SDKs/MacOSX10.5.sdk/usr/include/c++/4.2.1/bits/stl_algobase.h:69,

             from /Developer/SDKs/MacOSX10.5.sdk/usr/include/c++/4.2.1/vector:66,

             from pybedtools/cbedtools.cpp:253:

I used sudo and double checked that the files/directories it's complaining about do indeed exist (which they do). I am running Mac OS.10.8...

Features() -- should they accept lines or arrays?

Starting a new issue to keep things organized...this was previously brought up on
https://github.com/daler/pybedtools/issues#issue/3 :

so i think now, each bedtools object will have an array of _feature_objects

so if it's a bed-file, it'll be

_feature_objects = [features.BedFeature]

then if you do an intersection with a gff file, it'll be:

_feature_objects = [features.BedFeature, features.GFFFeature]

then we can start to implement what you suggest in the comment above.

brentp about 15 hours ago | link | edit | delete
that requires that each class knows how many fields it consumes.

and probably means that the Feature() init method should accept the array, rather than the string.

So should a Feature accept an array or a string?

As a concrete example to think about, imagine we have a line with 2 BED features. So:

bedtool._feature_objects = [features.BedFeature, features.BedFeature]

The line looks like:

chr1  200  500 feature1 1.5 +  chr1  300  400

We assume we know the field counts of each feature, so:

bedtool._feature_objects[0]._field_num = 6
bedtool._feature_objects[1]._field_num = 3

Now for some thinking aloud:

If the string is to be passed to features.BedFeature, then we'd have to increment the BedFeature.position_locs by 6 for the second feature so that it knows where to start. This could be implemented using an offset kwarg to Feature objects, which would just add those offsets to self.locs.

So bedtool.features() might look something like:

for line in open(self.fn):
    line_features = []
    offset = 0
    for featureclass in self._feature_objects:    
        line_features.append(featureclass(line, offset=offset))
        offset += featureclass._field_num
    yield line_features

If, on the other hand, Feature objects took an array, then it might look like:

for line in open(self.fn):
    line_features = []
    L = line.rstrip().split('\t')
    offset = 0
    for featureclass in self._feature_objects:
        line_features.append(featureclass(L[offset:offset+featureclass._field_num]))
    yield line_features

The string case seems more elegant somehow, but requires the line to be split twice -- once in each featureclass.__init__ method. So I think the array case might be faster.

I think splitting the line within bedtool.features() should be the limit of parsing though -- I don't think we want to have the Feature signature as something like:

Feature(chrom='chr1', start=200, stop=500, attrs=attrs)

because it would mean bedtools.features() needs all the brains that are already nicely encapsulated in each of the features.* classes.

re-usable pools for random_op methods

in the end, should look something like this:

mypool = multiprocessing.Pool(25)
bt.randomstats(_orig_pool=mypool, *args, **kwargs)
bt.random_op(_orig_pool=mypool, *args, **kwargs)
bt.random_jaccard(_orig_pool=mypool, *args, **kwargs)

Too many files open when generating reads

Hello,
I am randomly generating reads from a BED and fasta file to simulate read coverage, using pybedtools.BedTool and pybedtools.BedTool.sequence. After I generate ~1000 paired-end reads (so 2000 sequences total), I get this error:

[obot@grettir ~]$ ./simulate_read_coverage.py --fasta /home/hyjkim/reference/mm9/genome/mm9.validated.fa --bed mm9_ensembl-genes_single-exon.bed
finished generating 0 paired-end reads
finished generating 1000 paired-end reads
<type 'exceptions.OSError'>: Too many open files
The command was:

        fastaFromBed -fi /home/hyjkim/reference/mm9/genome/mm9.validated.fa -bed /tmp/pybedtools.kmwo6f.tmp -fo /tmp/pybedtools.FdABx4.tmp

Things to check:

        * Too many files open -- please submit a bug report so that this can be fixed
Traceback (most recent call last):
  File "./simulate_read_coverage.py", line 251, in <module>
    main();
  File "./simulate_read_coverage.py", line 214, in main
    seq1 = bed1.sequence(cl.args['fasta']).print_sequence().split()
  File "/opt/python/lib/python2.7/site-packages/pybedtools-0.5.5-py2.7-linux-x86_64.egg/pybedtools/bedtool.py", line 578, in decorated
    result = method(self, *args, **kwargs)
  File "/opt/python/lib/python2.7/site-packages/pybedtools-0.5.5-py2.7-linux-x86_64.egg/pybedtools/bedtool.py", line 188, in wrapped
    check_stderr=check_stderr)
  File "/opt/python/lib/python2.7/site-packages/pybedtools-0.5.5-py2.7-linux-x86_64.egg/pybedtools/helpers.py", line 450, in call_bedtools
    raise OSError('See above for commands that gave the error')
OSError: See above

Is there away to close and/or remove the temporary files?
Thanks,
Olga

reimplement relevant BedTool methods with .each()

Several BedTool methods -- for example, feature_centers, normalized_counts, and lengths -- are no longer that useful since the new .each() method can do operations per feature in a more flexible manner.

These methods should be removed, and their analogous .each() function version should be provided in a new module.

For example, rename_feature would become a new function that accepts an Interval:

    def rename_features(feature, new_name):
        feature.name = new_name
        return feature

The new module could then act as a repository of useful per-feature functions to serve as ideas and documentation for users

Strand not extracting from .bed

Hello,
I have a bed file that looks like this:

chr1    3044313 3044814 +
chr1    3092096 3092206 +
chr1    4519097 4519204 +
chr1    5053140 5054728 +
chr1    7167819 7169118 +
chr1    7255883 7256013 +
chr1    8806218 8806328 +
chr1    9448750 9448853 +
chr1    9535488 9537535 +
chr1    9564628 9565730 +

And when I parse it using pybedtools.Bedtool("bed_file"), I can't get the strand extracted properly:

>>> all_transcripts = pybedtools.Bedtool("bed_file")
>>> import random as r
>>> txpt = r.choice(all_transcripts)
>>> print 'txpt', txpt
txpt chr6       8451235 8451356 -
>>> print 'txpt.length', txpt.length
txpt.length 121
>>> print 'txpt.strand:', txpt.strand
txpt.strand:

.sequence() only works when from_string=True

unless I'm missing something this snippet demonstrates the problem, only the first block actually generates a seqfn with any content:

from pybedtools import BedTool
import pybedtools

#from_string=True OK
chrom, start, end = "chr1", 1, 10
fa = pybedtools.example_filename('test.fa')
loc = BedTool("%s\t%i\t%i" % (chrom, start, end), from_string=True)
print open(loc.sequence(fi=fa).seqfn).read()


# no content in .seqfn
loc = BedTool([(chrom, start, end)])
print open(loc.sequence(fi=fa).seqfn).read()


# no content in .seqfn
out = open('t.fa', 'w')
out.write("%s\t%i\t%i" % (chrom, start, end))
out.close()
loc = BedTool(out.name)
print open(loc.sequence(fi=fa).seqfn).read()

I checked the tests and they also seem to only test the case where from_string=True

What determines equivalence between Intervals?

Design question (for anyone, but hopefully @brentp and @arq5x can weigh in):

Currently, an Interval object is equal to another if chrom, start, stop, and [possibly] name match -- see

pybedtools/pybedtools/cbedtools.pyx

Line 156 in f13ff48

def __richcmp__(self, other, int op):

Since an Interval object really just encapsulates a line in a file, I think it would be preferable to do matching on the reconstituted string. That way, if attributes are added, or strand is changed, Intervals will no longer be identical.

However, this means "<" and ">" no longer have meaning. I think this is OK, cause "<" and ">" operators have fuzzy semantics when strand comes into play -- for example, does > mean "higher coords than" or "downstream from" ?

I propose implementing this with a __hash__ method, so that Intervals can become keys in dicts and we can make sets out of them. Then the __richcmp__ just tests hashes, which should be really fast (Cython accesses Python's C API for hash())

Like so:

# in Interval class, cbedtools.pyx
def __hash__(self):
    return hash(str(self))

def __richcmp__(self, other, int op):
    if op == 2:
        return hash(self) == hash(other)
    if op == 3:
        return hash(self) != hash(other)
    else:
        raise NotImplementedError('Interval objects can only be tested '
                                  'for equality')

thoughts?

improve call_bedtools performance when using iterators

call_bedtools works, but i'm not convinced it's very efficient with Interval generators.

that's because currentlycall_bedtools will "render" an generator down to the full string when sending info to p.communicate:

stdout, stderr = p.communicate('\n'.join(generator))

and this sort of defeats the purpose of iterators/generators.

possibly some ideas on this here?:

http://dcreager.net/2009/08/13/subprocess-callbacks/

build failure

Just checked out the source and ran python setup.py develop

running build_ext
building 'pybedtools.cbedtools' extension
/usr/bin/gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -pipe -O2 -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Isrc/ -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/bedFile.cpp -o build/temp.macosx-10.6-x86_64-2.7/src/bedFile.o
cc1plus: warning: command line option "-Wstrict-prototypes" is valid for C/ObjC but not for C++
/usr/bin/gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -pipe -O2 -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Isrc/ -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/fileType.cpp -o build/temp.macosx-10.6-x86_64-2.7/src/fileType.o
cc1plus: warning: command line option "-Wstrict-prototypes" is valid for C/ObjC but not for C++
/usr/bin/gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -pipe -O2 -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Isrc/ -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/gzstream.cpp -o build/temp.macosx-10.6-x86_64-2.7/src/gzstream.o
cc1plus: warning: command line option "-Wstrict-prototypes" is valid for C/ObjC but not for C++
/usr/bin/gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -pipe -O2 -fwrapv -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -Isrc/ -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c pybedtools/cbedtools.cpp -o build/temp.macosx-10.6-x86_64-2.7/pybedtools/cbedtools.o
i686-apple-darwin10-gcc-4.2.1: pybedtools/cbedtools.cpp: No such file or directory
i686-apple-darwin10-gcc-4.2.1: no input files
error: command '/usr/bin/gcc-4.2' failed with exit status 1

pybedtools.cpp should be built from pybedtools.pyx ? Doesn't seem to run through cython at the minute. Am | doing it right?

Iteration error on result of groupby

Hi,

When I run groupby on a given BED, the result is not always a BED, but it's still extremely convenient to call groupby from pybedtools. When the result of groupby is not a true BED, it yields the following error when iterating through the result:

  File "cbedtools.pyx", line 686, in pybedtools.cbedtools.IntervalFile.file_type (pybedtools/cbedtools.cpp:8852)
  File "cbedtools.pyx", line 561, in pybedtools.cbedtools.create_interval_from_list (pybedtools/cbedtools.cpp:7293)
pybedtools.cbedtools.MalformedBedLineError: Unable to detect format from ['ENSMUSG00000025903', 'chr1', '4822392', '4822427', '11', '+']

Here is a small test case that reproduces this problem:

import os
import pybedtools
bed = pybedtools.BedTool(os.path.expanduser("~/test.bed"))
c_cols = [1, 2, 3, 5, 6]
# Gene ID
group_cols = [7]
# Operation columns: get comma-separated list of clusters
# including potential duplicates
ops_cols = len(c_cols) * ["collapse"]
grouped_bed = bed.groupby(g=group_cols, c=c_cols, ops=ops_cols)
print "Grouped without a problem.", grouped_bed
print "Now iterating:"
for line in grouped_bed:
    print line.fields

test.bed is:

chr1    4822392 4822427 180 11  +   ENSMUSG00000025903
chr1    4848367 4848399 196 12  +   ENSMUSG00000025903,ENSMUSG00000033813
chr1    5088389 5088423 221 12  +   ENSMUSG00000033793
chr1    6797675 6797720 383 13  +   ENSMUSG00000033740
chr1    7160723 7160752 420 24  +   ENSMUSG00000051285
chr1    9902237 9902267 671 16  +   ENSMUSG00000046101
chr1    10055420    10055451    707 11  +   ENSMUSG00000056763
chr1    11131981    11132014    827 31  +   ENSMUSG00000048960
chr1    13335253    13335284    1109    14  -   ENSMUSG00000005886
chr1    14277180    14277209    1216    28  -   ENSMUSG00000025932

Is there a way to iterate through a BedTool even though it's malformed? Or access a BedTool as a generator of lists that can then be converted to something else (like a numpy array)? I totally understand that there's no need to support non-BED files so I just want to extract the groupby result and then treat it as a different non-BED format. Thanks!

build PDF docs

should be able to build PDF docs, but make latexpdf returns errors like

! Missing \endgroup inserted.
<inserted text> 
                \endgroup 
l.8311 ...n{OriginalVerbatim}[commandchars=\\\{\}]

.name attribute

currently .name has a branch for isGff, but it assumes the key in the attribute string is ID.
https://github.com/brentp/pybedtools/blob/b59d5f65c7814b6f1a723c32a15d203d7b6ac36f/pybedtools/cbedtools.pyx#L42

first, is it always ID for GFF?
next shall we assume 'gene_name' for GTF?
anything to do for VCF?

next, do we want to enable setting the name for GFF and GTF? or should we just allow it for bed files?

pybedtools-0.6.1build error

Hi! I'm trying to install on OSX 10.8.2 with gcc 4.2 and Python 2.7.3 -- EPD 7.3-1 (32-bit).

when I try to install with pip or easy_install I get a error:

easy_install pybedtools
Searching for pybedtools
Reading http://pypi.python.org/simple/pybedtools/
Reading http://pypi.python.org/simple/pybedtools/none
Best match: pybedtools 0.6.1
Downloading http://pypi.python.org/packages/source/p/pybedtools/pybedtools-0.6.1.tar.gz#md5=c67a34e48deb8d747795be09f9bc3dd6
Processing pybedtools-0.6.1.tar.gz
Writing /var/folders/1j/lkx2dg7d7tn6f4t4c6lwmt9r0000gq/T/easy_install-TLOphz/pybedtools-0.6.1/setup.cfg
Running pybedtools-0.6.1/setup.py -q bdist_egg --dist-dir /var/folders/1j/lkx2dg7d7tn6f4t4c6lwmt9r0000gq/T/easy_install-TLOphz/pybedtools-0.6.1/egg-dist-tmp-j83m23
warning: pybedtools/cbedtools.pyx:596:16: Unreachable code
src/bedFile.cpp: In instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’:
src/bedFile.cpp:41: instantiated from here
src/bedFile.cpp:41: error: explicit instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’ but no definition available
src/bedFile.cpp: In instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’:
src/bedFile.cpp:41: instantiated from here
src/bedFile.cpp:41: error: explicit instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’ but no definition available
src/bedFile.cpp: In instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’:
src/bedFile.cpp:41: instantiated from here
src/bedFile.cpp:41: error: explicit instantiation of ‘std::basic_ostream<_CharT, _Traits>& std::operator<<(std::basic_ostream<_CharT, _Traits>&, const std::basic_string<_CharT, _Traits, _Alloc>&) [with _CharT = char, _Traits = std::char_traits, _Alloc = std::allocator]’ but no definition available
error: Setup script exited with error: command 'gcc' failed with exit status 1

pipe blocking

For streaming intersections of moderate-sized files (say, >5000 features), the following blocks::

z = a.intersect(b, stream=True).intersect(c, stream=True)
len(z)

The schematic below shows what's happening with stdin/stdout and pipes. The above command hangs when trying to write to the stdin of the second process, marked below as ^^^^^^.


    FILE -> stdin-|------------------|-stdout  -> PIPE ->  stdin-|------------------|-stdout -> PIPE -> IntervalIterator
                  | intersectBed (1) |                           | intersectBed (2) |
                  |------------------|-stderr         ^^^^^^     |------------------|-stderr

Despite a forced flush of stdout of command (1) and stdin of command (2) in helpers.call_bedtools,as well as forcing flush of stdout in command (2) in the IntervalIterator, this still blocks.

In the Popen command, setting bufsize=1 or bufsize=0 doesn't help. Docs for Popen.communicate() say that it'll block for large input.

Various stackoverflow answers for similar problems describe the solution to this as using separate threads for each call, however, initial tests make interactive work in IPython a little crazy.

My guess is that workarounds like "rendering" a streaming BedTool to disk will be needed for the near-to-mid-future, since fixes to this will be difficult.

When iterating through bed_file: "Error: The requested bed file (<file>) could not be opened. Exiting!"

Hello,

I have a BedTool named bed. When I run this snippet of code,

    for i in range(len(bed)):
        #print bed[i].start
        #print bed[i].stop
        #print bed[i].name    #= GSM+"_{0}".format(i+1)                                                                                                  
        print i

if I have one print uncommented, "i" increases to 246 before giving me the error "Error: The requested bed file () could not be opened. Exiting!"

If I run the code with two print lines uncommented, "i" goes to 122 before quitting and I receive that same message.

If i run the code with three print lines uncommented, "i" goes to 81 before quitting and I receive the same message.

The bed file from which I am creating the BedTool has more than 246 lines. I am not sure how to resolve this issue. Looking through the other issues it seems doing something regarding "Intervals" might help with the memory issues, but I am not sure exactly how...

My eventual goal is to be able to run through the BedTool and rename the "name", as seen by the commented code..

Please let me know if my question is confusing or unclear or incorrectly formatted.

Best,

Alan

window_maker seems to not obey the -s (step size) option.

At least with the version I have installed (0.6.2dev), it looks like the window_maker() wrapper does not wrap the -s option. Are you able to reproduce this?

>>> import pybedtools as pbt
>>> pbt.__version__
'0.6.2dev'
 >>> pbt.BedTool().window_maker(genome='hg19', w=1000000, s=500000).head()
 chr1   0   1000000
 chr1   1000000 2000000
 chr1   2000000 3000000
 chr1   3000000 4000000
 chr1   4000000 5000000
 chr1   5000000 6000000
 chr1   6000000 7000000
 chr1   7000000 8000000
 chr1   8000000 9000000
 chr1   9000000 10000000

>>> pbt.BedTool().window_maker(genome='hg19', w=1000000, s=0).head()
chr1    0   1000000
 chr1   1000000 2000000
 chr1   2000000 3000000
 chr1   3000000 4000000
 chr1   4000000 5000000
 chr1   5000000 6000000
 chr1   6000000 7000000
 chr1   7000000 8000000
 chr1   8000000 9000000
 chr1   9000000 10000000

Example failed

I ran these commands successfully in my Ubuntu shell:

$ sudo apt-get install bedtools
$ sudo easy_isntall pybedtools

And then these commands in ipython:

In [1]: import pybedtools
In [2]: a = pybedtools.example_bedtool('x.bam')
In [3]: a.genome_coverage()

NotImplementedError                       Traceback (most recent call last)
/home/sinclair/OD1/raw/<ipython-input-4-bfb19364e87a> in <module>()
----> 1 a.genome_coverage()

/usr/local/lib/python2.7/dist-packages/pybedtools-0.6.2-py2.7-linux-x86_64.egg/pybedtools/bedtool.pyc in decorated(self, *args, **kwargs)
    621             # this calls the actual method in the first place; *result* is

    622             # whatever you get back

--> 623             result = method(self, *args, **kwargs)
    624 
    625             # add appropriate tags


/usr/local/lib/python2.7/dist-packages/pybedtools-0.6.2-py2.7-linux-x86_64.egg/pybedtools/bedtool.pyc in not_implemented_func(*args, **kwargs)
    139         if not_implemented:
    140             def not_implemented_func(*args, **kwargs):
--> 141                 raise NotImplementedError(help_str)
    142             return not_implemented_func
    143 

NotImplementedError: "genomeCoverageBed" does not appear to be installed or on the path, so this method is disabled.  Please install a more recent version of BEDTools and re-import to use this method.

The program is in my $PATH:

$ which genomeCoverageBed
/usr/bin/genomeCoverageBed

Need a generic handler for parsing kwargs/streams/iterables

The new stream=True kwarg and the ability to create BedTool objects out of Interval iterators means that the code in the BEDTool wrappers (BedTool.intersect, BedTool.subtract, etc) gets increasingly complex. Currently, stream is only implemented for BedTool.intersect. This needs to be factored out so it can easily be used by other wrapper methods.

I think this should be a method of BedTool, since it will have to figure out how it was constructed.

Thinking aloud . . . this method should:

determine whether the BEDTool program should accept an a kwarg
determine if a needs to be pulled from the current BedTool
- if a is an IntervalIterator, make it yield text lines (rather than Intervals) and send that as stdin to call_bedtool and set the first arg to '-'
- if a is an open file, send that directly as stdin to call_bedtool and set first arg to '-'
- if a is a string, then stdin is None, and call_bedtool gets -a <filename> as first arg
determine whether b needs to be collapsed from an iterable into a tempfile (since you can't stream both -a and -b to a BEDTools program)
- if collapsed, send the newly collapsed filename to call_bedtool

GFF mode versus BED mode in pybedtools

I noticed that pybedtools tries to respect the input format when using saveas, even though internally it always uses BED-like coordinates (which is a good design in my opinion), so that if you load a GFF file and try to save it, it outputs a GFF:

b = BedTool("foo.gff")
# foo.bed will be a GFF, not a BED
b.saveas("foo.bed")

Is there a way to force pybedtools to convert the GFF to a BED? saveas will still take trackline arguments even if b is a GFF-like BedTool:

b.saveas("foo.bed", trackline="name'test'")

will output:

$ head foo.bed
name='test'
chr10   SE  exon    80065614    80065694    .   +   .   ID=chr10:80065614:80065694:+@chr10:80065984:80066058:+@chr10:80071548:80071595:+.A.up;Parent=chr10:80065614:80065694:+@chr10:80065984:80066058:+@chr10:80071548:80071595:+.A

which would make sense for a BED but not for a GFF, so the result is an invalid GFF. It could be useful to have an argument like: as_bed=True that forces a BedTool to be serialized as BED if it's a GFF (vice versa is more complicated since GFF has too many fields). Is there already a way to do this that I missed? Thanks.

create (and run) doctests in the docstrings

there are some docstrings in the code already but they arent quite in doctest format.
having them run as part of the test-suite--and the stuff in feature.py would be nice.

'None' as key in kwargs when trying to merge two tmp BedTools

I try to merge two BedTools objects but internally one key in kwargs is 'None' which is not accepted.

import pybedtools

annotated = pybedtools.BedTool('file1.bed')
discovered = pybedtools.BedTool('file2.bed')

hg19genome = pybedtools.chromsizes('hg19')

ann_slop = annotated.slop(g=hg19genome, b=10)
disc_slop = discovered.slop(g=hg19genome, b=10)

mergd = ann_slop.merge(disc_slop, s=True, nms=True)

Gives the error:

/usr/local/lib/python2.7/dist-packages/pybedtools-0.5.5-py2.7-linux-x86_64.egg/pybedtools/bedtool.pyc in wrapped(self, *args, **kwargs)
171             # At runtime, this will parse the kwargs, convert streams to

172             # tempfiles if needed, and return all the goodies

--> 173             cmds, tmp, stdin = self.handle_kwargs(prog=prog, **kwargs)
174 
175             # Do the actual call

TypeError: handle_kwargs() keywords must be strings

And kwargs is

{None: <BedTool(/tmp/pybedtools.Iz9EUg.tmp)>, 's': True, 'nms': True}

The slopped files are fine themselves:

ipdb> disc_slop.head(2)
chr21   40563340        40563360        AATAAA  8       -
chr21   40564162        40564182        AGTAAA  12      -
ipdb> ann_slop.head(2)
chr1    328449  328470  0       0       +
chr1    529943  529964  0       0       +

with_column name change?

In the method's current form, the name "with_column" isn't very intuitive. How about something like "transform"? Or "apply" like R? Other ideas?

implement scoreatpercentile and percentileofscore

the goal is to drop the scipy dependency for randomstats . . .

check for off-by-1 errors

you guys have probably already handled this but let's add some tests that mix gff and bed input/output and make sure everything is as we expect for length, overlap amount, etc.

implement features.feature class to support methods like length() for files other than .bed

This is to start discussion on how to implement e.g.: features.gfffeature, features.vcffeature

could start with a generic features.feature class that takes a line and a tuple for chr, start, stop.
so bed would be

features.feature(line_arr, position_locs=(0, 1, 2))

and gff:

features.feature(line_arr, position_locs=(0, 3, 4))

or something like that. it could be an attribute on the class. so:

class bedfeature(feature):
    position_locs = (0, 1, 2)

class gfffeature(feature):
    position_locs = (0, 3, 4)

then use getattr for the other kwargs to set things like rgbItem and what not more generically.
Ryan, if you agree, and/or give some feedback, i'll be glad to take a first stab.

Interval.setitem?

BedTool.with_column() currently requires "outcols" contents to be integers because there's no __setitem__ method.

e.g.,

a.with_column(incols=['start','stop'], func=length, outcols=[4])

instead of

a.with_column(incols=['start','stop'], func=length, outcols=['score'])

My naive attempt to at least get an integer Interval.__setitem__ working gives a cython compile error:

def __setitem__(self, object key, object value):
    cdef int i
    if isinstance(key, (int,long)):
        nfields = self._bed.fields.size()
        if key >= nfields:
            raise IndexError('field index out of range')
        elif key < 0: key = nfields + key
        self._bed.fields[key] = value
                                    ^
        ------------------------------------------------------------

        pybedtools/cbedtools.pyx:146:41: Cannot convert Python object to 'string'

How to do this?

peak_pie.py error and doc improvement

Launching peak_pie.py without argument generate a type error instead of reporting the usage (TypeError: 'NoneType' object is not iterable)
The doc should be improved a little bit. It is not clear where I can find a GFF with all the annotations categories (especially the introns) as the one you have in the example. If you use the GTF file for example from Genome Browser or from Ensemble and convert to GFF you only have CDS and exon. Could you explain in the doc how to get an annotated GFF file?

BTW thanks for this nice library and for keeping improving it.

Luca

randomstats with many iterations opens too many files

I have two fairly small bed files (~1600 features in each, overlapping by ~500 features). I'd like to calculate the empirical overlap p-value but to get a decently small p-value, I need a lot of shuffles.

pybedtools is apparently opening a new file for each iteration (not sure how since clearly the files are unlinked and the python objects are deleted as soon as the intersect results are done. relevant code

Any way around this? I can increase the user limit for open files on this machine, but I doubt it will suffice for tens of millions of overlaps... Looks like it's crapping out at only 4541 files on this machine.

In [2]: import pybedtools
In [2]: chromsizes = {'chr1':(1,1000)}
In [3]: a = pybedtools.example_bedtool('a.bed').set_chromsizes(chromsizes)
In [4]: b = pybedtools.example_bedtool('b.bed')
In [5]: results = a.randomstats(b, 1000000, debug=True)

<type 'exceptions.OSError'>: Too many open files
The command was:

        shuffleBed -i /usr/local/lib/python2.7/dist-packages/pybedtools/test/data/a.bed -seed 4540 -g /tmp/pybedtools.DsnJBh.tmp

Things to check:

        * Too many files open -- are you creating lots of BedTool objects without calling del() on them?

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)

/home/wbiesing/<ipython console> in <module>()

/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in randomstats(self, other, iterations, **kwargs)
   1543         distribution = self.randomintersection(other, iterations=iterations,
   1544                                                **kwargs)
-> 1545         distribution = np.array(list(distribution))
   1546
   1547         # Median of distribution


/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in randomintersection(self, other, iterations, intersect_kwargs, shuffle_kwargs, debug)
   1624             if debug:
   1625                 shuffle_kwargs['seed'] = i
-> 1626             tmp = self.shuffle(**shuffle_kwargs)
   1627             tmp2 = tmp.intersect(other, **intersect_kwargs)
   1628

/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in decorated(self, *args, **kwargs)
    370             # this calls the actual method in the first place; *result* is

    371             # whatever you get back

--> 372             result = method(self, *args, **kwargs)
    373
    374             # add appropriate tags


/usr/local/lib/python2.7/dist-packages/pybedtools/bedtool.pyc in wrapped(self, *args, **kwargs)
    174             # Do the actual call

    175             stream = call_bedtools(cmds, tmp, stdin=stdin,
--> 176                                    check_stderr=check_stderr)
    177             result = BedTool(stream)
    178

/usr/local/lib/python2.7/dist-packages/pybedtools/helpers.pyc in call_bedtools(cmds, tmpfn, stdin, check_stderr)
    299         print 'Things to check:'
    300         print '\n\t' + '\n\t'.join(problems[err.errno])
--> 301         raise OSError('See above for commands that gave the error')
    302
    303     return output

OSError: See above for commands that gave the error

strategy for maintaining alongside BEDTools?

Currently in pybedtools we have a more-or-less static version of bedFile.h, a snapshot from some previous version of BEDTools. Recent changes (e.g., 7e989ea) are starting to diverge from BEDTools proper.

But there are some improvements in the latest BEDTools (bedFile.h) that would be nice to include in pybedtools. I'd also like to be flexible during future development so that when new features show up in BEDTools we can incorporate them into pybedtools. Furthermore, if we find bugs we could fix them in pybedtools and have nice clean patches to send to BEDTools.

There's probably some git hackery that will mirror bedtools as a sort of sub-repo inside pybedtools. Assuming most changes to bedtools are new features (rather than API changes) then cbedtools.pyx/pxi should still work with each BEDTools commit but just won't have the additional features until we manually add them. That way it's not too much of a maintenance overhead but still remains up to date.

Thoughts?

test bug

it works, just checking

BedTool.len leaks file descriptors

The following code illustrates the leak:

import pybedtools
fn = pybedtools.example_filename('a.bed')
orig_fds = pybedtools.helpers.n_open_fds()

# no len()
n1 = []
for i in range(100):
    x = pybedtools.BedTool(fn)
    n1.append(pybedtools.helpers.n_open_fds())
print max(n1) - orig_fds
#2

# check len() each time
n2 = []
for i in range(100):
    x = pybedtools.BedTool(fn)
    len(x)
    n2.append(pybedtools.helpers.n_open_fds())
print max(n2) - orig_fds
#100

pybedtools sequence handles bedtools fastaFromBed warnings as error

I often use BED files that have _random chromosome entries in them that are not necessarily part of my genome build, and I generally want these to be ignored. If I use fastaFromBed to make FASTAs out of these files, fastaFromBed just outputs a warning that these regions cannot be fetched but it correctly handles other regions in the file. I found that the sequence wrapper to fastaFromBed throws an error for these warnings:

mybed = pybedtools.BedTool("chrX_random 1 100", from_string=True)
mybed.sequence(fi="mm9.fa", fo="seq.fa", s=True)

results in:

 File "/usr/local/lib/python2.7/dist-packages/pybedtools-0.6.2-py2.7-linux-x86_64.egg/pybedtools/helpers.py", line 412, in call_bedtools
    'stderr, above', stderr)
pybedtools.helpers.BEDToolsError: ('Error message from BEDTools written to stderr, above', 'WARNING. chromosome (chrX_random) was not found in the FASTA file. Skipping.\n')

Since this is not really an error, it might be better to handle it differently or raise something like BEDToolsError that one could catch in a try/except clause. The current solutions I can think of are: (1) catching the BEDToolsError and parsing it out to see if it's a warning, or (2) prefiltering on the fly the entries that have these random chromosomes. Solution (1) is preferable to me, and seems more generic, as in:

try:
    mybed.sequence(fi="mm9.fa", fo="seq.fa", s=True)
except pybedtools.helpers.BEDToolsError as bed_error:
    if bed_error[1].startswith("WARNING"):
        print "Just a warning..."

But would break if Aaron ever changes the output format of warnings. If there are better ways to deal with this I'd love to hear. Thanks.

BedTool.sequence hangs on a BedTool made with generator?

I am following Ryan's template for creating a BedTool on the fly to avoid temporary files (as outlined here #55). I'm parsing through a BedTool corresponding to a BED file a. As I iterate through the file, I'm creating a new BedTool b using create_interval_from_list. The new BedTool b represents some shuffles of each BED interval in a. Once the new BedTool b is created, I call b.sequence() to generate a FASTA from it.

I found that the sequence() call hangs after outputting a large portion of the FASTA, probably because of some issue with the way I'm making the iterator. Strangely, if I call len(b) before b.sequence(...), then the problem goes away. I've looked at bedtool.py and noticed that __len__() calls self.count(), which is apparently what resolves the iterator. Any idea what might be causing this halting condition? Do I need to raise a StopIteration when feeding a generator to BedTool?

Here is my code for making the generator, it's extremely simple:

def get_shuffled_clusters_intervals(clusters,
                                    gene_ids_to_coords,
                                    num_shuffles):
    # 'clusters' is a BedTool
    for cluster_bed in clusters:
        ...
        # Make some intervals related to the current one
        intervals = \
            bedtools_utils.sample_intervals(...)
        if intervals is None:
            continue
        # Record these as BED intervals
        for shuffle_num, curr_interval in enumerate(intervals):
            ...
            curr_interval_bed = \
                pybedtools.create_interval_from_list([cluster_bed.chrom,
                                                      str(curr_interval[0]),
                                                      str(curr_interval[1]),
                                                      interval_name,
                                                      ".",
                                                      str(cluster_bed.strand)])
            yield curr_interval_bed
    return

Now I create a BedTool using this function and call sequence():

    sampled_clusters_intervals = \
        get_shuffled_clusters_intervals(clusters,
                                        gene_ids_to_coords,
                                        num_shuffles)
    # Make BEDTool out of cluster intervals
    sampled_clusters_bed = pybedtools.BedTool(sampled_clusters_intervals)
    # KEY LINE: without len(sampled_clusters_bed), this hangs
    num_sampled_clusters = len(sampled_clusters_bed)
    sampled_clusters_bed.sequence(fi=genome_seq_fname,
                                                 fo=sampled_clusters_fname,
                                                 name=True,
                                                 s=True)

Any thoughts on what might cause sequence to hang? The BedTool seems to be made fine from the generator, it's the sequence line that actually makes use of the generator that hangs. Thanks!

Update: further confirmation that it's a problem somewhere in the sequence() call or downstream of it - if I take the BedTool that I made using the generator (sampled_clusters_bed) and write it to a .bed first:

with open("test.bed", "w") as file_out:
  for bed in sampled_clusters_bed: file_out.write(str(bed))

And then load test.bed as a BedTool instance, and call that object's .sequence() method, then it all works. So it must be something about how sequence() parses the generator it seems.

Use OrderedDicts for genome_registries

Currently, the genome registries use plain old dictionaries. Consequently, when iterating through chromosomes, the order of the chroms is unpredictable (well, it's predictable, but undesirable). If you were to switch to an OrderedDict, you could could support a specific (and desirable) chrom iteration order, e.g., [chr1, chr2, chr3, etc.). In the case of scurgen, this would allow the HilbertCurve to have a predictable chromosome ordering when doing whole genome plots. It is doable by writing custom key sort functions, but that is nasty at best. That said, OrderedDict is only in 2.7+

What do you think?

randomstats not cleaning up

I dont know why this is happening, because a look at the code shows that it is calling close_or_delete but randomstats is leaving a ton of pybedtools.tmp* files in my tmp dir. and calling cleanup() does not remove them.
Perhaps what's getting sent to close_or_delete is a filehandle?
I've tried calling randomstats with an object and with object.fn and it never cleans up the files.

My call looks like this:

        res = bed.randomstats(loh.fn, 100, processes=25)

(a+b+c) is not commutative and overestimates overlap

I have three BED Tools, a b and c, and I expect that (a+b+c).count() should == (b+a+c).count(), since that function is intended to extract features that are common to all three BedTools. (as discussed here: http://packages.python.org/pybedtools/3-brief-examples.html#example-2-intersections-for-a-3-way-venn-diagram) However, in my tests, I find conflicting results, depending on the order of the operands. Also, the total number present in (a+b+c).count() [or any other order] seems to be an over-estimate of the actual number of overlaps, given that I find elements in (a+b+c) that are specific to only b. I can provide BED files for testing if needed.

as a consequence, "venn_mpl.py -c a.BED -a b.BED -b c.BED " gives a different result from "venn_mpl.py -a a.BED -b b.BED -c c.BED "

Create Interval from line

This test (commented out for now) doesn't run with GFF files:
https://github.com/daler/pybedtools/blob/master/pybedtools/test/test1.py#L123

It's because within IntervalIterator, I'm using create_interval_from_list, which appears to be BED-specific:
https://github.com/daler/pybedtools/blob/master/pybedtools/helpers.py#L328

Is there a good way to create an Interval out of a BED/GFF/VCF line?

Segfault when iterating over stream

Not sure what's going on, but here's how to reproduce it:

import pybedtools
f = pybedtools.example_bedtool('d.gff')
for i in f.intersect(f, stream=True):
    print i

IntervalIterator doesn't reset when instantiating from list of Intervals

When a BedTool is instantiated from a list of intervals, it seems that you can only iterate over it once, including from any instance methods which happen to use the iterator.

Failing code:

import pybedtools
a = pybedtools.BedTool([pybedtools.Interval('chr1', 0, 1)])
print len(a)
print len(a)

The first print statement returns 1, while the second returns 0. This causes major problems if you want to use an instance method that relies on the iterator, such as field_count() and intersect().

System information

pybedtools: v0.6.2
OS X 10.8

call_bedtools should be able to return the Popen.stdout stream

sometimes, a user will just want to do something like:

for ab in a.intersection(b):
      # do something with ab

and doesn't need to store the file. pybedtools could reduce some file/io by allowing a keyword like

a.intersection(b, stream=True)

which would then return a generator of features, streamed from Popen.stdout and not available after the first iteration unless saved as a list.

robustly close streams in "BedTool.del"

Currently, closing streams in the __del__ method of BedTools still leaves open file handles somewhere. See this comment for details:

#38 (comment)

Specifically, the following code should not return a "too many files open" error after a few thousand intersections:

import pybedtools
import sys
a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')
for i in range(10000):
    sys.stderr.write('\r%s' % i)
    sys.stderr.flush()
    c = a.intersect(b, stream=True)
    del c

Note, however, that this works fine (without the stream=True), suggesting that converting into an iterable is the issue:

import pybedtools
import sys
a = pybedtools.example_bedtool('a.bed')
b = pybedtools.example_bedtool('b.bed')
for i in range(10000):
    sys.stderr.write('\r%s' % i)
    sys.stderr.flush()
    c = a.intersect(b)
    del c

daler / pybedtools Goto Github PK

pybedtools's Issues

System information

Recommend Projects

Recommend Topics

Recommend Org