Giter Club home page Giter Club logo

more-itertools's People

Contributors

bbayles avatar d-ryzhykau avatar diazona avatar elliotwutingfeng avatar erikrose avatar finalsh4re avatar gleb-akhmerov avatar haukex avatar hjtran avatar hugovk avatar ilai-deutel avatar ioistired avatar james-wasson avatar jaraco avatar jdufresne avatar jferard avatar jtwool avatar kalekundert avatar lonnen avatar marcinkonowalczyk avatar michael-celani avatar mseifert04 avatar n8brooks avatar nanouasyn avatar nvie avatar olegalexander avatar pochmann avatar ruancomelli avatar schoyen avatar shlomif avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

more-itertools's Issues

Small lie in the documentation of first

The documentation for more_itertools.first claims:

It is marginally shorter than next(iter(...)) but saves you an entire try/except when you want to provide a fallback value.

This is completely false.

The next function has always had the second argument to be used exactly as is done by first. (See the What's new in Python 2.6 document: https://docs.python.org/3.6/whatsnew/2.6.html#other-language-changes) so calling first doesn't save any try/except if you know how to use next properly.

I'd remove that final sentence and, instead, add the fact that first is simply a better name for that operation.

Lists, not tuples, should be yielded by chunked

Possibly controversial:

I don't think the chunked function should yield tuples. I think it should yield lists.

The position of each item in the returned sub-iterable don't have intrinsic meaning, and they are homogeneous sub-sections of the original iterable. Therefore, IMO, the most "correct" thing to return is a list.

http://wescpy.blogspot.co.uk/2012/05/tuples-arent-what-you-think-theyre-for.html

Making this change to the existing chunked function is pretty trivial:


diff --git a/more_itertools/__init__.py b/more_itertools/__init__.py
index 0c769f4..d309375 100644
--- a/more_itertools/__init__.py
+++ b/more_itertools/__init__.py
@@ -28,7 +28,7 @@ def chunked(iterable, n):
     for group in izip_longest(*[iter(iterable)] * n, fillvalue=_marker):
         if group[-1] is _marker:
             # If this is the last group, shuck off the padding:
-            group = tuple(x for x in group if x is not _marker)
+            group = [x for x in group if x is not _marker]
         yield group

But if you agree with this, and also agree with #5, the new version of chunked that supports slicing should also yield lists.

The new `context` itertool is bad

The new context itertool tries to expose a context manager as an iterable. This breaks the context manager manager guarantee that __exit__ will be called. It's not enough to tell the caller that he has to iterate over the whole iterable. Even if there are no break or return statements in the loop, there is always the possibility of exceptions. The whole point of context managers is to guarantee that the __exit__ is always called when a block terminates. This is why context managers and iterables are orthogonal concepts; in general, one cannot be made to look like the other.

Please remove context because it encourages people to write bad code.

There is no benefit to context in any case. Even the motivating example in the documentation is just:

consume(print(x, file=f) for f in context(file_obj) for x in it)

which can be written just as succinctly

with file_obj as f:
    consume(print(x, file=f) for x in it)

nth item

I came across this recipe by R. Hettinger:

>>> x = [False,True,True,False,True,False,True,False,False,False,True,False,True]
>>> nth_item(50, True, x)
-1
>>> nth_item(0, True, x)
1
>>> nth_item(1, True, x)
2
>>> nth_item(2, True, x)
4
>>> nth_item(3, True, x)
6

Code

>>> from itertools import compress, count, imap, islice
>>> from functools import partial
>>> from operator import eq

>>> def nth_item(n, item, iterable):
        indicies = compress(count(), imap(partial(eq, item), iterable))
        return next(islice(indicies, n, None), -1)

I thought this may be a useful addition.

FEATURE: Stepped Sliding Window option

Abstract

This feature request proposes extending functionality in more_itertools.windowed by producing windows separated by a given step.

Sliding windows are known for advancing a fixed length from one adjacent item to the next (e.g. step=1) continuously. Can an option be made for implementing larger step sizes?

Example

>>> from more_itertools import windowed

>>> iterable = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
        
# Present: Continuous Sliding
>>> all_windows = windowed(iterable, 3)
>>> list(all_windows)
[(1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6), (5, 6, 7), (6, 7, 8), (7, 8, 9), (8, 9, 10)]

# Proposed: Steps
>>> stepped_windows = windowed(iterable, 3, step=2)
>>> list(stepped_windows)
[(1, 2, 3), (3, 4, 5), (5, 6, 7), (7, 8, 9)]

By default, the iterator returns windows spanning the length of the given iterable. Therefore, sliding may stop abruptly for windows that are not equally divisible (e.g. leaving out item 10). However, if all items in the iterable are desired, there can be an option for including the tail and wrapping the results back to the head, as requested in this SO Post:

# Proposed: Steps and Wrapping 
>>> stepped_windows = windowed(iterable, 3, step=2, wrap=True)
>>> list(stepped_windows)
[(1, 2, 3), (3, 4, 5), (5, 6, 7), (7, 8, 9), (9, 1, 2)]

Further Considerations

I believe I have a working prototype for this implementation with passing unittests and comparable performance times that I can post here if this is a plausible feature and you are interested in further discussion.

Include implementations from jaraco.util.itertools

I just now discovered more-itertools. Thanks for that!

I'd like to leverage this project to supersede the implementations I have in jaraco.util.itertools (docs). I see in the README the invite to send pull requests, but since this effort would be a somewhat large undertaking, I'd like to preface with some questions and concerns.

  1. Would more-itertools likely accept most or all of jaraco.util.itertools?
  2. Do you see any obvious conflicts or undesirable parts of that module (so I can exclude them from the pull request(s)?
  3. Would you prefer individual pull requests for each tool, or would a large pull request be adequate?
  4. jaraco.util.itertools relies heavily on doctests (for documentation and testing). Can more-itertools leverage those tests and documentation?

Tool suggestion: identify elements adjacent to those matching a predicate

I just came up with another function that I don't think already exists and might make a useful addition to more-itertools: identifying elements in an iterable that match a predicate or are adjacent to those matching a predicate. As a stupid (but simple) example, suppose I want to detect which letters are vowels or adjacent to vowels in a word. The design I have in mind returns a tuple for each element in the iterable with a boolean indicating whether it is or is adjacent to a "selected" element, as well as the element itself.

>>> list(adjacent(lambda c: c in 'aeiou', 'thursday'))
[(False, 't'), (True, 'h'), (True, 'u'), (True, 'r'), (False, 's'), (True, 'd'), (True, 'a'), (True, 'y')]

In my application it's important to know where the (equivalents of the) vowel-centered groups end and begin, in addition to knowing which elements are in those groups, so I pass the result of adjacent() through groupby(). (This is akin to choosing context lines in a differ.) If I wanted an iterable of just ['h', 'u', 'r', 'd', 'a', 'y'] in that example, I could use instead use filter() and map() or a generator, i.e. (e[1] for e in adjacent(...) if e[0]). I think this flexibility is important.

The question I want to bring up before sending in a pull request is how to generalize this. My base implementation is the following:

def adjacent(predicate, iterable):
    i1, i2 = tee(iterable)
    selected = chain([False], map(predicate, i1), [False])
    adjacent = map(any, windowed(selected, 3))
    return zip(adjacent, i2)

(designed to avoid calling predicate() more than once per item) It's easy enough to change the number of elements of "context" by increasing the second argument to windowed(), and that would be a straightforward generalization. Is it also worthwhile to support arbitrary "masks" by using stagger() instead of windowed()? E.g. passing offsets=(-1, 1) to "mark" only elements which are before or after those which satisfy the predicate, leaving out the ones which do satisfy the predicate themselves? Or offsets=(0,1,2) to "mark" items which satisfy the predicate and the two that follow them?

Installation and examples in README

I suspect the README is the first (and often the only) stop from visitors to most packages. Through the visitor's lens, I observe the following:

First, it's actually not clear how to install the package (from either the README or the docs). I propose adding a line on installation to the README.

...

Installation

> pip install more_itertools

...

Second, right now the first tool in the API docs is the new more_itertools.adjacent, which has a heavier docstring than traditional ones. IMO, seeing this much text for the first tool is less inviting to newcomers trying to quickly figure out what this package does. I suggest adding a couple elegant examples to the README, maybe a recipe and original, e.g. flatten and chunked.

...

Examples

>>> import more_itertools as mit

# Itertools Recipe
>>> list(mit.flatten([[0, 1], [2, 3]]))
[0, 1, 2, 3]

# More-Itertools Original
>>> list(mit.chunked([1, 2, 3, 4, 5, 6, 7], 3))
[[1, 2, 3], [4, 5, 6], [7]]

...

The idea is to succinctly demonstrate up front that this package is a simple extension of itertools, and it is easy to use.

Iterable parameter in `windowed`

I noticed the signature for windowed included seq, i.e. windowed(seq, ...). I consider sequences to be "sliceable iterables", so I think the correct name should be iterable instead, i.e. windowed(iterable, ...).

From Python docs:

sequence

An iterable which supports efficient element access using integer indices via the getitem() special method and defines a len() method that returns the length of the sequence. Some built-in sequence types are list, str, tuple, and unicode. Note that dict also supports getitem() and len(), but is considered a mapping rather than a sequence because the lookups use arbitrary immutable keys rather than integers.

iterable

An object capable of returning its members one at a time. Examples of iterables include all sequence types (such as list, str, and tuple) and some non-sequence types like dict, file objects, and objects of any classes you define with an iter() method or with a getitem() method that implements Sequence semantics.

The reason I bring this up is because sliced requires a sequence (as well as certain Python builtins that exclude general iterables). Since windowed can also be applied to non-sequences (e.g. dicts), iterable seems the appropriate variable name, consistent with other patterns in the source.

As far as I can tell, changing the name has no negative effect on the code as the iterable is made into an iterator and used no where else.

Combine peekable with spy

I like how peekable can look ahead and modify without affecting it's iterator and I like how spy can look ahead more than one.

It would be really nice if I could specify how many I want to look ahead with peek.

>>> a = peekable((1,2,3,4,5))
>>> a.peek(2)
2

Add function to chunk an iterable based on a predicate

Add a generator that produces a generator for each chunk, where each chunk is defined by a "is first item" or "is last item" predicate.

Basic code for "is first item" idea:

def chunk_by_first(seq, pred):
    """chunk_by_first(iterable, callable) -> list, ...
    Breaks up an iterable based on a predicate: If true, break before that item.
    """
    buf = []
    for i in seq:
        if pred(i) and len(buf):
            yield buf
            buf = []
        buf.append(i)
    yield buf

(This implementation is of course undesirable if chunks are of considerable size. But should concisely express the idea.)

Request: Sort iterables by

I'm not sure if this is a fitting addition to more-itertools but it's a method I use quite often. This function sorts iterables using a defined order of priority. So you can sort iterables in concordance with a given sort pattern. I suppose it's tough to explain so here are three examples.

# Will sort all iterables based on the ascending sort order of the first iterable
>>>sort_iterables_by([['a', 'd', 'c', 'd'], [1, 3, 2, 4]], key_list=(0,))
[('a', 'c', 'd', 'd'), (1, 2, 3, 4)]

# Will sort all iterables based on the ascending sort order of the first iterable,
# then the second iterable
>>>sort_iterables_by([['d', 'd', 'd', 'c'], [4, 3, 7, 10], [1, 2, 3, 4]],
                      key_list=(0, 1))
[('c', 'd', 'd', 'd'), (10, 3, 4, 7), (4, 2, 1, 3)]

# Will sort all iterables based on the descending sort order of the first iterable,
# then the second iterable
>>>sort_iterables_by([['a', 'b', 'b'], [1, 3, 2]],
>>>                   key_list=(0, 1),
>>>                   reverse=True))
[('b', 'b', 'a'), (3, 2, 1)]

Here is the function I propose

import operator

def sort_iterables_by(iterables, key_list=(0,), reverse=False):

    return list(zip(*sorted(zip(*iterables),
                            key=operator.itemgetter(*key_list),
                            reverse=reverse)))

What do you guys think? A useful addition? One remark is that because zip is used, iterables are returned trimmed to the length of the shortest iterable before sorting. An alternate form of the function could be used with zip_longest although for lists with heterogeneous objects no fillvalue will make obvious sense.

Example:

import operator
import itertools

def sort_iterables_by(iterables, key_list=(0,), reverse=False, fillvalue=None):

    return list(zip(*sorted(itertools.zip_longest(*iterables, fillvalue=fillvalue),
                            key=operator.itemgetter(*key_list),
                            reverse=reverse)))

Support optional slicing in chunked

Hi,

First - really like this package. I've written most of these things several times for various projects so it's nice to have them all in one place.

I have a couple of suggestions which I'll add as tickets. I'm more than happy to do the work to implement them if you prefer (in fact I've already made a start) but I wanted to get a few design decisions first.

So this ticket relates to the chunked function. Sometimes the existing behaviour is exactly what you want - yielding fixed-sized chunks of a possibly infinite-length iterable. However, sometimes you actually want to slice the iterable to get back the chunks. For example, if the API you're using makes a call to a database with an offset and a limit each time you slice the iterable. In this case you don't want to load all the rows into memory to start yielding chunks. You're essentially paginating the iterable, and yielding a page at a time.

The basic implementation of this pattern is here: http://stackoverflow.com/questions/3744451/is-this-how-you-paginate-or-is-there-a-better-algorithm

def getrows_byslice(seq, rowlen):
    for start in xrange(0, len(seq), rowlen):
        yield seq[start:start+rowlen]

This could be added to the library in a couple of ways. It could be an additional function alongside chunked (I'm thinking chunked_slices or maybe just paginate). Alternatively, it could be implemented as an alternative behaviour of chunked, by passing an argument, like this:

def chunked(iterable, n, slice=False):
    if slice:
        iterable = list(iterable)
        for start in xrange(0, len(iterable), n):
            yield tuple(iterable[start:start + n])
    else:
        for group in izip_longest(*[iter(iterable)] * n, fillvalue=_marker):
            if group[-1] is _marker:
                group = tuple(x for x in group if x is not _marker)
            yield group

What do you think?

Functional Programming

I've been looking for a good python functional programming library for a while now. PyFunctional is an ok start but is lacking a lot of features that I would expect from a functional library. Most importantly it does not make reusable pipes meaning combined functions cannot be used more than once, which isn't good if the pipe is going to be used a lot.
Do you know of any libraries that do allow reusable pipes?
I also think that the functionality of more-itertools would be amazing if we could add them to the PyFunctional package or do something similar.
Let me know what you think!

tox doesn't like the comment in the toxfile env list

Using tox 1.4.2 I get the following error when I run tox:

$ tox
ERROR: unknown environment 'py32  # Python 3.1 and 3.0 might work as well.'

It looks like tox doesn't handle comments in the env list. Removing the comment fixes the issue and tox runs successfully.

add empty or has method to peekable class

Usually it is good to check have you anything in peekable as part of more complex checks.
For example:

while p.peek(None) is not None and p.peek().type == 1:

Will be shorter and more readable as:

while not p.empty() and p.peek().type == 1:

or even:

while p.has() and p.peek().type == 1:
# === variant
while p.more() and p.peek().type == 1:

always_iterable treats dict as iterable

I'm finally getting around to reconciling always_iterable in jaraco.itertools, as contributed to this project in #37 and #108.

In September last year, I discovered that one would be unlikely to want to iterate over a dictionary (or other Mapping) when using always_iterable, so in jaraco.itertools 2.0, Mappings were treated as non-iterable. The reasoning, as found in the docs, is that a dictionary is likely to be intended as a single object rather than a sequence of keys, but also that one can readily pass iter(dict) or dict.keys() if one does want the value to be treated as iterable.

In order for more_itertools.more.always_iterable to supplant jaraco.itertools.always_iterable, I'd like for more_itertools to adopt this behavior as well.

What do you think?

Make `setup.py test` work.

Some atexit handler explodes after running the tests with python setup.py test:

----------------------------------------------------------------------
Ran 7 tests in 0.022s

OK
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  bbedit +24  /usr/local/Cellar/python2.6/2.6.7/lib/python2.6/atexit.py  # _run_exitfuncs
    func(*targs, **kargs)
  bbedit +258 /usr/local/Cellar/python2.6/2.6.7/lib/python2.6/multiprocessing/util.py  # _exit_function
    info('process shutting down')
TypeError: 'NoneType' object is not callable
Error in sys.exitfunc:
Traceback (most recent call last):
  bbedit +24  /usr/local/Cellar/python2.6/2.6.7/lib/python2.6/atexit.py  # _run_exitfuncs
    func(*targs, **kargs)
  bbedit +258 /usr/local/Cellar/python2.6/2.6.7/lib/python2.6/multiprocessing/util.py  # _exit_function
    info('process shutting down')
TypeError: 'NoneType' object is not callable

Peek without peekable

Hello,

sometimes I need to peek at the first element of an iterable. I usually implement it like this:

def peek(iterable):
    iterator = iter(iterable)
    item = next(iterator)
    return item, itertools.chain([item], iterator)

element, my_list = peek(my_list)

��
Would this be an interesting addition to more-itertools? If so, I'll send a pull request sometime in the future.

Cheers.

No `StopIteration` in `peekable`

I expect the following peekable code to raise a StopIteration error, but it runs without warning:

iterable = "A B C".split()
p = mit.peekable(iterable)

while p:
    line = next(p)
    print(line, end=" ")
# A B C 

By comparison, most iterators and generators I've tried raise an error:

iterable = "A B C".split()
p = iter(iterable)

while p:
    line = next(p)
    print(line, end=" ")

Output

A B C 
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-67-fe7c490ca45a> in <module>()
      3 
      4 while p:
----> 5     line = next(p)
      6     print(line, end=" ")

StopIteration:

Note, if replaced with while True, the StopIteration is raised as expected. However, the sudden ending of the while loop in the first example seems like a bug as it is unclear why the loop ended. I understand the peek method has an exception handler, although this method it is not directly called in the former example.

Before investigating further, regarding the first example:

  1. Is it intended behavior for peekable not to raise a StopIteration?
  2. If so, what signals the while loop to end?

Sliceable peekables

I had thought to add an index to peek() for grabbing, say, the 2nd item (and making peekable stack-based), but it'd be a little awkward to have a 2nd param on there. What if, instead, peekables were sliceable?

peekable(some_iter)[1] would be equivalent to peekable(some_iter).peek().

peekable(some_iter)[2:8] would also work (and look ahead, without appearing to advance the iterator).

We'd probably start off supporting only indexing and might never support negative indices.

We could also support peekable(some_iter).get(2, 'default') for having default fallbacks for arbitrary indexes.

Additional recipes

The Python 3 docs have additional recipes that aren't in more-itertools, in particular:

  • tail
  • all_equal
  • partition
  • first_true

I'll add them to recipes. I think we should also have accumulate for Python 2.7 users.

`roundrobin` vs. `interleave_longest`

Recently while looking at the recipes, I noticed robinrobin gives similar results to interleave_longest:

import more_itertools as mit

iterables = ['ABC', 'D', 'EF']

list(mit.roundrobin(*iterables))
# ['A', 'D', 'E', 'B', 'F', 'C']

list(mit.interleave_longest(*iterables))
# ['A', 'D', 'E', 'B', 'F', 'C']

I realize interleave_longest was discussed among other items in #22, but it seems it's similarity to an existing recipe may have been overlooked. Is there a rationale for keeping both tools?

Request: pushback

I'd like to suggest adding a wrapper that allows pushing a value back on to an iterator, so that the next call to next(it) will return the pushed value before the next element from the underlying iterable. I find myself wanting this from time to time (usually in parsing applications), and I could have sworn it was implemented somewhere standard, but I looked around and couldn't find it. Would this be a good addition to more-itertools?

I do have code to offer, but I'm posing this as an issue instead of a pull request because I have a dilemma. I've come up with two implementations, one as a generator function

def pushback(iterable, maxlen=None):
    iterable = iter(iterable)
    # add 1 to account for the append(None)
    stack = deque(maxlen=maxlen + 1 if maxlen is not None else None)
    while True:
        if stack:
            e = stack.pop()
        else:
            e = next(iterable)
        sent = yield e
        if sent is not None:
            stack.append(sent)
            stack.append(None) # dummy value to return from send()

and the other as a class

class pushback:
    def __init__(self, iterable, maxlen=None):
        self.iterable = iter(iterable)
        self.stack = deque(maxlen=maxlen)
    def __iter__(self):
        return self
    def __next__(self):
        return self.stack.pop() if self.stack else next(self.iterable)
    def send(self, value):
        self.stack.append(value)

The function implementation is about twice as fast in my preliminary tests (using IPython)

In [13]: %timeit list(pushback_function(range(10)))
100000 loops, best of 3: 5.45 µs per loop
In [14]: %timeit list(pushback_class(range(10)))
100000 loops, best of 3: 10.8 µs per loop

On the other hand the class implementation is conceptually cleaner, and also does not need to be "primed" by calling next(it) before sending in a value with it.send(x).

Now, in most cases, you can prime the generator iterator without losing an item by running it.send(next(it)), and that could be done in a wrapper function to make it transparent to client code. But only the class implementation allows pushing in front of an empty iterable (admittedly a rather pathological use case):

>>> it = pushback([])
>>> it.send(10)
>>> list(it)
[10]

So my point is: if this is something you want for more-itertools, which implementation to use? Or is there a way to "fix" one of them to make it strictly better than the other, that I'm not seeing? (Or does this whole thing already exist and I wasted an evening?)

Intersperse every n items

From this SO post, given

number = 123456789012345678901234567890
expected = "12345 67890 12345 67890 12345 67890"

This looks like an opportunity for interspearse. However, the present implementation "injects" (actually zips) one unique element between every element of the iterable. I propose modifying interspearse to inject an element between every n elements, e.g. a space every 5 characters in the expected string.

Here is a quick modification to the intersperse code adding an n keyword argument:

import itertools

import more_itertools as mit


def intersperse(e, iterable, n=1):
    it = iter(mit.chunked(iterable, n))                    # dependency 
    filler = itertools.repeat(e)         
    zipped = mit.collapse(zip(filler, it))                 # dependency
    next(zipped)
    return zipped

Results

print(list(intersperse('x', 'ABCD')))
print(list(intersperse('x', 'ABCD', 2)))
# ['A', 'x', 'B', 'x', 'C', 'x', 'D']
# ['A', 'B', 'x', 'C', 'D']


print(list(intersperse(None, [1,2,3])))
print(list(intersperse(None, [1,2,3], 2)))
# [1, None, 2, None, 3]
# [1, 2, None, 3]

print("".join(intersperse(" ", str(number), 5)))
# 12345 67890 12345 67890 12345 67890

These are minor changes, i.e. adding chunked and substituting flatten with collapse. The downside is that this modified implementation depends on other tools. I imagine the hope is to keep new recipes independent. Before proceeding, are there any thoughts on adding a keyword, suggestions for a different implementation or desire to keep as is?

Related posts

Fix the PyPI docs for the 4.0.0 release

Oof. I think this is the :func: directive again - our regex doesn't catch the . in :func:run_length.decode.

Before PyPI had a way to manually edit things to fix an existing release, but that seems to be gone.

Version of chunked that emits iterators and not lists

PR #56 and PR #58 both raise the idea of having a function that splits an iterator into a group of sub-iterators with a fixed length. That is, a version of chunked() that emits iterators instead of lists.

I was hoping to be able to modify chunked() to do this via a parameter or something, but I think performance would suffer. The simple version from #58 isn't viable.

>>> for func in (original, ichunked_new, ichunked_pr58):
...     def stmt():
...         iterable = range(2000)  # Obviously performance will vary with iterable and n
...         n = 101
...         all_chunks = list(func(iterable, n))
...         assert len(all_chunks) == 20
... 
...     result = timeit(stmt, number=10000)
...     print(func.__name__, result)
original 0.7484618649759796
ichunked_new 0.971211633994244
ichunked_pr58 17.300433913012967

So I think a separate function (ichunked, I guess) is called for!


from itertools import chain, islice, zip_longest
from more_itertools import consume, peekable
from timeit import timeit

def original(iterable, n):
    it = iter(iterable)
    while True:
        chunk = list(islice(it, n))
        if not chunk:
            return
        yield chunk


def ichunked_pr58(iterable, n, emit_lists=True):
    p = peekable(iterable)
    while p:
        chunk = islice(p, n)
        if emit_lists:
            yield list(chunk)
        else:
            yield chunk
            consume(chunk)

def ichunked_new(iterable, n, emit_lists=True):
    it = iter(iterable)
    while True:
        test_chunk = islice(it, n)
        try:
            item = next(test_chunk)
        except StopIteration:
            return
        chunk = chain([item], test_chunk)
        if emit_lists:
            yield list(chunk)
        else:
            yield chunk
            consume(test_chunk)


for func in (original, ichunked_new, ichunked_pr58):
    def stmt():
        iterable = range(2000)
        n = 101
        all_chunks = list(func(iterable, n))
        assert len(all_chunks) == 20

    result = timeit(stmt, number=10000)
    print(func.__name__, result)

`bucket` deprecated?

I'm getting a error trying to access bucket.

>>> more_itertools.bucket(iterable, key=lambda s: s[0]) 
...
AttributeError: module 'more_itertools' has no attribute 'bucket'

Has bucket been deprecated? If so, the latest docs need to be updated.

Same for more_itertools.collapse.

Memory-happy `chunked`

This ticket is a minor feature suggestion, not a bug/issue.

I've been using chunked for years, the difference being, the lists (chunks) that are yielded are often huge. The scenario is numerical computing.

To allow releasing the memory more quickly, I don't keep a reference to the yielded object in chunked, so that it can be easily garbage collected (once all other outside references are gone).

Code:

def chunked(iterable, chunksize):
    """
    Return elements from the iterable in `chunksize`-ed lists. The last returned
    list may be smaller (if length of collection is not divisible by `chunksize`).

    >>> list(chunked(xrange(10), 3))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
    """
    it = iter(iterable)
    while True:
        wrapped_chunk = [list(itertools.islice(it, chunksize))]
        if not wrapped_chunk[0]:
            break
        # memory opt: wrap the chunk and then pop(), to avoid leaving behind a reference
        yield wrapped_chunk.pop()

numeric_range with one argument

Is there any use-case for numeric_range with only one argument? It seems the type of the objects that are returned are solely depending on the type of start and step.

When only stop is given (even if I use floats or Decimal as stop) it will always return integers. And it's a lot slower than range.

ilen returns zero inside lambda

I am slightly confused but it seems there might be either a bug of some kind (possibly related to more-itertools) or a misunderstanding on my part.

Under python 3.6 shell:

>>> from itertools import islice
>>> from more_itertools import ilen
>>> iterable = [0, 40, 20, 30]
>>> ilen(iterable)
4
>>> i = 0
>>> slicesz=2
>>> slc = islice(iterable, i, slicesz)
>>> slc
<itertools.islice object at 0x7fce0964c728>
>>> ilen(slc)
2
>>> avg = lambda l: sum(l)/ilen(l)
>>> avg(slc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero

How can ilen return zero inside the lambda?

@consumer decorator from PEP 342

def consumer(func):
    """
    Decorator that automatically advances a "reverse iterator" to its first
    yield point when initially called
    """
    @wraps(func)
    def wrapper(*args, **kwargs):
        gen = func(*args, **kwargs)
        gen.next()
        return gen
    return wrapper

Request: Faster `all_equal` for strings

Observation

I understand the code for all_equal derives from the itertools recipe. This legacy implementation has the benefit of working with generic iterables. However, I came across this SO post, which shows an elegant implementation for the same operation on strings, that is, verify all letters in a string are equal.

import itertools

import more_itertools

s = "aaaa"

%timeit more_itertools.all_equal(s)
1000000 loops, best of 3: 1.1 µs per loop

%timeit s == s[0] * len(s)
1000000 loops, best of 3: 438 ns per loop

We see the SO algorithm in this case is 2x-3x faster for strings.

Request

Can the SO algorithm be included in more_itertools.all_equal so that if a string is passed as the argument, this faster algorithm is preferred?

For example:

def all_equal(iterable):
    """
    Returns True if all the elements are equal to each other.
    
    Uses a faster implementation for strings. 
    http://stackoverflow.com/a/14321721/4531270
    
        >>> all_equal('aaaa')
        True
        >>> all_equal('aaab')
        False
        >>> all_equal([1,1,1,1])
        True
        >>> all_equal([1,1,1,0])
        False
        
    """
    if isinstance(iterable, str):
        s = iterable
        return s == s[0] * len(s)
    g = itertools.groupby(iterable)
    return next(g, True) and not next(g, False)

Tests

My local tests confirm these results:

# New algorithm
all_equal("aaaa")
# True
all_equal("aaab")
# False

# Legacy algorithm
all_equal([1,1,1,1])
# True
all_equal([1,1,1,0])
# False

Performance

There is some improvement in speed over legacy with continued benefits for longer strings.

s = "a"*100000

# Legacy implementation
%timeit -n 1000 more_itertools.all_equal(s)
1000 loops, best of 3: 1.09 ms per loop
    
# Proposed implementation
%timeit -n 1000 all_equal(s)
1000 loops, best of 3: 9.64 µs per loop

last()

A last() function, with an API mirroring first(), would come in handy for things like https://gist.github.com/4019721, removing the need to use the less readable deque(seq, 1)[0]. Plus, we could take advantage of the __reversed__ method if it exists on the sequence.

range for floats

I didn't see a float range function in the library so frange could be helpful. I also used the recommendation from itertools.count to reduce float error.

When counting with floating point numbers, better accuracy can sometimes be achieved by substituting multiplicative code such as: (start + step * i for i in count()).

import itertools, operator

# frange(stop)
# frange(start, stop[, step])
def frange(*args):
  if len(args) == 1:
    start = 0
    stop = args[0]
    step = 1
  elif len(args) == 2:
    start, stop = args
    step = 1
  elif len(args) == 3:
    start, stop, step = args
  else:
    raise TypeError('frange expected at most 1 - 3 arguments, got {}.'.format(len(args)))

  if start < stop and 0 < step:
    compare_with = operator.lt
  elif start > stop and 0 > step:
    compare_with = operator.gt
  else:
    return

  compare_with = operator.lt if start < stop else operator.gt

  for step_count in itertools.count():
    val = start + step * step_count
    if compare_with(val, stop):
      yield val
    else:
      break


import unittest

class Testfrange(unittest.TestCase):
  def test_frange(self):
    self.assertEqual(
        tuple(frange(5)),
        tuple( range(5))
      )

    self.assertEqual(
        tuple(frange(-5)),
        tuple( range(-5))
      )

    self.assertEqual(
        tuple(frange(2, 5)),
        tuple( range(2, 5))
      )

    self.assertEqual(
        tuple(frange(2, 10, 2)),
        tuple( range(2, 10, 2))
      )

    self.assertEqual(
        tuple(frange(2, -5)),
        tuple( range(2, -5))
      )

    self.assertEqual(
        tuple(frange(2, -10, -2)),
        tuple( range(2, -10, -2))
      )

    self.assertEqual(
        tuple(frange(2, 10, -2)),
        tuple( range(2, 10, -2))
      )

    self.assertEqual(
        tuple(frange(2, -10, 2)),
        tuple( range(2, -10, 2))
      )

    self.assertEqual(
        tuple(frange(2.5, 4, 0.5)),
        (2.5, 3, 3.5)
      )

    self.assertEqual(
        tuple(frange(2.5, 4.1, 0.5)),
        (2.5, 3, 3.5, 4)
      )

    with self.assertRaises(TypeError):
      tuple(frange())

    with self.assertRaises(TypeError):
      tuple(frange(1, 2, 3, 4))


if __name__ == '__main__':
  unittest.main()

New itertool: window

How about adding a sliding window itertool?

There are multiple implementations that trade off on things like memory consumption/speed/etc, but here's a version that's worked fine for me

from collections import deque

def window(seq, n=2):
    it = iter(seq)
    win = deque((next(it, None) for _ in xrange(n)), maxlen=n)
    yield win
    append = win.append
    for e in it:
        append(e)
        yield win

This package is not yet in the standard lib

Yes, this is a bug.

It's unconceivable Python doesn't ship with an implementation for flatten already. It might be the most asked question about Python on Stack Overflow.

Since this package so useful, what about pushing this into the standard lib?

cycle with cycle count

Hi I just found your more-itertools library and really like what's in it. I stumbled upon this because I was looking for an iterator similar to itertools.cycle that also gives the number of cycles that have been given, (cycle count, object).

I have created this and think that it would be a good addition to your package. Let me know if you want this incorporated to the package.

from itertools import cycle

def count_cycle(iterable):
  '''
  similar to itertools.cycle but give the number of cycles that
  have already been given
  '''
  iterable = cycle(iterable)
  count = 0

  first = next(iterable)
  first_id = id(first)

  yield count, first

  for item in iterable:
    if id(item) == first_id:
      count += 1

    yield count, item


import unittest

class TestCycleCount(unittest.TestCase):
  def test_count_cycle(self):
    self.assertEqual(
        tuple(count_cycle(())),
        ()
      )

    self.assertEqual(
        tuple(cc for i, cc in zip(range(9), count_cycle(range(3)))),
        ((0,0), (0,1), (0,2), (1,0), (1,1), (1,2), (2,0), (2,1), (2,2))
      )

if __name__ == '__main__':
  unittest.main()

Support chunking with a fixed number of chunks

Something I do relatively often is chunk data into a certain number of chunks, rather than chunks of a certain size. I find this useful in situations where I'm parallelizing work that depends on a bottleneck (eg, database, vcs server), and I want to ensure I don't overload it if a massive amount of work comes in. I have an implementation of this in https://github.com/bhearsum/chunkify, but it seems like it might fit well into chunked.

It would require an API break, with an interface such as:
def chunked(data, chunk_size=None, total_chunks=None)

...where one (and only one) of chunk_size or total_chunks is required. chunkify also supports only returning a specific chunk, which is a bit more efficient for large lists, but not crucial.

`lstrip` vs. `dropwhile`

I notice more_itertools.lstrip is similar to itertools.dropwhile:

import itertools as it

import more_itertools as mit


iterable = [0, None, 1, 2, 0, 3, None, 0]
pred = lambda x: x in {None, 0}

list(mit.lstrip(iterable, pred))
# [1, 2, 0, 3, None, 0]

list(it.dropwhile(pred, iterable))
# [1, 2, 0, 3, None, 0]

I recall lstrip is a derivative of strip, but It may be worth noting in the docstring the similarity between lstrip and dropwhile (see #122).

more.chunked API inconsistent with recipes

Recipes.grouper has this signature:

def grouper(n, iterable, fillvalue=None):

But more.chunked has this signature:

def chunked(iterable, n):

Because these two functions serve almost exactly the same purpose, only with chunked not providing any fill value, it would be nice if it also had a congruent interface.

I know it's a lot to ask for an API to change so dramatically, but I think it would be worth the backward-incompatible change to make these congruent.

I make this post for your consideration and feedback. I won't be offended if the idea is rejected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.