How to Yield Batches from an Iterable in Python
In a past project, I built a pipeline to ingest crawled job postings from Elasticsearch and standardize them into a unified schema. Since the postings couldn’t all fit in memory, I needed to process them in batches. The Elasticsearch client yields postings one by one, so I wrapped it with a batching generator that groups them into chunks.
To illustrate, suppose you have a generator that yields numbers 0 to 9 (in reality, it might yield thousands or millions of items). To read them in batches (for example, batch size of 4), you can loop and use itertools.islice to take the next 4 items. Each call to itertools.islice consumes up to 4 elements from the generator, and the loop continues until the generator is exhausted.
|
|
The code above can be simplified with yield from. The key idea is that yield from g is equivalent to for v in g: yield v. Therefore, yield from delegates the loop and yields each chunk until the source iterator is exhausted. Note that you must pass a sentinel value to iter to signal the end of iteration (in this case, an empty tuple).
|
|
In Python 3.12 or later, you can simply use itertools.batched to yield fixed-size batches from any iterable.
|
|