Using itertools for file operations

21 Sep 2008

Using the Itertools module in python, you can get rid of a lot of boiler plate code and get a minor performance boost at the same time.

Skipping or Taking the first N lines of a file

NO:

#skip first line
count = 0
for line in file:
    count += 1
    if count == 1:
 continue

Yes:


import  itertools

for line in itertools.islice(file, 1, None):
    ...

First arg is the iterable, second is the initial lines to skip, and second arg is when to stop. So taking the first 10 lines of a file would be islice(f,0, 10). This case can be abbreivated with islice(f, 10).

Since islice may not make clear your intentions on what is happening with a file, make some helper functions:

def take(iter, n):
     """ take first n items """
    return itertools.islice(iter, n)

def drop(iter, n):
    """ drop first n items from sequence """
    return itertools.islice(iter, n, None)

Making progress monitors

If you are doing a batch processing a large file, a lot of code in the main loop can be for console "user interface". Get rid of it using a generator.

NO

count = 0
for f in file:
   count += 1
   if count % 10000 == 0:
       sys.stderr.write("Read %10d features    \r" % count)
   ...

sys.stderr.write("Read %10d features, total   \n" % count)

YES

import itertools

def countstatus(iter, mod=0):
    for item, c in itertools.izip(iter, itertools.count()):
        if mod and c % mod == 0:
            sys.stderr.write("Read %10d items\r" % c)
        yield item
    sys.stderr.write("Read %10d items, total\n" % c)

You can sex up the code by showing total elapsed time, or lap time or transactions per second. Then to do use it:

for f in countstatus(file, 10000):
    ...

Mix and Match

# start at 1,000,000th item
for f in countstatus(drop(f, 1000000), 1000):
    ...

or

# use only the first 1,000,000 items
for f in countstatus(head(f, 1000000), 1000)
   ...