Category Archives: Programming

Chunked encoding and python’s requests library

I’ve been investigating long polling solutions. This blog entry describes the technique I used on the client-side (I will probably change my mind a few more times before settling for a server-side implementation, and I may end up not using the code below on the client at all; but it may be useful to others who, for other obscure reasons, want to iterate over chunks as they got produced by the server).

The server produces response snippets as they become available, and sends them down the HTTP connection as chunks. In my case, the content being XML, it works well to concatenate a series of XML blobs into a large XML stream (somewhat similar to XMPP’s streams).

On the client side, I wanted a way to consume chunks as they become available. As it turns out, python’s httplib is not very generator-friendly: if the server specifies chunked encoding, the library will correctly decode chunks, but it won’t give me control to stop at chunk boundaries.

So here is a working example that handles both chunked and non-chunked responses, and exposes the data as it gets produced.

import httplib
import requests
import sys

def main():
    if len(sys.argv) != 2:
        print "Usage: %s " % sys.argv[0]
        return 1

    headers = { 'Accept-Encoding' : 'identity' }
    sess = requests.sessions.Session()
    sess.headers.update(headers)
    sess.verify = False
    sess.prefetch = False
    sess.hooks.update(response=response_hook)
    resp = sess.get(sys.argv[1])
    cb = lambda x: sys.stdout.write("Read: %s\n" % x)
    for chunk in resp.iter_chunks():
        cb(chunk)

def response_hook(response, *args, **kwargs):
    response.iter_chunks = lambda amt=None: iter_chunks(response.raw._fp, amt=amt)
    return response

def iter_chunks(response, amt=None):
    """
    A copy-paste version of httplib.HTTPConnection._read_chunked() that
    yields chunks served by the server.
    """
    if response.chunked:
        while True:
            line = response.fp.readline().strip()
            arr = line.split(';', 1)
            try:
                chunk_size = int(arr[0], 16)
            except ValueError:
                response.close()
                raise httplib.IncompleteRead(chunk_size)
            if chunk_size == 0:
                break
            value = response._safe_read(chunk_size)
            yield value
            # we read the whole chunk, get another
            response._safe_read(2)      # toss the CRLF at the end of the chunk

        # read and discard trailer up to the CRLF terminator
        ### note: we shouldn't have any trailers!
        while True:
            line = response.fp.readline()
            if not line:
                # a vanishingly small number of sites EOF without
                # sending the trailer
                break
            if line == '\r\n':
                break

        # we read everything; close the "file"
        response.close()
    else:
        # Non-chunked response. If amt is None, then just drop back to
        # response.read()
        if amt is None:
            yield response.read()
        else:
            # Yield chunks as read from the HTTP connection
            while True:
                ret = response.read(amt)
                if not ret:
                    break
                yield ret

if __name__ == '__main__':
    sys.exit(main())

Save it as test-request.py and run it against a server that produces chunks.

The Requests library does not directly allow one to do this, but it has a hook mechanism in place, thus permitting access to various entities as they get produced (in this case, the response, before it gets read).

I hope this will be useful to others too.

Python and Meta-Programming

Meta-programming is one of the lesser known features in Python that can simplify (and sometimes obscure) your code.

This was initially intended to be a 5-minute lightning talk at PyCarolinas 2012, but it could not quite fit the timeframe.

As described in the Python documentation, meta-programming allows you to customize class creation. Why would you need that? Keep reading, for I will discuss 2 common use cases I encountered.

This is how a normal class definition looks like:

class A(object):
    """Normal class"""
    def __init__(self):
        pass

And this is a skeleton class with a meta-class defined:

class A(object):
    "Customizing the creation of class A"
    class __metaclass__(type):
        def __new__(mcs, name, bases, attributes):
            # Do something clever with name, base or attributes
            cls = type.__new__(mcs, name, bases, attributes)
            # Do something cleverer with the class itself
            print "I've created class", name, attributes
            return cls

    def __init__(self):
        pass

class B(A):
    static = 1

Running this piece of code (without instantiating objects of class A or class B) will produce the following output:

I've created class A {'__module__': '__main__', '__metaclass__': <class '__main__.__metaclass__'>, '__doc__': 'Customizing the creation of class A', '__init__': }
I've created class B {'__module__': '__main__', 'static': 1}

So your code gets executed as the class gets defined, which gives you tremendous control over your class creation.

At some point, the mataclass’ __new__ method does need to call type.__new__ to let Python create your class, but you have the opportunity to modify the class name, base classes or class attributes before class creation. Why would you want to do that? Let’s explore 2 use cases.

Use case 1: Slots

Slots are also well described in the Python documentation, and should be used for several reasons. The first one is memory footprint. Slotted classes are more memory efficient, because the object dictionary (__dict__) is no longer  allocated. Another god reason to use slots is “weak typing” – it describes a class’ interface. How many times have you assigned data to myobj.vaule and wonder why there’s no data in myobj.value?

Here is how a slotted class looks like:

class A(object):
    __slots__ = [ 'data' ]

Now this is valid:

a = A()
a.data = 1

While this is not:

a.dat = 1

Python will raise an AttributeError exception.

So far so good. Now let’s bring inheritance into the mix.

class Base(object):
    __slots__ = ['data']

class A(Base):
    pass

So this should be invalid:

a = A()
a.dat = 1

But it is not. The reason? Slots have to be defined in the base class, as well as in each subclass, even if they are empty. So this is the proper definition:

class Base(object):
    __slots__ = ['data']

class A(Base):
    __slots__ = []

It is rather unfortunate that you have to remember to define empty slots, so let’s try to simplify that. This is where modifying the class’ attributes at class creation time comes in handy.

class Base(object):
    __slots__ = []
    class __metaclass__(type):
        def __new__(mcs, name, bases, attributes):
            if '__slots__' not in attributes:
                attributes.update(__slots__=[])
            cls = type.__new__(mcs, name, bases, attributes)
            return cls

class A(Base):
    pass

Note how, if __slots__ is not present in the attribute dictionary, we add it as an empty list.

Now this works as expected (in that trying to assign to field .a will raise an AttributeError):

a = A()
a.a = 1

Use case 2: A class registry

What we will explore now is an application that spawns different object types (instantiated from different classes), depending on the input. This is a common pattern when processing XML nodes as part of a SAX parser, where you would like to have a customized (non-generic) object created when the close tag in the XML stream is encountered. Doing this for very few variants of objects is not a problem (only a matter of the proper if/then/else construct), but it becomes cumbersome as soon as the number of classes grows.

In a very simplified scenario, we assume the input is plain text, and our custom classes define a Name attribute to indicate which input they are willing to handle. Here is the full example:

class Registry(object):
    _registry = {}

    @classmethod
    def register(cls, klass):
        cls._registry[klass.Name] = klass

    @classmethod
    def process(cls, text):
        print "-> Processing: %s" % text
        klass = cls._registry.get(text)
        if klass is None:
            return None
        return klass()

class Base(object):
    Name = None
    class __metaclass__(type):
        def __new__(mcs, name, bases, attributes):
            cls = type.__new__(mcs, name, bases, attributes)
            Registry.register(cls)
            return cls

class A(Base):
    Name = "A"

class B(Base):
    Name = "B"

Note the line in __metaclass__.__new__:

Registry.register(cls)

This is where our newly created class gets added to the registry.

Let’s run the above example and process some input:

print Registry.process('A')
print Registry.process('B')
print Registry.process('C')

The output will be:

-> Processing: A

-> Processing: B

-> Processing: C
None

Notice how, not having a handler class for C, a None object is returned in the third call.

An alternative and probably easier to follow implementation for this use case could also use class descriptors. This is how we would do it in that case (the Registry class is the same):

@Registry.register
class A(object):
    Name = "A"

@Registry.register
class B(object):
    Name = "B"

However, using descriptors to slotify sub-classes as in use case 1 will not work, since slots have to be defined at class creation.

List comprehension (done bad) for everybody!

As I was trying to fix a test, I found this masterpiece of Python code (edited a bit to highlight its resourcefulness):

def getModuleHooks(self):
    ModuleHooks = []
    for path, dirs, files in os.walk('/some/directory'):
        for file in [filename for filename in files]:
            if fnmatch.fnmatch(filename, '*.bar'):
                joined = os.path.join('hooks', file)
                ModuleHooks.append(joined)
    return ModuleHooks

There are several offenders there: iterating over a list comprehension, and using the variable from inside the comprehension outside of it (I didn’t even know it’s possible).

Should be a candidate for thedailywtf.com

keyutils, python and you

Over the weekend I wrote Python bindings for keyutils. So this blog announces python-keyutils.

If you are not familiar with keyutils, it is a library that allows you to securely store sensitive information, directly inside the Linux kernel. You have a reasonable guarantee that the information cannot be retrieved from the memory or swap.

keyutils comes with a binary, keyctl(1), that gives you access to the kernel’s key management facilities. The man page describes the types of available keyrings. The ones the most interesting to the use case I had in mind were the per-thread, per-process and per-session keyrings.

The need for python bindings came when we realized that our release process requires typing the passphrase for signing packages way too many times, so there was a real need for a key agent of some sort. Searching for gpg-agent protocol specifications (or seahorse) returned some information, but nothing I could readily use (I may not have found the proper examples for speaking assuan; the end result was that I could not get anywhere in this direction).

Future versions of Conary will have the ability to read passphrases from the session keyring, if python-keyutils is installed. You can get python-keyutils from either contrib.rpath.org@rpl:2 or foresight.rpath.org@fl:2-devel (depending on whether you need the python 2.4 or python 2.6 version).

Keep in mind that I only implemented the bare minimum I needed for being able to set and get key information. There are other functions the library provides, that could be useful to have. If you find the need for one, let me know; as usual, patches will be cheerfully accepted.

The code is hosted on bitbucket and can be checked out with Mercurial.