Chunked encoding and python’s requests library

I’ve been investigating long polling solutions. This blog entry describes the technique I used on the client-side (I will probably change my mind a few more times before settling for a server-side implementation, and I may end up not using the code below on the client at all; but it may be useful to others who, for other obscure reasons, want to iterate over chunks as they got produced by the server).

The server produces response snippets as they become available, and sends them down the HTTP connection as chunks. In my case, the content being XML, it works well to concatenate a series of XML blobs into a large XML stream (somewhat similar to XMPP’s streams).

On the client side, I wanted a way to consume chunks as they become available. As it turns out, python’s httplib is not very generator-friendly: if the server specifies chunked encoding, the library will correctly decode chunks, but it won’t give me control to stop at chunk boundaries.

So here is a working example that handles both chunked and non-chunked responses, and exposes the data as it gets produced.

import httplib
import requests
import sys

def main():
    if len(sys.argv) != 2:
        print "Usage: %s " % sys.argv[0]
        return 1

    headers = { 'Accept-Encoding' : 'identity' }
    sess = requests.sessions.Session()
    sess.headers.update(headers)
    sess.verify = False
    sess.prefetch = False
    sess.hooks.update(response=response_hook)
    resp = sess.get(sys.argv[1])
    cb = lambda x: sys.stdout.write("Read: %s\n" % x)
    for chunk in resp.iter_chunks():
        cb(chunk)

def response_hook(response, *args, **kwargs):
    response.iter_chunks = lambda amt=None: iter_chunks(response.raw._fp, amt=amt)
    return response

def iter_chunks(response, amt=None):
    """
    A copy-paste version of httplib.HTTPConnection._read_chunked() that
    yields chunks served by the server.
    """
    if response.chunked:
        while True:
            line = response.fp.readline().strip()
            arr = line.split(';', 1)
            try:
                chunk_size = int(arr[0], 16)
            except ValueError:
                response.close()
                raise httplib.IncompleteRead(chunk_size)
            if chunk_size == 0:
                break
            value = response._safe_read(chunk_size)
            yield value
            # we read the whole chunk, get another
            response._safe_read(2)      # toss the CRLF at the end of the chunk

        # read and discard trailer up to the CRLF terminator
        ### note: we shouldn't have any trailers!
        while True:
            line = response.fp.readline()
            if not line:
                # a vanishingly small number of sites EOF without
                # sending the trailer
                break
            if line == '\r\n':
                break

        # we read everything; close the "file"
        response.close()
    else:
        # Non-chunked response. If amt is None, then just drop back to
        # response.read()
        if amt is None:
            yield response.read()
        else:
            # Yield chunks as read from the HTTP connection
            while True:
                ret = response.read(amt)
                if not ret:
                    break
                yield ret

if __name__ == '__main__':
    sys.exit(main())

Save it as test-request.py and run it against a server that produces chunks.

The Requests library does not directly allow one to do this, but it has a hook mechanism in place, thus permitting access to various entities as they get produced (in this case, the response, before it gets read).

I hope this will be useful to others too.

Python and Meta-Programming

Meta-programming is one of the lesser known features in Python that can simplify (and sometimes obscure) your code.

This was initially intended to be a 5-minute lightning talk at PyCarolinas 2012, but it could not quite fit the timeframe.

As described in the Python documentation, meta-programming allows you to customize class creation. Why would you need that? Keep reading, for I will discuss 2 common use cases I encountered.

This is how a normal class definition looks like:

class A(object):
    """Normal class"""
    def __init__(self):
        pass

And this is a skeleton class with a meta-class defined:

class A(object):
    "Customizing the creation of class A"
    class __metaclass__(type):
        def __new__(mcs, name, bases, attributes):
            # Do something clever with name, base or attributes
            cls = type.__new__(mcs, name, bases, attributes)
            # Do something cleverer with the class itself
            print "I've created class", name, attributes
            return cls

    def __init__(self):
        pass

class B(A):
    static = 1

Running this piece of code (without instantiating objects of class A or class B) will produce the following output:

I've created class A {'__module__': '__main__', '__metaclass__': <class '__main__.__metaclass__'>, '__doc__': 'Customizing the creation of class A', '__init__': }
I've created class B {'__module__': '__main__', 'static': 1}

So your code gets executed as the class gets defined, which gives you tremendous control over your class creation.

At some point, the mataclass’ __new__ method does need to call type.__new__ to let Python create your class, but you have the opportunity to modify the class name, base classes or class attributes before class creation. Why would you want to do that? Let’s explore 2 use cases.

Use case 1: Slots

Slots are also well described in the Python documentation, and should be used for several reasons. The first one is memory footprint. Slotted classes are more memory efficient, because the object dictionary (__dict__) is no longer  allocated. Another god reason to use slots is “weak typing” – it describes a class’ interface. How many times have you assigned data to myobj.vaule and wonder why there’s no data in myobj.value?

Here is how a slotted class looks like:

class A(object):
    __slots__ = [ 'data' ]

Now this is valid:

a = A()
a.data = 1

While this is not:

a.dat = 1

Python will raise an AttributeError exception.

So far so good. Now let’s bring inheritance into the mix.

class Base(object):
    __slots__ = ['data']

class A(Base):
    pass

So this should be invalid:

a = A()
a.dat = 1

But it is not. The reason? Slots have to be defined in the base class, as well as in each subclass, even if they are empty. So this is the proper definition:

class Base(object):
    __slots__ = ['data']

class A(Base):
    __slots__ = []

It is rather unfortunate that you have to remember to define empty slots, so let’s try to simplify that. This is where modifying the class’ attributes at class creation time comes in handy.

class Base(object):
    __slots__ = []
    class __metaclass__(type):
        def __new__(mcs, name, bases, attributes):
            if '__slots__' not in attributes:
                attributes.update(__slots__=[])
            cls = type.__new__(mcs, name, bases, attributes)
            return cls

class A(Base):
    pass

Note how, if __slots__ is not present in the attribute dictionary, we add it as an empty list.

Now this works as expected (in that trying to assign to field .a will raise an AttributeError):

a = A()
a.a = 1

Use case 2: A class registry

What we will explore now is an application that spawns different object types (instantiated from different classes), depending on the input. This is a common pattern when processing XML nodes as part of a SAX parser, where you would like to have a customized (non-generic) object created when the close tag in the XML stream is encountered. Doing this for very few variants of objects is not a problem (only a matter of the proper if/then/else construct), but it becomes cumbersome as soon as the number of classes grows.

In a very simplified scenario, we assume the input is plain text, and our custom classes define a Name attribute to indicate which input they are willing to handle. Here is the full example:

class Registry(object):
    _registry = {}

    @classmethod
    def register(cls, klass):
        cls._registry[klass.Name] = klass

    @classmethod
    def process(cls, text):
        print "-> Processing: %s" % text
        klass = cls._registry.get(text)
        if klass is None:
            return None
        return klass()

class Base(object):
    Name = None
    class __metaclass__(type):
        def __new__(mcs, name, bases, attributes):
            cls = type.__new__(mcs, name, bases, attributes)
            Registry.register(cls)
            return cls

class A(Base):
    Name = "A"

class B(Base):
    Name = "B"

Note the line in __metaclass__.__new__:

Registry.register(cls)

This is where our newly created class gets added to the registry.

Let’s run the above example and process some input:

print Registry.process('A')
print Registry.process('B')
print Registry.process('C')

The output will be:

-> Processing: A

-> Processing: B

-> Processing: C
None

Notice how, not having a handler class for C, a None object is returned in the third call.

An alternative and probably easier to follow implementation for this use case could also use class descriptors. This is how we would do it in that case (the Registry class is the same):

@Registry.register
class A(object):
    Name = "A"

@Registry.register
class B(object):
    Name = "B"

However, using descriptors to slotify sub-classes as in use case 1 will not work, since slots have to be defined at class creation.

List comprehension (done bad) for everybody!

As I was trying to fix a test, I found this masterpiece of Python code (edited a bit to highlight its resourcefulness):

def getModuleHooks(self):
    ModuleHooks = []
    for path, dirs, files in os.walk('/some/directory'):
        for file in [filename for filename in files]:
            if fnmatch.fnmatch(filename, '*.bar'):
                joined = os.path.join('hooks', file)
                ModuleHooks.append(joined)
    return ModuleHooks

There are several offenders there: iterating over a list comprehension, and using the variable from inside the comprehension outside of it (I didn’t even know it’s possible).

Should be a candidate for thedailywtf.com

Managing SSH keys with Conary

Since we “eat your own dogfood” here at rPath, we do have IT using a Platform-as-a-Service model.

They maintain their own platform that contains all the bits required on all the systems, like a baseline. As a consumer of the platform, all I have to do is add my own software.

In this case, I was basing my Jira product on IT’s custom platform.

IT wants their standard SSH keys (for their own access, as well as for automated backups) on all their machines. But I also want SSH access (and my SSH key) on my appliance. Since one cannot share /root/.ssh/authorized_keys among multiple packages, Conary’s tag handlers to the rescue!

Foresight’s development branch now has an ssh-keys package that allows you to manage multiple sources for your ssh keys.

To manage your keys with Conary, all you need to do is drop your SSH public keys in /etc/ssh/keys.d/<username>/<somename>.pub within a Conary package that has a build requirement on ssh-keys:runtime (so that the file is properly tagged).

When the file gets installed, ssh-keys’ tag handler will append the file to the user’s authorized_keys file, thus granting you access.

Key removal is not yet done, although it would not be hard at all to implement.

Setting orienteering course for Sunday May 6th

Just got back from setting the long course for Sunday.

I waited for the rain to stop, but by 5:30 it was clear it wasn’t going to. So half of the time I was in the rain, and even though I had a plastic map cover, the water still got in. To the point where the top side of the map was so wet that the ink was getting smudged (yes, I only have a deskjet at home).

And for extra fun, I had to set the last 5 controls in the dark. Night orienteering is not easy at Umstead, when you’re looking for tiny orange ribbons.

Mud Run

Today I ran my first first 5K mud run. I was part of a 4-person co-ed team from the Raleigh Trail Runners meetup group.

The obstacles were numerous and challenging, but we all had a blast. It is definitely not your typical 5K run. The run itself was actually the easy part. I am very curious how long it took us to finish the course, I know the start time but none of us paid attention to the finish time. Results will probably be posted over the next few days.

20 of the 32 obstacles were featured in these short video clips on YouTube before the race, but there were some surprises (like obstacle 21, The Weaver, where you had to go over a log and under the next one (for a total of probably 16 logs) without touching the ground. This Google Map has a description of all obstacles and links to the video clips above.

Damage: $32.50 (not bad at all for a race!), a scraped and bumped knee, a few minor scratches in addition to a rather large one (most of them from The Weaver).

For the low-end cost of the race, the race was incredibly well organized. Building that course must have been a huge volunteer effort.

The beginning of a new Orienteering year

For Backwoods Orienteering Klub, the month of September is the beginning of a new year. In part because membership is paid from September till next year’s September. But also because during the summer the only events are advanced and sprint ones, so this month does mark the return of regular events.

Today’s course was beautifully chosen. I got very tired very fast, I think I managed to kill my legs yesterday when I ran as a preparation for today. So it was a constant struggle to keep the appearance of running while on the course, but overall I am very pleased with the result. I made two minor mistakes which costed probably a few minutes each – I can blame the rest only on my slow pace.

Overall, between yesterday’s 6 mile run at Lake Bond and today’s 88 minutes of pain, I think I got over the goal of running 26 miles per week. (I ran 3.5 miles on Monday, a very fast 6+ on Loblolly Trail on Tuesday, a slow 6 on a combination of Company Mill, Graylyn and Sycamore Trail on Thursday) Props to the Raleigh Trail Runners group for giving me motivation to wake up early.

And the Oscar goes to…

After today’s World Cup Final, I am even more committed to boycott Soccer (or Football for the other 99.9% of the nations out there) by not watching it.

Too much acting. Too much cheating.

Imagine this discussion with your child (and no, it didn’t happen to mine, but I am sure it could):

“Daddy, who is Cristiano Ronaldo?”

“He’s this very famous player that earns millions per year.”

“Wow, he is really good. Look at his skills. The other team can only stop him with fouls, see? Look at the replay, you see how they… Oh, wait… They didn’t even touch him. He just fell off his feet and the other guy got red carded.”

“Well, you see, sometimes they play a little bit of acting,  you know, to improve their odds…”

“But isn’t that cheating?”

“It is, but…”

“So why do I get punished if I cheat at my test, or I am called out for plagiarizing, but they get away with it? And they have no shame that everybody will see the replay, and realize how crooked they are? Including the referees who got fooled for a second, and will get (rightfully) stonewalled after the game? And why is FIFA’s slogan ‘Fair Play’? Is it really fair to cheat?”

I am not original, I read some of these opinions on other sites. I am sure a lot of people feel the same. I am just wasting zeros and ones here.

Spain deserved to win, they were the better team (although I must admit I did not watch the whole game). Netherlands did not deserve the silver medal, after all the theatricals they played. No matter how talented Robben is, I lost all respect for him the moment he fell off his feet and claimed a penalty kick or whatever he was claiming when he was booked.

I am sure that, 4 years from now, I will forget all this, and I will waste other 90-minute chunks of my life. And I will again feel sorry for that. If FIFA does nothing to make the game what it used to be, I am afraid the game is doomed.

Recovering data from one disk from a RAID1 array

Last night I helped a friend recover his data he had stored on an Iomega NAS.

The disks were fine, the rest of the hardware had failed.

Prior to me being involved in this, my friend had installed Ubuntu on an older machine and had installed both drives.

Not having played with RAID for quite some time, I had to acquire some knowledge first – google to the rescue!

In the process I used the wrong option to mdadm (–create instead of –assemble), so I messed up the RAID descriptor on one of the disks. Fortunately, the second disk was fine.

Here is what I ended up doing:

  • install mdadm, a utility to configure RAID devices.
  • install lvm2, a utility to configure LVM (Logical Volume Manager).
  • Run:

mdadm –assemble /dev/md9 /dev/sdc1 –run

(this adds one of the partitions on the existing drives, /dev/sdc1, into a RAID device /dev/md9 running in degraded mode, i.e. with not enough disks – that’s what –run does)

vgchange -a y

(this scans all drives, including the newly created /dev/md9, for logical volumes)

It should print something about a new device with a rather cryptic name, I think something like /dev/vg1_md9/lv1. lvdisplay will show the available volumes.

This new device has a filesystem that can be mounted:

mkdir /tmp/olddrive
mount /dev/vg1_md9/lv1 /tmp/olddrive -o ro

After this, the directory /tmp/olddrive is associated with the contents of the filesystem.

There may be better ways to achieve the same thing, but this is what worked.