Category Archives: Linux

Chunked encoding and python’s requests library

I’ve been investigating long polling solutions. This blog entry describes the technique I used on the client-side (I will probably change my mind a few more times before settling for a server-side implementation, and I may end up not using the code below on the client at all; but it may be useful to others who, for other obscure reasons, want to iterate over chunks as they got produced by the server).

The server produces response snippets as they become available, and sends them down the HTTP connection as chunks. In my case, the content being XML, it works well to concatenate a series of XML blobs into a large XML stream (somewhat similar to XMPP’s streams).

On the client side, I wanted a way to consume chunks as they become available. As it turns out, python’s httplib is not very generator-friendly: if the server specifies chunked encoding, the library will correctly decode chunks, but it won’t give me control to stop at chunk boundaries.

So here is a working example that handles both chunked and non-chunked responses, and exposes the data as it gets produced.

import httplib
import requests
import sys

def main():
    if len(sys.argv) != 2:
        print "Usage: %s " % sys.argv[0]
        return 1

    headers = { 'Accept-Encoding' : 'identity' }
    sess = requests.sessions.Session()
    sess.headers.update(headers)
    sess.verify = False
    sess.prefetch = False
    sess.hooks.update(response=response_hook)
    resp = sess.get(sys.argv[1])
    cb = lambda x: sys.stdout.write("Read: %s\n" % x)
    for chunk in resp.iter_chunks():
        cb(chunk)

def response_hook(response, *args, **kwargs):
    response.iter_chunks = lambda amt=None: iter_chunks(response.raw._fp, amt=amt)
    return response

def iter_chunks(response, amt=None):
    """
    A copy-paste version of httplib.HTTPConnection._read_chunked() that
    yields chunks served by the server.
    """
    if response.chunked:
        while True:
            line = response.fp.readline().strip()
            arr = line.split(';', 1)
            try:
                chunk_size = int(arr[0], 16)
            except ValueError:
                response.close()
                raise httplib.IncompleteRead(chunk_size)
            if chunk_size == 0:
                break
            value = response._safe_read(chunk_size)
            yield value
            # we read the whole chunk, get another
            response._safe_read(2)      # toss the CRLF at the end of the chunk

        # read and discard trailer up to the CRLF terminator
        ### note: we shouldn't have any trailers!
        while True:
            line = response.fp.readline()
            if not line:
                # a vanishingly small number of sites EOF without
                # sending the trailer
                break
            if line == '\r\n':
                break

        # we read everything; close the "file"
        response.close()
    else:
        # Non-chunked response. If amt is None, then just drop back to
        # response.read()
        if amt is None:
            yield response.read()
        else:
            # Yield chunks as read from the HTTP connection
            while True:
                ret = response.read(amt)
                if not ret:
                    break
                yield ret

if __name__ == '__main__':
    sys.exit(main())

Save it as test-request.py and run it against a server that produces chunks.

The Requests library does not directly allow one to do this, but it has a hook mechanism in place, thus permitting access to various entities as they get produced (in this case, the response, before it gets read).

I hope this will be useful to others too.

Python and Meta-Programming

Meta-programming is one of the lesser known features in Python that can simplify (and sometimes obscure) your code.

This was initially intended to be a 5-minute lightning talk at PyCarolinas 2012, but it could not quite fit the timeframe.

As described in the Python documentation, meta-programming allows you to customize class creation. Why would you need that? Keep reading, for I will discuss 2 common use cases I encountered.

This is how a normal class definition looks like:

class A(object):
    """Normal class"""
    def __init__(self):
        pass

And this is a skeleton class with a meta-class defined:

class A(object):
    "Customizing the creation of class A"
    class __metaclass__(type):
        def __new__(mcs, name, bases, attributes):
            # Do something clever with name, base or attributes
            cls = type.__new__(mcs, name, bases, attributes)
            # Do something cleverer with the class itself
            print "I've created class", name, attributes
            return cls

    def __init__(self):
        pass

class B(A):
    static = 1

Running this piece of code (without instantiating objects of class A or class B) will produce the following output:

I've created class A {'__module__': '__main__', '__metaclass__': <class '__main__.__metaclass__'>, '__doc__': 'Customizing the creation of class A', '__init__': }
I've created class B {'__module__': '__main__', 'static': 1}

So your code gets executed as the class gets defined, which gives you tremendous control over your class creation.

At some point, the mataclass’ __new__ method does need to call type.__new__ to let Python create your class, but you have the opportunity to modify the class name, base classes or class attributes before class creation. Why would you want to do that? Let’s explore 2 use cases.

Use case 1: Slots

Slots are also well described in the Python documentation, and should be used for several reasons. The first one is memory footprint. Slotted classes are more memory efficient, because the object dictionary (__dict__) is no longer  allocated. Another god reason to use slots is “weak typing” – it describes a class’ interface. How many times have you assigned data to myobj.vaule and wonder why there’s no data in myobj.value?

Here is how a slotted class looks like:

class A(object):
    __slots__ = [ 'data' ]

Now this is valid:

a = A()
a.data = 1

While this is not:

a.dat = 1

Python will raise an AttributeError exception.

So far so good. Now let’s bring inheritance into the mix.

class Base(object):
    __slots__ = ['data']

class A(Base):
    pass

So this should be invalid:

a = A()
a.dat = 1

But it is not. The reason? Slots have to be defined in the base class, as well as in each subclass, even if they are empty. So this is the proper definition:

class Base(object):
    __slots__ = ['data']

class A(Base):
    __slots__ = []

It is rather unfortunate that you have to remember to define empty slots, so let’s try to simplify that. This is where modifying the class’ attributes at class creation time comes in handy.

class Base(object):
    __slots__ = []
    class __metaclass__(type):
        def __new__(mcs, name, bases, attributes):
            if '__slots__' not in attributes:
                attributes.update(__slots__=[])
            cls = type.__new__(mcs, name, bases, attributes)
            return cls

class A(Base):
    pass

Note how, if __slots__ is not present in the attribute dictionary, we add it as an empty list.

Now this works as expected (in that trying to assign to field .a will raise an AttributeError):

a = A()
a.a = 1

Use case 2: A class registry

What we will explore now is an application that spawns different object types (instantiated from different classes), depending on the input. This is a common pattern when processing XML nodes as part of a SAX parser, where you would like to have a customized (non-generic) object created when the close tag in the XML stream is encountered. Doing this for very few variants of objects is not a problem (only a matter of the proper if/then/else construct), but it becomes cumbersome as soon as the number of classes grows.

In a very simplified scenario, we assume the input is plain text, and our custom classes define a Name attribute to indicate which input they are willing to handle. Here is the full example:

class Registry(object):
    _registry = {}

    @classmethod
    def register(cls, klass):
        cls._registry[klass.Name] = klass

    @classmethod
    def process(cls, text):
        print "-> Processing: %s" % text
        klass = cls._registry.get(text)
        if klass is None:
            return None
        return klass()

class Base(object):
    Name = None
    class __metaclass__(type):
        def __new__(mcs, name, bases, attributes):
            cls = type.__new__(mcs, name, bases, attributes)
            Registry.register(cls)
            return cls

class A(Base):
    Name = "A"

class B(Base):
    Name = "B"

Note the line in __metaclass__.__new__:

Registry.register(cls)

This is where our newly created class gets added to the registry.

Let’s run the above example and process some input:

print Registry.process('A')
print Registry.process('B')
print Registry.process('C')

The output will be:

-> Processing: A

-> Processing: B

-> Processing: C
None

Notice how, not having a handler class for C, a None object is returned in the third call.

An alternative and probably easier to follow implementation for this use case could also use class descriptors. This is how we would do it in that case (the Registry class is the same):

@Registry.register
class A(object):
    Name = "A"

@Registry.register
class B(object):
    Name = "B"

However, using descriptors to slotify sub-classes as in use case 1 will not work, since slots have to be defined at class creation.

List comprehension (done bad) for everybody!

As I was trying to fix a test, I found this masterpiece of Python code (edited a bit to highlight its resourcefulness):

def getModuleHooks(self):
    ModuleHooks = []
    for path, dirs, files in os.walk('/some/directory'):
        for file in [filename for filename in files]:
            if fnmatch.fnmatch(filename, '*.bar'):
                joined = os.path.join('hooks', file)
                ModuleHooks.append(joined)
    return ModuleHooks

There are several offenders there: iterating over a list comprehension, and using the variable from inside the comprehension outside of it (I didn’t even know it’s possible).

Should be a candidate for thedailywtf.com

Managing SSH keys with Conary

Since we “eat your own dogfood” here at rPath, we do have IT using a Platform-as-a-Service model.

They maintain their own platform that contains all the bits required on all the systems, like a baseline. As a consumer of the platform, all I have to do is add my own software.

In this case, I was basing my Jira product on IT’s custom platform.

IT wants their standard SSH keys (for their own access, as well as for automated backups) on all their machines. But I also want SSH access (and my SSH key) on my appliance. Since one cannot share /root/.ssh/authorized_keys among multiple packages, Conary’s tag handlers to the rescue!

Foresight’s development branch now has an ssh-keys package that allows you to manage multiple sources for your ssh keys.

To manage your keys with Conary, all you need to do is drop your SSH public keys in /etc/ssh/keys.d/<username>/<somename>.pub within a Conary package that has a build requirement on ssh-keys:runtime (so that the file is properly tagged).

When the file gets installed, ssh-keys’ tag handler will append the file to the user’s authorized_keys file, thus granting you access.

Key removal is not yet done, although it would not be hard at all to implement.

Recovering data from one disk from a RAID1 array

Last night I helped a friend recover his data he had stored on an Iomega NAS.

The disks were fine, the rest of the hardware had failed.

Prior to me being involved in this, my friend had installed Ubuntu on an older machine and had installed both drives.

Not having played with RAID for quite some time, I had to acquire some knowledge first – google to the rescue!

In the process I used the wrong option to mdadm (–create instead of –assemble), so I messed up the RAID descriptor on one of the disks. Fortunately, the second disk was fine.

Here is what I ended up doing:

  • install mdadm, a utility to configure RAID devices.
  • install lvm2, a utility to configure LVM (Logical Volume Manager).
  • Run:

mdadm –assemble /dev/md9 /dev/sdc1 –run

(this adds one of the partitions on the existing drives, /dev/sdc1, into a RAID device /dev/md9 running in degraded mode, i.e. with not enough disks – that’s what –run does)

vgchange -a y

(this scans all drives, including the newly created /dev/md9, for logical volumes)

It should print something about a new device with a rather cryptic name, I think something like /dev/vg1_md9/lv1. lvdisplay will show the available volumes.

This new device has a filesystem that can be mounted:

mkdir /tmp/olddrive
mount /dev/vg1_md9/lv1 /tmp/olddrive -o ro

After this, the directory /tmp/olddrive is associated with the contents of the filesystem.

There may be better ways to achieve the same thing, but this is what worked.

keyutils, python and you

Over the weekend I wrote Python bindings for keyutils. So this blog announces python-keyutils.

If you are not familiar with keyutils, it is a library that allows you to securely store sensitive information, directly inside the Linux kernel. You have a reasonable guarantee that the information cannot be retrieved from the memory or swap.

keyutils comes with a binary, keyctl(1), that gives you access to the kernel’s key management facilities. The man page describes the types of available keyrings. The ones the most interesting to the use case I had in mind were the per-thread, per-process and per-session keyrings.

The need for python bindings came when we realized that our release process requires typing the passphrase for signing packages way too many times, so there was a real need for a key agent of some sort. Searching for gpg-agent protocol specifications (or seahorse) returned some information, but nothing I could readily use (I may not have found the proper examples for speaking assuan; the end result was that I could not get anywhere in this direction).

Future versions of Conary will have the ability to read passphrases from the session keyring, if python-keyutils is installed. You can get python-keyutils from either contrib.rpath.org@rpl:2 or foresight.rpath.org@fl:2-devel (depending on whether you need the python 2.4 or python 2.6 version).

Keep in mind that I only implemented the bare minimum I needed for being able to set and get key information. There are other functions the library provides, that could be useful to have. If you find the need for one, let me know; as usual, patches will be cheerfully accepted.

The code is hosted on bitbucket and can be checked out with Mercurial.

The king is dead, long live the king!

As we here at rPath are trying to embrace standards [1], I got to work on CIM again. Kind of a deja-vu since, in a previous life, I started to write Cimbiote – a way to write CIM providers in python.

Cimbiote did not go anywhere in the past 2 years and a half, so I dusted it off and started to play with it. As it turns out, major pieces were missing: the ability to create references, support for associations etc.

When trying to add reference support, I found out from the sblim mailing list that there is still hope in the world. Today I finished packaging cmpi-bindings=contrib.rpath.org@rpl:2 and wrote a simple plugin that does not do much, but was enough to prove far superior to cimbiote.

[1] You, in the back row, stop chuckling. We all know that standards are wonderful things, as Tigger would say.

VIM tip of the day

An operation I happen to do a lot:

Given a list of words spread on separate lines, sort the words.

One can try to do :sort or !sort on a range, but that will sort the lines, not the words inside the lines.

Easiest I found so far (but requires visual formatting which is only available in vim):

  • visual select of the lines you want to sort
  • !!fmt -1
  • visual select of the lines you want to sort
  • :sort
  • gw}

Clear as mud :-)

Explanation:

! will run the selection through an external program, in this case fmt (which is a Unix command to format a list of lines). -1 says to format with a text width of 1 character (which effectively breaks the words at the space)

:sort will sort the selected lines

gw} will format from where the cursor is to the end of the paragraph (re-joining the lines).

Attaching random commands to a keyboard shortcut

I thought I should share this simple way to run a command with a key combination.

Metacity will let  you configure keyboard shortcuts for generating events it knows about (sound events, window events etc). For that, all you have to do is go to System->Preference->Keyboard Shortcuts.

But what if you wanted to run  an arbitrary command? In my case, if I suspend my laptop while the display is set to use the external monitor, and I resume trying to use the laptop’s LCD screen, I would have to somehow invoke xrandr to get the display back. Sounds like a keyboard shortcut would solve the problem, since I could type it blindly, without anything on the display.

This is the solution I found:

  •  Run gconf-editor
  • Go to apps->metacity->keyboard_commands
  • edit one of the command_N keys to add the command you want to be run (in my case xrandr)
  • go to global_keybindings, edit the run_command_N key and add the keyboard shortcut, like <Ctrl><Alt><Shift>R

No need to restart anything.