Category Archives: python

python

Table syncing with Python

Recently, I had to move around a lot of data in an annoyingly error-prone process: I would receive awfully colorful spreadsheets in ever changing formats. Frightened by seemingly infinite creativity of Excel users I decided to move the data to a staging environment first. After some quality checking – aka: Excel exorcism – I subsequently would merge the source data with my production database, the target.

In the lingo of database people, I needed to implement INSERT, UPDATE and DELETE. Regarding my Python programme, this meant covering three cases:

  • new records in the source (INSERT),
  • changes between source and target (UPDATE) as well as
  • lost records, which only appear in target (DELETE).

In the context of a production system the latter case is the nasty one: You cannot simply delete records when you already have child records (i.e. foreign key relationships) in your system. So this meant shovelling data around manually. Thus my programme only handles the first two cases automatically and simply warns about the third case.

hxht
Moving around data and breaking some rules for the sake of backwards compatibility.

Object-relational mapping and a set()

The central piece of my Python code is an object-relational mapping between every database record and my custom class:

class Room(object):
    
    def __init__(self, pk, name, size):

        self.pk = pk
        self.name = name
        self.size = size

The “Syncer” does the work of fetching from target and source databases, comparing and taking proper action. I store all the records in a Python set() – a very powerful data structure for the task.
class RoomSyncer(object):
    
    def __init__(self):
        
        self.target_rooms = set()
    
    def get_target(self):
        """Fetch rooms from production db"""
        
        ...
        
        for room in target:
            ...
            self.target_rooms.add(
                Room(pk, name, size )
                )    

If you run into the problem that your target and source do not resemble each other (i.e. that their attributes are different) you cannot instantiate with the __init__() method in both cases. Use a factory method instead:
class Room(object):

    ...

    @classmethod
    def from_source(cls, pk, other_name, size):
        
        # do some magic to make objects simiar
        name = other_name
        
        return cls(pk, name, size)

class RoomSyncer(object):
    
    def __init__(self):
        
        ...
        
        self.source_rooms = set()
    
    def get_source(self):
        """Fetch rooms from staging db"""
        
        ...
        
        for room in source:
            ...
            self.source_rooms.add(
                Room.from_source(pk, other_name, size )
                )          

set() operations

So far we have two sets filled with our objects: target and source. If you subtract the target from the source you’ll get the newly arrived objects:

# rooms only in source
rooms_to_add = self.source_rooms - self.target_rooms

Oh no! The code won’t work since Python does not know whether your objects are equal or not. Hence the set operations won’t work out of the box. So you have to override the __hash__ and __eq__ methods – Python will then compare as you  would expect (more info on stackoverflow):
class Room(object):
    
    ...

    def __eq__(self, other):
        
        return self.pk == other.pk
    
    def __hash__(self):
        
        return hash(self.pk)

With this simple update you can now easily determine objects that only appear on one side. The last thing we need to do is to cover the UPDATE case. By overriding the methods above we implicitly confine the powers of Python’s set data structure. For a more sophisticated search for updates we’ll have to look at every attribute. Luckily for us, a Python object is just a simple dictionary under the hood. As a consequence we simply have to test the candidate dictionaries for equality:
class Room(object):
    
    ...

    def cmp(self, other):
        # is entire object equal?
        
        return self.__dict__ == other.__dict__

class RoomSyncer(object):
    ....

    def cmp_rooms(self, rooms):
        """objects appear on both sides"""

        rooms = self.target_rooms.intersection(self.source_rooms)

        for r in rooms:

            # simply get one object by key
            t = self._get_target(r.pk)
            s = self._get_source(r.pk)

            if t.cmp(s):
                # objects are identical
                ...
            
            else:
                # update
                ...

    def _get_target(self, pk):
        for r in self.target_rooms:
            if r.pk == pk:
                return r
    
    def _get_source(self, pk):  
        for r in self.source_rooms:
            if r.pk == pk:
                return r

python

Memcached decorator for Python

Bad English, yet findable by search engines…

After returning from Europython 2011 I was specially excited about two things: Memcached and Python’s decorator syntax. Don’t know why this took so long but finally it was  time to try things out.

Memcached is a memory-based caching system that can be easily installed on various operating systems. Online documentation is abundant (try this) – so let’s focus on it’s purpose. Imagine some expensive function, e.g. a CPU-costly database calculation or some network-bound function. For the sake of simplicity my example just sleeps for 5 secs:

import time

def expensive_function(x):
    time.sleep(5)
    return x

expensive_function('some_input')

The function in my example always returns the same output given an input, thus it is said to be referentially transparent. Hence the expensive function can be replaced with it’s value. (This technique is either called caching or memoization but I don’t really care about the disambiguation.)

The first approach looks like this:

import memcache
import time

def expensive_function(x):
    mc = memcache.Client(['127.0.0.1:11211'], debug=0)
    # calculate a unique key
    key = 'expensive_function_%s' % (x,)    
    # check if key/value pair exists
    value = mc.get(key)
    if value:
        return value
    else:
        time.sleep(5)
        value = x
        # cache result for 5 mins
        mc.set(key, value, 60 * 5)
        return value

expensive_function('some_input')
expensive_function('some_input')

On the first invocation the function now runs for five seconds whereas the second will return immediately. A unique key is needed to store and retrieve the values. Since memcached is not bound to the Python runtime the speedup also works in a second process.

The return value depends on the functions’s input value. So the key must contain (some form of) the input. I will show a better solution below.

By now, it’s possible to cache results of a single function. In a bigger application context we run into two problems. First, we violate the DRY principle because we have to write the code getting and setting key-value pairs over and over again. Second, and more severe, we risk setting the same key for different combinations of functions and input parameters. Since this could lead to harmful errors hard to debug it seems advisable to place the caching logic in a single point of your code.

Thanks to Python’s decorators the solution is easy. (A decorator simply is a function that intercepts execution of another function or method. So every time we call expensive_function a wrapper function will look for cached results and decide whether to invoke as intended or return from the cache immediately.) The function can now be decorated. The code readability is preserved.

#!/usr/bin/env python

import time
import memcache
import hashlib

def memoize(f):

    def newfn(*args, **kwargs):
        mc = memcache.Client(['127.0.0.1:11211'], debug=0)
        # generate md5 out of args and function
        m = hashlib.md5()
        margs = [x.__repr__() for x in args]
        mkwargs = [x.__repr__() for x in kwargs.values()]
        map(m.update, margs + mkwargs)
        m.update(f.__name__)
        m.update(f.__class__.__name__)
        key = m.hexdigest()

        value = mc.get(key)
        if value:
            return value
        else:
            value = f(*args, **kwargs)
            mc.set(key, value, 60)
            return value
        return f(*args)

    return newfn

@memoize
def expensive_function(x):
    time.sleep(5)
    return x

if __name__ == '__main__':
    print expensive_function('abc')
    print expensive_function('abc')

The function name, class name and all parameters passed produce a hash key which will be used as key by memcached. I’m still not happy with the solution because uniqueness of the key is not guaranteed. (And all solutions I can think of revolve around some string concatenation magic…) A glimpse at more sophisticated implementations, e.g. Django’s cache decorator or memorised, work pretty much the same way though.