Recently, I had to move around a lot of data in an annoyingly error-prone process: I would receive awfully colorful spreadsheets in ever changing formats. Frightened by seemingly infinite creativity of Excel users I decided to move the data to a staging environment first. After some quality checking – aka: Excel exorcism – I subsequently would merge the source data with my production database, the target.
In the lingo of database people, I needed to implement INSERT, UPDATE and DELETE. Regarding my Python programme, this meant covering three cases:
- new records in the source (INSERT),
- changes between source and target (UPDATE) as well as
- lost records, which only appear in target (DELETE).
In the context of a production system the latter case is the nasty one: You cannot simply delete records when you already have child records (i.e. foreign key relationships) in your system. So this meant shovelling data around manually. Thus my programme only handles the first two cases automatically and simply warns about the third case.
Object-relational mapping and a set()
The central piece of my Python code is an object-relational mapping between every database record and my custom class:
class Room(object): def __init__(self, pk, name, size): self.pk = pk self.name = name self.size = size
The “Syncer” does the work of fetching from target and source databases, comparing and taking proper action. I store all the records in a Python
set()– a very powerful data structure for the task.
class RoomSyncer(object): def __init__(self): self.target_rooms = set() def get_target(self): """Fetch rooms from production db""" ... for room in target: ... self.target_rooms.add( Room(pk, name, size ) )
If you run into the problem that your target and source do not resemble each other (i.e. that their attributes are different) you cannot instantiate with the
__init__()method in both cases. Use a factory method instead:
class Room(object): ... @classmethod def from_source(cls, pk, other_name, size): # do some magic to make objects simiar name = other_name return cls(pk, name, size) class RoomSyncer(object): def __init__(self): ... self.source_rooms = set() def get_source(self): """Fetch rooms from staging db""" ... for room in source: ... self.source_rooms.add( Room.from_source(pk, other_name, size ) )
So far we have two sets filled with our objects: target and source. If you subtract the target from the source you’ll get the newly arrived objects:
# rooms only in source rooms_to_add = self.source_rooms - self.target_rooms
Oh no! The code won’t work since Python does not know whether your objects are equal or not. Hence the set operations won’t work out of the box. So you have to override the
__eq__methods – Python will then compare as you would expect (more info on stackoverflow):
class Room(object): ... def __eq__(self, other): return self.pk == other.pk def __hash__(self): return hash(self.pk)
With this simple update you can now easily determine objects that only appear on one side. The last thing we need to do is to cover the UPDATE case. By overriding the methods above we implicitly confine the powers of Python’s set data structure. For a more sophisticated search for updates we’ll have to look at every attribute. Luckily for us, a Python object is just a simple dictionary under the hood. As a consequence we simply have to test the candidate dictionaries for equality:
class Room(object): ... def cmp(self, other): # is entire object equal? return self.__dict__ == other.__dict__ class RoomSyncer(object): .... def cmp_rooms(self, rooms): """objects appear on both sides""" rooms = self.target_rooms.intersection(self.source_rooms) for r in rooms: # simply get one object by key t = self._get_target(r.pk) s = self._get_source(r.pk) if t.cmp(s): # objects are identical ... else: # update ... def _get_target(self, pk): for r in self.target_rooms: if r.pk == pk: return r def _get_source(self, pk): for r in self.source_rooms: if r.pk == pk: return r