Advanced SQLAlchemy & Performance Concerns

Here at Tracelytics, performance is obviously one of our top priorities. However, writing performant code is nearly impossible if that code isn’t clear and well-organized. One of our favorite tools for doing this is the Python ORM SQLAlchemy. It provides a nice Pythonic abstraction for writing queries, and also a powerful interface for controlling exactly what SQL is generated when performance matters. Let’s take a look at a few of the features we love.

Loading Strategies

One of the most common pitfalls of using an ORM is accidentally creating a Query in a Loop. In SQLAlchemy, writing something like this:

# Assuming a User class, with a group relationship to a different table...

>>> users = Session.query(User).all()
>>> for r in users:
...     print user.group

This runs an initial query to get the list of users, then an individual query for each user to get the group! In many ORMs, you’d have to change the structure here: batch load the users, batch load the groups, then iterate over the loaded results. SQLAlchemy provides a better way, called Loading Strategies. Instead of changing the logic, you just specify which relationships to load, and SQLAlchemy will get everything you need in one query. In our example, we’d like to load all groups at the same time as the users.

>>> users = Session.query(User).options(joinedload('group')).all()
>>> for r in users:
...     print user.group

Fewer roundtrips, much faster, and hardly any code change.

Reconstructing Objects

One common pattern in OO designs is to lazily compute properties, cache them, and return the cached version after that. Not only does this have performance benefits, but it also keeps the computation logic for given properties separate and coherent. A simple method might look like this:

class Foo(object):

    def __init__(self):
        self._bar = None

    def bar(self):
        if self._bar is None:
            self._bar = ... # Compute bar
        return self._bar

This pattern unfortunately doesn’t work with SQLAlchemy. Specifically, objects loaded from the DB don’t re-run the __init__ constructor, so they won’t have the _bar property available, and the line self._bar is None will fail with AttributeError: 'Foo' object has no attribute '_bar' . To fix this, SQLAlchemy offers a hook that runs after loading: sqlalchemy.orm.reconstructor. In our case:

from sqlalchemy.orm import reconstructor

class Foo(object):

    def __init__(self):
        # Initialize persistent properties here
        self.init()

    @reconstructor
    def init(self):
        # Initialize transient properties here
        self._bar = None

    def bar(self):
        if self._bar is None:
            self._bar = ... # Compute bar
        return self._bar

In general, we do all transient initialization in a reconstructor method and only touch persistent properties in the constructor. Without that, new objects may behave differently than loaded objects!

Polymorphic Identities

ORMs typically model a single row in the database as an object, with each column represented as a field, and SQLAlchemy is no different. For most data, this makes sense. A “users” table with columns “first_name”, “joined_on”, and “active” maps cleanly to an object with the same properties. However, not all columns represent simple data. Something like “role” identifies a behavior on an object. Naively, the model methods that use that column might look like this:

Base = declarative_base()
class User(Base):
    __tablename__ = 'user'

    role = Column('role', String(20), nullable=False)
    # other column definitions here...

    def get_menu_dropdown_items(self):
        if self.role == 'owner':
            # return all choices
        elif self.role == 'admin':
            # return all choices, except delete organization
        elif self.role == 'base':
            # return just personal settings
        else:
            # return blank list

Instead of returning the column data itself, the object behaves differently based on the value of the column. A common OO pattern when faced with this kind of switch is to choose a different class for each (e.g.) role. By pushing this conditional “up a level”, to class instantiation time, we only have to make that choice once, instead of repeating this if/elif/else block across all the model methods that might need it. (If you want to read more about this pattern, check out the wiki article Replace Conditional with Polymorphism).

By default, SQLAlchemy picks the class type when it loads the rows. To change this, we can give the base class a polymorphic type — a row name that SQLAlchemy will use to figure out which class to instantiate, passed to the mapper. Each subclass can then declare a polymorphic identity — a row value to associate with the class. If you’re using the declarative base, you can set the __mapper_args__ to the appropriate column and value. For instance, the above example could become:

Base = declarative_base()
class User(Base):
    __tablename__ = 'user'
    __mapper_args__ = { 'polymorphic_on': 'role' }

    role = Column('role', String(20), nullable=False)

    def get_menu_dropdown_items(self):
        # default implementation: return blank list

class AdminUser(User):
    __mapper_args__ = { 'polymorphic_identity': 'admin' }

    def get_menu_dropdown_items(self):
        # return all choices, except delete organization

class OwnerUser(User):
    __mapper_args__ = { 'polymorphic_identity': 'owner' }

    def get_menu_dropdown_items(self):
        # return all choices

class BaseUser(User):
    __mapper_args__ = { 'polymorphic_identity': 'base' }

    def get_menu_dropdown_items(self):
        # return just personal settings

Now, querying the User class will actually return a number of different implementations, matched to the data!

This pattern is useful when you have a number of different types that you expect to have to change with some frequency. Obviously splitting up the get_menu_dropdown_items logic across multiple classes makes it somewhat harder to see all types of behavior. In return, implementing a new class (e.g., guest), is much easier, since all the methods are in one place, and can even be statically checked (for existence) by tools such as PyLint.

Lazy Loading Dependencies

In an ideal world, all models have no dependencies. It would even be fine to have models with one-way dependencies, with no cycles. Unfortunately, this isn’t always the case. Consider two classes, with mutual references:

from sqlalchemy import *
from sqlalchemy import orm
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()
class Account(Base):
    __table_name__ = 'account'

    id       = Column('id', Integer, primary_key=True)
    prefs_id = Column('organization_id', Integer, ForeignKey('preferences.id'))

    prefs = orm.relation(AccountPrefs, uselist=False) # One-to-one relationship

class AccountPrefs(Base):
    __table_name__ = 'preferences'

    id         = Column('id', Integer, primary_key=True)
    account_id = Column('organization_id', Integer, ForeignKey('account.id'))

    account = orm.relation(Account, uselist=False) # One-to-one relationship

However, when we try to create this, it fails with NameError: name 'AccountPrefs' is not defined. This is because we’re referencing the AccountPrefs class before its definition. We can fix this by changing the relationship lines to the following:

    prefs = orm.relation("AccountPrefs", uselist=False) # One-to-one relationship
    # ...
    account = orm.relation("Account", uselist=False) # One-to-one relationship

Instead of the class instead itself, we can pass a string to relation and defer creating the relationship until the loader is first used — typically after the import has finished.

If these classes are sufficiently large, or have other support code, it may make sense to break them up into two modules (e.g., account.py and account_prefs.py). This introduces a new problem: if one of these modules is loaded without the other, SQLAlchemy will again throw a NameError. Deferring this loading forever won’t help — the class definitions simply aren’t available!

Fundamentally, we have to force both modules to load if either module is loaded. We could just put the appropriate import statement in both files, but importing modules without a clear use of any of the variables seems kludgy (remember that we’re hiding the relationship declaration in strings already — tools like pylint will issue warnings on these unused imports). Ideally, we’d have an explicit declaration, with a clear purpose. Fortunately, SQLAlchemy 0.7.7 introduces a new hook: __declare_last__. This allows us to execute a function after the relationships are loaded, but before they are used. This gives us a chance to nest the circular import:

# account.py
Base = declarative_base()
class Account(Base):
    __table_name__ = 'account'

    id       = Column('id', Integer, primary_key=True)
    prefs_id = Column('organization_id', Integer, ForeignKey('preferences.id'))

    @classmethod
    def __declare_last__(cls):
        from account_prefs import AccountPrefs
        cls.prefs = orm.relation(AccountPrefs, uselist=False)

# account_prefs.py

from account import Base
class AccountPrefs(Base):
    __table_name__ = 'preferences'

    id         = Column('id', Integer, primary_key=True)
    account_id = Column('organization_id', Integer, ForeignKey('account.id'))

    account = orm.relation("Account", uselist=False) # One-to-one relationship

We like this because it lets us manage different chunks of logic, even when that code depends on data that is naturally tightly coupled.

Even more…

There’s obviously a lot more to SQLAlchemy. What are some of your favorite features? Let us know in the comments!

  • bloodearnest

    I think your caching property example is more complicated that it needs to be. Can you not just use a class property – like so:

    class Foo(object):

    _bar = None

    def bar(self):
    if self._bar is None:
    self._bar = …
    return self._bar

    This is how I’ve always done cached properties

    Or is the reset of self._bar that you need @reconstructor for? To avoid re-using the a stale cache? In that case, maybe your way is better – an Attrribute not found exception ensures you won’t be using the cached value by accident if you forget @reconstructor. Interesting.,,

  • TR Jordan

    @bloodearnest

    That caching example definitely works. Personally, I prefer to initialize instance variables in instance methods, just to explicitly mark the scope, but that’s more of a style preference than anything else.

    Like you mentioned, though, the real advantage to using an @reconstructor function is more complicated initialization. In one reconstructor method, we actually set up a logger with some predefined constants, based on the persistant values in the row. In that case, it made more sense to initialize early — in the reconstructor method.

    Thanks for the comment! There’s always more than one way to skin a cat.