Here at Tracelytics, performance is obviously one of our top priorities. However, writing performant code is nearly impossible if that code isn’t clear and well-organized. One of our favorite tools for doing this is the Python ORM SQLAlchemy. It provides a nice Pythonic abstraction for writing queries, and also a powerful interface for controlling exactly what SQL is generated when performance matters. Let’s take a look at a few of the features we love.
Loading Strategies
One of the most common pitfalls of using an ORM is accidentally creating a Query in a Loop. In SQLAlchemy, writing something like this:
# Assuming a User class, with a group relationship to a different table... >>> users = Session.query(User).all() >>> for r in users: ... print user.group
This runs an initial query to get the list of users, then an individual query for each user to get the group! In many ORMs, you’d have to change the structure here: batch load the users, batch load the groups, then iterate over the loaded results. SQLAlchemy provides a better way, called Loading Strategies. Instead of changing the logic, you just specify which relationships to load, and SQLAlchemy will get everything you need in one query. In our example, we’d like to load all groups at the same time as the users.
>>> users = Session.query(User).options(joinedload('group')).all()
>>> for r in users:
... print user.group
Fewer roundtrips, much faster, and hardly any code change.
Reconstructing Objects
One common pattern in OO designs is to lazily compute properties, cache them, and return the cached version after that. Not only does this have performance benefits, but it also keeps the computation logic for given properties separate and coherent. A simple method might look like this:
class Foo(object):
def __init__(self):
self._bar = None
def bar(self):
if self._bar is None:
self._bar = ... # Compute bar
return self._bar
This pattern unfortunately doesn’t work with SQLAlchemy. Specifically, objects loaded from the DB don’t re-run the __init__ constructor, so they won’t have the _bar property available, and the line self._bar is None will fail with AttributeError: 'Foo' object has no attribute '_bar' . To fix this, SQLAlchemy offers a hook that runs after loading: sqlalchemy.orm.reconstructor. In our case:
from sqlalchemy.orm import reconstructor
class Foo(object):
def __init__(self):
# Initialize persistent properties here
self.init()
@reconstructor
def init(self):
# Initialize transient properties here
self._bar = None
def bar(self):
if self._bar is None:
self._bar = ... # Compute bar
return self._bar
In general, we do all transient initialization in a reconstructor method and only touch persistent properties in the constructor. Without that, new objects may behave differently than loaded objects!
Polymorphic Identities
ORMs typically model a single row in the database as an object, with each column represented as a field, and SQLAlchemy is no different. For most data, this makes sense. A “users” table with columns “first_name”, “joined_on”, and “active” maps cleanly to an object with the same properties. However, not all columns represent simple data. Something like “role” identifies a behavior on an object. Naively, the model methods that use that column might look like this:
Base = declarative_base()
class User(Base):
__tablename__ = 'user'
role = Column('role', String(20), nullable=False)
# other column definitions here...
def get_menu_dropdown_items(self):
if self.role == 'owner':
# return all choices
elif self.role == 'admin':
# return all choices, except delete organization
elif self.role == 'base':
# return just personal settings
else:
# return blank list
Instead of returning the column data itself, the object behaves differently based on the value of the column. A common OO pattern when faced with this kind of switch is to choose a different class for each (e.g.) role. By pushing this conditional “up a level”, to class instantiation time, we only have to make that choice once, instead of repeating this if/elif/else block across all the model methods that might need it. (If you want to read more about this pattern, check out the wiki article Replace Conditional with Polymorphism).
By default, SQLAlchemy picks the class type when it loads the rows. To change this, we can give the base class a polymorphic type — a row name that SQLAlchemy will use to figure out which class to instantiate, passed to the mapper. Each subclass can then declare a polymorphic identity — a row value to associate with the class. If you’re using the declarative base, you can set the __mapper_args__ to the appropriate column and value. For instance, the above example could become:
Base = declarative_base()
class User(Base):
__tablename__ = 'user'
__mapper_args__ = { 'polymorphic_on': 'role' }
role = Column('role', String(20), nullable=False)
def get_menu_dropdown_items(self):
# default implementation: return blank list
class AdminUser(User):
__mapper_args__ = { 'polymorphic_identity': 'admin' }
def get_menu_dropdown_items(self):
# return all choices, except delete organization
class OwnerUser(User):
__mapper_args__ = { 'polymorphic_identity': 'owner' }
def get_menu_dropdown_items(self):
# return all choices
class BaseUser(User):
__mapper_args__ = { 'polymorphic_identity': 'base' }
def get_menu_dropdown_items(self):
# return just personal settings
Now, querying the User class will actually return a number of different implementations, matched to the data!
This pattern is useful when you have a number of different types that you expect to have to change with some frequency. Obviously splitting up the get_menu_dropdown_items logic across multiple classes makes it somewhat harder to see all types of behavior. In return, implementing a new class (e.g., guest), is much easier, since all the methods are in one place, and can even be statically checked (for existence) by tools such as PyLint.
Lazy Loading Dependencies
In an ideal world, all models have no dependencies. It would even be fine to have models with one-way dependencies, with no cycles. Unfortunately, this isn’t always the case. Consider two classes, with mutual references:
from sqlalchemy import *
from sqlalchemy import orm
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Account(Base):
__table_name__ = 'account'
id = Column('id', Integer, primary_key=True)
prefs_id = Column('organization_id', Integer, ForeignKey('preferences.id'))
prefs = orm.relation(AccountPrefs, uselist=False) # One-to-one relationship
class AccountPrefs(Base):
__table_name__ = 'preferences'
id = Column('id', Integer, primary_key=True)
account_id = Column('organization_id', Integer, ForeignKey('account.id'))
account = orm.relation(Account, uselist=False) # One-to-one relationship
However, when we try to create this, it fails with NameError: name 'AccountPrefs' is not defined. This is because we’re referencing the AccountPrefs class before its definition. We can fix this by changing the relationship lines to the following:
prefs = orm.relation("AccountPrefs", uselist=False) # One-to-one relationship
# ...
account = orm.relation("Account", uselist=False) # One-to-one relationship
Instead of the class instead itself, we can pass a string to relation and defer creating the relationship until the loader is first used — typically after the import has finished.
If these classes are sufficiently large, or have other support code, it may make sense to break them up into two modules (e.g., account.py and account_prefs.py). This introduces a new problem: if one of these modules is loaded without the other, SQLAlchemy will again throw a NameError. Deferring this loading forever won’t help — the class definitions simply aren’t available!
Fundamentally, we have to force both modules to load if either module is loaded. We could just put the appropriate import statement in both files, but importing modules without a clear use of any of the variables seems kludgy (remember that we’re hiding the relationship declaration in strings already — tools like pylint will issue warnings on these unused imports). Ideally, we’d have an explicit declaration, with a clear purpose. Fortunately, SQLAlchemy 0.7.7 introduces a new hook: __declare_last__. This allows us to execute a function after the relationships are loaded, but before they are used. This gives us a chance to nest the circular import:
# account.py
Base = declarative_base()
class Account(Base):
__table_name__ = 'account'
id = Column('id', Integer, primary_key=True)
prefs_id = Column('organization_id', Integer, ForeignKey('preferences.id'))
@classmethod
def __declare_last__(cls):
from account_prefs import AccountPrefs
cls.prefs = orm.relation(AccountPrefs, uselist=False)
# account_prefs.py
from account import Base
class AccountPrefs(Base):
__table_name__ = 'preferences'
id = Column('id', Integer, primary_key=True)
account_id = Column('organization_id', Integer, ForeignKey('account.id'))
account = orm.relation("Account", uselist=False) # One-to-one relationship
We like this because it lets us manage different chunks of logic, even when that code depends on data that is naturally tightly coupled.
Even more…
There’s obviously a lot more to SQLAlchemy. What are some of your favorite features? Let us know in the comments!
