Django ORM performance patch

February 14, 2008 [last comment: May 15, 2008]

Django ORM is simple and obvious, but when from tutorial models project moved to real-world models, same ORM often becomes headache. I work on big Django-project that currently have 278 models in 40 applications with a lot of relations between them. For such number of models QuerySet.select_related() is best friend and worst enemy in same time. The best friend because it allows to decrease number of SQL-queries, and worst enemy because it can't load only those fields of involved in queryset models that I need for this concrete view.

For example following query it's standard way to receive with 1 SQL query and then show 10 last blog entries titles with author names and avatars.


entries = Entry.objects.select_related(depth=1)[:10]
for e in entries:
    print e.headline, e.user.username, e.user.get_avatar_url()
 

The problem is in SQL query that Django ORM will generate for this code – it will load from database all fields of Entry and related User model for each from 10 blog entries even if you need only 3 small fields Entry.headline, User.username, User.avatar and don't need other heavy fields like Entry.content, User.about_text, etc.

A lot of resources wasted here: it's not cheap to select from database big textfield, then transfer utf-8 bytes to web-server, convert from utf-8 to unicode, allocate kilobytes of memory in python process, and finally run garbage collector that will find that those already megabytes of memory (for each object from object list for each client request) were not need at all and it safe now to spend some more system resources for deleting these never used objects.

Current QuerySet.values() can't help here because it can't select fields from related models and often we need model instances only and not dictionaries.

Special lightweight models (when you bind to same database table several Django models with different field subsets of a heavy model) results in code duplication and require to implement lightweight models for all related heavy models also, and actually it's not web development done right.

Returning back to old good SQL is possible but it's "not right" and in this case you also lose business logic already defined in models.

I checked out queryset-ref Django branch and didn't find solution for above problem. This branch makes step ahead and allows to control what related models to join and what don't to join, but it don't give me what I need – full control on the loaded fields without loosing advantages of models.

Then I found 2 Django tickets #5420: Allow database API users to specify the fields to exclude in a SELECT statement and #5768: Allow QuerySet.values() to return values spanning joins. This tickets address same problem but unfortunately it were assigned to nobody, so I created the patch and going to attach it to above tickets.

You can download the patch here, it is quite simple but allows to write queries that loads only those fields that you requested (and plan to use in view or template). Above example with new fields() method can be rewritten as:


entries = Entry.objects.fields('headline', user=('username', 'avatar'))[:10]
for e in entries:
    print e.headline, e.user.username, e.user.get_avatar_url()

The patch include tests/modeltests/fields/models.py with examples of usage of new methods and some docs. The patch is backward compatible and don't fail any test from Django test site. The patch also change signature and implementation of current QuerySet.values() method improving it with ability to select fields in related models also. It may look like:


entries = Entry.objects.values('headline', user=('username',))[:10]
for e in entries:
    print e['headline'], e['user']['username']

Both values() and fields() use same logic, only return format is different, so I also removed with this patch ValuesQuerySet that isn't need.

And a final note about lazy loading of field, that was mentioned in the Django ticket #5420. Proposed patch don't support and from my experience of using Django in big project lazy field loading is a dangerous feature. When different developers write views and templates and same code can be modified by many developers, it's very easy to find eventually in a template single not-optimal line of code that for objects list from 10 items will generate 20 heavy SQL queries. It may look like:

{% for room in room_list %}
Blog {{ blog.name }} have {{ blog.entry_set.all|length }} 
post{{ blog.entry_set.all|length|pluralize }}
{% endfor %}

With lazy field loading we can have same problem. For example frontend developer can add to a template {{ entry.summary }} // {{ entry.date }} that he copy-pasted from other template used by other view that loaded these fields. If new view doesn't load these fields, with lazy field loading you will have 2 extra SQL queries per each blog entry on the page. When number of developers exceeds 10, it very hard to prevent such problems. If in a template new model fields are need, frontend developer should ask backend developer instead of using of lazy field loading.

I tested the patch with Django trunk in my environment and all 234 available tests were passed. If the patch will broke something in other environments, please let me know in comments.

Tags: concept  django  performance  sql 
Submitted 3 comments: accepted - 3, in moderation queue - 0. Add your comment.
  • Joel Bernstein on February 14, 2008
    You know, I have to agree with mtredinnick on point 2.

    Lazy field loading is potentially an annoyance, sure, but everything still works (albeit slower) and the fix is simple (add the extra columns to the query).

    Your alternative, though, generates half-loaded model instances. The very idea makes my skin crawl. If you call save() on that instance, you could very well wipe out the other columns on that record.

    Django's ORM may very well be smart enough to only update columns that have changed; I don't know off the top of my head. Even if it is, I still don't like it.
  • Dima Dogadaylo on February 14, 2008
    fields() and values() should only be used in read-only patterns that actually is 99% of most sites usage. For sure, you should load all fields if you are going to change and save model.
  • Milos on May 15, 2008
    Hi, your patch is great, I've had similar problem with my app, my database is a special spatial database where each table has some kind of spatial column, quite large..

    I'm using geodjango, which has its own patched version of _get_sql_clause, so the additional patch is needed like this:

    --- query.py 2008-05-15 20:59:07.421875000 +0200
    +++ query.py.orig 2008-05-15 20:50:34.875000000 +0200
    @@ -7,7 +7,6 @@
    from django.contrib.gis.db.models.fields import GeometryField, PointField
    # parse_lookup depends on the spatial database backend.
    from django.contrib.gis.db.backend import gqn, parse_lookup, SpatialBackend
    -from django.db.models.query import _get_model_fields
    from django.contrib.gis.geos import GEOSGeometry, Point

    # Shortcut booleans for determining the backend.
    @@ -74,7 +73,7 @@
    # QuerySet.extra(select=foo) because extra() adds an an _additional_
    # field to be selected. Used in returning transformed geometries, and
    # handling the selection of native database geometry formats.
    - for f in _get_model_fields(opts, "", self._fields, self._select):
    + for f in opts.fields:
    # Getting the selection format string.
    if hasattr(f, '_geom'):
    sel_fmt = self._geo_fmt
    @@ -112,8 +111,7 @@
    fill_table_cache(opts, select, tables, where,
    old_prefix=opts.db_table,
    cache_tables_seen=[opts.db_table],
    - max_depth=self._max_related_depth,
    - fields=self._fields)
    + max_depth=self._max_related_depth)

    # Add any additional SELECTs.
    if self._select:


   
Web log, research lab and soft parade of Dima Dogadaylo.
Email: entropyhacker at gmail dot com