Django ORM performance patch

February 14, 2008

Django ORM is simple and obvious, but when from tutorial models project moved to real-world models, same ORM often becomes headache. I work on big Django-project that currently have 278 models in 40 applications with a lot of relations between them. For such number of models QuerySet.select_related() is best friend and worst enemy in same time. The best friend because it allows to decrease number of SQL-queries, and worst enemy because it can't load only those fields of involved in queryset models that I need for this concrete view.

For example following query it's standard way to receive with 1 SQL query and then show 10 last blog entries titles with author names and avatars.


entries = Entry.objects.select_related(depth=1)[:10]
for e in entries:
    print e.headline, e.user.username, e.user.get_avatar_url()
 

The problem is in SQL query that Django ORM will generate for this code – it will load from database all fields of Entry and related User model for each from 10 blog entries even if you need only 3 small fields Entry.headline, User.username, User.avatar and don't need other heavy fields like Entry.content, User.about_text, etc.

A lot of resources wasted here: it's not cheap to select from database big textfield, then transfer utf-8 bytes to web-server, convert from utf-8 to unicode, allocate kilobytes of memory in python process, and finally run garbage collector that will find that those already megabytes of memory (for each object from object list for each client request) were not need at all and it safe now to spend some more system resources for deleting these never used objects.

Current QuerySet.values() can't help here because it can't select fields from related models and often we need model instances only and not dictionaries.

Special lightweight models (when you bind to same database table several Django models with different field subsets of a heavy model) results in code duplication and require to implement lightweight models for all related heavy models also, and actually it's not web development done right.

Returning back to old good SQL is possible but it's "not right" and in this case you also lose business logic already defined in models.

I checked out queryset-ref Django branch and didn't find solution for above problem. This branch makes step ahead and allows to control what related models to join and what don't to join, but it don't give me what I need – full control on the loaded fields without loosing advantages of models.

Then I found 2 Django tickets #5420: Allow database API users to specify the fields to exclude in a SELECT statement and #5768: Allow QuerySet.values() to return values spanning joins. This tickets address same problem but unfortunately it were assigned to nobody, so I created the patch and going to attach it to above tickets.

You can download the patch here, it is quite simple but allows to write queries that loads only those fields that you requested (and plan to use in view or template). Above example with new fields() method can be rewritten as:


entries = Entry.objects.fields('headline', user=('username', 'avatar'))[:10]
for e in entries:
    print e.headline, e.user.username, e.user.get_avatar_url()

The patch include tests/modeltests/fields/models.py with examples of usage of new methods and some docs. The patch is backward compatible and don't fail any test from Django test site. The patch also change signature and implementation of current QuerySet.values() method improving it with ability to select fields in related models also. It may look like:


entries = Entry.objects.values('headline', user=('username',))[:10]
for e in entries:
    print e['headline'], e['user']['username']

Both values() and fields() use same logic, only return format is different, so I also removed with this patch ValuesQuerySet that isn't need.

And a final note about lazy loading of field, that was mentioned in the Django ticket #5420. Proposed patch don't support and from my experience of using Django in big project lazy field loading is a dangerous feature. When different developers write views and templates and same code can be modified by many developers, it's very easy to find eventually in a template single not-optimal line of code that for objects list from 10 items will generate 20 heavy SQL queries. It may look like:

{% for room in room_list %}
Blog {{ blog.name }} have {{ blog.entry_set.all|length }} 
post{{ blog.entry_set.all|length|pluralize }}
{% endfor %}

With lazy field loading we can have same problem. For example frontend developer can add to a template {{ entry.summary }} // {{ entry.date }} that he copy-pasted from other template used by other view that loaded these fields. If new view doesn't load these fields, with lazy field loading you will have 2 extra SQL queries per each blog entry on the page. When number of developers exceeds 10, it very hard to prevent such problems. If in a template new model fields are need, frontend developer should ask backend developer instead of using of lazy field loading.

I tested the patch with Django trunk in my environment and all 234 available tests were passed. If the patch will broke something in other environments, please let me know in comments.

Tags: concept  django  performance  sql 

WebAlchemy vs StaticGenerator

January 09, 2008

When two months ago I published WebAlchemy code, several people asked me how to make it working with nginx or lighttpd. Yesterday new tool similar to WebAlchemy StaticGenerator was published. Now it seems right time to explain how to use WebAlchemy with nginx and other servers that don't have .htaccess files, highlight some important differences between WebAlchemy and StaticGenerator and say thanks to Django for the signal framework that makes it all possible.

I designed WebAlchemy to support publishing of resources of any content type with any URL scheme and ability to be used in RESTful environment when GET-requests to same url are served as static requests, but POST-, PUT- and other requests are served by Django. It means that for example request to /feed/comments/ that also served by WebAlchemy on this site should return application/rss+xml content type, but request to normal page like /about/ should return text/html; charset=utf-8 content type. It's not possible to detect content type from url itself, so web-server should use resolved filename extension and we end with situation when we can't have single index file for directories and sometimes need to use index.html, sometimes index.xml, sometimes index.pdf, etc.

Other issue and important difference between WebAlchemy and StaticGenerator is additional load that each tool creates. StaticGenerator generates static versions of changed pages every time when master model is changed and doesn't take into account transaction, so same path may be transformed even several times during a transaction. WebAlchemy don't create additional load on web-server, it don't generates static version especially, but uses generated in result of real user request content (all future requests of this url are served as static).

Both WebAlchemy and StaticGenerator use post_save Django signals but WebAlchemy process all signals at once when request is completed and so it's able to remove duplicates (when changes in several models affects same path) and can't create problems for monitored models. However StaticGenerator generates static versions of resources for each signal, and if during generating of the resource an error will occur, saving of original model that fired the signal will be failed also.

So if you don't want to use .htaccess files, to resolve above problems you need to address different content types issue. For example for all non-html content it's possible to use URLs with file extension (/feed.rss2 instead of /feed/). It solves the problem, but limits you a bit in choosing of URL format.

When content types issues are resolved, you can use WebAlchemy with any web-server. You just need to use your trivial publisher that don't deal with .htaccess files. For example, you can add to publishers.py form WebAlchemy following TrivialPublisher or wrote your own.


class TrivialPublisher(object):
    """ Save/delete only page content and don't deal with .htaccess.
    This publisher can be used with any web-server that can check 
    presense of static file and send published version of the resource 
    if static file exists.
    The server configuration may look like (used nginx syntax):
    http {
      upstream django {
        server localhost:8000;
      }
      server {
        server_name  example.com;
        root   /var/www/;

        location / {
          if (-f $request_filename/index.html) {
            rewrite (.*) $1/index.html break;
          }
          if (-f $request_filename/index.xml) {
            rewrite (.*) $1/index.xml break;
          }
          if (!-f $request_filename) {
            proxy_pass http://django;
            break;
          }
        }
      }
    }
    """

    def __init__(self, root):
        self.root = root

    def resolve_path(self, path):
        """Return cached path to be used in mod_rewrite for given path.
        >>> p = TrivialPublisher('/var/www/')
        >>> p.resolve_path('/')
        '/var/www/index.html'
        >>> p.resolve_path('/dir/')
        '/var/www/dir/index.html'
        >>> p.resolve_path('/dir/noext')
        '/var/www/dir/noext'
        >>> p.resolve_path('/dir/file.css')
        '/var/www/dir/file.css'
        """
        if path[-1] == '/':
            path += "index.html"
        return os.path.join(self.root, path[1:])

    def save_file(self, path, response):
        fpath = self.resolve_path(path)
        if not os.path.exists(os.path.dirname(fpath)):
            os.makedirs(os.path.dirname(fpath))
        write_file(fpath, response.content)

    def delete_file(self, path):
        fpath = self.resolve_path(path)
        remove_file(fpath)

Nothing in original WebAlchemy core code was changed so you can use it. Link to source code and instructions for WebAlchemy configuration you can find at original blog entry WebAlchemy accelerates Django in 100 times.

Web log, research lab and soft parade of Dima Dogadaylo.
Email: entropyhacker at gmail dot com