Best practices for using Databases with Django ORM (Object-Relational Mapper)

Handling data is a big part of developing applications with Django. To do that, we use Django’s ORM, from retrieving a specific set of instances to updating several rows in the database.

It’s no surprise then that the way we make those queries can significantly impact our app’s performance. So, in this blog post, we will look into some valuable functionalities provided by Django.  

We will also provide some best practices to keep in mind when writing our queries and how we can combine them to maximize the efficiency of our application.

Django ORM and Querysets

A queryset is a collection of data from the database. The Django ORM allows us to easily make queries in Python without the need of using the SQL language.

Querysets are lazy, meaning that just creating a queryset does not trigger any database hits. You can add filters and modify the same queryset many times, but it will only access the database once it needs to be evaluated.

https://gist.github.com/brunomichetti/d0d5430bc8b0181acdb2ea0a7fa103f3

In this example, even though we create and manipulate the queryset more than once, the only time when Django accesses the database is in the print function.

Each queryset contains a cache to minimize database access:

  1. The first time the queryset is evaluated (the first time we hit the database with that queryset), the cache is empty; after that, Django saves the result.
  2. If the queryset is again evaluated, the cache will be used instead of hitting the database.

Django provides an explain function that shows the execution plan of the queryset:

https://gist.github.com/brunomichetti/ec58553fc6c14e8d8846bd1508c5e732

This is useful when you want to analyze queries that you may need to optimize.

Best practices for using Querysets

The following is a list of good practices to take into consideration to be efficient when using querysets:

  • Try to reduce the number of hits to the database as much as possible.
  • When working on data from queries, try to make the database do as much work as possible instead of handling that work yourself with python code (sorting, comparing fields, etc.).
  • If you know what fields you will need, try to fetch only those fields instead of all of them for each instance.
  • Don’t trigger hits to the database if it’s not necessary (for example, to get an empty queryset).
  • Create a database index for any unique field you know you will frequently use to search instances.
  • If you are working not only with a particular model instance but also with instances of models related to it, see if you can use select_related and prefetch_related to improve your queries.
  • When you have an instance of a given model A in memory, and you want to access the id of a related model instance (let’s call it model B), use instance_a.instance_b_id instead of instance_a.instance_b.id . Why? When you define a foreign key from A to B, you already have the id of the model B instance in the model A table as an attribute. In the next section, we will explain this a little bit more.
  • For updating/creating multiple instances of a Model, it’s better to do it as a bulk operation by using bulk_update or bulk_create, respectively, instead of one by one. Consider that using these functions doesn’t trigger the save or create methods, so any custom code in those methods will not be executed.

Next, we will show a defined reality and introduce examples for all concepts. After that, we will go into more detail regarding best practices.

Reality example

Our reality example here is the Target app:

  • Users register in the system with their email.
  • They also have optional data: name, username, date of birth, and zodiac sign.
  • They can create/update at most ten targets on an actual map.
  • Each target has a title, topic, radius in meters, and location.
  • When a user creates a target in the map in a given location with a given radius, the system checks if there are other targets from other users that have the same topic and intersecting area. If that happens, a match is created between those owner users. The app then allows them to know each other.
  • Each user has a list of matched users (that user has at least one match with him).

The following is an example image of matching targets with the intersecting area and the same topic:

Reality example - Target app
Target app

Models and relationships

The following is a simple diagram showing the models and relationships:

Diagram highlighting the models and relationships
Models and relationships

Models definition

Here are the models defined from reality example:

https://gist.github.com/brunomichetti/b7367342a75620d03308d6fca6f75499

Examples of good practices

Seeing SQL statement/s triggered by a queryset

To see what SQL query is going to be executed for a particular Django queryset you have two options (both require you to launch a python console using python manage.py shell or python manage.py shell_plus if you have the django-extensions library installed):

Option 1

You can use either print(queryset.query) or str(queryset.query):

https://gist.github.com/brunomichetti/d0e4787cffa36a8233188f20508cbec6

Option 2 (the Django DEBUG setting must be set to True)

You can use the queries attribute of the connection module from django.db:

https://gist.github.com/brunomichetti/1ca76c938ca8447a0b1a7442153bcd24

connection.queries will display a list of dictionaries with the following keys:

  • sql, whose value is the SQL statement
  • time, whose value is how long the statement took to execute (in seconds)
Example:
https://gist.github.com/brunomichetti/0b32a0f9701da575823d66d1dda977e8

All queries executed since the start of the session will be listed; it doesn’t matter when the connection module was imported. If at any point you want to clear the queries list, you can use the reset_queries function:

https://gist.github.com/brunomichetti/c3831f57111a23ca9d786712300bdc16

Using these tools, you can also count how many hits to the database a particular queryset triggers. You can do this by looking at the difference in the value of len(connection. queries) before and after the queryset is executed, or by clearing the queryset list and then looking at the importance of len(connection.queries) after evaluating the queryset.

Count vs len

As a rule of thumb, if you only want to know how many elements are in a queryset, it’s best to use the queryset count function instead of using len(queryset), except if the queryset was already evaluated. In that case, the count will return the len of the queryset’s cached value).

For example, let’s try to fetch how many users we have in our database:

https://gist.github.com/brunomichetti/716458c29b18656015c40106fbeea4e9

As you can see, using len(queryset) first evaluates the queryset, so it fetches all the users and then calculates the length of the collection. 

In contrast, when using the count function, the users are counted at the database level, which is more efficient memory-wise since you avoid every user instance being loaded into memory and is also generally faster.

Fetching only necessary fields

Sometimes, you may want to use only a specific subset of fields from a model. In those cases, it may be a good idea to only get those fields for every row in the database, so you avoid fetching and loading into memory data that you will not use.

As how there are many options. Using our previous example with the user model, let’s say you want to retrieve only the id and email fields; you could use:

  • User.objects.values(“id”, “email”)
  • User.objects.only(“id”, “email”)
  • User.objects.values_list(“id”, “email”)
https://gist.github.com/brunomichetti/148f77deadf46b040bc2457f8919030b

As you can see, all the options presented execute the same SQL statement under the hood. The difference is that one returns a list of dictionaries and the other a list of tuples, with the last one a list of model instances.

It’s worth noting that you can still access fields other than id and email in the instances fetched with only, but a different SQL query will be triggered specifically to fetch the value of that field if you do so.

In case you want to select all the fields except a minor subset of them, you can use defer.

Empty queryset without hitting the database

In some specific cases, you could need an empty queryset, and a natural solution for this would be to make a queryset using a filter with the condition that no model instances could meet.

This would not be the best way since you could still hit the database at least once to evaluate the queryset.

The best way would be using the querysets none function, which provides you with an empty queryset without triggering any database hits. But, if you need to use a filter to generate your empty queryset, you can use the condition pk__in=[], any model instance can’t meet this condition, and the ORM is smart enough to realize this and not make any hits to the database.

https://gist.github.com/brunomichetti/63f82180397af98d21a3cb550113c068

When a model A has a ForeignKey field to a model B, every instance of A has a field called b_id (name derived from the name of the related model in snake case + the word “id”). 

So, when we fetch an instance of A from the database we already have the id of its related B instance available. One can wonder if there is a difference then between using a.b_id and a.b.id.

Let’s try with an example from our reality:

https://gist.github.com/brunomichetti/e60f08b32b6062f47fd82d39a66f9f64

As we can see, when accessing the user id through target.user_id, we trigger no SQL query since this value is part of the data we fetched from the database when we retrieved our target instance. But, when accessing target.user.id, we trigger a new SQL query to retrieve every field of the related user (unless you use the select_related function).

Bulk operations

As mentioned before, when working with several instances, it’s better to use bulk operations to reduce the number of hits to the database. 

As an example, if we want to create several users, we can do the following:

https://gist.github.com/brunomichetti/c78a51fe3156340f7acfebd940150522

This works but we are hitting the database 3 times. With bulk_create, we can do the very same thing hitting only once, defining a list of objects:

https://gist.github.com/brunomichetti/64309b581b4960045e7e8a424d304e44

In the same way, we can use bulk_update, but in this case, we must have a list of user objects already created in the database. Suppose we want this for each user that has a name but doesn’t have a username:

  • Get the name, remove spaces, and transform it to lowercase. This value will be the username.
  • As an example, if the user’s name is John Doe, then we want the username to be johndoe.

This is a solution to get this done:

https://gist.github.com/brunomichetti/41625ce7dd2f481dacc5bf771563c130

The bulk_update receives two lists:

  • The list of modified objects that you want to update. These objects already exist in the database and you have changed at least one field.
  • The list of columns that you have changed in the objects.

The corresponding SQL query:

https://gist.github.com/brunomichetti/7468793c5c9056cfd4e88b89a84f40b1

Some important considerations:

  • When using bulk_create and bulk_update, the model’s save function is not executed.
  • The bulk_create function:
    • doesn’t work with many-to-many relationships.
    • doesn’t work with child models in a multi-table inheritance scenario.
    • accepts a batch_size param in case the list of objects to create is too big. This param splits the bulk creation in more hits to the database but helps to avoid very big SQL queries.
  • The bulk_update function:
    • can’t update the primary key field.
    • if updates fields defined on multi-table inheritance ancestors will generate an extra query per ancestor.
    • if updates a large number of columns in a large number of rows, the SQL generated can be very large. Avoid this by adding a suitable batch_size param to the function. This is analogous to batch_size in bulk_create

Querysets examples

In this section, we will describe different queries examples to show efficient ways of addressing them. We will describe each case in natural language, then present the queryset and finally show the resultant SQL query.

F expressions

As mentioned, if possible it’s recommended to delegate computations to the database instead of doing them directly in memory.

F expressions help us to do precisely that, allowing us to reference database fields and letting the database handle operations with that value.

For example in the following query:

https://gist.github.com/brunomichetti/a2ccc33622364da2d3582e683ae31b1e

Filters all targets that have an updated_at value posterior to their created_at value, but the value of updated_at is never loaded into memory, it’s just referenced so the database can use it in the comparison:

As we can see in the generated SQL:

https://gist.github.com/brunomichetti/2c85b93b8725a40e2a4b37217203ee70

Another use of F expressions can be for updating database rows. For example, if we wanted to increase the created_at value for all rows by 3 days, we could retrieve all instances, modify the attribute value, and save all instances.

This would require at least 2 hits to the database, but using F expressions we can do it in just one:

https://gist.github.com/brunomichetti/b943aac74e0163494c0af9df1903f0a6

This generates the following SQL query:

https://gist.github.com/brunomichetti/d06d8d6c7bfc2fb09a0828f25226dde8

Q objects

Q objects represent conditions that can be used in database queries. They are helpful not only because they let you reuse conditions in multiple questions but also because they can be combined to express complex conditions.

A common use case for Q objects is defining an OR condition for a filter. As an example, let’s retrieve the targets whose topic is “Sports” or that contains the word sport in their title:

https://gist.github.com/JorgeMichelena/9a0f02c9a9c8f54b111b6bf4361064d8

The generated SQL is as follows:

https://gist.github.com/JorgeMichelena/321643c644b2028b3787d769f4f0f31c

N+1 problem

Imagine if we need to print for each target: the target title, and then the email of the corresponding owner user. This can be a solution:

https://gist.github.com/JorgeMichelena/1516885d92552ea93d49d7236120fd9b

This is a working solution, but how many times does it access the database? Imagine that N is the number of targets in the system:

  • Accesses one time to get all the targets ( Target.objects.all() : 1)
  • For each target accesses one time to get the email of the owner user (target.user.email : N)
  • To get the title doesn’t need to access the database because it’s brought in the first query.

That is a total of N + 1 hits. As mentioned, it’s important to reduce the number of hits to the database. If N is a huge number, then we are accessing it a lot of times and this will generate a performance issue. This is known as the N + 1 problem. 

Let’s look at a way of addressing this with select_related and prefetch_related.

Let’s see the definition in Django docs:

This returns a QuerySet that will “follow” foreign-key relationships, selecting additional related-object data when it executes its query. This performance booster results in a single, more complex question but means later use of foreign-key relationships won’t require database queries.

Continuing the example, the solution would be this:

https://gist.github.com/JorgeMichelena/0f32ad5f7fde61ddc6a0b3757a75bbc1

Now, the output is the same as before, but we only access the database once with this solution. That is because select_related brought in the same query, each user’s email. We can use selected_related with foreign-key and one-to-one relationships fields. Let’s see now how to do this when we have many-to-many relationships.

Now we need to print for each user: the user id and the emails of the related users that matched with him (remember we have the mateched_users relationship). Let’s see the definition in the Django docs:

This returns a QuerySet that will automatically retrieve, in a single batch, related objects for each of the specified lookups.

So, it has the same purpose as select_related but it has a different strategy. The solution to the mentioned problem using the function is this:

https://gist.github.com/JorgeMichelena/8fbcbe4eefa6cecd3d53cd9475cba219

This function accesses two times to the database:

  • One time to get all the users (User.objects.all() : 1).
  • A second time to get the matched users of each user (.prefetch_related(“matched_users”) : 1).

If we didn’t use the prefetch_related function in the previous solution, we would have the database accessed N + 1 times (being N the number of users in the system).

Quick summary:

  • select_related solves the N + 1 problem accessing only one time to the database. It’s used for foreign-key and one-to-one relationships fields.
  • prefetch_related solves the N + 1 problem accessing only two times to the database. It’s used for many-to-many relationships fields.

Aggregate, Max, and Avg

Sometimes we need to get information from the database that results from aggregate data of different rows. Let’s see an example.

Here, we want to get the maximum radius of a target in the database:

https://gist.github.com/JorgeMichelena/d8b9b1fb37052f35a961f202c7b8eb09

And the SQL is:

https://gist.github.com/JorgeMichelena/7627927d00f43fa1ab15e532d3ac831d

The result is a dictionary with the specified key:

https://gist.github.com/JorgeMichelena/7140dd6b1ad94e5bcef36a18bdd3794f

Now, we need to know the difference between the highest radius and the average of the radius of all the targets:

https://gist.github.com/JorgeMichelena/e89296779efab860c3673d34cf8619e4

The return value is also a dictionary:

https://gist.github.com/JorgeMichelena/132e316cab1c21abd036ab69e7ca2816

And the SQL:

https://gist.github.com/JorgeMichelena/357c1fc59fc76bf1844fe430a690a8ba

There are other aggregation functions, such as SUM, COUNT, MIN. Take a look at the Django docs if you want to know more.

Most used topic in targets

If we want to know the topic most targets use, we can use the function Count and annotate the value in the queryset. Then, sort by that annotated value and get the first element in the result (this would be the topic that most targets use since we sorted by the topic count). 

annotate adds information (depending on the number of arguments it receives) to every element of the queryset. 

An annotation can be a value, a reference to a field on the model, or an aggregate expression (average, sum, etc.), and then it can be accessed as an attribute of each instance and used in functions like filter or order_by.

The query would look something like this:

https://gist.github.com/JorgeMichelena/b170dbbef472d78e349992d976eac9a3

Note that we must use the values function for this, since the Count only takes into account the values of topic for each particular database row, by using values we generate a GROUP BY as part of the SQL allowing us to count all different values of topic.

https://gist.github.com/JorgeMichelena/aca9bdde4853932efdbc2dd646730e7f

Sort tips

Sorting is an essential part of making queries. Let’s look at some particular sorting instances and how to achieve them with Django.

order_by with two arguments vs. chaining 2 order_by with different arguments

Let’s say we want to sort all users by two attributes: their date of birth and their name. We would like to get date_of_birth as the first criteria for the ordering, and if more than one user shares the same value for that field, use the name field to sort them relative to each other.

Some queryset functions, such as filters, can be chained with different arguments, and the result is the same as applying only one filter function with all the views at once. 

But, ordery_by doesn’t work like that; in effect, the following query:

https://gist.github.com/JorgeMichelena/b273afdc4006501aac3b5c3a59ca7f87

Doesn’t yield the same result as this query:

https://gist.github.com/JorgeMichelena/ddbd0ac44c4a6d49f7a9d5fc09df2b2d

For the first one, we get the SQL we would expect:

https://gist.github.com/JorgeMichelena/11d3f31ddb3224e09b5d8b21dccbb602

With an ORDER BY with the two fields we provided. In contrast, the second query produces:

https://gist.github.com/JorgeMichelena/6d754132eec65950022cebb55cb569b9

As we can see, we get the user only sorted by name. In general, when chaining multiple order_by calls, the one that will be applied will be the last one.

Randomly sorting queries

To get a queryset with all the instances in a random disposition, we can sort them using python code, but the Django ORM gives us the possibility of letting the database do it for us by using ? as an argument for order_by:

https://gist.github.com/JorgeMichelena/572c5d9a0d0d03e3ca1b9605c80d0913

The resulting SQL is as follows:

https://gist.github.com/JorgeMichelena/0a9ff61e33853623bb5e9e9245e86c5e

It is worth noting that for some databases this random sorting can be slow, so be careful when deciding to use this functionality.

Sort targets by the number of matches

If we want to sort by the number of matches we could do something similar to what we did to get the most used topic value by using Count with a couple of differences:

  • Because we were counting the number of occurrences of a value in all the rows in a table we needed to use values in our query to generate a GROUP BY SQL instruction. In this case, because we are interested in counting the number of related Match instances for each Target instance we don’t need to use values. When using Count on a foreign-key or many-to-many field the SQL query generated already uses GROUP BY on the target’s id.
  • Since a Match can be related to two targets and thus has two ForeignKey fields, we need to annotate a field and assign it the sum of two Counts as its value.
https://gist.github.com/JorgeMichelena/ffb11721719271c9d116cd65899b2ee7

The SQL query:

https://gist.github.com/JorgeMichelena/70cdf57a6478cd7a5b1a775faef468e0

Segment query depending on condition

Another case where we can use order_by is when we want to retrieve a set of rows from the database but order them so that all rows fulfilling a condition appear before those that don’t fulfill the said condition. 

This could be useful, for example, to show the users their targets, showing first the ones that have a match and therefore offer an opportunity for more interaction with the app.

We can use annotate to create a boolean field and then sort by that field as follows:

https://gist.github.com/JorgeMichelena/b76c1a0baa30a526113f10b7b0189e0a

In this case, the query used generates 3 SQL queries, one to retrieve the targets and 2 others for the related entities in the prefetch_related

https://gist.github.com/JorgeMichelena/c1487d9c4b3a72256cadbbf00df11781

Using datetimes

Sometimes you may want to retrieve instances from the database created or updated at a particular time but without using all the precision that a timestamp usually provides (seconds, milliseconds, etc).

In such a case, something like filtering by crated_at=now() wouldn’t be ideal, so we recommend going the following path:

https://gist.github.com/JorgeMichelena/3e74ee2090575a1d1f7f742aee70539d

SQL query:

https://gist.github.com/JorgeMichelena/52b2492e72d00f65d3583a6f7c3bf167

As can be seen, the database is tasked with separating the data by its components (date, hour, and minute).

Matches query

Now, let’s implement the query at the heart of our app’s main feature. When a new target is created (named in the code as my_target) we will retrieve all targets that are a match for it.

That is, all targets whose areas overlap with this new target and that also share the same topic but don’t belong to the same user.

https://gist.github.com/JorgeMichelena/ad92f9d3763540190cfdf0b6e7218d2e

As you can see, we have used an annotation to store the value of the distance between our new target and all existing targets, and also an F expression to add the radius of each target in the database with that of our new target.

The Distance function we imported and used lets the database handle the calculation of the distance between two points.

The resulting SQL is the following:

https://gist.github.com/JorgeMichelena/d60d21b140d2e6f9fb4702009555f476

Here, ST_DistanceSphere is the function we generated by using Distance in our query.

Use different cases

Now, let’s say we want to be able to give a particular score to users by their activity on the app and by their compatibility with other users. We then could sort the users by this score and use it as a metric for a future feature.

Let’s take a given user (we will refer to it as my_user) as a reference and score every other user with the following criteria:

  • It will receive 5 points if it has any matches
  • It will receive another 5 points if it has more than 5 matches
  • It will receive 5 points if its zodiac sign is the same as my_user’s zodiac sign
https://gist.github.com/JorgeMichelena/67571dd12d2319d58f2ae6f591a9001b

Note: you can compare fields in the query using <field_name>__<compare_expression>, where the <compare_expression> can be:

  • gt: Greater than
  • gte: Greater or equal than
  • In analogous way with less: lt and lte
  • exact: equal for text (key sensitive)
  • iexact: equal for text (key insensitive)

There are more <compare_expression>. Take a look at the Django docs to know more.

The generated SQL code is:

https://gist.github.com/JorgeMichelena/33d77db6821823ca14abf3818febf6af

Union, difference, and intersection

The union, difference, and intersection functions are used when working with more than one query, and we need to combine them.

As mentioned, we should try to retrieve the needed data in a unique query to minimize the hits to the database. But, we explain these functions here if you don’t have another choice but to work with two or more queries.

Let’s assume we have an already evaluated queryset from the users’ table called q1. This queryset has the users having Scorpio zodiac signs. Now we want:

All users with the Scorpio zodiac sign and those 21 years old or older (despite the zodiac sign). The queryset using union is:

https://gist.github.com/JorgeMichelena/c6cadba194cfff4cdde7ed3745acb697

Now, the resultant SQL query:

https://gist.github.com/JorgeMichelena/a7e53263cfb4e87c19406b106af54a46

Continuing with q1, suppose we want the users that have the Scorpio zodiac sign, but we want to remove from q1 those who have an empty username (equal to null). The queryset would be:

https://gist.github.com/JorgeMichelena/289dbca7e20b4527876a3973a0db330a

And the SQL query:

https://gist.github.com/JorgeMichelena/f6ab5580486cdba10bd8efef023716e5

And for the final case, let’s use intersect. Suppose we want the users that have the Scorpio zodiac sign, but we want to get only those in q1 who were born in October:

https://gist.github.com/JorgeMichelena/3fe63bcf05a378ceb6c9e7fd0ab71e50

The resultant SQL query:

https://gist.github.com/JorgeMichelena/67f66c12fff194ee65ec1b176d6a2a11

Some notable considerations:

  • Remember that it’s essential to try to make only one query. In this case, we assumed that we had q1 already evaluated to explain the union, difference, and intersection. Anyway, you may not have another choice but to work with more than one query.
  • When you compone queries using the mentioned functions, please take into account that may be Django versions or Database engines may fail when slicing them.

Summary

This article presented good practices when using databases with the Django ORM. In the software development area, the performance of the applications is critical, and the use of databases can be highly related to that, hence why we focused on their use and access of them.

Besides general tips and best practices, we presented an example of reality and queried to show tools and performant ways to use them. Also, for each case, we indicated the corresponding SQL query to understand better what Django does behind the scenes.

The main conclusion here is that the Django ORM is a powerful tool to work with databases without the need to know the SQL language and provides different ways to do the same. For that reason, we enumerated good practices.

Co-authors:

Bruno Michetti & Jorge Michelena

Want to work with Bruno Michetti?
We are hiring :)

When you think of design thinking, what comes to mind? You might imagine stock images of people in suits smiling […]
November 25
5 min read
Slicing and atomization have become popular terms in the software industry, but many developers may need help understanding their true […]
November 22
5 min read
In recent years, automating software quality tests has become widely popular in the tech industry. Current automation tools allow us […]
November 21
5 min read
While I was in a session with one of my mentors some time ago, we did an exercise called Ikigai, […]
November 18
5 min read
Did you know that 76% of internet users access social media for product research? Your target audience is likely part […]
November 9
5 min read