Handling data is a big part of developing applications with Django. To do that, we use Django’s ORM, from retrieving a specific set of instances to updating several rows in the database.
It’s no surprise then that the way we make those queries can significantly impact our app’s performance. So, in this blog post, we will look into some valuable functionalities provided by Django.
We will also provide some best practices to keep in mind when writing our queries and how we can combine them to maximize the efficiency of our application.
Django ORM and Querysets
A queryset is a collection of data from the database. The Django ORM allows us to easily make queries in Python without the need of using the SQL language.
Querysets are lazy, meaning that just creating a queryset does not trigger any database hits. You can add filters and modify the same queryset many times, but it will only access the database once it needs to be evaluated.
In this example, even though we create and manipulate the queryset more than once, the only time when Django accesses the database is in the print function.
Each queryset contains a cache to minimize database access:
- The first time the queryset is evaluated (the first time we hit the database with that queryset), the cache is empty; after that, Django saves the result.
- If the queryset is again evaluated, the cache will be used instead of hitting the database.
Django provides an explain function that shows the execution plan of the queryset:
This is useful when you want to analyze queries that you may need to optimize.
Best practices for using Querysets
The following is a list of good practices to take into consideration to be efficient when using querysets:
- Try to reduce the number of hits to the database as much as possible.
- When working on data from queries, try to make the database do as much work as possible instead of handling that work yourself with python code (sorting, comparing fields, etc.).
- If you know what fields you will need, try to fetch only those fields instead of all of them for each instance.
- Don’t trigger hits to the database if it’s not necessary (for example, to get an empty queryset).
- Create a database index for any unique field you know you will frequently use to search instances.
- If you are working not only with a particular model instance but also with instances of models related to it, see if you can use select_related and prefetch_related to improve your queries.
- When you have an instance of a given model A in memory, and you want to access the id of a related model instance (let’s call it model B), use instance_a.instance_b_id instead of instance_a.instance_b.id . Why? When you define a foreign key from A to B, you already have the id of the model B instance in the model A table as an attribute. In the next section, we will explain this a little bit more.
- For updating/creating multiple instances of a Model, it’s better to do it as a bulk operation by using bulk_update or bulk_create, respectively, instead of one by one. Consider that using these functions doesn’t trigger the save or create methods, so any custom code in those methods will not be executed.
Next, we will show a defined reality and introduce examples for all concepts. After that, we will go into more detail regarding best practices.
Our reality example here is the Target app:
- Users register in the system with their email.
- They also have optional data: name, username, date of birth, and zodiac sign.
- They can create/update at most ten targets on an actual map.
- Each target has a title, topic, radius in meters, and location.
- When a user creates a target in the map in a given location with a given radius, the system checks if there are other targets from other users that have the same topic and intersecting area. If that happens, a match is created between those owner users. The app then allows them to know each other.
- Each user has a list of matched users (that user has at least one match with him).
The following is an example image of matching targets with the intersecting area and the same topic:
Models and relationships
The following is a simple diagram showing the models and relationships:
Here are the models defined from reality example:
Examples of good practices
Seeing SQL statement/s triggered by a queryset
To see what SQL query is going to be executed for a particular Django queryset you have two options (both require you to launch a python console using python manage.py shell or python manage.py shell_plus if you have the django-extensions library installed):
You can use either print(queryset.query) or str(queryset.query):
Option 2 (the Django DEBUG setting must be set to True)
You can use the queries attribute of the connection module from django.db:
connection.queries will display a list of dictionaries with the following keys:
- sql, whose value is the SQL statement
- time, whose value is how long the statement took to execute (in seconds)
All queries executed since the start of the session will be listed; it doesn’t matter when the connection module was imported. If at any point you want to clear the queries list, you can use the reset_queries function:
Using these tools, you can also count how many hits to the database a particular queryset triggers. You can do this by looking at the difference in the value of len(connection. queries) before and after the queryset is executed, or by clearing the queryset list and then looking at the importance of len(connection.queries) after evaluating the queryset.
Count vs len
As a rule of thumb, if you only want to know how many elements are in a queryset, it’s best to use the queryset count function instead of using len(queryset), except if the queryset was already evaluated. In that case, the count will return the len of the queryset’s cached value).
For example, let’s try to fetch how many users we have in our database:
As you can see, using len(queryset) first evaluates the queryset, so it fetches all the users and then calculates the length of the collection.
In contrast, when using the count function, the users are counted at the database level, which is more efficient memory-wise since you avoid every user instance being loaded into memory and is also generally faster.
Fetching only necessary fields
Sometimes, you may want to use only a specific subset of fields from a model. In those cases, it may be a good idea to only get those fields for every row in the database, so you avoid fetching and loading into memory data that you will not use.
As how there are many options. Using our previous example with the user model, let’s say you want to retrieve only the id and email fields; you could use:
- User.objects.values(“id”, “email”)
- User.objects.only(“id”, “email”)
- User.objects.values_list(“id”, “email”)
As you can see, all the options presented execute the same SQL statement under the hood. The difference is that one returns a list of dictionaries and the other a list of tuples, with the last one a list of model instances.
It’s worth noting that you can still access fields other than id and email in the instances fetched with only, but a different SQL query will be triggered specifically to fetch the value of that field if you do so.
In case you want to select all the fields except a minor subset of them, you can use defer.
Empty queryset without hitting the database
In some specific cases, you could need an empty queryset, and a natural solution for this would be to make a queryset using a filter with the condition that no model instances could meet.
This would not be the best way since you could still hit the database at least once to evaluate the queryset.
The best way would be using the querysets none function, which provides you with an empty queryset without triggering any database hits. But, if you need to use a filter to generate your empty queryset, you can use the condition pk__in=, any model instance can’t meet this condition, and the ORM is smart enough to realize this and not make any hits to the database.
Access the id of a related model instance
When a model A has a ForeignKey field to a model B, every instance of A has a field called b_id (name derived from the name of the related model in snake case + the word “id”).
So, when we fetch an instance of A from the database we already have the id of its related B instance available. One can wonder if there is a difference then between using a.b_id and a.b.id.
Let’s try with an example from our reality:
As we can see, when accessing the user id through target.user_id, we trigger no SQL query since this value is part of the data we fetched from the database when we retrieved our target instance. But, when accessing target.user.id, we trigger a new SQL query to retrieve every field of the related user (unless you use the select_related function).
As mentioned before, when working with several instances, it’s better to use bulk operations to reduce the number of hits to the database.
As an example, if we want to create several users, we can do the following:
This works but we are hitting the database 3 times. With bulk_create, we can do the very same thing hitting only once, defining a list of objects:
In the same way, we can use bulk_update, but in this case, we must have a list of user objects already created in the database. Suppose we want this for each user that has a name but doesn’t have a username:
- Get the name, remove spaces, and transform it to lowercase. This value will be the username.
- As an example, if the user’s name is John Doe, then we want the username to be johndoe.
This is a solution to get this done:
The bulk_update receives two lists:
- The list of modified objects that you want to update. These objects already exist in the database and you have changed at least one field.
- The list of columns that you have changed in the objects.
The corresponding SQL query:
Some important considerations:
- When using bulk_create and bulk_update, the model’s save function is not executed.
- The bulk_create function:
- doesn’t work with many-to-many relationships.
- doesn’t work with child models in a multi-table inheritance scenario.
- accepts a batch_size param in case the list of objects to create is too big. This param splits the bulk creation in more hits to the database but helps to avoid very big SQL queries.
- The bulk_update function:
- can’t update the primary key field.
- if updates fields defined on multi-table inheritance ancestors will generate an extra query per ancestor.
- if updates a large number of columns in a large number of rows, the SQL generated can be very large. Avoid this by adding a suitable batch_size param to the function. This is analogous to batch_size in bulk_create
In this section, we will describe different queries examples to show efficient ways of addressing them. We will describe each case in natural language, then present the queryset and finally show the resultant SQL query.
As mentioned, if possible it’s recommended to delegate computations to the database instead of doing them directly in memory.
F expressions help us to do precisely that, allowing us to reference database fields and letting the database handle operations with that value.
For example in the following query:
Filters all targets that have an updated_at value posterior to their created_at value, but the value of updated_at is never loaded into memory, it’s just referenced so the database can use it in the comparison:
As we can see in the generated SQL:
Another use of F expressions can be for updating database rows. For example, if we wanted to increase the created_at value for all rows by 3 days, we could retrieve all instances, modify the attribute value, and save all instances.
This would require at least 2 hits to the database, but using F expressions we can do it in just one:
This generates the following SQL query:
Q objects represent conditions that can be used in database queries. They are helpful not only because they let you reuse conditions in multiple questions but also because they can be combined to express complex conditions.
A common use case for Q objects is defining an OR condition for a filter. As an example, let’s retrieve the targets whose topic is “Sports” or that contains the word sport in their title:
The generated SQL is as follows:
Imagine if we need to print for each target: the target title, and then the email of the corresponding owner user. This can be a solution:
This is a working solution, but how many times does it access the database? Imagine that N is the number of targets in the system:
- Accesses one time to get all the targets ( Target.objects.all() : 1)
- For each target accesses one time to get the email of the owner user (target.user.email : N)
- To get the title doesn’t need to access the database because it’s brought in the first query.
That is a total of N + 1 hits. As mentioned, it’s important to reduce the number of hits to the database. If N is a huge number, then we are accessing it a lot of times and this will generate a performance issue. This is known as the N + 1 problem.
Let’s look at a way of addressing this with select_related and prefetch_related.
Let’s see the definition in Django docs:
This returns a QuerySet that will “follow” foreign-key relationships, selecting additional related-object data when it executes its query. This performance booster results in a single, more complex question but means later use of foreign-key relationships won’t require database queries.
Continuing the example, the solution would be this:
Now, the output is the same as before, but we only access the database once with this solution. That is because select_related brought in the same query, each user’s email. We can use selected_related with foreign-key and one-to-one relationships fields. Let’s see now how to do this when we have many-to-many relationships.
Now we need to print for each user: the user id and the emails of the related users that matched with him (remember we have the mateched_users relationship). Let’s see the definition in the Django docs:
This returns a QuerySet that will automatically retrieve, in a single batch, related objects for each of the specified lookups.
So, it has the same purpose as select_related but it has a different strategy. The solution to the mentioned problem using the function is this:
This function accesses two times to the database:
- One time to get all the users (User.objects.all() : 1).
- A second time to get the matched users of each user (.prefetch_related(“matched_users”) : 1).
If we didn’t use the prefetch_related function in the previous solution, we would have the database accessed N + 1 times (being N the number of users in the system).
- select_related solves the N + 1 problem accessing only one time to the database. It’s used for foreign-key and one-to-one relationships fields.
- prefetch_related solves the N + 1 problem accessing only two times to the database. It’s used for many-to-many relationships fields.
Aggregate, Max, and Avg
Sometimes we need to get information from the database that results from aggregate data of different rows. Let’s see an example.
Here, we want to get the maximum radius of a target in the database:
And the SQL is:
The result is a dictionary with the specified key:
Now, we need to know the difference between the highest radius and the average of the radius of all the targets:
The return value is also a dictionary:
And the SQL:
There are other aggregation functions, such as SUM, COUNT, MIN. Take a look at the Django docs if you want to know more.
Most used topic in targets
If we want to know the topic most targets use, we can use the function Count and annotate the value in the queryset. Then, sort by that annotated value and get the first element in the result (this would be the topic that most targets use since we sorted by the topic count).
annotate adds information (depending on the number of arguments it receives) to every element of the queryset.
An annotation can be a value, a reference to a field on the model, or an aggregate expression (average, sum, etc.), and then it can be accessed as an attribute of each instance and used in functions like filter or order_by.
The query would look something like this:
Note that we must use the values function for this, since the Count only takes into account the values of topic for each particular database row, by using values we generate a GROUP BY as part of the SQL allowing us to count all different values of topic.
Sorting is an essential part of making queries. Let’s look at some particular sorting instances and how to achieve them with Django.
order_by with two arguments vs. chaining 2 order_by with different arguments
Let’s say we want to sort all users by two attributes: their date of birth and their name. We would like to get date_of_birth as the first criteria for the ordering, and if more than one user shares the same value for that field, use the name field to sort them relative to each other.
Some queryset functions, such as filters, can be chained with different arguments, and the result is the same as applying only one filter function with all the views at once.
But, ordery_by doesn’t work like that; in effect, the following query:
Doesn’t yield the same result as this query:
For the first one, we get the SQL we would expect:
With an ORDER BY with the two fields we provided. In contrast, the second query produces:
As we can see, we get the user only sorted by name. In general, when chaining multiple order_by calls, the one that will be applied will be the last one.
Randomly sorting queries
To get a queryset with all the instances in a random disposition, we can sort them using python code, but the Django ORM gives us the possibility of letting the database do it for us by using ? as an argument for order_by:
The resulting SQL is as follows:
It is worth noting that for some databases this random sorting can be slow, so be careful when deciding to use this functionality.
Sort targets by the number of matches
If we want to sort by the number of matches we could do something similar to what we did to get the most used topic value by using Count with a couple of differences:
- Because we were counting the number of occurrences of a value in all the rows in a table we needed to use values in our query to generate a GROUP BY SQL instruction. In this case, because we are interested in counting the number of related Match instances for each Target instance we don’t need to use values. When using Count on a foreign-key or many-to-many field the SQL query generated already uses GROUP BY on the target’s id.
- Since a Match can be related to two targets and thus has two ForeignKey fields, we need to annotate a field and assign it the sum of two Counts as its value.
The SQL query:
Segment query depending on condition
Another case where we can use order_by is when we want to retrieve a set of rows from the database but order them so that all rows fulfilling a condition appear before those that don’t fulfill the said condition.
This could be useful, for example, to show the users their targets, showing first the ones that have a match and therefore offer an opportunity for more interaction with the app.
We can use annotate to create a boolean field and then sort by that field as follows:
In this case, the query used generates 3 SQL queries, one to retrieve the targets and 2 others for the related entities in the prefetch_related
Sometimes you may want to retrieve instances from the database created or updated at a particular time but without using all the precision that a timestamp usually provides (seconds, milliseconds, etc).
In such a case, something like filtering by crated_at=now() wouldn’t be ideal, so we recommend going the following path:
As can be seen, the database is tasked with separating the data by its components (date, hour, and minute).
Now, let’s implement the query at the heart of our app’s main feature. When a new target is created (named in the code as my_target) we will retrieve all targets that are a match for it.
That is, all targets whose areas overlap with this new target and that also share the same topic but don’t belong to the same user.
As you can see, we have used an annotation to store the value of the distance between our new target and all existing targets, and also an F expression to add the radius of each target in the database with that of our new target.
The Distance function we imported and used lets the database handle the calculation of the distance between two points.
The resulting SQL is the following:
Here, ST_DistanceSphere is the function we generated by using Distance in our query.
Use different cases
Now, let’s say we want to be able to give a particular score to users by their activity on the app and by their compatibility with other users. We then could sort the users by this score and use it as a metric for a future feature.
Let’s take a given user (we will refer to it as my_user) as a reference and score every other user with the following criteria:
- It will receive 5 points if it has any matches
- It will receive another 5 points if it has more than 5 matches
- It will receive 5 points if its zodiac sign is the same as my_user’s zodiac sign
Note: you can compare fields in the query using <field_name>__<compare_expression>, where the <compare_expression> can be:
- gt: Greater than
- gte: Greater or equal than
- In analogous way with less: lt and lte
- exact: equal for text (key sensitive)
- iexact: equal for text (key insensitive)
There are more <compare_expression>. Take a look at the Django docs to know more.
The generated SQL code is:
Union, difference, and intersection
The union, difference, and intersection functions are used when working with more than one query, and we need to combine them.
As mentioned, we should try to retrieve the needed data in a unique query to minimize the hits to the database. But, we explain these functions here if you don’t have another choice but to work with two or more queries.
Let’s assume we have an already evaluated queryset from the users’ table called q1. This queryset has the users having Scorpio zodiac signs. Now we want:
All users with the Scorpio zodiac sign and those 21 years old or older (despite the zodiac sign). The queryset using union is:
Now, the resultant SQL query:
Continuing with q1, suppose we want the users that have the Scorpio zodiac sign, but we want to remove from q1 those who have an empty username (equal to null). The queryset would be:
And the SQL query:
And for the final case, let’s use intersect. Suppose we want the users that have the Scorpio zodiac sign, but we want to get only those in q1 who were born in October:
The resultant SQL query:
Some notable considerations:
- Remember that it’s essential to try to make only one query. In this case, we assumed that we had q1 already evaluated to explain the union, difference, and intersection. Anyway, you may not have another choice but to work with more than one query.
- When you compone queries using the mentioned functions, please take into account that may be Django versions or Database engines may fail when slicing them.
This article presented good practices when using databases with the Django ORM. In the software development area, the performance of the applications is critical, and the use of databases can be highly related to that, hence why we focused on their use and access of them.
Besides general tips and best practices, we presented an example of reality and queried to show tools and performant ways to use them. Also, for each case, we indicated the corresponding SQL query to understand better what Django does behind the scenes.
The main conclusion here is that the Django ORM is a powerful tool to work with databases without the need to know the SQL language and provides different ways to do the same. For that reason, we enumerated good practices.
Bruno Michetti & Jorge Michelena