The last component of any application that has a persistence layer is, as most of you know, the database. We have several options of db managers to choose from: Oracle, PostgreSQL, MySQL, SQL Server, among others. All of them have their strengths and their weaknesses, but fortunately they all operate under the same basic principle, namely collecting statistics of your application’s data needs and therefore improving the paths to reach the final data. I will talk about some of the most basic tweaks that can help improve your application’s overall performance by taking advantage of the aforementioned principle.
How I start analyzing my DB
The examples are going to be on a PostgreSQL db, because it is free and therefore the most common one for web applications.
The first thing to consider are statistics. PostgreSQL uses a catalog called pg_statistics where it collects data about the contents of our db. This catalog is used by the query planner to return query results. So first we are going to analyze a table to check the health of our data, and then we are going to vacuum it to make it more performant. This whole process is daemonized on autovacuum, but as it could be disabled, it is a good practice to do this on a low data consumption day on a regular basis:
ANALYZE VERBOSE articles; INFO: analyzing "public.articles" INFO: "articles": scanned 1 of 1 pages, containing 2 live rows and 24 dead rows; 2 rows in sample, 2 estimated total rows ANALYZE
That query shows us important things about our table. For example, it says we have 24 dead rows. That is wasted space on dead data, so the next thing will be to vacuum our table:
VACUUM FULL articles;
And now if we run our analyze query again:
ANALYZE VERBOSE articles;
INFO: analyzing "public.articles" INFO: "articles": scanned 1 of 1 pages, containing 2 live rows and 0 dead rows; 2 rows in sample, 2 estimated total rows ANALYZE
That’s more like it. This particular example is from a development project that has an editorial to show articles; imagine how much space and improvements you will find on a live environment.
Other aspects to keep in mind are joins. A good practice will be examining your app’s logs, finding the most executed queries, and then tracing their execution plan:
EXPLAIN VERBOSE select * from articles where author_id in (select id from authors where name like '%mbell')
Nested Loop (cost=0.00..11.67 rows=2 width=166) Output: , articles.author_id, , , articles.status, articles.created_at, articles.updated_at, articles.background_image, articles.short_description, , articles.start_date, articles.end_date Join Filter: (articles.author_id = ) -> Seq Scan on public.authors (cost=0.00..10.62 rows=1 width=4) Output: , , , authors.avatar, authors.created_at, authors.updated_at Filter: (( ~~ '%mbell'::text) -> Seq Scan on public.articles (cost=0.00..1.02 rows=2 width=166) Output: , articles.author_id, , , articles.status, articles.created_at, articles.updated_at, articles.background_image, articles.short_description, , articles.start_date, articles.end_date
There we are looking for articles written by any author whose name contains the string ‘mbell’. Let’s take a look at the green numbers:
Cost: means the startup effort to this query
..11.67: the max time
Width: Estimated average width (in bytes) of rows output by this plan node (we want to keep it low)
We are using an IN clause to join those two tables. Let’s see what happens if we use a common join:
EXPLAIN VERBOSE select * from articles a, authors b where a.author_id=b.id and b.name like '%mbell'
Nested Loop (cost=0.00..11.67 rows=1 width=1734) Output: a.id, a.author_id, a.title, a.body, a.status, a.created_at, a.updated_at, a.background_image, a.short_description, a.slug, a.start_date, a.end_date, b.id, b.name, b.bio, b.avatar, b.created_at, b.updated_at Join Filter: (a.author_id = b.id) -> Seq Scan on public.authors b (cost=0.00..10.62 rows=1 width=1568) Output: b.id, b.name, b.bio, b.avatar, b.created_at, b.updated_at Filter: ((b.name)::text ~~ '%mbell'::text) -> Seq Scan on public.articles a (cost=0.00..1.02 rows=2 width=166) Output: a.id, a.author_id, a.title, a.body, a.status, a.created_at, a.updated_at, a.background_image, a.short_description, a.slug, a.start_date, a.end_date
The second time we can observe that there is one less row and the same max time, but look at the width! That happens because we are selecting every row from each table. This shows that the most innocent change can have a big impact on the overall efficiency.
If we are using frameworks like Rails, for example, it will take control of the resulting queries. We can rest assured that the Rails developers have taken everything into account and foreseen this matter, but keep in mind that the framework allows you to do all kinds of stuff (n to n relations, scopes, etc) that will result in queries and may not be as efficient as you would like.
So what can we do if the queries are created by the framework? We can start by analyzing execution plans for the most common queries appearing on our app log and play around with them by adding or removing indexes. Keep in mind that there are several types of indexes. Again each db manager has trademark indexes as well as the most common ones. Don’t worry though, as each manager usually has great documentation on the matter. See for example http://www.postgresql.org/docs/9.3/static/sql-createindex.html
This is a difficult matter and it is not my intention to give you recipes or teach you about indexes or execution plans because nothing in 100% true or correct here (the green numbers above are just the tip of the iceberg. Below them is the entire path for the query execution, and this is where problems can appear). Keep in mind that your application is like a human being, it changes overtime – sometimes for better, sometimes not so much – so you may consider changing indexes and strategies as time goes by. A good starting point is using the queries I’ve shown you. The key is to always be aware of what is happening under the hood.