Database optimization

Sometimes, especially on interviews, developers are asked about what would be their steps to fix a database that works slowly. That’s a really extensive subject to discuss. Here, I decided to summarize my experience. I hope it help to not only increase the performance but do that in acceptable time and with less misery.

First, of all, fixing the database queries is the last thing you should do. Fixing code first is premature optimization that is a root of all evil as we know. It could take huge amount of time whereas the real problem is kept in a wrong configuration file.

So in my prospective, the right steps to make your database more peppy are something like follows below. Keep in mind that it is not a technical tutorial with code snippets that you could paste and make the job done.

Ensure you use SSD drives. It sounds silly but still you should check it. SSD gives you up to 10 times performance boost immediately. Ensure that SSD drive is a highlighted feature in you hoster’s plan.

Check you database config before tweaking something else. Usually, the default configurations are set up poorly to make the DB work on cheap computers. Some distributions carry config templates which names include memory or CPU values. Read them, they are well-documented. Google for best practices when configuring your database. Read books.

Consider a case when the entire database is kept in memory. That also involves DBA and Ops teams so probably you won’t be allowed to perform such a move by your own. But since memory is cheat nowadays, it is possible. For backups, replicate such a database using streaming replications on servers that are run on hard drives. PostgreSQL supports various replication scenarios as well. You may also dump memory images into files and restore them on demand.

I used to work in a project where they ran 400Gb database into RAM and it worked quite fast.

Check you application uses connection pooling when communicating to the DB. Suddenly, such wildly spreaded frameworks as Django do not use them. On each HTTP request, they open a fresh connection and close it once the response has been sent. Now imagine your application serves 20 requests per second. Network communication will be quite expensive in that case!

Consider using either PGBouncer in front your Postgres installation or any library for your language that deals with reusable connections by its own. Say, for Clojure, HikariCP would be a good choice.

Everything that might be read from replica should be read from it, not the prod DB instance. Don’t run reports against the production. All the analytics and statistics should not touch it. Modern frameworks allow you to declare multiple database backends and switch between them on demand. For example, in Django, calling .using('replica') method on a Queryset instance dispatches a query into another database, not the default one.

For quite complicated queries, use materialized views that may be turned into physical tables on demand. Querying them would be much faster than processing a query full of joins and aggregates every time. Setup a cron job to rebuild such views every night for example when the load is low.

Select only those fields that you really need to process a request. Avoid using select * from table since it takes extra time for the database to find out what fields to fetch exactly. With Django ORM, either specify needed fields with .only() method or bypass some of them calling .defer(). But keep in mind that touching a non-fetched field hits the database again.

When writing unit tests, add a case that counts a number of database queries made during request. Sometimes, it leads to terrifying findings. Touching fields of foreign entities in a cycle without joining them initially may end up with up to 100 queries or more.

Be carefull when passing a collection of IDs as a parameter. On the database level, it turns into something like

STATEMENT: SELECT * FROM users WHERE id IN (?, ?, ?, <100 more of them>)
PARAMETERS: $1=10035, $2=10036, $3=10037, $4=10038, $5=10039,...

When not limiting the number of such IDs, one may end up with 1000 of them what is quite harmful for the database. Either process them by chunks or try to compose a single query with raw SQL.

For fast data processing, don’t use ORM at all. Building a model instance from a low-lever tuple that the database driver returns is really costly. Even SQLAlchemy (Python’s the most powerful ORM) spends double time on building instances than reading data from psycopg2 (Python PostgreSQL driver) directly.

Denormalize your tables. Sometimes, it really worth moving any columns into foreign tables and joining them on demand. A good example might be ranking system. You’ve got a users table full of fields. Now you need to add rank field and recalculate it every 6 hours. Adding yet another fields would be a bad choice since our users table is already under load.

But having the users_ranks table that has a unique foreign key to a user and a rank value would be a good choice. Now that, we may rewrite it as we wish without touching actual user’s data. When needing to sort users by their rank, we join that table and order the result by rank as well.

The same relates to a search document. When implementing full search on some entity, we need to build a special document known as a search vector. Storing that vector inside an entity is a bad choice because it is not the data but just technical information.

Join tables only by foreign and primary keys but not any custom conditions like strings equality, regex matching and so on. When you need such a join, probably it is better to fetch both parts of it and compose the result in your app’s code.

When a query really works badly, examine it with EXPLAIN operator. If you don’t understand its output completely, use online services that turn it into more human-friendly format. Ensure the query hits indexes. Don’t add multiple indexes since having them a lot may slow down writing procedure.

Finally, never be afraid of raw SQL. No matter how powerful your ORM is, check what queries it produces exactly. On my dev machine, I always start Postgres not as a service but as a process passing -E flag to print all the queries:

postgres -E -D /usr/local/var/postgres

That gives me the whole vision of what’s going on with the database.

Summarizing all together:

use modern hardware;
configure your database properly;
know all the shadowed parts of your ORM;
rise your SQL skills.

← Clojure in Highload Cup

Пруф →

Олег Комков, 22nd Sep 2017, link

Как бы Django умеет пул соединение https://docs.djangoproject....

Ivan Grishaev, 22nd Sep 2017, link , parent

По дефолту он не включен:
> The default value is 0, preserving the historical behavior of closing the database connection at the end of each request.

Timur Malikin, 24th Sep 2017, link

Согласен во всем. Как относишься к хранимым процедурам? Нас на работе хватает любителей пол программы вынести в oracle pl/sql. ИМХО страшная вещь, когда бизнес логику туда пихают.

Ivan Grishaev, 24th Sep 2017, link , parent

Бизнес-логику, конечно, нельзя, потому что она постоянно меняется. А всякие технические штуки вполне. У нас еще триггеров полно.