The only thing worse than discovering your application is crashing is having a customer tell you your application is crashing. Not too long ago I received panicked emails stating that, at some random intervals, our application was throwing 500s. Unfortunately I turned off push notifications (actually it has helped separate work/life - if it’s an emergency you’ll be getting a call) so I didn’t receive these emails until the next morning when I sat down at my desk.
My co-worker had already diagnosed the problem by the time I had arrived. The recent changes we had pushed included a 1s ajax poll on a particular page which, in most polling cases, ended up hitting the database. Also, those changes were going live with a few thousand more users than the previous version. My co worker discovered that the database was the culprit, but why?
It turns out one of tables we were querying on didn’t have an index set for a particular column. Coupled with the fact we added more users it was obvious why the database was choking: it had to perform a full table scan. The solution to the problem was easy, of course, we just needed to add the appropriate index.
Performance before and after indexing
It has been a few years since I’ve directly interacted with MySQL so I thought it would be fun to run a couple benchmarks on a table with and without indexing. The first thing I did was setup a Rackspace cloud server with 8 GB of memory and 4 vCPUs. If you don’t have a testing box, I’d highly recommend it. You get great computing power at your fingertips on demand. The best part is: it’s cheap. You can turn your machine off when it’s not being used so you’re not charged for something you’re not using. It took me a couple hours to setup MySQL and write the python (I chose python to familiarize myself with it) scripts for inserting data. Two hours only cost me $0.96.
I created the following test data:
- 5 million rows of randomly generated users with: id, firstname, lastname, email, year, month and date.
- 100k randomly generated products with: id, name, category, price and views.
- 1 million rows of purchases consisting of: productid and userid.
The first query was searching for a particular email address.
1 2 3 4 5 6 7
That’s pretty damn slow. How can we improve this? Let’s add the constraint that email addresses will be unique. Next, we can add a unique index to our email column. Depending on your database engine you could define an index type (BTree or Hash). I decided to use the InnoDB engine which only supports a BTree index.
1 2 3
With the index added let’s see how much of an improvement that made:
1 2 3 4 5 6 7
What a huge improvement! The query was so much faster it couldn’t be measured in seconds, perhaps if we had measured in milliseconds we would have a non zero response time.
The query without the index had to perform a full table sequential scan, meaning it had to look at all 5 million rows. Think of it as an array with 5 million elements we have to traverse. Adding the index transformed this array into a self balancing tree, more specifically, a btree. This means searching only takes O(log n) steps, or at most 23 steps (222 = 4.2 million, 223 = 8.3 million). This is a huge improvement over our previous 5 million steps!
Another slow query
Let’s suppose I want to find all products bought by users who were born in 1994 without using an index.
1 2 3 4 5 6 7
3.04 seconds - unacceptably slow.
1 2 3 4 5 6 7 8 9 10 11 12
A 434% improvement! I’m sure this query could be improved even more, but I had to head out after running this last query, so this is where I stopped.
When you’ll be frequently querying on a particular field in your database remember to add an index for that column to increase performance and keep your customers happy!