In the previous article, we’ve seen that “average” SQL performance metrics that ORACLE provides out of the box can be useful, but only in a limited set of circumstances when underlying data distribution is normal.
Let’s try to find better metrics.
Here is what the requirements are:
- We still want a summary (one point) metric, so that we do not need to store and process large raw data during performance analysis
- It has to be meaningful, i.e. represent what people actually care about
- It has to be universal, i.e. not dependent on data shape or sample size
Thinking clearly about performance
Let’s step back a bit and talk about performance in general. It is a well known maxima (thanks, Cary Millsap!), that
People feel variance, not the “mean”
I.e. people will generally feel better if things (i.e. query timing) remain constant but will detect (and complain about) things that get far outside the norm.
In my experience with tuning SQL queries, this maxima usually gets one sided.
I.e. imagine yourself tuning a random OLTP query and improving its latency by a factor of 10. Chances are, this change will not even be noticed by the users (yes, it’s a thankless job sometimes…). However, make this query performance 10 times worse and I’ll bet than people will start “screaming” … (relatively speaking )
So my slightly modified maxima is:
People feel (right side “BAD”) variance, not the “mean”
In very simple terms, when it comes to query performance, people only care about slow performing queries.
So, how do we track them ?
How to track “slow” queries ?
The first approach that comes to mind is to set a “slow threshold” and just record all (or, at least the count of) individual executions crossing it. This way we should have a very precise answer of how many queries are “bad”.
However, there is slight problem with this approach: What if we do not see anything ?
Does the fact that we measure 0 “bad” executions tell us if things are all good ?
I.e. you can imagine the situation when queries are “almost bad”, or slowly creeping to be “almost bad”, but because of the strict threshold, we are blind to their existence until they actually cross it (at which point it might be too late to react).
A more general solution to track “bad queries” is to think in “percentages” (or: “percentiles”). The idea is actually quite simple.
Let’s take the same measurements from our UPSERT example, but now order individual runs by “elapsed time”:
With the newly ordered data, let’s take a point at “90″ on X axis and think what it represents in terms of performance.
I can summarize it as:
Latency of the worst 10% of individual UPSERT executions is at least as bad as 248 milliseconds
Just for contrast, compare it with implied performance definition when we are using “averages”:
(what we think is) typical UPSERT latency is 102 milliseconds
What if we selected “data point at 90″ to represent query performance ?
- This would still be a single data point to deal with (“Summary metric” – check!)
- It has a very well defined and obvious performance meaning (“Meaningful metric” – check!)
- It does not really depend on data size or shape as “any shape can be reordered” (“Universality” – check!)
So, the bottom line is that this point (let’s finally give it a proper name – “90th percentile” or p90 for short), makes for a very good performance metric, a lot better than “average”.
In addition to that, percentiles are continuous – as long as sample size is not 0, p90 will always have some value. It can be tracked over time, plotted on screen and we (and more importantly our automated tools) will have a chance to react in time to potential problems.
The resulting performance plot will look very similar to the original “average” plot (except for it being more precise and meaningful, of course )
The typical use of percentiles is to track “bad” queries with ever increasing precision. That’s why, it makes sense to capture and track several percentile metrics, not just one.
I.e. you might want to track p50 (“50% of our queries are at least as bad as …”), p90 (“the worst 10%”) and p99 (“the worst 1%”). Occasionally, there might be a need to be more precise and track i.e. p99.9 or p99.999 percentiles if requirements are very strict.
R code to reproduce examples:
Ok, where can I find these “percentiles” ?
I hope you agree that percentiles are useful performance metrics and tracking them is a worthy thing to do.
But where can we find them ? If you look around ORACLE dictionary, you’ll notice that V$SQL does not exactly have ELAPSED_TIME_P90 or ELAPSED_TIME_P99 columns and v$SQL_PERCENTILES view is nowhere to be found as well.
The (sad) truth is that, at the moment, tracking percentiles is a “do it yourself” exercise. Stick around as I’ll be talking about it in the next article.