The 95th Percentile
Imagine a reality, where you can detect and fix issues without your users noticing that something went wrong.
We all aspire to measure performance in some way, and choosing what to measure can be a challenge in itself. By default, we think about averages, and we forget that there are many other possible measurements.
For example, certain queries can average an execution time of 10ms. When we measure the percentile, we could discover that our perception is skewed. We could discover queries that actually run for 90ms. This is rather revealing, especially when its execution time is fairly removed from the average. This tells us that the consumer experience varies a lot, and that we may be on the verge of losing customers due to poor perceived performance.
Percentile values are often far more important for highly available (HA) solutions. Remember that a greater availability requirement, demands a higher percentile. Normally, we care about the 95th percentile (95%). However, for solutions that deal with a high volume of transactions, we can leverage higher percentiles like the 99.999th (a.k.a five 9s) percentile (99.999%).
The 95th percentile means that a condition is met 95 times out of 100.
Percentiles are valuable because they give us an idea of how our solution is behaving. Tracking multiple percentiles, provides us with insights about how fast our services are degrading. This information is crucial for early detection, diagnostics and remediation. It gives us a timeframe with which we can work. If the degradation is slow, there may be less urgency. On the other hand, if time is of the essence, we find out early and can start damage control before users start to complain. When support tells a customer, that they are aware of the issue, and that engineering is already deploying a fix, it shows that we genuinely care. This is how we build trust and loyalty. And this is how we influence customers to share positively about our company on social media.
To leverage percentiles, we need a meaningful data set. It must contain at least one order of magnitude more data points than the target percentile. For a 95th percentile, we need a minimum of 100 data points. A 99.9th percentile needs 1000 data points and the 99.99th (a.k.a four 9s) percentile requires 10000 data points.
NOTE: Don’t accept Service Level Agreements (SLAs) based on averages. Averages are sneaky and hide undesirable behavior. SLAs based on percentiles bring clarity and transparency through easily verifiable data.
Updated: June 2016
Azure Application Insights can be a great way to start leveraging percentiles. Read “Cool AppInsights Analytics: Percentiles” to find out how.