When discussing storage performance of many RDBMS applications, I/O latency is one of the most important topics. Today, I would like to discuss my view on latency monitoring. In a future post, I will be talking about latency, throughput, and IOPS, and in which situation each is most important.
I/O latency is defined simply as the time that it takes to complete a single I/O operation.
So, what is an acceptable I/O latency number? Before answering this question, I should explain a bit about storage vendors’ published latency numbers. There are two important pieces of information that many vendors neglect to mention:
- When designing the workload for a latency benchmark, vendors use the smallest I/O size possible in order to speed up the I/O operation. There is nothing wrong with testing read latency of 512B, except that Oracle or SQL will never read 512B of data. A LOG I/O operation may be 512B, but data is always 8KB or higher.
- The latency reported by most vendors is the storage internal latency and does not include the initiator connectivity latency.
When dealing with I/O latency measured in milliseconds (ms), these two factors might not be so important, but when latencies are on the order of nanoseconds, as is the case with high-performance DRAM storage, these are very significant matters.
So, what is good, acceptable disk latency? You would not believe how many times I speak with customers who state that average latency should not exceed 10ms (12ms is also a very common answer from customers) and maximum latency should not exceed 50 ms. Where do these numbers come from? Well, Microsoft and Oracle have both published a best-practice rule of thumb and, no surprise, the 10ms magic number is clearly stated there. Here is one of Microsoft’s answers:
The Avg. Disk sec/Read performance counter indicates the average time, in seconds, of a read of data from the disk. The average value of the Avg. Disk sec/Read performance counter should be under 10 milliseconds. The maximum value of the Avg. Disk sec/Read performance counter should not exceed 50 milliseconds.
So, what is the problem with this answer? Well, say I am a customer who has a performance problem in my application and I decide to check the storage subsystem. Based on the above rules, here are the issues:
- An average I/O latency of 15ms—should I start shopping around for a new and better storage subsystem? I am not sure about that.
- What about an average latency of 5ms? This is way below the 10ms guide, so I guess my I/O subsystem is not the cause of the application performance problem. Again—I’m not sure about that.
If you are like me, coming from the world of databases rather than storage, try to answer the following question: What is an acceptable number for statement compilation per second? Do you have a best-practice number that will fit all applications? No? Why? Because each application is different. An OLTP application will (and should) behave differently from an OLAP application. Compiling zero-cost statements is different from compiling complex statements, etc. A compilation of 100 statements per second might be good enough for one application, but 5 statements per second might be a real problem for another. How should you know if a compilation is an issue? You need to examine the wait events to understand exactly on what the database spend the time. If you see that too much time is spent on the compile, then you need to solve this problem.
The same is true for I/O latency. You must look at the performance of your storage only after analyzing the wait events of your RDBMS application. This is the most important takeaway from this post. You should not look at I/O stats and decide that you have a problem with your storage. You must first understand whether your application suffers from I/O wait and only then look at the storage performance.
What is I/O wait? It is the total duration that the working processes of your session are being blocked, waiting for the I/O operation to complete. If the duration is high, then you will benefit from improving the speed of your storage. If you have no I/O wait, or the duration of I/O wait is minimal, then you will not.
You basically need to be able to break the elapsed time of the application into two measurements: processing time (CPU), and waits and a breakdown of the waits. If a substantial amount of the elapsed time is waits and a significant amount of the waits is I/O wait, then you are correct in looking into the storage system performance.
|App 1||App 2|
|I/O Wait time||5%||80%|
App1 is an example of an Oracle application on a storage having 15ms I/O latency. The breakdown of CPU and Waits is 90% and 10%. In addition, 5% of the time is I/O wait.
App2 is an example of an Oracle application on a storage having 5ms latency. The breakdown of CPU and Waits is 10% and 90% and I/O wait is responsible for 80% of the total processing time, in the case of App 1, there is no business need or justification to replace the storage. Although the average latency is 15ms, the maximum improvement that you can expect is 5%. On the other hand, for App 2, although the average latency is 5ms, it will yield a significant improvement in the total response time of the application.
Can your application generate high amount of IOPS with high I/O latency and not see much benefit from improvement in storage latency? Yes. Again, you should look at the CPU and waits breakdown before looking at I/O stats. I will discuss in a future post how one may have 20ms-30ms latencies with an SQL (or Oracle) application and not gain much benefit from faster storage.
To create the above measurements, I am using the following:
- Oracle: AWR + iostat/vmstat
- SQL: DMVs + perfmon (I am using an SQL script that I wrote in-house)
If you do not know how to create a report that shows the wait event breakdown, you are in good company. Most DBAs will not know how to do that—not to mention storage administrators, who in many cases have no privilege to access the database layer.