Stop Chasing Ghosts: Hardware Utilization is Not an Outcome Metric

I often need to multiply two numbers together, and I am tired of leaving the command line to use GUI calculators or painfully inefficient Python¹. So, I decided to write my own mega hardware-efficient calculator … for multiplying two integers. My goal? Achieving peak performance, measured by the holy grail of CPU optimization: Instructions Per Cycle (IPC).

I started with a naive implementation in c++. I simply used the built-in addition operator:

hamish@server:$ ./calculator_unopt 4 15000000
4 * 15000000 = 60000000
IPC
0.59

This approach was straightforward, but my IPC was poor; I wasn’t even close to executing one instruction per cycle. So, I needed to optimize. I consulted Agner Fog’s microarchitecture reference to learn the Skylake microarhitecture. I examined every line of code, profiled my calculator program, and inspected every performance counter, all to handwrite the perfect assembly. I achieved microarchitectural perfection and the pinnacle of hardware efficiency:

hamish@server:$ ./calculator_opt 4 15000000
4 * 15000000 = 60000000
IPC
5.93

My optimized calculator nearly hit an IPC of 6, over a 15x improvement. This is practically the maximum possible on the Skylake Xeon I am using, which can allocate 6 uops a cycle from the instruction decode queue to the backend scheduler. The code for both versions can be found here. Surely this is the world’s most hardware-efficient calculator?

But is it actually fast? Well, no. In my facetious example, I neglected to include anything about wall clock time. The initial version took 173 microseconds, while the “optimized” version took a whopping 12617 microseconds, nearly 73 times slower. Why? because the “optimized” version uses a silly algorithm that adds input1 to itself input2 times, intermingled with a few useless instructions, resulting in the “optimized” versuib version nearly 150 times as many instructions. Despite the phenomenal IPC, this dramatic slowdown starkly highlights the danger of obsessing over hardware utilization metrics as if they are the end goal. They aren’t. Hardware utilization is often an intermediate or proxy metric, not an outcome metric. Optimizing solely for utilization without rigorously measuring the actual desired outcome – like execution time, throughput, or cost – can be deeply misleading and actively harmful to your goals.

While this may be obvious, I am writing this blog post because, on more than one occassion, I have had a fundamental disagreement with individuals on what their or our goal actually was.

Defining what matters (outcome vs. proxy metrics)

Before we continue, let’s clarify what I mean by these terms in the context of software systems and performance engineering.

Outcome metrics: These quantify the results that directly impact the user, the business, or the primary goal of the software. Think about:

Latency: How long does a web request take? How quickly does a batch job finish? (e.g., p99 request latency, total job runtime).
Throughput: How many requests can the system handle per second? How many items are processed per hour?
Cost: What’s the cloud bill for running this system? How much energy does this computation consume? (e.g., cost per million transactions).
User Experience: Are users satisfied? Is the UI responsive? (Harder to quantify, but ultimately crucial).

Proxy / intermediate metrics: These are measurements of internal system or hardware behavior that might influence outcome metrics but aren’t goals in themselves. Examples:

Database operator throughput (tuples/sec)
Request throughput (requests/sec)
CPU Utilization (%)
Instructions or operations Per Cycle or second (IPC, IPS, flop/s)
Cache Hit/Miss Rates (L1, L2, L3)
Memory Bandwidth Usage (GB/s)
Network I/O (packets/sec, MB/s)

The danger lies in mistaking a proxy for an outcome. This is particularly tempting with resource utilization metrics. If we pay for hardware that can achieve 20 TFLOPS, why not strive to “optimize” our code to efficiently use the hardware and get as close to 20 TFLOPS as possible? The short answer is that hardware resource efficiency, in isolation, does not matter. At the end of the day, what matters is time, or money, or both. If better resource efficiency means you need less hardware (i.e. less money) to achieve your goals, that’s fantastic. Goal achieved. Similarly. getting CPU utilization to 100% feels “efficient”, but if it skyrockets your request latency because there’s no headroom for bursts, disrupting your paying users, you’ve failed. Improving IPC seems great, but as my example showed, if it massively degrades the outcome metric, execution latency, you lose. Proxy metrics only provide information in the context of your goal and your system.

As a PhD student working on data systems, my outcome metrics primarily revolve around performance, e.g. query execution time. (Though my ultimate metric is thesis completion latency, which in turn depends on accepted paper throughput). The measure of impact for academic research in data systems is frequently, but not always, performance. This is not because performance is the perfect outcome metric but because it is the easiest to quantify and is widely accepted, so researchers don’t have to spend as much time convincing reviewers and the research community that it matters. Reporting results on standardized benchmarks (TPC-C, TPC-H, TPC-DS, etc) allows for more straightforward comparisons, and carefully controlled workloads ensure reproducibility and isolate variables. This environment naturally prioritizes optimizing for direct performance outcomes like execution speed or throughput on these specific tasks or improving related proxy metrics like IPC or cache hit rates can demonstrate or help explain technical novelty. This focus can sometimes diverge from problems data system users experience outside of academia because standardized benchmarks don’t necessarily reflect how data systems are used in practice, and researchers have limited access to real-world workloads. Though occasionally, companies will publish papers with insights or workload traces (example A and B). Still, evaluating metrics like real-world deployment cost or energy efficiency is challenging in academia, often requiring complex modeling and numerous assumptions about hardware pricing, power usage, and operational scale that are difficult to validate universally. Academic work may inadvertently optimize for performance within a constrained context, potentially overlooking the cost-effectiveness or behavior under the diverse, messy conditions typical of production environments.

Effective profiling: connecting proxy metrics to real outcomes

The correlation between proxy metrics and outcomes is not always straightforward for systems performance. For instance, total CPU cycles often directly correlate with execution time for CPU-bound single-threaded tasks, assuming a constant clock speed and limited off-CPU time. Reducing the cycle count usually leads to a faster program, making it a strong, albeit still proxy, indicator.

However, performance-sensitive systems, i.e. systems where performance actually matters, are rarely single-threaded and often involve substantial off-CPU time due to synchronization and I/O. Just determining what proxy metrics might be relevant to your objective can be challenging. Consider these extremely non-exhaustive examples:

Memory accesses and cache misses (L1/L2/L3): A high cache miss rate seems bad, but is it the cause of slowness or merely a symptom? How many misses are acceptable? Does a burst of misses stall the pipeline completely, or can the CPU hide the latency with out-of-order execution? Is the latency to other NUMA nodes also hidden, or do we need to try to localize our memory accesses? Does the memory access pattern fit in the TLB? Is the CPU the only device accessing system DRAM? (DeepSeek’s Fire-Flyer paper has an fascinating discussion on this in the context of transfers between InfiniBand NICs, GPUs, and system DRAM.)
Subtle microarchitectural hazards: Deeper issues like 4K aliasing can be incredibly tricky to diagnose. This occurs when different virtual memory addresses inadvertently map to the same cache sets, causing conflict misses even when overall cache usage seems low. Standard cache miss counters might not pinpoint this specific problem; it requires sophisticated tools and analysis to detect these pathological access patterns based on address bits. Similar complexities arise with false sharing or interference patterns in branch predictors.
I/O accesses: Is your application I/O bound? Simply knowing the theoretical peak IOPS (Input/Output Operations Per Second) or throughput of your SSD or disk array isn’t enough. The actual achievable performance may be limited earlier in the chain. Is the bottleneck the physical device, the interconnect between the device and target memory, the kernel, or the system misusing interfaces (e.g. many small synchronous requests when batched or async requests are possible)? Is there other useful work that can be done while waiting on I/O? High latency per I/O operation often hurts dependent read/writes more than raw throughput limits data-parallel processing.
Operating System Interactions: The OS can be a significant performance factor, often in non-obvious ways. Is there significant time spent on context switches because too many runnable threads are fighting for limited cores? Are there expensive system calls on critical paths leading to costly transitions between user mode and kernel mode? Are page faults causing delays? Is kernel lock contention or time spent handling interrupts consuming resources? These OS-level activities consume real-time and impact system latency and throughput but might not neatly map to simple CPU utilization percentages or hardware execution counters.

Understanding the proxy metrics available to you and when they are relevant to your desired outcome is crucial because the diagnosis dictates the solution. Is the problem solvable with careful code adjustments to help the compiler emit more efficient instructions? Does it require reorganizing data layouts (e.g., switching from array-of-structs to struct-of-arrays) to improve cache locality? Or does the profiling data reveal that the fundamental algorithm interacts poorly with the hardware architecture, necessitating a complete rethink rather than micro-optimizations?

This intricate relationship between proxy metrics and actual outcomes underscores why simply optimizing one metric in isolation, like IPC, without understanding its impact on the entire system can lead you astray. Effective profiling is about building a hypothesis, testing it, and connecting those low-level events back to the ultimate goal.

This is not easy. I don’t profess to be a world expert, but my general advice is to take a top-down² approach starting with your target metric, e.g. query execution time. Look at where the real, i.e. wall-clock, time goes³, and find what module/component has the primary bottleneck⁴. Starting with real time is important, as CPU time can be a poor proxy for real time, especially in parallel systems. Is the bottleneck in parsing the SQL, query planning, query execution etc? If the problem is in query execution (assuming the optimizer produced a good plan), is any particular operator the bottleneck? Methodologies such as flame graphs can be helpful here, but always be careful what you are measuring; a flame graph of stack traces sampled using CPU time may hide off-CPU issues. Once you have narrowed down where the problem lies, you may already have formed a hypothesis about the root cause. Real/CPU time may be sufficient to form and test your hypothesis. However, you will likely need to collect more information, OS/hardware/system metrics, to understand why a certain segment of code is taking up so much time for concurrency/synchronization, memory, I/O, and micro-architectural issues. It is much easier to correlate metrics to specific system behavior or part of the code than to take a bottom-up approach, looking at low-level metrics first and what part of a system contributes to an unusual-looking metric (which may not even matter for overall performance). It’s the difference between a doctor asking, “Where does it hurt?” to narrow down the problem area before ordering specific tests versus ordering every conceivable medical test for a patient and then sifting through thousands of results, hoping to find the diagnosis.

Focus on what counts

Chasing hardware utilization metrics or other proxies in isolation feels productive, it can generate nice-looking graphs, but it can easily lead you away from your actual goals.

To summarize:

Identify your true outcome metrics: What defines success for your project? Is it latency, throughput, cost per transaction, user sign-ups, job completion time, an accepted academic paper? Be specific.
Measure outcomes rigorously: Put monitoring and measurement in place for these key outcome metrics first. This is your ground truth.
Use Proxy metrics for diagnosis: leverage utilization and hardware metrics (CPU%, IPC, cache misses, network I/O, etc.) and structured methods like Top-Down Analysis as diagnostic tools. When your outcome metrics degrade, use these proxies to understand why. When outcome metrics are great, then look at utilization to see if you’re overprovisioned (potential cost savings).
Always ask: “Does this change improve my actual outcome metric?” If optimizing a proxy metric doesn’t move the needle on what truly matters, question whether it’s worth the effort or if it might even be causing harm elsewhere.

Stop chasing ghosts in the machine. Focus on the results. Optimize for outcomes.

No shade to Python, just playing on the “Python is slow” meme. It is actually pretty fast for numerical stuff, especially when using libraries like NumPy, which is all Fortran/C/C++ and BLAS under the hood. ↩
A bottom-up approach can also be useful for full system optimization, e.g. identifying time spent in frequently called functions. Optimizing frequently called functions can have wider benefits. I specifically discourage a bottom-up approach with respect to examining proxy metrics before your outcome metric. ↩
Measuring where real time goes can be surprisingly tricky. In modern systems utilizing parallelism or asynchronous operations, simply summing up the time attributed to individual operations (like disk I/O, network waits, CPU processing, or a thread waiting for work) can be misleading. For example, a query processing storage-resident data might overlap fetching the next block from disk using asynchronous I/O while simultaneously processing the current block on the CPU. Because these activities happen concurrently, the time spent fetching data and the time spent processing don’t add up neatly to the total wall-clock time; the total time is determined by the longest chain of dependent operations (the critical path), including waits, considering all overlaps. Assigning time to operations requires knowing when a specific operation is on the critical path. Generally, this requires some effort to be invested in instrumentation and tracing infrastructure. This is further exacerbated by parallelism, e.g., in the same example, if multiple threads are issuing async I/O and processing data. ↩
Hope that your problem is a result of clear bottlenecks. If every part of your code is equally slow, solving performance issues becomes more challenging and the approach will be system and situation dependent. ↩