Twelve years ago I gave a talk on “performance signatures” and submitted a paper on the subject. This paper was about various approaches to help give tactical performance guidance to people without having done a lot of measurement. The idea was that there are often common patterns worth considering. I also wrote some notes about design and metrics considerations at the same time that never saw the light of day (except those few that read them at the time). Here’s that set of notes from ~2006.
[Click for the cheat sheet]
Performance Personae — Code and Data
This document is intended to serve as a brief introduction to the notion of Performance Personae that we wish to apply in the. The general idea is that when writing new code or reviewing existing code, in addition to thinking about the Customer Persona for which the code is intended we will also attempt to describe the likely usage for code/data of key methods. This will help us to focus design work in the right areas and give general guidance as to what should be considered. By construction, code and data that falls into none of these categories is of lesser importance. There are only four code categories and three data categories so they are easy to remember and apply.
The following sections describe the personae and give some guidance as to possible issues and metrics. Naturally if there are important scenarios in your particular project that are not covered by the simple ones below then you should feel free to add those to make sure you have adequate coverage. These are intended to be motivational only — they do not provide a complete picture. To use the personae just apply one data persona to each class you are annotating and one code persona to each public method.
- Inner Loop
- High Volume
Using Performance Personae
Some day we’ll have all of our APIs annotated with their performance persona but even without a full annotation it’s still useful/instructional to think about how the personae relate to one another and how a hypothetical enforcement tool might work. To complete the thinking we have to add a couple of additional notions:
1. The “None” persona which indicates the API is not approved for use in any performance sensitive context. Unmarked methods would be assumed to be in the “None” persona.
2. Hotness/Coldness of basic blocks
- a basic block with a “throw” in it is cold because its on an error path o a catch clause is “cold” for the same reason
- other blocks could be annotated as cold (by for instance calling a static method that inlines to nothing)
For the following discussion we’ll assume that some satisfactory combination of the above has been done to enable things. With that in mind we end up with these rules:
- Cold code can call anything regardless of other annotations, this rule trumps the others.
- Hot code can never call methods marked “None.”
- Any method may call a method marked Inner Loop
- Methods marked Inner Loop may only call other Inner Loop methods
- Methods marked Throughput may call other methods marked Throughput
- Methods marked Interactive may call other methods marked Interactive or they may call Throughput methods when performing a “mode change” (e.g. the user just clicked on “Replace All” and the application is temporarily in a throughput mode)
- Methods marked Startup may call other methods marked Startup, Throughput, or (as always) Inner Loop
This leads to the following diagram rendered in ASCII Art where each persona is allowed to call to a lower level in the diagram or stay in its own box on the hot path
| | |
| Startup | Interactive |
| | |
| Throughput |
| Inner Loop |
Code Persona: Inner Loop
The key thing that distinguishes inner loop code from other categories is that there is a strong expectation that the code in question is going to be called repeatedly. Depending on the nature of the code the number of calls might be linear in the size of the input, or it might be an even larger function. Examples of this kind of code are: hashing functions, comparison functions, parsing functions, low-level collection access and modification functions, and others of this ilk.
General Issues and Guidance
Because the number of calls is likely to be high we must write the code so as to be frugal with both of the main low level resources — cpu cycles and memory.
- path length (number of instructions) for the function should be minimized, speed is a key feature in this kind of code
- cost per instruction should be minimized, where there are CPI problems there are likely to be benefits from measuring secondary contributors such as missed branch predictions, L2 cache misses, cache splits, and others
- data locality is essential — since the algorithm is to be called repeatedly if it touches large regions of memory it will effectively render the cache contents useless for other purposes — being a good citizen in the cache is important
- largely due to the previous point temporary allocations on the heap should be (virtually) non-existent, only memory needed for the actual result of the computation should be allocated if at all possible, otherwise the repeated loops will induce garbage collections and effectively age the entire heap more than is necessary -inner loop functions should not do this
- the large number of iterations is likewise likely to make synchronization primitives prohibitively expensive, Inner loop code is generally written so that any locking needed was done at a higher level, the code itself will assume that the necessary locks have been taken
Both scenario benchmarks and microbenchmarks are likely to be helpful but at the very least the former is indispensable. For the relevant scenarios you should measure:
- cpu cost per interation (in cycles)
- measure CPI to indicate that these cycles are effective (i.e. not a lot of “stalling”)
- consult secondary CPI factors, especially cache hits, to explain poor costs and target fixes
- allocations per iteration (in bytes)
Code Persona: Throughput
Continuous sustainable performance under an ongoing workload is the distinguishing factor of throughput oriented code. The workload can come from streaming through a file, receiving web requests, activation of managed stored procedures, or other mechanisms but this class of code expects the work to be ongoing for at least some processing phase, To succeed the code must marshal the computer’s overall resources for the long haul.
General Issues and Guidance
The throughput phase by definition excludes Startup issues and is at a higher level of abstraction than the Inner Loop so those issues are often not the appropriate ones. Instead consider issues that would tend to erode performance over time.
- Limit growth of the GC heap — once initialized the necessary state should be established, if significant growth (especially non-linear growth) in the heap is necessary for continued processing you will likely find that the overall algorithm scales poorly with the size of the problem. Better algorithms exhibit little or no data growth with the size of the problem. Data should pass through the system as much as possible rather than accumulate. Memory leaks are totally fatal to this class of algorithm
- Limit generation 2 collections — the rate of promotion to generation 2 while the algorithm runs should be as close to zero as possible, while temporary allocations often are critical in throughput applications they should be temporary, meaning as each chunk of data is processed (whatever the transaction or other unit of work is) those temporaries can be retired, this in turn will mean that most if not all garbage collections will be partial collects
- Complex and/or redundant calculations, that is calculations that would need to be repeated in each unit of work should be (cached with suitable policy in the cache) so that work done on previous batches tends to simplify the work needed on subsequent similar batches • Scalability of the algorithm from the perspective of its ability to run on larger processors or larger processor farms is often an issue, throughput class algorithms often run on multiple processors running on multiple machines. This class of algorithm tends to define the rules for threading, scale up, and scale out
- Scalability of the algorithm from the perspective of its ability to handle larger problem sets is frequently a consideration and interacts with scale up and scale out characteristics.
- Throughput algorithms should rely on composing Loop type algorithms together and not much, if at all, on the other personae.
It should be no surprise that Throughput class algorithms are measured on their throughput — any algorithm in this class is measured by its ability to process representative workloads effectively. Common success and diagnostic metrics are as follows:
- Throughput of the activity — transactions per second, requests per second, records per second, bytes per second etc.
- Secondary considerations such as latency, guarantees (with % probability) service time are often vital
- Processor utilization (high utilization is generally indicative of good performance) , or, more generally, utilization of the gating resource for the algorithm (full disk use for instance, full network use, etc.)
- Garbage collection statistics — low overall % time in GC, low promotion rates to generation 2, appropriate overall allocation volumes, and total heap size
- Secondary large scale CPU usage indicators like contention, page faults, interrupts, etc. can indicate reasons why throughput isn’t what it could be
- Primary CPU indicators like Cycles Per Instruction usually indicate a problem with the algorithms the Throughput code is using
Consider the context in which the throughput algorithm is to be used and make sure it is suitably constrained that it could run in its target environment. Measurements to confirm that peak memory, peak disk usage, etc. are compatible with the systems available and the other software running on those systems, are crucial to success.
Code Persona: Interactive
Ultimately the Interactive persona is all about preventing observable lags in your GUI. As such it shares many issues with the Throughput persona in as much as both are in a sense dealing with regular incoming bits of work that need to be managed. At first blush you might guess that what characterizes the Interactive class of algorithms is its need to offer very low latency for what are often (but not always) quite small units of work. If you combine those issues with a need to be able to “go idle”, “do things in the background”, “resume quickly”, and perhaps “manage power”, then you start to strike even closer to the mark. But perhaps the crispest way to say all of that is to simply note that Throughput algorithms succeed by maximizing the use of computer resources while they run whereas Interactive algorithms succeed by minimizing the use of resources while they run. Low latency results with minimal resources used: that’s Interactive.
General Issues and Guidance
Not all code that is run interactively fits the Interactive persona and this is a good thing to remember. For instance, suppose you are implementing a Search and Replace feature in an application. The Search and Replace dialog handling code is almost certainly interactive but the code that does the actual replacing when the user presses “Replace All” is actually Throughput code. That’s a key point to remember as we consider the qualities of Interactive code.
- Small requests from the user should result in only small perturbations to objects to allow for brisk editing with limited/no collections
- object visitations should proportional to the size of requested operations
- processor cache should stays hot with data centric to the objects being edited
- temporary allocations are permissible, but promotions to generation 2 should be frowned upon due to potential pausing
- reducing temporary allocations generally so that objects age more slowly is generally a good thing
- external dependencies should be limited
- starting up a new subsystem or new major interactive dialog is really under the Startup persona
- dynamic loading etc. should not allowed if we expect to be in an interactive mode
- enable battery conservation modes by blocking as much as possible
- background work should settle comparatively quickly
- minimize the need to spin up the disk for any kind of edits
In Interactive code the goal is to minimize the impact of running the application (sometimes even at the expense of raw speed). A variety of test scenarios representative of important customer activities should be measured. Key indicators are:
- application response time in various steps of the scenarios on representative hardware
- processor utilization under load (going for low utilization)
- resource utilization in general for any controlling resources such as disk/network etc. always wanting low usage
- any unexpected disk i/o is likely a problem — validate with a tool like Resource Monitor
- steady-state working set — touching little memory
- maximize the amount of the application that can be swapped out after startup and stay swapped out
- page file access data, force out pages and watch them come back in (or hopefully not come back in)
- GC statistics, % time in GC, number of GC, and nil or nearly nil promotion rate to generation 2
- any loading activity or anomalous code loading events
Code Persona: Startup
Startup code is distinguished by the fact that it is only likely to be run once, but it is run at an extremely sensitive/important time… startup (duh). Generally startup algorithms are not that complicated and so time complexity of algorithms (e.g. O(n^2), etc.) is not the most common issue. Rather we’re faced with a situation where we must get the application running in an orderly fashion while using the fewest external resources — network and disk — and usually the latter is the issue.
General Issues and Guidance
Because external resources are the gating factor each access to such a resource must be carefully examined. Effectively a full understanding of your dependencies is called for. In some cases the access to the resource is disguised but it is real none-the-less. Some of the more direct costs are listed below
- Code: every bit of code you run will have to be paged into the process, at least via soft-faults and potentially hard-faults, this is often the largest source of disk i/o and in turn the largest consumer of external resources
- analysis of your startup path is your best line of defense, don’t put things on the startup path that do not need to be on the startup path
- keep DLL dependencies to a minimum
- do not force complex subsystems to initialize at startup if it can be avoided
- Configuration: configuration information often comes from the registry or else some kind of config file
- there are direct costs associated with getting the data from the file in question, remember even the registry is in a file
- touching more initialization files makes it more difficult for the operating system to do its job scheduling the other i/os
- keep things concentrated, don’t mix startup data with other data such that you have to read vast amounts of config just to get a few booleans or coordinates
- avoid complex registry configuration (e.g. long search paths, complex trees)
- there are indirect costs associated with loading the code that is going to be needed to read the configuration data, often this code is MUCH larger than the data itself, consider this when choosing your configuration medium
- Remoting: discovery of services in other processes on the system can cause hefty transports to be loaded that could otherwise be avoided, use the simplest configuration that is reasonably possible
- network based configuration or loads can add significant load time
- network based module loading compounds any code issues and makes it all the more vital that dependencies are minimized
- For interactive applications, the user of the application should feel like the application is doing useful work on his/her behalf as soon as possible
Normal startup scenarios should be defined and resource usage should be measured. Note that in the final analysis we only truly care about startup time but as it is often very difficult to measure representative times on developer machines we often measure the underlying sources of delays directly and track those instead.
- Total disk usage set for startup (all disk i/o’s from everywhere including faults)
- Working Set analysis especially to identify opportunities for sharing
- Performance counters around disk and network i/o
- Time on representative hardware (this is the bottom line after all)
Remember that end users machines will be running applications other than your test application and so if you are targeting (e.g.) a 4G machine that doesn’t mean you get all 4G — you should probably test on a 1G machine to simulate actual load, or otherwise add the expected applications to the scenario.
Data Persona: High Volume
Classes where you can reasonably expect there to be huge numbers of live instances often deserve special treatment in their implementation. Naturally, the greater the order of magnitude of the usage the more attention is called for. In these sorts of classes every bit of per-object overhead is magnified to the point where it is readily measurable and almost always impactful. As a result normally minor issues — such as field layout and default sizing.
General Issues and Guidance
The most important bit of guidance with for this data persona is to test it in the manner in which it is designed to be used. Create and organize the objects according to typical data patterns and at high volume. Observe the inherent costs associated With operating on the data and tune accordingly.
- A high volume of objects multiplies all per-object overheads
- some of these overheads are inherent, things like the object headers and the method tables are unavoidable
- other overheads are associated with how objects are padded and these can be controlled
- even larger overheads are associated potentially optional fields which are always present in the object but contain nil values — these should be explored
- collatoral storage associated with the object (such as arrays owned by the object) will likely need tuning, the default sizes may be too big or too small
- choose storage-oriented data structures at the low level not abstraction oriented ones — that approach leads to more bytes to hold data and less to hold pointers
- Almost any object type that is happening in high volume is likely to impact the garbage collector
- overall allocation volume will increase frequency of collections and age other objects
- rich linking between the various objects will increase the cost of tracing the objects when computing what is reachable
- very large objects may land on the large object heap, if these are in high volume there may be fragmentation issues — reusable object sizing at this size becomes important
- High volume objects may make good value-types depending on lifetime considerations
- consider the objects mutability carefully when making these choices
- immutable reference types may provide excellent threading and reuse
- mutable types may help with lifetime issues and facilitate recycling
- Observe the flow of these objects into the various generations — reducing each type of creation/promotion is a good objective
- consider the persona of the code that will typically use these objects, are allocations friendly to that code persona?
- See the Document Persona rules if these structures participate in an overall structure that is actually one document
- See the Cache Persona rules if these structures participate in an overall structure that is actually one cache
- size (bytes) per object
- overhead (bytes) per object
- cache lines touched per operation (various operations)
- garbage collector metrics — % time In GC and promotion rates under various representative scenarios
Data Persona: Document
Document classes don’t necessarily hold things that you would normally think of as a document it might be a dataset, or it might be the representation of a complex form. The defining characteristics of this type of data are the fact that it is expected to have significant durability, it has at least medium size, and it is likely that the data is of value to the end-user — in contrast to a cache.
General Issues and Guidance
The document is normally large enough that operations on the data typically mutate it in place in some manner. This will mean that fast incremental operations are the key to success. Combine this with the general importance of the document data to the user experience and you can get a set of constraints that look something like this:
- incremental operations common, operations generally mutate the document
- locality of the document pieces is vital to success, operations on the document should touch memory proportional to the complexity of the operation
- notebly most of the basic incremental operations should touch a minimum of memory
- to get the best GC performance similarly aged objects should be involved in the logical operations, so do not frequently cause old objects to point to new as this is bad for the GC
- documents typically define the “transaction”, or “unit of work”, whatever that may be — this is what is isolate, what participates in undo, what can be committed to disk, etc.
- the document will define its threading model (if it has one) and it should be very clear what that model is
- how will locks behaving internally and externally?
- what combinations of reads and writes are permissible without deadlocking?
- what is the isolation policy? (i.e. what changes are visible across threads and when do they become visible?)
- what kind of transactions (batch commit) are supported if any?
- consider searching as a key scenario for document data, documents must also consider things like “background printing” and “background saving”
- Since our presumption is that the data is precious we must guard ourselves against an assortment of failure cases — each of these factor into choices about the unit of work and hence the overall performance
- how do we handle out of memory cases?
- how do we handle disk full?
- how do we handle a power failure?
The things we will be measuring revolve around how effectively the document is able to represent the underlying data while meeting the isolation and recovery needs. Typical document shapes and edit operations should be defined and these should be used to collect these metrics:
- Raw size to represent typical documents (bytes)
- Waste bytes (pointer overheads etc.) for typical documents (what percentage of the storage is for the Actual Document rather than overhead)
- How much overhead (bytes) is associated with typical “transactions” while they are in flight?
Data Persona: Cache
Caching Implies Policy and the presence of discardable data with a clear policy as to how the discarding happens is what makes a cache. A cache with no policy (or bad policy) isn’t a cache, it’s a memory leak. If the cache is never out of date and always has the data then it isn’t really a cache at all perhaps it’s an indexed Document — in which case it should be following those rules. But if the data is potentially not authoritative, and is peppered with auxiliary information for managing a discard policy (such as LRU with a fixed size cap) then it’s a cache and it will need to follow the guidelines below to be successful.
General Issues and Guidance
Policy is everything. Good caches are driven by algorithms that want them to shrink, not to grow, as resource minimization is a hallmark of good caches. We’re trying to reduce CPU utilization, improve locality of reference, eliminate disk and/or network i/o operations by noting and leveraging popular patterns in data access. Putting caches too at too low of a level in the system is probably the most common source of failure — if the cache is at too low a level in the system then the code will not be able to make good assumptions about the access patterns. Higher levels of the system understand the overall trends and lifetimes and can make these choices — remember a cache with the wrong policy is often worse than no cache at all.
- cache policy is the key decision
- put the policy decisions at the right level of abstraction
- choose policies whose goal is to minimize resource usage which maximizing cache hits
- most times ‘weak references’ as part of your caches indicate a problem
- policy should drive discard choices
- object lifetime is usually a bad policy choice so GC driven caches tend to fail
- if the cache starts to feel like a document it is likely not doing its job
- cache policies should include right-sizing logic to create an effective cache
- hit rates and other metrics should be readily observable for important caches
- cache policies often incorporate dynamic feedback to help with self tuning
- when using dynamic feedback be sure to consider the overall effectiveness of the system and not the local effectiveness of the cache — do not maximize hits, maximize overall throughput
- memory tied up in a cache is memory you can’t use for your primary needs
- caches can simplify isolation and threading issues
- if your cache policies are such that your application is liberated from complicated threading or isolation requirements by first checking a cache with a thread-friendly access pattern you can achieve significant savings both in performance and complexity
- underlying transacted stores can have simpler rules if high performance access is typically satisfied by lock-free (or lock-reduced) caches
- caches are not an end unto themselves
- if the underlying document can be organized such that a cache is not needed or adds little value that is a good thing
Ultimately the success of the cache is represented by savings in the resources otherwise consumed by accessing the cached objects the normal way. Policies are scenario driven so there should be representative scenarios for your cache classes in which you observe the cost of the cache.
- Typical cache sizes (in representative cases)
- report savings vs. the un-cached approach Hit rate vs. size of cache (bytes)
- correlate savings from hits to cache size so that you can find the sweet spot and verify that your policy is putting you close to that spot
- Throughput of representative algorithms vs. size of cache
- improvement in the key effectiveness metric for the scenario ultimately decides the value of the cache
- sometimes non-speed metrics, such as battery life, are the key success indicator
- more popular indicators are throughput metrics such as requests/second, transactions/second, records/second etc.