Twelve years ago I gave a talk on “performance signatures” and submitted a paper on the subject. This paper was about various approaches to help give tactical performance guidance to people without having done a lot of measurement. The idea was that there are often common patterns worth considering. I also wrote some notes about design and metrics considerations at the same time that never saw the light of day (except those few that read them at the time). Here’s that set of notes from ~2006.

[Click for the cheat sheet]

Performance Personae — Code and Data

The following sections describe the personae and give some guidance as to possible issues and metrics. Naturally if there are important scenarios in your particular project that are not covered by the simple ones below then you should feel free to add those to make sure you have adequate coverage. These are intended to be motivational only — they do not provide a complete picture. To use the personae just apply one data persona to each class you are annotating and one code persona to each public method.

Code Personae

  • Startup
  • Throughput
  • Interactive

Data Personae

  • Document
  • Cache

Using Performance Personae

1. The “None” persona which indicates the API is not approved for use in any performance sensitive context. Unmarked methods would be assumed to be in the “None” persona.

2. Hotness/Coldness of basic blocks

  • a basic block with a “throw” in it is cold because its on an error path o a catch clause is “cold” for the same reason
  • other blocks could be annotated as cold (by for instance calling a static method that inlines to nothing)

For the following discussion we’ll assume that some satisfactory combination of the above has been done to enable things. With that in mind we end up with these rules:

  • Cold code can call anything regardless of other annotations, this rule trumps the others.
  • Hot code can never call methods marked “None.”
  • Any method may call a method marked Inner Loop
  • Methods marked Inner Loop may only call other Inner Loop methods
  • Methods marked Throughput may call other methods marked Throughput
  • Methods marked Interactive may call other methods marked Interactive or they may call Throughput methods when performing a “mode change” (e.g. the user just clicked on “Replace All” and the application is temporarily in a throughput mode)
  • Methods marked Startup may call other methods marked Startup, Throughput, or (as always) Inner Loop

This leads to the following diagram rendered in ASCII Art where each persona is allowed to call to a lower level in the diagram or stay in its own box on the hot path

/------------------------\
| | |
| Startup | Interactive |
| | |
|------------mode change-|
| |
| Throughput |
| |
|------------------------|
| |
| Inner Loop |
| |
\------------------------/

Code Persona: Inner Loop

General Issues and Guidance

  • path length (number of instructions) for the function should be minimized, speed is a key feature in this kind of code
  • cost per instruction should be minimized, where there are CPI problems there are likely to be benefits from measuring secondary contributors such as missed branch predictions, L2 cache misses, cache splits, and others
  • data locality is essential — since the algorithm is to be called repeatedly if it touches large regions of memory it will effectively render the cache contents useless for other purposes — being a good citizen in the cache is important
  • largely due to the previous point temporary allocations on the heap should be (virtually) non-existent, only memory needed for the actual result of the computation should be allocated if at all possible, otherwise the repeated loops will induce garbage collections and effectively age the entire heap more than is necessary -inner loop functions should not do this
  • the large number of iterations is likewise likely to make synchronization primitives prohibitively expensive, Inner loop code is generally written so that any locking needed was done at a higher level, the code itself will assume that the necessary locks have been taken

Metrics

  • cpu cost per interation (in cycles)
  • measure CPI to indicate that these cycles are effective (i.e. not a lot of “stalling”)
  • consult secondary CPI factors, especially cache hits, to explain poor costs and target fixes
  • allocations per iteration (in bytes)

Code Persona: Throughput

General Issues and Guidance

  • Limit growth of the GC heap — once initialized the necessary state should be established, if significant growth (especially non-linear growth) in the heap is necessary for continued processing you will likely find that the overall algorithm scales poorly with the size of the problem. Better algorithms exhibit little or no data growth with the size of the problem. Data should pass through the system as much as possible rather than accumulate. Memory leaks are totally fatal to this class of algorithm
  • Limit generation 2 collections — the rate of promotion to generation 2 while the algorithm runs should be as close to zero as possible, while temporary allocations often are critical in throughput applications they should be temporary, meaning as each chunk of data is processed (whatever the transaction or other unit of work is) those temporaries can be retired, this in turn will mean that most if not all garbage collections will be partial collects
  • Complex and/or redundant calculations, that is calculations that would need to be repeated in each unit of work should be (cached with suitable policy in the cache) so that work done on previous batches tends to simplify the work needed on subsequent similar batches • Scalability of the algorithm from the perspective of its ability to run on larger processors or larger processor farms is often an issue, throughput class algorithms often run on multiple processors running on multiple machines. This class of algorithm tends to define the rules for threading, scale up, and scale out
  • Scalability of the algorithm from the perspective of its ability to handle larger problem sets is frequently a consideration and interacts with scale up and scale out characteristics.
  • Throughput algorithms should rely on composing Loop type algorithms together and not much, if at all, on the other personae.

Metrics

  • Throughput of the activity — transactions per second, requests per second, records per second, bytes per second etc.
  • Secondary considerations such as latency, guarantees (with % probability) service time are often vital
  • Processor utilization (high utilization is generally indicative of good performance) , or, more generally, utilization of the gating resource for the algorithm (full disk use for instance, full network use, etc.)
  • Garbage collection statistics — low overall % time in GC, low promotion rates to generation 2, appropriate overall allocation volumes, and total heap size
  • Secondary large scale CPU usage indicators like contention, page faults, interrupts, etc. can indicate reasons why throughput isn’t what it could be
  • Primary CPU indicators like Cycles Per Instruction usually indicate a problem with the algorithms the Throughput code is using

Consider the context in which the throughput algorithm is to be used and make sure it is suitably constrained that it could run in its target environment. Measurements to confirm that peak memory, peak disk usage, etc. are compatible with the systems available and the other software running on those systems, are crucial to success.

Code Persona: Interactive

General Issues and Guidance

  • Small requests from the user should result in only small perturbations to objects to allow for brisk editing with limited/no collections
  • object visitations should proportional to the size of requested operations
  • processor cache should stays hot with data centric to the objects being edited
  • temporary allocations are permissible, but promotions to generation 2 should be frowned upon due to potential pausing
  • reducing temporary allocations generally so that objects age more slowly is generally a good thing
  • external dependencies should be limited
  • starting up a new subsystem or new major interactive dialog is really under the Startup persona
  • dynamic loading etc. should not allowed if we expect to be in an interactive mode
  • enable battery conservation modes by blocking as much as possible
  • background work should settle comparatively quickly
  • minimize the need to spin up the disk for any kind of edits

Metrics

  • application response time in various steps of the scenarios on representative hardware
  • processor utilization under load (going for low utilization)
  • resource utilization in general for any controlling resources such as disk/network etc. always wanting low usage
  • any unexpected disk i/o is likely a problem — validate with a tool like Resource Monitor
  • steady-state working set — touching little memory
  • maximize the amount of the application that can be swapped out after startup and stay swapped out
  • page file access data, force out pages and watch them come back in (or hopefully not come back in)
  • GC statistics, % time in GC, number of GC, and nil or nearly nil promotion rate to generation 2
  • any loading activity or anomalous code loading events

Code Persona: Startup

General Issues and Guidance

  • Code: every bit of code you run will have to be paged into the process, at least via soft-faults and potentially hard-faults, this is often the largest source of disk i/o and in turn the largest consumer of external resources
  • analysis of your startup path is your best line of defense, don’t put things on the startup path that do not need to be on the startup path
  • keep DLL dependencies to a minimum
  • do not force complex subsystems to initialize at startup if it can be avoided
  • Configuration: configuration information often comes from the registry or else some kind of config file
  • there are direct costs associated with getting the data from the file in question, remember even the registry is in a file
  • touching more initialization files makes it more difficult for the operating system to do its job scheduling the other i/os
  • keep things concentrated, don’t mix startup data with other data such that you have to read vast amounts of config just to get a few booleans or coordinates
  • avoid complex registry configuration (e.g. long search paths, complex trees)
  • there are indirect costs associated with loading the code that is going to be needed to read the configuration data, often this code is MUCH larger than the data itself, consider this when choosing your configuration medium
  • Remoting: discovery of services in other processes on the system can cause hefty transports to be loaded that could otherwise be avoided, use the simplest configuration that is reasonably possible
  • network based configuration or loads can add significant load time
  • network based module loading compounds any code issues and makes it all the more vital that dependencies are minimized
  • For interactive applications, the user of the application should feel like the application is doing useful work on his/her behalf as soon as possible

Metrics

  • Total disk usage set for startup (all disk i/o’s from everywhere including faults)
  • Working Set analysis especially to identify opportunities for sharing
  • Performance counters around disk and network i/o
  • Time on representative hardware (this is the bottom line after all)

Remember that end users machines will be running applications other than your test application and so if you are targeting (e.g.) a 4G machine that doesn’t mean you get all 4G — you should probably test on a 1G machine to simulate actual load, or otherwise add the expected applications to the scenario.

Data Persona: High Volume

General Issues and Guidance

  • A high volume of objects multiplies all per-object overheads
  • some of these overheads are inherent, things like the object headers and the method tables are unavoidable
  • other overheads are associated with how objects are padded and these can be controlled
  • even larger overheads are associated potentially optional fields which are always present in the object but contain nil values — these should be explored
  • collatoral storage associated with the object (such as arrays owned by the object) will likely need tuning, the default sizes may be too big or too small
  • choose storage-oriented data structures at the low level not abstraction oriented ones — that approach leads to more bytes to hold data and less to hold pointers
  • Almost any object type that is happening in high volume is likely to impact the garbage collector
  • overall allocation volume will increase frequency of collections and age other objects
  • rich linking between the various objects will increase the cost of tracing the objects when computing what is reachable
  • very large objects may land on the large object heap, if these are in high volume there may be fragmentation issues — reusable object sizing at this size becomes important
  • High volume objects may make good value-types depending on lifetime considerations
  • consider the objects mutability carefully when making these choices
  • immutable reference types may provide excellent threading and reuse
  • mutable types may help with lifetime issues and facilitate recycling
  • Observe the flow of these objects into the various generations — reducing each type of creation/promotion is a good objective
  • consider the persona of the code that will typically use these objects, are allocations friendly to that code persona?
  • See the Document Persona rules if these structures participate in an overall structure that is actually one document
  • See the Cache Persona rules if these structures participate in an overall structure that is actually one cache

Metrics

  • overhead (bytes) per object
  • cache lines touched per operation (various operations)
  • garbage collector metrics — % time In GC and promotion rates under various representative scenarios

Data Persona: Document

General Issues and Guidance

  • incremental operations common, operations generally mutate the document
  • locality of the document pieces is vital to success, operations on the document should touch memory proportional to the complexity of the operation
  • notebly most of the basic incremental operations should touch a minimum of memory
  • to get the best GC performance similarly aged objects should be involved in the logical operations, so do not frequently cause old objects to point to new as this is bad for the GC
  • documents typically define the “transaction”, or “unit of work”, whatever that may be — this is what is isolate, what participates in undo, what can be committed to disk, etc.
  • the document will define its threading model (if it has one) and it should be very clear what that model is
  • how will locks behaving internally and externally?
  • what combinations of reads and writes are permissible without deadlocking?
  • what is the isolation policy? (i.e. what changes are visible across threads and when do they become visible?)
  • what kind of transactions (batch commit) are supported if any?
  • consider searching as a key scenario for document data, documents must also consider things like “background printing” and “background saving”
  • Since our presumption is that the data is precious we must guard ourselves against an assortment of failure cases — each of these factor into choices about the unit of work and hence the overall performance
  • how do we handle out of memory cases?
  • how do we handle disk full?
  • how do we handle a power failure?

Metrics

  • Raw size to represent typical documents (bytes)
  • Waste bytes (pointer overheads etc.) for typical documents (what percentage of the storage is for the Actual Document rather than overhead)
  • How much overhead (bytes) is associated with typical “transactions” while they are in flight?

Data Persona: Cache

General Issues and Guidance

  • cache policy is the key decision
  • put the policy decisions at the right level of abstraction
  • choose policies whose goal is to minimize resource usage which maximizing cache hits
  • most times ‘weak references’ as part of your caches indicate a problem
  • policy should drive discard choices
  • object lifetime is usually a bad policy choice so GC driven caches tend to fail
  • if the cache starts to feel like a document it is likely not doing its job
  • cache policies should include right-sizing logic to create an effective cache
  • hit rates and other metrics should be readily observable for important caches
  • cache policies often incorporate dynamic feedback to help with self tuning
  • when using dynamic feedback be sure to consider the overall effectiveness of the system and not the local effectiveness of the cache — do not maximize hits, maximize overall throughput
  • memory tied up in a cache is memory you can’t use for your primary needs
  • caches can simplify isolation and threading issues
  • if your cache policies are such that your application is liberated from complicated threading or isolation requirements by first checking a cache with a thread-friendly access pattern you can achieve significant savings both in performance and complexity
  • underlying transacted stores can have simpler rules if high performance access is typically satisfied by lock-free (or lock-reduced) caches
  • caches are not an end unto themselves
  • if the underlying document can be organized such that a cache is not needed or adds little value that is a good thing

Metrics

  • Typical cache sizes (in representative cases)
  • report savings vs. the un-cached approach Hit rate vs. size of cache (bytes)
  • correlate savings from hits to cache size so that you can find the sweet spot and verify that your policy is putting you close to that spot
  • Throughput of representative algorithms vs. size of cache
  • improvement in the key effectiveness metric for the scenario ultimately decides the value of the cache
  • sometimes non-speed metrics, such as battery life, are the key success indicator
  • more popular indicators are throughput metrics such as requests/second, transactions/second, records/second etc.

Written by

I’m a software engineer at Facebook; I specialize in software performance engineering and programming tools generally. I survived Microsoft from 1988 to 2017.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store