A brief word about some common metrics

Rico Mariani
7 min readJan 26, 2018

[originally published 8/21/2017]

It’s not easy to figure out exactly what the various MachOs (haha) metrics mean…. to be sure that things meant what I thought they meant I wrote this test program. I tested on my Mac but I’m fairly sure the results would be similar for iOS.

The metrics we’re studying are:

From getusage

  • Maximum Resident Set, Minor Faults, Major Faults, System Time, User Time

From task_info

  • Copy on Write Faults, Virtual Bytes, Resident Bytes

Textbook Meanings

Maximum Resident Set

  • The highest number of physical bytes ever used by your process (note you can ask for different objects but in this case I’m studying RUSAGE_SELF
  • Note that some of your bytes might be swapped out to make room for new allocations and as a consequence it is possible to allocate and even modify new bytes without incurring additional change here
  • Why I care: an excellent metric to assess the memory pressure your application is adding to the system

Minor Faults

  • page faults that did not require i/o to resolve. This includes demand-zero faults and copy-on-write faults as we’ll see below
  • A demand zero fault occurs the first time you modify a freshly allocated page, the page contained all zeros and was shared with other pages that were all zeros as this is a common situation. When you modify the page it forces a physical page to be allocated for that use. This is the demand zero fault.
  • Why I care: each fault costs CPU to field (but not i/o), they tell you what you’re bringing in from cache, plus demand zero faults; after faulting you will likely take cpu cache misses too.

Major Faults

  • a page fault that required i/o to resolve. This includes loading code for the first time, loading constant data for the first, or restoring a page that was swapped out to try to re-use physical memory.
  • Why I care: All the penalties of minor faults PLUS i/o (there must have been i/o or it would have been minor). Major faults can grind your app to a halt as you wait for the i/os to resolve.

System Time

  • Measured in CPU-seconds, this is the amount of time that a thread was running any kind of non-user mode code on behalf of the measured process
  • This could be more or less than wall clock time, if there is much waiting it is likely to be less. If there is little waiting and multiple threads running it is likely to be more.
  • Why I care: many of the most costly things a process does are done by the kernel on behalf of the application, those things burn CPU as well and this lets you track them.

User Time

  • As above, but for user-mode code, i.e. the code you wrote. This includes runtime libraries that run in user mode… Which is nearly every library.
  • Why I care: this is CPU you control fairly directly, it’s a leading indicator of many things gone wrong, including faults, and a primary driver of battery usage.

Copy on Write Faults

A copy on write fault occurs when a data page that was initialized to some constant data is modified. There are two reasons this can reasonably happen:

  • You alter an initialized constant like int global = 5; you set global=6.
  • At startup, an initialized pointer like char *string = “String”; requires rebasing so that it points to the text of “String”. This looks just the same as global = 6; to the VM system. This fix-ups are applied before main.
  • Why I care: COW faults that happen at startup are especially painful because they slow down the startup experience. Incremental COW that happen as your app runs on a pay for play basis are no worse/better than other minor faults.

Virtual Bytes

  • the total of all virtual space reservations made, whether or note the code/data is resident. Virtual Byte can go down if the address space is released. Swapping/Faulting does not affect this number. Note this is a very large number as it includes all shared libraries your program may reference whether they are used or not. Typically this much larger than Resident Set.
  • Why I care: if your system is under pressure and your resident set keeps getting trimmed, leading to lots of faults, but no growth in max resident set size, you can see who is to blame by watching for processes whose virtual bytes are growing.

Resident Bytes

  • The actual number of bytes allocated to your process (or the measured thing) that are stored in physical memory. Obviously less than or equal to Max Resident Set by definition. The operating system is likely to try to “trim” your process to recover physical memory if you don’t seem to be using it. This results in Resident Bytes being lower than Max. The fact that this is going on makes Resident Bytes a tricky metric to interpret.
  • Why I care: See max resident set and add this: it’s possible to give back bytes with deallocation, if you’re watching resident set over long periods of time max resident set will become useless after one bad moment

Program output and interpretation

Pre-main approximation
maxrss 437, min 711, maj 3, cow 90, virt 618436, res 440, usr = 0.001161, sys = 0.001242

There are 90 cow faults before main, virtual memory printed is in pages so that’s 618000 pages (!) of virtual allocation. Huge… massive. Resident set is 437 pages. There are very few major faults. Looks like lots of disk cache hits for the bits we needed. Do not ask me how maxrss is < current resident set.

If we run this again with the #if set to 1 to that there are lots more references to bigdata then we get this output.

Premain approximation
maxrss 437, min 720, maj 4, cow 98, virt 618436, res 440, usr = 0.000953, sys = 0.001234

Note that Copy on Write faults went up to 98 to apply all the fixups necessary to initialize the array. Minor faults also went up. There was quite a bit more initialized data so there was an extra major fault as well. Don’t read too much into that as I ran the program several times so it should be mostly in the disk cache — hence major faults will be low. So copy on writes count as minor faults. Sorry the accounting isn’t quite perfect, it’s hard to get it perfectly stable.

From here I’ll use the results with the #if set to 0. The other option was just to illustrate faults before main.

Regular alloc phase
maxrss 450, min 724, maj 3, cow 90, virt 623557, res 450, usr = 0.001193, sys = 0.001267
...
maxrss 450, min 724, maj 3, cow 90, virt 669637, res 450, usr = 0.001212, sys = 0.001308

During this phase we make 10 allocations of 20 meg each. Note that maxrss doesn’t go up. That’s because these pages are all zeros. They’re all sharing the same memory. The other fault counts stay constant as well. Virtual memory usage does go up though as can be seen.

Demand zero phase
maxrss 5571, min 5845, maj 3, cow 90, virt 674757, res 5571, usr = 0.006330, sys = 0.007717
...
maxrss 51651, min 51925, maj 3, cow 90, virt 720837, res 51651, usr = 0.052957, sys = 0.079686

During this phase we allocate 20 meg chunks and also zero-fill them. Big difference, maxrss now goes up clearly. Minor faults go up so that means demand-zero faults are minor faults. Copy On Write faults are not going up. And of course Virtual Memory is going up.

Data writing phase
maxrss 56771, min 52180, maj 4868, cow 5209, virt 720837, res 56771, usr = 0.057620, sys = 0.100440

In the data writing phase, we clobber 20 megabytes of initialized data. This causes copy on write faults. Note that in this case there were also major faults for all that data because the 20M of zero-filled initialized data (even though I only initialized a few bytes of it explicitly) was not in the disk cache. So major faults.

BSS writing phase
maxrss 61891, min 57300, maj 4868, cow 5209, virt 720837, res 61891, usr = 0.061720, sys = 0.106610

For historical reasons, data that is by-default-zero-initialized such as int global; is called BSS. The reason for such names tends to be things like “That’s what the segment name was on the PDP10 and we just kept using it years later.” But I don’t know the origin of this one. I looked once…

Anyway, when we modify that data there are no copy on write faults. This data is behaving exactly like allocated memory: it’s getting demand zero faults. That’s a good thing.

Big alloc phase 
maxrss 1120682, min 2105331, maj 4868, cow 5209, virt 2768852, res 1115007, usr = 2.105623, sys = 3.189343

The final phase allocates buckets of memory and fills it with zeros. This forces many many faults and puts tremendous memory pressure on the system. Enough to finally create a visible difference between resident bytes and maximum resident set. Note that both of those are scaled down to 4k pages in the output.

And with that very large operation you can finally see a big cpu consumption situation. Note that there is more system time to do this job than user time. There is big cost in the faults.

Appendix: Source Code

#include <sys/resource.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#import <mach/mach.h>
#import <mach/task_info.h>

void printmetrics();
char bigdata[20<<20] = "This is a test";
char bigbss[20<<20]; // zero init
extern char * lotsaPointers[];struct sysmetrics {
struct rusage rusage;
struct task_basic_info basic;
struct task_events_info event;
};
static void getsysmetrics(struct sysmetrics *metrics)
{
getrusage(RUSAGE_SELF, &metrics->rusage);
mach_port_t task = mach_task_self();
mach_msg_type_number_t tcnt;
tcnt = TASK_BASIC_INFO_COUNT;
task_info(task, TASK_BASIC_INFO, (task_info_t)&metrics->basic, &tcnt);
tcnt = TASK_EVENTS_INFO_COUNT;
task_info(task, TASK_EVENTS_INFO, (task_info_t)&metrics->event, &tcnt);
}
int main()
{
int i;
printf("Pre-main approximation\n");
printmetrics();
printf("Regular alloc phase\n");
for (i = 0; i < 10; i++) {
void *m = malloc(20 << 20);
printmetrics();
}
printf("Demand zero phase\n");
for (i = 0; i < 10; i++) {
void *m = malloc(20 << 20);
memset(m, 0, 20<<20);
printmetrics();
}
printf("Data writing phase\n");
memset(bigdata, 0, sizeof(bigdata));
printmetrics();
printf("BSS writing phase\n");
memset(bigbss, 0, sizeof(bigbss));
printmetrics();
printf("Big alloc phase\n");
for (i = 0; i < 400; i++) {
void *m = malloc(20 << 20);
memset(m, 0, 20<<20);
}
printmetrics();
// prevent linker from dead eliminating the data
memcpy(lotsaPointers[0], lotsaPointers[1], 1);
return 0;
}
void printmetrics()
{
struct sysmetrics m;;
getsysmetrics(&m);
printf(
"maxrss %ld, min %ld, maj %ld, "
"cow %d, virt %ld, res %ld, "
"usr = %f, sys = %f\n",
m.rusage.ru_maxrss / 4096,
m.rusage.ru_minflt,
m.rusage.ru_majflt,
m.event.cow_faults,
m.basic.virtual_size / 4096,
m.basic.resident_size / 4096,
m.rusage.ru_utime.tv_sec + m.rusage.ru_utime.tv_usec / 1000000.0,
m.rusage.ru_stime.tv_sec + m.rusage.ru_stime.tv_usec / 1000000.0);
}
char * lotsaPointers[] =
{
bigdata, bigdata, bigdata, bigdata, bigdata, bigdata, bigdata, bigdata, bigdata, bigdata, #if 1
bigdata, bigdata, bigdata, bigdata, bigdata, bigdata, bigdata, bigdata, bigdata, ... many thousand rows of the same
#endif
};

--

--

Rico Mariani

I’m an Architect at Microsoft; I specialize in software performance engineering and programming tools.