Considerations for Big Virtual Memory Pages

Rico Mariani
7 min readOct 3, 2023

--

A few days ago, I wrote a couple of articles about Virtual Memory (as before I will use VM in this article to mean Virtual Memory not Virtual Machines).

A lot of people jumped in and got their creative juices flowing on the various ideas mentioned there and a popular one was “Why not just use 2M pages for everything on x64 systems?”

The general argument goes something like this: “We built systems with 4k pages when we had 4M of memory or less, on systems with slow disks. Now we have 4G or more of memory, on systems with fast disks. Surely bigger pages make sense?”

When this idea was suggested, I thought it was interesting. But I didn’t leap to “OMG let’s do this right away”, instead I’m a bit more cautious. The purpose of this article is to explain what some of the considerations are for such a thing, because I think they are interesting and educational. Notwithstanding that I’m going to cite a bunch of concerns this still might be a great idea. But, in this article, I really do just want to get your VM thoughts simulated.

Purpose of the VM System

This will be important as we analyze the possible consequence of going to 2M pages universally. The purpose is two-fold if you will:

  1. Allow programs to use more memory than actually exists, by virtualization, and
  2. Allow programs to run using less memory than they demanded, again by virtualization.

The first point is probably the one people think about the most. The system can pretend it has more memory than it really does. But it sort of presumes we’re paging already and that is likely to be a bad situation. I actually think the second item is actually more important — it keeps your system running well.

What do we mean by use less memory? We mean that, for instance, even though you loaded in some large library (e.g., shell32.dll is 8,587,752 bytes) the likelihood that you need ALL of shell32.dll is very low indeed. You probably need only a tiny of slice of it that does “this one weird thing”.

Operating Systems are frequently in the business of not loading the stuff you asked for and the VM system is the key conspirator in these shenanigans. Your virtual memory is a lie by design.

So, keeping these things in mind, let’s consider the 2M pages from various points of view.

Selected Consequences of 2M Pages

Reduced ability to reclaim memory for unnecessary code and data.

If all pages are 2M then if even one byte is used in any given 2M the whole 2M has to stay. It seems like this would happen a lot. You are likely to significantly lose the ability to swap out unneeded code.

Companies like Microsoft spend an insane amount of engineering effort to localize code (e.g., moving all the hot code to the front of the TEXT segment) even going so far as to split functions into hot and cold parts in an effort to minimize memory waste. You want to use all 4k of every page. But even with these efforts there are always stragglers that are hard or impossible to remove. You could easily double the resident size of say shell32.dll bigger pages and this would likely be widespread.

So maybe you’re doubling resident code size? How bad is that? This is a thing to investigate.

Reduced TLB misses, hence reduced average cycles per instruction.

Any kind of Cycles Per Instruction (CPI) reduction is straight up raw speed. In some workloads TLB costs can be around 10% of all execution time, often less, but sometimes even more than 10%. So, this is not a small win.

Increased page faults (in bytes) and disk I/O generally.

Reducing the amount of memory that can be reclaimed for general use means that when there are competing programs each will be under more memory pressure. This means each is more likely to take page faults for its own unique memory usages. The more heterogenous the workload is the more likely this is to occur.

Further, reducing available memory will reduce the size of the disk cache. In any modern operating system there is (mostly) no such thing as truly free memory. Other than a modest pool that is truly immediately available for use, any available memory will be repurposed for other important things, primarily disk cache. Memory that is doing nothing is wasted so we don’t want a lot of that. Disk cache pages can be (and will be) repurposed for other uses on demand.

So, you can expect your disk cache to be less effective if there are 2M pages. And you can expect that total fault count is reduced but total faulted bytes might actually increase.

Increased swap throughput.

The marginal cost of swapping one byte goes down when you are reading and writing in 2M chunks. Basically, every disk type in existence, is better at sequential reads. More in-order disk operations is a good thing.

Of course, the benefit here could be marginalized. With clever disk layout, and by choosing to read/write smaller pages in bigger chucks. For instance, an OS could choose to always page in 64k units. Such an OS could get most of the I/O benefits of large pages while keeping the flexibility of smaller pages.

In fact, it could choose to do 2M disk operations as a rule even with 4k pages. It could further optimize by reading/writing only the dirty range of the 2M page, keeping to one I/O operation that is ≤ 2M could be very interesting. Less I/O is better.

So… do you really need 2M pages to get the benefits of 2M I/O operations? Maybe not?

Increased swap latency, and variability per swap operation

Bigger pages will take longer to read/write on average. We still expect more throughput, but latency also matter. Bigger operations will have a larger mean and can reasonably be expected to also have larger standard deviation. This will result in less consistency of experience when there is swapping.

An argument could be made that by the time you are swapping any meaningful amount things are already horrible. But the counter argument to that is that with bigger pages the swapping will happen sooner and be worse when it starts.

Or will it? Will the throughput wins be enough to overcome the latency hit? The cost per byte should be better, right?

Spinning media adds new considerations everything.

With SSD’s we haven’t had to think about seek time as much (or at all). But those old-school disks are still out there. If your system has a one, then you might appreciate fewer disk operations even more. The big reads mean far fewer seeks. This should be good right?

Well, maybe, but if there are other disk operations interleaved with memory access, things could go badly. The disk cache was handicapped by increased memory retention so any reads that are a bit far afield are less likely to be cached. So more ad hoc reads/writes? They interleave with VM reads/writes?

Adding to this, as we saw before, you don’t actually need 2M pages to decide to do 2M page file disk operations. You might get most of the wins just by doing 2M chunk disk layout and fusing read/write operations. Kind of like when you overpaint one bigger rectangle rather than draw two or three smaller rectangles that are a better fit. You can choose to merge.

If we think about the disk queue, we’ll find that overall throughput is up but so is average service time per request, which means other things in the queue will be waiting longer. But, on the other hand, average queue length is likely to be shorter. But on the other, other hand variability is up which also increases queue length with no corresponding throughput improvement on that axis. Phew.

In the spinning media cases latency is likely to drive a lot of these choices. Any increase in disk queue length can cause massive additional latency hits because the latency is average queue length times average service time.

Significant improvement for machines with one workload

The most likely benefit comes from cases where there is one workload dominating the machines resources, or a few similar workloads. Here the TLB savings would not be offset so much by extra memory pressure. Keeping all of the one workload resident is likely to provide significant benefits.

The more heterogenous the workload is the more the big pages hurt your ability to manage memory frugally and therefore cause lost performance due to page faults.

Are these even all of the considerations?

Oh hell no. But I have to stop somewhere, and I think I hit the big ones.

Conclusion?

I didn’t intend to make one, and I won’t. This stuff is making me feel like Vizzini: “The memory cost might be greater so I certainly cannot choose the cup in front of me!” [go watch The Princess Bride].

To get answers to these kinds of concerns a bunch of experiments would be required. Could you just change how you do i/o? Could you just fuse pages opportunistically? Do you have to go hit Intel over the head until they give you 8k or 16k pages or something? Is the TLB going to save the day with compression anyway?

This is where one of my most quotable quotes comes in handy.

“If you’re really good at software performance engineering you’re only wrong about 95% of the time. If you’re bad, it’s worse.”

Frankly, this one seems super complicated to me.

I think if you got to the right answer, or the best compromise at least, in less than 20 attempts (95% wrong) that would be a miracle.

I’m sure that at least some of my guesses as to how things might play out are wrong. Hopefully the considerations are at least mostly right.

If you ever wondered what your favorite developers that work on VM systems are thinking about on any given day… a lot of it is stuff like the above.

--

--

Rico Mariani

I’m an Architect at Microsoft; I specialize in software performance engineering and programming tools.