Some VM follow-up information
I kept wanting to write post-scripts to my previous article but that soon gets ridiculous so here are some follow-ups in the form of Q&A.
Why mention the utility of odd page sizes that are supported basically nowhere?
I have friends that work on CPUs. Some of them might be listening.
Why not use big pages just for code or other things that don’t grow?
The VM subsystem is actually primarily (IMO) in the business of giving you the memory you need not the memory you asked for. This usually works out well because programs usually ask for more memory than they need. We want programs to be frugal and play nice with others. For instance, many programs use some shared library but then only use a tiny slice of it. Putting it all into physical memory up-front would be wasteful. Putting all the TEXT
section (where the code lives) into one page would make it impossible to swap out (small) parts of it. This basically translates to internal fragmentation of physical memory because effectively there’s stuff you don’t need in there right beside the stuff you do need.
Further, if the memory was swapped out you need to find a contiguous chunk of 2M to swap it back in. If all pages were 2M then there would be no issues finding such pages but then you get serious internal fragmentation.
Note: sometimes the goal of being frugal is at odds with “All your bytes are belong to me. Give me now.” That was not a typo. :D
What about big pages for static data?
It’s the same issues as with code. Plus, you would be burning a lot of space in the page file as the pages are not sharable.
Some operating systems have a user-mode loader (e.g., iOS) and in those cases basically ALL your data section is going to be faulted in at startup because there’s always at least one fixup on any given page. This does have some advantages — preloading all the data in-order gives some goodness. On iOS it works well for small applications and it’s super painful for big ones. And you still have the option of paging out the data in small parts if the pages are small.
What about unifying pages just in the page table?
I really like the idea of unifying the mappings in the page table and nowhere else. Keeping in mind Intel processors are not presently flexible enough with their paging strategy to actually do this. A better page table representation would not be compatible but in principle it could be done. And maybe ARM and RISC-V have more flexibility in this area. I can’t say for sure.
What about unifying in just the TLB?
I think this is a slick idea. If the TLB unified even just pairs of pages, it could effectively compress its own storage thereby doubling its utility for very little marginal cost.
Sounds good right?
Yeah, well, if you think the L2 cache is on the hot path that’s nothing compared to the TLB. The TLB is basically constantly active so you want the thing to be as simple as possible. It might be easier to actually double its memory rather than add this extra decoding logic. I can’t say.
What about reshuffling physical memory so that you can get 2M when you need it?
I mentioned this briefly in the original article. Doing this fully is kind of like making a garbage collector for physical memory. There could be lots of sliding needed to do this operation. The worst thing about this is you might have to stop processes other than the one taking the fault in order to adjust the page table for the new physical layout. The memory of different processes is all mixed up in physical memory.
Now if all the pages are 2M (or any supported fixed size really) you don’t have this problem, but then you have much bigger paging latencies and much more internal fragmentation which bring their own problems.
But, for instance, 16k pages might be just the ticket to make things better on 2023 hardware. Not happening on Intel.
So, what it’s hopeless?
Compared to the miracles done in silicon for things like vectorization, internal register relabeling, hyperthreading, x64 mode in general, I could go on… this seems like not an intractable problem. But this stuff doesn’t happen fast.
With the situation being what it is on Intel I’m getting a real sense of why the compromises that exist were made. But they won’t always make sense.
It’s amazing how intertwined VM goals are with page choice:
- reducing physical memory usage as much as possible,
- using available memory as a disk cache,
- quickly paging in in arbitrary relocated code with fixups in it,
- lazily loading static code and data,
- allocating large blocks of memory and using them immediately,
- allocating large blocks of memory and paying for them lazily…
- minimizing TLB misses (ITLB and DTLB if appropriate)
- offering predictable page fault times
- minimizing the noisy-neighbor effect on your own processes
- and, lest we forget, actually running on hardware that exists today…
That’s probably enough for now.