[I wrote this 7/25/2017 after doing some research, I think it’s generally useful because, although the equivalent info is readily available in Apple videos, this is hella shorter. I think this is pretty accurate.]
I did some quick homework for iOS today, mostly for myself, but some of this stuff is worth writing down. This is a summary of what I learned:
All images are faulted in, this includes all dylibs. The page size is 16k on ARM64, 4k elsewhere.
- Reading an image sequentially triggers a kernel read-ahead behavior so that’s more economical
Images that are “loaded” at startup are more economically loaded because there is strict control of threading by the operating system at that time, later there is loader magic necessary to make sure everything stays thread-safe
- OS Images are pre-optimized in a fashion that is not discussed but that apparently makes things considerably more economical.
Dylib loading happens mostly in user mode, only the image mapping itself is done by the kernel, the transitive mapping and whatnot happens by dyld. This seems prudent.
Image files consist of three sections: _TEXT, _DATA, and _LINKEDIT
- The _TEXT section contains readonly data plus code. The code is generated position independently so it does not need fix-ups to slide around; this is important because it means that you do not need to page in lots of code just to fix it up!
- The fix-ups are stored in _LINKEDIT, they are applied (which requires them to be faulted in) but then they are DISCARDED so they do not need to stay in physical memory.
- The _DATA section is copy on write (COW).
All fixups are applied to the _DATA section, this means that any initialized pointers (code or data) have to be in the _DATA section, so things like vtables etc.
- Because there is no Windows-Like affordance for paging in and fixing-up code on the fly for ASLR it means that the data pages that hold vtables are guaranteed to be COW dirty.
- For the same reason, locality of fix-ups is good because for sure they are all in the _DATA section.
There are a variety of pre-main fix-up types, they including binding (to get imports) and remapping (for relocation).
- Additionally non-trivial initialization, or initialization of pointers to code requires fixups and/or initializers to run, there are phases for this.
- There are special phases for ObjectiveC specific things like class registration and selector management (de-duping).
- The upshot of this is that the mere existence of virtual methods causes a startup cost, in fact any initialized constant pointer to code or data causes a startup cost.
- However, and this is important, the existence of code does not have an inherent load cost absent fix-ups. So for instance a giant algorithm costs you nothing at startup if it doesn’t run. But giant interfaces do. Giant initialized data trees do if you use pointers in them.
Refactoring the code into delay-loaded dylibs (truly dynamically loaded) is a huge win because all those initializers and bindings and fix-ups do not have to happen at startup.
- However, see above, delay-loaded comes at greater cost, so it’s crucial that it actually be delayed.
- There’s no particular reason why hundreds or even thousands of libraries that were not loaded would be a problem, you can not load as many libraries as you like. The architecture has to consider how many will typically be loaded, not how many will be not loaded.
The cost of startup in cold is going to be dominated by the page faults (!) the page faults will be dominated by two sources: the fix-ups that force _DATA to be loaded and _LINKEDIT to be loaded, and, the code that actually runs which forces _TEXT to be loaded.
- The secret to fast loading is: initialize less, have fewer fix-ups, run less code.