[originally posted 5/17/2017]
One of the cool things that’s happened recently, and I’m showing my age by saying its recent, is that virtually every popular language now has some notion of continuations. This is super cool because scheduling work for “later” was, in the past, a lot more of a pain. In fact sometimes it was the secret sauce of how you kept things responsive. Like MFC having an easy-to-access idle infrastructure. That made a lot of things work. Good frameworks had affordances for deferred work because it wasn’t easy.
The trouble with continuations, however they are offered, is that they can create orchestration problems as easily as they solve them. For instance, over-production is a leading cause of pain when everything is a “future”, so these systems need some kind of pushback, whether it’s congestion control or something else. It’s far too easy to just keep producing new work.
Orchestration isn’t a new problem at all. While I was working on Edge it was often the most important source of performance improvements, or performance problems if you like, same thing. :) I think we can all agree that when you have a bunch of work to do it’s important to choose the right order to do that work. If you can easily defer pieces of it for “later”, without having to do all kinds on state capturing shenanigans manually, so much the better. The work you defer for later may be what enables you to get your (e.g.) UI ready in a reasonable amount of time.
There are other good reasons to defer work, even if the work is on the critical path. Re-ordering work gives you the opportunity to properly manage your resources. This is no small effect. For instance suppose you’re doing operation “A” which uses a bunch of memory, while doing operation “A” you tend to produce work “B” that needs doing. The obvious way to handle this is to do the B as you encounter it, so maybe you get a pattern like this ABABABAB… This is fine and well if A and B tend to use the very same data structures, but what if they don’t? In that case deferring the work so you get AAABBB might be a fantastic thing to do — all your caches stay hot for A, then switch to B and stay hot for B.
Actually I said cache thinking of memory but it could just as easily be a disk cache we’re talking about, or any other resource that has locality issues, like disk itself, cache or no. If B tends to evict things from whatever cache that A tends to need then interleaving them is probably a very bad idea. But the reverse could also be true. Consider this: A1B1A2B2A3B3 — maybe that’s the best order because B1 tends to use the same resources as A1 and B1 will be very economical if done immediately after A1. But the reverse could also be true. A1A2A3B1B2B3 might be best after all.
Actually my favorite example was of this kind, the best pattern turned out to be A1A2A3B3B2B1… let me tell you that story very quickly.
Back when I was working on making “cvpack” faster, that’s the thing that de-duplicates debug information from various C++ compilands, my first strategy was to not actually write out all of the raw debug information from the linker just so that I could read it all back again, de-dupe it, and then write it out much smaller. That’s what used to happen: link->big foo.exe->cvpack->small foo.exe.. I changed the system so that cvpack ran as a post step in the linker and compressed directly from the linkers in-memory representation. That was a big help. But here’s what I found — the packer packed each segment the linker produced in the order that the linker produced them, that meant that if the linker produced an executable that was big enough to not fit in the disk cache (and this was 1993 so caches were smaller) then for sure LRU would make it so that, when the packer started, the first thing it wanted was not resident. I fixed this problem by packing the modules in the opposite order that the linker produced them, so whatever was hot would stay hot. This made a huge difference since most of the time most of the stuff was still mostly cached. Link1 Link2 Link3 Pack3 Pack2 Pack1…
All of this is just an example of work orchestration.
If there’s actual concurrency orchestration becomes even more important. Some operations overlap well with each other. If you have ABCD maybe you should do A then B&C together then D because B and C play nice with each other. But that is just one consideration. How about this: If you have 4 threads should you do ABCD on each of the 4 threads thereby doing 4 jobs in parallel? Or should you do A on one thread B on another thread and so forth making a pipeline? The pipeline is far more efficient if each operation A,B,C,D fits well in the processors cache when run on its own but moving work through all the stages on one processor causes the overall operation to not fit. So pipeline? Well maybe don’t pipeline, right? Because it’s also possible that all the stages use the same data and if you run ABCD then each stage goes super-fast because the previous stage heated everything up just right for the next stage.
It’s not at all difficult to come up with situations where simply reordering work that can be re-ordered (ABC becomes ACB, assuming that’s legal) results in far superior execution. All you need is for C to depend strongly on the memory that A uses and B to splat the cache doing its job, thereby ruining things for C. In short if B is a hog and C isn’t then ACB is probably much better than ABC.
Good use of continuations can help you manage these considerations without turning your work into something unnatural and unreadable. But make no mistake, locality is no small effect, so it’s unwise to proceed without considering good ordering for performance critical work.
I consider orchestration to be a “macro-optimization” worthy of early consideration.
As always forgive me for giving only approximately correct information, it takes far too long to make the prose more rigorous.
And of course, don’t believe a word I say until you measure it for yourself :)