I want all the threads for meeeee
Recently there was a discussion about using lots of threads on a server workload and the costs/consequences of this practice.
I have to start with “your mileage may vary”, which is really the only answer anyone can give that’s guaranteed to be correct. But putting aside that the situation is very complicated, there’s maybe some general advice that’s helpful.
I usually begin these discussions with this mostly correct statement: “If you decide you’re going to use a zillion threads, where a zillion means greater than the number of cores you have, then basically you’re choosing to give up doing your own orchestration and instead you’re going to let the scheduler do the job and hope for the best.”
That might work out for you. It’s unlikely to be optimal, but maybe it’s good enough. You should measure.
But when it comes to controlling thread count, sometimes we get pushback like “that stuff doesn’t matter anymore because x64 and tons of memory”.
Well, it’s true that you have more room for threads than the old days, but, alas, threads are still not free and their costs do matter. So now we get to the point of this article: what are some of the consequences of having a lot of threads, like say 1000, or even 10,000.
Scheduling
While I haven’t researched every single scheduler out there, I can be pretty confident saying that there are no schedulers where creating more threads results in less overhead. And to even bother considering this we presume, of course, that most of the threads are “blocked.” If most of the threads are “ready” then things are really going to be ugly because they will be running in slow motion pre-empting each other constantly. Not a good place.
Now if most of the threads are “idle” then you have to ask, “Why do they even exist?” Here “idle” means waiting for work to arrive. If they’re “blocked”, i.e., they have work to do but presently are waiting on some synchronous operation, then at least they are holding a spot in line. In an expensive way mind you, but yes, holding a spot.
Let’s think about that a little more: what is it that they could be waiting on?
Waiting on Disk
If they’re waiting on disk, then probably some very small number of them are being serviced and the rest are in a disk queue. This queue is just like a queue you could have created yourself to dispatch work. Had you done so you could control orchestration and cancellation, but now you probably can’t because you have threads in some arbitrary stack waiting on disk in a queue you don’t own. It seems very unlikely that you have O(10⁴) outstanding disk requests to the hardware and if you did it seems very unlikely that this would be a good thing for disk performance.
Waiting on Network
Mostly the same as the above, except you can have O(10⁴) sockets. I’m not sure that’s a good idea. If fact it seems like a terrible idea. Presumably the sockets are all idle/waiting on … something. See next two entries for examples.
Waiting on a Server Resource
This is probably “waiting on network” with an asterisk. Like suppose it’s a SQL server, yes there is (probably) a socket for the connection but there is also a logical SQL session open. Starting O(10⁴) sql connections seems like a really bad idea. Assuming you managed to pull that off, there’s no way SQL is actually going to process all the work from all those connections concurrently, so again you’re waiting in a queue, only now the queue is on another server where you have even less control.
Waiting on HTTP
This is another special case of Waiting on Network. If you’re talking to a web server it, too, will enqueue requests and serve them according to some discipline. Probably having O(10⁴) connections to the the web server will not make it more efficient. But if you made it work at all, again the result is you’re waiting in someone else’s queue…
Waiting on critical sections or other mutex-like things
Every time you wrap anything in a critical section you’ve created a software resource, it has a service time, a queue, and maybe multiple “servers” and whatnot. Again you’ve moved the work from a queue of work items you control to a queue of threads on a waitable object you don’t control.
Memory
Every thread needs a stack, the default commit size on Win64 is (I think) 4MB (maybe someone check that). That’s not nothing. Maybe you have memory to spare? Even if you do, I’m not aware of any allocators that get faster when there is more allocated memory. The reverse on the other hand is quite common. Just considering the virtual memory allocation costs — the ones that underly whatever allocator you’re using — you’re incurring a penalty for having more address space pieces to manage. This says nothing about managing the page tables and so forth. If you allocated 10,000 threads, you just burned 40G on stack. Divide that by 512 to get the size in page tables. This assumes your threads use no other memory.
Context Switch Costs
Assuming the threads are blocked enough that there is hardly any starvation, you still face the costs of switching in cold threads. With normal working set trimming going on chances are that you’ll take page faults when you switch. Now maybe you say, “Yes but page faults are normal”, and yes they are — one of the Windows kernel devs (and I wish I could remember which one) once told me that context switches are actually really cheap at the processor level, the problem is that immediately after the logical switch there is a very expensive period of faulting in the old state — that’s where the true cost comes from. Fair enough, some faults are normal. But again, I’m not aware of any kernels where adding more threads reduces the cost of getting a new thread running.
Garbage collection
Sometimes collection requires complex rendezvous with many threads. Maybe all the threads. This is tricky.
Collector algorithms — even concurrent collectors — have tons of thought put into the problems of managing many threads, but I’m not aware of any collector that gets faster if you create O(10⁴) user threads.
Perhaps we’re seeing a pattern here?
Lots of threads improves nothing, all it lets you do is tie up more stacks blocking on something synchronous which could have been managed with an await pattern.
Managing Your Own Queues
If you decide that you’re going to step up to the problem of orchestration and manage your own work queue(s) you pay a price: you have to deal with async patterns. This price might not be so bad. There are good await
patterns in lots of languages. And even if you have to use raw Windows IO Completion Ports that isn't so terribly bad. What do you get? Well, in addition to control of your thread count you get policy. And — this is important — it's your choice of policy.
- you can decide how many threads to create
- you can decide what work to prioritize in the queue
- you can decide what work to cancel in the queue, and, you can do this in such a way that one bad burst doesn’t tank the SLA for everything that follows
- you can report clearly on the state of your system since waiting puts things back in your queue where you control them, and
- you can make domain specific decisions on adding/removing parallelism
In short, you get control.
Fundamentally there are two things that drive system performance they are:
- consumption
- orchestration
You don’t want to give up on one of those.
Super-linear scaling
I can’t resist touching on this topic. Parallelism is all fine and well, but if you have a lot of cores to play with, running the same workload on all of them may not be the right choice at all. For instance, suppose you go from 1 core to 2 cores (easy example) you might be able to achieve more than 2x speedup with good orchestration. How? With pipelining.
Suppose that each request coming in requires two atoms of work: A and B. I could schedule the work (AB) on two threads such that each does A then B. But I could also break the work so that one thread always does A and the other always does B. Now maybe this is no better… but maybe all the code for A fits into a single core’s cache and the same for B. Now both processors are running faster because they are each doing only one thing. So now, instead of doing 2*(AB) in parallel, I pipeline from A to B and my overall throughput is better.
Of course your work is probably more complicated than just two stages, but then you do have more cores than two. A suitable pipeline with good stages could easily yield super-linear speedups from superior cache effects at every stage.
Will this always happen? Of course not. But that’s not the point, the point is that there are wins to be had from orchestration and they can be very big wins. Shrugging way these opportunties is not likely to get you the best results.
Summary
- Threads are not free
- Orchestration is important
- Measure measure measure
- Thanks if you made it this far