[originally posted 6/30/2017, seeding some old content here]
I’ve been doing this for some time and I have a few basic ways to get to the bottom of typical GC problems. Note: this is not even remotely a comprehensive guide on the subject but rather the way to think about the most basic types of problems you will encounter. I make reference to the performance counters available in .NET because I know them the best but I think every collector worth mentioning has these notions (where they apply).
Is there a problem?
The top-level counter is the % time spent collecting. If the collector is a “stop the world” collector then this will tell you literally the fraction of time the collector is running. If the collector is concurrent then you need to get a the % CPU time that is going to the collector.
Generally: If your percentage is in the mid to high single digits you don’t have a problem. That’s likely comparable to what you would get from a traditional allocator. YMMV.
I have a problem, now what?
The next question to answer is this: is the problem that your collections are too frequent? Or is the problem that any given collection is taking far too long? Or it could be both.
Have a look at the rate counters that are available. If your collector supports partial collections you want to look carefully at the rate at which objects are moving from the youngest generation to the oldest generation. You’d really like (in round numbers) something like 90% of your objects to die before they get into the next older generation. So in .NET that means maybe 1% of your objects survive to Generation 2 in steady state. Again YYMV. The promotion counters are invaluable here.
If you find that you are promoting a lot of objects, which then die quickly, then you are going to be driving a lot of large collections to get that memory back. Those collections are not cheap. Try to address the lifetime problems by either making the objects more durable (which has its own problems) or less durable (probably a better idea). A classic reason you can get into this situation is that you are allocating a lot of objects with finalizers. Objects that require finalization necessarily survive at least one collection. If you can eliminate the need for finalization (by forcing the issue with explicit cleanup for instance) you will be doing much better. Alternatively if you can recycle those objects instead of making them die you may be ok. The worst thing to do is to drive a lot of death into the eldest generation.
To get clues about what objects may be causing this situation, dump the heap. Many heap dumpers will tell you what’s in there that’s already doomed. That’s your clue. Alternatively try to get a dump of what’s being promoted.
If you don’t have a collector with a partial collection strategy, or if the promote rate is looking ok, then it’s likely that your overall allocation rate is the problem. Look for sources of temporary objects. Big sources of temporary objects often result from silly things like having object comparison methods that allocate, or object hashing methods that allocate (if you have an object with 5 fields don’t hash it by allocating an array and storing the 5 fields in it and then hashing the array).
To get clues about what might be driving your allocation rate, attach an allocation profiler to your process and look at the statistics. The data types or allocation stacks should tell you where the spam is coming from. Nix the spammer.
If the promotion rate is decent, and the allocation rate is decent, but you just have really slow collections from time to time, maybe enough to really make your application/server get glitchy then you have a more challenging problem to solve.Collections are slow because they require visiting a lot of objects. If your heap is very small, there’s just no way this can happen. If your heap is big you have two choices:
- store less overall (this is always a good idea)
- store the same but do it with fewer objects and especially fewer pointers
If the collector has to trace a lot of objects this is going to have a bad effect on your application/server no matter the strategy. If it stops the world then you get big stalls; if it’s concurrent then dancing around all those pointers in the background while your program is trying to do useful work is going to ruin your locality and therefore your performance.
To minimize the work the collector has to do, make your durable objects low in pointers and rich in values. Use larger objects (like arrays) and blocky objects (like btree) rather than fine-grained objects (like linked lists).
It’s important to take a close look at your building blocks. For instance, in Java the standard HashMap links HashEntries together. That’s gonna be a lot of objects. If you have a huge hashmap you’ll find that the collector has to work overtime tracing it compared to some other structures that are low in pointers.
Collectors aren’t magic. Minimize the work they have to do by having collector friendly object lifetime and heap contents. More values, less pointers, that’s always a good thing. And less object stores means fewer write barriers, too.