Diagnosing Garbage Collector Problems 101

[originally posted 6/30/2017, seeding some old content here]

I’ve been doing this for some time and I have a few basic ways to get to the bottom of typical GC problems. Note: this is not even remotely a comprehensive guide on the subject but rather the way to think about the most basic types of problems you will encounter. I make reference to the performance counters available in .NET because I know them the best but I think every collector worth mentioning has these notions (where they apply).

Is there a problem?

The top-level counter is the % time spent collecting. If the collector is a “stop the world” collector then this will tell you literally the fraction of time the collector is running. If the collector is concurrent then you need to get a the % CPU time that is going to the collector.

Generally: If your percentage is in the mid to high single digits you don’t have a problem. That’s likely comparable to what you would get from a traditional allocator. YMMV.

I have a problem, now what?

The next question to answer is this: is the problem that your collections are too frequent? Or is the problem that any given collection is taking far too long? Or it could be both.

Frequent Collections

Have a look at the rate counters that are available. If your collector supports partial collections you want to look carefully at the rate at which objects are moving from the youngest generation to the oldest generation. You’d really like (in round numbers) something like 90% of your objects to die before they get into the next older generation. So in .NET that means maybe 1% of your objects survive to Generation 2 in steady state. Again YYMV. The promotion counters are invaluable here.

If you find that you are promoting a lot of objects, which then die quickly, then you are going to be driving a lot of large collections to get that memory back. Those collections are not cheap. Try to address the lifetime problems by either making the objects more durable (which has its own problems) or less durable (probably a better idea). A classic reason you can get into this situation is that you are allocating a lot of objects with finalizers. Objects that require finalization necessarily survive at least one collection. If you can eliminate the need for finalization (by forcing the issue with explicit cleanup for instance) you will be doing much better. Alternatively if you can recycle those objects instead of making them die you may be ok. The worst thing to do is to drive a lot of death into the eldest generation.

If you don’t have a collector with a partial collection strategy, or if the promote rate is looking ok, then it’s likely that your overall allocation rate is the problem. Look for sources of temporary objects. Big sources of temporary objects often result from silly things like having object comparison methods that allocate, or object hashing methods that allocate (if you have an object with 5 fields don’t hash it by allocating an array and storing the 5 fields in it and then hashing the array).

Slow Collections

If the promotion rate is decent, and the allocation rate is decent, but you just have really slow collections from time to time, maybe enough to really make your application/server get glitchy then you have a more challenging problem to solve.Collections are slow because they require visiting a lot of objects. If your heap is very small, there’s just no way this can happen. If your heap is big you have two choices:

  • store less overall (this is always a good idea)
  • store the same but do it with fewer objects and especially fewer pointers

If the collector has to trace a lot of objects this is going to have a bad effect on your application/server no matter the strategy. If it stops the world then you get big stalls; if it’s concurrent then dancing around all those pointers in the background while your program is trying to do useful work is going to ruin your locality and therefore your performance.

It’s important to take a close look at your building blocks. For instance, in Java the standard HashMap links HashEntries together. That’s gonna be a lot of objects. If you have a huge hashmap you’ll find that the collector has to work overtime tracing it compared to some other structures that are low in pointers.

Summary

Collectors aren’t magic. Minimize the work they have to do by having collector friendly object lifetime and heap contents. More values, less pointers, that’s always a good thing. And less object stores means fewer write barriers, too.

I’m a software engineer at Facebook; I specialize in software performance engineering and programming tools generally. I survived Microsoft from 1988 to 2017.