Wrathmark: An Interesting Compute Workload (Part 2)

7 min readOct 18, 2023

Context

Continuing from the first article where we discussed the benchmark and the initial attempts to port it to C#, at this stage I started to make some changes in the codebase to try to improve the speed and get a sense of where the extra costs were coming from more specifically.

For additional information about the benchmark and the test machines please see Part 1.

Source Code

https://github.com/ricomariani/wrath-othello
Use the better_flip branch for these results
CC=clang make
bench_all.sh

Nature of the Experiment

I began with the assumption that the extra branches we had observed were due to array bounds checks but as I started looking at the disassembly this did not pan out. What I saw instead was that some of the loopier functions on the “flip” path got very aggressive treatment in clang –O3 that was different in .NET. In the getvert function clang removed the loop entirely and vectorized instead. In getdiag1, getdiag2, putdiag1,a putdiag2 the loop was unrolled to the point that it became an 8-way switch with fall through. To do this there was a good bit of strength lowering where some computations that were based on the loop index could become constants (e.g., 1<<x can become 1, 2, 4, 8, etc.).

// clang compiles this to no loop at all
// you get a series of vector instructions
uint16_t getvert(uint8_t *me, uint8_t *him, int x, int y) {
  int8_t mask_in = 1 << x;
  uint16_t mask_me = 0x100;
  uint16_t mask_him = 1;
  uint16_t row = 0;
  for (int i = 0; i < 8; i++, mask_me <<= 1, mask_him <<= 1) {
    row |= (me[i] & mask_in) ? mask_me : 0;
    row |= (him[i] & mask_in) ? mask_him : 0;
  }
  return row & ~(0x100 << y);
}

See the disassembly for yourself at Godbolt:

https://godbolt.org/z/re9z8j1fG for getvert and,
https://godbolt.org/z/ebbYvd6Yc for getdiag1

That amount of vector processing is maybe too much to ask a JIT to do but some unrolling would help at that seems within reach. In the new benchmark I manually unrolled the loop completely — it’s only 8 iterations anyway. I did the same for putvert. Attempting to do loop-unrolling more widely made the benchmark slower in the C# version as code size became a factor. Manual unrolling did not help the native code at all because basically clang was already doing it better than I could. Trying to outsmart clang is a losing proposition these days.

The second important pattern happens in several places, including putdiag1 and putdiag2. These functions write back the result of a flip in one of the two diagonal directions). At their core they do something like this:

for (int i = y0; i <= y1; i++, mask <<= 1) {
  if (hi & mask) me[i] |= mask; else me[i] &= ~mask;
  if (row & mask) him[i] |= mask; else him[i] &= ~mask;
}

The clang x64 output for one of those lines is shown below. It’s using cmovne to do the job of the store without actually branching. It computes both arms of the if and assigns one with no branches. There are two copies of this code:

movzx   r9d, r10b
movzx   r11d, byte ptr [rdi + rax + 1]
mov     ebx, r9d
not     bl
mov     ebp, r11d
and     bpl, bl
or      r11b, r9b
test    r10d, edx
movzx   r10d, r11b
movzx   r11d, bpl
cmovne  r11d, r10d
mov     byte ptr [rdi + rax + 1], r11b

The ARM code is wicked good. It’s doing a bit clear and set and picking the correct branch. There are two copies of this code:

ldrb   w13, [x0, x8]
tst    w12, w9
bic    w14, w13, w12
orr    w13, w13, w12
csel   w13, w14, w13, eq
strb   w13, [x0, x8]

But it’s basically the same principle as the X64 — do both and pick the correct one.

The .NET code is more naïve, with two copies of the below:

             movzx   r11, byte ptr [rdx+r10]
             test    r9d, ecx
             jne     OR_CASE
             andn    r11d, ecx, r11d
             movzx   r11, r11b
STORE_BYTE:  mov byte ptr [rdx+r10], r11b

             … control flow continues here…

OR_CASE:     or      r11d, ecx
             movzx   r11, r11b
             jmp     STORE_BYTE

Making This Better

In addition to unrolling the loop in getvert/putvert, I made this transform in the *diag functions:

for (int i = y0; i <= y1; i++, mask <<= 1) {
  // flip the bits that need flipping
  me[i] ^= (hi ^ me[i]) & mask;
  him[i] ^= (row ^ him[i]) & mask;
}

This xor logic removes any conditionals at all. It computes the bit that needs to change and then changes it. The identity is that:

x ^= (x ^ y) & mask

puts the masked bits of y into x by first computing what is different between x and y, then keeping only the bits in mask, and then applying those differences with ^=. This is a slick trick to avoid any control flow and get great economy.

On X64, Clang compiles one of those ^= lines into this:

movzx r10d, byte ptr [rdi + rax + 1]
mov   r11d, r9d
xor   r11d, r10d
and   r11d, edx
xor   r11d, r10d
mov   byte ptr [rdi + rax + 1], r11b

On ARM clang this “xor bit set” trick in a funky way. It’s figured out that the xors cancel and it can just move the needed bits by stripping them in the destination with bic first then it can orr in the new bit.

This is hot.

ldrb  w14, [x0, x12]
and   w15, w9, w13
bic   w14, w14, w13
orr   w14, w15, w14
strb  w14, [x0, x12]

The new .NET code looks good too!

movzx   r11, byte ptr [r10]
mov     ebx, r11d
xor     r11d, edi
and     r11d, eax
movzx   r11, r11b
xor     r11d, ebx
mov     byte ptr [r10], r11b

It’s not quite as good as clang but very close and of course no branches now.

And if we do all this? Then what?

New Results

Please refer to Part 1 for the line-by-line meanings in the table. The format is exactly the same.

Interpretation

The main observations that we make here are these:

Native speed basically didn’t change, clang didn’t require any assistance from us, the changes maybe even hurt a little
The ARM results are the most consistent run-to-run and they showed a slightly worse result, no surprise there really.
Both X64 and ARM showed a clear win removing these branches in every .NET version.
The best version of the .NET options (inline arrays) are now running at over 70% of native speed with ARM clocking in at nearly 77%.
.NET on ARM is basically 20% faster, X64 is more like 15% faster

This gives us a sense of how much branches matter on the various architectures.

Microarchitectural Analysis

I repeated the study of key CPU counters in the same cases as Part 1.

These are the same observations as in Part 1 with the same meanings. As noted, in the interest of comparison with Part 1, I used the duration from the original Native benchmark (marked in red) as the control for duration (only) because that seemed more interesting.

Note that the new native benchmark has statistics very similar to the old hence the choice to use the old duration as the control.

Below is a comparison of the two results showing percentage change. Recall that in all these metrics lower is better.

Observations

As expected, this benchmark is not cache sensitive, it fits entirely in cache. The miss rate was already very low. The misses likely correspond with infrequent rescheduling to a different CPU. The gains in cache misses did not correlate with throughput gains at all.
Branches dropped clearly and significantly. The simpler control flow paths resulted in fewer overall instructions.
Branch misses were not expected to be a huge problem as the miss rate was low, this is borne out by the data.
Removing the branches mainly helped because it reduced instruction count, not because it reduced branch misses (gains don’t correlate to miss rate improvement). This, too, was expected.

Conclusions

I have no doubt that with a little further work I could get the managed code to be around 90% of full speed of native. This represents is an interesting rule of thumb for the best you can reasonably hope to do with good effort and experience on the modern .NET Core Framework. I think it’s actually an astonishingly good place to be, especially compared to where things once were.

In none of these benchmarks did we need to give up memory safety, so all the usual .NET benefits are intact.

Some of the high end -O3 optimizations in Clang seem to be well within reach (e.g., we know that conditional move can kick in, it does in other places). So that really bodes well for the main needs.

Other high end -O3 features might be possible with enough dynamic information but, in as much as they are almost always counter-indicated because “space is speed” about 98% of the time, I’m not sure they are ripe for consideration.

Even so, today’s JIT is doing some great stuff. We’re seriously comparing it to clang –O3 here, so I’m really pleased. I’m sure the CLR team of 2005 would be proud :D