WASM and SQLite Investigations Part 2

Continuing the previous discussion with some more issues and followup.

This document focuses on tuning the VFS and running some additional experiments with lower level libraries.

Issue #11 : What factors are affecting IndexedDB performance the most?

What about deferred writes?

Consider the essential method in wa-sqlite/src/examples/IndexedDbVFS.js

_putBlock(store, file, index, blockData) {
const key = this._blockKey(file.meta.name, index);
file.cache.delete(key);
file.cache.set(key, { data: blockData, dirty: true });
/* this._purgeCache(store, file); */
}

The existing VFS always flushes its write cache on every block, it does this on all writes not just at Sync operations. That’s pretty conservative. Maybe we could do this on writes to the journal but an experiment would be to remove the forced flush. So I commented out that bit.

This is WAL mode, the same 1,000,000 row insert and select twice test as the previous document. The results were:

8k cache Lazy write: 43.285s, 42.024s, 7.985s *, 41.871s, 8.570s *, 45.750s

We’re seeing more of those hyper-fast runs again (marked *), but this time I have a solid theory — more on that later. Just looking at the normal runs we see a marked reduction from the baseline of 76.192s. The new execution time is about 57% of the old.

What about a bigger block size?

Again looking at wa-sqlite/src/examples/IndexedDbVFS.js, we see:

// Default block size for new databases.const BLOCK_SIZE = 32768; // was 8192

Holding the write policy unchanged (i.e. back to normal, no lazy write) but increasing the block size to 32k from 8k gave these results:

32k block writes, normal policy, 25.205s *, 40.186s, 41.561s, 24.280s *

Again these are markedly better than before. The normal runs were about 53% of the baseline time. Again this is with no change in sync policy.

What about both?

32k block with lazy writes, 19.106s *, 40.067s, 16.755s *

The “normal” case is not affected much by both (in fairness I only did the one run) but the mystery * case is still going on. We’ll get to that in a second…

Are you hallucinating?

No, I redid some control runs. Note that the “mystery factor” is much subdued here. In fact as we’ll see shortly these runs are really the same. With 8k pages and no cache policy changes we get very close to the numbers we saw in the first experiment consistently.

Control Run 74.705s *, 76.767, 74.237s *, 78.628s

What is the mystery factor already?

This picture tells the story. I was using exclusive locking and so whenever I started a new test case I ran into the problem that the db was left locked (we would need an answer to this problem btw). I selected “Delete and Reload” and then ran the experiment.

Why should this make things faster? Clearing out the database? We have to look at the VFS for the answer but it’s quite direct:

if (blockIndex <= lastBlockIndex && blockBytes < meta.blockSize) {
// The write is to only part of a block that may have been
// already written.
blockData = await this._getBlock(store, file, blockIndex);
}
if (!blockData) {
// We should reach here when:
// — writing a complete block
// — writing past the previous last block of the file
// — writing in an empty gap
blockData = new ArrayBuffer(meta.blockSize);
}

As you can see, if the write is partial and the data is not off the end of the file then we have to do a read/modify/write operation to do the write. This is considerably more expensive. If the database begins empty then almost every write we do will be off the end, and even once we have done a partial block write the last block is hot in the cache so we won’t have to read it from IndexedDB (i.e. any getBlock would be quite fast).

Now you may cry foul, the script doesn’t have to delete 1,000,000 rows of the dummy table if it starts empty so I controlled for that. The drop run by itself takes only ~1.4s. Now this is enough that the control runs are effectively the same duration because the * runs save the raw 1.4s.

So what this means is that even changing to 32k pages results in a lot more clean writes. The big page staying in cache allows for later writes to stay fully within the cached page and hence no read. Now on standard pages with lazy writing we’re saving writes directly because the page doesn’t hit the disk until it’s basically in its finished version. The two combined is better still as both effects are helping. But there are downsides to large page sizes when it comes to writing within the file.

This means that if the file is not growing we run in ~22% of the time or about 4.5x faster.

What about the mystery effect in the last document?

I am guessing here, and it only happened sometimes, but I have to assume that when we saw the fast numbers if was because SQLite had not yet reclaimed the space of the old table that was just dropped in the file for whatever reason and so we were effectively running in new space like we are if we delete the database first. I was using “force unlock” before. With the drop in place, the presence of old data was necessary but not sufficient for the “fast” run.

So with lazy writes we can reasonably expect to insert 1M rows in ~20s. This is still maybe 40x slower than raw native but it’s a lot better than the 76s we were seeing previously. Random writes will benefit from lazy writing and a good cache policy. The bulk insert case is very bad for caching overall, with only the final block possibly benefiting due to re-reads.

Issue #12: Could we possibly run our other base libraries?

We had tremendous success running the CQL demos (see previous results) but what about more of the code? Again things went spectacularly well.

I didn’t want to drag BUCK kicking and screaming into this but luckily we have a simple script to compile some low level files directly. I made trivial changes to this script such that “emcc” was used everywhere “clang” or “clang++” appeared. I ran into these trivial problems:

  • The emcc target was “unknown” I faked it as standard linux
  • Array sizes specified by a const int didn’t work so I changed the const to a macro
  • The ‘abort’ shim in the test didn’t work in this environment so I disabled that test

That’s it, 559 tests for the base fabric all passed in a browser with virtually no effort.

Issue #13: What about pthread and friends?

I looked into pthread support, I didn’t want to take on more of our libraries as I had no stand-alone script for them and I didn’t think it was necessary to learn what we needed to learn. Generally the documentation was very clear that pthread (plus locking on threads etc.) was supported in later versions of WebAssembly.

The long and the short of this is that it is very easy to find threading demos to verify that it really does work. However, the bad news is that the necessary primitives, especially SharedArrayBuffer are not supported on Safari.

And indeed:

The good news is that it is in their Tech Preview but it has to be manually enabled.

However, there is also bad news:

So apparently this is going to need some rethinking before Apple is likely to buy into it.

Chrome appears to be able to do this with COOP and COEP protections. Maybe that would work for Apple too?

The security concerns are discussed in more detail at:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/SharedArrayBuffer#security_requirements

I had planned to stop here but the quick test I did on Safari opened such a large can of worms I had to investigate.

Issue #13: What about other browsers?

I thought a quick test of Safari would be in order and imagine my surprise when I got these results:

This is the first run (fast because empty db append case) on the stock build with no optimizations. So this would be a “*” run.

Now keep in mind this is effectively the control run for Safari. The baseline numbers on this same machine on Chrome are:

Chrome Control Run 74.705s *, 76.767, 74.237s *, 78.628s

With again the * above indicating it was the clean db case. Safari over 10x faster!

The non-clean case comes in well under 10s. Let’s have a quick look at those…

So Safari Control: 9.114s, 8.326s, and 6.041s.

This demanded extra investigation.

Data source: https://rhashimoto.github.io/wa-sqlite/demo/ running the same insert 10⁶ rows benchmark detailed previously.

Safari with optimizations mentioned previously: 4.436s*, 7.388s, 7.121s

This brings us to…

Issue #14: What the hell is going on here?

So I decided to start doing some real tracing rather than just simple wall-clock times. The first thing I did was look into Chrome and see what was going on. The trace looks rather like this in their dev tools.

Now this is just a small window because of course you can’t see anything if you look at all 70s it’s a blur. But what we can see are clear impulses of CPU with huge gaps between them. The above is about 100ms of time.

The Safari tools offer less clarity but nonetheless they should a different picture

What we see on this interval is clearly that the CPU is staying very busy the whole time. This is a ~1s interval which was the best I could do with those devtools. But clearly it’s not got the huge gaps. So a lot more async is going on here and a lot less blocking on writes. It could come down to differences in the IndexedDB implementation or in how Safari is handling micro-tasks.

There is a mostly regular heartbeat for the inserts here, which we would expect. We would also expect occasional breaks in the beat as the WAL is written.

So I decided to take a look where the tooling is better and moved to Windows. Which would also let us check out a different ecosystem and the Edge browser.

So let’s have a look at some ETW traces:

The above is a trace of 118ms of the baseline build on the Edge browser (this is of course the Chromium based Edge). I went here because I expected this to have the best tooling. Here you can see the I/O on the lower graph and above it the CPU usage spread mostly across 3 threads. Note that “400% CPU” usage means “4 threads at 100%”. So here we’re seeing typically 3 threads running with some peaks where there are three. But you can see how the sync i/o is happening. These are 8k pages and there is a whole lot of waiting for those pages to write in this build.

I did fewer runs here but the standard run was taking about 25s on this hardware. This compares to the ~70s run on my Macbook. It’s a pretty burly laptop so I’m not that surprised that it’s a bit faster than the Macbook.

Edge Control Run: 23.058s*, 24.882s, 25.627s

Specs: Core i7–6700HQ @ 2.60Ghz; 64GB RAM (4 cores, 1 socket, 8 threads, 256k L1, 1M L2, 6M L3)

Now if we look at what some of our optimizations did, we get this new trace:

The time scale is similar: it’s 122ms vs. 118ms on the x axis so nearly the same. The width of the writes is also nearly the same at about 4ms on both. But clearly the 2nd is much more effective. As before, the second trace has the optimizations mentioned earlier.

This build takes only 14.3s to run the benchmark (normal case).

Optimized run: 6.767s*, 6.850s*, 12.511s, 11.825s, 11.679s => roughly 2x better

I also controlled this against Chrome and got the same times, which is not a surprise since they are both Chromium.

All of this leads me to these conclusions:

  • It’s likely that Safari is taking a very different approach on IndexedDB writes, we could look into this but I suspect they are less complete when they let the async promise resolve
  • I was able to get a significant speedup in the VFS with just a cursory look at it, if we thought harder about how to do the async and when exactly to sync we could do considerably better still
  • Safari is probably giving us the best-we-can-hope-for number and that number seems very reasonable, more careful use of await can probably get us closer to the Safari time on Chromium
  • We should probably have a look at Firefox and maybe follow-up with some perf folks on Edge/Chrome to see what’s going on here

I’m a software engineer at Facebook; I specialize in software performance engineering and programming tools generally. I survived Microsoft from 1988 to 2017.