Why Nostr? What is Njump?
2024-05-28 07:15:53

Hector Martin on Nostr: So I just pushed a kernel fix for Asahi Linux to (hopefully) fix random kernel ...

So I just pushed a kernel fix for Asahi Linux to (hopefully) fix random kernel panics.

The fix? Increase kernel stacks to 32K.

We were running out of stack. It turns out that when you have zram enabled and are running out of physical RAM, a memory allocation can trigger a ridiculous call-chain through zram and back into the allocator. This, combined with one or two large-ish stack frames in our GPU driver (2-3K), was simply overflowing the kernel stack.

Here's the thing though: If we were hitting this with simple GPU stuff (which, yes, has a few large stack frames because Rust, but it's a shallow call stack and all it's doing is a regular memory allocation to trigger the rest all the way into the overflow) I *guarantee* there are kernel call paths that would also run out of stack, today, in upstream kernels with zram (i.e. vanilla Fedora setups).

I'm honestly baffled that, in this day and age, 1) people still think 16K is acceptable, and 2) we still haven't figured out dynamically sized Linux kernel stacks. If we're so close to the edge that a couple KB of extra stack from Rust nonsense causes kernel panics, you're definitely going over the edge with long-tail corner cases of complex subsystem layering *already* and people's machines are definitely crashing already, just perhaps less often.

I know there was talk of dynamic kernel stacks recently, and one of the issues was that implementing it is hard on x86 due to a series of bad decisions made many years ago including the x86 double-fault model and the fact that in x86 the CPU implicitly uses the stack on faults. Of course, none of this is a problem for ARM64, so maybe we should just implement it here first and let the x86 people figure something out for their architecture on their own ;).

But on the other hand, why not increase stacks to 32K? ARM64 got bumped to 16K in *2013*, over 10 years ago. Minimum RAM size has at *least* doubled since then, so it stands to reason that doubling the kernel stack size is entirely acceptable. Consider a typical GUI app with ~30 threads: With 32K stacks, that's less than 1MB of RAM, and any random GUI app is already going to use many times more than that in graphics surfaces.

Of course, the hyperscalers will complain because they run services that spawn a billion threads (hi Java) and they like to multiply the RAM usage increase by the size of their fleet to justify their opinions (even though all of this is inherently relative anyway). But the hyperscalers are running custom kernels anyway, so they can crank the size down to 16K if they really want to (or 8K, I heard Google still uses that).
Author Public Key
npub1qk9x6yrvten3jqyvundn7exggm90fxf9yfarj5eaz25yd7aty8hqe9azpx