Let's write a setjmp

February 12, 2023

nullprogram.com/blog/2023/02/12/

Yesterday I wrote that setjmp is handy and that it would be nice
to have without linking the C standard library. It’s conceptually simple,
after all. Today let’s explore some differently-portable implementation
possibilities with distinct trade-offs. At the very least it should
illuminate why setjmp sometimes requires the use of volatile.

First, a quick review: setjmp and longjmp are a form of non-local
goto.

typedef void *jmp_buf[N];
int setjmp(jmp_buf);
void longjmp(jmp_buf, int);

Calling setjmp saves the execution context in a jmp_buf, and longjmp
restores this context, returning the thread to this previous point of
execution. This means setjmp returns twice: (1) after saving the
context, and (2) from longjmp. To distinguish these cases, the first
time it returns zero and the second time it returns the value passed to
longjmp.

jmp_buf is an array of some platform-specific type and length. I’ll be
using void pointers in this article because it’s a register-sized type
that isn’t behind a typedef. Plus they print nicely in GDB as hexadecimal
addresses which eased in working it out.

Using GCC intrinsics

Let’s start with the easiest option. GCC has two intrinsics doing
all the hard work for us: __builtin_setjmp and __builtin_longjmp. Its
worst case jmp_buf is length 5, but the most popular architectures only
use the first 3 elements. Clang supports these intrinsics as well for GCC
compatibility.

Be mindful that the semantics are slightly different from the standard C
definition, namely that you cannot use longjmp from the same function as
setjmp. It also doesn’t touch the signal mask. However, it’s easier to
use and you don’t need to worry about volatile.

// NOTE to copy-pasters: semantics differ slightly from standard C
typedef void *jmp_buf[5];
#define setjmp __builtin_setjmp
#define longjmp __builtin_longjmp

If you only care about GCC and/or Clang, then that’s it! It works as-is on
every supported target and nothing more is needed. As a bonus, it will be
more efficient than the libc version, though I should hope that won’t
matter in practice. These are so awesome and convenient that I’m already
second-guessing myself: “Do I really need to support other compilers…?”

Using assembly

If I want to support more compilers I’ll need to write it myself. It’s
also an excuse to dig into the details. The execution context is no more
than an array of saved registers, and longjmp is merely restoring those
registers. One of the registers is the instruction pointer, and setting
the instruction pointer is called a jump.

Since we’re talking about registers, that means assembly. We’ll also need
to know the target’s calling convention, so this really narrows things
down. This implementation will target x86-64, a.k.a x64, Windows, but it
will support MSVC as an additional compiler. So it’s a different kind of
portability. I’ll start with GCC via w64devkit then massage it into
something MSVC can use.

I mentioned before that setjmp returns twice. So to return a second time
we just need to simulate a normal function return. Obviously that
includes restoring the stack pointer like the ret instruction, but it
means preserving all the non-volatile registers a callee is supposed to
preserve. These will all go in the execution context.

The x64 calling convention specifies 9 non-volatile rsp, rsp,
rbx, rdi, rsi, r12, r13, r14, and r15. We’ll also need the
instruction pointer, rip, making it 10 total.

typedef void *jmp_buf[10];

setjmp assembly

The tricky issue is that we need to save the registers immediately inside
setjmp before the compiler has manipulated them in a function prologue.
That will take more than mere inline assembly. We’ll start with a naked
function, which means that GCC will not create a prologue or epilogue.
However, that means no local variables, and the function body will be
limited to inline assembly, including a ret instruction for the
epilogue.

__attribute__((naked))
int setjmp(jmp_buf buf)
{
    __asm(
        // ...
    );
}

The x64 calling convention uses rcx for the first pointer argument, so
that’s where we’ll find buf. I’ve arbitrarily decided to store rip
first, then the other registers in order. However, the current value of
rip isn’t the one we need. The rip we need was just pushed on top of
the stack by the caller. I’ll read that off the stack into a scratch
register, rax, and then store it in the first element of buf.

    mov (%rsp), %rax
    mov %rax,  0(%rcx)

The stack pointer, rsp, is also indirect since I want the pointer just
before rip was pushed, as it would be just after a ret. I use a lea,
load effective address, to add 8 bytes (recall: stack grows down),
placing the result in a scratch register, then write it into the second
element of buf (i.e. 8 bytes into %rcx).

    lea 8(%rsp), %rax
    mov %rax,  8(%rcx)

Everything else is a matter of elbow grease.

    mov %rbp, 16(%rcx)
    mov %rbx, 24(%rcx)
    mov %rdi, 32(%rcx)
    mov %rsi, 40(%rcx)
    mov %r12, 48(%rcx)
    mov %r13, 56(%rcx)
    mov %r14, 64(%rcx)
    mov %r15, 72(%rcx)

With all work complete, return zero to the caller.

Putting it altogether, and avoiding a -Wunused-variable:

__attribute__((naked,returns_twice))
int setjmp(jmp_buf buf)
{
    (void)buf;
    __asm(
        "mov (%rsp), %raxn"
        "mov %rax,  0(%rcx)n"
        "lea 8(%rsp), %raxn"
        "mov %rax,  8(%rcx)n"
        "mov %rbp, 16(%rcx)n"
        "mov %rbx, 24(%rcx)n"
        "mov %rdi, 32(%rcx)n"
        "mov %rsi, 40(%rcx)n"
        "mov %r12, 48(%rcx)n"
        "mov %r13, 56(%rcx)n"
        "mov %r14, 64(%rcx)n"
        "mov %r15, 72(%rcx)n"
        "xor %eax, %eaxn"
        "retn"
    );
}

Also take note of the returns_twice attribute. It informs GCC of this
function’s unusual nature, saying the function doesn’t preserve most
non-volatile registers, and induces -Wclobbered diagnostics. Technically
this means we could get away with saving only rip, rsp, and rbp —
exactly as __builtin_setjmp does — but we’ll need the others for MSVC
anyway.

longjmp assembly

In longjmp we need to restore all those registers. For purely aesthetic
reasons I’ve decided to do it in reverse order. Everything but rip is
easy.

    mov 72(%rcx), %r15
    mov 64(%rcx), %r14
    mov 56(%rcx), %r13
    mov 48(%rcx), %r12
    mov 40(%rcx), %rsi
    mov 32(%rcx), %rdi
    mov 24(%rcx), %rbx
    mov 16(%rcx), %rbp
    mov  8(%rcx), %rsp

The instruction set doesn’t have direct access to rip. It will be a
jmp instead of mov, but before jumping we’ll need to prepare the
return value. The x64 calling convention says the second argument is
passed in rdx, so move that to rax, then jmp to the caller. It’s
only a 32-bit operand, C int, so edx instead of rdx.

    mov %edx, %eax
    jmp *0(%rcx)

Putting it all together, and adding the noreturn attribute:

__attribute__((naked,noreturn))
void longjmp(jmp_buf buf, int ret)
{
    (void)buf;
    (void)ret;
    __asm(
        "mov 72(%rcx), %r15n"
        "mov 64(%rcx), %r14n"
        "mov 56(%rcx), %r13n"
        "mov 48(%rcx), %r12n"
        "mov 40(%rcx), %rsin"
        "mov 32(%rcx), %rdin"
        "mov 24(%rcx), %rbxn"
        "mov 16(%rcx), %rbpn"
        "mov  8(%rcx), %rspn"
        "mov %edx, %eaxn"
        "jmp *0(%rcx)n"
    );
}

The C standard says that if ret is zero then longjmp will return 1
from setjmp instead. I leave that detail as a reader exercise. Otherwise
this is a complete, working setjmp. It works perfectly when I swap it in
for setjmp.h in my u-config test suite.

Considering volatile

Now that you’ve seen the guts, let’s talk about volatile and why it’s
necessary. Consider this function, example, which calls a work
function that may return through setjmp (e.g. on failure).

void work(jmp_buf);

int example(void)
{
    int r = 0;
    jmp_buf buf;
    if (!setjmp(buf)) {
        // first return
        r = 1;
        work(buf);
    } else {
        // second return
    }
    return r;
}

It stores to r after the first setjmp return, then loads r after the
second setjmp return. However, r may have been stored in the execution
context. Since it’s used across function calls, it would be reasonable to
store this variable in non-volatile register like ebx. If so, it will be
restored to its value at the moment of the first call to setbuf, in
which case the old r would be read after restoration by longjmp. If
it’s not stored in a register, but on the stack, then on the second return
the function will read the latest value out of the stack. In practice, if
work returns through longjmp, this function may return either 0 or 1,
probably determined by the optimization level.

The solution is to qualify r with volatile, which forces the compiler
to store the variable on the stack and never cache it in a register.

Though since our setbuf is marked returns_twice, GCC will never store
r in a register across setjmp calls. This potentially hides a bug in
the program that would occur under some other compilers, but GCC will
(usually) warn about it.

Pure assembly and MSVC

MSVC doesn’t understand __attribute__ nor the inline assembly, so it
cannot compile these functions. I could compile my setjmp with GCC and
the rest of the program with MSVC, which means I need two compilers.
Instead, I’ll move to pure assembly, assemble with GNU as (TODO: port
to MASM?) so we’ll only need a tiny piece of the GNU toolchain.

	.global setjmp
setjmp:
        mov (%rsp), %rax
	mov %rax,  0(%rcx)
	lea 8(%rsp), %rax
	mov %rax,  8(%rcx)
	mov %rbp, 16(%rcx)
	mov %rbx, 24(%rcx)
	mov %rdi, 32(%rcx)
	mov %rsi, 40(%rcx)
	mov %r12, 48(%rcx)
	mov %r13, 56(%rcx)
	mov %r14, 64(%rcx)
	mov %r15, 72(%rcx)
	xor %eax, %eax
	ret

	.globl longjmp
longjmp:
	mov 72(%rcx), %r15
	mov 64(%rcx), %r14
	mov 56(%rcx), %r13
	mov 48(%rcx), %r12
	mov 40(%rcx), %rsi
	mov 32(%rcx), %rdi
	mov 24(%rcx), %rbx
	mov 16(%rcx), %rbp
	mov  8(%rcx), %rsp
	mov %edx, %eax
	jmp *0(%rcx)

Then some declarations in C:

typedef void *jmp_buf[10];
int setjmp(jmp_buf);
_Noreturn void longjmp(jmp_buf, int);

I’ll need to enable C11 for that _Noreturn in MSVC. Assemble, compile,
and link:

$ as -o setjmp.obj setjmp.s
$ cl /std:c11 program.c setjmp.obj

That generally works! If I rename to xsetjmp and xlongjmp to avoid
conflicting with the CRT definitions, drop them into the u-config test
suite in place of setjmp.h, then compile with MSVC, it passes all tests
using my alternate implementation in MSVC as well as GCC. Pretty cool!

Takeaway

I’m not sure if I’ll ever use the assembly, but writing this article led
me to try the GCC intrinsics, and I’m so impressed I’m still thinking
about ways I can use them. My main thought is out-of-memory situations in
arena allocators, using a non-local exit to roll back to a savepoint, even
if just to return an error. This is nicer than either terminating the
program or handling OOM errors on every allocation. Very roughly:

typedef struct {
    size_t cap;
    size_t off;
    void *jmp_buf[5];
} Arena;

// Place an arena and savepoint an out-of-memory jump.
#define OOM(a, m, n) __builtin_setjmp((a = place(m, n))->jmp_buf)

// Place a new arena at the front of the buffer.
Arena *place(void *mem, size_t size)
{
    assert(size >= sizeof(Arena));
    Arena *a = mem;
    a->cap = size;
    a->off = sizeof(Arena);
    return a;
}

void *alloc(Arena *a, size_t size)
{
    size_t avail = a->cap - a->off;
    if (avail < size) {
        __builtin_longjmp(a->jmp_buf, 1);
    }
    void *p = (char *)a + a->off;
    a->off += size;
    return p;
}

Usage would look like:

int compute(void *workmem, size_t memsize)
{
    Arena *arena;
    if (OOM(arena, workmem, memsize)) {
        // jumps here when out of memory
        return COMPUTE_OOM;
    }

    Thing *t = PUSHSTRUCT(arena, Thing);
    // ...

    return COMPUTE_OK;
}

More granular snapshots can be made further down the stack by allocating
subarenas out of the main arena. I have yet to try this out in a practical
program.

Let’s write a setjmp

Using GCC intrinsics

Using assembly

setjmp assembly

longjmp assembly

Considering volatile

Pure assembly and MSVC

Takeaway

Bayesian Statistics: The three cultures

Reverse-engineering my speakers’ API to get reasonable volume control

Zen 5’s 2-ahead branch predictor: how a 30 year old idea allows for new tricks

LEAVE A REPLY Cancel reply

Most Popular

Facebook doesn’t think hackers accessed third-party sites

It’s getting a lot harder for global brands to win in China

Why it’s time for investors to go on the defense

Facebook doesn’t think hackers accessed third-party sites

Recent Comments

EDITOR PICKS

Top Fashion Trends to Look for in Every Important Collection

Spring Fashion Show at the University of Michigan Has Started

Top Ten Kitchen Shortcuts for Indian Food Delights

POPULAR POSTS

Reflecting on 18 Years at Google

Gboard Hat Version

Feathered robotic wing paves way for flapping drones

POPULAR CATEGORY