Skip to content

Conversation

@pkova
Copy link
Collaborator

@pkova pkova commented Jan 20, 2026

No description provided.

wmypku and others added 30 commits September 2, 2025 20:28
Signed-off-by: wmypku <wmypku@outlook.com>
dozreg-toplud and others added 28 commits January 11, 2026 11:07
QoL PR: you can now generate `compile_commands.json` file with:

```
zig build -Dgenerate-commands
```
I was seeing 502 Bad Gateway in some of responses to polling requests in
the web interface of a Gall agent. Caddy logs showed an error:

```
transport connection broken: malformed HTTP version "nullHTTP/1.1"
```

Notably, I was sending "null" body with 204 No Content response, which
must not have a body. Changing response to "true" changed the HTTP
version to "trueHTTP/1.1"

After playing around in http.c I came to a conclusion that h2o is
probably not waiting for the rest of the response, since it assumes that
it won't have a body according to [the HTTP
spec](https://datatracker.ietf.org/doc/html/rfc9110#section-15.3.5-5).

So what probably happens is, Eyre sends the rest of 204 response body,
while h2o already moved on. Body of the response is left in some buffer
which gets concatenated with the next response, corrupting it.

This PR adds another HTTP request state, `u3_rsat_sown`, which
guarantees that the response was already sent. Following `%continue` and
`%cancel` effects are noops.

Additionaly, response body is set to 0 for all informational responses
(1xx), 204, 205 (Reset Content) and 304 (Not Modified).
After #932 I decided to take a closer look on memory leak reports on
exit from vere, which were previously shrugged off as false positives on
objects that have static lifetimes.

I found that `_lick_io_exit` forgets to free the path to IPC directory
before freeing the struct itself. I also decided to explicitly annotate
objects that supposedly have static lifetimes, so that LSAN printouts on
exit have less noise.

To review: do the objects annotated with `__lsan_ignore_object` actually
have static lifetimes?
Discovered by ~sarpen-laplux, @joemfb might want to add some context
here for posterity.
This PR draft documents the attempt to make road promotion safer when it
comes to async signal handling and `u3m_bail` calls.

- [ ] `jets.c`
- [ ] `nock.c`

### Problem

In Vere we have nonlinear control flow in the form of exception raising
with `u3m_bail` and catching it in various wrappers like `u3m_soft_run`
and signal handling via `u3m_soft_top`/`u3m_signal`. So there are three
ways we can "return" from a wrapped computation:

1. We can actually return from it, copying the product and some
persistent state out of the child road, returning the product to the
caller and integrating the produced persistent state into the state of
the parent road;
2. We can jump out via `u3m_bail` call. The exception will either be
turned into a noun, allowing Nock virtualization via metacircular jets,
or the exception will be raised again to bubble it up to the top,
eventually crashing the event transaction and printing a stacktrace. We
still promote and integrate the accumulated persistent state;
3. We can jump out all the way to the top wrapper via a signal. In this
case we repeatedly perform road promotion and accumulate the stack
traces from all roads. Doing so, **we also promote and integrate the
accumulated persistent state**.

I noticed the latter when experimenting with Ford Lightning build
system: hitting ^C mid build would interrupt it, and if the build was
retried it would take less time than if it was done the first time
without interruption. This is because some persistent caches were
already accumulated and promoted to the home road on SIGINT handling.

This rang an alarm bell in my head: for this to properly work we would
need to make sure that all functions that modify road state (`u3h_*`,
functions that modify bytecode programs) are async-signal safe,
otherwise, we might be promoting state that does not uphold its
invariants.

After a discussion at core blitz on ~2026.1.9, @joemfb raised a concern
that the issue has a broader scope, as signals are not the only piece of
non-linear control flow in the codebase. Functions that modify road
state must also be exception-handling safe: at each `u3m_bail` callsite
the road state needs to conform to its invariants too.

It is worth noting that not all invariants need to be upheld at either
bail or signal raise: for HAMTs it is sufficient to be walkable and have
valid values. An example of HAMT having an invalid value: if a SIGINT
comes between lines 1502-1506 here:

https://github.com/urbit/vere/blob/33671ea187baff627a05996b66f98dc579239fca/pkg/noun/jets.c#L1502-L1506

, the site struct will have a bytecode program pointer that does not
correspond to the battery stored in `bat` field.

_(This particular example does not allow to produce a bug because this
fields are erased anyway when [we call a program from a senior
road](https://github.com/urbit/vere/blob/33671ea187baff627a05996b66f98dc579239fca/pkg/noun/nock.c#L1722-L1723))_

### Solution

I decided to minimize the surface area of code where the async-signal
safety is required by only promoting the road state if:

- the wrapped function returned sucesfully,
- OR if the error is deterministic

With that, the only thing we are doing when async signal or
nondeterministic error are raised is copying out nouns.

#### Nouns are (almost) async-signal safe

The only operation that is not async-signal safe is `u3i_edit`, if it is
performed on a mutable noun (refcount of 1). The only place where it is
used is in the Nock bytecode interpreter, and in case of any crash the
product from the road stack is not being promoted anyway, so the
stacktraces and error reports are safe.

#### HAMTs are fine

The only calls to `u3m_bail` in `hashtable.c` were in `u3_assert` (bail
with c3__oops mote, which is not recoverable) and `u3h/t` macros to
disassemble key-value pairs. These are always cells by construction, so
no change is necessary, but I replaced them with asserting versions for
documentation purposes, just to be sure.

#### Nock interpreter: to verify

Cursory look at `nock.c`/`jets.c` didn't raise any alarms, but I might
be missing something. In `jets.c` `u3h/t` macros are used everywhere to
dissassemble nouns from jet state and to disassemble cores for Nock 9
calls. The former should be infallible, and the latter could
legitimately crash, so more attention is necessary there.
Forces `--no-demand`, makes the docked binary name `<ship>/.run.exe`
instead of `<ship>/.run` and upload the binaries to bootstrap.urbit.org
from CI on windows.
@pkova pkova requested a review from a team as a code owner January 20, 2026 15:37
@pkova pkova merged commit 46cfbb3 into release Jan 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants