Lab outage: btrfs cannot delete its way out of a full disk

The lab is down.

The disk filled. That should be a bad afternoon. On btrfs, in the year 2026, on Linux 7.0.1, it is an outage. The filesystem will not delete files, because deleting a file requires writing metadata, and writing metadata requires space. It will not delete snapshots either, for the same reason. The one operation a full filesystem must support — making itself less full — is the operation it cannot perform.

This is nuts. It would have been nuts in 2010.

We lost no data. The distributed replicas did their job. What we did lose is availability, because the lab frontend is configured to talk to the master. That is on us, and we are fixing it in the same window.

Here is the plan, executing now:

  • The PostgreSQL master is moving to ZFS on OmniOS. ZFS does not lock itself out of its own free-space accounting when a pool fills. It tells you the pool is full and lets you delete things. That is the bar.
  • The lab is being reconfigured to fail over to a replica when the master is unreachable, instead of sitting there staring at a dead socket.
  • WAL archiving is being pointed at object storage so the next standby we bring up does not have to be in the same rack as the master to be useful.

Services that do not depend on the lab database — the site, the tap, release artifacts on Codeberg — stay up.

ETA for full recovery is today.

Two things written down so we read them next time:

  1. A filesystem that cannot delete files when it is full is not finished.
  2. A client that only knows how to talk to the primary does not have a replica. It has a spare it has never met.

Back shortly.

← All news