The Agentic Drift

Debugging distributed storage with an AI agent, and where it runs out.

Sam Li

2026-05-29

FOSS

Claude Code reduces friction in repetitive workflows such as testing, benchmarking, debugging, and scaffolding. But the mental fatigue it brings is not negligible. I have suffered headaches and exhaustion from the constant judgment of whether its answers are reasonable. Even though the Opus models are powerful at coding, and we can find plenty of excuses for them when they fail to follow direction or go off track, there are still problems they cannot solve independently, without human intervention.

It is a disaster when an engineer stops steering the project and just lets Claude improvise. I admit there is something seductive about watching it scaffold a toy project, stand up a system component, or produce performance statistics. But it makes me wonder what the real goal of building a project is, and how to drive it forward. What problems are left to work on? At what level do we need to understand a project?

I have run into several issues that are hard to tackle while working on the stability of RMR, a distributed block storage system over RDMA networks. They are not always reproducible, and they can behave differently across environments. Finding a root cause can mean crossing several modules. For example, one concurrency issue on RMR appears only on QEMU VMs. The first error line is rdma_rxe: rxe_reg_fast_mr: mr->lkey = 0x12c634 not free or rtrs_client: Failed IB_WR_REG_MR: WR flushed, followed by a cascade of IO errors. At first glance I thought it was related to IO queue management: maybe in-flight requests were not tracked properly, or a use-after-free. I had Claude run the repetitive part, rebuilding the modules, booting the VMs, running the concurrent IO test, and watching the same disconnect come back. That fix-verify loop just went down a rabbit hole.

In the end, a colleague suggested running the tests on real hardware instead of VMs, and the issue did not occur there. It was the rxe soft-RoCE stack on QEMU buckling under concurrent load — a virtual transport with no flow control, dropping the link. When there is not enough training data, an agent cannot do much better than a human. It had no feel for how that particular soft-RoCE setup behaved under stress, so it had nothing to push back with. It simply went along with my framing and kept optimizing inside it.

Another issue was hard to capture because it reproduces easily only when IO is running to the servers while data is also syncing between them. The first error line is infiniband mlx5_0: mlx5_ib_post_send:1101:(pid 800921) — mlx5 returning -ENOMEM because its send-work-queue ring is full. The real question was why mlx5 thought the queue was full when RTRS’s own bookkeeping insisted there was plenty of room left.

The trigger is the overlap, not the load. A single-node pool under heavy fio — even at queue_depth=2048 — stays clean. Pour the same IO over a replicated pool, so that the storage↔storage sync session is draining dirty chunks from srv0 to srv1 at the same time, and it disconnects within tens of seconds, deterministically. The question then becomes why this rarely happens on the RNBD stack (a block device over RTRS), or on the md-raid1-over-RNBD stack. Stress testing across ten RNBD devices eventually reproduced the same issue.

The root cause turned out to be a unit mismatch buried two layers down. rtrs-server tracks a software “credit” counter for how many send-queue slots it thinks are free, and it spends and refunds that counter in units of chains — one per response. But each chain is actually one to four work requests, and the hot case on the sync read path (RDMA_WRITE -> SEND_WITH_INV -> IMM) is three. mlx5 fills its ring in WRs, not chains. On average the two ledgers balanced, so there was no steady leak and nothing looked wrong in isolation. The damage was bursty: between two signaled completions — signal_interval = 512 by default — rtrs would admit hundreds of chains, each quietly pushing up to three WRs into a queue it was counting as one. After about 910 admissions, the real ring hit its max_post of 2730 while the credit counter still read 626, “plenty free.” The next post overflowed, and the whole session came down.

In testing, no rtrs disconnection occurred at sess_queue_depth = 1024. The reason turned out to be almost incidental: max_send_wr = 3·qd + 1, so doubling the session depth doubled the send queue as a side effect, and the bursty overshoot suddenly had room to land. That means the more honest version of the workaround is to raise max_send_wr directly, giving the SQ enough headroom without inflating the whole session window and receive side along with it. But both are still headroom, not correctness: they enlarge the ring so the undercount no longer overflows it on this workload, while leaving the unit mismatch in place to bite again the moment chain_len or the burst pattern shifts. The actual fix was unglamorous once we understood it: count credits in the WRs the chain will really post, and lower signal_interval so refunds arrive often enough to keep the ledger honest.

I wish those were the only tedious parts. The standard workflow for testing kernels on our staging machines is to build Debian packages — which takes a couple of hours at minimum — install them across several servers, and then set up the test environment on each freshly booted kernel. I managed to cut most of that out by building the kernel from source directly on the servers instead of packaging it. Because the parts I am changing are loadable modules, I can rebuild and reload just those modules without rebooting at all. The scaffolding mostly disappears, and I get to stay in the reasoning instead of waiting on builds.

I had spent a few weeks researching strong consistency for replicated storage, trying to pin down what guarantee the map-version work actually needs. That led me to add Raft-like replicated log to RMR. I wrote the basic skeleton code and kept my attention on the design, The design documentation grew as the conversation went on. The agent built the tests out from there, tests first, and then settled into the test-fix loop and got stuck.

Log-length-as-freshness works only because it smuggles in a total order, and producing that order forced rules that do not fit our topology: the connected-to-client election rule, the term-inflation guard, solo-commit versus quorum. Most of the recent effort went into debugging those Raft mechanics rather than the freshness problem I set out to solve, which was the first sign I was heading the wrong way.

The realization is that strong consistency and version control are two different problems, and I had been treating them as one. Version control asks which node holds the freshest data and how to resync the rest. It is an after-the-fact question, answered during recovery by map_ver, log length, or DRBD-style generation identifiers. Strong consistency asks whether all writers agree on the in-set at the moment they set or clear a dirty bit, so that the maps never diverge to begin with. So the map-version effort was aimed at the wrong axis. The design needs to be revisited starting from the consistency requirement, with version control kept as a separate concern.