Resident KV Claims

A semantic contract for how inference runtimes should behave when resident KV blocks and active KV pressure compete for the same memory pool.

Contract before implementation

The paper is now available as an arXiv preprint: Resident KV Claims: A Conformance Contract for Future Reuse under Active KV Pressure. It was first submitted on 22 May 2026.

The work starts with a deliberately narrow question: what should an inference runtime promise when resident KV blocks and active KV pressure cannot both fit comfortably in the same pool?

The point is not to pretend memory pressure disappears. The point is to make behavior under pressure explicit enough that systems can be tested, compared, and trusted.

What the paper claims

Resident KV Claims treats future reusable KV as an accepted obligation rather than a loose cache preference.

The paper argues that an accepted claim needs a stable identity, a useful-state predicate, lifecycle events, conflict outcomes, and telemetry scoped to the claim. Without that boundary, ordinary cache behavior can look stronger than it is: a priority value, no-admit path, offload tier, routing hint, or cached-token counter may be useful without showing what happened to an accepted future-reuse object.

The contribution is a conformance contract and a small set of witnesses. A MicroRuntime isolates the semantics. A minimal vLLM prototype shows how hard protected resident claims can turn active/resident infeasibility into a scheduler-visible active refusal with direct blocking-claim attribution.

What it does not claim

This is not a production speedup result, a new eviction policy, or a claim that existing runtimes lack useful KV machinery.

It does not claim native vLLM support, native TensorRT-LLM/SGLang/Dynamo conformance, production offloadability, broad workload incidence, or a general serving benchmark. The vLLM evidence is a controlled prototype and artifact witness for the contract boundary, not a statement that upstream runtimes already implement ResidentClaims.

The pressure case

Modern inference stacks are shaped by memory pressure. When that pressure becomes concrete, the runtime has to decide which blocks remain resident, which blocks move, and what invariants callers can depend on.

This paper treats resident claims as a semantic contract rather than an implementation trick. That distinction matters because it separates the behavior a runtime should expose from the specific allocator or backend that happens to provide it.

The smallest pressure example is intentionally blunt:

protected resident KV = 60 blocks
active live KV        = 70 blocks
usable KV pool        = 80 blocks

60 + 70 = 130 > 80

No naming trick makes 130 blocks fit into an 80-block pool. A runtime can evict, demote, expire, offload, refuse, defer, route elsewhere, recompute, or report harm. The contract question is what the runtime must expose after it has accepted resident state for future reuse.

What I am looking for feedback on

The most useful review is narrow. I am not looking for a general endorsement of the research line. I am looking for corrections to the obligation boundary.

The key question is: which obligation is wrong, too strong, too weak, or already covered by an existing runtime path I missed?

A useful answer could be: the distinction is real; a named runtime path already carries the missing obligation; the contract should be weaker; the contract should be stronger; or this belongs as a debugging and conformance artifact rather than as a runtime API.

That kind of feedback is enough to improve the work.

The current follow-on work uses that posture. One public bridge is a vLLM prefix-cache observability note: a disabled-by-default lookup observation could help explain prefix-cache hits and misses without asking vLLM to own external ResidentClaim semantics.

Follow-up

For the story around the paper, see The Dream Is Large. The Claim Is Small., a field note on why this small contract feels like a first possible contribution to open-source AI/ML.

The next step is implementation work: asking how this contract can be lowered onto real serving runtimes without turning every nearby cache primitive into a false positive.