The Dream Is Large. The Claim Is Small.
A first field note on the Resident KV Claims paper, open-source ambition, and why a bounded runtime contract can still feel like a real contribution.
I have wanted, for a long time, to make a useful contribution to open-source AI and machine learning.
That sentence is easy to make too large.
It can start carrying all the wrong things: the wish to be useful, the wish to belong near the frontier, the wish to do work that is not only commentary about other people's systems. None of that is evidence. None of it makes a result true. At its worst, ambition can become a kind of private inflation engine. The story gets bigger because the desire behind it is real.
So the first rule has to be smaller than the dream.
The result has to stand on its own.
The paper I submitted is called Resident KV Claims: A Conformance Contract for Future Reuse under Active KV Pressure. It is not the whole dream. It is a small conformance-contract paper about one boundary inside LLM serving runtimes. But it is the first piece of work I have made that feels like it might be a real contribution to the open-source AI/ML space because its boundary is narrow enough to inspect.
The paper should speak for itself. This note is not a replacement abstract. It is a way of saying what the paper is trying to be, why I care about it, and where its claim has to stop.
The Problem Is Not "Cache Better"
Modern inference runtimes already know a lot about KV cache reuse. They can reuse repeated prefixes. They can page KV memory. They can expose priority, duration hints, offload paths, routing metadata, scheduler modes, and event streams.
KV cache is the stored attention state a model can reuse instead of recomputing a repeated prefix from scratch.
That matters because the contribution is not "someone forgot that KV cache exists." They did not.
The question is narrower.
When a runtime accepts future-reuse state, what does that acceptance mean if a later active request needs the same finite memory?
That was the pressure point I kept coming back to. A future-reuse hint is cheap while it is only language. It becomes expensive when it spends memory. At that moment the system has to decide whether the accepted resident state is still protected, whether the active request should be refused or deferred, whether the claim should be demoted or expired, whether state should move somewhere else, or whether the runtime should report that it harmed a claim it had already accepted.
If the runtime does none of that, accepted future-reuse state collapses back into ordinary cache eviction. Maybe that is fine for a soft hint. It is not fine if the runtime has accepted something stronger.
That is why the paper calls the object a resident KV claim.
A resident claim is stronger than a cache priority or a loose retention hint.
A claim is an application-visible future-reuse object with a materialization predicate, lifecycle state, active/resident feasibility outcome, and claim-level telemetry.
Materialization is the part that keeps this from becoming raw block counting. Keeping some scattered KV blocks is not the same as preserving the useful future computation object. A prefix can have value that falls off a cliff: if the leading part is gone, the later fragments may still count as retained blocks while failing the thing the caller actually needed.
That distinction is small. I think it is also the sort of distinction runtimes need if future-reuse hints are going to become testable contracts rather than informal promises.
The Boundary Is Physical
In the paper, I try to show that boundary with a deliberately simple example:
protected resident KV = 60 blocks
active live KV = 70 blocks
usable KV pool = 80 blocks
60 + 70 = 130 > 80
No naming trick makes 130 fit into 80.
Something has to give.
A runtime can evict or relax resident state. It can refuse, defer, split, recompute, offload, route, or bound active work. It can deny future reusable admission for the active request. It can add capacity if the system has capacity to add. But if protected resident KV and active live KV draw from the same usable pool, the runtime cannot preserve both without some explicit action.
One reason this boundary is easy to miss is that "no-admit" sounds stronger than it is.
Write no-admit can prevent a large active request from becoming future reusable state. That is useful. But it does not make the active request consume no live KV while the request is being served. If active allocation still draws from the same physical pool as resident reusable KV, no-admit can arrive too late to protect residents.
That was the hinge I found interesting.
The problem was not only which cached block to evict. It was what responsibility the runtime had accepted before eviction became necessary.
What The Paper Claims
In the paper, I make that concrete through two artifacts: a small MicroRuntime model and a minimal vLLM arbiter prototype. I needed the smaller model because vLLM was too entangled to separate ownership, admission, and useful-prefix survival cleanly. The MicroRuntime isolates those semantics: claims, capacity, materialization, active live pressure, future-reuse admission, refusal, demotion, expiry, and claim-level telemetry. The vLLM arbiter artifact carries the runtime-facing evidence: litmus traces, a patched vLLM prototype, capacity sweeps, and scheduler-path pressure evidence.
The point is not "faster inference."
The point is that the contract changes what the runtime is allowed to hide.
In the vLLM probes, the decisive step was not only that resident loss changed. It was that the prototype could turn the conflict into scheduler-visible active refusal and name the blocking resident claim directly. That is the contract claim: once the runtime has accepted resident state, it cannot silently reclassify predicate-breaking loss as ordinary cache eviction. If it breaks the claim, it has to emit a conflict outcome that can be reconstructed.
That is the contribution I want the paper to make legible.
It identifies a boundary.
It defines the smallest contract surface I could make coherent.
It gives that surface executable tests.
It shows a real runtime path where the conflict outcome changes.
Why The Limits Matter
I do not read the paper's limits as a defensive wrapper around the contribution. They are part of the empirical work. If the claim touches runtimes with serious prior art, then the paper has to say what those systems already do, what they do not define, and where this contract adds a narrower boundary.
The temptation, especially around AI systems work, is to turn every local mechanism into a frontier story. A small runtime witness becomes a serving claim. A conformance trace becomes a benchmark. A patch becomes an API. A good word in the abstract becomes a promise the evidence cannot pay back.
I tried to keep the ambition and the evidence in the same room: honor the closest prior work, name the neighboring mechanisms clearly, and refuse the claims the evidence did not earn.
The dream is still large. I want local machines to do more serious work. I want agentic workflows to produce evidence instead of laundering uncertainty. I want open systems to win more often than closed ones. I want to make contributions that runtime builders, researchers, and other people working close to the metal can actually inspect.
But the first claim has to be the one the evidence can actually carry.
That is how this became a conformance paper. The larger ambition is still systems work: better runtime behavior, stronger performance evidence, and tools that make local AI research less fragile. This paper is the part of that line that became credible first. It names a boundary in the runtime vocabulary, separates existing retention fragments from the accepted-claim lifecycle, defines a minimal contract, gives that contract litmus tests, and shows a vLLM path where the conflict outcome changes.
That is enough.
At least for the first rung, it has to be enough.
Why This One Matters To Me
A lot of my work starts from a strange combination of hunger and constraint.
I care about AI and machine learning research. I care about open source. I also work from local hardware, limited budgets, incomplete time, and the usual mess of trying to turn scattered obsession into something that survives contact with evidence.
That constraint can be frustrating. It can also be clarifying.
The machine under my desk cannot hide the cost of an idea. Memory pressure is not a metaphor there. Runtime behavior is not a diagram. If accepted resident KV and active live KV do not fit in the same pool, the runtime has to pick an outcome. Something has to be evicted, refused, deferred, moved, or reported clearly. If none of that happens, the conflict gets hidden.
That physicality is part of why this paper feels like the right first candidate.
The local setup is enough for a bounded question, not for a global serving claim. The vLLM prototype is evidence for one contract boundary, not for a general policy. The question is smaller and more testable: when future-reuse state becomes an accepted object, what must a runtime do when active work competes for the same memory?
That question is narrow enough to test.
It is also wide enough to matter.
If the paper earns attention (hopeium), the next useful work is not to declare the contract finished. It is to ask how the contract can be lowered onto real serving runtimes without turning every nearby primitive into a false positive. That means obligation-level checks, runtime descriptors, and conformance profiles that let maintainers see when a backend preserves the contract, approximates it, or has to fail closed.
That is slower than the story wants to be.
Good.
The story should move at the speed of the evidence, and of what the search space gives back.
This is a small paper. It is a small contract. It is a small claim on a very large dream.
That is the part I like about it.