Research

Systems close to the machine.

I work around the point where model ideas meet practical systems: memory limits, local hardware, agent workflows, and the tooling needed to study them.

The work moves between papers, implementation probes, and research tools, with a bias toward things that can be tested on real machines.

Resident KV arc

Future reuse is an obligation boundary.

Resident KV Claims is a research line about when future KV reuse becomes a runtime responsibility, not just a cache hint. The public paper defines the contract: accepted future-reuse state needs claim identity, a materialization predicate, ordered lifecycle events, and claim-scoped outcomes.

Follow-on work is still in calibration. The next public bridge is a narrow vLLM observability note about prefix-cache lookup debugging. The lowering work asks how existing runtime primitives can satisfy those obligations without turning useful substrates into false conformance claims.

Not claimed: production speedup, native vLLM support, native TensorRT-LLM/SGLang/Dynamo conformance, or a verdict that existing runtimes lack useful KV machinery.

arXiv preprint

Resident KV Claims

A semantic contract for how inference runtimes should behave when resident KV blocks and active KV pressure compete for the same memory pool.

manuscriptinferenceruntimeKV cache

arXiv preprint

Fail-Closed Lowering Of Resident KV Claims

A Resident KV paper on when runtime primitives, adapters, or patches can be treated as evidence for accepted future-KV obligations.

manuscriptinferenceruntimeKV cache

Open questions

The thread keeps moving.

Which future-KV obligations are too strong, too weak, or already covered by runtime paths I missed?
What can local constraints reveal about useful model systems?
How should inference systems handle memory pressure and long context?
How can agent workflows become more reliable without becoming opaque?