← All issues

Issue #6 · April 28, 2026

Your vLLM KV cache eviction is over-pruning the layers that matter most

Issue #6 of sumocat — sharp insights from this week's AI research for builders.

TL;DR

Five papers dropped on arxiv this week that will be AlphaSignal headlines in 3-6 months -- but the implementation window is now, before every team running LangChain agents or vLLM long-context inference copies the same playbook. Here's what to ship before that happens.

  • Switch vLLM's KV cache eviction from uniform to per-layer budget allocation on Llama 3 70B -- free quality recovery at the same memory cost
  • Add a Pinecone/pgvector retrieval layer in front of your OpenAI Agents SDK or LangChain tool registry before you cross 30 tools
  • Run a spaCy NER dedup pass in your LlamaIndex ingestion pipeline before upserting to Pinecone -- cut index size 25-36% this sprint
  • Add a population-diversity eval to your Braintrust suite using OpenAI text-embedding-3-large pairwise distances -- catch persona collapse your per-agent evals are hiding

This Week in AI Systems

1. Your vLLM KV cache eviction is over-pruning the layers that matter most

Every team running H2O or SnapKV-style uniform eviction on Llama 3 70B or Mistral 7B in vLLM is applying the same pruning ratio across all 32-80 layers. Early layers are far more sensitive to token loss than late layers, so uniform budgets over-prune where it hurts and under-prune where it doesn't -- the paper quantifies 10-20% retrieval quality loss at equivalent memory budgets on long-doc QA tasks. The fix is a non-uniform per-layer budget, and you can prototype it in vLLM's custom eviction hooks in a sprint.

What to ship:

  • Profile per-layer attention entropy using vLLM's attention backend hooks or a one-off HuggingFace Transformers forward pass on your real workload -- log which layers show high variance; those are your sensitive layers
  • Set a 2x KV token budget for layers 0-8, 1x for middle layers, 0.5x for final layers in your vLLM custom eviction config on Llama 3 70B -- then benchmark against your existing LangSmith or Braintrust summarization evals before/after
  • If you're on HuggingFace TGI, monkey-patch KV eviction with a precomputed per-layer retention ratio from a one-time offline sensitivity sweep on your corpus and track ROUGE deltas in Phoenix

Ship window: next sprint


2. Your LangChain or OpenAI Agents SDK tool list is silently degrading past 30 tools

Flat tool registration in LangChain agent prompts or OpenAI Agents SDK function calling contexts doesn't scale -- the paper quantifies accuracy degradation past ~50 skills and shows it's architectural, not a prompt fix. GPT-4o mis-selects tools at elevated rates even when the right skill is retrieved and present in context, meaning retrieval alone isn't enough: you need to measure the incorporation gap separately. You probably don't have that metric in your LangSmith dashboards right now.

What to ship:

  • Replace flat tool lists with a two-stage pipeline: embed all skills via OpenAI text-embedding-3-small, store in Pinecone or pgvector, retrieve top-5 to 10 at query time before injecting into the LangChain or OpenAI Agents SDK context
  • Add a three-part skill eval to Braintrust or LangSmith that tracks (1) retrieval hit rate, (2) whether the agent actually invokes the retrieved skill, and (3) end-task success -- the paper shows these three metrics diverge badly and you're probably only measuring the third
  • Gate skill injection with an OpenAI structured outputs schema that forces the agent to output an explicit yes/no decision on whether an external skill is needed before loading any tool -- this directly patches the incorporation gap

Ship window: next sprint


3. Your LlamaIndex ingestion pipeline is pumping redundant chunks into Pinecone and you're paying for it

Fixed-size and recursive text splitters in LangChain and LlamaIndex generate near-duplicate chunks that land in Pinecone or pgvector with no dedup -- the paper shows NER-based filtering cuts index size 25-36% with no meaningful retrieval quality loss. That index bloat is slower ANN query latency, higher Pinecone pod costs, and top-k results diluted with semantically identical content. Two spaCy calls in your ingestion script fix this.

What to ship:

  • Add a spaCy en_core_web_sm NER pass post-chunking in your LlamaIndex ingestion pipeline -- drop any chunk where named entity overlap with an already-indexed chunk exceeds ~0.5 IoU, implemented as a custom NodePostprocessor
  • Wire a sentence-transformers cosine similarity dedup step (threshold ~0.92) into your LangChain TextSplitter -> embedding -> Pinecone upsert pipeline to catch semantic duplicates the NER pass misses -- cuts upsert volume immediately
  • Instrument your RAG eval in Braintrust or LangSmith with chunk-level token overlap IoU as a precision metric so you can A/B test filtered vs unfiltered indexes against retrieval quality before shipping

Ship window: this week


4. Your multi-agent persona system in GPT-4o or Claude 3.5 Sonnet is producing a behavioral monoculture

Any production system using LangChain or OpenAI Agents SDK to generate diverse synthetic user populations -- UX research bots, red-team swarms, debate simulators -- is producing near-identical behavioral outputs even when agents carry distinct demographic profiles. The paper shows models with the highest per-persona benchmark scores are the worst offenders for population-level homogenization. Your LangSmith or Braintrust evals are almost certainly measuring per-agent fidelity only, so the collapse is invisible in your existing dashboards.

What to ship:

  • Add a population-diversity eval to your Braintrust suite: embed all agent outputs with OpenAI text-embedding-3-large, compute pairwise cosine distances across the full agent population, and alert when mean intra-population similarity exceeds 0.85
  • Replace flat demographic descriptors in your LangChain or OpenAI Agents SDK persona templates with explicit BFI-44 behavioral axis overrides (e.g., "Agreeableness 18/40: interrupt, push back, reject consensus") -- force the model off its stereotype attractor instead of relying on demographic labels
  • Run the paper's open-source eval toolkit against your vLLM or OpenAI batch inference pipeline on Coverage, Uniformity, and Complexity axes before your next synthetic data generation run, and gate data use on a minimum Uniformity threshold

Ship window: next sprint


5. Your Braintrust or LangSmith eval for your code review bot is measuring sprint pressure, not comment quality

If you're training or evaluating an LLM code review bot using GitHub PR comment resolve/dismiss signals as labels, the paper shows you're hitting an accuracy ceiling of ~62% because engineers dismiss valid comments due to deadlines, not quality. LLM-as-a-Judge pipelines using GPT-4.1-mini or Gemini 2.5 Pro against those labels inherit the same ceiling -- 44-62% agreement, barely above random on contested cases. Any clean accuracy numbers on your Phoenix or Braintrust dashboard built on this signal are misleading.

What to ship:

  • Audit your Braintrust or LangSmith eval datasets and filter any wontFix or dismissed labels collected during sprint-end or release freeze windows using PR merge timestamp metadata from the GitHub API -- mark these as noisy, not ground truth
  • Add a rubric-based LLM-as-a-Judge layer in Phoenix using GPT-4.1 (not mini) via OpenAI structured outputs, scoring comment quality on a 0-4 Likert scale independent of developer action -- track divergence between judge score and developer label as a separate signal
  • Build a contested comment queue: when your ACR bot comment gets dismissed, fire a Slack or Linear task asking the PR author one forced-choice question on technical validity and feed those responses into Braintrust as a separate gold-label stream

Ship window: this sprint


What's Breaking

  • vLLM uniform KV eviction on Llama 3 70B. Every team running 128k context windows with H2O-style eviction is uniformly degrading early attention layers and losing retrieval quality they'd keep for free with per-layer budgets.
  • LangChain and OpenAI Agents SDK flat tool lists past 30 entries. Tool-selection accuracy is silently degrading and your LangSmith evals probably only track end-task success, hiding the retrieval and incorporation gaps separately.
  • Pinecone and pgvector indexes built from LlamaIndex or LangChain recursive splitters. Near-duplicate chunk pollution is costing you real money in pod sizing and slowing ANN queries on every request.
  • Braintrust and LangSmith evals using GitHub PR resolve/dismiss as ground truth for code review bots. The label source is poisoned by org workflow noise and your accuracy ceiling is 62% -- you won't detect regressions with this signal alone.
  • OpenAI Agents SDK and LangChain multi-agent persona systems. GPT-4o and Claude 3.5 Sonnet are collapsing distinct persona profiles into homogeneous behavioral blobs and your per-agent fidelity evals are structurally blind to it.

Build Ideas

Idea: A LlamaIndex ingestion middleware layer -- a pip-installable NodePostprocessor chain -- that runs spaCy NER dedup, sentence-transformers semantic similarity filtering, and auto-reports index compression ratio to a Braintrust eval dataset on every ingestion run.

Why now: Every team building RAG on Pinecone or pgvector is over-indexing redundant chunks today and has no instrumentation to know how bad it is -- this is a pain point with a two-sprint fix and no good existing tooling.

Start with:

  • Build the LlamaIndex NodePostprocessor using spaCy en_core_web_sm for NER overlap detection with configurable IoU threshold
  • Add sentence-transformers all-MiniLM-L6-v2 cosine similarity as the second dedup gate before the Pinecone upsert call
  • Log chunk retention rate, dedup ratio, and retrieval quality delta (MRR on a held-out eval set) as a Braintrust eval dataset so teams can tune thresholds without guessing

Papers

Get the next issue

Sharp insights from AI research. Every week. No fluff.