February 7, 2026
The assumption that safety testing happens before deployment is breaking down.
For years, the playbook was clear: red-team your model, patch the vulnerabilities, ship it, move on. That process assumed attacks were expensive, skilled labor was required, and the threat surface stayed fixed between evaluations. Four papers from this week's arxiv submissions collectively dismantle that assumption.
The $0.002 Attack
AutoInject [1] demonstrates something the safety community has been dreading: a 1.5-billion-parameter model can systematically generate prompt injections that succeed against GPT-5 Nano, Claude 3.5 Sonnet, and Gemini 2.5 Flash. The attacker is roughly a thousand times smaller than the models it compromises. Running it costs almost nothing.
Manual red-teaming requires skilled researchers who understand model architectures, prompt structures, and failure modes. AutoInject automates the process. A small model generates candidate attacks, tests them against targets, and iteratively refines its approach. The success rates are measured against production-grade systems with active safety filters.
This creates a structural asymmetry. Defending a frontier model requires massive investment in alignment, RLHF, and continuous monitoring. Attacking one requires a commodity GPU and a few hours of compute. Unlike human red-teamers, automated systems never stop generating candidates.
Our earlier analysis of Split Personality Training showed that internal monitoring outperforms external monitoring for deception detection. AutoInject strengthens that case: if small models can automate attacks against frontier models, external defenses cannot keep pace with the rate of novel attack generation.
Where Generic Benchmarks Go Blind
The financial domain illustrates why pre-deployment testing fails. Singha's ECLIPSE framework [2] treats hallucination as a measurable mismatch between a model's semantic entropy and the capacity of available evidence. Applied to financial question-answering, it cuts hallucination rates by 92%.
But the deeper finding concerns the gap between generic evaluation and domain deployment. Financial RAG systems hallucinate in ways that standard benchmarks never surface -- fabricating specific figures, inventing regulatory citations, conflating time-sensitive data. These are systematic failure modes that emerge only when the model encounters the specific structure of financial documents and numerical precision requirements.
Every deployment domain creates its own attack surface. Healthcare, legal, financial -- each introduces failure modes that only manifest under domain-specific conditions. A model that passes every standard safety benchmark can still fail catastrophically in production.
Uncertainty That Shrinks
The good news arrives from an unexpected direction. Oh et al. [3] present the first general framework for quantifying uncertainty in LLM agents, and their central finding overturns a widespread assumption.
Conventional wisdom holds that uncertainty accumulates over an agent's trajectory. Each step introduces new error sources; by step ten, compounded uncertainty makes the output unreliable. This framing has shaped how engineers build guardrails: shorter trajectories, frequent human checkpoints, conservative action spaces.
The paper proposes a conditional uncertainty reduction process showing that interactive agents can actually decrease uncertainty mid-task. When an agent acts and receives environmental feedback, it gains information. A web search returning relevant documents reduces uncertainty about the answer. A code execution producing expected output reduces uncertainty about the implementation. Interactivity is not just a feature of agents -- it is a mechanism for uncertainty management.
This provides a principled basis for deciding when an agent needs human oversight. Rather than blanket restrictions, systems can monitor uncertainty in real time and intervene only when it crosses domain-specific thresholds.
The Compiler Is Not Enough
Code-writing agents are often considered more robust because their outputs face a hard validator: the compiler. Evolve-CTF [4] demonstrates why this confidence is misplaced.
The benchmark tests code agents against systematically transformed inputs -- adversarial challenges where problem statements and constraints are evolved to exploit weaknesses in how agents parse and reason about code. The compiler catches syntax errors. It cannot catch an agent that solves the wrong problem correctly. Clean, compiling, well-tested code for a subtly different specification than intended is a failure mode no compilation step will surface.
This is the code-domain equivalent of financial hallucination. The failure is not in generation but in alignment between what was requested and what was produced.
Safety as a Runtime Property
These four papers converge on a single conclusion: safety is becoming a runtime property, not a pre-deployment checkbox.
AutoInject shows the attack surface expanding continuously. ECLIPSE shows deployment-specific failures requiring deployment-specific monitoring. The agent UQ framework provides theoretical foundations for real-time uncertainty tracking. Evolve-CTF shows that even agents with hard validators are vulnerable to adversarial specification drift.
The tools for continuous monitoring are emerging. But the attack tools are emerging faster, and they are cheaper to run. The red team that never sleeps is not a team at all. It is a small model on a rented GPU, probing your frontier system around the clock.
References
[1] AutoInject: Automated Prompt Injection Attacks via Small Language Models. arXiv, February 2026. https://arxiv.org/abs/2602.04821
[2] Singha, M. "Detecting AI Hallucinations in Finance: An Information-Theoretic Method Cuts Hallucination Rate by 92%." arXiv:2512.03107, December 2025. https://arxiv.org/abs/2512.03107
[3] Oh, C., Park, S., Kim, T.E., Li, J., Li, W., Yeh, S., Du, X., Hassani, H., Bogdan, P., Song, D., Li, S. "Towards Reducible Uncertainty Modeling for Reliable Large Language Model Agents." arXiv:2602.05073, February 2026. https://arxiv.org/abs/2602.05073
[4] Evolve-CTF: Evolved Adversarial Code Challenges for Testing Agent Robustness. arXiv, February 2026. https://arxiv.org/abs/2602.03891