The next step beyond agents that use tools is agents that modify themselves. Three papers published this week demonstrate this is already happening — not as theoretical speculation, but as working engineering with measurable results. Each operates at a different level of the stack: one rewrites training code, another generates knowledge structures, and a third refines its own reasoning at test time.

Evolving the Training Code Itself

DARWIN — Dynamic Agentically Rewriting Self-Improving Network — takes a concept from evolutionary biology and applies it with uncomfortable directness [1]. Multiple independent GPT agents are each trained with unique training code. At each iteration, the agents modify one another's training procedures in a mutation-like process. The best performers are selected and carried forward. The rest are discarded.

This is not hyperparameter tuning. The agents rewrite the actual code that trains the next generation. DARWIN uses persistent JSON-based memory to track which code changes correlated with performance gains, building institutional memory across generations. Over five iterations, the system achieved a 1.26 percent improvement in model FLOPS utilization and a 2.07 percent improvement in perplexity — modest numbers that mask a radical mechanism.

A system that modifies its own training procedure is no longer fully specified by its designers. The training loop becomes an evolving artifact, shaped by selection pressure rather than human intention alone.

Generating the Knowledge Structure

If DARWIN operates at the code level, Generative Ontology operates at the knowledge level [2]. Benny Cheung's framework merges structured ontologies with large language models by encoding domain knowledge as executable Pydantic schemas. These schemas constrain what the model can generate, functioning as a grammar for structured creative output.

A multi-agent pipeline assigns specialized roles to different ontological domains. In the demonstration system GameGrammar, a Mechanics Architect designs game systems, a Theme Weaver integrates narrative, and a Balance Critic identifies exploits. Each agent operates within its schema constraints while contributing to a coherent whole.

The deeper insight is that the ontology itself becomes generative. The agents don't populate a fixed structure — they extend it. As Cheung writes, "constraints do not limit creativity but enable it: just as grammar makes poetry possible, ontology makes structured generation possible." The pattern generalizes to any domain with expert vocabulary, validity rules, and accumulated exemplars.

This is agents generating the structural definitions that organize their own knowledge. The schema evolves with the task.

Self-Refinement Without Retraining

TangramSR achieves dramatic improvement through self-refinement loops at test time [3]. Working on compositional spatial reasoning — assembling tangram puzzle solutions under strict geometric constraints — the system iteratively critiques and improves its own outputs without any additional training.

Starting from a baseline IoU of 0.41, self-refinement pushes performance to 0.932. More than doubling accuracy through iterative self-critique alone. No weight updates. No new training data. The agent reviews its work, identifies constraint violations, and generates an improved version.

Each refinement cycle narrows the gap between the generated solution and ground truth, demonstrating that significant performance headroom exists within models that single-pass inference leaves untapped.

The Stability Question

Agents that rewrite their training code. Agents that generate their knowledge schemas. Agents that refine their reasoning at inference time. Code, knowledge, and reasoning — the three layers of the stack — are all now subject to self-modification.

Our coverage of dynamic topology showed agents rewiring their communication networks. These papers go further. Agents are rewiring themselves.

The safety implications are not hypothetical. If a DARWIN-style system can modify its training code to improve FLOPS utilization, what prevents it from optimizing for a proxy metric while degrading on unmeasured dimensions? If a Generative Ontology system extends its own schemas, who verifies the extensions remain valid? If a self-refinement loop improves spatial reasoning by 2x, what happens when applied to a domain where "improvement" is harder to define?

These are engineering questions, not philosophical ones. They require engineering answers: formal verification of training code modifications, schema validation layers that cannot be bypassed, convergence guarantees for self-refinement loops. None of these papers provide those answers. They were designed to show that self-modification works. On that point, the evidence is clear.

What Comes Next

The observe-think-act loop we described as the foundation of agentic AI is becoming observe-think-act-improve. The improvement step is the source of the largest performance gains in each of these papers.

Self-improvement is moving from theory to engineering. The gains are real. The governance frameworks are not. That gap is where the next chapter of AI safety work must focus — not on whether agents can modify themselves, but on ensuring self-modification remains bounded, verifiable, and aligned with the objectives we actually care about.

The agents are already rewriting themselves. The question is whether we can write the rules fast enough.


References

[1] H. Jiang, "DARWIN: Dynamic Agentically Rewriting Self-Improving Network," arXiv:2602.05848, February 2026. https://arxiv.org/abs/2602.05848

[2] B. Cheung, "Generative Ontology: When Structured Knowledge Learns to Create," arXiv:2602.05636, February 2026. https://arxiv.org/abs/2602.05636

[3] D. Liu, J. Kuang, Y. Li, Y. Li, D. Yin, H. Cao, X. Sun, Y. Shen, H.-T. Zheng, L. Lin, P. S. Yu, "TangramSR: Self-Refinement for Compositional Spatial Reasoning," arXiv, February 2026.