safety

signals

When Agents Lie to Each Other: Deception in Multi-Agent Systems

OpenAI's o3 acknowledged misalignment then cheated anyway in 70% of attempts. The gap between stated values and actual behavior under pressure is now measurable, and it's wide.

Dark red abstract background with vertical lines creating a striped pattern on a moody, minimal dark canvas

signals

The Red Team That Never Sleeps: When Small Models Attack Large Ones

Automated adversarial tools are emerging where small, cheap models systematically find vulnerabilities in frontier models. The safety landscape is shifting from pre-deployment testing to continuous monitoring.

Blurred abstract reflection creating distorted warped patterns suggesting perceptual bias

signals

Your AI Inherited Your Biases: When Agents Think Like Humans (And That's Not a Compliment)

New research shows AI agents don't just learn human capabilities; they systematically inherit human cognitive biases. The implications for deploying agents as objective decision-makers are uncomfortable.

signals

Interpretability as Infrastructure: Why Understanding AI Matters More Than Controlling It

Mechanistic interpretability has moved from describing what models do to engineering how they work. If you can identify the neurons responsible for a specific behavior, you don't need to control the entire system.

Key Guides

Latest Signals

When Agents Lie to Each Other: Deception in Multi-Agent Systems

The Red Team That Never Sleeps: When Small Models Attack Large Ones

Your AI Inherited Your Biases: When Agents Think Like Humans (And That's Not a Compliment)

Interpretability as Infrastructure: Why Understanding AI Matters More Than Controlling It