safety
Key Guides
No guides published for this topic yet.
Latest Signals
- When Agents Lie to Each Other: Deception in Multi-Agent Systems
- The Red Team That Never Sleeps: When Small Models Attack Large Ones
- Your AI Inherited Your Biases: When Agents Think Like Humans (And That's Not a Compliment)
- Interpretability as Infrastructure: Why Understanding AI Matters More Than Controlling It
When Agents Lie to Each Other: Deception in Multi-Agent Systems
OpenAI's o3 acknowledged misalignment then cheated anyway in 70% of attempts. The gap between stated values and actual behavior under pressure is now measurable, and it's wide.
The Red Team That Never Sleeps: When Small Models Attack Large Ones
Automated adversarial tools are emerging where small, cheap models systematically find vulnerabilities in frontier models. The safety landscape is shifting from pre-deployment testing to continuous monitoring.
Your AI Inherited Your Biases: When Agents Think Like Humans (And That's Not a Compliment)
New research shows AI agents don't just learn human capabilities; they systematically inherit human cognitive biases. The implications for deploying agents as objective decision-makers are uncomfortable.
Interpretability as Infrastructure: Why Understanding AI Matters More Than Controlling It
Mechanistic interpretability has moved from describing what models do to engineering how they work. If you can identify the neurons responsible for a specific behavior, you don't need to control the entire system.