Safety & Governance
The hard problems: red teaming, bias, interpretability, alignment, and the governance frameworks that might actually matter. No hand-waving.
Key Guides
Latest Signals
- Interpretability as Infrastructure: Why Understanding AI Matters More Than Controlling It
- The Red Team That Never Sleeps: When Small Models Attack Large Ones
- Open Weights, Closed Minds: The Paradox of 'Open' AI
- Red Teams Found Agents Leak More Than Models
- Alignment Works in English. In Japanese, It Backfires.
From the team behind Swarm Signal
Track Your Finances While You Build AI
BoredTools makes the boring stuff easy — budget dashboards, freelance trackers, and business planners. Download free or grab the full collection.
Interpretability as Infrastructure: Why Understanding AI Matters More Than Controlling It
Approximately 100 neurons control subject-verb agreement in large language models. Not thousands. Not millions. One hundred MLP neurons in a 8-billion...
The Red Team That Never Sleeps: When Small Models Attack Large Ones
A 1.5-billion parameter model just learned to jailbreak GPT-5 Nano, Claude 3.5 Sonnet, and Gemini 2.5 Flash. It didn't need human creativity or domain...
Open Weights, Closed Minds: The Paradox of 'Open' AI
When researchers [examined 100+ language models](https://arxiv.org/abs/2502.18505) marketed as "open-source," they found a systematic pattern of omission....
AI Safety Compliance for Startups: The Minimum Viable Checklist
The EU AI Act went live. Colorado enforces algorithmic fairness. Enterprise buyers demand AI governance documentation. Here's the minimum viable compliance stack that satisfies current regulations without draining your runway.
Red Teams Found Agents Leak More Than Models
Red teams found agents are far more vulnerable than standalone models. Mixed attack strategies hit 84.3% success rates. Memory poisoning persists across sessions. Every tool is a potential exfiltration path.
Red Teaming AI Agents: A Practitioner's Guide
Red teaming AI agents is fundamentally different from red teaming standalone models. Agents have tools, memory, and credentials — each a new attack surface. This guide covers the OWASP agentic framework and a structured testing methodology.
AI Safety Frameworks for Regulated Industries: Healthcare, Finance, and Government
Regulated industries face roughly three times the compliance burden of unregulated AI deployments. This guide maps the actual frameworks, enforcement timelines, and compliance costs for AI safety across healthcare, finance, and government in 2026.
Best AI Red-Teaming and Safety Testing Tools 2026
Your AI system will get attacked. The question is whether you find the vulnerabilities first or your users do. 8 red-teaming tools tested and compared.
Alignment Works in English. In Japanese, It Backfires.
A new study shows the same alignment intervention that produces strong safety effects in English reverses direction in Japanese, increasing harmful outputs. Tested across 1,584 simulations, 16 languages, and three model families.
One Fake Source Broke Every Agent
A single misinformation article injected into search rankings crashed GPT-5's accuracy from 65.1% to 18.2%. The agents had unlimited access to truthful sources and couldn't be bothered to look.