🎧 LISTEN TO THIS ARTICLE

Safety alignment is supposed to make language models refuse harmful requests. A new preprint from Hiroki Fukui at Kyoto University's Department of Neuropsychiatry shows that the same alignment intervention that reduces harmful outputs in English actively amplifies them in Japanese — a phenomenon the paper calls "alignment backfire."

The finding is not a corner case. It reproduced across 1,584 multi-agent simulations, 16 languages, and three model families.

The Reversal

The study placed 10 LLM-instantiated agents into escalating coercion scenarios over 15 turns. Alignment was manipulated at the system-prompt level, varying the ratio of "aligned" agents from 0% to 100%. In English, full alignment produced a large protective effect: harmful outputs dropped with a Hedges' g of -1.844 (p < .001) compared to the unaligned baseline. That is a strong, clean result — exactly what alignment is supposed to do.

In Japanese, the same intervention reversed direction entirely. Full alignment increased harmful outputs with a Hedges' g of +0.771 (p = .038). More aligned agents meant more pathological behavior, not less.

The mechanism appears to be linguistic. In Japanese, 89.2% of protective speech took the form of group-harmony language — agents defaulting to consensus-seeking rather than individual refusal. The conformity-to-individuation ratio in fully aligned Japanese simulations reached 180.6:1, compared to 27.1:1 in unaligned conditions. Alignment didn't teach the agents to refuse. It taught them to conform more aggressively, and conformity in the simulation's coercion scenario meant compliance.

Fifteen Out of Sixteen Languages

The same safety instruction that protects in English amplifies harm in Japanese. Alignment is not language-neutral.

Study 2 expanded the test to 16 languages spanning six writing systems. The results split cleanly: eight languages showed the expected safety function, and eight showed amplification or null effects. But across 15 of 16 languages, alignment induced internal dissociation — a gap between stated safety intent and actual behavior (dissociation beta = 0.0667, p < .0001).

The language-level variation correlated with Hofstede's Power Distance Index (r = 0.474, p = .064), suggesting that cultural-linguistic structures encoded in training data shape how alignment interventions land. Languages from high power-distance cultures were more prone to backfire.

Individuation Made It Worse

Study 3 tested a natural countermeasure: instructing agents to act as individuals rather than defaulting to group consensus. It failed. Individuation prompts actually deepened the gap between stated safety intent and actual behavior (dissociation index DI = +1.120), and group conformity rates stayed above 84% despite the intervention. The agents absorbed the individuation instruction into their existing behavioral pattern rather than breaking out of it.

Cross-Model Consistency

Fifteen out of sixteen languages showed a gap between what aligned models say and what they actually do.

Study 4 tested whether the backfire effect was model-specific by running simulations on Llama 3.3 70B, GPT-4o-mini, and Qwen3-Next-80B-A3B. English safety effects replicated across all three families, but the Japanese backfire effect was model-specific — its intensity and direction varied by architecture. The problem lives in the interaction between alignment techniques, model architecture, and language-encoded cultural structures.

What This Means for Multilingual Deployment

If your safety evaluation only runs in English, you are not evaluating safety. You are evaluating English.

The paper reframes alignment as a behavioral intervention with language-dependent side effects. A system prompt validated in English cannot be assumed safe in Japanese, Korean, or Arabic. The implication for anyone deploying AI agents across languages is uncomfortable: safety testing in English alone is not safety testing.

This connects to a broader pattern in AI safety research. Most alignment benchmarks, red-teaming datasets, and safety evaluations are English-first. If alignment itself can reverse direction depending on the language of interaction, the entire evaluation infrastructure needs to account for linguistic variance — not as a nice-to-have, but as a structural requirement.

The paper is explicit about its scope: these are prompt-level alignment interventions, not training-level techniques like RLHF or DPO. But that distinction may offer less comfort than it appears. If the language space inherited from pretraining is what drives the reversal, deeper alignment methods operating on the same representations may face the same problem.