LISTEN TO THIS ARTICLE

LLMs Can't Find What's Already In Their Heads

Knowledge graphs have a well-documented lookup problem. When you ask an LLM to traverse a KG and reason over multi-hop paths, it doesn't search the graph so much as it pattern-matches against whatever training data happens to rhyme with the query. The Explore-on-Graph paper highlights the gap: standard methods constrain LLM reasoning within fixed demonstrations or rule-based generation, limiting their ability to explore diverse reasoning paths on the graph. Models default to shallow retrieval that stops at the first plausible-looking node rather than committing to multi-hop exploration. That's not exploration. That's educated guessing.

The Explore-on-Graph (EoG) paper from Yan, Chen, Zhou et al. targets this directly, and what it proposes is simpler and more interesting than the usual "add more supervised data" fix. The core idea: use path-refined reward modeling to give the RL training signal something to chew on beyond binary correct/incorrect outcomes. Instead of rewarding only terminal correctness, EoG rewards intermediate path quality, penalizing dead-end retreats and incentivizing the model to commit to multi-hop chains that are structurally coherent, even before it knows whether they'll pan out.

The Reward Shape Is the Architecture

Here's the analogy that stuck with me: training a KG reasoning agent on terminal rewards alone is like teaching someone to find their way across a city by only telling them when they've arrived at the wrong destination. No partial credit for taking the right highway. No signal that they were two turns away from success before looping back to the hotel. The model never learns to value the path itself.

EoG's path-refined reward model scores intermediate reasoning steps by evaluating whether a traversal sequence maintains semantic coherence with the query intent. It does this by learning a discriminative signal over partial paths, not just endpoint matches. The reward shaping pushes the model toward committed exploration: longer chains that don't backtrack just because an intermediate node looks ambiguous. On KGQA benchmarks, the results are striking. On WebQSP, EoG achieves 92.8% Hit@1 compared to 86.3% for the best RL baseline. On CWQ, EoG hits 86.6% versus 70.5%. Across five KGQA datasets, EoG consistently outperforms prior methods by wide margins, even exceeding closed-source models like Gemini 2.5 Pro and GPT-5. That gap is the reward function doing real work.

What makes this non-obvious is that the fix isn't in the model architecture or the KG representation. It's entirely in how the training objective is structured. The underlying LLM is unchanged. The graph is unchanged. The only thing EoG modifies is what the model gets credit for during RL fine-tuning.

The Parametric Knowledge Problem Compounds This

EoG operates on external knowledge graphs, but a closely related paper from Ma and Hewitt at Columbia University surfaces a parallel failure mode that makes the whole picture worse. Their finding: reasoning language models, the ones trained via RL to produce extended thinking traces on math and coding tasks, don't automatically apply that same reasoning to access their own parametric knowledge. When asked factual recall questions where deliberate thinking would help (the paper uses the example of inferring Canberra is Australia's capital by reasoning through purpose-built capitals and political history), RL-trained models produce their best answer less often than if they'd been prompted to reason explicitly first.

The failure mode isn't that the model lacks the knowledge. It's that the RL training on task-type X doesn't generalize the reasoning behavior to knowledge-access type Y. The model has the answer stored. It just doesn't think to think before retrieving it. That's a jarring finding if you assumed reasoning training was building some general "deliberate cognition" capability. It isn't. It's building task-specific deliberation.

Ma and Hewitt's proposed fix, EoG-style incentivization applied to internal knowledge retrieval, directly echoes the same logic. Reward the model for generating reasoning traces before producing knowledge claims, not just for getting the fact right at the end. They train models via RL on world-knowledge QA (TriviaQA) and show that performance transfers: +9.9% on TriviaQA, +4.2% on Natural Questions, +2.1% on HotpotQA, +3.0% on StrategyQA, and +0.6% on SimpleQA. The reward shaping insight generalizes beyond graph traversal.

What the Headlines Miss

Both papers will get cited in the "LLMs can now reason over KGs" wave of coverage. That framing is too generous. The honest read is narrower: we've found a training signal trick that stops models from giving up on multi-hop paths too early, and a related trick that stops RL-trained models from bypassing deliberation on factual queries. Neither paper claims the underlying reasoning is genuine. Neither one should.

The RADAR paper from Xue et al. runs parallel to EoG but takes a discriminative rather than generative angle on KGR. Where EoG trains the model to commit to exploration on KGQA tasks, RADAR recasts knowledge graph reasoning as discriminative entity selection, training the model to distinguish valid reasoning chains from superficially plausible ones using aligned representations. On FB15k-237 link prediction, RADAR achieves 0.377 MRR and 27.3% Hits@1, outperforming the previous best (COSIGN) by 2.4% MRR. On triple classification, it reaches 81.6% on FB15k-237N and 95.3% on WN18RR. The two approaches aren't directly comparable because they target different tasks (KGQA vs. link prediction/triple classification), and RADAR relies on stratified negative samples during training, constructed at three difficulty tiers, which adds labeling cost that EoG avoids.

I've watched the KGR benchmark numbers climb steadily this year, and each time the winning method relies on an increasingly specific training recipe that works on these benchmarks but has questionable transfer to real enterprise knowledge graphs where the schema is messier, the entities are noisier, and nobody has clean stratified negative samples ready to go. The benchmark performance is real. The operational generalizability is not established.

Where ExpLang Fits

The ExpLang paper from Gao et al. adds a dimension that the EoG discussion tends to ignore: what language the model reasons in. Their finding, that on-policy thinking language selection across 12 languages improves reasoning performance on math benchmarks, matters here because KG traversal in multilingual graphs is a real deployment constraint that the EoG paper doesn't address at all. If your knowledge graph has entity labels in Mandarin, Korean, or Arabic, and your reward model was trained on English-centric path representations, the incentivization signal degrades.

ExpLang consistently outperforms English-only RL training given the same compute budget, with particularly strong gains in exploration diversity (Pass@k improvements of up to 10 points on AIME 2025) and near-perfect thinking-language compliance even for unseen languages. The implication for EoG is clear: the reward shaping work needs a language-aware component if it's going to hold up in production. This is the part that actually worries me about the EoG framing: the benchmark is clean, monolingual, and well-structured. Real knowledge graphs are none of those things. The reward model's ability to score intermediate paths as semantically coherent depends heavily on embedding quality, and embedding quality still falls off for lower-resource entity types and languages.

The Deeper Pattern

All three of these approaches, EoG's path-refined rewards, Ma and Hewitt's parametric reasoning incentives, RADAR's discriminative alignment, are really solving the same underlying problem from different angles. RL training with sparse terminal rewards doesn't teach models to value the process of reasoning. It teaches them to correlate surface patterns with correct endpoints. When the task requires genuine multi-step deliberation (traversing a graph, recalling a fact that requires inference, checking a reasoning chain for logical coherence), models trained only on terminal signals cut corners wherever they can get away with it.

The solution in each paper is some version of dense intermediate rewards. Credit the journey, not just the destination. This is not a new idea in RL, it's basically curriculum-adjacent reward shaping that practitioners have used in robotics and game-playing for years. What's notable is that it needed to be rediscovered for LLM reasoning, and that it works this well on tasks where we previously assumed the model's pattern-matching was good enough.

It wasn't good enough. The gaps across KGQA and knowledge-access benchmarks make that clear.

This also connects to a broader problem we've covered before: reasoning models have a memory and retrieval architecture that treats stored knowledge like a static lookup when it's better understood as something that needs active reconstruction. The Ma and Hewitt result fits that framing precisely, the knowledge is there, but the access pathway is broken without deliberation. As we noted in Multi-Agent Reasoning's Memory Problem, the retrieval failure often isn't absence of knowledge, it's absence of the right trigger to surface it. The same dynamic shows up in agentic RAG pipelines, where the retrieval step itself needs reasoning support to pull the right context.

What This Actually Changes

EoG and the parametric reasoning work from Ma and Hewitt represent a genuine shift in how we should think about training objectives for knowledge-intensive tasks. The insight that intermediate path quality needs its own reward signal is both obvious in hindsight and underutilized in practice. If you're building RAG pipelines or KG-integrated agents today, the implication is concrete: fine-tuning on terminal correctness alone is leaving performance on the table, and the gap is measurable.

What this doesn't change is the benchmark-to-production translation problem. WebQSP, CWQ, and FB15k-237 are clean test environments. Enterprise KGs aren't. The reward models in both EoG and RADAR were trained on structured, well-labeled data that most production deployments don't have. That's not a reason to dismiss the papers, the core incentivization mechanism is sound. It's a reason to be careful about assuming that benchmark-level improvements transfer without additional work to adapt the reward model to your schema.

The broader takeaway: stop treating reasoning as a capability that either exists or doesn't in a given model. It exists conditionally, in specific task domains, for specific training distributions. Reward shaping can extend that domain. But it has to be designed into the training objective deliberately. It doesn't emerge from scale alone.

Sources

Research Papers:

Benchmarks & Reference:

Related Swarm Signal Coverage: