BeingBeyond

BeTTER teaser — **Left:** State-of-the-art models project an illusion of high-level reasoning on static benchmarks. **Middle:** BeTTER introduces controlled causal interventions. **Right:** The reality--models revert to shortcut correlations and catastrophically fail.

The Observation: The Illusion of Competence

Recent Vision-Language-Action (VLA) models report near-perfect success rates on standard robotic benchmarks. The narrative seems straightforward: scale up the parameters, pre-train on massive datasets, and general-purpose physical intelligence emerges.

In practice, however, these high scores often hide a systematic flaw. Existing evaluation protocols reward behavioral competence without validating the underlying reasoning. When a model successfully completes a task like "Packing a Fruit Lunch" by correctly picking up a red apple, it appears to possess genuine semantic understanding of the object. But what happens if you test that assumption?

The illusion — **Left Video (The Illusion):** A VLA model executing the *"Packing a Fruit Lunch"* task flawlessly, accurately grasping the red apple in its canonical training layout.

The reality — **Right Video (The Reality):** The exact same model and instruction, but the red apple is replaced by a red block in the same location. The model completely ignores the semantic mismatch and blindly grasps the distractor block.

When we subjected state-of-the-art models (including $\pi_{0.5}$ , GR00T-N1.6, and Being-H0.5) to these targeted visual interventions, their apparent competence collapsed.

The models were not actually reasoning about the environment. They had suffered from semantic feature collapse, relying on superficial visual heuristics--mapping a salient low-level feature (like the color red or a vague shape blob) directly to a memorized motor primitive, completely bypassing fine-grained semantic verification.

The Probe: Diagnosing True Embodied Reasoning

To systematically expose these shortcuts, we built BeTTER (Benchmark for Testing True Embodied Reasoning).

Unlike standard robustness benchmarks that merely add visual noise or change textures, BeTTER applies targeted causal interventions. Crucially, we enforce strict kinematic isolation: all objects are placed in known, reachable configurations. If a model fails our probe, it is not because it lacks the low-level motor skills to reach the target, but because its high-level reasoning has broken down.

We stress-test models across four distinct cognitive dimensions:

Spatial Layout Shift: Rearranging objects or altering spatial relationships with distractors to expose visual-layout biases.
Primitive Recomposition: Recombining learned action primitives (e.g., evaluating sequence $B \rightarrow C$ after only training on $A \rightarrow B$ and $A \rightarrow C$ ) to test compositional generalization.
Adversarial Object Perturbation: Replacing targets with visually similar but semantically distinct objects to examine fine-grained semantic grounding.
Temporal Extrapolation: Extending task horizons or modifying initial states to evaluate causal state tracking versus visual memorization.

Four BeTTER intervention axes — A 2x2 grid of illustrations depicting the four intervention axes.

The Reality of Cognitive Breakdowns

Under the BeTTER probe, the illusion of reasoning breaks down into predictable, systematic failure modes:

Behavioral inertia — **Behavioral Inertia (Subgoal Composition):** When asked to execute a novel sequence, models fail to dynamically plan. Because all training sequences began with packing a burger, models acquired a strong unconditional bias toward this initial primitive. When instructed to just pack fries and a drink, success rates dropped by over 40 percentage points as policies mindlessly defaulted to the burger.

Causal confusion — **Causal Confusion (State Tracking):** Imitation-based policies use superficial visual cues as proxies for causal states. If a model learns to pack fries *after* seeing "one burger left on the table," starting an episode with only one burger causes the model to hallucinate task progress and skip the first step entirely.

Semantic feature collapse — **Semantic Feature Collapse:** In cluttered environments without reliable spatial priors, models abandon semantic discrimination. When replacing a target with a functional distractor (e.g., a coffee mug), models regress to a "blind grasping" heuristic, interacting with any salient object within reach.

Where Does the Reasoning Go?

This leads to a fundamental paradox: these VLAs are initialized from powerful Vision-Language Models (VLMs) that already possess strong reasoning capabilities. So where does that reasoning--which we denote as the high-level semantic representation $z$ --go during continuous physical control?

We traced the representational degradation step-by-step through the VLM-to-VLA adaptation pipeline.

Model / Configuration	EmbSp.	RefSpatial	EgoPlan-Bench2
Reference: Gemini-2.5-Pro	78.74	38.16	42.85
Reference: Mimo-Embodied (7B)	76.24	48.00	43.00
Reference: RoboBrain-2.0 (7B)	76.32	32.50	33.23
Reference: RoboBrain-2.0 (32B)	78.57	54.00	57.23
Stage 1: Capacity Scaling
$\rightarrow$ 8B Base	74.96	27.50	40.80
$\rightarrow$ 4B Base	72.08	23.00	36.83
$\rightarrow$ 2B Base	60.19	6.00	33.38
Stage 2: The Pre-training Shift
$\rightarrow$ 2B + VLM + VLA Data (Multi-patch)	57.47	29.50	31.95
Stage 3: The "Myopic" Constraint
$\rightarrow$ 2B + VLM + VLA Data (Single 224px)	52.42	7.50	31.95

The steep drop in Spatial Grounding (RefSpatial) and Sequential Planning (EgoPlan) as the model undergoes: 1) Capacity Scaling 8B -> 2B, 2) Co-training, and 3) Perceptual Downsampling.

Capacity Compression: Scaling down from 8B to 2B parameters for onboard compute strict limits the model's representational bandwidth, crippling its ability to sustain rich causal-relational reasoning.
The "Myopic" Perceptual Constraint: Real-time control demands high-frequency inference, forcing models to heavily downsample multi-patch visual inputs to a single low-resolution image (e.g., 224x224). This structural bottleneck literally erases the fine-grained semantics necessary for robust embodied reasoning.
Shortcut Optimization: During fine-tuning, the optimization process naturally favors pathways that minimize immediate action-prediction loss. With semantic capacity choked, the model trades away reasoning ( $z$ ) and overfits entirely to its sensorimotor and kinematic priors ( $z_a$ ).

Highly static evaluation environments mask this degradation, allowing models to max out success rates through spatial overfitting. But under distribution shifts, this brittle trajectory memorization catastrophically fails.

The Benchmark Paradox: Validating the Shortcut

To empirically validate how standard benchmarks mask this degradation, we evaluated different pre-training recipes alongside established state-of-the-art models. We compared performance across two distinct regimes: the highly static LIBERO benchmark (in-distribution) and the challenging CALVIN benchmark (out-of-distribution).

Configuration	LIBERO (Static / In-Distribution) Average Success	CALVIN (OOD / ABC → D) Average Length ↑
Reference: GR00T-N1.6	96.99%	4.244
Reference: $\pi_{0.5}$	96.85%	4.126
Reference: Being-H0.5	98.90%	4.138
Ours: No Pre-training	97.40%	3.669
Ours: VLA (OXE Only)	96.25%	3.860
Ours: VLA + VLM	97.20%	4.086

The stark contrast: on LIBERO, all configurations look equally perfect. On CALVIN, the truth is revealed.

The results clearly resolve the paradox. In visually consistent, fixed-camera environments like LIBERO, tasks can be "solved" almost entirely through kinematic heuristics ( $z_a$ ). The benchmark cannot distinguish true semantic reasoning from trajectory memorization. However, under CALVIN's out-of-distribution splits--where environmental shifts invalidate memorized spatial heuristics--genuine semantic grounding ( $z$ ) becomes the strictly required pathway for success.

Real-World Evidence: Closing the Sim-to-Real Loop

To confirm these representational bottlenecks are not simulation artifacts, we conducted targeted physical stress tests on the SO101 robotic platform.

Real-world validation — Experimental setup for real-world validation. Models are evaluated across two task horizons and subjected to targeted causal interventions.

To disentangle cognitive reasoning from simple kinematic execution, we evaluated the policy using diagnostic matrices. We systematically varied the presence and location of targets and distractors, revealing a stark contrast between short-horizon heuristics and long-horizon breakdowns.

1. The Short-Horizon Illusion

In short-horizon tasks (e.g., "Put the lemon into the fruit basket"), the policy exhibits strong apparent competence. As shown below, it correctly grasps the target (Row 1) and properly terminates if the target is already placed (Row 2).

Distractor → Target ↓	On Table	In Basket	Absent
On Table	✅ Success (Grasps Lemon)	✅ Success (Grasps Lemon)	✅ Success (Grasps Lemon)
In Basket	✅ Correct Freeze (Done)	✅ Correct Freeze (Done)	✅ Correct Freeze (Done)
Absent	❓ Ambiguous (Grasps Distractor)	❓ Ambiguous (Freezes / Air Grasp)	❓ Ambiguous (Air Grasp)

However, this success is an illusion. The limited execution horizon allows policies to succeed via simple reactive heuristics (e.g., mapping a yellow blob directly to a grasp primitive) without requiring actual causal state tracking.

2. Deconstructing Long-Horizon Breakdowns

To unmask the limitations of reactive control, we extend the evaluation to long-horizon tasks (e.g., "Put all the fruits into the fruit basket"). By deliberately disrupting the expected execution chain, we force the model to either plan causally or fail.

Mangosteen (Target 2) → Lemon (Target 1) ↓	On Table	In Basket	Absent
On Table	✅ Success (Packs Both)	✅ Success (Packs L, Ends)	❌ Behavioral Inertia (Packs L → Air Grasp)
In Basket	❌ Phase Conflict (Freezes/Skips L)	✅ Success (Recognizes Done)	❌ Phase Conflict (Oscillates/Air Grasp)
Absent	✅ Success (Packs M)	✅ Correct Freeze (Done)	❓ Ambiguous (Air Grasp)

Without genuine semantic representation, the policy defaults to rigid proprioceptive assumptions, leading to two massive failure modes (highlighted above):

Behavioral Inertia (Proprioceptive Overfitting): When the first target is present but the second is missing, the model successfully packs the first, but then blindly executes a perfect "air grasp" at the exact spatial coordinates where the second target usually sits. It trusts its memorized trajectory over the actual visual scene.
Phase Conflict (Temporal Disorientation): When the first target is pre-placed in the basket, the visual observation (Step 1 completed) conflicts with the robot's proprioceptive state ( $t=0$ ). The model fails to track the causal state, resulting in near-zero velocity outputs (freezing) or erratic exploratory motions.

These physical diagnostics provide mechanistic closure: the reasoning deficits observed in VLAs are not sim-to-real gaps. They are fundamental consequences of treating cognitive reasoning as a byproduct of trajectory memorization.

Wrapping Up

The findings from the BeTTER benchmark unmask the "illusion of embodied reasoning" in state-of-the-art VLAs. While current models achieve impressive success rates on highly static evaluation environments, this performance often masks shortcut learning. When evaluated under controlled causal interventions, policies consistently regress to brittle trajectory memorization and superficial heuristics.

Our mechanistic analysis traces these vulnerabilities directly to structural architectural bottlenecks. Requirements like capacity compression and myopic perceptual downsampling systematically degrade a model's foundational semantic representation. Consequently, optimization defaults to overfitting to the sensorimotor representation ( $z_a$ ).

Open-world generalization strictly requires genuine semantic grounding. To move from superficial behavioral competence to genuine physical intelligence, future VLA paradigms must explicitly resolve the structural tension between the demands of high-frequency continuous control and the preservation of high-level cognitive reasoning.

Citation

If you find our benchmark or insights useful in your research, please consider citing:

@article{xu2026unmasking,
  title={Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models},
  author={Xu, Haiweng and Zheng, Sipeng and Luo, Hao and Zhang, Wanpeng and Xi, Ziheng and Lu, Zongqing},
  journal={arXiv preprint arXiv:2604.18000},
  year={2026}
}