Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

Xiangtong Yao*, Hongkuan Zhou*, Oier Mees*, Yuan Meng, Ted Xiao, Yonatan Bisk, Jean Oh, Edward Johns, Mohit Shridhar, Dhruv Shah, Jesse Thomason, Kai Huang, Joyce Chai, Zhenshan Bing, Alois Knoll · 2024 (accepted to IJRR) · arXiv:2312.10807 · PDF

One-liner. A 70-page survey that re-taxonomizes language-conditioned manipulation not by algorithm or by robotic module, but by the functional role language plays in the control loop — state evaluation, policy condition, cognitive planning/reasoning, and unified VLAs — then cross-cuts the whole field along five engineering axes (action granularity, data/supervision, cost/latency, environments, task specification).

Problem & motivation

Language-conditioned manipulation has exploded across RL, IL, neuro-symbolic planning, and foundation-model approaches, and prior surveys (Tellex et al. 2020; Hu et al. 2023; Xiao et al. 2024; Firoozi et al. 2025) organized the field either by technical underpinning ("lexically grounded" vs. "learning methods") or by the robotic module language enhances (perception / decision-making / control). The authors argue these axes obscure the deeper question: the central design choice is how language enters the control loop. Two methods can both "use an LLM" yet play fundamentally distinct roles — one shaping a reward, another directly emitting actions. They propose a new, orthogonal taxonomy around language's functional role, then layer a cross-cutting comparative analysis to expose practical trade-offs (latency, data cost, granularity) that the role-taxonomy alone doesn't surface. The framing is "Language – Perception – Control" (Fig 2), but the survey's claim is that the interesting variation lives in how language bridges those three modules.

Survey scope / taxonomy

The spine is a four-way taxonomy of the functional role of language (Fig 3), each a section with its own sub-taxonomy figure and comparison table:

1. Language for state evaluation (§4). Language → numerical feedback (reward for RL, cost for planners). Subdivided into reward designing/learning (sparse vs. dense vs. learned; e.g. ZSRM uses CLIP image–text similarity as reward, PixL2R learns a dense relatedness shaping reward, LOReL learns a binary classifier), FM-driven reward generation (Text2Reward / EUREKA write executable reward code with LLMs; Reward-Self-Align and R* then learn the component weights the code leaves fragile; VLM critics like Video-Language Critic and ReWiND drop the privileged-simulator dependence), and cost-function mapping for motion planners (VoxPoser writes Python to compose 3D value maps; ReKep writes symbolic keypoint constraints; LACO learns a collision function; IMPACT rates "contact tolerance" with GPT-4o).

2. Language as a policy condition (§5). Language directly conditions a policy π_θ(a_t|s_t, l) — from goal specifier to behavior specifier. Subdivided by learning paradigm: RL (LCGG decouples language into a goal generator; LanCon-Learn / MILLION gate skill modules; FLaRe fine-tunes a BC policy with sparse linguistic reward), BC (CLIPORT, PerAct's voxel "next-best-action", RVT/Act3D/GNFactor 3D representations, HiveFormer's history-aware transformer, HULC/SPIL for long horizon), and diffusion policy (StructDiffusion, ChainedDiffuser, LCD, PoCo, MDT/GR-MG for multimodal goals).

3. Language for cognitive planning and reasoning (§6). Language as an internal reasoning medium — the robot "thinks in language." Three branches: classic neuro-symbolic (learning-for-reasoning, reasoning-for-learning, learning-reasoning — including Silver et al. 2023's learned predicate skills + bilevel planner, and LEMMo-Plan injecting tactile/force data into PDDL); LLM-empowered (open-loop planning like SayCan, closed-loop like SayPlan, plus summarization, prompt engineering, code-as-policies, iterative self-reflection, LLM+PDDL/behavior-tree structured planning); VLM-empowered (contrastive CLIP-based grounding, generative goal-image and world-model approaches that resolve the LLM "grounding problem").

4. Language in unified VLAs (§7). Language is jointly modeled with vision and action inside one embodied policy, not an external condition. Organized along an optimization-direction flow Perception → Reasoning → Action → Adaptation (Fig 16): perception (data/augmentation, 3D grounding via SpatialVLA/PointVLA/BridgeVLA, multimodal sensing/fusion via VTLA/Tactile-VLA/OmniVTLA/ForceVLA/FuSe), reasoning (long-horizon planning, preserving VLM capabilities against catastrophic forgetting via MoE / knowledge insulation, internal world models like WorldVLA/DreamVLA), action (continuous flow-matching like π₀, FAST tokenization, discrete diffusion), and adaptation (OpenVLA-OFT, ControlVLA, ConRFT, RIPT-VLA).

On top of the role-taxonomy, §8 adds a cross-cutting comparative analysis along five axes: action granularity (skill / trajectory / low-level torque), data & supervision regime (expert demos / play data / web data; target / outcome / auxiliary supervision), system cost & latency, environments & evaluations (with a three-family task taxonomy: object-centric / interaction-centric / long-horizon), and task specification (language vs. image/video conditioning). §9 stages three "heat debates," §10 covers challenges and future directions.

Coverage

Datasets / benchmarks: Simulators — PyBullet, MuJoCo, CoppeliaSim, NVIDIA Omniverse, Unity, UniSim (Table 7). Benchmarks — CALVIN, Meta-World, RLBench, VIMAbench, LoHoRavens, ARNOLD, RoboGen, LIBERO (Table 8). Real-world datasets — Open X-Embodiment (2.4M+ trajectories, 22 robots, 527 skills), DROID (76k traj, ~350h), Galaxea Open-World (100k traj, ~500h). A three-family task taxonomy (Table 9): object-centric / interaction-centric / long-horizon.
Hardware / simulator: not reported — survey, no robot experiments of its own. Surveyed embodiments span Franka Panda, UR5, Sawyer, mobile bimanual platforms (Galaxea R1 Lite), etc.
Baselines: not reported — surveys prior surveys (Tellex 2020, Hu 2023, Xiao 2024, Firoozi 2025) and positions its functional-role taxonomy as orthogonal to their model-type / robotic-module organizations.
Compute: not reported. (The survey itself reports inference cost/latency for surveyed models in Table 6, e.g. PaLM-E 562B, π₀ ~10,000 training hours, diffusion policies running ~1Hz vs. RT-1 ~3Hz.)

Key insights / classifications

The survey's value is the organizing claims, not headline numbers. The most useful:

The functional-role taxonomy is the contribution. Four roles — state evaluation, policy condition, cognitive planning, unified VLA — each with an evolutionary arc from hand-engineered → data-driven → FM-automated. E.g. state evaluation evolved from manual reward/cost specification (ZSRM, VoxPoser) to autonomous FM-driven reward+success-detector generation (ARCHIE).
Five cross-cutting axes (§8) expose the real trade-offs. Action granularity: skill-level (SayCan, Code-as-Policies, TAMP) matches "language as natural units" but fails on skill-library uncoverage; trajectory-level (most IL) suffers covariate shift over long horizons; low-level torque (Bi-LAT, ManiFoundation, TA-VLA) is needed for contact-rich/dexterous tasks but is hard to teleoperate and sim-to-real.
Task-difficulty taxonomy warns against naive success-rate comparison (Table 9): object-centric benchmarks mostly test visual-language grounding + short-horizon visuomotor control; interaction-centric and long-horizon benchmarks additionally test physical interaction, force, memory, and closed-loop replanning. Strong numbers on one family don't transfer.
Language vs. image/video conditioning are complementary (§8.5). Language gives low-bandwidth specification, compositional generalization, interactive editability, and uniquely encodes negative constraints / safety protocols ("stay away from the yellow bottle," "watch out for the vase") and relational preferences; image/video give dense perceptual grounding of geometric configurations language under-specifies. The "contemporary consensus" they articulate: language to dictate what, image/video to specify how precisely, fused in hybrid VLAs.
Three heat debates (§9): (1) Are VLAs the right path? — scaling laws may not transfer to embodiment; the field's gains came from expanding the task manifold, not model size; structure-aware scaling argued over pure scale. (2) Are world models the right path? — offer safety via "imagination" and shared dynamics priors, but model bias / distributional brittleness accumulate over imagined rollouts. (3) Does scaling help under real-time constraints? — latency is the under-reported wall; control cycles have hard deadlines; authors "strongly encourage authors to report latency as a primary metric."
Future directions (§10): generalization (data scale, lifelong learning, cross-embodiment alignment) and real-world safety (language ambiguity → clarification dialogue, failure recovery, real-time performance). Zero-shot capability is reframed as partial and conditional — strongest at semantic/planning levels, fragile at policy/control levels.

Limitations & open questions

(a) Author-stated. The survey concedes the field "still lacks a unified evaluation protocol" for measuring how much grounding/reasoning actually contributes to task success; that direct success-rate comparison is misleading because benchmarks differ wildly in difficulty; that latency is rarely treated as a first-class metric in the literature it reviews; and that real-world, large-scale, cross-embodiment evaluation benchmarks are missing. It also flags its own taxonomy choice (focusing on the "primary contribution" of each method) as lossy, since most methods touch multiple roles at once.

(b) What I noticed reading it.

Tactile/force/audio sensing is under-weighted relative to this batch's thesis. Non-visual sensing appears almost entirely inside §7.2.3 (VLA multimodal fusion: VTLA, Tactile-VLA, OmniVTLA, ForceVLA, FuSe) and as a passing note that LEMMo-Plan injects tactile/force-torque into PDDL to reason about "invisible" events like cable tension / insertion forces. There is no dedicated treatment of tactile representation learning, acoustic sensing, or the systematic claim that many predicates are non-visual. For a survey named "Bridging Language and Action," touch/sound is a footnote, not a pillar.
Recency padding / citation inflation risk. The comparison tables cite many 2025 and even 2026-dated works (H-RDT 2026, PointVLA/BridgeVLA 2026, MemoryVLA 2026). A v1 posted Dec 2023 carrying 2026 citations means heavy revision; the taxonomy is sound but some "representative" entries are very fresh and may not have stabilized as canonical.
No quantitative synthesis. Despite the comparative-analysis framing, §8 is qualitative — the tables list mechanisms/advantages/disadvantages, not head-to-head numbers on a shared benchmark. So the "trade-offs" are argued, not measured; a reader can't rank approaches from this survey alone.
The "functional role" boundary is fuzzy at §5/§7. The authors themselves spend a paragraph (CLIPORT example) disambiguating "language as policy condition" from "unified VLA," which signals the taxonomy's cleanest cut is also its softest. Methods drift between cells as they add reasoning heads.

Why I care

This survey is the map of the territory BLADE sits in, and a useful positioning device. BLADE — Learning Compositional Behaviors from Demonstration and Language — lands squarely in this survey's §6 "classic neuro-symbolic / learning-reasoning" branch: it learns PDDL-style predicate abstractions + diffusion-policy controllers from language-annotated demos and composes them with a bilevel planner. Notably the survey explicitly cites Silver et al. 2023 ("learn neuro-symbolic skills from demonstrations... predefined 'language' of symbolic predicates... bilevel planner uses these learned skills") as the canonical instance of this branch — that is exactly BLADE's lineage and the cleanest place to position BLADE in a related-work section. It gives Weiyu a ready-made, citable framing: "language-conditioned manipulation by functional role," with our work as the predicate-learning + bilevel-planning exemplar.

On the batch's bigger thesis — that many manipulation predicates (is_grasped, is_inserted, is_full, surface_is_rough, is_screwed_tight) are not visually evaluable and live in touch/force/sound — this survey is partially on-theme but does not make that argument. Its one resonant move is §8.5: language uniquely encodes constraints and relations that vision under-specifies, while vision/touch ground geometric and contact details language under-specifies. That's the dual of our thesis (it argues vision grounds what language can't; we argue touch grounds what vision can't). The LEMMo-Plan example — using force-torque to define predicates over "invisible" events like cable tension — is the single clearest seed in the survey for the non-visual-predicate idea, and worth chasing. But the survey treats tactile/acoustic sensing as a late VLA-fusion topic, not as a challenge to the visual-predicate assumption. So: use it as the structural map and the positioning citation, not as evidence for the non-visual-predicate thesis — for that, lean on the dedicated tactile/audio papers in this batch (clusters A–G).

Quotable

"A deeper analysis reveals that the central research questions are not just about these components, but about the specific functional role language plays in bridging them. Different approaches leverage language in fundamentally distinct ways to solve the manipulation problem." — §1.2 / p.4

"The contemporary consensus points toward a unified multimodal task-specification framework: utilize language as the primary interface to dictate what to do ... while using image/video conditioning to specify how precisely to execute it." — §8.5.4 / p.45

"We note that latency is not consistently treated as a first-class evaluation metric in the literature, and often it is mentioned only in footnotes. ... we strongly encourage authors to report latency prominently as a primary performance metric." — §9.3 / p.47

(a) Cited here, worth ingesting next (forward references):

Towards Forceful Robotic Foundation Models (survey) — complementary Cluster-J survey; takes the force angle this one under-weights, directly relevant to the non-visual-predicate thesis.
What Foundation Models can Bring (survey) — the model-type / robotic-module survey (Xiao et al. 2024) this paper explicitly positions itself against; useful contrast for a related-work framing.
Beyond Sight (FuSe) — cited as a multimodal-fusion VLA in §7.2.3; canonical heterogeneous-sensor + language policy.
Tactile-VLA — the survey's example (Huang et al. 2025a) of grounding force-adverbs ("gently", "hard") in tactile feedback; directly on the non-visual-predicate thesis.
ForceVLA — the force-aware MoE late-fusion VLA the survey highlights (Yu et al. 2026).

(b) Newly ingested in 2026-06-24 batch — directly relevant:

BLADE — the work this survey's §6 learning-reasoning branch (Silver et al. 2023 lineage) describes; positions our predicate-learning + bilevel-planning approach within the taxonomy.
PaLM-E — the survey's canonical VLM-empowered planner / embodied multimodal LM anchor (§6.3, Table 4, Table 6).
FuSe — if already summarized, the survey's multimodal-fusion VLA exemplar.
Tactile-VLA — if already summarized, the force-adverb-grounding instance closest to the batch thesis.