One-liner. A 70-page survey that re-taxonomizes language-conditioned manipulation not by algorithm or by robotic module, but by the functional role language plays in the control loop — state evaluation, policy condition, cognitive planning/reasoning, and unified VLAs — then cross-cuts the whole field along five engineering axes (action granularity, data/supervision, cost/latency, environments, task specification).
Language-conditioned manipulation has exploded across RL, IL, neuro-symbolic planning, and foundation-model approaches, and prior surveys (Tellex et al. 2020; Hu et al. 2023; Xiao et al. 2024; Firoozi et al. 2025) organized the field either by technical underpinning ("lexically grounded" vs. "learning methods") or by the robotic module language enhances (perception / decision-making / control). The authors argue these axes obscure the deeper question: the central design choice is how language enters the control loop. Two methods can both "use an LLM" yet play fundamentally distinct roles — one shaping a reward, another directly emitting actions. They propose a new, orthogonal taxonomy around language's functional role, then layer a cross-cutting comparative analysis to expose practical trade-offs (latency, data cost, granularity) that the role-taxonomy alone doesn't surface. The framing is "Language – Perception – Control" (Fig 2), but the survey's claim is that the interesting variation lives in how language bridges those three modules.
The spine is a four-way taxonomy of the functional role of language (Fig 3), each a section with its own sub-taxonomy figure and comparison table:
1. Language for state evaluation (§4). Language → numerical feedback (reward for RL, cost for planners). Subdivided into reward designing/learning (sparse vs. dense vs. learned; e.g. ZSRM uses CLIP image–text similarity as reward, PixL2R learns a dense relatedness shaping reward, LOReL learns a binary classifier), FM-driven reward generation (Text2Reward / EUREKA write executable reward code with LLMs; Reward-Self-Align and R* then learn the component weights the code leaves fragile; VLM critics like Video-Language Critic and ReWiND drop the privileged-simulator dependence), and cost-function mapping for motion planners (VoxPoser writes Python to compose 3D value maps; ReKep writes symbolic keypoint constraints; LACO learns a collision function; IMPACT rates "contact tolerance" with GPT-4o).
2. Language as a policy condition (§5). Language directly conditions
a policy πθ(at|st, l) — from
goal specifier to behavior specifier. Subdivided by learning paradigm:
RL (LCGG decouples language into a goal generator; LanCon-Learn / MILLION gate skill
modules; FLaRe fine-tunes a BC policy with sparse linguistic reward), BC
(CLIPORT, PerAct's voxel "next-best-action", RVT/Act3D/GNFactor 3D representations,
HiveFormer's history-aware transformer, HULC/SPIL for long horizon), and
diffusion policy (StructDiffusion, ChainedDiffuser, LCD, PoCo, MDT/GR-MG for
multimodal goals).
3. Language for cognitive planning and reasoning (§6). Language as an internal reasoning medium — the robot "thinks in language." Three branches: classic neuro-symbolic (learning-for-reasoning, reasoning-for-learning, learning-reasoning — including Silver et al. 2023's learned predicate skills + bilevel planner, and LEMMo-Plan injecting tactile/force data into PDDL); LLM-empowered (open-loop planning like SayCan, closed-loop like SayPlan, plus summarization, prompt engineering, code-as-policies, iterative self-reflection, LLM+PDDL/behavior-tree structured planning); VLM-empowered (contrastive CLIP-based grounding, generative goal-image and world-model approaches that resolve the LLM "grounding problem").
4. Language in unified VLAs (§7). Language is jointly modeled with vision and action inside one embodied policy, not an external condition. Organized along an optimization-direction flow Perception → Reasoning → Action → Adaptation (Fig 16): perception (data/augmentation, 3D grounding via SpatialVLA/PointVLA/BridgeVLA, multimodal sensing/fusion via VTLA/Tactile-VLA/OmniVTLA/ForceVLA/FuSe), reasoning (long-horizon planning, preserving VLM capabilities against catastrophic forgetting via MoE / knowledge insulation, internal world models like WorldVLA/DreamVLA), action (continuous flow-matching like π0, FAST tokenization, discrete diffusion), and adaptation (OpenVLA-OFT, ControlVLA, ConRFT, RIPT-VLA).
On top of the role-taxonomy, §8 adds a cross-cutting comparative analysis along five axes: action granularity (skill / trajectory / low-level torque), data & supervision regime (expert demos / play data / web data; target / outcome / auxiliary supervision), system cost & latency, environments & evaluations (with a three-family task taxonomy: object-centric / interaction-centric / long-horizon), and task specification (language vs. image/video conditioning). §9 stages three "heat debates," §10 covers challenges and future directions.
The survey's value is the organizing claims, not headline numbers. The most useful:
(a) Author-stated. The survey concedes the field "still lacks a unified evaluation protocol" for measuring how much grounding/reasoning actually contributes to task success; that direct success-rate comparison is misleading because benchmarks differ wildly in difficulty; that latency is rarely treated as a first-class metric in the literature it reviews; and that real-world, large-scale, cross-embodiment evaluation benchmarks are missing. It also flags its own taxonomy choice (focusing on the "primary contribution" of each method) as lossy, since most methods touch multiple roles at once.
(b) What I noticed reading it.
This survey is the map of the territory BLADE sits in, and a useful positioning device. BLADE — Learning Compositional Behaviors from Demonstration and Language — lands squarely in this survey's §6 "classic neuro-symbolic / learning-reasoning" branch: it learns PDDL-style predicate abstractions + diffusion-policy controllers from language-annotated demos and composes them with a bilevel planner. Notably the survey explicitly cites Silver et al. 2023 ("learn neuro-symbolic skills from demonstrations... predefined 'language' of symbolic predicates... bilevel planner uses these learned skills") as the canonical instance of this branch — that is exactly BLADE's lineage and the cleanest place to position BLADE in a related-work section. It gives Weiyu a ready-made, citable framing: "language-conditioned manipulation by functional role," with our work as the predicate-learning + bilevel-planning exemplar.
On the batch's bigger thesis — that many manipulation predicates
(is_grasped, is_inserted, is_full,
surface_is_rough, is_screwed_tight) are not visually evaluable
and live in touch/force/sound — this survey is partially on-theme but does not
make that argument. Its one resonant move is §8.5: language uniquely encodes
constraints and relations that vision under-specifies, while vision/touch ground geometric
and contact details language under-specifies. That's the dual of our thesis (it argues vision
grounds what language can't; we argue touch grounds what vision can't). The LEMMo-Plan example
— using force-torque to define predicates over "invisible" events like cable tension —
is the single clearest seed in the survey for the non-visual-predicate idea, and worth chasing.
But the survey treats tactile/acoustic sensing as a late VLA-fusion topic, not as a challenge
to the visual-predicate assumption. So: use it as the structural map and the
positioning citation, not as evidence for the non-visual-predicate thesis — for
that, lean on the dedicated tactile/audio papers in this batch (clusters A–G).
"A deeper analysis reveals that the central research questions are not just about these components, but about the specific functional role language plays in bridging them. Different approaches leverage language in fundamentally distinct ways to solve the manipulation problem." — §1.2 / p.4
"The contemporary consensus points toward a unified multimodal task-specification framework: utilize language as the primary interface to dictate what to do ... while using image/video conditioning to specify how precisely to execute it." — §8.5.4 / p.45
"We note that latency is not consistently treated as a first-class evaluation metric in the literature, and often it is mentioned only in footnotes. ... we strongly encourage authors to report latency prominently as a primary performance metric." — §9.3 / p.47
(a) Cited here, worth ingesting next (forward references):
(b) Newly ingested in 2026-06-24 batch — directly relevant: