PaLM-E: An Embodied Multimodal Language Model

Driess, Xia, Sajjadi, Lynch, Chowdhery, Ichter, Wahid, Tompson, Vuong, Yu, Huang, Chebotar, Sermanet, Duckworth, Levine, Vanhoucke, Hausman, Toussaint, Greff, Zeng, Mordatch, Florence (Robotics at Google, TU Berlin, Google Research) · 2023 · ICML 2023 · arXiv:2303.03378 · PDF · project page

One-liner. PaLM-E injects continuous sensor observations (images, 3D scene representations, robot state vectors) directly into the token embedding space of a frozen-or-finetuned pretrained LLM, turning the LLM into a single embodied generalist that outputs natural-language plans for a robot control loop — and the recipe for "encode a non-text modality into vectors the same width as word tokens, then interleave them with text" is the structural template for adding any sensor (touch, force, audio) to a language model.

Problem & motivation

LLMs carry vast world knowledge and strong reasoning, but for real-world embodied tasks they are ungrounded: trained on text alone, their representations don't connect to the robot's physical percepts. Prior LLM-for-planning work like SayCan (Ahn et al., cited as [3]) interfaces an LLM with learned affordance functions but feeds the LLM only text, which is insufficient when the geometric configuration of the scene matters. The authors further show that out-of-the-box SOTA vision-language models (PaLI) trained on typical VQA-style data cannot directly solve robotic reasoning tasks. The goal: a single model that ingests continuous multimodal observations and produces grounded sequential decisions, while remaining a competent vision-language and language generalist.

Method

The central architectural idea (Sec 3, Fig 1): inject continuous embodied observations into the language embedding space of a pretrained decoder-only LLM, forming multimodal sentences — sequences of tokens where some positions are word-token embeddings and others are encoder outputs from arbitrary modalities, interleaved freely with text.

Multimodal sentences / token injection. A normal LLM maps a text token w_i to a word-embedding vector x_i = γ(w_i) ∈ R^k via the embedding matrix. PaLM-E replaces selected token positions with encoder outputs: each continuous observation O_j is mapped by an encoder φ_j: O → X^q into a sequence of q vectors of the same dimension k as word embeddings, which are then interleaved into the prefix (Eq. 3). One observation usually becomes multiple embedding vectors, and different encoders can be mixed at different positions. Crucially, the observation embeddings are not inserted at fixed positions (contrast with cross-attention VLMs like Flamingo) but placed dynamically within the surrounding text, reusing the LLM's existing positional encodings. The whole thing trains end-to-end with a cross-entropy loss over the non-prefix (target) text tokens; special text tokens in the input get replaced by the encoder embeddings at those locations.

Embodying the output: robot control loop. PaLM-E is a text-generating decoder-only model. For embodied question answering / scene description, the text output is the answer. For embodied planning/control, PaLM-E generates text decisions that are interpreted as a sequence of low-level skills from a (small) pre-existing skill vocabulary; each skill is executed by a separately-trained low-level policy (e.g. RT-1, or the Lynch et al. Language-Table policies). PaLM-E thus acts as a high-level policy that sequences low-level policies, and because execution yields new observations, it can replan at each step, giving closed-loop control robust to disturbances.

Input & scene representations (Sec 4). The paper studies several encoders mapping a modality into the language embedding space: (i) state-estimation vectors — a robot/scene state vector s ∈ R^S (poses, sizes, colors) mapped by an MLP φ_state; (ii) ViT — a Vision Transformer (ViT-4B from Chen et al., or ViT-22B) producing token embeddings, projected to width k by a learned affine map ψ; (iii) ViT token learner (ViT+TL) trained from scratch; (iv) object-centric ViT — given ground-truth instance masks, decompose ViT features per object; and (v) Object Scene Representation Transformer (OSRT) — an unsupervised 3D-aware object-centric scene representation (slots o_j) projected per-slot into multiple embeddings by an MLP. Entity referrals: for object-centric encoders the prompt labels each object's tokens as Object j is <obj_j>, so PaLM-E can reference objects by special obj_j tokens in its generated plan and the low-level policies operate on those tokens.

Training recipes (Sec 5). PaLM-E is built on the pretrained 8B / 62B / 540B PaLM as the decoder-only LLM, plus a 4B or 22B ViT. Naming: 8B LLM + 4B ViT = PaLM-E-12B; 62B + 22B = PaLM-E-84B; 540B + 22B = PaLM-E-562B (the largest VLM reported at the time). The three components are an encoder φ, a projector ψ, and the LLM p_LM; the paper ablates freezing the LLM (train only encoder+projector, a form of input-conditioned soft-prompting) vs. finetuning it end-to-end. Co-training across tasks is the key experimental knob: the "full mixture" is mostly diverse internet-scale vision-language and language data, with only 8.9% embodied data, to test whether transfer from general VL domains boosts embodied performance.

Setup

Datasets / benchmarks: Three robotics domains — a TAMP grasp-and-stack environment, the Language-Table tabletop pushing dataset (Lynch et al.), and a SayCan-style mobile manipulation kitchen environment. General VL benchmarks: OK-VQA, VQA v2, COCO captioning. Plus 21 general language (NLU/NLG) benchmarks for catastrophic-forgetting analysis. The "full mixture" adds WebLI, VQA, COCO, and other internet-scale VL/language data (App. A).
Hardware / simulator: Real robots in two closed-loop domains (Language-Table tabletop robot; a mobile manipulator in a real kitchen, following the SayCan setup) plus simulation. Low-level policies are RT-1 (mobile manipulation) and the Lynch et al. policies (Language-Table). TAMP is in simulation.
Baselines: PaLI (a SOTA VLM not trained on robot data, zero-shot); SayCan with oracle affordances; for failure detection, CLIP-FT and CLIP-FT-hindsight (Xiao et al.); for affordance, QT-OPT value-function thresholding; Frozen (Tsimpoukelli et al.) for the frozen-LLM comparison; Flamingo for general VQA.
Compute: not reported (parameter counts given — up to 562B — but no GPU/TPU-hours or training cost stated in the main text).

Results

Headline: PaLM-E-562B sets a new SOTA on OK-VQA (66.1 val), beating models finetuned specifically for OK-VQA, without task-specific finetuning — while being a single model that also controls real robots. Co-training on the full mixture roughly doubles embodied performance vs. training on in-domain robot data alone (positive transfer, Fig 3).

TAMP planning (Table 1, trained on only 1% / 320 examples per task): a frozen pretrained 62B+ LLM with state inputs reaches the high-90s success on planning, and OSRT is the most data-efficient input representation. Selected planning success (p₁, p₂ mean) for PaLM-E-12B (Fig 4) showing the transfer + LLM-pretraining effect:

PaLM-E-12B variant (TAMP, 1% data)	Planning success
LLM frozen, single robot	31.8%
without pretraining (LLM+ViT from scratch)	42.9%
LLM finetune, single robot	48.6%
LLM frozen, full mixture	74.3%
LLM finetune, full mixture	94.9%

Where it wins: full-mixture co-training more than doubles ViT-4B TAMP planning vs. single-robot; OSRT (3D-aware, no large-scale pretraining) gives the best input encoding; PaLM-E generalizes one-shot and zero-shot on real Language-Table tasks (novel object pairs, an unseen toy turtle) and survives adversarial human disturbance in the kitchen long-horizon task (Fig 5). On mobile manipulation, PaLM-E-12B (full mixture) hits 0.91 F1 on both failure detection and affordance prediction, beating CLIP-FT (0.65), CLIP-FT-hindsight (0.89 on failure det.), and QT-OPT (0.63 affordance) (Table 4).

Where it loses / costs: the zero-shot PaLI and SayCan baselines score 0.0 on the harder planning tasks — a clean demonstration that off-the-shelf VLMs and affordance-only LLMs fail, but it also means the comparison lacks a competitive non-trivial baseline on those tasks. Scaling helps language retention: the smallest PaLM-E-12B loses 87.3% of its NLG performance (relative) to catastrophic forgetting under multimodal training, while PaLM-E-562B loses only 3.9% (Fig 6) — i.e. small models pay a steep generalist tax. The frozen-LLM route (train only encoders) sometimes struggled on robotics tasks vs. full end-to-end training (Table 2).

Limitations & open questions

From the authors:

Two paths to retain language ability — freeze the LLM, or train end-to-end and rely on scale — and the frozen route "occasionally struggled for robotics tasks." No single clean recipe dominates.
Catastrophic forgetting is only mitigated by scale, not solved; smaller models forget heavily.
PaLM-E depends on a fixed, small library of pre-trained low-level skill policies; it sequences them but does not learn or generate the low-level control itself.

What I noticed reading it:

All injected modalities here are visual or symbolic state (ViT images, OSRT slots, pose/color state vectors). No genuinely non-visual continuous sensor (force, tactile, audio) is ever fed in — the architecture claims generality over "arbitrary modalities" but the evidence is vision + state only. The recipe is the contribution; the non-visual demonstration is left to successors.
Object-centric variants (object-ViT, entity referrals) depend on ground-truth instance masks at input time in several experiments — a strong supervision assumption that isn't always available on a real robot.
The most striking embodied numbers (TAMP 94.9%) come from 1%-data / 320-example regimes with only 2 planning tasks (p₁, p₂) averaged; the strong transfer story rests on a fairly narrow task slice.
Compute cost is undisclosed despite 562B parameters — reproducibility and the practical cost of "just scale the LLM to fix forgetting" are unaddressed.
Low-level skills are assumed to succeed when invoked; failure attribution between PaLM-E's plan and the underlying policy is not separated out in the planning metrics.

Why I care

This is the method anchor for the entire "inject continuous sensor tokens into an LLM" recipe that the 2026-06-24 multimodal-sensing batch is organized around. The structural move — encode a non-text modality into a sequence of vectors the same width as word-token embeddings, interleave them with text, train end-to-end — is exactly the template a robot learner would reuse to add touch, force, or audio to a language model. PaLM-E only ever demonstrates it with vision and state vectors, but φ_tactile or φ_force drop into the same slot as φ_ViT with no architectural change. This matters directly for the thesis underlying my work: many manipulation predicates — is_grasped, is_inserted, is_full, surface_is_rough, is_screwed_tight — are not visually evaluable; they live in touch/force/sound. PaLM-E gives the clean structural answer for how a language model could ever read those signals, and the touch-language and tactile-VLA papers in this batch (FuSe, Tactile-VLA, OmniVTLA, UniTouch) are essentially PaLM-E's recipe instantiated for non-visual sensors.

Relative to BLADE: PaLM-E is the canonical "scale-everything / structure-implicit" counterpoint to BLADE's "explicit symbolic abstraction" pole of the structure-vs-scale debate. Both produce a high-level plan that sequences low-level policies and replan after each step — PaLM-E's control loop and BLADE's bi-level replanning are strikingly parallel — but PaLM-E's plan is free-form generated text grounded only by entity-referral tokens, whereas BLADE's plan is a typed PDDL operator sequence with learned precondition/effect predicates. PaLM-E has no notion of an explicit predicate that could be grounded in a force sensor; BLADE has the predicate slot but (currently) only visual classifiers. The synthesis I care about — BLADE-style learned predicates whose classifiers read PaLM-E-style injected tactile/force tokens — is exactly the gap between these two anchors.

Quotable

We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. — Abstract / p.1

The main architectural idea of PaLM-E is to inject continuous, embodied observations such as images, state estimates, or other sensor modalities into the language embedding space of a pre-trained language model. — §3 / p.3

Each vector x_i in the prefix is formed from either the word token embedder γ or an encoder φ_i … the observation embeddings are not inserted at fixed positions, but instead placed dynamically within the surrounding text. — §3, Multi-modal sentences / p.4

Papers cited that should likely be ingested next:

SayCan (Ahn et al. 2022) — the text-only LLM + affordance baseline PaLM-E is built against; the mobile-manipulation domain and low-level-skill-sequencing setup come from here.
RT-1 (Brohan et al. 2022) — the low-level policy PaLM-E sequences in the kitchen domain; the action-execution layer under the plans.
Gato (Reed et al. 2022) — the closest generalist multi-embodiment agent; PaLM-E's explicit contrast for the transfer claim.
Flamingo (Alayrac et al. 2022) — the cross-attention VLM PaLM-E contrasts its token-injection design against; VQA baseline.
OSRT (Sajjadi et al. 2022) — the 3D object-centric scene representation that gives PaLM-E its best data-efficiency; key input encoder.
PaLM (Chowdhery et al. 2022) — the underlying LLM; foundational dependency.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

ImageBind and LanguageBind — the multimodal-binding siblings in Cluster H; where PaLM-E injects modalities into one LLM's embedding space, these bind many modalities to a shared embedding space — two routes to the same "put touch/audio next to language" goal.
Meta-Transformer and Any2Policy — the other Cluster H anchors generalizing "unify arbitrary modalities in one backbone"; Any2Policy carries the recipe all the way to a visuomotor policy.
FuSe, Tactile-VLA, and OmniVTLA — the tactile/force VLA policies (Cluster C) that instantiate PaLM-E's inject-sensor-tokens recipe for genuinely non-visual modalities; the successors that fill PaLM-E's vision-only evidence gap.
UniTouch — binds touch into a vision-language embedding space; the touch-side analogue of PaLM-E's "make the modality look like language tokens" move.