PaLM-E: An Embodied Multimodal Language Model

Driess, Xia, Sajjadi, Lynch, Chowdhery, Ichter, Wahid, Tompson, Vuong, Yu, Huang, Chebotar, Sermanet, Duckworth, Levine, Vanhoucke, Hausman, Toussaint, Greff, Zeng, Mordatch, Florence (Robotics at Google, TU Berlin, Google Research) · 2023 · ICML 2023 · arXiv:2303.03378 · PDF · project page

One-liner. PaLM-E injects continuous sensor observations (images, 3D scene representations, robot state vectors) directly into the token embedding space of a frozen-or-finetuned pretrained LLM, turning the LLM into a single embodied generalist that outputs natural-language plans for a robot control loop — and the recipe for "encode a non-text modality into vectors the same width as word tokens, then interleave them with text" is the structural template for adding any sensor (touch, force, audio) to a language model.

Problem & motivation

LLMs carry vast world knowledge and strong reasoning, but for real-world embodied tasks they are ungrounded: trained on text alone, their representations don't connect to the robot's physical percepts. Prior LLM-for-planning work like SayCan (Ahn et al., cited as [3]) interfaces an LLM with learned affordance functions but feeds the LLM only text, which is insufficient when the geometric configuration of the scene matters. The authors further show that out-of-the-box SOTA vision-language models (PaLI) trained on typical VQA-style data cannot directly solve robotic reasoning tasks. The goal: a single model that ingests continuous multimodal observations and produces grounded sequential decisions, while remaining a competent vision-language and language generalist.

Method

The central architectural idea (Sec 3, Fig 1): inject continuous embodied observations into the language embedding space of a pretrained decoder-only LLM, forming multimodal sentences — sequences of tokens where some positions are word-token embeddings and others are encoder outputs from arbitrary modalities, interleaved freely with text.

Multimodal sentences / token injection. A normal LLM maps a text token wi to a word-embedding vector xi = γ(wi) ∈ Rk via the embedding matrix. PaLM-E replaces selected token positions with encoder outputs: each continuous observation Oj is mapped by an encoder φj: O → Xq into a sequence of q vectors of the same dimension k as word embeddings, which are then interleaved into the prefix (Eq. 3). One observation usually becomes multiple embedding vectors, and different encoders can be mixed at different positions. Crucially, the observation embeddings are not inserted at fixed positions (contrast with cross-attention VLMs like Flamingo) but placed dynamically within the surrounding text, reusing the LLM's existing positional encodings. The whole thing trains end-to-end with a cross-entropy loss over the non-prefix (target) text tokens; special text tokens in the input get replaced by the encoder embeddings at those locations.

Embodying the output: robot control loop. PaLM-E is a text-generating decoder-only model. For embodied question answering / scene description, the text output is the answer. For embodied planning/control, PaLM-E generates text decisions that are interpreted as a sequence of low-level skills from a (small) pre-existing skill vocabulary; each skill is executed by a separately-trained low-level policy (e.g. RT-1, or the Lynch et al. Language-Table policies). PaLM-E thus acts as a high-level policy that sequences low-level policies, and because execution yields new observations, it can replan at each step, giving closed-loop control robust to disturbances.

Input & scene representations (Sec 4). The paper studies several encoders mapping a modality into the language embedding space: (i) state-estimation vectors — a robot/scene state vector s ∈ RS (poses, sizes, colors) mapped by an MLP φstate; (ii) ViT — a Vision Transformer (ViT-4B from Chen et al., or ViT-22B) producing token embeddings, projected to width k by a learned affine map ψ; (iii) ViT token learner (ViT+TL) trained from scratch; (iv) object-centric ViT — given ground-truth instance masks, decompose ViT features per object; and (v) Object Scene Representation Transformer (OSRT) — an unsupervised 3D-aware object-centric scene representation (slots oj) projected per-slot into multiple embeddings by an MLP. Entity referrals: for object-centric encoders the prompt labels each object's tokens as Object j is <obj_j>, so PaLM-E can reference objects by special obj_j tokens in its generated plan and the low-level policies operate on those tokens.

Training recipes (Sec 5). PaLM-E is built on the pretrained 8B / 62B / 540B PaLM as the decoder-only LLM, plus a 4B or 22B ViT. Naming: 8B LLM + 4B ViT = PaLM-E-12B; 62B + 22B = PaLM-E-84B; 540B + 22B = PaLM-E-562B (the largest VLM reported at the time). The three components are an encoder φ, a projector ψ, and the LLM pLM; the paper ablates freezing the LLM (train only encoder+projector, a form of input-conditioned soft-prompting) vs. finetuning it end-to-end. Co-training across tasks is the key experimental knob: the "full mixture" is mostly diverse internet-scale vision-language and language data, with only 8.9% embodied data, to test whether transfer from general VL domains boosts embodied performance.

Setup

Results

Headline: PaLM-E-562B sets a new SOTA on OK-VQA (66.1 val), beating models finetuned specifically for OK-VQA, without task-specific finetuning — while being a single model that also controls real robots. Co-training on the full mixture roughly doubles embodied performance vs. training on in-domain robot data alone (positive transfer, Fig 3).

TAMP planning (Table 1, trained on only 1% / 320 examples per task): a frozen pretrained 62B+ LLM with state inputs reaches the high-90s success on planning, and OSRT is the most data-efficient input representation. Selected planning success (p1, p2 mean) for PaLM-E-12B (Fig 4) showing the transfer + LLM-pretraining effect:

PaLM-E-12B variant (TAMP, 1% data)Planning success
LLM frozen, single robot31.8%
without pretraining (LLM+ViT from scratch)42.9%
LLM finetune, single robot48.6%
LLM frozen, full mixture74.3%
LLM finetune, full mixture94.9%

Where it wins: full-mixture co-training more than doubles ViT-4B TAMP planning vs. single-robot; OSRT (3D-aware, no large-scale pretraining) gives the best input encoding; PaLM-E generalizes one-shot and zero-shot on real Language-Table tasks (novel object pairs, an unseen toy turtle) and survives adversarial human disturbance in the kitchen long-horizon task (Fig 5). On mobile manipulation, PaLM-E-12B (full mixture) hits 0.91 F1 on both failure detection and affordance prediction, beating CLIP-FT (0.65), CLIP-FT-hindsight (0.89 on failure det.), and QT-OPT (0.63 affordance) (Table 4).

Where it loses / costs: the zero-shot PaLI and SayCan baselines score 0.0 on the harder planning tasks — a clean demonstration that off-the-shelf VLMs and affordance-only LLMs fail, but it also means the comparison lacks a competitive non-trivial baseline on those tasks. Scaling helps language retention: the smallest PaLM-E-12B loses 87.3% of its NLG performance (relative) to catastrophic forgetting under multimodal training, while PaLM-E-562B loses only 3.9% (Fig 6) — i.e. small models pay a steep generalist tax. The frozen-LLM route (train only encoders) sometimes struggled on robotics tasks vs. full end-to-end training (Table 2).

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This is the method anchor for the entire "inject continuous sensor tokens into an LLM" recipe that the 2026-06-24 multimodal-sensing batch is organized around. The structural move — encode a non-text modality into a sequence of vectors the same width as word-token embeddings, interleave them with text, train end-to-end — is exactly the template a robot learner would reuse to add touch, force, or audio to a language model. PaLM-E only ever demonstrates it with vision and state vectors, but φtactile or φforce drop into the same slot as φViT with no architectural change. This matters directly for the thesis underlying my work: many manipulation predicates — is_grasped, is_inserted, is_full, surface_is_rough, is_screwed_tight — are not visually evaluable; they live in touch/force/sound. PaLM-E gives the clean structural answer for how a language model could ever read those signals, and the touch-language and tactile-VLA papers in this batch (FuSe, Tactile-VLA, OmniVTLA, UniTouch) are essentially PaLM-E's recipe instantiated for non-visual sensors.

Relative to BLADE: PaLM-E is the canonical "scale-everything / structure-implicit" counterpoint to BLADE's "explicit symbolic abstraction" pole of the structure-vs-scale debate. Both produce a high-level plan that sequences low-level policies and replan after each step — PaLM-E's control loop and BLADE's bi-level replanning are strikingly parallel — but PaLM-E's plan is free-form generated text grounded only by entity-referral tokens, whereas BLADE's plan is a typed PDDL operator sequence with learned precondition/effect predicates. PaLM-E has no notion of an explicit predicate that could be grounded in a force sensor; BLADE has the predicate slot but (currently) only visual classifiers. The synthesis I care about — BLADE-style learned predicates whose classifiers read PaLM-E-style injected tactile/force tokens — is exactly the gap between these two anchors.

Quotable

We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. — Abstract / p.1
The main architectural idea of PaLM-E is to inject continuous, embodied observations such as images, state estimates, or other sensor modalities into the language embedding space of a pre-trained language model. — §3 / p.3
Each vector xi in the prefix is formed from either the word token embedder γ or an encoder φi … the observation embeddings are not inserted at fixed positions, but instead placed dynamically within the surrounding text. — §3, Multi-modal sentences / p.4

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: