One-liner. PaLM-E injects continuous sensor observations (images, 3D scene representations, robot state vectors) directly into the token embedding space of a frozen-or-finetuned pretrained LLM, turning the LLM into a single embodied generalist that outputs natural-language plans for a robot control loop — and the recipe for "encode a non-text modality into vectors the same width as word tokens, then interleave them with text" is the structural template for adding any sensor (touch, force, audio) to a language model.
LLMs carry vast world knowledge and strong reasoning, but for real-world embodied tasks they are ungrounded: trained on text alone, their representations don't connect to the robot's physical percepts. Prior LLM-for-planning work like SayCan (Ahn et al., cited as [3]) interfaces an LLM with learned affordance functions but feeds the LLM only text, which is insufficient when the geometric configuration of the scene matters. The authors further show that out-of-the-box SOTA vision-language models (PaLI) trained on typical VQA-style data cannot directly solve robotic reasoning tasks. The goal: a single model that ingests continuous multimodal observations and produces grounded sequential decisions, while remaining a competent vision-language and language generalist.
The central architectural idea (Sec 3, Fig 1): inject continuous embodied observations into the language embedding space of a pretrained decoder-only LLM, forming multimodal sentences — sequences of tokens where some positions are word-token embeddings and others are encoder outputs from arbitrary modalities, interleaved freely with text.
Multimodal sentences / token injection. A normal LLM maps a
text token wi to a word-embedding vector
xi = γ(wi) ∈ Rk via the
embedding matrix. PaLM-E replaces selected token positions with encoder outputs:
each continuous observation Oj is mapped by an encoder
φj: O → Xq into a sequence of
q vectors of the same dimension k as word embeddings, which
are then interleaved into the prefix (Eq. 3). One observation usually becomes
multiple embedding vectors, and different encoders can be mixed at different
positions. Crucially, the observation embeddings are not inserted at
fixed positions (contrast with cross-attention VLMs like Flamingo) but placed
dynamically within the surrounding text, reusing the LLM's existing positional
encodings. The whole thing trains end-to-end with a cross-entropy loss over the
non-prefix (target) text tokens; special text tokens in the input get replaced by
the encoder embeddings at those locations.
Embodying the output: robot control loop. PaLM-E is a text-generating decoder-only model. For embodied question answering / scene description, the text output is the answer. For embodied planning/control, PaLM-E generates text decisions that are interpreted as a sequence of low-level skills from a (small) pre-existing skill vocabulary; each skill is executed by a separately-trained low-level policy (e.g. RT-1, or the Lynch et al. Language-Table policies). PaLM-E thus acts as a high-level policy that sequences low-level policies, and because execution yields new observations, it can replan at each step, giving closed-loop control robust to disturbances.
Input & scene representations (Sec 4). The paper studies
several encoders mapping a modality into the language embedding space:
(i) state-estimation vectors — a robot/scene state vector
s ∈ RS (poses, sizes, colors) mapped by an MLP
φstate; (ii) ViT — a Vision Transformer
(ViT-4B from Chen et al., or ViT-22B) producing token embeddings, projected to
width k by a learned affine map ψ; (iii) ViT
token learner (ViT+TL) trained from scratch; (iv) object-centric ViT
— given ground-truth instance masks, decompose ViT features per object; and
(v) Object Scene Representation Transformer (OSRT) — an
unsupervised 3D-aware object-centric scene representation (slots
oj) projected per-slot into multiple embeddings by an MLP.
Entity referrals: for object-centric encoders the prompt labels
each object's tokens as Object j is <obj_j>, so PaLM-E can
reference objects by special obj_j tokens in its generated plan and
the low-level policies operate on those tokens.
Training recipes (Sec 5). PaLM-E is built on the pretrained
8B / 62B / 540B PaLM as the decoder-only LLM, plus a 4B or 22B ViT. Naming:
8B LLM + 4B ViT = PaLM-E-12B; 62B + 22B = PaLM-E-84B; 540B + 22B = PaLM-E-562B
(the largest VLM reported at the time). The three components are an encoder
φ, a projector ψ, and the LLM
pLM; the paper ablates freezing the LLM (train
only encoder+projector, a form of input-conditioned soft-prompting) vs.
finetuning it end-to-end. Co-training across tasks is the key
experimental knob: the "full mixture" is mostly diverse internet-scale
vision-language and language data, with only 8.9% embodied data,
to test whether transfer from general VL domains boosts embodied performance.
Headline: PaLM-E-562B sets a new SOTA on OK-VQA (66.1 val), beating models finetuned specifically for OK-VQA, without task-specific finetuning — while being a single model that also controls real robots. Co-training on the full mixture roughly doubles embodied performance vs. training on in-domain robot data alone (positive transfer, Fig 3).
TAMP planning (Table 1, trained on only 1% / 320 examples per task): a frozen pretrained 62B+ LLM with state inputs reaches the high-90s success on planning, and OSRT is the most data-efficient input representation. Selected planning success (p1, p2 mean) for PaLM-E-12B (Fig 4) showing the transfer + LLM-pretraining effect:
| PaLM-E-12B variant (TAMP, 1% data) | Planning success |
|---|---|
| LLM frozen, single robot | 31.8% |
| without pretraining (LLM+ViT from scratch) | 42.9% |
| LLM finetune, single robot | 48.6% |
| LLM frozen, full mixture | 74.3% |
| LLM finetune, full mixture | 94.9% |
Where it wins: full-mixture co-training more than doubles ViT-4B TAMP planning vs. single-robot; OSRT (3D-aware, no large-scale pretraining) gives the best input encoding; PaLM-E generalizes one-shot and zero-shot on real Language-Table tasks (novel object pairs, an unseen toy turtle) and survives adversarial human disturbance in the kitchen long-horizon task (Fig 5). On mobile manipulation, PaLM-E-12B (full mixture) hits 0.91 F1 on both failure detection and affordance prediction, beating CLIP-FT (0.65), CLIP-FT-hindsight (0.89 on failure det.), and QT-OPT (0.63 affordance) (Table 4).
Where it loses / costs: the zero-shot PaLI and SayCan baselines score 0.0 on the harder planning tasks — a clean demonstration that off-the-shelf VLMs and affordance-only LLMs fail, but it also means the comparison lacks a competitive non-trivial baseline on those tasks. Scaling helps language retention: the smallest PaLM-E-12B loses 87.3% of its NLG performance (relative) to catastrophic forgetting under multimodal training, while PaLM-E-562B loses only 3.9% (Fig 6) — i.e. small models pay a steep generalist tax. The frozen-LLM route (train only encoders) sometimes struggled on robotics tasks vs. full end-to-end training (Table 2).
From the authors:
What I noticed reading it:
This is the method anchor for the entire "inject continuous
sensor tokens into an LLM" recipe that the 2026-06-24 multimodal-sensing batch is
organized around. The structural move — encode a non-text modality into
a sequence of vectors the same width as word-token embeddings, interleave them
with text, train end-to-end — is exactly the template a robot learner
would reuse to add touch, force, or audio to a language model.
PaLM-E only ever demonstrates it with vision and state vectors, but
φtactile or φforce drop
into the same slot as φViT with no architectural
change. This matters directly for the thesis underlying my work: many manipulation
predicates — is_grasped, is_inserted,
is_full, surface_is_rough, is_screwed_tight
— are not visually evaluable; they live in touch/force/sound. PaLM-E
gives the clean structural answer for how a language model could ever read those
signals, and the touch-language and tactile-VLA papers in this batch (FuSe,
Tactile-VLA, OmniVTLA, UniTouch) are essentially PaLM-E's recipe instantiated for
non-visual sensors.
Relative to BLADE: PaLM-E is the canonical "scale-everything / structure-implicit" counterpoint to BLADE's "explicit symbolic abstraction" pole of the structure-vs-scale debate. Both produce a high-level plan that sequences low-level policies and replan after each step — PaLM-E's control loop and BLADE's bi-level replanning are strikingly parallel — but PaLM-E's plan is free-form generated text grounded only by entity-referral tokens, whereas BLADE's plan is a typed PDDL operator sequence with learned precondition/effect predicates. PaLM-E has no notion of an explicit predicate that could be grounded in a force sensor; BLADE has the predicate slot but (currently) only visual classifiers. The synthesis I care about — BLADE-style learned predicates whose classifiers read PaLM-E-style injected tactile/force tokens — is exactly the gap between these two anchors.
We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. — Abstract / p.1
The main architectural idea of PaLM-E is to inject continuous, embodied observations such as images, state estimates, or other sensor modalities into the language embedding space of a pre-trained language model. — §3 / p.3
Each vector xi in the prefix is formed from either the word token embedder γ or an encoder φi … the observation embeddings are not inserted at fixed positions, but instead placed dynamically within the surrounding text. — §3, Multi-modal sentences / p.4
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant to this work: