Any2Policy: Learning Visuomotor Policy with Any-Modality

Yichen Zhu, Zhicai Ou, Feifei Feng, Jian Tang* · Midea Group · NeurIPS 2024 · OpenReview · PDF

One-liner. Any2Policy is an end-to-end manipulation policy that accepts task instructions in any of four modalities (text, audio speech, image end-goal, video demonstration) and observations in any of three (image, video, point cloud), by encoding everything into a shared ImageBind latent space and fusing instruction-with-observation via a learnable cross-attention "embodied alignment" module — trained and evaluated on a new 30-task real-world dataset (RoboAny) richly annotated across all of these modalities.

Problem & motivation

Most robot-learning methods commit to a single modality for task specification (usually text) and a single modality for observation (usually 2D images), treating each sensory channel as a distinct, specialized problem. That limits generalizability: a human can be told a task in speech, shown a goal picture, or given a video demo, and can perceive the world through images and 3D. The paper argues that simultaneous multi-modal learning yields richer, more transferable representations (citing dual-coding and intersensory-redundancy work from cognitive science, [16,17]). Prior multi-modal systems like VIMA and MUTEX engage several instruction modalities but still fix a single observation type. The gap Any2Policy targets is a single unified policy that handles arbitrary combinations of instruction and observation modalities, and a benchmark dataset that even makes such evaluation possible.

Method

The Any-to-Policy ("Any2Policy") architecture has two primary components, shown in Fig 1 and detailed in Fig 2: a set of multi-modal encoders that map every input into a shared latent space, and an embodied alignment module that fuses the instruction token with the observation tokens to condition a policy network. The objective is a multi-task policy π(a | s, o) producing continuous actions; demonstrations d = [(s, o, a)] pair an instruction s, observation o, and action a.

Multimodal encoder. Rather than maintain heterogeneous per-modality backbones, the authors adopt ImageBind [25] as a single unified encoder across all five modalities (text, audio, image, video, point cloud). The encoder is a frozen mapping P(x); only the downstream projection layers are trained, which keeps compute low and the representation modality-agnostic. If instruction and observation share a modality, they share the encoder.

Embodied alignment. Projection layers map each modality's features into a uniform token form. Each modality is tokenized at a different count: 81×t tokens for video (t = frames), 256 for point clouds, a single token for text and for audio. Because self-attention is quadratic in token count, they follow RT-1 in applying TokenLearner [105] per visual modality to prune redundant tokens — images and videos drop to 8 tokens, point clouds to 16. The pruned tokens from co-present observation modalities are amalgamated with a modality token inserted between every pair to mark boundaries, plus absolute position embeddings; with all three visual modalities present the total drops to as few as 34 (8+8+16+2). A single Transformer block then mixes them.

To align instruction with observation, the authors use cross-attention: given an observation sequence P_o and instruction sequence P_s, they project P_o into keys (K), P_s into values (V), and use a fixed set of 16 learnable query embeddings (Q) for the cross-attention, h_i = Softmax(Q_iK_i^T/√d_h)V_i (Eq 1), in the BERT-style QKV framing. This learnable-query design adapts the different modalities for efficient instruction–observation fusion.

Policy network. The policy is conditioned on the observation (and optionally instruction) sequence through a stack of Transformer blocks, each with self-attention, cross-attention, and a feed-forward network, plus residual connections to the rollout trajectory. One history action token is appended (similar to VIMA [59]). The instruction representation is cached to avoid recomputation since it aligns with observation at every step. A final MLP emits continuous actions; the training loss is behavior-cloning MSE. The action space is absolute joint position.

Setup

Datasets / benchmarks: RoboAny, a new real-world dataset of n=30 distinct tasks (pick-and-place, sorting by color, "open the drawer", etc.), each with m=30 human teleoperated trajectories, drawn from a set of 70 objects. Every task is annotated with k=5 distinct text instructions (paraphrased via GPT-4) plus speech (Amazon Polly voices), image end-goals, and video demonstrations; depth recordings are converted to point clouds for observation. Simulation benchmarks: Franka Kitchen [92] (text–image) and ManiSkill2 [94] (text–image and text–{image, point cloud}).
Hardware / simulator: Franka real robot with two workstations (Fig 3); data via teleoperation. Simulators: Franka Kitchen, ManiSkill2. Training on A100 GPUs, PyTorch.
Baselines: real-world — modality-specific variants of the same architecture (single instruction–observation pair), Any2Policy without embodied alignment (concatenation ablation), and SOTA robot models VIMA [59], R3M [109], T5 [110]. Simulation — R3M [109], BLIP-2 [22], Embodied-GPT [113].
Compute: A100 GPUs; Franka Kitchen trained 40K steps; AdamW, lr 3e-5 (real-world) / 1e-3 & 3e-4 (Franka Kitchen, ManiSkill2), weight decay 1e-6, cosine schedule with 2% warmup, gradient clip 1.0. Exact GPU count / wall-clock not reported.

Results

Real-world evaluations use a 7/1/2 train/val/test split, objects positioned randomly, mean success rate over 10 trials.

1. Multi-modal beats modality-specific (Table 1). With total training data held fixed, Any2Policy outperforms single-modality variants across all twelve instruction→observation pairs. Examples: Text→Image 51 vs 39; Text→Point Cloud 62 vs 47; Audio→Video 55 vs 48; Image→Video 63 vs 47. The gain is attributed to cross-modal training improving generalization.

2. Embodied alignment is essential (Table 2). Removing the alignment module (concatenating projection outputs, fusing via MLP) collapses performance: Text→Video drops 57→18, Video→Video 57→9, Image→Image 56→15. The cross-attention fusion is doing most of the work.

3. More modalities at inference help (Table 4 ablation). Starting from Text→Image (51), stacking instruction modalities (Text+Audio+ImageEndGoal+VideoDemo) reaches 62; stacking observation modalities (Image+Video+PointCloud) reaches 66; a mixed combination reaches 69. Adding video demonstrations in instruction plus video in observation gave a +12 point jump. Text and audio convey similar information, so combining them adds only ~1%.

4. Beats SOTA robot models (Table 3, Image-observation column):

Instr→Obs	Image→Image	Text→Image	Video→Image	Image+Text→Image
VIMA [59]	–	–	–	49
R3M [109]	42	–	46	–
T5 [110]	–	39	–	–
Any2Policy	44	51	59	62

5. Simulation (Franka Kitchen, Fig 4). Benchmarked against R3M, BLIP-2, Embodied-GPT with 10 or 25 demonstrations across five tasks (two camera views). Any2Policy is superior on all tasks except Knobs-Left with 25 demonstrations — the one reported loss.

Where it loses / weak spots: point-cloud pairings (image–point cloud, video–point cloud) score lower than other observation pairs, which the authors attribute to the difficulty of fusing 2D visual features with 3D spatial data; point cloud helps most when paired with text/audio instruction.

Limitations & open questions

Author-stated:

Integrating point-cloud (3D) observations with image/video (2D) remains hard — the authors explicitly flag that more sophisticated fusion is needed and that these pairings underperform.
One reported loss (Knobs-Left, 25 demos) against baselines in simulation.
Reliance on a frozen ImageBind backbone means the policy inherits whatever modality gaps and biases that encoder carries (implied throughout the method discussion).

What I noticed reading it:

No tactile, no force, no contact-audio. Despite the "any-modality" branding, the modalities are all perceptual at a distance: text, speech, image, video, point cloud. There is no touch, no force/torque, no proprioception, no contact sound. This is "any visual or linguistic modality," not "any sensory modality" in the contact-rich sense.
Tiny statistics. Real-world numbers are mean success over only 10 trials per cell; no variance / std reported, unlike e.g. BLADE's seeded tables. With 10 trials a difference of 51 vs 39 is roughly 5 vs 4 successes — the confidence intervals must be wide and are not shown.
"Same training data" claim is load-bearing but thin. The headline ("multi-modal training improves generalization at fixed data") rests on holding total trajectories fixed, but each task already has 5 text paraphrases + speech + image-goal + video; the multi-modal model sees strictly more annotation per trajectory, so it isn't obviously a clean apples-to-apples comparison.
Tasks are simple and same-environment. All 30 tasks are in one environment; most are pick-and-place. Long-horizon composition, contact-rich manipulation, and cross-environment transfer are untested.
ImageBind dependence caps the ceiling. Freezing the encoder is presented as an efficiency win, but it also means representation quality for, say, point clouds is bounded by ImageBind's (relatively weak) point-cloud arm — plausibly why the 3D pairings lag.

Why I care

This sits in my Cluster H multimodal-binding anchors alongside ImageBind, PaLM-E, and Meta-Transformer. It is the most policy-level of those: it actually closes the loop from a shared multi-modal embedding to robot actions, which is the embodied step the pure binding papers stop short of.

Relative to my thesis — that many manipulation predicates (is_grasped, is_inserted, is_full, surface_is_rough, is_screwed_tight) live in touch/force/sound, not pixels — Any2Policy is a clean counter-example of the limitation I want to address. It markets "any-modality" but its modalities are entirely vision + language + 3D geometry; there is no channel that could ever evaluate is_screwed_tight. It is the visual-binding paradigm taken to its policy conclusion, and precisely because it stops at the non-contact modalities it sharpens the argument for adding tactile/force/audio binding (FuSe, OmniVTLA, AnyTouch) into exactly this kind of shared-latent-space-plus-cross-attention recipe.

For BLADE specifically: Any2Policy is the opposite design axis. BLADE keeps a small visual observation channel but adds symbolic structure (PDDL operators, predicates, bi-level planning) for long-horizon composition; Any2Policy keeps a flat end-to-end BC policy but maximizes input-modality breadth. Both are ways to generalize, structure-vs-modality. A future system wants both: BLADE-style abstractions whose predicate classifiers can read from a multi-modal (including tactile) observation encoder like Any2Policy's. The "instruction in any modality" piece (speech, image-goal, video-demo → FOL goal) is also directly relevant to BLADE's LLM goal-translation step.

Quotable

We present an end-to-end general-purpose multi-modal system named Any-to-Policy Embodied Agents. This system empowers robots to handle tasks using various modalities, whether in combinations like text-image, audio-image, text-point cloud, or in isolation. — Abstract / p.1

We leverage existing well-established models, ImageBind, a unified high-performance encoder across five modalities, to encode inputs of various modalities. With the help of ImageBind, we are spared from managing many numbers of heterogeneous modal encoders. — §3.1, Multimodal Encoder / p.4

Our dataset, RoboAny, stands out as the first to support a comprehensive range of modalities in robotics. It encompasses both instructions and observations across images, videos, audio, language, and point clouds. — §2, Robot Manipulation and Datasets / p.3

Papers cited that should likely be ingested next:

[25] Girdhar et al. 2023 — ImageBind (CVPR) — the unified five-modality encoder Any2Policy is built on; in this batch as imagebind_one_embedding_space.
[59] Jiang et al. — VIMA — multimodal-prompt manipulation; the closest prior on multi-modal instruction and the key baseline.
[26] Zhang et al. 2023 — Meta-Transformer — unified multimodal transformer; in this batch as meta_transformer_unified_modalities.
[12] Lee et al. 2019 — Making Sense of Vision and Touch (ICRA) — the contact-rich multimodal-representation paper Any2Policy cites but does not include touch from; in this batch as making_sense_of_vision_and_touch.
[35] Brohan et al. 2022 — RT-1 — source of the TokenLearner token-pruning trick Any2Policy reuses.
[92] Franka Kitchen and [94] ManiSkill2 — the two simulation benchmarks used.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

ImageBind — the exact backbone; Any2Policy is ImageBind operationalized into a robot policy.
Meta-Transformer and LanguageBind — sibling unified-modality encoders; the binding-vs-policy contrast.
PaLM-E — the other embodied-multimodal-LM anchor in Cluster H; LLM-centric where Any2Policy is encoder-centric.
Beyond Sight (FuSe), OmniVTLA, and AnyTouch — the contact-modality counterparts: these add the touch/force/audio channels Any2Policy conspicuously omits, into a similar shared-space-plus-fusion recipe.
Kaiwu — another multimodal manipulation dataset; the contact-rich data foil to RoboAny's vision/language-only annotation.