Any2Policy: Learning Visuomotor Policy with Any-Modality

Yichen Zhu, Zhicai Ou, Feifei Feng, Jian Tang* · Midea Group · NeurIPS 2024 · OpenReview · PDF

One-liner. Any2Policy is an end-to-end manipulation policy that accepts task instructions in any of four modalities (text, audio speech, image end-goal, video demonstration) and observations in any of three (image, video, point cloud), by encoding everything into a shared ImageBind latent space and fusing instruction-with-observation via a learnable cross-attention "embodied alignment" module — trained and evaluated on a new 30-task real-world dataset (RoboAny) richly annotated across all of these modalities.

Problem & motivation

Most robot-learning methods commit to a single modality for task specification (usually text) and a single modality for observation (usually 2D images), treating each sensory channel as a distinct, specialized problem. That limits generalizability: a human can be told a task in speech, shown a goal picture, or given a video demo, and can perceive the world through images and 3D. The paper argues that simultaneous multi-modal learning yields richer, more transferable representations (citing dual-coding and intersensory-redundancy work from cognitive science, [16,17]). Prior multi-modal systems like VIMA and MUTEX engage several instruction modalities but still fix a single observation type. The gap Any2Policy targets is a single unified policy that handles arbitrary combinations of instruction and observation modalities, and a benchmark dataset that even makes such evaluation possible.

Method

The Any-to-Policy ("Any2Policy") architecture has two primary components, shown in Fig 1 and detailed in Fig 2: a set of multi-modal encoders that map every input into a shared latent space, and an embodied alignment module that fuses the instruction token with the observation tokens to condition a policy network. The objective is a multi-task policy π(a | s, o) producing continuous actions; demonstrations d = [(s, o, a)] pair an instruction s, observation o, and action a.

Multimodal encoder. Rather than maintain heterogeneous per-modality backbones, the authors adopt ImageBind [25] as a single unified encoder across all five modalities (text, audio, image, video, point cloud). The encoder is a frozen mapping P(x); only the downstream projection layers are trained, which keeps compute low and the representation modality-agnostic. If instruction and observation share a modality, they share the encoder.

Embodied alignment. Projection layers map each modality's features into a uniform token form. Each modality is tokenized at a different count: 81×t tokens for video (t = frames), 256 for point clouds, a single token for text and for audio. Because self-attention is quadratic in token count, they follow RT-1 in applying TokenLearner [105] per visual modality to prune redundant tokens — images and videos drop to 8 tokens, point clouds to 16. The pruned tokens from co-present observation modalities are amalgamated with a modality token inserted between every pair to mark boundaries, plus absolute position embeddings; with all three visual modalities present the total drops to as few as 34 (8+8+16+2). A single Transformer block then mixes them.

To align instruction with observation, the authors use cross-attention: given an observation sequence Po and instruction sequence Ps, they project Po into keys (K), Ps into values (V), and use a fixed set of 16 learnable query embeddings (Q) for the cross-attention, hi = Softmax(QiKiT/√dh)Vi (Eq 1), in the BERT-style QKV framing. This learnable-query design adapts the different modalities for efficient instruction–observation fusion.

Policy network. The policy is conditioned on the observation (and optionally instruction) sequence through a stack of Transformer blocks, each with self-attention, cross-attention, and a feed-forward network, plus residual connections to the rollout trajectory. One history action token is appended (similar to VIMA [59]). The instruction representation is cached to avoid recomputation since it aligns with observation at every step. A final MLP emits continuous actions; the training loss is behavior-cloning MSE. The action space is absolute joint position.

Setup

Results

Real-world evaluations use a 7/1/2 train/val/test split, objects positioned randomly, mean success rate over 10 trials.

1. Multi-modal beats modality-specific (Table 1). With total training data held fixed, Any2Policy outperforms single-modality variants across all twelve instruction→observation pairs. Examples: Text→Image 51 vs 39; Text→Point Cloud 62 vs 47; Audio→Video 55 vs 48; Image→Video 63 vs 47. The gain is attributed to cross-modal training improving generalization.

2. Embodied alignment is essential (Table 2). Removing the alignment module (concatenating projection outputs, fusing via MLP) collapses performance: Text→Video drops 57→18, Video→Video 57→9, Image→Image 56→15. The cross-attention fusion is doing most of the work.

3. More modalities at inference help (Table 4 ablation). Starting from Text→Image (51), stacking instruction modalities (Text+Audio+ImageEndGoal+VideoDemo) reaches 62; stacking observation modalities (Image+Video+PointCloud) reaches 66; a mixed combination reaches 69. Adding video demonstrations in instruction plus video in observation gave a +12 point jump. Text and audio convey similar information, so combining them adds only ~1%.

4. Beats SOTA robot models (Table 3, Image-observation column):

Instr→ObsImage→ImageText→ImageVideo→ImageImage+Text→Image
VIMA [59]49
R3M [109]4246
T5 [110]39
Any2Policy44515962

5. Simulation (Franka Kitchen, Fig 4). Benchmarked against R3M, BLIP-2, Embodied-GPT with 10 or 25 demonstrations across five tasks (two camera views). Any2Policy is superior on all tasks except Knobs-Left with 25 demonstrations — the one reported loss.

Where it loses / weak spots: point-cloud pairings (image–point cloud, video–point cloud) score lower than other observation pairs, which the authors attribute to the difficulty of fusing 2D visual features with 3D spatial data; point cloud helps most when paired with text/audio instruction.

Limitations & open questions

Author-stated:

What I noticed reading it:

Why I care

This sits in my Cluster H multimodal-binding anchors alongside ImageBind, PaLM-E, and Meta-Transformer. It is the most policy-level of those: it actually closes the loop from a shared multi-modal embedding to robot actions, which is the embodied step the pure binding papers stop short of.

Relative to my thesis — that many manipulation predicates (is_grasped, is_inserted, is_full, surface_is_rough, is_screwed_tight) live in touch/force/sound, not pixels — Any2Policy is a clean counter-example of the limitation I want to address. It markets "any-modality" but its modalities are entirely vision + language + 3D geometry; there is no channel that could ever evaluate is_screwed_tight. It is the visual-binding paradigm taken to its policy conclusion, and precisely because it stops at the non-contact modalities it sharpens the argument for adding tactile/force/audio binding (FuSe, OmniVTLA, AnyTouch) into exactly this kind of shared-latent-space-plus-cross-attention recipe.

For BLADE specifically: Any2Policy is the opposite design axis. BLADE keeps a small visual observation channel but adds symbolic structure (PDDL operators, predicates, bi-level planning) for long-horizon composition; Any2Policy keeps a flat end-to-end BC policy but maximizes input-modality breadth. Both are ways to generalize, structure-vs-modality. A future system wants both: BLADE-style abstractions whose predicate classifiers can read from a multi-modal (including tactile) observation encoder like Any2Policy's. The "instruction in any modality" piece (speech, image-goal, video-demo → FOL goal) is also directly relevant to BLADE's LLM goal-translation step.

Quotable

We present an end-to-end general-purpose multi-modal system named Any-to-Policy Embodied Agents. This system empowers robots to handle tasks using various modalities, whether in combinations like text-image, audio-image, text-point cloud, or in isolation. — Abstract / p.1
We leverage existing well-established models, ImageBind, a unified high-performance encoder across five modalities, to encode inputs of various modalities. With the help of ImageBind, we are spared from managing many numbers of heterogeneous modal encoders. — §3.1, Multimodal Encoder / p.4
Our dataset, RoboAny, stands out as the first to support a comprehensive range of modalities in robotics. It encompasses both instructions and observations across images, videos, audio, language, and point clouds. — §2, Robot Manipulation and Datasets / p.3

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: