ImageBind: One Embedding Space To Bind Them All

Rohit Girdhar*, Alaaeldin El-Nouby*, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra* · FAIR, Meta AI · CVPR 2023 · arXiv:2305.05665 · PDF

One-liner. Train a contrastive model that aligns five "naturally paired-with-image" modalities (audio, depth, thermal, IMU, plus text) each to a frozen CLIP image encoder — and the image space acts as a hub, so modalities you never trained together (e.g. audio↔text, audio↔depth) become aligned emergently, giving zero-shot cross-modal retrieval, classification, arithmetic, and "upgrade-an-existing-CLIP-model" for free.

Problem & motivation

CLIP-style models give one shared (image, text) space, but extending that to a true joint embedding over many sensory modalities normally requires datasets where all modalities co-occur for the same instance — which is infeasible to collect. Prior multimodal work is therefore limited to one or two modalities per pair (video–audio, image–depth), and those embeddings are siloed: a video–audio encoder can't be used for image–text tasks. The paper's key observation: images already co-occur with everything (web image–text, video–audio, RGB-D depth, thermal–image, egocentric video–IMU). So you don't need all-modality data — image-paired data is sufficient to bind the whole set, because the image embedding is a natural anchor every modality already sees.

Method

Binding via image as hub. ImageBind only ever trains pairs of the form (ℬ, ℳ) where ℬ is image/video and ℳ is one other modality. Given aligned observations I_i and M_i, deep encoders produce normalized embeddings q_i = f(I_i) and k_i = g(M_i), optimized with a symmetric InfoNCE loss (Eq 1): L_ℬ,ℳ + L_ℳ,ℬ, treating every other example in the mini-batch as a negative, with a learnable temperature τ.

Emergent alignment. Because each modality ℳ₁, ℳ₂ is independently pulled toward image embeddings, the two also become aligned to each other without ever seeing a (ℳ₁, ℳ₂) pair (Fig 2). This is what unlocks zero-shot audio↔text classification despite training audio only against images. The paper notes the analogy to emergent zero-shot translation in multilingual MT when languages share a latent space.

Encoders. A Transformer for every modality (ViT for images; same encoder for video via temporally inflating the patch-projection layer and sampling 2 frames over 2s). Audio: 2s @16kHz → 128 mel-spectrogram bins, treated as a 2D image, ViT with patch 16 / stride 10. Thermal and depth are treated as one-channel images and ViT-encoded (depth converted to disparity maps for scale invariance). IMU: 5s of accelerometer+gyroscope across X,Y,Z → 2K-timestep readings projected with a 1D conv (kernel 8), then a Transformer. Each encoder gets a modality-specific linear projection head to a fixed d-dim embedding.

Initialization. The image and text encoders are taken from a pretrained, frozen OpenCLIP (ViT-H 630M image / 302M text params); the audio, depth, thermal, and IMU encoders are trained from scratch to align to that fixed image space. Naturally-paired non-text data is replicated 50× for training since those datasets (SUN RGB-D, LLVIP) are small.

Setup

Datasets / benchmarks: Training pairs — web (image, text) [OpenCLIP]; (video, audio) from Audioset; (image, depth) from SUN RGB-D; (image, thermal) from LLVIP; (video, IMU) from Ego4D. Evaluation (Table 1): Audioset Audio-only, ESC, Clotho, AudioCaps, VGGSound (audio cls/retrieval); SUN-D, NYU-v2 (depth scene cls); LLVIP (thermal person cls); Ego4D (IMU scenario cls); plus standard ImageNet/Places-365/K400/MSR-VTT.
Hardware / simulator: not reported (no robot; this is a representation-learning paper). Compute hardware unspecified.
Baselines: AudioCLIP [27] and AVFIC [51] (audio–text with explicit supervision); MIL-NCE, SupportSet, FIT (video retrieval); self-supervised AudioMAE and supervised models for few-shot; MultiMAE for depth; OpenCLIP / "Text Paired" CLIP applied directly as upper-bound foils; DINO vs DeiT image encoders for the evaluation-tool experiment.
Compute: not reported (epoch counts given for ablations — 16 vs 32 vs 64; IMU encoder 8 epochs — but no GPU-hours / device counts).

Results

Headline: ImageBind sets a new state of the art on emergent zero-shot recognition across non-visual modalities, often matching or beating specialist supervised models — all without training on text pairs for those modalities.

Emergent zero-shot classification (Table 2, top-1 / Recall@1 / mAP as noted):

Method	AS-A (audio, mAP)	VGGS (audio)	ESC (audio)	SUN-D (depth)	NYU-D (depth)	LLVIP (thermal)	Ego4D (IMU)
Random	0.62	0.32	5.26	5.26	10.0	2.75	0.9
ImageBind	17.6	27.8	66.9	35.1	54.0	63.4	25.0
Text Paired (CLIP direct)	28.4†	—	68.6†	25.4*	41.9*	—	—

(† AudioCLIP-style, uses AS class names as supervision — not truly zero-shot; * OpenCLIP on grayscale-rendered depth.)

Where it wins and loses:

Audio retrieval (Table 3): ImageBind doubles AVFIC's Clotho performance (R@1 6.0 vs 3.0) without any audio–text supervision; on ESC classification (66.9) it's comparable to supervised AudioCLIP (68.6).
Text→audio/video retrieval (Table 4, MSR-VTT 1k-A): audio-only ImageBind (R@1 6.8) is competitive with prior video methods (MIL-NCE 8.6); A+V together jumps to 36.8 — beating all prior work and showing the modalities compose.
Few-shot (Fig 3): beats self-supervised AudioMAE by ~40% top-1 at ≤4-shot on ESC; outperforms supervised at ≤2-shot; beats MultiMAE across all few-shot depth settings.
Loses / weaker: on AS-A and ESC, the AudioCLIP "Text Paired" upper bound (which uses class-name supervision) still edges ImageBind. Standard image/video zero-shot just matches OpenCLIP (expected, since those encoders are frozen-inherited).

Applications. (1) Embedding-space arithmetic (Fig 4): adding an image embedding + an audio embedding retrieves images combining both concepts (fruit image + bird sound → birds among fruit). (2) Audio-promptable object detection (Fig 5): swap Detic's CLIP text "class" embeddings for ImageBind audio embeddings → detector localizes a dog from a bark, no retraining. (3) Audio-to-image generation (Fig 1): feed audio embeddings to a DALLE-2 decoder built for CLIP text embeddings. (4) Vision-model evaluation tool (Table 8): DINO beats DeiT on emergent multimodal binding, and binding quality is not correlated with ImageNet accuracy — so ImageBind measures a different axis of vision-model quality. Ablations (Tables 5–7): a stronger (ViT-H vs ViT-B) image encoder lifts emergent zero-shot on all modalities (+7% depth, +4% audio), confirming the hub-quality thesis; fixed temperature, linear projection heads, longer training, spatially-aligned crops, and temporally-aligned audio all help.

Limitations & open questions

From the authors (§6):

Embeddings are trained with no specific downstream task, so they lag task-specialist models; adapting general-purpose embeddings to structured tasks (e.g. detection) is open.
The image-alignment loss could be enriched with other alignment data (modalities paired with text, or with each other, e.g. audio+IMU) — currently only image-paired data is used.
Explicitly called "a research prototype" not ready for real-world use.
New benchmarks are needed to measure emergent multimodal abilities.

What I noticed reading it:

The hub-and-spoke design means all emergent alignment quality is gated by the image encoder. Modalities that are weakly image-correlated by nature (a "is_screwed_tight" sound, a faint vibration) have no path into the space except through whatever the image happened to capture — an information bottleneck the paper never stresses-tests.
Audio is a fixed 2s mel-spectrogram clip — fine for environmental sound classes, but contact events in manipulation are sub-second transients; the temporal granularity may be far too coarse for force/contact reasoning.
Reported numbers are single-run; no seeds / variance bars on the main Table 2, so the margins over baselines (especially the 4-point audio retrieval gaps) lack a statistical claim.
"Binding" is purely perceptual co-occurrence, not causal or physical — the model learns that birds and chirps appear together, not that a screw becoming tight produces a particular torque-sound. That gap matters a lot for the manipulation thesis below.

Why I care

Method anchor, not a manipulation paper. ImageBind is the canonical "bind any modality into one embedding space" result, and it's the direct ancestor of the touch/force/audio robot papers in this batch — UniTouch literally applies the ImageBind recipe with touch as a new spoke, and FuSe leans on image-as-hub binding to fuse heterogeneous sensors under language. So it earns a place here as the representation-learning template the embodied work specializes.

The connection to my thesis (relative to BLADE): many manipulation predicates — is_grasped, is_inserted, is_full, surface_is_rough, is_screwed_tight — are not visually evaluable; they live in touch, force, and sound. ImageBind is the clearest existence proof that you can fold those non-visual channels into a single shared space using only naturally-paired data, without per-predicate text labels. The tantalizing version for BLADE: if a predicate classifier f_θ(p) could read from a bound audio/tactile embedding rather than RGB pixels, the categorical abstraction layer would finally have access to the sensory channels where these predicates actually live.

But two caveats sharpen rather than soften the point: (1) ImageBind binds via image co-occurrence, and the predicates I care about are precisely the ones where the image is uninformative (you can't see torque) — so naive image-as-hub binding may be the wrong topology for force/touch; UniTouch and the tactile-foundation-model line are the papers testing that. (2) The binding is perceptual, not symbolic/causal, so it complements rather than replaces BLADE's structured precondition/effect reasoning. The interesting research question is the seam between them: a bound multisensory embedding feeding learned symbolic predicates.

Quotable

We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. — Abstract

We observe an emergent behavior in the embedding space that aligns two pairs of modalities (ℳ₁, ℳ₂) even though we only train using the pairs (ℬ, ℳ₁) and (ℬ, ℳ₂). — §3.2, Emergent alignment

The emergent capabilities improve with the strength of the image encoder ... and ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks. — Abstract

Papers cited that should likely be ingested next:

AudioCLIP [27] (Guzhov et al.) — closest baseline; adds audio to CLIP with explicit (audio, text) supervision. PDF (if dropped).
CLIP [60] (Radford et al.) — the (image, text) backbone whose frozen encoders ImageBind inherits and extends.
AVID [50] (Morgado et al.) — audio-visual instance discrimination with cross-modal agreement; the (video, audio) training-pair lineage.
DALLE-2 / unCLIP [61] — the diffusion decoder ImageBind repurposes for audio→image generation.
OmniMAE / Omnivore [20, 21] (Girdhar et al.) — same authors' single-model-many-visual-modalities precursors.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

UniTouch — the direct embodied successor: applies the ImageBind recipe with touch as a new modality bound through images, exactly the touch-channel extension my "Why I care" wants.
LanguageBind — sibling binding paper that swaps the hub from image to language; the obvious contrast in anchor choice.
Meta-Transformer — unified frozen-encoder-across-12-modalities alternative; a different route to "one model, many modalities."
FuSe — builds on image-as-hub binding to fuse heterogeneous robot sensors under language; the manipulation-policy payoff of this representation idea.
AudioCLIP and CLAP — the audio–language binding line ImageBind is measured against (Tables 3–4).
PaLM-E — the other Cluster-H anchor: embodied multimodal LM; complementary "tokens-into-an-LLM" route vs ImageBind's "contrastive shared space."
Any2Policy — any-modality visuomotor policy; downstream consumer of bound multimodal representations.