ImageBind: One Embedding Space To Bind Them All

Rohit Girdhar*, Alaaeldin El-Nouby*, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra* · FAIR, Meta AI · CVPR 2023 · arXiv:2305.05665 · PDF

One-liner. Train a contrastive model that aligns five "naturally paired-with-image" modalities (audio, depth, thermal, IMU, plus text) each to a frozen CLIP image encoder — and the image space acts as a hub, so modalities you never trained together (e.g. audio↔text, audio↔depth) become aligned emergently, giving zero-shot cross-modal retrieval, classification, arithmetic, and "upgrade-an-existing-CLIP-model" for free.

Problem & motivation

CLIP-style models give one shared (image, text) space, but extending that to a true joint embedding over many sensory modalities normally requires datasets where all modalities co-occur for the same instance — which is infeasible to collect. Prior multimodal work is therefore limited to one or two modalities per pair (video–audio, image–depth), and those embeddings are siloed: a video–audio encoder can't be used for image–text tasks. The paper's key observation: images already co-occur with everything (web image–text, video–audio, RGB-D depth, thermal–image, egocentric video–IMU). So you don't need all-modality data — image-paired data is sufficient to bind the whole set, because the image embedding is a natural anchor every modality already sees.

Method

Binding via image as hub. ImageBind only ever trains pairs of the form (ℬ, ℳ) where is image/video and is one other modality. Given aligned observations Ii and Mi, deep encoders produce normalized embeddings qi = f(Ii) and ki = g(Mi), optimized with a symmetric InfoNCE loss (Eq 1): Lℬ,ℳ + Lℳ,ℬ, treating every other example in the mini-batch as a negative, with a learnable temperature τ.

Emergent alignment. Because each modality 1, 2 is independently pulled toward image embeddings, the two also become aligned to each other without ever seeing a (ℳ1, ℳ2) pair (Fig 2). This is what unlocks zero-shot audio↔text classification despite training audio only against images. The paper notes the analogy to emergent zero-shot translation in multilingual MT when languages share a latent space.

Encoders. A Transformer for every modality (ViT for images; same encoder for video via temporally inflating the patch-projection layer and sampling 2 frames over 2s). Audio: 2s @16kHz → 128 mel-spectrogram bins, treated as a 2D image, ViT with patch 16 / stride 10. Thermal and depth are treated as one-channel images and ViT-encoded (depth converted to disparity maps for scale invariance). IMU: 5s of accelerometer+gyroscope across X,Y,Z → 2K-timestep readings projected with a 1D conv (kernel 8), then a Transformer. Each encoder gets a modality-specific linear projection head to a fixed d-dim embedding.

Initialization. The image and text encoders are taken from a pretrained, frozen OpenCLIP (ViT-H 630M image / 302M text params); the audio, depth, thermal, and IMU encoders are trained from scratch to align to that fixed image space. Naturally-paired non-text data is replicated 50× for training since those datasets (SUN RGB-D, LLVIP) are small.

Setup

Results

Headline: ImageBind sets a new state of the art on emergent zero-shot recognition across non-visual modalities, often matching or beating specialist supervised models — all without training on text pairs for those modalities.

Emergent zero-shot classification (Table 2, top-1 / Recall@1 / mAP as noted):

MethodAS-A (audio, mAP)VGGS (audio)ESC (audio)SUN-D (depth)NYU-D (depth)LLVIP (thermal)Ego4D (IMU)
Random0.620.325.265.2610.02.750.9
ImageBind17.627.866.935.154.063.425.0
Text Paired (CLIP direct)28.4†68.6†25.4*41.9*

(† AudioCLIP-style, uses AS class names as supervision — not truly zero-shot; * OpenCLIP on grayscale-rendered depth.)

Where it wins and loses:

Applications. (1) Embedding-space arithmetic (Fig 4): adding an image embedding + an audio embedding retrieves images combining both concepts (fruit image + bird sound → birds among fruit). (2) Audio-promptable object detection (Fig 5): swap Detic's CLIP text "class" embeddings for ImageBind audio embeddings → detector localizes a dog from a bark, no retraining. (3) Audio-to-image generation (Fig 1): feed audio embeddings to a DALLE-2 decoder built for CLIP text embeddings. (4) Vision-model evaluation tool (Table 8): DINO beats DeiT on emergent multimodal binding, and binding quality is not correlated with ImageNet accuracy — so ImageBind measures a different axis of vision-model quality. Ablations (Tables 5–7): a stronger (ViT-H vs ViT-B) image encoder lifts emergent zero-shot on all modalities (+7% depth, +4% audio), confirming the hub-quality thesis; fixed temperature, linear projection heads, longer training, spatially-aligned crops, and temporally-aligned audio all help.

Limitations & open questions

From the authors (§6):

What I noticed reading it:

Why I care

Method anchor, not a manipulation paper. ImageBind is the canonical "bind any modality into one embedding space" result, and it's the direct ancestor of the touch/force/audio robot papers in this batch — UniTouch literally applies the ImageBind recipe with touch as a new spoke, and FuSe leans on image-as-hub binding to fuse heterogeneous sensors under language. So it earns a place here as the representation-learning template the embodied work specializes.

The connection to my thesis (relative to BLADE): many manipulation predicates — is_grasped, is_inserted, is_full, surface_is_rough, is_screwed_tight — are not visually evaluable; they live in touch, force, and sound. ImageBind is the clearest existence proof that you can fold those non-visual channels into a single shared space using only naturally-paired data, without per-predicate text labels. The tantalizing version for BLADE: if a predicate classifier fθ(p) could read from a bound audio/tactile embedding rather than RGB pixels, the categorical abstraction layer would finally have access to the sensory channels where these predicates actually live.

But two caveats sharpen rather than soften the point: (1) ImageBind binds via image co-occurrence, and the predicates I care about are precisely the ones where the image is uninformative (you can't see torque) — so naive image-as-hub binding may be the wrong topology for force/touch; UniTouch and the tactile-foundation-model line are the papers testing that. (2) The binding is perceptual, not symbolic/causal, so it complements rather than replaces BLADE's structured precondition/effect reasoning. The interesting research question is the seam between them: a bound multisensory embedding feeding learned symbolic predicates.

Quotable

We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. — Abstract
We observe an emergent behavior in the embedding space that aligns two pairs of modalities (ℳ1, ℳ2) even though we only train using the pairs (ℬ, ℳ1) and (ℬ, ℳ2). — §3.2, Emergent alignment
The emergent capabilities improve with the strength of the image encoder ... and ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks. — Abstract

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: