LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, JiaXi Cui, Hongfa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan · 2024 · ICLR 2024 · arXiv · PDF

One-liner. Instead of binding all modalities through images the way ImageBind does, LanguageBind freezes a CLIP text encoder and contrastively aligns each new modality (video, audio, depth, infrared) directly to language, arguing that the highly-semantic language space is a better shared hub than pixels and proving it with a 5-modality language-paired dataset (VIDAL-10M).

Problem & motivation

Vision-language (VL) pretraining (CLIP) works well, but extending it to N≥3 modalities is hard. ImageBind solves the scaling problem by binding everything indirectly through the image modality — but most downstream tasks (zero-shot retrieval, classification) ultimately require alignment to language, and routing through images introduces an extra hop that can degrade performance. LanguageBind's thesis: the language modality is "well-explored and contains rich semantics" (highest information density), so it should be the binding anchor, not images. A second bottleneck is data: existing multi-modal datasets either pair only vision+text or use truncated segments from long videos with fragmented semantics, and no large-scale dataset directly aligns depth/infrared/audio to language. The paper attacks both the method and the data gap.

Method

LanguageBind has three parts (Fig 3): (a) multi-modal encoders, (b) a frozen language encoder, and (c) language-anchored multi-modal joint learning.

Frozen language anchor. The language encoder is a 12-layer, 768-dim transformer initialized from OpenCLIP and kept frozen throughout. Each text is BPE-tokenized and encoded to a logit y ∈ R^(L×C). Because the anchor never moves, every modality is forced into the pre-existing, semantically-rich CLIP text space.

Per-modality encoders, initialized from OpenCLIP-large. All non-language modalities use a 24-layer, 1024-dim ViT (patch size 14) initialized from OpenCLIP-large. Modalities are coerced into image-like tensors: depth and infrared are treated as RGB and replicated 3× across the channel dimension; audio is turned into a 10-second spectrogram (128 mel-bins), repeated/ padded for short clips, and for long clips three 10-s segments (front/middle/back 1/3) are sampled and stacked. This "modality extending" recipe — process to a token sequence, then init from OpenCLIP — is what makes adding the N-th modality cheap.

Patch masking + LoRA fine-tuning for efficiency. Rather than fully fine-tune each ViT, they (i) mask tokens MAE-style (only a visible subset x = {m'_i + P_i} for i ∈ M_v is processed; a mask ratio of 0.5 is optimal, cutting compute to ~1/4), and (ii) apply LoRA: the frozen weight W_0 gets a low-rank update h(x) = W_0 x + BA x, with B ∈ R^(d×r), A ∈ R^(r×k). Smaller LoRA rank works best (less overfitting). This is the key training-efficiency move: the language hub stays frozen, and each modality is adapted with minimal new parameters.

Contrastive language-binding objective. Each modality M is bound to text T with a symmetric InfoNCE loss (Eq. 3): a modality-to-text term L_M2T and a text-to-modality term L_T2M, both normalized features with temperature τ (learnable, initialized at 0.07, which beats ImageBind's fixed temperature). Crucially modalities are aligned to language pairwise and independently — there is no image intermediary — yet the shared frozen text anchor lets them implicitly align to each other, enabling emergent cross-modal retrieval (e.g. depth→infrared) never trained on.

VIDAL-10M dataset. To supply direct language pairs for the non-vision modalities, they build VIDAL-10M: 10M data points spanning Video, Infrared, Depth, Audio and their Language (3M VL, 3M IL, 3M DL, 1M AL). Construction (Fig 4): (1) build a 100k balanced search-term database by POS-tagging labels/captions from YouTube-8M, MSR-VTT, COCO, AVA, HMDB-51, ImageNet; (2) collect short videos (≤20 s, ≥2-word titles) from YouTube Shorts and audio from Freesound, filtered by ratings/comments/tags; (3) generate the extra modalities — depth via GLPN, infrared via an sRGB-TIR model — and enhance the language side with multi-view text (title + hashtags + OFA keyframe captions + mPLUG-owl video captions, all refined by ChatGPT, Fig 5). VIDAL-10M is claimed as the first large-scale multi-modal dataset with depth and infrared aligned to language.

Setup

Results

Headline: LanguageBind beats ImageBind across the board by routing through language instead of images, with especially large margins on the non-RGB modalities. Zero-shot X-Language classification (Table 4, top-1 unless noted):

MethodK400 (video)LLVIP (infrared)NYU-D (depth)ESC-50 (audio)
ImageBind (Huge)50.063.454.066.9
OpenCLIP (Large)60.782.245.4
LanguageBind (Large)64.087.265.191.8
LanguageBind* (full-tune)94.0

Against ImageBind specifically: +23.8% on LLVIP infrared, +11.1% on NYU-D depth, +23.9% on ESC-50 audio, +14.0% on video (K400). Zero-shot video-text retrieval (Table 2): SOTA on four datasets, e.g. MSR-VTT R@1 42.8 (10M) / 44.8 (Huge), beating InterVideo by 1.9% on MSR-VTT, 8.8% on MSVD, 6.3% on DiDeMo, 4.4% on ActivityNet. Audio-language retrieval (Table 5): on Clotho R@1 12.1 (16.7 full-tune) vs. ImageBind 6.0 and even VALOR 8.4. Emergent zero-shot retrieval (Table 7): on never-trained cross-modal directions, LanguageBind beats ImageBind (AVE RGB→A R@1 10.6 vs 36.9 for ImageBind — note ImageBind wins this one; VGGS RGB→A 10.0 vs 28.7 — again ImageBind higher), and adding a second modality helps (NYU D+RGB→T 77.4, +1.4 over D→T alone; LLVIP I+RGB→T 79.3, +16.9 over RGB→T). Ablations (Table 8): LoRA beats full tuning on LLVIP/FLIRv1/Clotho while using ~1/2 the time and <1/2 the memory; 0.5 mask ratio is best; learnable temperature beats fixed.

Where it loses. On the two emergent RGB→Audio directions (AVE, VGGS) ImageBind actually scores higher (e.g. 36.9 vs 10.6) — unsurprising since ImageBind is image-anchored and those queries originate from images. On plain video R@5/R@10 some non-CLIP baselines remain competitive.

Limitations & open questions

(a) Author-stated. Depth and infrared in VIDAL-10M are model-generated (GLPN / sRGB-TIR) rather than sensor-captured; the authors concede "some limitations may exist" but argue diversity reduces model bias. They note the focus is on depth/infrared as "visual and spatial" modalities, leaving truly non-visual sensing (true thermal, contact, IMU) underexplored. ImageBind covers six modalities including IMU; LanguageBind demonstrates five.

(b) What I noticed reading it. (i) The "direct language alignment beats image alignment" claim is partly confounded with scale and data: LanguageBind also introduces a brand-new, language-rich, ChatGPT-enhanced dataset, so it's hard to separate the architecture win (frozen-language anchor) from the data win (VIDAL-10M's better captions). The CLIP4Clip dataset ablation (Table 3) helps but only isolates VL, not the depth/audio claims. (ii) The generated-modality concern is real for robotics: a depth map predicted by GLPN from an internet video is not a sensor reading and may bake in the depth model's failure modes; the alignment is to synthetic depth. (iii) All "modalities" are coerced into the image ViT (depth/IR replicated to 3 channels, audio → spectrogram image) — so this is really still a vision backbone, not a native encoder per sensor; whether that survives genuinely non-image signals (force, raw audio waveforms, tactile gel images aside) is untested. (iv) The emergent cross-modal numbers, while a nice story, are weak in absolute terms (single-digit to low-teens R@1) and lose to ImageBind on the directions it was built for.

Why I care

This is an off-genre method anchor, not a manipulation paper — no robot, no policy, no contact. Its relevance to my thesis is structural and conditional. The big idea I'm chasing is that many manipulation predicates (is_grasped, is_inserted, is_full, surface_is_rough, is_screwed_tight) are not visually evaluable — they live in touch, force, and sound. LanguageBind matters iff language can serve as the binding hub across those sensors the way it does here for depth/IR/audio. That's an attractive hypothesis for BLADE-style predicate learning: if a tactile or acoustic encoder is contrastively bound to a frozen language anchor, a predicate like is_inserted could in principle be read out as a language-space classifier over a non-visual embedding — exactly the gap where vision-only predicate classifiers fail. The contrast with ImageBind (image hub) is the live design question for any sensor-fusion stack: bind sensors through vision, or through language? LanguageBind argues language, because downstream tasks are language-shaped — and robot goals/instructions are too. The honest caveat: this paper only demonstrates the claim on image-coercible, model-generated modalities (depth/IR as fake-RGB, audio as spectrogram-image), so it's evidence the recipe could extend to touch/contact-audio, not proof. Adjacent, not load-bearing, but it sets the template the touch-language papers (UniTouch, TVL) build on.

Quotable

"We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics." — Abstract / p.1
"In our work, we propose LanguageBind, a direct alignment mechanism designed to align alternative modalities directly with the language modality, which has the highest information density." — §2 Related Work / p.3

Related

(a) Cited here, worth ingesting next:

(b) Newly ingested in 2026-06-24 batch — directly relevant: