AudioCLIP: Extending CLIP to Image, Text and Audio

Andrey Guzhov, Federico Raue, Jörn Hees, Andreas Dengel · DFKI GmbH / TU Kaiserslautern · 2021 (arXiv; later ICASSP 2022) · arXiv:2106.13043 · PDF

One-liner. AudioCLIP bolts an audio encoder (ESResNeXt) onto a frozen-then-fine-tuned CLIP text+image model and trains all three heads contrastively on AudioSet's audio/frame/label triples, producing a single tri-modal embedding space that lets you query across text, image, and audio in any direction and do zero-shot environmental-sound classification.

Problem & motivation

Audio classification had benefited from importing visual-domain models, but multimodal audio work almost always combined at most two modalities and used the visual stream sequentially, processing one modality at a time. Meanwhile labeled audio data is scarce, which had pushed the field toward zero-/few-shot contrastive methods conditioned on textual descriptions. The paper's bet: CLIP already aligns text and image in a shared space with strong domain-transfer ("zero-shot") behavior, so adding a third audible modality as a first-class citizen should (i) extend CLIP's zero-shot generalization to audio and (ii) enable cross-modal querying in any combination of the three. The glue that makes this possible is AudioSet, whose YouTube snippets supply audio, video frames, and class labels simultaneously.

Method

The architecture (Fig 1) is three encoder heads feeding one shared embedding space of size 1024 (CLIP's dimension).

Text and image heads = CLIP. A 12-layer Transformer text encoder (BPE vocab 49 408, sequence length clipped at 76) and a modified ResNet-50 image encoder (global average pool replaced by a QKV-attention layer). The ResNet variant is chosen over ViT for lower compute. These were pre-trained jointly on CLIP's ~400M text-image pairs and are used here as weight initializers.

Audio head = ESResNeXt. A ResNeXt-50 backbone fronted by a trainable time-frequency transformation built on complex frequency B-spline wavelets (~30M params). It natively handles multi-channel audio and is robust to additive white Gaussian noise and sample-rate decrease. It is initialized from ImageNet weights, then pre-trained on AudioSet.

Loss. CLIP's symmetric cross-entropy similarity loss is extended from one term (text↔image) to three: text↔image, text↔audio, image↔audio. All three modalities (or any pair) can be processed simultaneously.

Staged training. The recipe is deliberately multi-step:

(1) Audio-head standalone pre-training on AudioSet (ImageNet-init ESResNeXt, trained for more epochs than the original — 30 vs 5).
(2) Cooperative audio-head pre-training: drop the classification layer for a randomly-initialized embedding layer, then train the audio head jointly with the frozen text and image heads — i.e., a multi-modal knowledge-distillation setup where text/image heads are teachers.
(3) AudioCLIP full training: unfreeze and tune all three heads together on AudioSet (audio snippets + their video frames + class names) so the model also absorbs AudioSet's image and text distributions, not just CLIP's.
(4) Downstream fine-tuning: tune the audio head on UrbanSound8K and ESC-50 in a bimodal (audio+text) fashion, one class label per sample.

Augmentations for the small audio datasets: time scaling (joint duration+pitch, factors in [−1.5, 1.5]), time inversion (p=0.5), random crop/padding, and AWGN (SNR 10–120 dB, p=0.25). Optimizer SGD with Nesterov momentum 0.9, weight decay 5×10⁻⁴, batch 64.

Setup

Datasets / benchmarks: AudioSet (~1.8M train, ~20k eval, 527 classes; audio + extracted video frames) as the tri-modal training glue; ImageNet as image zero-shot target; UrbanSound8K (8732 tracks, 10 classes, 10 folds) and ESC-50 (2000 tracks, 50 classes, 5 folds) as audio classification targets. CLIP's ~400M text-image dataset used only indirectly (weight init).
Hardware / simulator: not reported (no GPU type / count given; no simulator — this is a non-robot ML paper).
Baselines: Audio-classification SOTA — ESResNet [9], WEANET, DenseNet-201 ensemble [17], ESResNeXt [10], AST [8], ERANN [30]; zero-shot audio baselines VGGish+Word2Vec(+GloVe) [32,33]; Human [19] on ESC-50; and AudioCLIP's own audio-head (ESResNeXt, their training) as an internal baseline (Table 3).
Compute: not reported (epoch counts given — 30 for AudioSet training, 50 for downstream — but no wall-clock or hardware).

Results

Headline: AudioCLIP (full training) sets new SOTA environmental-sound classification accuracy of 90.07% on UrbanSound8K and 97.15% on ESC-50, and a new zero-shot baseline of 68.78% (US8K) / 69.40% (ESC-50, partial training) — the latter beating a commonly-trained baseline CNN (64.50%).

Model (Table 3)	US8K acc%	ESC-50 acc%
Human [19]	–	81.30
ESResNeXt (2021) [10]	89.14	95.20
AST (2021) [8]	–	95.60
ERANN (2021) [30]	–	96.10
Audio-Head (ESResNeXt, their training)	89.49	95.90
AudioCLIP (partial training)	89.95	96.65
AudioCLIP (full training)	90.07	97.15

Where each component helps (Table 2, ablation of partial vs full AudioSet training). Full tri-modal training improves AudioSet mAP (audio 25.85→28.36; both-modality 25.11→32.38) and the downstream audio accuracies. Where it loses: full training hurts image-only performance — ImageNet accuracy drops 40.51→21.79 and AudioSet image-mAP behavior is mixed — because tuning the image head on AudioSet frames pulls it away from CLIP's natural image distribution. Extended audio-head pre-training alone (30 vs 5 epochs, Table 1) already lifts AudioSet mAP 28.17→34.14 and downstream US8K/ESC-50 to 89.49/95.90.

Cross-modal querying (Table 4). Audio-by-text on ESC-50 reaches P@1 51.78 / mAP ~77; text-by-image (ImageNet) audio-head P@1 5.42 / R@1 84.15 / mAP 52.91. Full training generally improves AudioSet and downstream querying mAP but degrades ImageNet querying (again the distribution-shift cost). The qualitative point stands: a single model retrieves across text↔image↔audio in any direction.

Limitations & open questions

From the authors:

Evaluated on a narrow set of datasets/tasks; they flag wider evaluation as future work.
Backbones are mid-tier (ResNet-50 image head, ESResNeXt audio head); they note stronger backbones could improve results.
Implicitly: full tri-modal training trades image-domain performance for audio gains (the ImageNet drop), an unresolved tension.

What I noticed reading it:

The "zero-shot" framing is weak. AudioCLIP is trained on AudioSet (527 classes) and then "zero-shot" evaluated on ESC-50 / US8K, whose class vocabularies overlap heavily with everyday-sound AudioSet labels. This is closer to transfer with label overlap than to true open-vocabulary zero-shot; the paper leans on CLIP's terminology without quantifying label leakage.
Audio is bound through frozen vision/text teachers, not learned jointly from scratch. The shared space is anchored to CLIP's geometry; the audio modality is fit into a pre-existing text-image manifold. Whether the resulting audio embedding captures acoustic structure that text/image can't express (timbre, onset, material resonance) is never probed — only label-classification accuracy is.
No ablation isolating the three loss terms. We see partial (audio-head only) vs full, but not, e.g., text↔audio alone vs +image↔audio. So the marginal value of the image modality to audio learning is unquantified.
AudioSet supervision is coarse. Frames are sampled from 10s YouTube clips; the audio-frame correspondence is weak (the visible frame need not depict the sound source). This noisy pairing likely caps how tight the image↔audio alignment can get.

Why I care

Adjacent / method anchor — not a manipulation paper. There is no robot, no policy, no contact here; AudioCLIP is an audio-language representation-learning paper. I'm filing it as the canonical audio↔language alignment method anchor for the batch, the audio analogue of how BLADE relies on language to name predicates.

The tangential relevance to my thesis — that many manipulation predicates (is_inserted, is_full, is_screwed_tight, surface_is_rough) live in touch/force/sound rather than in vision — is this: if a predicate like is_pouring or cup_is_full is most legible in the audio stream, then a CLIP-style contrastive recipe that binds audio to language is exactly the mechanism that could let an LLM-named predicate (BLADE-style) be grounded in an acoustic classifier rather than a visual one. AudioCLIP is the upstream representation; the robot papers in cluster D (ManiWAV, SonicSense, Making Sense of Audio Vibration) are where that grounding gets cashed out on contact-rich tasks. AudioCLIP itself does none of this — I should not overstate it. It is a building block, not a manipulation result.

Concretely worth tracking: AudioCLIP predates and is methodologically parallel to CLAP (cluster E), and is the audio precursor to the all-modality binding papers ImageBind and LanguageBind (cluster H) — AudioCLIP binds three modalities by extending CLIP's loss; ImageBind generalizes the trick to six by binding everything to images.

Quotable

Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion. — Abstract / p.1

The joint use of three modalities during the training results in out-performance of previous models … extends zero-shot capabilities of the base architecture to the audio modality and introduces an ability to perform cross-modal querying using text, image and audio in any combination. — §1 Introduction / p.2

Parameters of the two other subnetworks, namely text- and image-head, were frozen during the cooperative pre-training of the audio encoding head, thus, these heads served as teachers in a multi-modal knowledge distillation setup. — §4.3 Training / p.7

Papers cited that should likely be ingested next:

Radford et al. 2021 — CLIP [21] (Learning Transferable Visual Models from Natural Language Supervision) — the base model AudioCLIP extends; foundational dependency. Not in this batch's cross-ref list, but the obvious upstream anchor.
Guzhov et al. 2021 — ESResNe(X)t-fbsp [10] (IJCNN) — the audio encoder; direct predecessor by the same authors.
Gemmeke et al. 2017 — AudioSet [7] (ICASSP) — the tri-modal training dataset; the "glue" enabling the whole approach.
Xie & Virtanen 2019/2021 [32,33] — zero-shot audio classification via class-label / semantic embeddings; the prior zero-shot-audio line AudioCLIP compares against.

Newly ingested in the 2026-06-24 batch — directly relevant:

CLAP — the parallel audio↔language contrastive model; closest peer (audio-text vs AudioCLIP's audio-text-image), same cluster E.
ImageBind — generalizes AudioCLIP's "extend CLIP's contrastive loss to a new modality" trick to six modalities bound through images; the natural successor.
LanguageBind — same binding idea but anchored on language rather than images; AudioCLIP is a two-extra-loss-term special case.
Meta-Transformer — unified-encoder alternative to per-modality heads; contrasts with AudioCLIP's separate ESResNeXt/ResNet/Transformer design.
The Sound of Pixels, Objects that Sound, Visually Indicated Sounds — the earlier audio-visual self-supervision line (cluster E) that AudioCLIP's contrastive-binding approach supersedes for the classification/querying use case.