CLAP: Learning Audio Concepts From Natural Language Supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, Huaming Wang · Microsoft · arXiv 2022 (later ICASSP 2023) · arXiv:2206.04769 · PDF

One-liner. CLAP is "CLIP for audio": train a paired audio encoder + text encoder with a symmetric contrastive loss on 128k audio–caption pairs to build a joint embedding space, which then does zero-shot audio classification for any user-typed class name and sets zero-shot SoTA on sound-event tasks — the canonical method anchor for grounding acoustic events in language.

Problem & motivation

Mainstream audio models follow the "one class label → many recordings, one task" supervised paradigm: they require labeled audio and can only predict a fixed, predefined set of categories. Self-supervised audio pretraining removes the label dependence but discards the semantic knowledge carried in natural language, and still bolts on a static, fixed-vocabulary classifier head downstream. Computer vision had already shown a third path — learn representations from natural-language supervision (CLIP, Florence, ALIGN) for flexible zero-shot prediction — but in audio the closest attempts (Wav2CLIP, AudioCLIP) distilled from CLIP and trained on audio + class labels from AudioSet rather than on audio + free-form language. CLAP asks whether learning directly from audio–text caption pairs buys the same open-vocabulary flexibility and generalization for sound.

Method

CLAP (Fig 1) is a two-tower contrastive model, structurally a direct audio analogue of CLIP.

Encoders. An audio encoder f_a maps a processed log-Mel spectrogram X_a ∈ R^F×T (F Mel bins, T time bins) to an audio representation; a text encoder f_t maps the caption X_t to a text representation. Concretely the audio encoder is CNN14 (from PANNs, 80.8M params, embedding size 2048, pretrained on 2M AudioSet clips) and the text encoder is BERT-base-uncased (110M params), using the [CLS] token of the final layer (size 768) as the text embedding.

Joint space + contrastive loss. Two learnable linear projections L_a, L_t map both towers into a shared d=1024-dim multimodal space, giving embeddings E_a, E_t. For a batch of N pairs the similarity matrix is C = τ · (E_t · E_a^⊤) with N correct pairs on the diagonal and N²−N negatives off-diagonal. Training uses a symmetric cross-entropy loss L = 0.5·(ℓ_text(C) + ℓ_audio(C)) — softmax-over-rows and softmax-over-columns — jointly optimizing both encoders and both projections. The temperature τ is learnable (init 0.007), and τ-scaled logits are clipped at 100 for stability.

Zero-shot classification (§2.2). At test time, embed the N test audios and the C class names through the frozen towers + projections; because both live in the same space, compute cosine similarity between each audio and every class name, then softmax (binary/multiclass) or sigmoid (multilabel) over the logits. No training, no fixed label set — classes are typed in at inference. Class names are wrapped in a prompt template — default 'This is a sound of [class label]' — to close the gap between caption-style training text and bare-word labels (per-domain variants for emotion / keyword / speaker-counting).

Setup

Datasets / benchmarks: Training — 128,010 audio–text pairs assembled from four sources (Table 4): FSD50k (36,796), ClothoV2 (29,646; 5 captions/clip), AudioCaps (44,292), MACS (17,276); 90,947 unique audios. Evaluation — 16 downstream datasets across 8 domains (Table 5): sound-event classification (ESC50, FSD50K, US8K, DCASE17 Task4, AudioSet), music (GTZAN Music/Speech, GTZAN Genres, Mridangam Stroke/Tonic), instrument (Beijing Opera), acoustic scene (TUT2017), emotion (CREMA-D, RAVDESS), keyword spotting (Speech Commands V2), vocal-sound classification (Vocal Sound), speaker counting (LibriCount).
Hardware / simulator: 16GB V100 GPUs, scaling 8–24 GPUs; no robot or simulator (this is a pure audio-ML paper). Audio pre-processed to log-Mel (44.1 kHz, 64 Mel bins, 5-sec random-crop clips).
Baselines: Zero-shot — AudioCLIP and Wav2CLIP (the prior CLIP-distillation audio models), plus per-task literature best (`Benchmark (ZS)` / `Benchmark (Best)`). Supervised feature-extraction comparison (Table 6) against YAMNet, OpenL3, Wav2CLIP, PANN, Wav2Vec2.
Compute: 40 training epochs, Adam, lr 10⁻³ with plateau decay; batch sizes 32–768 across 4–24 GPUs (Fig 2 sweep). Exact GPU-hours not reported.

Results

Headline: trained on ~128k pairs — the authors note this is at least ~0.001% the pair count of comparable CV models — CLAP still sets zero-shot SoTA on the core sound-event benchmarks and achieves SoTA in 5 tasks under supervised finetuning.

Task (metric)	Prior ZS benchmark	CLAP (ZS)	Best in lit.	CLAP (Best)
ESC50 (acc)	0.694 (AudioCLIP)	0.826	0.9715	0.9670
US8K (acc)	0.6531 (AudioCLIP)	0.7324	0.90–7	0.8796
FSD50K (mAP)	0.0302 (Wav2CLIP)	0.3024	0.641	0.5859
GTZAN Music vs Speech (acc)	—	1.00	0.992	1.00

Where it wins:

ESC50 zero-shot 82.6% beats AudioCLIP (69%) by an absolute 12% and beats reported human performance of 81%; US8K 73% beats AudioCLIP 65% by 8%; FSD50K 30.2% mAP beats Wav2CLIP (3%) by 27% absolute.
On GTZAN Music-vs-Speech the zero-shot model hits 100% acc, edging supervised models — a clean "no training at all" result.
Supervised finetune (CLAP-Best) sets SoTA on 5 tasks: GTZAN Music/Speech (100%), GTZAN Genres (91.3%), Mridangam Stroke (97.94%), Mridangam Tonic (95.34%), Vocal Sound (97.95%).
Ablation (Table 2): unfreezing both encoders gives the best average ZS score; surprisingly, unfreezing the text encoder helps more than unfreezing the audio encoder — suggesting you can keep a chosen pretrained audio encoder largely fixed and still turn it into a zero-shot classifier. Prompt choice matters (Table 3): the right template lifts ESC50 by ~5% (0.786 → 0.826).

Where it loses:

On speech-content tasks CLAP is weak: zero-shot emotion recognition and keyword spotting only ~4% above random; finetuned RAVDESS emotion 64% vs SoTA 81%. The authors attribute this to training captions describing sound events / scenes / objects but rarely speech content.
Even finetuned, CLAP trails per-task SoTA by up to 7% on several tasks (Table 1).
Naively adding ~1.7M AudioSet title/label pairs hurt zero-shot (ESC50 82.6% → 67.15%), because AudioSet text describes the whole video, not the clip — pair quality dominates pair quantity (App. C.2).

Limitations & open questions

From the authors:

Speech-content tasks (emotion, keyword spotting) underperform due to caption-domain mismatch; they posit more speech-descriptive captions would close the gap, but don't demonstrate it.
Sourcing good training pairs is hard: large noisy web pairs (AudioSet-style title/description) degrade performance, so "relying on large-scale noisy pairs is the only scalable approach" is stated as an open tension, not solved.
Batch-size scaling helps up to 256 but drops at 768 (Fig 2); flagged as an anomaly left to future work.

What I noticed reading it:

The audio encoder (CNN14) is pretrained on 2M AudioSet clips, and several downstream sound-event datasets overlap AudioSet's label space — some "zero-shot" SoTA on SEC may partly reflect that the encoder already saw the acoustic distribution, even if it never saw the paired captions. The paper doesn't isolate this.
"Beats human performance of 81%" on ESC50 cites a single number; no confidence interval or per-fold variance is given for the headline ZS results, so the SoTA margins are point estimates.
The batch-size–vs–performance curve (Fig 2) and the AudioSet-hurts result are both single-seed observations — thin evidence for the "pair quality > quantity" thesis, even though it's plausible.
Only the simplest design (CNN14 + BERT + linear projections) is tried; no ablation on encoder choice or projection depth, so it's unclear how much of the result is the contrastive recipe vs. the strong pretrained CNN14.

Why I care

Adjacent / method anchor, not a manipulation paper. CLAP has no robot, no policy, no contact — it's the "CLIP-for-audio" reference that the robotics audio-sensing literature builds on. I care about it as the grounding recipe, not as a result to chase.

The connection to my thesis: many manipulation predicates I care about (is_full, is_screwed_tight, surface_is_rough, liquid pouring, contact onset) are not visually evaluable — they live in sound and touch. BLADE learns visual classifiers f_θ(p): O → {T,F} for predicates; CLAP is the existence proof that you can instead learn a language-grounded acoustic classifier with no fixed label set, where a predicate name typed in natural language is scored against an audio embedding by cosine similarity. That's structurally exactly the open-vocabulary predicate grounding BLADE flags as future work — only in the acoustic modality. A CLAP-style audio tower is the natural drop-in when a precondition/effect is audible (the pour finished, the click happened, the drill bit caught) rather than visible. This is why CLAP keeps reappearing upstream of contact-audio manipulation work — it's the joint-embedding backbone those policies inherit.

Quotable

We propose to learn audio concepts from natural language supervision. We call our approach Contrastive Language-Audio Pretraining (CLAP), which learns to connect language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space. — Abstract

Zero-shot requires no training stage so there are no predefined categories. To enable such flexibility and generalization, models need to learn the relationships between the acoustic semantics and language semantics. — §1, Introduction

The quality of audio-text pairs is key in training CLAP. ... In general, finding helpful training data for CLAP based on public datasets is difficult, thus relying on large-scale noisy pairs is the only scalable approach. — App. C.2 / p.8

Papers cited that should likely be ingested next:

[6] Radford et al. 2021 — CLIP (ICML) — the image–language contrastive model CLAP is a direct audio port of. The foundational template; not in this batch's cross-ref list.
[9] Wu et al. 2022 — Wav2CLIP (ICASSP) — CLIP-distillation audio baseline CLAP beats; the immediate prior.
[10] Guzhov et al. 2022 — AudioCLIP (ICASSP) — the other CLIP-distillation baseline; in this batch as audioclip_extending_clip.
[23] Kong et al. 2020 — PANNs / CNN14 — the pretrained audio encoder backbone. Foundational dependency.

Newly ingested in the 2026-06-24 batch — directly relevant:

AudioCLIP — closest sibling: the prior CLIP-distillation audio–language model and CLAP's direct zero-shot baseline (CLAP beats it +12% on ESC50). Read the two together as the "distill-from-CLIP vs. learn-from-captions" contrast.
Objects that Sound and The Sound of Pixels and Visually Indicated Sounds — the audio-visual self-supervision line (same Cluster E); CLAP swaps the visual anchor for a language anchor, so these frame what "audio–X binding" looked like before language supervision.
ImageBind and LanguageBind — multimodal-binding successors that fold an audio tower (CLAP-style) into a single space with vision/touch/depth; CLAP is the audio leg those binding models generalize.
SonicSense and ManiWAV and Making Sense of Audio Vibration (pouring) — the robot-side payoff: contact/contact-audio manipulation that needs exactly the audio–language grounding CLAP provides for audible predicates (pour-done, contact onset, material). CLAP is the upstream representation anchor for this cluster.