CLAP: Learning Audio Concepts From Natural Language Supervision
Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, Huaming Wang
· Microsoft
· arXiv 2022 (later ICASSP 2023)
· arXiv:2206.04769
· PDF
One-liner. CLAP is "CLIP for audio": train a paired
audio encoder + text encoder with a symmetric contrastive loss on 128k
audio–caption pairs to build a joint embedding space, which then does
zero-shot audio classification for any user-typed class name and
sets zero-shot SoTA on sound-event tasks — the canonical method anchor
for grounding acoustic events in language.
Problem & motivation
Mainstream audio models follow the "one class label → many recordings,
one task" supervised paradigm: they require labeled audio and can only predict
a fixed, predefined set of categories. Self-supervised audio pretraining
removes the label dependence but discards the semantic knowledge carried in
natural language, and still bolts on a static, fixed-vocabulary classifier head
downstream. Computer vision had already shown a third path — learn
representations from natural-language supervision (CLIP, Florence, ALIGN) for
flexible zero-shot prediction — but in audio the closest attempts
(Wav2CLIP, AudioCLIP) distilled from CLIP and trained on audio + class
labels from AudioSet rather than on audio + free-form language. CLAP asks
whether learning directly from audio–text caption pairs buys the same
open-vocabulary flexibility and generalization for sound.
Method
CLAP (Fig 1) is a two-tower contrastive model, structurally a direct audio
analogue of CLIP.
Encoders. An audio encoder fa maps
a processed log-Mel spectrogram Xa ∈ RF×T
(F Mel bins, T time bins) to an audio representation; a text encoder
ft maps the caption Xt to a text
representation. Concretely the audio encoder is CNN14 (from
PANNs, 80.8M params, embedding size 2048, pretrained on 2M AudioSet clips) and
the text encoder is BERT-base-uncased (110M params), using the
[CLS] token of the final layer (size 768) as the text embedding.
Joint space + contrastive loss. Two learnable linear
projections La, Lt map both towers into a
shared d=1024-dim multimodal space, giving embeddings
Ea, Et. For a batch of N pairs the similarity
matrix is C = τ · (Et · Ea⊤)
with N correct pairs on the diagonal and N²−N negatives off-diagonal.
Training uses a symmetric cross-entropy loss
L = 0.5·(ℓtext(C) + ℓaudio(C))
— softmax-over-rows and softmax-over-columns — jointly optimizing
both encoders and both projections. The temperature τ is
learnable (init 0.007), and τ-scaled logits are clipped at 100 for stability.
Zero-shot classification (§2.2). At test time, embed the
N test audios and the C class names through the frozen towers + projections;
because both live in the same space, compute cosine similarity between each
audio and every class name, then softmax (binary/multiclass) or sigmoid
(multilabel) over the logits. No training, no fixed label set — classes
are typed in at inference. Class names are wrapped in a prompt template
— default 'This is a sound of [class label]' — to close
the gap between caption-style training text and bare-word labels (per-domain
variants for emotion / keyword / speaker-counting).
Setup
- Datasets / benchmarks: Training — 128,010
audio–text pairs assembled from four sources (Table 4): FSD50k (36,796),
ClothoV2 (29,646; 5 captions/clip), AudioCaps (44,292), MACS (17,276);
90,947 unique audios. Evaluation — 16 downstream datasets
across 8 domains (Table 5): sound-event classification (ESC50, FSD50K, US8K,
DCASE17 Task4, AudioSet), music (GTZAN Music/Speech, GTZAN Genres, Mridangam
Stroke/Tonic), instrument (Beijing Opera), acoustic scene (TUT2017), emotion
(CREMA-D, RAVDESS), keyword spotting (Speech Commands V2), vocal-sound
classification (Vocal Sound), speaker counting (LibriCount).
- Hardware / simulator: 16GB V100 GPUs, scaling 8–24
GPUs; no robot or simulator (this is a pure audio-ML paper). Audio
pre-processed to log-Mel (44.1 kHz, 64 Mel bins, 5-sec random-crop clips).
- Baselines: Zero-shot — AudioCLIP and Wav2CLIP
(the prior CLIP-distillation audio models), plus per-task literature
best (`Benchmark (ZS)` / `Benchmark (Best)`). Supervised feature-extraction
comparison (Table 6) against YAMNet, OpenL3, Wav2CLIP, PANN, Wav2Vec2.
- Compute: 40 training epochs, Adam, lr 10−3
with plateau decay; batch sizes 32–768 across 4–24 GPUs (Fig 2
sweep). Exact GPU-hours not reported.
Results
Headline: trained on ~128k pairs — the authors note this is at least
~0.001% the pair count of comparable CV models — CLAP still sets
zero-shot SoTA on the core sound-event benchmarks and achieves
SoTA in 5 tasks under supervised finetuning.
| Task (metric) | Prior ZS benchmark | CLAP (ZS) | Best in lit. | CLAP (Best) |
| ESC50 (acc) | 0.694 (AudioCLIP) | 0.826 | 0.9715 | 0.9670 |
| US8K (acc) | 0.6531 (AudioCLIP) | 0.7324 | 0.90–7 | 0.8796 |
| FSD50K (mAP) | 0.0302 (Wav2CLIP) | 0.3024 | 0.641 | 0.5859 |
| GTZAN Music vs Speech (acc) | — | 1.00 | 0.992 | 1.00 |
Where it wins:
- ESC50 zero-shot 82.6% beats AudioCLIP (69%) by an absolute 12% and
beats reported human performance of 81%; US8K 73% beats AudioCLIP
65% by 8%; FSD50K 30.2% mAP beats Wav2CLIP (3%) by 27% absolute.
- On GTZAN Music-vs-Speech the zero-shot model hits 100% acc, edging
supervised models — a clean "no training at all" result.
- Supervised finetune (CLAP-Best) sets SoTA on 5 tasks: GTZAN Music/Speech
(100%), GTZAN Genres (91.3%), Mridangam Stroke (97.94%), Mridangam Tonic
(95.34%), Vocal Sound (97.95%).
- Ablation (Table 2): unfreezing both encoders gives the best average ZS
score; surprisingly, unfreezing the text encoder helps more than
unfreezing the audio encoder — suggesting you can keep a
chosen pretrained audio encoder largely fixed and still turn it into a
zero-shot classifier. Prompt choice matters (Table 3): the right template
lifts ESC50 by ~5% (0.786 → 0.826).
Where it loses:
- On speech-content tasks CLAP is weak: zero-shot emotion
recognition and keyword spotting only ~4% above random; finetuned RAVDESS
emotion 64% vs SoTA 81%. The authors attribute this to training captions
describing sound events / scenes / objects but rarely speech content.
- Even finetuned, CLAP trails per-task SoTA by up to 7% on several tasks
(Table 1).
- Naively adding ~1.7M AudioSet title/label pairs hurt zero-shot
(ESC50 82.6% → 67.15%), because AudioSet text describes the whole video,
not the clip — pair quality dominates pair quantity (App. C.2).
Limitations & open questions
From the authors:
- Speech-content tasks (emotion, keyword spotting) underperform due to
caption-domain mismatch; they posit more speech-descriptive captions would
close the gap, but don't demonstrate it.
- Sourcing good training pairs is hard: large noisy web pairs (AudioSet-style
title/description) degrade performance, so "relying on large-scale noisy
pairs is the only scalable approach" is stated as an open tension, not solved.
- Batch-size scaling helps up to 256 but drops at 768 (Fig 2);
flagged as an anomaly left to future work.
What I noticed reading it:
- The audio encoder (CNN14) is pretrained on 2M AudioSet clips, and several
downstream sound-event datasets overlap AudioSet's label space — some
"zero-shot" SoTA on SEC may partly reflect that the encoder already saw the
acoustic distribution, even if it never saw the paired captions. The paper
doesn't isolate this.
- "Beats human performance of 81%" on ESC50 cites a single number; no
confidence interval or per-fold variance is given for the headline ZS
results, so the SoTA margins are point estimates.
- The batch-size–vs–performance curve (Fig 2) and the
AudioSet-hurts result are both single-seed observations — thin evidence
for the "pair quality > quantity" thesis, even though it's plausible.
- Only the simplest design (CNN14 + BERT + linear projections) is tried; no
ablation on encoder choice or projection depth, so it's unclear how much of
the result is the contrastive recipe vs. the strong pretrained CNN14.
Why I care
Adjacent / method anchor, not a manipulation paper. CLAP has
no robot, no policy, no contact — it's the "CLIP-for-audio" reference that
the robotics audio-sensing literature builds on. I care about it as the
grounding recipe, not as a result to chase.
The connection to my thesis: many manipulation predicates I care about
(is_full, is_screwed_tight, surface_is_rough,
liquid pouring, contact onset) are not visually evaluable — they
live in sound and touch. BLADE
learns visual classifiers fθ(p): O → {T,F} for
predicates; CLAP is the existence proof that you can instead learn a
language-grounded acoustic classifier with no fixed label set, where a
predicate name typed in natural language is scored against an audio embedding by
cosine similarity. That's structurally exactly the open-vocabulary predicate
grounding BLADE flags as future work — only in the acoustic modality. A
CLAP-style audio tower is the natural drop-in when a precondition/effect is
audible (the pour finished, the click happened, the drill bit caught) rather than
visible. This is why CLAP keeps reappearing upstream of contact-audio
manipulation work — it's the joint-embedding backbone those policies inherit.
Quotable
We propose to learn audio concepts from natural language supervision. We call
our approach Contrastive Language-Audio Pretraining (CLAP), which learns to
connect language and audio by using two encoders and a contrastive learning to
bring audio and text descriptions into a joint multimodal space.
— Abstract
Zero-shot requires no training stage so there are no predefined categories. To
enable such flexibility and generalization, models need to learn the
relationships between the acoustic semantics and language semantics.
— §1, Introduction
The quality of audio-text pairs is key in training CLAP. ... In general, finding
helpful training data for CLAP based on public datasets is difficult, thus
relying on large-scale noisy pairs is the only scalable approach.
— App. C.2 / p.8
Related
Papers cited that should likely be ingested next:
- [6] Radford et al. 2021 — CLIP (ICML) — the
image–language contrastive model CLAP is a direct audio port of. The
foundational template; not in this batch's cross-ref list.
- [9] Wu et al. 2022 — Wav2CLIP (ICASSP) —
CLIP-distillation audio baseline CLAP beats; the immediate prior.
- [10] Guzhov et al. 2022 — AudioCLIP (ICASSP) —
the other CLIP-distillation baseline; in this batch as
audioclip_extending_clip.
- [23] Kong et al. 2020 — PANNs / CNN14 — the
pretrained audio encoder backbone. Foundational dependency.
Newly ingested in the 2026-06-24 batch — directly relevant:
- AudioCLIP — closest
sibling: the prior CLIP-distillation audio–language model and CLAP's
direct zero-shot baseline (CLAP beats it +12% on ESC50). Read the two
together as the "distill-from-CLIP vs. learn-from-captions" contrast.
- Objects that Sound and
The Sound of Pixels and
Visually Indicated Sounds
— the audio-visual self-supervision line (same Cluster E); CLAP swaps
the visual anchor for a language anchor, so these frame what
"audio–X binding" looked like before language supervision.
- ImageBind and
LanguageBind —
multimodal-binding successors that fold an audio tower (CLAP-style) into a
single space with vision/touch/depth; CLAP is the audio leg those binding
models generalize.
- SonicSense and
ManiWAV and
Making Sense of Audio Vibration (pouring)
— the robot-side payoff: contact/contact-audio manipulation that needs
exactly the audio–language grounding CLAP provides for audible
predicates (pour-done, contact onset, material). CLAP is the upstream
representation anchor for this cluster.