CLAP: Learning Audio Concepts From Natural Language Supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, Huaming Wang · Microsoft · arXiv 2022 (later ICASSP 2023) · arXiv:2206.04769 · PDF

One-liner. CLAP is "CLIP for audio": train a paired audio encoder + text encoder with a symmetric contrastive loss on 128k audio–caption pairs to build a joint embedding space, which then does zero-shot audio classification for any user-typed class name and sets zero-shot SoTA on sound-event tasks — the canonical method anchor for grounding acoustic events in language.

Problem & motivation

Mainstream audio models follow the "one class label → many recordings, one task" supervised paradigm: they require labeled audio and can only predict a fixed, predefined set of categories. Self-supervised audio pretraining removes the label dependence but discards the semantic knowledge carried in natural language, and still bolts on a static, fixed-vocabulary classifier head downstream. Computer vision had already shown a third path — learn representations from natural-language supervision (CLIP, Florence, ALIGN) for flexible zero-shot prediction — but in audio the closest attempts (Wav2CLIP, AudioCLIP) distilled from CLIP and trained on audio + class labels from AudioSet rather than on audio + free-form language. CLAP asks whether learning directly from audio–text caption pairs buys the same open-vocabulary flexibility and generalization for sound.

Method

CLAP (Fig 1) is a two-tower contrastive model, structurally a direct audio analogue of CLIP.

Encoders. An audio encoder fa maps a processed log-Mel spectrogram Xa ∈ RF×T (F Mel bins, T time bins) to an audio representation; a text encoder ft maps the caption Xt to a text representation. Concretely the audio encoder is CNN14 (from PANNs, 80.8M params, embedding size 2048, pretrained on 2M AudioSet clips) and the text encoder is BERT-base-uncased (110M params), using the [CLS] token of the final layer (size 768) as the text embedding.

Joint space + contrastive loss. Two learnable linear projections La, Lt map both towers into a shared d=1024-dim multimodal space, giving embeddings Ea, Et. For a batch of N pairs the similarity matrix is C = τ · (Et · Ea) with N correct pairs on the diagonal and N²−N negatives off-diagonal. Training uses a symmetric cross-entropy loss L = 0.5·(ℓtext(C) + ℓaudio(C)) — softmax-over-rows and softmax-over-columns — jointly optimizing both encoders and both projections. The temperature τ is learnable (init 0.007), and τ-scaled logits are clipped at 100 for stability.

Zero-shot classification (§2.2). At test time, embed the N test audios and the C class names through the frozen towers + projections; because both live in the same space, compute cosine similarity between each audio and every class name, then softmax (binary/multiclass) or sigmoid (multilabel) over the logits. No training, no fixed label set — classes are typed in at inference. Class names are wrapped in a prompt template — default 'This is a sound of [class label]' — to close the gap between caption-style training text and bare-word labels (per-domain variants for emotion / keyword / speaker-counting).

Setup

Results

Headline: trained on ~128k pairs — the authors note this is at least ~0.001% the pair count of comparable CV models — CLAP still sets zero-shot SoTA on the core sound-event benchmarks and achieves SoTA in 5 tasks under supervised finetuning.

Task (metric)Prior ZS benchmarkCLAP (ZS)Best in lit.CLAP (Best)
ESC50 (acc)0.694 (AudioCLIP)0.8260.97150.9670
US8K (acc)0.6531 (AudioCLIP)0.73240.90–70.8796
FSD50K (mAP)0.0302 (Wav2CLIP)0.30240.6410.5859
GTZAN Music vs Speech (acc)1.000.9921.00

Where it wins:

Where it loses:

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

Adjacent / method anchor, not a manipulation paper. CLAP has no robot, no policy, no contact — it's the "CLIP-for-audio" reference that the robotics audio-sensing literature builds on. I care about it as the grounding recipe, not as a result to chase.

The connection to my thesis: many manipulation predicates I care about (is_full, is_screwed_tight, surface_is_rough, liquid pouring, contact onset) are not visually evaluable — they live in sound and touch. BLADE learns visual classifiers fθ(p): O → {T,F} for predicates; CLAP is the existence proof that you can instead learn a language-grounded acoustic classifier with no fixed label set, where a predicate name typed in natural language is scored against an audio embedding by cosine similarity. That's structurally exactly the open-vocabulary predicate grounding BLADE flags as future work — only in the acoustic modality. A CLAP-style audio tower is the natural drop-in when a precondition/effect is audible (the pour finished, the click happened, the drill bit caught) rather than visible. This is why CLAP keeps reappearing upstream of contact-audio manipulation work — it's the joint-embedding backbone those policies inherit.

Quotable

We propose to learn audio concepts from natural language supervision. We call our approach Contrastive Language-Audio Pretraining (CLAP), which learns to connect language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space. — Abstract
Zero-shot requires no training stage so there are no predefined categories. To enable such flexibility and generalization, models need to learn the relationships between the acoustic semantics and language semantics. — §1, Introduction
The quality of audio-text pairs is key in training CLAP. ... In general, finding helpful training data for CLAP based on public datasets is difficult, thus relying on large-scale noisy pairs is the only scalable approach. — App. C.2 / p.8

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant: