Objects that Sound

Relja Arandjelović , Andrew Zisserman (DeepMind / VGG, Oxford) · 2018 · ECCV 2018 · arXiv:1712.06651 · PDF

One-liner. Train two ConvNets from scratch on unlabelled video with a single self-supervised objective — "does this frame and this 1 s of audio come from the same video?" — and you get (a) aligned audio/visual embeddings good enough for cross-modal retrieval and (b), with a tiny architectural change, a network that points at which pixels are making the sound, all without a single label or bounding box.

Problem & motivation

Unlabelled video gives you a synchronized image stream and audio stream for free, an enormous source of "automatic" supervision. The prior L³-Net (Arandjelović & Zisserman, "Look, Listen and Learn", ICCV 2017) showed the audio-visual correspondence (AVC) task — classify a (frame, audio) pair as matched or mismatched — teaches good unimodal features. But its features are fused by concatenation only after the fully-connected layers, so the two modalities are never aligned in a shared space and are useless for cross-modal retrieval. And no prior cross-modal network was actually designed or shown to answer "where in the image is the object making this sound?" This paper targets both gaps: a network whose embeddings are directly comparable across modalities, and a network that localizes the sounding object — both still label-free.

Method

Shared backbone. A vision ConvNet ingests a 224×224 RGB frame; an audio ConvNet ingests 1 s of audio resampled to 48 kHz and converted to a 257×200 log-spectrogram treated as a greyscale image (Fig 2a, 2b). Both are trained from scratch, no ImageNet init.

AVE-Net (retrieval; Fig 2c). The key move vs. L³-Net is to push the fully-connected layers into each subnetwork, so each modality produces a 128-D, L2-normalized embedding. Correspondence is then computed as a function of the Euclidean distance between the two embeddings: the single scalar distance is the only information bottleneck, passed through a tiny FC (which scales/shifts it and effectively learns a distance threshold) and a softmax for the matched/mismatched decision. Because the only channel through which the network can solve AVC is that distance, training forces the visual and audio embeddings into a shared, retrieval-ready metric space. The authors note this resembles a parameter-free contrastive/metric-learning loss but with no margin hyper-parameter to tune.

AVOL-Net (localization; Fig 4). To localize, the vision subnetwork stops pooling and keeps operating at 14×14 spatial resolution; its FC layers are converted to 1×1 convolutions, yielding a 14×14 grid of 128-D region descriptors. A single 128-D audio embedding is compared (scalar product) against each of the 196 region descriptors, producing a 14×14 similarity / localization map. This is framed as Multiple Instance Learning: the max similarity over the grid is taken as the image-audio correspondence score and trained with the AVC logistic loss. For matched pairs at least one region must respond strongly (highlighting the sounding object); for mismatched pairs the whole map should stay low. The audio embedding effectively acts as a filter that "looks" for matching image patches — an attention-like mechanism.

Shortcut prevention. A cautionary contribution: naive AVC negative sampling lets the net cheat. Positives always align the audio midpoint to a frame at a multiple of 0.04 s (25 fps); if negatives don't, the net learns to detect audio sampled at multiples of 0.04 s (an MPEG/resampling artifact) instead of semantics. Fix: sample negative audio only at multiples of 0.04 s too.

Setup

Datasets / benchmarks: AudioSet (10 s YouTube clips), filtered to 110 musical-instrument / singing / tools classes — the "AudioSet-Instruments" subset: 263k train, 30k val, 4.3k test clips. Labels used only for evaluation, never for training. 500 val clips were hand-annotated with the sounding instrument's location for quantitative localization eval.
Hardware / simulator: 16 GPUs, synchronous TensorFlow training, per-worker batch of 128 (effective batch 2048). No robot / no simulator (vision+audio ML paper).
Baselines: L³-Net (retrained, same data/procedure); L³-Net aligned with CCA; VGG16-ImageNet visual features (supervised) aligned to L³-audio via CCA; random chance. Localization: a "predict image center" baseline.
Compute: Adam, weight decay 1e-5, LR by grid search with a 6%-every-16-epochs decay schedule; stride-2 first conv for 4× speedup. Exact GPU-hours not reported.

Results

On the AVC task itself, AVE-Net reaches 81.9% accuracy vs. L³-Net 80.8% — but the authors stress AVC accuracy is only a proxy; retrieval is the real test. Cross-modal and intra-modal retrieval measured by mean nDCG@30 (im = image, aud = audio):

Method	im-im	im-aud	aud-im	aud-aud
Random chance	.407	.407	.407	.407
L³-Net	.567	.418	.385	.653
L³-Net + CCA	.578	.531	.560	.649
VGG16-ImageNet (supervised)	.600	—	—	—
VGG16-ImageNet + L³-Audio CCA	.493	.458	.464	.618
AVE-Net	.604	.561	.587	.665

AVE-Net beats every baseline on all four query-database modality combos. Notably it edges out fully-supervised VGG16-ImageNet on image-image (.604 vs .600) despite never training on classification, and crushes raw L³-Net on cross-modal (im-aud .561 vs .418) because L³'s un-aligned features sit near chance across modalities. AVE-Net even does intra-modal retrieval (im-im, aud-aud) well via transitivity, though never explicitly trained for it.

Localization (AVOL-Net). AVC accuracy is unchanged under the MIL setup, i.e. switching to localization costs no semantic quality. Quantitatively, AVOL-Net localizes the sounding instrument with 81.7% accuracy vs. 57.2% for the center-prediction baseline. Crucially, mismatch experiments (Fig 6) show it is not just a saliency detector: given an image of a piano+flute, playing flute audio highlights the flute, playing piano audio highlights the piano — localization genuinely depends on the sound.

Negative result worth noting: adding multiple frames (AVE+MF) or optical flow (AVE+OF) raised AVC accuracy (84.7%, 84.9%) but did not improve retrieval — the net exploits low-level motion correlations to solve AVC more cheaply, reducing pressure to learn semantics. So a single frame is used everywhere else.

Limitations & open questions

Author-stated. Being unsupervised, the localizer may fire on only a discriminative part of an object (piano keys, not the whole piano), raising the "what is an object that makes a sound" question (body? keyboard? player?). Max-pooling is crude; the authors propose an explicit soft-attention mechanism as future work. AudioSet is noisy: audio is often dubbed (album cover, unrelated visuals), and ideal nDCG of 1 is unreachable since one frame / 1 s can miss the labelled event. Failure cases include firing on overlaid text/music-sheet artifacts.

What I noticed reading it. (i) Localization is evaluated on a small home-grown set of 500 hand-annotated clips with a single coarse metric (is the heatmap mode inside the annotation), because no instrument bounding-box dataset exists — so the headline 81.7% rests on one author-defined protocol, not a community benchmark. (ii) Everything is musical instruments / singing / tools; whether AVC localization transfers to non-periodic, transient contact sounds (a click, a snap, a pour) is untested — instrument timbre is an unusually clean, sustained audio signal. (iii) The whole result hinges on the Euclidean-distance bottleneck, but there is no ablation isolating "distance bottleneck vs. concat+FC" under matched capacity beyond the AVE-Net-vs-L³ comparison, which also changes where the FCs live. (iv) The shortcut-prevention finding is a genuinely transferable warning: self-supervised correspondence objectives over real data invite exactly this kind of low-level cheating, and the "fix" is dataset-specific (0.04 s grid) rather than principled.

Why I care

This is an off-theme, non-robot audio-visual ML paper — I treat it as a method anchor / adjacent, not a manipulation result. It does not touch robots, manipulation, planning, or language-conditioned policies, so I won't manufacture relevance. Its value to my agenda is as the canonical architecture pattern for label-free cross-modal alignment and sound-source grounding, and it is the conceptual ancestor of much of the audio-for-manipulation cluster I'm building (ManiWAV, SonicSense, See/Hear/Feel).

The thread that does connect to my thesis: many manipulation predicates I care about for BLADE-style abstraction learning — is_inserted, is_full, is_screwed_tight, surface_is_rough — are not visually evaluable; they live in touch, force, and sound. "Objects that Sound" is the cleanest demonstration that a contact/audio signal can be grounded into spatial structure with zero labels via a correspondence objective. The AVOL-Net MIL trick (audio embedding as a spatial filter over a vision grid) is a template I'd reach for if I ever wanted to localize where a manipulation sound is coming from on an object without annotation. But that bridge is mine to build; the paper itself stops at instrument videos.

Quotable

"The AVC task stimulates the learnt visual and audio representations to be both discriminative... and semantically meaningful... the only way for a network to solve the task is if it learns to classify semantic concepts in both modalities, and then judge whether the two concepts correspond." — §1 / p.2

"In essence, the audio representation forms a filter which 'looks' for relevant image patches in a similar manner to an attention mechanism." — §4 / p.10

"Deep neural networks are notorious for finding subtle data shortcuts to exploit in order to 'cheat'... the audio for the negative pair is also sampled only from multiples of 0.04s." — §3.3 / p.8

Cited here, candidates to ingest next:

The Sound of Pixels (Zhao et al., ECCV 2018) — concurrent ECCV work; goes beyond localization to audio source separation from video.
Visually Indicated Sounds (Owens et al., CVPR 2016) — predicts sound from silent video of contact events; the contact-audio direction most relevant to manipulation.

Newly ingested in 2026-06-24 batch — directly relevant:

The Sound of Pixels — same ECCV'18 venue, same AVC self-supervision lineage, extends from localization to separation.
Visually Indicated Sounds — the other Cluster-E adjacent anchor; closest to physical contact-sound prediction.
AudioCLIP — later, scaled-up audio-visual-text embedding; AVE-Net is its small label-free ancestor.
CLAP — audio-language contrastive embedding; same cross-modal-alignment family one rung up the abstraction ladder.
SonicSense — carries the "objects that sound" intuition onto a robot hand for in-hand acoustic property inference.

← Index