One-liner. Train two ConvNets from scratch on unlabelled video with a single self-supervised objective — "does this frame and this 1 s of audio come from the same video?" — and you get (a) aligned audio/visual embeddings good enough for cross-modal retrieval and (b), with a tiny architectural change, a network that points at which pixels are making the sound, all without a single label or bounding box.
Unlabelled video gives you a synchronized image stream and audio stream for free, an enormous source of "automatic" supervision. The prior L³-Net (Arandjelović & Zisserman, "Look, Listen and Learn", ICCV 2017) showed the audio-visual correspondence (AVC) task — classify a (frame, audio) pair as matched or mismatched — teaches good unimodal features. But its features are fused by concatenation only after the fully-connected layers, so the two modalities are never aligned in a shared space and are useless for cross-modal retrieval. And no prior cross-modal network was actually designed or shown to answer "where in the image is the object making this sound?" This paper targets both gaps: a network whose embeddings are directly comparable across modalities, and a network that localizes the sounding object — both still label-free.
Shared backbone. A vision ConvNet ingests a 224×224 RGB frame; an audio ConvNet ingests 1 s of audio resampled to 48 kHz and converted to a 257×200 log-spectrogram treated as a greyscale image (Fig 2a, 2b). Both are trained from scratch, no ImageNet init.
AVE-Net (retrieval; Fig 2c). The key move vs. L³-Net is to push the fully-connected layers into each subnetwork, so each modality produces a 128-D, L2-normalized embedding. Correspondence is then computed as a function of the Euclidean distance between the two embeddings: the single scalar distance is the only information bottleneck, passed through a tiny FC (which scales/shifts it and effectively learns a distance threshold) and a softmax for the matched/mismatched decision. Because the only channel through which the network can solve AVC is that distance, training forces the visual and audio embeddings into a shared, retrieval-ready metric space. The authors note this resembles a parameter-free contrastive/metric-learning loss but with no margin hyper-parameter to tune.
AVOL-Net (localization; Fig 4). To localize, the vision subnetwork stops pooling and keeps operating at 14×14 spatial resolution; its FC layers are converted to 1×1 convolutions, yielding a 14×14 grid of 128-D region descriptors. A single 128-D audio embedding is compared (scalar product) against each of the 196 region descriptors, producing a 14×14 similarity / localization map. This is framed as Multiple Instance Learning: the max similarity over the grid is taken as the image-audio correspondence score and trained with the AVC logistic loss. For matched pairs at least one region must respond strongly (highlighting the sounding object); for mismatched pairs the whole map should stay low. The audio embedding effectively acts as a filter that "looks" for matching image patches — an attention-like mechanism.
Shortcut prevention. A cautionary contribution: naive AVC negative sampling lets the net cheat. Positives always align the audio midpoint to a frame at a multiple of 0.04 s (25 fps); if negatives don't, the net learns to detect audio sampled at multiples of 0.04 s (an MPEG/resampling artifact) instead of semantics. Fix: sample negative audio only at multiples of 0.04 s too.
L³-Net (retrained, same data/procedure); L³-Net aligned with CCA; VGG16-ImageNet visual features (supervised) aligned to L³-audio via CCA; random chance. Localization: a "predict image center" baseline.On the AVC task itself, AVE-Net reaches 81.9% accuracy vs. L³-Net 80.8% — but the authors stress AVC accuracy is only a proxy; retrieval is the real test. Cross-modal and intra-modal retrieval measured by mean nDCG@30 (im = image, aud = audio):
| Method | im-im | im-aud | aud-im | aud-aud |
|---|---|---|---|---|
| Random chance | .407 | .407 | .407 | .407 |
| L³-Net | .567 | .418 | .385 | .653 |
| L³-Net + CCA | .578 | .531 | .560 | .649 |
| VGG16-ImageNet (supervised) | .600 | — | — | — |
| VGG16-ImageNet + L³-Audio CCA | .493 | .458 | .464 | .618 |
| AVE-Net | .604 | .561 | .587 | .665 |
AVE-Net beats every baseline on all four query-database modality combos. Notably it edges out fully-supervised VGG16-ImageNet on image-image (.604 vs .600) despite never training on classification, and crushes raw L³-Net on cross-modal (im-aud .561 vs .418) because L³'s un-aligned features sit near chance across modalities. AVE-Net even does intra-modal retrieval (im-im, aud-aud) well via transitivity, though never explicitly trained for it.
Localization (AVOL-Net). AVC accuracy is unchanged under the MIL setup, i.e. switching to localization costs no semantic quality. Quantitatively, AVOL-Net localizes the sounding instrument with 81.7% accuracy vs. 57.2% for the center-prediction baseline. Crucially, mismatch experiments (Fig 6) show it is not just a saliency detector: given an image of a piano+flute, playing flute audio highlights the flute, playing piano audio highlights the piano — localization genuinely depends on the sound.
Negative result worth noting: adding multiple frames (AVE+MF) or optical flow (AVE+OF) raised AVC accuracy (84.7%, 84.9%) but did not improve retrieval — the net exploits low-level motion correlations to solve AVC more cheaply, reducing pressure to learn semantics. So a single frame is used everywhere else.
Author-stated. Being unsupervised, the localizer may fire on only a discriminative part of an object (piano keys, not the whole piano), raising the "what is an object that makes a sound" question (body? keyboard? player?). Max-pooling is crude; the authors propose an explicit soft-attention mechanism as future work. AudioSet is noisy: audio is often dubbed (album cover, unrelated visuals), and ideal nDCG of 1 is unreachable since one frame / 1 s can miss the labelled event. Failure cases include firing on overlaid text/music-sheet artifacts.
What I noticed reading it. (i) Localization is evaluated on a small home-grown set of 500 hand-annotated clips with a single coarse metric (is the heatmap mode inside the annotation), because no instrument bounding-box dataset exists — so the headline 81.7% rests on one author-defined protocol, not a community benchmark. (ii) Everything is musical instruments / singing / tools; whether AVC localization transfers to non-periodic, transient contact sounds (a click, a snap, a pour) is untested — instrument timbre is an unusually clean, sustained audio signal. (iii) The whole result hinges on the Euclidean-distance bottleneck, but there is no ablation isolating "distance bottleneck vs. concat+FC" under matched capacity beyond the AVE-Net-vs-L³ comparison, which also changes where the FCs live. (iv) The shortcut-prevention finding is a genuinely transferable warning: self-supervised correspondence objectives over real data invite exactly this kind of low-level cheating, and the "fix" is dataset-specific (0.04 s grid) rather than principled.
This is an off-theme, non-robot audio-visual ML paper — I treat it as a method anchor / adjacent, not a manipulation result. It does not touch robots, manipulation, planning, or language-conditioned policies, so I won't manufacture relevance. Its value to my agenda is as the canonical architecture pattern for label-free cross-modal alignment and sound-source grounding, and it is the conceptual ancestor of much of the audio-for-manipulation cluster I'm building (ManiWAV, SonicSense, See/Hear/Feel).
The thread that does connect to my thesis: many manipulation predicates I care about for BLADE-style abstraction learning — is_inserted, is_full, is_screwed_tight, surface_is_rough — are not visually evaluable; they live in touch, force, and sound. "Objects that Sound" is the cleanest demonstration that a contact/audio signal can be grounded into spatial structure with zero labels via a correspondence objective. The AVOL-Net MIL trick (audio embedding as a spatial filter over a vision grid) is a template I'd reach for if I ever wanted to localize where a manipulation sound is coming from on an object without annotation. But that bridge is mine to build; the paper itself stops at instrument videos.
"The AVC task stimulates the learnt visual and audio representations to be both discriminative... and semantically meaningful... the only way for a network to solve the task is if it learns to classify semantic concepts in both modalities, and then judge whether the two concepts correspond." — §1 / p.2
"In essence, the audio representation forms a filter which 'looks' for relevant image patches in a similar manner to an attention mechanism." — §4 / p.10
"Deep neural networks are notorious for finding subtle data shortcuts to exploit in order to 'cheat'... the audio for the negative pair is also sampled only from multiples of 0.04s." — §3.3 / p.8
Cited here, candidates to ingest next:
Newly ingested in 2026-06-24 batch — directly relevant: