Visually Indicated Sounds

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, William T. Freeman · MIT / U.C. Berkeley / Google Research · CVPR 2016 · arXiv:1512.08512 · PDF

One-liner. Train a CNN+LSTM to synthesize the plausible impact sound a silent video of a drumstick hitting/scratching objects should make — and show that doing so forces the network to implicitly learn material and physical-interaction properties, because the sound a contact makes is a readout of what the object is made of.

Problem & motivation

Objects make distinctive sounds when struck, and those sounds encode material properties (stiffness, density) and the action that produced them. The authors propose predicting sound from silent video as a proxy task for learning about physical interactions in a scene: to predict a video's held-out soundtrack, an algorithm must reason about what is being hit and how. Unlike classic material-recognition work, the supervision is never an explicit material label — the network discovers material structure by learning the regularities of raw audio-visual data. The framing is inspired by how infants learn intuitive physics by poking and prodding objects. The task is a form of automatic Foley (the film-industry craft of faking impact sounds), but with the human taken out of the loop.

Method

Formulated as regression: map a sequence of video frames to a sequence of audio features, then convert those features back into a waveform (Fig 4).

Sound representation. Waveforms are decomposed into subband envelopes: a bank of 40 band-pass filters on an ERB scale, take the Hilbert envelope of each response, downsample to 90 Hz, and compress with exponent c=0.3 (Eq. 1). The result is a cochleagram. A 42-dim feature vector per timestep is PCA-projected to 10 dims for the regression target.

Image representation. Motion is made explicit via spacetime images: each frame's three channels are grayscale versions of the previous, current, and next frames (a cheap optical-flow surrogate that avoids fast non-rigid motion problems). Per frame t, the input feature x_t = [φ(F_t), φ(I_1)] concatenates AlexNet fc7 features of the spacetime image F_t and of the first color frame I_1 (Eq. 2). The CNN is either trained from scratch or initialized from ImageNet and fine-tuned.

Sound prediction model. An LSTM RNN consumes the CNN feature sequence and emits subband-envelope features via an affine map of its hidden state (Eq. 3). CNN feature vectors are replicated k=3 times to bridge the video/audio sampling-rate gap. Training minimizes a robust (log-bounded) loss between predicted and ground-truth sound features over time (Eq. 4); CNN and RNN are trained jointly with Caffe, dropout, and gradient clipping.

Waveform generation. Two options: (i) parametric synthesis — iteratively impose the predicted subband envelopes on white noise (one iteration); (ii) example-based synthesis — snap a window of predicted sound features to its nearest exemplar in the training set (by L1 distance) and transfer that exemplar's real waveform. Example-based synthesis gives a strong natural-sound prior and is the one that fools humans best.

Setup

Results

Psychophysical study (Fig 5a). A two-alternative forced-choice "real or fake" test on Mechanical Turk. The full system (RGB + spacetime + ImageNet pretraining + example-based synthesis) fooled participants 40.01% ± 1.66% of the time — vs. a 46.90% ± 1.49% ceiling for matching to a real held-out sound from the same video, and far above image-matching baselines (~32%) and the random-sound baseline (19.77%). Headline ablations:

Model% labeled real
Real sound match (ceiling)46.90 ± 1.49
Full system (ours)40.01 ± 1.66
− No spacetime37.88 ± 1.67
− Parametric synthesis34.66 ± 1.62
− Trained from scratch36.46 ± 1.68
Image match32.98 ± 1.59
Image + spacetime33.77 ± 1.18
− No RNN29.96 ± 1.55
Random impact sound19.77 ± 1.34

Full system significantly beat image matching (p < 0.001); RGB-only vs. full was not significant (p = 0.08). On automated auditory metrics (loudness error, spectral-centroid error; Fig 5a right), the network beat image matching too. Where it loses: parametric (non example-based) synthesis did poorly on hard materials like wood and metal (confusion rate 62% ± 6% for dirt, 18% ± 5% for metal), and the example-based variant struggled with highly variable textural sounds like splashing water (Fig 7).

Material/action emerges from sound (Sec 6.3). An SVM trained on real sounds and applied to the network's predicted sounds reached 22.7% class-averaged material accuracy (chance 5.9%) — rising to 28.8% with ImageNet pretraining — demonstrating the predicted sound carries genuine material information. Predicted-sound action classification hit 53.5% (chance 50%) and reaction 55.2%. As a sanity check, an SVM on real impact sounds gets 45.8% material accuracy (Sec 4), confirming impact sounds are materially informative in the first place. Impact detection on long videos: average precision 43.6% (RGB+spacetime) vs. 21.6% (RGB-only).

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

Adjacent, not a manipulation paper — there is no robot, no policy, no planning here. It is a 2016 audio-visual computer-vision classic, and it earns its place in this batch as a method/insight anchor rather than a method I'd directly build on.

The load-bearing idea, though, is exactly the thesis behind the contact-audio robotics cluster: the sound of a contact is a near-direct readout of material and physical-interaction properties — precisely the predicates (surface_is_rough, is_made_of_metal, hardness, the hit-vs-scratch distinction) that are not visually evaluable and live in the touch/sound channel. BLADE learns its predicate classifiers from vision; a predicate like "the lid is screwed tight" or "this surface is rough" has no clean visual signature but a sharp acoustic one. This paper is the foundational demonstration that a learned model can extract that signature from impact audio at all — the empirical bedrock that later robot contact-audio papers (SonicSense, ManiWAV, See, Hear, and Feel) exploit for manipulation. The "predict one modality from another as self-supervision" recipe is also the conceptual seed for the binding/alignment line (Objects that Sound, ImageBind).

Quotable

Objects make distinctive sounds when they are hit or scratched. These sounds reveal aspects of an object's material properties, as well as the actions that produced them. — Abstract / p.1
This task implicitly requires material recognition, but unlike traditional work on this problem, we never explicitly tell the algorithm about materials. Instead, it learns about them by identifying statistical regularities in the raw audio-visual signal. — §1 Introduction / p.1
This accuracy indicates that our model learned an output representation that was informative about material, even though it was only trained to predict sound. — §6.3 / p.7

Related

Papers cited here that could be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant: