Visually Indicated Sounds

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H. Adelson, William T. Freeman · MIT / U.C. Berkeley / Google Research · CVPR 2016 · arXiv:1512.08512 · PDF

One-liner. Train a CNN+LSTM to synthesize the plausible impact sound a silent video of a drumstick hitting/scratching objects should make — and show that doing so forces the network to implicitly learn material and physical-interaction properties, because the sound a contact makes is a readout of what the object is made of.

Problem & motivation

Objects make distinctive sounds when struck, and those sounds encode material properties (stiffness, density) and the action that produced them. The authors propose predicting sound from silent video as a proxy task for learning about physical interactions in a scene: to predict a video's held-out soundtrack, an algorithm must reason about what is being hit and how. Unlike classic material-recognition work, the supervision is never an explicit material label — the network discovers material structure by learning the regularities of raw audio-visual data. The framing is inspired by how infants learn intuitive physics by poking and prodding objects. The task is a form of automatic Foley (the film-industry craft of faking impact sounds), but with the human taken out of the loop.

Method

Formulated as regression: map a sequence of video frames to a sequence of audio features, then convert those features back into a waveform (Fig 4).

Sound representation. Waveforms are decomposed into subband envelopes: a bank of 40 band-pass filters on an ERB scale, take the Hilbert envelope of each response, downsample to 90 Hz, and compress with exponent c=0.3 (Eq. 1). The result is a cochleagram. A 42-dim feature vector per timestep is PCA-projected to 10 dims for the regression target.

Image representation. Motion is made explicit via spacetime images: each frame's three channels are grayscale versions of the previous, current, and next frames (a cheap optical-flow surrogate that avoids fast non-rigid motion problems). Per frame t, the input feature x_t = [φ(F_t), φ(I_1)] concatenates AlexNet fc7 features of the spacetime image F_t and of the first color frame I_1 (Eq. 2). The CNN is either trained from scratch or initialized from ImageNet and fine-tuned.

Sound prediction model. An LSTM RNN consumes the CNN feature sequence and emits subband-envelope features via an affine map of its hidden state (Eq. 3). CNN feature vectors are replicated k=3 times to bridge the video/audio sampling-rate gap. Training minimizes a robust (log-bounded) loss between predicted and ground-truth sound features over time (Eq. 4); CNN and RNN are trained jointly with Caffe, dropout, and gradient clipping.

Waveform generation. Two options: (i) parametric synthesis — iteratively impose the predicted subband envelopes on white noise (one iteration); (ii) example-based synthesis — snap a window of predicted sound features to its nearest exemplar in the training set (by L1 distance) and transfer that exemplar's real waveform. Example-based synthesis gives a strong natural-sound prior and is the one that fools humans best.

Setup

Datasets / benchmarks: the Greatest Hits dataset, introduced here — 977 videos (64% indoor, 36% outdoor) of a person probing scenes with a drumstick, 46,577 total actions, ~48 actions per ~35 s video (~69% hits, ~31% scratches). Semantic annotations (material, action hit-vs-scratch, reaction, impact pixel location) collected on ~62% of impacts via Amazon Mechanical Turk — used only for analysis, never for training the sound model.
Hardware / simulator: not a robot paper. Data capture used a handheld drumstick, a shotgun microphone on the camera (wind cover + denoising for outdoor scenes), and a separate audio recorder. AlexNet CNN + LSTM trained on GPUs (compute specifics not reported).
Baselines: image-based nearest-neighbor retrieval variants (image match, spacetime match, image+spacetime); a random-impact baseline; oracle models that draw exemplars from videos with the same material label; plus ablations of the full system (from scratch, no spacetime, parametric synthesis, no RNN, RGB-only).
Compute: not reported.

Results

Psychophysical study (Fig 5a). A two-alternative forced-choice "real or fake" test on Mechanical Turk. The full system (RGB + spacetime + ImageNet pretraining + example-based synthesis) fooled participants 40.01% ± 1.66% of the time — vs. a 46.90% ± 1.49% ceiling for matching to a real held-out sound from the same video, and far above image-matching baselines (~32%) and the random-sound baseline (19.77%). Headline ablations:

Model	% labeled real
Real sound match (ceiling)	46.90 ± 1.49
Full system (ours)	40.01 ± 1.66
− No spacetime	37.88 ± 1.67
− Parametric synthesis	34.66 ± 1.62
− Trained from scratch	36.46 ± 1.68
Image match	32.98 ± 1.59
Image + spacetime	33.77 ± 1.18
− No RNN	29.96 ± 1.55
Random impact sound	19.77 ± 1.34

Full system significantly beat image matching (p < 0.001); RGB-only vs. full was not significant (p = 0.08). On automated auditory metrics (loudness error, spectral-centroid error; Fig 5a right), the network beat image matching too. Where it loses: parametric (non example-based) synthesis did poorly on hard materials like wood and metal (confusion rate 62% ± 6% for dirt, 18% ± 5% for metal), and the example-based variant struggled with highly variable textural sounds like splashing water (Fig 7).

Material/action emerges from sound (Sec 6.3). An SVM trained on real sounds and applied to the network's predicted sounds reached 22.7% class-averaged material accuracy (chance 5.9%) — rising to 28.8% with ImageNet pretraining — demonstrating the predicted sound carries genuine material information. Predicted-sound action classification hit 53.5% (chance 50%) and reaction 55.2%. As a sanity check, an SVM on real impact sounds gets 45.8% material accuracy (Sec 4), confirming impact sounds are materially informative in the first place. Impact detection on long videos: average precision 43.6% (RGB+spacetime) vs. 21.6% (RGB-only).

Limitations & open questions

From the authors:

Splash sounds violate the single-amplitude-peak assumption of the onset detector, so the dataset/onset pipeline is biased toward crisp hits and scratches.
The model misses or hallucinates impacts when the drumstick moves erratically; broken-RNN ablation often fails to detect hit timing and under-predicts amplitude (Fig 8 failure modes: missed railing tap, false cushion hit).
Example-based synthesis is poor on highly variable textural sounds (water); parametric synthesis is poor on hard materials.

What I noticed reading it:

The "fool rate" metric tops out at 46.9% even for ground-truth real sounds — people are bad at this task, which inflates how good 40% sounds. The right framing is "82% of the human ceiling," not "fooled 40% of the time."
Material/action recovery is read out by an SVM trained on real audio, not from the network's internal features — so it measures whether the output is materially informative, a weaker claim than a learned representation being so. The fc7-feature confusion analysis (30.2% accuracy) is the representation-level result and is reported separately.
Drumstick-only contact is a narrow, consistent action distribution. Whether the material readout generalizes to arbitrary contact tools or forces (the regime a robot gripper actually operates in) is untested.
Single dataset, single capture rig; no cross-environment generalization study. The indoor/outdoor split hints at domain shift but isn't isolated.

Why I care

Adjacent, not a manipulation paper — there is no robot, no policy, no planning here. It is a 2016 audio-visual computer-vision classic, and it earns its place in this batch as a method/insight anchor rather than a method I'd directly build on.

The load-bearing idea, though, is exactly the thesis behind the contact-audio robotics cluster: the sound of a contact is a near-direct readout of material and physical-interaction properties — precisely the predicates (surface_is_rough, is_made_of_metal, hardness, the hit-vs-scratch distinction) that are not visually evaluable and live in the touch/sound channel. BLADE learns its predicate classifiers from vision; a predicate like "the lid is screwed tight" or "this surface is rough" has no clean visual signature but a sharp acoustic one. This paper is the foundational demonstration that a learned model can extract that signature from impact audio at all — the empirical bedrock that later robot contact-audio papers (SonicSense, ManiWAV, See, Hear, and Feel) exploit for manipulation. The "predict one modality from another as self-supervision" recipe is also the conceptual seed for the binding/alignment line (Objects that Sound, ImageBind).

Quotable

Objects make distinctive sounds when they are hit or scratched. These sounds reveal aspects of an object's material properties, as well as the actions that produced them. — Abstract / p.1

This task implicitly requires material recognition, but unlike traditional work on this problem, we never explicitly tell the algorithm about materials. Instead, it learns about them by identifying statistical regularities in the raw audio-visual signal. — §1 Introduction / p.1

This accuracy indicates that our model learned an output representation that was informative about material, even though it was only trained to predict sound. — §6.3 / p.7

Papers cited here that could be ingested next:

Arnab et al. 2015 [2] — semantic segmentation that incorporates audio from impact sounds; the direct prior on using contact audio for material/object recognition.
Davis et al. 2014 [9] — The Visual Microphone — passive recovery of sound from high-speed video of vibrating objects; the inverse problem (sound from subtle visual vibration).
Owens et al. — Greatest Hits / ambient-sound features — the same group's follow-up using these sounds as a supervisory signal for visual representation learning.
Sinapov, Wiemer & Stoytchev 2009 [41] — ICRA work on interactive learning of acoustic properties of household objects; the robotics-side precursor to contact-audio sensing.

Newly ingested in the 2026-06-24 batch — directly relevant:

Objects that Sound and The Sound of Pixels — the audio-visual correspondence/separation line that grew out of this "predict-across-modalities" framing; closest siblings (Cluster E).
SonicSense, Active Acoustic Sensing for Manipulation, and VibeCheck — the robotics realization of this paper's core insight: contact / acoustic vibration encodes material and object state (Cluster D).
ManiWAV, See, Hear, and Feel, and Making Sense of Audio Vibration (pouring) — audio-as-feedback manipulation policies; downstream uses of the contact-sound signal this paper first characterized (Cluster D).
ObjectFolder and ObjectFolder 2.0 — multisensory (incl. impact-audio) object datasets; the simulated counterpart to Greatest Hits (Cluster I).
ImageBind — the modern many-modality binding model whose audio-visual leg descends from this self-supervised cross-modal-prediction idea (Cluster H).