One-liner. Train a CNN+LSTM to synthesize the plausible impact sound a silent video of a drumstick hitting/scratching objects should make — and show that doing so forces the network to implicitly learn material and physical-interaction properties, because the sound a contact makes is a readout of what the object is made of.
Objects make distinctive sounds when struck, and those sounds encode material properties (stiffness, density) and the action that produced them. The authors propose predicting sound from silent video as a proxy task for learning about physical interactions in a scene: to predict a video's held-out soundtrack, an algorithm must reason about what is being hit and how. Unlike classic material-recognition work, the supervision is never an explicit material label — the network discovers material structure by learning the regularities of raw audio-visual data. The framing is inspired by how infants learn intuitive physics by poking and prodding objects. The task is a form of automatic Foley (the film-industry craft of faking impact sounds), but with the human taken out of the loop.
Formulated as regression: map a sequence of video frames to a sequence of audio features, then convert those features back into a waveform (Fig 4).
Sound representation. Waveforms are decomposed into
subband envelopes: a bank of 40 band-pass filters on an ERB scale,
take the Hilbert envelope of each response, downsample to 90 Hz, and
compress with exponent c=0.3 (Eq. 1). The result is a
cochleagram. A 42-dim feature vector per timestep is PCA-projected to
10 dims for the regression target.
Image representation. Motion is made explicit via
spacetime images: each frame's three channels are grayscale versions
of the previous, current, and next frames (a cheap optical-flow surrogate that
avoids fast non-rigid motion problems). Per frame t, the input
feature x_t = [φ(F_t), φ(I_1)] concatenates AlexNet
fc7 features of the spacetime image F_t and of the
first color frame I_1 (Eq. 2). The CNN is either trained from
scratch or initialized from ImageNet and fine-tuned.
Sound prediction model. An LSTM RNN consumes the CNN
feature sequence and emits subband-envelope features via an affine map of its
hidden state (Eq. 3). CNN feature vectors are replicated k=3 times
to bridge the video/audio sampling-rate gap. Training minimizes a robust
(log-bounded) loss between predicted and ground-truth sound features over time
(Eq. 4); CNN and RNN are trained jointly with Caffe, dropout, and gradient
clipping.
Waveform generation. Two options: (i) parametric
synthesis — iteratively impose the predicted subband envelopes on
white noise (one iteration); (ii) example-based synthesis — snap
a window of predicted sound features to its nearest exemplar in the training
set (by L1 distance) and transfer that exemplar's real waveform.
Example-based synthesis gives a strong natural-sound prior and is the one that
fools humans best.
Psychophysical study (Fig 5a). A two-alternative forced-choice "real or fake" test on Mechanical Turk. The full system (RGB + spacetime + ImageNet pretraining + example-based synthesis) fooled participants 40.01% ± 1.66% of the time — vs. a 46.90% ± 1.49% ceiling for matching to a real held-out sound from the same video, and far above image-matching baselines (~32%) and the random-sound baseline (19.77%). Headline ablations:
| Model | % labeled real |
|---|---|
| Real sound match (ceiling) | 46.90 ± 1.49 |
| Full system (ours) | 40.01 ± 1.66 |
| − No spacetime | 37.88 ± 1.67 |
| − Parametric synthesis | 34.66 ± 1.62 |
| − Trained from scratch | 36.46 ± 1.68 |
| Image match | 32.98 ± 1.59 |
| Image + spacetime | 33.77 ± 1.18 |
| − No RNN | 29.96 ± 1.55 |
| Random impact sound | 19.77 ± 1.34 |
Full system significantly beat image matching (p < 0.001);
RGB-only vs. full was not significant (p = 0.08). On automated
auditory metrics (loudness error, spectral-centroid error; Fig 5a right), the
network beat image matching too. Where it loses: parametric (non
example-based) synthesis did poorly on hard materials like wood and metal
(confusion rate 62% ± 6% for dirt, 18% ± 5% for metal), and the
example-based variant struggled with highly variable textural sounds like
splashing water (Fig 7).
Material/action emerges from sound (Sec 6.3). An SVM trained on real sounds and applied to the network's predicted sounds reached 22.7% class-averaged material accuracy (chance 5.9%) — rising to 28.8% with ImageNet pretraining — demonstrating the predicted sound carries genuine material information. Predicted-sound action classification hit 53.5% (chance 50%) and reaction 55.2%. As a sanity check, an SVM on real impact sounds gets 45.8% material accuracy (Sec 4), confirming impact sounds are materially informative in the first place. Impact detection on long videos: average precision 43.6% (RGB+spacetime) vs. 21.6% (RGB-only).
From the authors:
What I noticed reading it:
Adjacent, not a manipulation paper — there is no robot, no policy, no planning here. It is a 2016 audio-visual computer-vision classic, and it earns its place in this batch as a method/insight anchor rather than a method I'd directly build on.
The load-bearing idea, though, is exactly the thesis behind the contact-audio
robotics cluster: the sound of a contact is a near-direct readout of
material and physical-interaction properties — precisely the
predicates (surface_is_rough, is_made_of_metal,
hardness, the hit-vs-scratch distinction) that are not visually
evaluable and live in the touch/sound channel. BLADE
learns its predicate classifiers from vision; a predicate like
"the lid is screwed tight" or "this surface is rough" has no clean visual
signature but a sharp acoustic one. This paper is the foundational
demonstration that a learned model can extract that signature from impact
audio at all — the empirical bedrock that later robot contact-audio
papers (SonicSense,
ManiWAV,
See, Hear, and Feel)
exploit for manipulation. The "predict one modality from another as
self-supervision" recipe is also the conceptual seed for the binding/alignment
line (Objects that Sound,
ImageBind).
Objects make distinctive sounds when they are hit or scratched. These sounds reveal aspects of an object's material properties, as well as the actions that produced them. — Abstract / p.1
This task implicitly requires material recognition, but unlike traditional work on this problem, we never explicitly tell the algorithm about materials. Instead, it learns about them by identifying statistical regularities in the raw audio-visual signal. — §1 Introduction / p.1
This accuracy indicates that our model learned an output representation that was informative about material, even though it was only trained to predict sound. — §6.3 / p.7
Papers cited here that could be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant: