Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation
Jared Mejia, Victoria Dean, Tess Hellebrekers, Abhinav Gupta
· CMU / Olin / Meta AI
· 2024 · arXiv preprint
· arXiv:2405.08576
· PDF
One-liner. Treat a cheap piezo contact microphone as an
"audio-native" tactile sensor, and you can bootstrap its representation from
2M+ internet audio-visual clips (AudioSet, via AVID) instead of training from
scratch — the resulting contact-audio features measurably boost low-data
behavior cloning and, surprisingly, make policies more visually robust
despite a large domain gap between YouTube sounds and robot scraping noises.
Problem & motivation
Vision representations for robots get to ride internet-scale pretraining
(R3M, VIP, MVP); every other modality — especially touch — is
trained from scratch on a few hundred task-specific samples, because no
internet-scale tactile corpus exists. The authors' wedge: a piezo
contact microphone produces a signal that is literally audio
(structural vibrations at 32–48 kHz, up to ~1000× the bandwidth of
optical/magnetic tactile sensors). So the "no internet-scale tactile data"
problem dissolves — you can pretrain the touch encoder on the enormous
pool of internet audio-visual video. This is pitched as the first method to use
large-scale multisensory (not vision-only) pretraining for robot
manipulation, and it targets the low-data regime (≤60 demos/task) where
scratch-trained sensory encoders are weakest.
Method
Two-stage pipeline (Fig 2): large-scale pretraining of two encoders, then
end-to-end behavior cloning on a handful of in-domain demos.
Sensors. Four piezo contact microphones mounted on the
Franka gripper, each recording at 32 kHz; signals are averaged into one channel
and downsampled to 16 kHz. Because they read structural vibration, they pick up
not just direct gripper-object contact but indirect contact traveling
along a grasped tool (spatula, spoon) — subtle surface interactions vision
can't see. Per timestep the policy ingests an image v_t from a
fixed third-person RealSense and a 2-second contact-audio clip
a_t, rendered as a mel spectrogram (AVID's preprocessing).
Encoder pretraining. The audio encoder is lifted
from AVID (Audio-Visual Instance Discrimination, Morgado et al.
CVPR 2021), self-supervised with cross-modal instance discrimination on
AudioSet (~2M 10-second internet clips). The vision
encoder is R3M (ResNet18 pretrained on Ego4D with
time-contrastive + video-language alignment) — deliberately a known-good
robot vision representation, so any gain is attributable to the audio side.
Both encoders are kept unfrozen during policy learning (following Dean
et al., "Don't freeze your embedding").
Audio-visual behavior cloning. A sequence of 4 images
spanning the same 2s window plus the single audio token (5 tokens total) get
learned positional embeddings and pass through a single self-attention
transformer block (the fusion mechanism, à la See-Hear-Feel [6]); the output is
concatenated and fed to a 2-layer MLP head. The policy is quasi
open-loop: at time t it predicts H actions and
executes h ≤ H (here h = 2) before re-predicting
— balancing reactivity to audio against non-Markovian demo artifacts
(pauses). Each action is a 6-D delta end-effector command (Cartesian xyz + Euler
αβγ); standard MSE loss. ~20M params total.
Setup
- Datasets / benchmarks: Pretraining — AudioSet
(~2M clips, for AVID audio encoder) and Ego4D (for R3M vision encoder).
Downstream — three self-collected real-world tasks: flipping (40
demos), scooping (60 demos), zipping (50 demos), each teleoperated via an
Oculus Quest. Train/test splits use deliberately large visual differences
(different objects, backgrounds, pans).
- Hardware / simulator: Franka Emika Panda (7-DoF, IK from
6-DoF delta EE), end-effector commands at 30 Hz. Four piezo contact mics on
the gripper (32 kHz → averaged → 16 kHz). One Intel D435 RealSense,
fixed third-person, 480×640 at 30 Hz. Real-robot only; no simulator.
- Baselines: Vision-Only (same architecture,
image-only); Scratch (audio encoder randomly initialized);
BYOL-A (in-domain-only self-supervised audio pretraining). All
audio-using methods share the R3M-initialized image encoder.
- Compute: single GPU (RTX 3080Ti), 1–1.5 hours per
BC policy; batch 64, ≤100 epochs with early stopping. (Pretraining
compute for AVID/R3M not reported — inherited from prior work.)
Results
Headline: across all three tasks the AVID-pretrained method beats every
baseline, averaging +23% absolute 0-1 success rate and
+76% reward over the next-best baseline, and wins or ties in
8/9 task configurations. Table I (success %, plus reward where reported):
| Method | Flipping (succ%) | Scooping (succ%) | Zipping (succ%) |
| Ours (AVID/AudioSet) | 50.0 | 78.1 | 88.9 |
| BYOL-A (in-domain audio SSL) | 25.0 | 25.0 | 66.7 |
| Scratch (random audio enc.) | 15.4 | 50.0 | 72.2 |
| Vision-Only | 0.0 | 28.1 | 44.4 |
Where it wins and how:
- Vision-Only is worst everywhere — direct evidence
contact audio carries action-relevant signal vision misses (e.g. whether the
spatula actually slid under the bagel, whether the zipper is snagged).
- Large-scale > in-domain pretraining. BYOL-A (pretrained
only on the small in-domain contact-audio set) is mixed vs Scratch and
sometimes worse — its augmentation tricks don't transfer to
contact audio in the low-data regime. The AudioSet-scale aspect is the load-bearing component.
- Visual robustness is the surprise. Under train→test
visual shift on flipping, Vision-Only drops ~60% in success; Ours drops only
~20% (Fig 6c). Good audio features appear to regularize the policy
against overfitting to visual detail. t-SNE (Fig 5) shows Ours' trajectory
embeddings converge over time across visually-different settings; baselines
stay scattered.
- Where it loses: it does not win one zipping
configuration outright (the only 1/9 miss). On scooping reward,
Scratch (6.9) edges close and the relative ordering is noisier than success%.
- Ablations: (a) frozen AVID encoder only slightly worse and
still beats next-best on zipping — representations transfer near
zero-shot; (b) scaling 30→90 demos improves steadily, tracking
Vision-Only's slope; (c) replacing the transformer with a param-matched MLP
drops success/reward ~50% — self-attention fusion is necessary but,
alone (with scratch features), insufficient.
Limitations & open questions
From the authors:
- Contact mics help only on dynamic, vibration-emitting contact.
They expect little benefit on quasi-static pick-and-place, when the robot
itself generates dominant vibration, or with deformable objects that don't
ring on contact.
- Open: which properties of the pretraining dataset actually drive transfer
(they don't isolate this — AudioSet is just "big and diverse").
- Future: combine contact mics with optical visuotactile sensors (e.g.
GelSight/DIGIT) under a shared pretrained audio/tactile representation.
What I noticed reading it:
- Tiny-N, no seeds. Every success% is a count out of
16–36 rollouts per task (e.g. 50.0% = a handful of successes), single
training run, no multi-seed std reported. The +23% headline rests on
point estimates with no error bars — a much weaker statistical claim
than the bar charts' confidence suggests.
- Confound between modality and regularization. The biggest
selling point (visual robustness) may be partly an auxiliary-input
regularization effect: any extra non-visual token that's hard to overfit
might reduce visual overfitting. A noise-token or proprioception control
would sharpen the claim that it's contact audio specifically.
- Domain-gap hand-wave. "Surprisingly it works despite the
gap" is asserted, never measured — no analysis of what AudioSet
features the contact audio actually activates, so we don't know if AVID
transfer or just a better-conditioned init is doing the work. The frozen-AVID
ablation hints transfer is real but doesn't quantify it.
- Third-person vision, no wrist cam. The Vision-Only
baseline is handicapped by a fixed external camera that can't see
fine contact; a wrist-mounted view might close much of the gap. The audio
advantage may be partly an artifact of a weak vision setup.
- No language anywhere — this is a representation/pretraining paper, not
a touch-language or VLA paper, despite living in the same batch.
Why I care
This paper is a clean, direct datapoint for the batch thesis: many
manipulation predicates are not visually evaluable — is_scooped,
spatula_is_under(bagel), zipper_is_snagged,
tool_is_in_contact live in vibration, not pixels. Relative
to BLADE,
where predicate classifiers fθ(p): O → {T,F}
are learned purely over RGB crops, Hearing Touch is evidence that BLADE's
own-stated limitation — "caging grasps and contact-rich tasks would need
richer contact detection" — has a cheap sensory answer: a piezo mic gives
a high-bandwidth contact channel a predicate classifier could read. The
intriguing transfer story (internet audio → contact audio) also suggests a
path to pretrained contact-predicate classifiers, which would attack
BLADE's data-hungry classifier-learning step in the low-demo regime.
Caveat for my purposes: this is a flat BC policy with end-to-end fused
features, not a structured/abstraction method — there's no
symbolic layer, no planning, no predicates as such. Its relevance to my
long-horizon/planning-abstraction line is at the sensor &
representation level (feeding non-visual predicate grounding), not at the
method level. I should treat it as a building block, not a competing
architecture.
Quotable
Our key insight is that contact microphones capture inherently audio-based
information, allowing us to leverage large-scale audio-visual pretraining to
obtain representations that boost the performance of robotic manipulation.
— Abstract
Despite the domain gap between the audio in Audioset and contact audio
obtained through manipulation, we find that our approach improves performance
over visual-only policies—especially in test settings where objects and
locations differ significantly from the training data.
— §I, Introduction
Pre-trained audio features prevent the network from overfitting to visual
details in the training setting, hence attaining better generalization
abilities.
— §IV-E.3, Generalization
Related
Papers cited that should likely be ingested next (forward references):
- See, Hear, and Feel (SHF) [6] — the self-attention
multisensory-fusion mechanism this paper adapts; fuses camera + GelSight +
contact mic. Direct architectural ancestor.
see_hear_feel_sensory_fusion
- That Sounds Right [12] (Thankaraj & Pinto) — the
closest competitor: also contact-audio pretraining for manipulation, but
in-domain SSL on 5,000 pts/task vs this paper's internet-scale + <100
demos. The head-to-head framing.
that_sounds_right_auditory_self_supervision
- Play it by Ear [19] (Du et al.) — audio-visual
imitation through occlusion; sibling audio-for-manipulation work.
play_it_by_ear_audio_visual_imitation
- Making Sense of Vision and Touch [10] (Lee et al.) —
canonical self-supervised multimodal (vision+force) representation for
contact-rich tasks; the decoupled-representation lineage.
making_sense_of_vision_and_touch
- GelSight [21] (Yuan et al.) — the optical tactile
sensor the authors propose to combine with contact mics; sensor foundation.
gelsight_high_resolution_tactile_sensors
- DIGIT [22] (Lambeta et al.) — the low-cost
visuotactile sensor cited in the bandwidth comparison; sensor foundation.
digit_low_cost_compact_tactile_sensor
Newly ingested in the 2026-06-24 batch — directly relevant:
- That Sounds Right
— the direct contrast: in-domain contact-audio SSL vs this paper's
internet-scale audio-visual pretraining. Same Cluster D thesis.
- See, Hear, and Feel
— source of the self-attention sensory-fusion block; also fuses a
contact mic.
- Play it by Ear
— audio-visual imitation under occlusion; same audio-for-manipulation cluster.
- ManiWAV
— in-the-wild audio-visual manipulation; closest follow-on in scaling
contact-audio policies.
- SonicSense
and Active Acoustic Sensing
— acoustic/vibration sensing for in-hand and active manipulation; same
Cluster D sensing principle.
- Making Sense of Vision and Touch
— the multimodal self-supervised-representation predecessor for
contact-rich tasks.