Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation

Jared Mejia, Victoria Dean, Tess Hellebrekers, Abhinav Gupta · CMU / Olin / Meta AI · 2024 · arXiv preprint · arXiv:2405.08576 · PDF

One-liner. Treat a cheap piezo contact microphone as an "audio-native" tactile sensor, and you can bootstrap its representation from 2M+ internet audio-visual clips (AudioSet, via AVID) instead of training from scratch — the resulting contact-audio features measurably boost low-data behavior cloning and, surprisingly, make policies more visually robust despite a large domain gap between YouTube sounds and robot scraping noises.

Problem & motivation

Vision representations for robots get to ride internet-scale pretraining (R3M, VIP, MVP); every other modality — especially touch — is trained from scratch on a few hundred task-specific samples, because no internet-scale tactile corpus exists. The authors' wedge: a piezo contact microphone produces a signal that is literally audio (structural vibrations at 32–48 kHz, up to ~1000× the bandwidth of optical/magnetic tactile sensors). So the "no internet-scale tactile data" problem dissolves — you can pretrain the touch encoder on the enormous pool of internet audio-visual video. This is pitched as the first method to use large-scale multisensory (not vision-only) pretraining for robot manipulation, and it targets the low-data regime (≤60 demos/task) where scratch-trained sensory encoders are weakest.

Method

Two-stage pipeline (Fig 2): large-scale pretraining of two encoders, then end-to-end behavior cloning on a handful of in-domain demos.

Sensors. Four piezo contact microphones mounted on the Franka gripper, each recording at 32 kHz; signals are averaged into one channel and downsampled to 16 kHz. Because they read structural vibration, they pick up not just direct gripper-object contact but indirect contact traveling along a grasped tool (spatula, spoon) — subtle surface interactions vision can't see. Per timestep the policy ingests an image v_t from a fixed third-person RealSense and a 2-second contact-audio clip a_t, rendered as a mel spectrogram (AVID's preprocessing).

Encoder pretraining. The audio encoder is lifted from AVID (Audio-Visual Instance Discrimination, Morgado et al. CVPR 2021), self-supervised with cross-modal instance discrimination on AudioSet (~2M 10-second internet clips). The vision encoder is R3M (ResNet18 pretrained on Ego4D with time-contrastive + video-language alignment) — deliberately a known-good robot vision representation, so any gain is attributable to the audio side. Both encoders are kept unfrozen during policy learning (following Dean et al., "Don't freeze your embedding").

Audio-visual behavior cloning. A sequence of 4 images spanning the same 2s window plus the single audio token (5 tokens total) get learned positional embeddings and pass through a single self-attention transformer block (the fusion mechanism, à la See-Hear-Feel [6]); the output is concatenated and fed to a 2-layer MLP head. The policy is quasi open-loop: at time t it predicts H actions and executes h ≤ H (here h = 2) before re-predicting — balancing reactivity to audio against non-Markovian demo artifacts (pauses). Each action is a 6-D delta end-effector command (Cartesian xyz + Euler αβγ); standard MSE loss. ~20M params total.

Setup

Datasets / benchmarks: Pretraining — AudioSet (~2M clips, for AVID audio encoder) and Ego4D (for R3M vision encoder). Downstream — three self-collected real-world tasks: flipping (40 demos), scooping (60 demos), zipping (50 demos), each teleoperated via an Oculus Quest. Train/test splits use deliberately large visual differences (different objects, backgrounds, pans).
Hardware / simulator: Franka Emika Panda (7-DoF, IK from 6-DoF delta EE), end-effector commands at 30 Hz. Four piezo contact mics on the gripper (32 kHz → averaged → 16 kHz). One Intel D435 RealSense, fixed third-person, 480×640 at 30 Hz. Real-robot only; no simulator.
Baselines: Vision-Only (same architecture, image-only); Scratch (audio encoder randomly initialized); BYOL-A (in-domain-only self-supervised audio pretraining). All audio-using methods share the R3M-initialized image encoder.
Compute: single GPU (RTX 3080Ti), 1–1.5 hours per BC policy; batch 64, ≤100 epochs with early stopping. (Pretraining compute for AVID/R3M not reported — inherited from prior work.)

Results

Headline: across all three tasks the AVID-pretrained method beats every baseline, averaging +23% absolute 0-1 success rate and +76% reward over the next-best baseline, and wins or ties in 8/9 task configurations. Table I (success %, plus reward where reported):

Method	Flipping (succ%)	Scooping (succ%)	Zipping (succ%)
Ours (AVID/AudioSet)	50.0	78.1	88.9
BYOL-A (in-domain audio SSL)	25.0	25.0	66.7
Scratch (random audio enc.)	15.4	50.0	72.2
Vision-Only	0.0	28.1	44.4

Where it wins and how:

Vision-Only is worst everywhere — direct evidence contact audio carries action-relevant signal vision misses (e.g. whether the spatula actually slid under the bagel, whether the zipper is snagged).
Large-scale > in-domain pretraining. BYOL-A (pretrained only on the small in-domain contact-audio set) is mixed vs Scratch and sometimes worse — its augmentation tricks don't transfer to contact audio in the low-data regime. The AudioSet-scale aspect is the load-bearing component.
Visual robustness is the surprise. Under train→test visual shift on flipping, Vision-Only drops ~60% in success; Ours drops only ~20% (Fig 6c). Good audio features appear to regularize the policy against overfitting to visual detail. t-SNE (Fig 5) shows Ours' trajectory embeddings converge over time across visually-different settings; baselines stay scattered.
Where it loses: it does not win one zipping configuration outright (the only 1/9 miss). On scooping reward, Scratch (6.9) edges close and the relative ordering is noisier than success%.
Ablations: (a) frozen AVID encoder only slightly worse and still beats next-best on zipping — representations transfer near zero-shot; (b) scaling 30→90 demos improves steadily, tracking Vision-Only's slope; (c) replacing the transformer with a param-matched MLP drops success/reward ~50% — self-attention fusion is necessary but, alone (with scratch features), insufficient.

Limitations & open questions

From the authors:

Contact mics help only on dynamic, vibration-emitting contact. They expect little benefit on quasi-static pick-and-place, when the robot itself generates dominant vibration, or with deformable objects that don't ring on contact.
Open: which properties of the pretraining dataset actually drive transfer (they don't isolate this — AudioSet is just "big and diverse").
Future: combine contact mics with optical visuotactile sensors (e.g. GelSight/DIGIT) under a shared pretrained audio/tactile representation.

What I noticed reading it:

Tiny-N, no seeds. Every success% is a count out of 16–36 rollouts per task (e.g. 50.0% = a handful of successes), single training run, no multi-seed std reported. The +23% headline rests on point estimates with no error bars — a much weaker statistical claim than the bar charts' confidence suggests.
Confound between modality and regularization. The biggest selling point (visual robustness) may be partly an auxiliary-input regularization effect: any extra non-visual token that's hard to overfit might reduce visual overfitting. A noise-token or proprioception control would sharpen the claim that it's contact audio specifically.
Domain-gap hand-wave. "Surprisingly it works despite the gap" is asserted, never measured — no analysis of what AudioSet features the contact audio actually activates, so we don't know if AVID transfer or just a better-conditioned init is doing the work. The frozen-AVID ablation hints transfer is real but doesn't quantify it.
Third-person vision, no wrist cam. The Vision-Only baseline is handicapped by a fixed external camera that can't see fine contact; a wrist-mounted view might close much of the gap. The audio advantage may be partly an artifact of a weak vision setup.
No language anywhere — this is a representation/pretraining paper, not a touch-language or VLA paper, despite living in the same batch.

Why I care

This paper is a clean, direct datapoint for the batch thesis: many manipulation predicates are not visually evaluable — is_scooped, spatula_is_under(bagel), zipper_is_snagged, tool_is_in_contact live in vibration, not pixels. Relative to BLADE, where predicate classifiers f_θ(p): O → {T,F} are learned purely over RGB crops, Hearing Touch is evidence that BLADE's own-stated limitation — "caging grasps and contact-rich tasks would need richer contact detection" — has a cheap sensory answer: a piezo mic gives a high-bandwidth contact channel a predicate classifier could read. The intriguing transfer story (internet audio → contact audio) also suggests a path to pretrained contact-predicate classifiers, which would attack BLADE's data-hungry classifier-learning step in the low-demo regime.

Caveat for my purposes: this is a flat BC policy with end-to-end fused features, not a structured/abstraction method — there's no symbolic layer, no planning, no predicates as such. Its relevance to my long-horizon/planning-abstraction line is at the sensor & representation level (feeding non-visual predicate grounding), not at the method level. I should treat it as a building block, not a competing architecture.

Quotable

Our key insight is that contact microphones capture inherently audio-based information, allowing us to leverage large-scale audio-visual pretraining to obtain representations that boost the performance of robotic manipulation. — Abstract

Despite the domain gap between the audio in Audioset and contact audio obtained through manipulation, we find that our approach improves performance over visual-only policies—especially in test settings where objects and locations differ significantly from the training data. — §I, Introduction

Pre-trained audio features prevent the network from overfitting to visual details in the training setting, hence attaining better generalization abilities. — §IV-E.3, Generalization

Papers cited that should likely be ingested next (forward references):

See, Hear, and Feel (SHF) [6] — the self-attention multisensory-fusion mechanism this paper adapts; fuses camera + GelSight + contact mic. Direct architectural ancestor. see_hear_feel_sensory_fusion
That Sounds Right [12] (Thankaraj & Pinto) — the closest competitor: also contact-audio pretraining for manipulation, but in-domain SSL on 5,000 pts/task vs this paper's internet-scale + <100 demos. The head-to-head framing. that_sounds_right_auditory_self_supervision
Play it by Ear [19] (Du et al.) — audio-visual imitation through occlusion; sibling audio-for-manipulation work. play_it_by_ear_audio_visual_imitation
Making Sense of Vision and Touch [10] (Lee et al.) — canonical self-supervised multimodal (vision+force) representation for contact-rich tasks; the decoupled-representation lineage. making_sense_of_vision_and_touch
GelSight [21] (Yuan et al.) — the optical tactile sensor the authors propose to combine with contact mics; sensor foundation. gelsight_high_resolution_tactile_sensors
DIGIT [22] (Lambeta et al.) — the low-cost visuotactile sensor cited in the bandwidth comparison; sensor foundation. digit_low_cost_compact_tactile_sensor

Newly ingested in the 2026-06-24 batch — directly relevant:

That Sounds Right — the direct contrast: in-domain contact-audio SSL vs this paper's internet-scale audio-visual pretraining. Same Cluster D thesis.
See, Hear, and Feel — source of the self-attention sensory-fusion block; also fuses a contact mic.
Play it by Ear — audio-visual imitation under occlusion; same audio-for-manipulation cluster.
ManiWAV — in-the-wild audio-visual manipulation; closest follow-on in scaling contact-audio policies.
SonicSense and Active Acoustic Sensing — acoustic/vibration sensing for in-hand and active manipulation; same Cluster D sensing principle.
Making Sense of Vision and Touch — the multimodal self-supervised-representation predecessor for contact-rich tasks.