Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation

Jared Mejia, Victoria Dean, Tess Hellebrekers, Abhinav Gupta · CMU / Olin / Meta AI · 2024 · arXiv preprint · arXiv:2405.08576 · PDF

One-liner. Treat a cheap piezo contact microphone as an "audio-native" tactile sensor, and you can bootstrap its representation from 2M+ internet audio-visual clips (AudioSet, via AVID) instead of training from scratch — the resulting contact-audio features measurably boost low-data behavior cloning and, surprisingly, make policies more visually robust despite a large domain gap between YouTube sounds and robot scraping noises.

Problem & motivation

Vision representations for robots get to ride internet-scale pretraining (R3M, VIP, MVP); every other modality — especially touch — is trained from scratch on a few hundred task-specific samples, because no internet-scale tactile corpus exists. The authors' wedge: a piezo contact microphone produces a signal that is literally audio (structural vibrations at 32–48 kHz, up to ~1000× the bandwidth of optical/magnetic tactile sensors). So the "no internet-scale tactile data" problem dissolves — you can pretrain the touch encoder on the enormous pool of internet audio-visual video. This is pitched as the first method to use large-scale multisensory (not vision-only) pretraining for robot manipulation, and it targets the low-data regime (≤60 demos/task) where scratch-trained sensory encoders are weakest.

Method

Two-stage pipeline (Fig 2): large-scale pretraining of two encoders, then end-to-end behavior cloning on a handful of in-domain demos.

Sensors. Four piezo contact microphones mounted on the Franka gripper, each recording at 32 kHz; signals are averaged into one channel and downsampled to 16 kHz. Because they read structural vibration, they pick up not just direct gripper-object contact but indirect contact traveling along a grasped tool (spatula, spoon) — subtle surface interactions vision can't see. Per timestep the policy ingests an image v_t from a fixed third-person RealSense and a 2-second contact-audio clip a_t, rendered as a mel spectrogram (AVID's preprocessing).

Encoder pretraining. The audio encoder is lifted from AVID (Audio-Visual Instance Discrimination, Morgado et al. CVPR 2021), self-supervised with cross-modal instance discrimination on AudioSet (~2M 10-second internet clips). The vision encoder is R3M (ResNet18 pretrained on Ego4D with time-contrastive + video-language alignment) — deliberately a known-good robot vision representation, so any gain is attributable to the audio side. Both encoders are kept unfrozen during policy learning (following Dean et al., "Don't freeze your embedding").

Audio-visual behavior cloning. A sequence of 4 images spanning the same 2s window plus the single audio token (5 tokens total) get learned positional embeddings and pass through a single self-attention transformer block (the fusion mechanism, à la See-Hear-Feel [6]); the output is concatenated and fed to a 2-layer MLP head. The policy is quasi open-loop: at time t it predicts H actions and executes h ≤ H (here h = 2) before re-predicting — balancing reactivity to audio against non-Markovian demo artifacts (pauses). Each action is a 6-D delta end-effector command (Cartesian xyz + Euler αβγ); standard MSE loss. ~20M params total.

Setup

Results

Headline: across all three tasks the AVID-pretrained method beats every baseline, averaging +23% absolute 0-1 success rate and +76% reward over the next-best baseline, and wins or ties in 8/9 task configurations. Table I (success %, plus reward where reported):

MethodFlipping (succ%)Scooping (succ%)Zipping (succ%)
Ours (AVID/AudioSet)50.078.188.9
BYOL-A (in-domain audio SSL)25.025.066.7
Scratch (random audio enc.)15.450.072.2
Vision-Only0.028.144.4

Where it wins and how:

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This paper is a clean, direct datapoint for the batch thesis: many manipulation predicates are not visually evaluable — is_scooped, spatula_is_under(bagel), zipper_is_snagged, tool_is_in_contact live in vibration, not pixels. Relative to BLADE, where predicate classifiers fθ(p): O → {T,F} are learned purely over RGB crops, Hearing Touch is evidence that BLADE's own-stated limitation — "caging grasps and contact-rich tasks would need richer contact detection" — has a cheap sensory answer: a piezo mic gives a high-bandwidth contact channel a predicate classifier could read. The intriguing transfer story (internet audio → contact audio) also suggests a path to pretrained contact-predicate classifiers, which would attack BLADE's data-hungry classifier-learning step in the low-demo regime.

Caveat for my purposes: this is a flat BC policy with end-to-end fused features, not a structured/abstraction method — there's no symbolic layer, no planning, no predicates as such. Its relevance to my long-horizon/planning-abstraction line is at the sensor & representation level (feeding non-visual predicate grounding), not at the method level. I should treat it as a building block, not a competing architecture.

Quotable

Our key insight is that contact microphones capture inherently audio-based information, allowing us to leverage large-scale audio-visual pretraining to obtain representations that boost the performance of robotic manipulation. — Abstract
Despite the domain gap between the audio in Audioset and contact audio obtained through manipulation, we find that our approach improves performance over visual-only policies—especially in test settings where objects and locations differ significantly from the training data. — §I, Introduction
Pre-trained audio features prevent the network from overfitting to visual details in the training setting, hence attaining better generalization abilities. — §IV-E.3, Generalization

Related

Papers cited that should likely be ingested next (forward references):

Newly ingested in the 2026-06-24 batch — directly relevant: