VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation

Kaidi Zhang*, Do-Gon Kim*, Eric T. Chang*, Hua-Hsuan Liang, Zhanpeng He, Kathryn Lampo, Philippe Wu, Ioannis Kymissis, Matei Ciocarlie · Columbia University · 2026 (arXiv v2, Jun 2026) · arXiv:2504.15535 · PDF

One-liner. A parallel-gripper where each finger holds an identical piezoelectric disk — one emits a 20 Hz–20 kHz acoustic sweep through the grasped object and the other receives it — turning the object itself into a resonant sensor that reports material, grasp position, internal structure, and extrinsic contact type, and (the headline) closes a peg-insertion loop using only that acoustic feedback.

Problem & motivation

Conventional tactile sensors acquire information by making and breaking contact, so they only see local contact points; passive/intrinsic contact is bounded by the sensor's contact area and leaves global object properties (material, mass distribution, internal geometry) and extrinsic contacts (how the object touches the rest of the world) unmeasured. The paper's premise: introduce an active component — emit an acoustic signal into the system and observe the response, "like shaking a wrapped present" — and read object/contact state out of the resonant response. Prior active-acoustic grippers (Lu & Culbertson [11]; Yi, Lee & Fazeli [12]) showed static classification (material, grasp position, occluded contact); VibeCheck streamlines the hardware (same component as emitter and receiver) and, crucially, goes beyond static classification to learn a closed-loop long-horizon policy (peg insertion) on acoustic feedback alone.

Method

Hardware (Fig 2). Each finger embeds an Adafruit 1740 piezoelectric disk (8 kHz resonant frequency). Piezoelectrics convert both voltage→strain and strain→voltage, so the same part acts as speaker on one finger and contact microphone on the other. The microphone-side disk is unamplified and is "relatively immune to ambient noise." The plastic housing is removed for compactness/sensitivity (0.2 mm thick, 6 mm radius); transducers sit in 3D-printed Clear-V4-resin housings on sorbothane isolation pads (important for absorbing robot-arm vibration), with replaceable high-friction polyethylene grip strips at the fingertips. Two Teensy 4.0 boards with audio adapters generate/sample; the emitter signal passes an inverting amplifier (gain 1.5), biased to 2.5 V center, 2.4 V amplitude; received signal sampled at 44.1 kHz; micro-ROS to the host.

Sensing procedure & preprocessing. A sinusoid is linearly swept 20 Hz→20 kHz over 1 second while the opposite finger records. Only the first 42,000 sweep points are retained (data corruption), effectively terminating the sweep at 19.029 kHz. Raw time-domain signal → FFT → kernel PCA with a cosine kernel for dimensionality reduction before training. Across tasks they keep only 5–10 principal components ("~90% of variance"), which they find key to generalizing to unseen test conditions.

Static estimation tasks (Sec IV). For each task a small MLP (sizes 400–250–100) does classification or regression on the kPCA features. Four capabilities: (1) object classification — 9 rods (~21×21×130 mm) of different material/geometry; (2) grasping position — edge / quarter / center along a rod; (3) pose from internal structure — a smooth PLA cylinder with asymmetric internal geometry, regress rotation 0–170° (10° bins, flips beyond 170°); (4) contact-type classification — diagonal / line / in-hole for a rod meeting a square hole, the classifier later reused as the peg-insertion observation.

Task learning — peg insertion (Sec V). A handcrafted "ideal" insertion trajectory (incremental rotations about z then x until aligned) yields ground-truth expert actions a_expert (action space = move ±4.5° about x or z; 112 discrete poses). Because handcrafting a policy directly on noisy classifier outputs is brittle, they train in a simulator: given the trained contact classifier h, ground-truth contact c_gt, they build the categorical p(c_obs | c_gt) from h's confusion matrix and sample observations from it, so the imitation policy learns to be robust to classifier errors. The policy π(a | c_obs^0,...,c_obs^n) conditions on an n=10-step observation history, trained by minimizing L = -log π(a_expert | history) (behavior cloning). On the UR5, a heuristic threshold on the receiver signal as the robot descends detects the moment of contact (and a second z threshold acts as a safety stop for misaligned pegs); a rollout succeeds if it reaches the insertion state within 50 steps.

Setup

Datasets / benchmarks: Self-collected. Object: 9 rods, 900 train / 225 test. Grasping position: 2700 train / 675 test. Pose regression: 1800 train / 450 test (18 poses). Contact type: 1000 train / 200 test per class. Test sets drawn from separate collection sessions to capture distribution shift; out-of-distribution tests on new surfaces and new orientations. No public benchmark.
Hardware / simulator: Custom parallel gripper (twin Adafruit-1740 piezo disks, Teensy 4.0 ×2, micro-ROS) on a Universal Robots UR5; peg = 13.1×12.8×53.9 mm rod, hole 3 mm expanded per direction (6° tolerance). A simple simulator tracks the 112 discrete poses and samples classifier outputs for policy training.
Baselines: not reported — no external method comparison; "baselines" are internal ablations over frequency range and number of principal components, and in-distribution vs. new-surface / new-orientation / OOD test splits.
Compute: not reported.

Results

Headline: a peg-insertion policy driven by acoustic contact-type feedback as the only sensing modality succeeds in simulation at 95% and transfers to the real UR5. Static-estimation accuracies (Table I; best frequency range, generally 0.02–9.19 kHz beats the full range on unseen sets):

Task	In-distribution	New surface	New orientation
Object classification	1.00	1.00	1.00
Grasping position	1.00	0.99	0.90
Contact type	0.95	interp. (in-dist) 0.73 · interp. (OOD) 0.80

Pose-from-internal-structure regression: RMSE ~20.0° over 0–170° (Fig 6), worse at low/high angles; interpolated unseen poses show similar error scale, a positive generalization sign. Object classification hits 100% in-distribution using only the top 3 PCs (~75% of variance), though ~91% of variance is needed for full unseen-case performance.

Peg insertion on UR5 (Table II, success over 10 trials):

Start condition	θ_z=45° fixed	Random start
θ_x=45° (in-dist)	6/10	9/10
40.5°≤θ_x≤81° (OOD)	N/A	6/10

Where it wins: random-start in-distribution insertion is strong (9/10), and even significantly out-of-distribution poses still insert 60% of the time. Robustness check: contact-type classifier still hits 87% with a 75 dB music distractor playing (Sec VI-A). Where it loses / is weak: the fixed-start case is only 6/10 (worse than random start — the authors note borderline diagonal/line and line/in-hole states are hardest); New-orientation grasping drops to 0.90 (and to 0.43 in the high-frequency-only band); pose RMSE of 20° is "fairly large."

Limitations & open questions

From the authors (Sec VI-C/D):

Signal response is influenced by many factors — minor hardware adjustments, motor heating, geometry, material, contact state — some informative, some nuisance, introducing variability.
Learned models are specific to the object sets used; they do not generalize broadly to entirely unseen objects, nor is that expected, since the response depends on the whole acoustic system plus environment.
Truly general acoustic sensing needs careful curation of a diverse dataset; for now best suited to sorting known objects or grasping a familiar object under occlusion.
Adding compliant (elastomer) finger surfaces — desirable for grasp stability — would attenuate vibrations and require a more powerful actuator to coexist with other tactile modalities.

What I noticed reading it:

Real-robot statistics are thin: peg insertion is reported as count-of-success out of 10 (6/10, 9/10, 6/10), no seeds or confidence intervals — a weaker claim than the simulator's 95%.
No external baseline anywhere — all comparisons are internal ablations. The 95% sim number rests on the same confusion-matrix that generated the training observations, so sim success is partly tautological; the real transfer (6–9/10) is the load-bearing result.
The "only acoustic feedback" claim has an asterisk: contact onset is detected by a hand-tuned threshold on the receiver signal plus a hard-coded z safety stop — so proprioception/geometry priors quietly enter the loop, not pure acoustic state estimation.
The sweep is truncated to 19.029 kHz due to a data corruption bug rather than by design; unclear how much the resonant-feature content above that would have helped.
Object classification is trivially saturated (100% even on new surfaces), which suggests the 9-rod set is acoustically very separable — the harder, more honest tasks are pose regression (RMSE 20°) and the borderline contact-type confusions that bottleneck insertion.

Why I care

This paper is a clean instance of the thesis behind my batch interest: many manipulation predicates are not visually evaluable — they live in touch, force, and sound. VibeCheck's in-hole vs. diagonal vs. line contact-type classifier is, in BLADE's vocabulary, a learned predicate (an is_aligned / is_inserted precondition) whose ground truth is an acoustic resonance signature, not a pixel pattern. BLADE learns visual classifiers f_θ(p): O → {T,F} for predicates and diffusion controllers for the bodies; VibeCheck demonstrates an acoustic classifier for exactly the kind of contact predicate that BLADE's faucet-pixel cropping could never evaluate (peg-in-hole alignment is occluded and contact-defined). The natural BLADE extension: let the predicate classifier's input modality be acoustic/tactile where the predicate is physically contact-defined — predicate grounding in the right sensory channel.

Two more connections to my themes. (1) Their confusion-matrix-as-observation-model trick — train the imitation policy on classifier errors sampled in sim — is the same robustness concern BLADE flags ("noisy state classification causes planning failures; need planners robust to estimation noise"); VibeCheck answers it at the policy layer rather than the planner. (2) It is a long-horizon-ish contact-rich task closed on a single non-visual modality — a concrete data point that the abstraction layer for contact-rich manipulation may need force/sound channels, not just vision. The big caveat: VibeCheck has no language, no planning, no symbolic abstraction — it is a sensing + single-skill-policy paper, so its relevance to BLADE is at the predicate-grounding-modality level, not the planning-abstraction level.

Quotable

The acoustic response of an object can reveal a lot about its global state, for example its material properties or the extrinsic contacts it is making with the world. — Abstract / p.1

To our knowledge, we are the first to go beyond static classification and demonstrate that a long-horizon manipulation task, in this case a peg insertion task, can be learned using active acoustic sensing as the only sensing modality. — §I, contributions / p.2

We use a simple simulated transition model based on the performance of our sensor to train an imitation learning policy that is robust to imperfect predictions from the classifier. — Abstract / p.1

Papers cited that should likely be ingested next:

[11] Lu & Culbertson 2023 — Active Acoustic Sensing for Robot Manipulation (IROS) — the hardware platform VibeCheck rebuilds and streamlines. Direct predecessor; see batch entry Active Acoustic Sensing for Robot Manipulation.
[12] Yi, Lee & Fazeli 2024 — Visual-auditory Extrinsic Contact Estimation (arXiv:2409.14608) — the other prior active-acoustic-in-gripper work VibeCheck positions against (extrinsic contact under occlusion).
[2] Liu & Chen 2024 — SonicSense (CoRL) — passive contact-microphone multi-fingered hand for container contents / 3D shape / re-ID; the passive-acoustic counterpoint. See batch entry SonicSense.
[1] Kim & Rodriguez 2022 — Active Extrinsic Contact Sensing (peg-in-hole) (ICRA) — the active-extrinsic-contact / peg-insertion lineage VibeCheck's task slots into.
[20] Liu et al. 2024 — ManiWAV (CoRL) — in-the-wild audio-visual manipulation policy; see batch entry ManiWAV.
[21] Mejia, Dean, Hellebrekers & Gupta 2024 — Hearing Touch (ICRA) — audio-tactile pretraining for contact-rich manipulation; see batch entry Hearing Touch.
[23] Du, Lee, Nair & Finn — Play it by Ear (RSS) — learning skills amidst occlusion via audio-visual imitation; see batch entry Play it by Ear.

Newly ingested in the 2026-06-24 batch — directly relevant:

SonicSense — closest sibling: in-hand passive acoustic/vibration sensing for object understanding; VibeCheck is the active (emit-and-listen) counterpart that adds a closed-loop policy.
Active Acoustic Sensing for Robot Manipulation — the Lu & Culbertson platform [11] VibeCheck builds on; same emit-through-object principle, VibeCheck unifies emitter/receiver hardware and goes beyond static classification.
ManiWAV and Making Sense of Audio Vibration (pouring) — audio-as-contact-feedback for manipulation; ManiWAV uses contact audio in an imitation policy, the pouring work reads vibration for a contact-rich pour — both share VibeCheck's "sound encodes contact state" premise.
See, Hear, and Feel and Hearing Touch — multisensory fusion / audio-tactile pretraining; VibeCheck is the single-modality (audio-only) ablation point of this fusion line.
The Sound of Simulation — generative audio sim2real; relevant to VibeCheck's confusion-matrix-driven simulator that trains the policy on sampled acoustic-classifier observations.