VibeCheck: Using Active Acoustic Tactile Sensing for Contact-Rich Manipulation

Kaidi Zhang*, Do-Gon Kim*, Eric T. Chang*, Hua-Hsuan Liang, Zhanpeng He, Kathryn Lampo, Philippe Wu, Ioannis Kymissis, Matei Ciocarlie · Columbia University · 2026 (arXiv v2, Jun 2026) · arXiv:2504.15535 · PDF

One-liner. A parallel-gripper where each finger holds an identical piezoelectric disk — one emits a 20 Hz–20 kHz acoustic sweep through the grasped object and the other receives it — turning the object itself into a resonant sensor that reports material, grasp position, internal structure, and extrinsic contact type, and (the headline) closes a peg-insertion loop using only that acoustic feedback.

Problem & motivation

Conventional tactile sensors acquire information by making and breaking contact, so they only see local contact points; passive/intrinsic contact is bounded by the sensor's contact area and leaves global object properties (material, mass distribution, internal geometry) and extrinsic contacts (how the object touches the rest of the world) unmeasured. The paper's premise: introduce an active component — emit an acoustic signal into the system and observe the response, "like shaking a wrapped present" — and read object/contact state out of the resonant response. Prior active-acoustic grippers (Lu & Culbertson [11]; Yi, Lee & Fazeli [12]) showed static classification (material, grasp position, occluded contact); VibeCheck streamlines the hardware (same component as emitter and receiver) and, crucially, goes beyond static classification to learn a closed-loop long-horizon policy (peg insertion) on acoustic feedback alone.

Method

Hardware (Fig 2). Each finger embeds an Adafruit 1740 piezoelectric disk (8 kHz resonant frequency). Piezoelectrics convert both voltage→strain and strain→voltage, so the same part acts as speaker on one finger and contact microphone on the other. The microphone-side disk is unamplified and is "relatively immune to ambient noise." The plastic housing is removed for compactness/sensitivity (0.2 mm thick, 6 mm radius); transducers sit in 3D-printed Clear-V4-resin housings on sorbothane isolation pads (important for absorbing robot-arm vibration), with replaceable high-friction polyethylene grip strips at the fingertips. Two Teensy 4.0 boards with audio adapters generate/sample; the emitter signal passes an inverting amplifier (gain 1.5), biased to 2.5 V center, 2.4 V amplitude; received signal sampled at 44.1 kHz; micro-ROS to the host.

Sensing procedure & preprocessing. A sinusoid is linearly swept 20 Hz→20 kHz over 1 second while the opposite finger records. Only the first 42,000 sweep points are retained (data corruption), effectively terminating the sweep at 19.029 kHz. Raw time-domain signal → FFTkernel PCA with a cosine kernel for dimensionality reduction before training. Across tasks they keep only 5–10 principal components ("~90% of variance"), which they find key to generalizing to unseen test conditions.

Static estimation tasks (Sec IV). For each task a small MLP (sizes 400–250–100) does classification or regression on the kPCA features. Four capabilities: (1) object classification — 9 rods (~21×21×130 mm) of different material/geometry; (2) grasping positionedge / quarter / center along a rod; (3) pose from internal structure — a smooth PLA cylinder with asymmetric internal geometry, regress rotation 0–170° (10° bins, flips beyond 170°); (4) contact-type classificationdiagonal / line / in-hole for a rod meeting a square hole, the classifier later reused as the peg-insertion observation.

Task learning — peg insertion (Sec V). A handcrafted "ideal" insertion trajectory (incremental rotations about z then x until aligned) yields ground-truth expert actions a_expert (action space = move ±4.5° about x or z; 112 discrete poses). Because handcrafting a policy directly on noisy classifier outputs is brittle, they train in a simulator: given the trained contact classifier h, ground-truth contact c_gt, they build the categorical p(c_obs | c_gt) from h's confusion matrix and sample observations from it, so the imitation policy learns to be robust to classifier errors. The policy π(a | c_obs^0,...,c_obs^n) conditions on an n=10-step observation history, trained by minimizing L = -log π(a_expert | history) (behavior cloning). On the UR5, a heuristic threshold on the receiver signal as the robot descends detects the moment of contact (and a second z threshold acts as a safety stop for misaligned pegs); a rollout succeeds if it reaches the insertion state within 50 steps.

Setup

Results

Headline: a peg-insertion policy driven by acoustic contact-type feedback as the only sensing modality succeeds in simulation at 95% and transfers to the real UR5. Static-estimation accuracies (Table I; best frequency range, generally 0.02–9.19 kHz beats the full range on unseen sets):

TaskIn-distributionNew surfaceNew orientation
Object classification1.001.001.00
Grasping position1.000.990.90
Contact type0.95interp. (in-dist) 0.73 · interp. (OOD) 0.80

Pose-from-internal-structure regression: RMSE ~20.0° over 0–170° (Fig 6), worse at low/high angles; interpolated unseen poses show similar error scale, a positive generalization sign. Object classification hits 100% in-distribution using only the top 3 PCs (~75% of variance), though ~91% of variance is needed for full unseen-case performance.

Peg insertion on UR5 (Table II, success over 10 trials):

Start conditionθz=45° fixedRandom start
θx=45° (in-dist)6/109/10
40.5°≤θx≤81° (OOD)N/A6/10

Where it wins: random-start in-distribution insertion is strong (9/10), and even significantly out-of-distribution poses still insert 60% of the time. Robustness check: contact-type classifier still hits 87% with a 75 dB music distractor playing (Sec VI-A). Where it loses / is weak: the fixed-start case is only 6/10 (worse than random start — the authors note borderline diagonal/line and line/in-hole states are hardest); New-orientation grasping drops to 0.90 (and to 0.43 in the high-frequency-only band); pose RMSE of 20° is "fairly large."

Limitations & open questions

From the authors (Sec VI-C/D):

What I noticed reading it:

Why I care

This paper is a clean instance of the thesis behind my batch interest: many manipulation predicates are not visually evaluable — they live in touch, force, and sound. VibeCheck's in-hole vs. diagonal vs. line contact-type classifier is, in BLADE's vocabulary, a learned predicate (an is_aligned / is_inserted precondition) whose ground truth is an acoustic resonance signature, not a pixel pattern. BLADE learns visual classifiers fθ(p): O → {T,F} for predicates and diffusion controllers for the bodies; VibeCheck demonstrates an acoustic classifier for exactly the kind of contact predicate that BLADE's faucet-pixel cropping could never evaluate (peg-in-hole alignment is occluded and contact-defined). The natural BLADE extension: let the predicate classifier's input modality be acoustic/tactile where the predicate is physically contact-defined — predicate grounding in the right sensory channel.

Two more connections to my themes. (1) Their confusion-matrix-as-observation-model trick — train the imitation policy on classifier errors sampled in sim — is the same robustness concern BLADE flags ("noisy state classification causes planning failures; need planners robust to estimation noise"); VibeCheck answers it at the policy layer rather than the planner. (2) It is a long-horizon-ish contact-rich task closed on a single non-visual modality — a concrete data point that the abstraction layer for contact-rich manipulation may need force/sound channels, not just vision. The big caveat: VibeCheck has no language, no planning, no symbolic abstraction — it is a sensing + single-skill-policy paper, so its relevance to BLADE is at the predicate-grounding-modality level, not the planning-abstraction level.

Quotable

The acoustic response of an object can reveal a lot about its global state, for example its material properties or the extrinsic contacts it is making with the world. — Abstract / p.1
To our knowledge, we are the first to go beyond static classification and demonstrate that a long-horizon manipulation task, in this case a peg insertion task, can be learned using active acoustic sensing as the only sensing modality. — §I, contributions / p.2
We use a simple simulated transition model based on the performance of our sensor to train an imitation learning policy that is robust to imperfect predictions from the classifier. — Abstract / p.1

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant: