One-liner. A parallel-gripper where each finger holds an identical piezoelectric disk — one emits a 20 Hz–20 kHz acoustic sweep through the grasped object and the other receives it — turning the object itself into a resonant sensor that reports material, grasp position, internal structure, and extrinsic contact type, and (the headline) closes a peg-insertion loop using only that acoustic feedback.
Conventional tactile sensors acquire information by making and breaking contact, so they only see local contact points; passive/intrinsic contact is bounded by the sensor's contact area and leaves global object properties (material, mass distribution, internal geometry) and extrinsic contacts (how the object touches the rest of the world) unmeasured. The paper's premise: introduce an active component — emit an acoustic signal into the system and observe the response, "like shaking a wrapped present" — and read object/contact state out of the resonant response. Prior active-acoustic grippers (Lu & Culbertson [11]; Yi, Lee & Fazeli [12]) showed static classification (material, grasp position, occluded contact); VibeCheck streamlines the hardware (same component as emitter and receiver) and, crucially, goes beyond static classification to learn a closed-loop long-horizon policy (peg insertion) on acoustic feedback alone.
Hardware (Fig 2). Each finger embeds an Adafruit 1740 piezoelectric disk (8 kHz resonant frequency). Piezoelectrics convert both voltage→strain and strain→voltage, so the same part acts as speaker on one finger and contact microphone on the other. The microphone-side disk is unamplified and is "relatively immune to ambient noise." The plastic housing is removed for compactness/sensitivity (0.2 mm thick, 6 mm radius); transducers sit in 3D-printed Clear-V4-resin housings on sorbothane isolation pads (important for absorbing robot-arm vibration), with replaceable high-friction polyethylene grip strips at the fingertips. Two Teensy 4.0 boards with audio adapters generate/sample; the emitter signal passes an inverting amplifier (gain 1.5), biased to 2.5 V center, 2.4 V amplitude; received signal sampled at 44.1 kHz; micro-ROS to the host.
Sensing procedure & preprocessing. A sinusoid is
linearly swept 20 Hz→20 kHz over 1 second while the
opposite finger records. Only the first 42,000 sweep points are retained (data
corruption), effectively terminating the sweep at 19.029 kHz. Raw time-domain
signal → FFT → kernel PCA with a cosine
kernel for dimensionality reduction before training. Across tasks they
keep only 5–10 principal components ("~90% of variance"), which they find
key to generalizing to unseen test conditions.
Static estimation tasks (Sec IV). For each task a small
MLP (sizes 400–250–100) does classification or regression on the
kPCA features. Four capabilities: (1) object classification —
9 rods (~21×21×130 mm) of different material/geometry; (2)
grasping position — edge / quarter /
center along a rod; (3) pose from internal structure
— a smooth PLA cylinder with asymmetric internal geometry, regress
rotation 0–170° (10° bins, flips beyond 170°); (4)
contact-type classification — diagonal /
line / in-hole for a rod meeting a square hole, the
classifier later reused as the peg-insertion observation.
Task learning — peg insertion (Sec V). A handcrafted
"ideal" insertion trajectory (incremental rotations about z then
x until aligned) yields ground-truth expert actions
a_expert (action space = move ±4.5° about x
or z; 112 discrete poses). Because handcrafting a policy directly
on noisy classifier outputs is brittle, they train in a simulator:
given the trained contact classifier h, ground-truth contact
c_gt, they build the categorical
p(c_obs | c_gt) from h's confusion matrix and sample
observations from it, so the imitation policy learns to be robust to
classifier errors. The policy
π(a | c_obs^0,...,c_obs^n) conditions on an
n=10-step observation history, trained by minimizing
L = -log π(a_expert | history) (behavior cloning). On the UR5,
a heuristic threshold on the receiver signal as the robot descends detects the
moment of contact (and a second z threshold acts as a safety stop
for misaligned pegs); a rollout succeeds if it reaches the insertion state
within 50 steps.
Headline: a peg-insertion policy driven by acoustic contact-type feedback as the only sensing modality succeeds in simulation at 95% and transfers to the real UR5. Static-estimation accuracies (Table I; best frequency range, generally 0.02–9.19 kHz beats the full range on unseen sets):
| Task | In-distribution | New surface | New orientation |
|---|---|---|---|
| Object classification | 1.00 | 1.00 | 1.00 |
| Grasping position | 1.00 | 0.99 | 0.90 |
| Contact type | 0.95 | interp. (in-dist) 0.73 · interp. (OOD) 0.80 | |
Pose-from-internal-structure regression: RMSE ~20.0° over 0–170° (Fig 6), worse at low/high angles; interpolated unseen poses show similar error scale, a positive generalization sign. Object classification hits 100% in-distribution using only the top 3 PCs (~75% of variance), though ~91% of variance is needed for full unseen-case performance.
Peg insertion on UR5 (Table II, success over 10 trials):
| Start condition | θz=45° fixed | Random start |
|---|---|---|
| θx=45° (in-dist) | 6/10 | 9/10 |
| 40.5°≤θx≤81° (OOD) | N/A | 6/10 |
Where it wins: random-start in-distribution insertion is strong (9/10), and
even significantly out-of-distribution poses still insert 60% of the time.
Robustness check: contact-type classifier still hits 87% with
a 75 dB music distractor playing (Sec VI-A). Where it loses / is weak:
the fixed-start case is only 6/10 (worse than random start — the authors
note borderline diagonal/line and
line/in-hole states are hardest); New-orientation
grasping drops to 0.90 (and to 0.43 in the high-frequency-only band);
pose RMSE of 20° is "fairly large."
From the authors (Sec VI-C/D):
What I noticed reading it:
z safety stop — so proprioception/geometry
priors quietly enter the loop, not pure acoustic state estimation.This paper is a clean instance of the thesis behind my batch interest:
many manipulation predicates are not visually evaluable — they
live in touch, force, and sound. VibeCheck's in-hole vs.
diagonal vs. line contact-type classifier is, in
BLADE's
vocabulary, a learned predicate (an is_aligned /
is_inserted precondition) whose ground truth is an
acoustic resonance signature, not a pixel pattern. BLADE learns visual
classifiers fθ(p): O → {T,F} for predicates
and diffusion controllers for the bodies; VibeCheck demonstrates an
acoustic classifier for exactly the kind of contact predicate that
BLADE's faucet-pixel cropping could never evaluate (peg-in-hole alignment is
occluded and contact-defined). The natural BLADE extension: let the predicate
classifier's input modality be acoustic/tactile where the predicate is
physically contact-defined — predicate grounding in the right
sensory channel.
Two more connections to my themes. (1) Their confusion-matrix-as-observation-model trick — train the imitation policy on classifier errors sampled in sim — is the same robustness concern BLADE flags ("noisy state classification causes planning failures; need planners robust to estimation noise"); VibeCheck answers it at the policy layer rather than the planner. (2) It is a long-horizon-ish contact-rich task closed on a single non-visual modality — a concrete data point that the abstraction layer for contact-rich manipulation may need force/sound channels, not just vision. The big caveat: VibeCheck has no language, no planning, no symbolic abstraction — it is a sensing + single-skill-policy paper, so its relevance to BLADE is at the predicate-grounding-modality level, not the planning-abstraction level.
The acoustic response of an object can reveal a lot about its global state, for example its material properties or the extrinsic contacts it is making with the world. — Abstract / p.1
To our knowledge, we are the first to go beyond static classification and demonstrate that a long-horizon manipulation task, in this case a peg insertion task, can be learned using active acoustic sensing as the only sensing modality. — §I, contributions / p.2
We use a simple simulated transition model based on the performance of our sensor to train an imitation learning policy that is robust to imperfect predictions from the classifier. — Abstract / p.1
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant: