SonicSense: Object Perception from In-Hand Acoustic Vibration

Jiaxun Liu, Boyuan Chen · Duke University · CoRL 2024 · arXiv:2406.17932 · PDF

One-liner. A cheap (~$215) four-finger robot hand with a piezoelectric contact microphone in each fingertip taps, grasps, and shakes 83 real-world objects, then learns end-to-end models that read the contact-borne vibration to do four tasks at once — container inventory differentiation, material classification, sparse-tapping 3D shape reconstruction, and object re-identification — showing that "what an object sounds like when you touch it" is a rich, noise-robust perception channel that scales past the small-N, single-finger acoustic-sensing setups that came before.

Problem & motivation

Acoustic vibration is an underused tactile channel: human skin captures high-frequency vibrations to read material and geometry, but robots rarely exploit this. Prior acoustic-sensing work is constrained on three axes: (1) small object sets (N<5) of simple primitives with homogeneous materials; (2) single-finger sensing; (3) train/test on the same objects (different contacts, same instance), so generalization to unseen objects is unproven. Worse, much prior data is collected by a human manually moving the hand or by replaying fixed pre-defined poses, which doesn't scale. A second framing issue: air microphones near the robot pick up sound waves through air and are swamped by ambient noise, whereas contact microphones sense only vibrations transmitted through physical contact. SonicSense argues for a holistic hardware+software design built around contact microphones, autonomous heuristic exploration, and a large diverse real-world object set, so that one system generalizes across multiple object-perception tasks on unseen objects.

Method

The contribution is a co-designed sensing platform plus three task models and a self-supervised-ish data-collection policy.

Hardware (Fig 2). A four-finger hand, each finger one joint / one DoF, enabling tapping, grasping, and shaking primitives. Each fingertip embeds a piezoelectric contact microphone inside the 3D-printed plastic shell; a round counterweight is mounted on the outer shell to increase finger momentum — the authors found the counterweight is important for producing large striking vibrations during tapping. The four microphones are synchronized and sample at 44,100 Hz. Total build cost is $215.26 from off-the-shelf parts and 3D printing.

Heuristic interaction policy (§3.3). Rather than human teleop or fixed-pose replay, a simple but effective heuristic autonomously collects responses across variable sizes/geometries. The hand approaches from top-down and side directions, from high to low with a fixed step, until a first contact event is detected per direction (detected via motor voltage-resistance feedback, chosen for robustness/ease over acoustic change). From those contacts it estimates object height (top-down) and radius (side), then uses a grid-sampling schedule to tap-contact the surface and gather acoustic responses. Stated assumptions: max object height fits the hand, object size is graspable during tapping, and the object is fixed to the table (following prior work).

Material classification model (Fig 4A). Input is the Mel-spectrogram A of a single tapping position's vibration signal; three CNN layers + two MLP layers output a per-contact material label m, trained with cross-entropy (Eq. 1). At inference, an iterative refinement assumes material is locally uniform: filter low-occurrence predictions with threshold M, then re-assign each point by majority vote among its K nearest neighbors, repeated N steps (M,K,N chosen on validation).

Shape reconstruction model (Fig 4B). Sparse contact points C (rough, noisy fingertip-contact locations) are completed into a dense 2,024-point cloud P via a Point Completion Network: two stacked PointNet layers encode C to a 1×1024 global feature, then fully-connected decoder layers produce the cloud (no folding decoder). Trained with Chamfer Distance loss (Eqs. 2–3). Because real contact data is limited, the network is pre-trained on synthetic contact interactions simulated in a sim environment (Fig 5), then trained with a curriculum that gradually shifts from synthetic to real data.

Object re-identification model (Fig 4C). Hardest task: after interacting with an object, re-identify it from a new, different set of tapping interactions (different contact locations). Input fuses 15 channels of Mel-spectrograms (audio encoder of conv layers) and the corresponding 15 contact positions (PCN contact-point encoder); fused MLP layers output the object label among 82 objects (Eq. 4). Ablations drop either branch to test the contribution of audio vs. contact geometry.

Setup

Results

Headline numbers (means from Figs 8–11):

Task / metricRandomNearest NeighborSonicSense
Material classification (avg F1)0.1150.3710.763 (0.523 before refinement)
Shape reconstruction (Chamfer-L1, m, lower better)0.036530.013470.00876
Object re-identification (accuracy)1.33%43.45%92.52%

Key findings:

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This sits squarely on the batch thesis that many manipulation predicates are not visually evaluableis_full (water in an occluded bottle), object material (surface_is_rough / is_ceramic), and even object identity under occlusion are read here purely from contact-borne sound, exactly the touch/force/sound channel that visual predicate classifiers in BLADE cannot see. BLADE learns neural predicate classifiers from RGB-D and explicitly flags partial observability and force-modulated tasks as bottlenecks; SonicSense is a concrete demonstration that an acoustic classifier could ground predicates like container_has_liquid(x) or material(x)=glass that a camera-only classifier would mark unknowable. The container-inventory task is the cleanest example: differentiating 0/100/200 mL of water in a closed bottle is a precondition/effect distinction (is_full before vs. after pouring) that lives entirely in sound.

The caveat for my line of work: SonicSense is a perception paper, not a planning or abstraction paper — there is no symbolic layer, no language, no composition over skills. Its relevance is as a sensor/predicate-grounding substrate that a BLADE-style system could consume, not as a method I'd compare planning against. The honest open question it raises for me: if these properties are only legible through active interaction (tapping, shaking), then the abstraction layer needs information-gathering actions as first-class operators, which BLADE's purely categorical, perception-passive operator model does not yet have.

Quotable

By shaking a container, we can tell its inventory status from the generated acoustic vibrations, such as the quantity and geometry of the objects inside. — §1 Introduction / p.1
On the other hand, contact microphones only sense the acoustic vibrations caused by physical contact. — §1 Introduction / p.1
Our framework underscores the significance of in-hand acoustic vibration sensing in advancing robot tactile perception. — Abstract / p.1

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant: