SonicSense: Object Perception from In-Hand Acoustic Vibration

Jiaxun Liu, Boyuan Chen · Duke University · CoRL 2024 · arXiv:2406.17932 · PDF

One-liner. A cheap (~$215) four-finger robot hand with a piezoelectric contact microphone in each fingertip taps, grasps, and shakes 83 real-world objects, then learns end-to-end models that read the contact-borne vibration to do four tasks at once — container inventory differentiation, material classification, sparse-tapping 3D shape reconstruction, and object re-identification — showing that "what an object sounds like when you touch it" is a rich, noise-robust perception channel that scales past the small-N, single-finger acoustic-sensing setups that came before.

Problem & motivation

Acoustic vibration is an underused tactile channel: human skin captures high-frequency vibrations to read material and geometry, but robots rarely exploit this. Prior acoustic-sensing work is constrained on three axes: (1) small object sets (N<5) of simple primitives with homogeneous materials; (2) single-finger sensing; (3) train/test on the same objects (different contacts, same instance), so generalization to unseen objects is unproven. Worse, much prior data is collected by a human manually moving the hand or by replaying fixed pre-defined poses, which doesn't scale. A second framing issue: air microphones near the robot pick up sound waves through air and are swamped by ambient noise, whereas contact microphones sense only vibrations transmitted through physical contact. SonicSense argues for a holistic hardware+software design built around contact microphones, autonomous heuristic exploration, and a large diverse real-world object set, so that one system generalizes across multiple object-perception tasks on unseen objects.

Method

The contribution is a co-designed sensing platform plus three task models and a self-supervised-ish data-collection policy.

Hardware (Fig 2). A four-finger hand, each finger one joint / one DoF, enabling tapping, grasping, and shaking primitives. Each fingertip embeds a piezoelectric contact microphone inside the 3D-printed plastic shell; a round counterweight is mounted on the outer shell to increase finger momentum — the authors found the counterweight is important for producing large striking vibrations during tapping. The four microphones are synchronized and sample at 44,100 Hz. Total build cost is $215.26 from off-the-shelf parts and 3D printing.

Heuristic interaction policy (§3.3). Rather than human teleop or fixed-pose replay, a simple but effective heuristic autonomously collects responses across variable sizes/geometries. The hand approaches from top-down and side directions, from high to low with a fixed step, until a first contact event is detected per direction (detected via motor voltage-resistance feedback, chosen for robustness/ease over acoustic change). From those contacts it estimates object height (top-down) and radius (side), then uses a grid-sampling schedule to tap-contact the surface and gather acoustic responses. Stated assumptions: max object height fits the hand, object size is graspable during tapping, and the object is fixed to the table (following prior work).

Material classification model (Fig 4A). Input is the Mel-spectrogram A of a single tapping position's vibration signal; three CNN layers + two MLP layers output a per-contact material label m, trained with cross-entropy (Eq. 1). At inference, an iterative refinement assumes material is locally uniform: filter low-occurrence predictions with threshold M, then re-assign each point by majority vote among its K nearest neighbors, repeated N steps (M,K,N chosen on validation).

Shape reconstruction model (Fig 4B). Sparse contact points C (rough, noisy fingertip-contact locations) are completed into a dense 2,024-point cloud P via a Point Completion Network: two stacked PointNet layers encode C to a 1×1024 global feature, then fully-connected decoder layers produce the cloud (no folding decoder). Trained with Chamfer Distance loss (Eqs. 2–3). Because real contact data is limited, the network is pre-trained on synthetic contact interactions simulated in a sim environment (Fig 5), then trained with a curriculum that gradually shifts from synthetic to real data.

Object re-identification model (Fig 4C). Hardest task: after interacting with an object, re-identify it from a new, different set of tapping interactions (different contact locations). Input fuses 15 channels of Mel-spectrograms (audio encoder of conv layers) and the corresponding 15 contact positions (PCN contact-point encoder); fused MLP layers output the object label among 82 objects (Eq. 4). Ablations drop either branch to test the contribution of audio vs. contact geometry.

Setup

Datasets / benchmarks: Self-collected real-world dataset of 83 objects (54 everyday + 29 3D-printed primitives with various surface materials), spanning nine material categories (plastic, glass, wood, metal, ceramic, paper, rubber, foam, fabric); 22.9% are multi-material. Provides high-quality 3D-scanned meshes/point clouds plus per-point material annotations. Material classification uses object-level splits of 60/11/11 (train/val/test), 3 splits for mean±std; re-identification is over 82 objects.
Hardware / simulator: Custom four-finger, 4-DoF hand with one piezoelectric contact microphone + counterweight per fingertip, 44.1 kHz synchronized sampling, ~$215 build. A simulation environment augments shape-reconstruction training with synthetic contact interactions (Fig 5).
Baselines: Random search and Nearest Neighbor baselines across all three quantitative tasks; for re-identification also Ours (contact-points only) and Ours (audio only) ablations. Also tested pre-training material classifier on external acoustic-material datasets (cited [6,45,54]).
Compute: All models trained on a single NVIDIA GeForce RTX 3090 GPU; each training run takes from under an hour to a few hours.

Results

Headline numbers (means from Figs 8–11):

Task / metric	Random	Nearest Neighbor	SonicSense
Material classification (avg F1)	0.115	0.371	0.763 (0.523 before refinement)
Shape reconstruction (Chamfer-L1, m, lower better)	0.03653	0.01347	0.00876
Object re-identification (accuracy)	1.33%	43.45%	92.52%

Key findings:

Material: iterative refinement lifts F1 from 0.523 to 0.763 and beats both baselines, generalizing to unseen objects (Fig 8A). Confusion matrix (Fig 8B): even soft materials (foam, fabric) that produce weak striking signals are classified well via subtle stiffness cues; ceramic↔glass are confused (similar properties), and plastic is hard because the dataset's plastics vary widely in thickness/stiffness.
Material pre-training failed: pre-training on external acoustic-material datasets [6,45,54] hurt performance because those signals were collected with air microphones in noise-controlled settings — a large domain gap from in-hand contact microphones.
Shape: Chamfer-L1 0.00876 m vs 0.03653 (random) and 0.01347 (NN); reconstructs concave geometries; the sim-augmentation likely drives generalization (e.g., a single wine-glass instance is still reconstructed). Failures: spray nozzle, bottle cap (complex shape in a small region; few such objects in data).
Re-identification: 92.52% with both branches; ablations show audio-only 84.11% vs contact-points-only 48.89% — acoustic vibration is the more informative branch, and fusing rough contact positions adds a further boost. Smaller objects and similar-material pairs (ceramic/glass) are harder.
Noise robustness (Fig 6): as injected Gaussian white noise rose from ~38 to ~80 dB, the external air microphone's amplitude rose by thousands of units while the contact fingertip microphone changed by only a few units — strong evidence the hardware isolates physical-contact vibration from ambient air noise.
Container inventory (Fig 7): qualitatively, vibration signals from shaking a container of dice (varying count/shape) and from a bottle with varying water amounts (0/100/200 mL static; 100/200/300 mL shaking) form clear t-SNE clusters over 12 hand-crafted acoustic features (RMS, spectral centroid, bandwidth, contrast, flatness, roll-off, ZCR, tempogram, poly features, MFCC, chroma, tonnetz), distinguishing both solid counts/geometries and subtle continuous liquid states.

Limitations & open questions

From the authors:

The object is assumed fixed to the table; relaxing this needs online object estimation and tactile-informed tracking during interaction.
The system does not consider multiple objects; in cluttered scenes acoustic vibration could still help but isn't tested.
Hardware lacks complex manipulation skill (dexterous manipulation); a more anthropomorphic, higher-DoF hand is needed to extend the findings.
Focus is acoustic-only; integrating multiple sensing modalities for complementary information is left as future work.

What I noticed reading it:

Container-inventory results are only qualitative t-SNE plots and spectrogram visualizations — no classifier, no accuracy number, no baseline. This is the headline "shake a container to read its inventory" capability and it is the least quantified of the four tasks.
The dataset is small in instance count per material (60/11/11 object split), and the test set is "unbalanced caused by different contact points per object," which is why F1 (not accuracy) is reported — the absolute object count per held-out split is tiny, so material F1 variance across the 3 splits matters more than the headline mean conveys.
Shape reconstruction leans heavily on synthetic pre-training; the paper concedes the wine-glass result is "likely achieved through our augmentation dataset," which raises the question of how much real acoustic contact is actually doing the work vs. a learned 3D shape prior over a limited primitive library.
The heuristic policy estimates only height + radius (a cylinder bounding model), so contact sampling on genuinely complex geometries (the failure cases) is under-served by the very exploration policy meant to feed the shape model — a coupling the paper doesn't analyze.
All tasks are trained as separate networks; despite the "holistic" framing there's no shared representation or multi-task model, so the claim that one channel supports four tasks is a claim about the sensor, not about a unified perception model.

Why I care

This sits squarely on the batch thesis that many manipulation predicates are not visually evaluable — is_full (water in an occluded bottle), object material (surface_is_rough / is_ceramic), and even object identity under occlusion are read here purely from contact-borne sound, exactly the touch/force/sound channel that visual predicate classifiers in BLADE cannot see. BLADE learns neural predicate classifiers from RGB-D and explicitly flags partial observability and force-modulated tasks as bottlenecks; SonicSense is a concrete demonstration that an acoustic classifier could ground predicates like container_has_liquid(x) or material(x)=glass that a camera-only classifier would mark unknowable. The container-inventory task is the cleanest example: differentiating 0/100/200 mL of water in a closed bottle is a precondition/effect distinction (is_full before vs. after pouring) that lives entirely in sound.

The caveat for my line of work: SonicSense is a perception paper, not a planning or abstraction paper — there is no symbolic layer, no language, no composition over skills. Its relevance is as a sensor/predicate-grounding substrate that a BLADE-style system could consume, not as a method I'd compare planning against. The honest open question it raises for me: if these properties are only legible through active interaction (tapping, shaking), then the abstraction layer needs information-gathering actions as first-class operators, which BLADE's purely categorical, perception-passive operator model does not yet have.

Quotable

By shaking a container, we can tell its inventory status from the generated acoustic vibrations, such as the quantity and geometry of the objects inside. — §1 Introduction / p.1

On the other hand, contact microphones only sense the acoustic vibrations caused by physical contact. — §1 Introduction / p.1

Our framework underscores the significance of in-hand acoustic vibration sensing in advancing robot tactile perception. — Abstract / p.1

Papers cited that should likely be ingested next:

[6] Gao et al. 2021 — ObjectFolder — the implicit visual/auditory/tactile object dataset SonicSense tried to pre-train on (domain gap with air-mic data). → objectfolder_dataset_implicit_representations
[15] Chen, Chiquier, Lipson, Vondrick 2022 — The Boombox (CoRL) — same lab lineage; visual reconstruction from acoustic vibrations, the closest predecessor in spirit.
[16] Wall, Zöller, Brock 2023 — Passive and active acoustic sensing for soft pneumatic actuators (IJRR) — key prior acoustic-sensing baseline this paper positions against (single-finger, same-object).
[17] Lu & Culbertson 2023 — Active Acoustic Sensing for Robot Manipulation (IROS) — the active-acoustic counterpart; direct comparison point. → active_acoustic_sensing_manipulation
[51] Yuan et al. — Point Completion Network (PCN) — the shape-completion backbone reused here.

Newly ingested in the 2026-06-24 batch — directly relevant:

Active Acoustic Sensing for Robot Manipulation — the active-acoustic sibling (cited [17]); SonicSense is the passive contact-microphone analogue scaled to 83 objects.
VibeCheck — active acoustic tactile sensing; same "vibration as a tactile channel" premise, different excitation strategy.
Making Sense of Audio Vibration (pouring) — audio for liquid-amount estimation; the focused version of SonicSense's container-water-level result.
See, Hear, and Feel and ManiWAV — audio fused into manipulation policies; the policy-learning counterpart to SonicSense's perception-only models.
That Sounds Right and Hearing Touch — self-supervised auditory / audio-tactile pretraining; relevant to SonicSense's failed external-dataset pretraining and its domain-gap lesson.
The Sound of Simulation — generative audio for sim2real; directly addresses the synthetic contact-audio augmentation gap SonicSense leans on for shape reconstruction.
ObjectFolder, ObjectFolder 2.0, and ObjectFolder Benchmark — the multisensory (incl. audio) object datasets SonicSense compared against and found a domain gap with.
BLADE — Weiyu's anchor; SonicSense supplies acoustic grounding for predicates (material, container-fill) that BLADE's visual classifiers can't evaluate.