SonicSense: Object Perception from In-Hand Acoustic Vibration
Jiaxun Liu, Boyuan Chen
· Duke University
· CoRL 2024
· arXiv:2406.17932
· PDF
One-liner. A cheap (~$215) four-finger robot hand with a
piezoelectric contact microphone in each fingertip taps, grasps, and shakes
83 real-world objects, then learns end-to-end models that read the
contact-borne vibration to do four tasks at once — container inventory
differentiation, material classification, sparse-tapping 3D shape
reconstruction, and object re-identification — showing that "what an
object sounds like when you touch it" is a rich, noise-robust perception
channel that scales past the small-N, single-finger acoustic-sensing setups
that came before.
Problem & motivation
Acoustic vibration is an underused tactile channel: human skin captures
high-frequency vibrations to read material and geometry, but robots rarely
exploit this. Prior acoustic-sensing work is constrained on three axes: (1)
small object sets (N<5) of simple primitives with homogeneous
materials; (2) single-finger sensing; (3) train/test on the same
objects (different contacts, same instance), so generalization to unseen
objects is unproven. Worse, much prior data is collected by a human manually
moving the hand or by replaying fixed pre-defined poses, which doesn't scale.
A second framing issue: air microphones near the robot pick up
sound waves through air and are swamped by ambient noise, whereas
contact microphones sense only vibrations transmitted through
physical contact. SonicSense argues for a holistic hardware+software design
built around contact microphones, autonomous heuristic exploration, and a
large diverse real-world object set, so that one system generalizes across
multiple object-perception tasks on unseen objects.
Method
The contribution is a co-designed sensing platform plus three task models
and a self-supervised-ish data-collection policy.
Hardware (Fig 2). A four-finger hand, each finger one joint
/ one DoF, enabling tapping, grasping, and shaking primitives. Each fingertip
embeds a piezoelectric contact microphone inside the 3D-printed
plastic shell; a round counterweight is mounted on the outer shell to
increase finger momentum — the authors found the counterweight is
important for producing large striking vibrations during tapping. The four
microphones are synchronized and sample at 44,100 Hz. Total build
cost is $215.26 from off-the-shelf parts and 3D printing.
Heuristic interaction policy (§3.3). Rather than human
teleop or fixed-pose replay, a simple but effective heuristic autonomously
collects responses across variable sizes/geometries. The hand approaches from
top-down and side directions, from high to low with a fixed step, until a
first contact event is detected per direction (detected via motor
voltage-resistance feedback, chosen for robustness/ease over acoustic
change). From those contacts it estimates object height (top-down) and radius
(side), then uses a grid-sampling schedule to tap-contact the surface and
gather acoustic responses. Stated assumptions: max object height fits the
hand, object size is graspable during tapping, and the object is fixed to the
table (following prior work).
Material classification model (Fig 4A). Input is the
Mel-spectrogram A of a single tapping position's vibration signal;
three CNN layers + two MLP layers output a per-contact material label
m, trained with cross-entropy (Eq. 1). At inference, an
iterative refinement assumes material is locally uniform: filter
low-occurrence predictions with threshold M, then re-assign each
point by majority vote among its K nearest neighbors, repeated
N steps (M,K,N chosen on validation).
Shape reconstruction model (Fig 4B). Sparse contact points
C (rough, noisy fingertip-contact locations) are completed into a
dense 2,024-point cloud P via a Point Completion Network: two
stacked PointNet layers encode C to a 1×1024 global feature,
then fully-connected decoder layers produce the cloud (no folding decoder).
Trained with Chamfer Distance loss (Eqs. 2–3). Because real contact data
is limited, the network is pre-trained on synthetic contact interactions
simulated in a sim environment (Fig 5), then trained with a curriculum
that gradually shifts from synthetic to real data.
Object re-identification model (Fig 4C). Hardest task: after
interacting with an object, re-identify it from a new, different set
of tapping interactions (different contact locations). Input fuses 15 channels
of Mel-spectrograms (audio encoder of conv layers) and the corresponding 15
contact positions (PCN contact-point encoder); fused MLP layers output the
object label among 82 objects (Eq. 4). Ablations drop either branch to test the
contribution of audio vs. contact geometry.
Setup
- Datasets / benchmarks: Self-collected real-world dataset
of 83 objects (54 everyday + 29 3D-printed primitives with
various surface materials), spanning nine material categories (plastic,
glass, wood, metal, ceramic, paper, rubber, foam, fabric); 22.9% are
multi-material. Provides high-quality 3D-scanned meshes/point clouds plus
per-point material annotations. Material classification uses object-level
splits of 60/11/11 (train/val/test), 3 splits for mean±std;
re-identification is over 82 objects.
- Hardware / simulator: Custom four-finger, 4-DoF hand
with one piezoelectric contact microphone + counterweight per fingertip,
44.1 kHz synchronized sampling, ~$215 build. A simulation environment
augments shape-reconstruction training with synthetic contact interactions
(Fig 5).
- Baselines: Random search and Nearest Neighbor baselines
across all three quantitative tasks; for re-identification also Ours
(contact-points only) and Ours (audio only) ablations. Also tested
pre-training material classifier on external acoustic-material datasets
(cited [6,45,54]).
- Compute: All models trained on a single NVIDIA GeForce
RTX 3090 GPU; each training run takes from under an hour to a few hours.
Results
Headline numbers (means from Figs 8–11):
| Task / metric | Random | Nearest Neighbor | SonicSense |
| Material classification (avg F1) | 0.115 | 0.371 | 0.763 (0.523 before refinement) |
| Shape reconstruction (Chamfer-L1, m, lower better) | 0.03653 | 0.01347 | 0.00876 |
| Object re-identification (accuracy) | 1.33% | 43.45% | 92.52% |
Key findings:
- Material: iterative refinement lifts F1 from 0.523 to
0.763 and beats both baselines, generalizing to unseen objects (Fig 8A).
Confusion matrix (Fig 8B): even soft materials (foam, fabric) that produce
weak striking signals are classified well via subtle stiffness cues;
ceramic↔glass are confused (similar properties), and plastic is hard
because the dataset's plastics vary widely in thickness/stiffness.
- Material pre-training failed: pre-training on external
acoustic-material datasets [6,45,54] hurt performance because those
signals were collected with air microphones in noise-controlled settings
— a large domain gap from in-hand contact microphones.
- Shape: Chamfer-L1 0.00876 m vs 0.03653 (random) and
0.01347 (NN); reconstructs concave geometries; the sim-augmentation likely
drives generalization (e.g., a single wine-glass instance is still
reconstructed). Failures: spray nozzle, bottle cap (complex shape in a
small region; few such objects in data).
- Re-identification: 92.52% with both branches; ablations
show audio-only 84.11% vs contact-points-only 48.89% — acoustic
vibration is the more informative branch, and fusing rough contact
positions adds a further boost. Smaller objects and similar-material pairs
(ceramic/glass) are harder.
- Noise robustness (Fig 6): as injected Gaussian white
noise rose from ~38 to ~80 dB, the external air microphone's amplitude rose
by thousands of units while the contact fingertip microphone changed by only
a few units — strong evidence the hardware isolates physical-contact
vibration from ambient air noise.
- Container inventory (Fig 7): qualitatively, vibration
signals from shaking a container of dice (varying count/shape) and from a
bottle with varying water amounts (0/100/200 mL static; 100/200/300 mL
shaking) form clear t-SNE clusters over 12 hand-crafted acoustic features
(RMS, spectral centroid, bandwidth, contrast, flatness, roll-off, ZCR,
tempogram, poly features, MFCC, chroma, tonnetz), distinguishing both solid
counts/geometries and subtle continuous liquid states.
Limitations & open questions
From the authors:
- The object is assumed fixed to the table; relaxing this needs
online object estimation and tactile-informed tracking during interaction.
- The system does not consider multiple objects; in cluttered
scenes acoustic vibration could still help but isn't tested.
- Hardware lacks complex manipulation skill (dexterous manipulation); a
more anthropomorphic, higher-DoF hand is needed to extend the findings.
- Focus is acoustic-only; integrating multiple sensing modalities for
complementary information is left as future work.
What I noticed reading it:
- Container-inventory results are only qualitative t-SNE plots and
spectrogram visualizations — no classifier, no accuracy number, no
baseline. This is the headline "shake a container to read its inventory"
capability and it is the least quantified of the four tasks.
- The dataset is small in instance count per material (60/11/11 object
split), and the test set is "unbalanced caused by different contact points
per object," which is why F1 (not accuracy) is reported — the absolute
object count per held-out split is tiny, so material F1 variance across the
3 splits matters more than the headline mean conveys.
- Shape reconstruction leans heavily on synthetic pre-training; the
paper concedes the wine-glass result is "likely achieved through our
augmentation dataset," which raises the question of how much real acoustic
contact is actually doing the work vs. a learned 3D shape prior over a
limited primitive library.
- The heuristic policy estimates only height + radius (a cylinder
bounding model), so contact sampling on genuinely complex geometries
(the failure cases) is under-served by the very exploration policy meant
to feed the shape model — a coupling the paper doesn't analyze.
- All tasks are trained as separate networks; despite the "holistic"
framing there's no shared representation or multi-task model, so the claim
that one channel supports four tasks is a claim about the sensor,
not about a unified perception model.
Why I care
This sits squarely on the batch thesis that many manipulation predicates are
not visually evaluable — is_full (water in
an occluded bottle), object material (surface_is_rough /
is_ceramic), and even object identity under occlusion are read
here purely from contact-borne sound, exactly the touch/force/sound channel
that visual predicate classifiers in BLADE
cannot see. BLADE learns neural predicate classifiers from RGB-D and explicitly
flags partial observability and force-modulated tasks as bottlenecks;
SonicSense is a concrete demonstration that an acoustic classifier
could ground predicates like container_has_liquid(x) or
material(x)=glass that a camera-only classifier would mark
unknowable. The container-inventory task is the cleanest example: differentiating
0/100/200 mL of water in a closed bottle is a precondition/effect distinction
(is_full before vs. after pouring) that lives entirely in sound.
The caveat for my line of work: SonicSense is a perception paper, not
a planning or abstraction paper — there is no symbolic layer, no language,
no composition over skills. Its relevance is as a sensor/predicate-grounding
substrate that a BLADE-style system could consume, not as a method I'd
compare planning against. The honest open question it raises for me: if these
properties are only legible through active interaction (tapping, shaking), then
the abstraction layer needs information-gathering actions as
first-class operators, which BLADE's purely categorical, perception-passive
operator model does not yet have.
Quotable
By shaking a container, we can tell its inventory status from the generated
acoustic vibrations, such as the quantity and geometry of the objects inside.
— §1 Introduction / p.1
On the other hand, contact microphones only sense the acoustic vibrations
caused by physical contact.
— §1 Introduction / p.1
Our framework underscores the significance of in-hand acoustic vibration
sensing in advancing robot tactile perception.
— Abstract / p.1
Related
Papers cited that should likely be ingested next:
- [6] Gao et al. 2021 — ObjectFolder — the
implicit visual/auditory/tactile object dataset SonicSense tried to
pre-train on (domain gap with air-mic data).
→ objectfolder_dataset_implicit_representations
- [15] Chen, Chiquier, Lipson, Vondrick 2022 — The Boombox
(CoRL) — same lab lineage; visual reconstruction from acoustic
vibrations, the closest predecessor in spirit.
- [16] Wall, Zöller, Brock 2023 — Passive and active
acoustic sensing for soft pneumatic actuators (IJRR) — key
prior acoustic-sensing baseline this paper positions against (single-finger,
same-object).
- [17] Lu & Culbertson 2023 — Active Acoustic Sensing for
Robot Manipulation (IROS) — the active-acoustic counterpart;
direct comparison point.
→ active_acoustic_sensing_manipulation
- [51] Yuan et al. — Point Completion Network (PCN)
— the shape-completion backbone reused here.
Newly ingested in the 2026-06-24 batch — directly relevant:
- Active Acoustic Sensing for Robot Manipulation
— the active-acoustic sibling (cited [17]); SonicSense is the passive
contact-microphone analogue scaled to 83 objects.
- VibeCheck
— active acoustic tactile sensing; same "vibration as a tactile channel"
premise, different excitation strategy.
- Making Sense of Audio Vibration (pouring)
— audio for liquid-amount estimation; the focused version of
SonicSense's container-water-level result.
- See, Hear, and Feel
and ManiWAV
— audio fused into manipulation policies; the policy-learning
counterpart to SonicSense's perception-only models.
- That Sounds Right
and Hearing Touch
— self-supervised auditory / audio-tactile pretraining; relevant to
SonicSense's failed external-dataset pretraining and its domain-gap lesson.
- The Sound of Simulation
— generative audio for sim2real; directly addresses the synthetic
contact-audio augmentation gap SonicSense leans on for shape reconstruction.
- ObjectFolder,
ObjectFolder 2.0, and
ObjectFolder Benchmark
— the multisensory (incl. audio) object datasets SonicSense compared
against and found a domain gap with.
- BLADE
— Weiyu's anchor; SonicSense supplies acoustic grounding for predicates
(material, container-fill) that BLADE's visual classifiers can't evaluate.