Active Acoustic Sensing for Robot Manipulation

Shihan Lu, Heather Culbertson · University of Southern California · 2023 · arXiv preprint · arXiv:2308.01600 · PDF

One-liner. A $15 gripper sensor that injects a known vibration waveform into a grasped object through one fingertip and records how the object distorts it at the other fingertip — turning the object's acoustic resonance into a readout of material, shape, grasp position, internal state, and external contact, exactly the global, internally-hidden properties that vision and surface tactile sensors can't see.

Problem & motivation

Vision gives global-but-occludable information; optical tactile sensors (GelSight, etc.) give local surface geometry. Neither can perceive an object's internal state — how much water is in a closed cup, where a toy is inside a package, whether two metal bars differ in shape despite identical appearance. The combination of vision + haptics is also bulky, biased to surface features, and usually needs object-specific exploratory motions (shaking, tilting) before the actual task. The paper's framing question: "can we perceive an object's global states from a localized contact?" Humans do this by tapping a cup and listening to the resonance. The authors propose active acoustic sensing — emit a waveform through the grasp, capture the waveform after it travels through the object — as a compact tool that integrates into the normal grasping routine without interrupting it, and senses the object's state change (exteroceptive) rather than the sensor's own state.

Method

The contribution is split across a sensing concept, a hardware design, a modal-based simulator, and proof-of-concept tasks.

Sensing principle. The spectral characteristics of a waveform after it travels through an object depend on the object's resonant properties, which are set by material, shape, grasping point, internal-state distribution, and external contact formations (which add damping). By emitting a controlled excitation signal and recording the received signal (Fig 1), one can infer those latent states from spectral differences.

Excitation signals. Three controllable inputs: impulse (Dirac-like, 0.01 s, for impulse response), linear sweep, and exponential sweep (both 20 Hz–10 kHz over 0.5 s). Exponential sweeps emphasize low frequencies to better differentiate contacts; sweeps give wide spectral coverage for resonance. All signals loop every 0.5 s. Received windows are FFT'd; because actuator-to-mic vibration leaks through the gripper body and motors, only the 3–10 kHz band at 50 Hz steps is kept — a 140-dim feature vector fed to downstream classification/regression.

Hardware (Fig 3). On one Franka Hand fingertip, a bone-conduction actuator (Dayton Audio BCE-1, response to ~19 kHz, 22×14×8 mm) injects vibration directly into the contact object rather than into air. On the other fingertip, a piezo-electric contact microphone (Adafruit, r=7 mm, 20 Hz–20 kHz) records structure-borne vibration; being a contact mic, it is insensitive to ambient air sound. Both run through an external sound card (Sound Blaster Play! 3) at 44.1 kHz, driven by a 2 W Class-D amplifier, controlled via PyAudio. 3D-printed housings with Sorbothane vibration-isolation pads damp actuator/motor leakage. Total prototype cost ~US$15, far cheaper than force/tactile sensors.

Modal-based simulator. To scale data collection, a two-pass simulator synthesizes the received signal as the displacement of the object vertex at the contact point. Pass 1: precompute the object's modal model from a tetrahedral mesh via linear modal analysis — solve the generalized eigenproblem for stiffness/mass pair (K,M) (Eq 2) under Rayleigh damping C = αM + βK, diagonalizing the dynamics into independent modes solvable analytically (Eq 3). Contact-dependent damping is added via Zheng et al.'s viscous contact model (Eq 4–5): damping scales with contact-force magnitude and direction at each contact point, with a material-dependent γ. Pass 2: a PyBullet robot simulation extracts time-series contact forces/positions for object-gripper and object-environment pairs (modeling object + two fingertips + environment as mass-spring-damper systems, Fig 4b), excites the modal model with those collision impulses, and reads the displacement at the vertex nearest the mic. Real idle-state "leaked vibration" is pre-recorded and superimposed onto the simulated output to match the physical signal.

Setup

Datasets / benchmarks: Self-collected. Seven objects = two shapes (bar, tube) × five materials (aluminium, steel, wood, ABS plastic, 3D-printed plastic); bar/tube of the same material are visually identical in-hand (Fig 5). Object recognition: each object grasped 20× (single-object scene) and 10× (multi-object scene with external contacts) per excitation signal. Grasping-position estimation: 13 grasp positions 5 mm apart, recorded 5× each. Only the first 1 s of each 5 s recording used.
Hardware / simulator: Franka Emika arm + Franka Hand with the paired active acoustic sensor; 40 N constant grasp force; Intel RealSense D435 depth camera for object pose/position; antipodal grasp sampler (Dex-Net 2.0 [30]). Modal simulator built on PyBullet [28] + linear modal analysis/synthesis. Material params (density, Young's modulus, Poisson, damping α/β/γ) in Table I for ABS, aluminium, steel, wood.
Baselines: not reported as competing methods. Comparison is across three excitation signals (impulse / linear / exponential) and three off-the-shelf classifiers (KNN, SVM, MLP via scikit-learn). No comparison to vision-only or optical-tactile sensing baselines.
Compute: not reported.

Results

Two proof-of-concept tasks on rigid objects.

Object recognition (Table II). Best single→single accuracy: 82.9% (exponential sweep, MLP); linear sweep was the most robust across all three classifiers (up to 80.0% MLP). For the harder single→multiple transfer (train on single-object scene, test on multi-object scene with external contacts), the exponential sweep dominated at 84.3% (MLP), while the impulse signal collapsed to ~50% — impulses lack the spectral richness to transfer to complex contact scenes. Confusion matrices (Fig 8) show aluminium tube misclassified as steel bar (similar metal properties, different shape) and the two 3D-printed-plastic shapes confused.

Train→Test	Signal	KNN%	SVM%	MLP%
Single→Single	Impulse	75.7	69.3	81.4
Single→Single	Linear	78.6	77.1	80.0
Single→Single	Exponential	67.1	66.4	82.9
Single→Multiple	Impulse	47.1	55.7	51.4
Single→Multiple	Linear	67.1	60.0	72.9
Single→Multiple	Exponential	70.0	70.0	84.3

Grasping-position estimation (Fig 9). KNN regression (3 neighbors, 3-fold CV) predicts distance-from-center. Best RMSE per object: wood bar 0.9 mm (linear), aluminium bar 1.1 mm (linear), aluminium tube 2.1 mm (linear), ABS plastic bar 3.5 mm (impulse). Linear sweep won on 3/4 objects; on ABS plastic, the sweeps lost to the impulse, attributed to object slip under the intense sweep vibration on a slippery plastic surface.

Sim-vs-real validation (Fig 6). Qualitative only. Simulated spectra match real ones for most material/shape/grasp/contact conditions, but the simulator failed for the aluminium tube (its shape pushed eigenvalues toward singularity) and for aluminium-bar-on-steel contact (hard to get accurate contact-dynamics values). No quantitative sim2real number is reported.

Limitations & open questions

From the authors:

Modal simulation is inefficient for high-frequency sweep signals (many collisions to synthesize).
Linear modal analysis assumes small-deformation rigid bodies — doesn't apply to thin-shelled or soft objects, and assumes isotropic, homogeneous material, which most everyday objects violate.
Rigid-fluid (water in a bottle) and rigid-deformable (plush toy in a package) interactions — precisely the high-value "internal state" cases — cannot yet be modeled; needs more advanced physics.
Sim material params must be manually tuned (α, β, γ); 3D-printed objects weren't simulated at all (unknown material properties).

What I noticed reading it:

The evaluation is a closed-set classification of 7 lab objects and single-axis grasp regression on 4 objects — a genuine proof-of-concept, not a generalization study. The headline capabilities advertised in the abstract (internal state, mass/volume, flow distribution) are illustrated qualitatively (Fig 2 cup-with-liquid) but never quantitatively evaluated; the experiments stay on rigid objects.
The whole approach "relies on pre-recorded data or simulated data from the object for the state inference" (authors' own words) — i.e. it's object-instance-conditioned, not a zero-shot sensor. Generalization to unseen objects is asserted ("inference model easily generalized") but not tested.
No vision-only or optical-tactile baseline, so the central claim that this sees what those can't is argued, not measured. The bar-vs-tube "visually identical" point is the strongest evidence, and it's anecdotal.
Grasp force is fixed at 40 N throughout; since damping is force-dependent (Eq 4), robustness to grasp-force variation — a real concern for any deployed gripper — is untested.
Only the first 1 s of 5 s recordings is used, with no ablation on window length; the time budget the sensor actually needs is unclear.

Why I care

This paper is a clean instance of the batch thesis that many manipulation predicates are not visually evaluable — is_full(cup), contains(package, toy), material_is(x, steel), grasp_position(x) live in sound and vibration, not pixels. Active acoustic sensing is a particularly nice fit for the BLADE predicate-grounding story: BLADE currently learns predicate classifiers f_θ(p): O → {T, F} over visual observations and flags that internal/contact states are a gap. A sensor like this is exactly the kind of non-visual observation channel a turned-on(faucet)-style classifier could read — e.g. a learned is_full(cup) or in_contact_with(x, table) predicate grounded in the 140-dim spectral feature rather than in cropped pixels. It also has a BLADE-adjacent active flavor: the robot controls the excitation signal, so the sensory output is action-conditioned — resonant with the "controllable data collection / tight action-to-sensation coupling" the authors emphasize, and with the abstraction-layer view that the agent should choose informative probing actions.

Two caveats on relevance. First, it's a 2023 hardware/sensing paper with no learning of abstractions, language, or planning — the ML is off-the-shelf KNN/SVM/MLP. So it's a sensor-modality anchor for the "predicates in touch/sound" thesis, not a method to build on directly. Second, its instance-conditioned pre-recording requirement is the opposite of the open-vocabulary, generalize-to-novel-objects regime BLADE targets — which is itself a useful tension to flag.

Quotable

Active acoustic sensing utilizes the state change of the objects, rather than the state change of the sensor itself, which is called exteroceptive sensing. — §III / p.2

In daily activities, a person can gauge the water volume in a cup by listening to the resonance due to tapping. — §I / p.1

The total cost of a prototype is about 15 US dollars, which is notably more affordable compared to common force or tactile sensors. — §IV.B / p.4

Papers cited that should likely be ingested next:

[16] Clarke et al. 2018 — Learning audio feedback for estimating amount and flow of granular material (CoRL) — audio-for-pouring/flow estimation; closest prior on inferring internal container state from sound. See batch slug making_sense_audio_vibration_pouring.
[17] Clarke et al. 2022 — DiffImpact: Differentiable rendering and identification of impact sounds (CoRL) — analysis-by-synthesis for impact-sound object properties; the differentiable counterpart to this paper's modal simulator.
[18] Du et al. 2022 — Play it by Ear (RSS) — audio-visual imitation under occlusion. See batch slug play_it_by_ear_audio_visual_imitation.
[20] Donlon et al. 2018 — GelSlim (IROS) — the optical-tactile sensor this paper positions itself as complementing.
[27] Zheng & James 2011 — Toward high-quality modal contact sound (SIGGRAPH Asia) — the viscous contact-damping model adopted for the simulator (Eq 4–5).

Newly ingested in the 2026-06-24 batch — directly relevant:

VibeCheck — the closest sibling: also active acoustic/vibroacoustic tactile sensing; direct comparison point for the actuator-plus-receiver probing paradigm.
SonicSense — in-hand acoustic-vibration sensing for object properties; the passive/contact-vibration counterpart to this active method.
Making Sense of Audio Vibration (pouring) — inferring liquid state from vibration during pouring; the internal-state use case this paper illustrates but doesn't quantify.
Hearing Touch and See, Hear, and Feel — audio-as-a-sensing-channel-for-manipulation papers; situate active acoustics within the broader audio-for-manipulation cluster.
The Sound of Simulation — generative audio sim2real; the modern counterpart to this paper's modal-synthesis simulator-for-data-scaling goal.
ObjectFolder 2.0 — multisensory (incl. acoustic) object dataset with sim2real; the dataset/simulator infrastructure this paper's "scale up via simulation" ambition points toward.