Active Acoustic Sensing for Robot Manipulation

Shihan Lu, Heather Culbertson · University of Southern California · 2023 · arXiv preprint · arXiv:2308.01600 · PDF

One-liner. A $15 gripper sensor that injects a known vibration waveform into a grasped object through one fingertip and records how the object distorts it at the other fingertip — turning the object's acoustic resonance into a readout of material, shape, grasp position, internal state, and external contact, exactly the global, internally-hidden properties that vision and surface tactile sensors can't see.

Problem & motivation

Vision gives global-but-occludable information; optical tactile sensors (GelSight, etc.) give local surface geometry. Neither can perceive an object's internal state — how much water is in a closed cup, where a toy is inside a package, whether two metal bars differ in shape despite identical appearance. The combination of vision + haptics is also bulky, biased to surface features, and usually needs object-specific exploratory motions (shaking, tilting) before the actual task. The paper's framing question: "can we perceive an object's global states from a localized contact?" Humans do this by tapping a cup and listening to the resonance. The authors propose active acoustic sensing — emit a waveform through the grasp, capture the waveform after it travels through the object — as a compact tool that integrates into the normal grasping routine without interrupting it, and senses the object's state change (exteroceptive) rather than the sensor's own state.

Method

The contribution is split across a sensing concept, a hardware design, a modal-based simulator, and proof-of-concept tasks.

Sensing principle. The spectral characteristics of a waveform after it travels through an object depend on the object's resonant properties, which are set by material, shape, grasping point, internal-state distribution, and external contact formations (which add damping). By emitting a controlled excitation signal and recording the received signal (Fig 1), one can infer those latent states from spectral differences.

Excitation signals. Three controllable inputs: impulse (Dirac-like, 0.01 s, for impulse response), linear sweep, and exponential sweep (both 20 Hz–10 kHz over 0.5 s). Exponential sweeps emphasize low frequencies to better differentiate contacts; sweeps give wide spectral coverage for resonance. All signals loop every 0.5 s. Received windows are FFT'd; because actuator-to-mic vibration leaks through the gripper body and motors, only the 3–10 kHz band at 50 Hz steps is kept — a 140-dim feature vector fed to downstream classification/regression.

Hardware (Fig 3). On one Franka Hand fingertip, a bone-conduction actuator (Dayton Audio BCE-1, response to ~19 kHz, 22×14×8 mm) injects vibration directly into the contact object rather than into air. On the other fingertip, a piezo-electric contact microphone (Adafruit, r=7 mm, 20 Hz–20 kHz) records structure-borne vibration; being a contact mic, it is insensitive to ambient air sound. Both run through an external sound card (Sound Blaster Play! 3) at 44.1 kHz, driven by a 2 W Class-D amplifier, controlled via PyAudio. 3D-printed housings with Sorbothane vibration-isolation pads damp actuator/motor leakage. Total prototype cost ~US$15, far cheaper than force/tactile sensors.

Modal-based simulator. To scale data collection, a two-pass simulator synthesizes the received signal as the displacement of the object vertex at the contact point. Pass 1: precompute the object's modal model from a tetrahedral mesh via linear modal analysis — solve the generalized eigenproblem for stiffness/mass pair (K,M) (Eq 2) under Rayleigh damping C = αM + βK, diagonalizing the dynamics into independent modes solvable analytically (Eq 3). Contact-dependent damping is added via Zheng et al.'s viscous contact model (Eq 4–5): damping scales with contact-force magnitude and direction at each contact point, with a material-dependent γ. Pass 2: a PyBullet robot simulation extracts time-series contact forces/positions for object-gripper and object-environment pairs (modeling object + two fingertips + environment as mass-spring-damper systems, Fig 4b), excites the modal model with those collision impulses, and reads the displacement at the vertex nearest the mic. Real idle-state "leaked vibration" is pre-recorded and superimposed onto the simulated output to match the physical signal.

Setup

Results

Two proof-of-concept tasks on rigid objects.

Object recognition (Table II). Best single→single accuracy: 82.9% (exponential sweep, MLP); linear sweep was the most robust across all three classifiers (up to 80.0% MLP). For the harder single→multiple transfer (train on single-object scene, test on multi-object scene with external contacts), the exponential sweep dominated at 84.3% (MLP), while the impulse signal collapsed to ~50% — impulses lack the spectral richness to transfer to complex contact scenes. Confusion matrices (Fig 8) show aluminium tube misclassified as steel bar (similar metal properties, different shape) and the two 3D-printed-plastic shapes confused.

Train→TestSignalKNN%SVM%MLP%
Single→SingleImpulse75.769.381.4
Single→SingleLinear78.677.180.0
Single→SingleExponential67.166.482.9
Single→MultipleImpulse47.155.751.4
Single→MultipleLinear67.160.072.9
Single→MultipleExponential70.070.084.3

Grasping-position estimation (Fig 9). KNN regression (3 neighbors, 3-fold CV) predicts distance-from-center. Best RMSE per object: wood bar 0.9 mm (linear), aluminium bar 1.1 mm (linear), aluminium tube 2.1 mm (linear), ABS plastic bar 3.5 mm (impulse). Linear sweep won on 3/4 objects; on ABS plastic, the sweeps lost to the impulse, attributed to object slip under the intense sweep vibration on a slippery plastic surface.

Sim-vs-real validation (Fig 6). Qualitative only. Simulated spectra match real ones for most material/shape/grasp/contact conditions, but the simulator failed for the aluminium tube (its shape pushed eigenvalues toward singularity) and for aluminium-bar-on-steel contact (hard to get accurate contact-dynamics values). No quantitative sim2real number is reported.

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This paper is a clean instance of the batch thesis that many manipulation predicates are not visually evaluableis_full(cup), contains(package, toy), material_is(x, steel), grasp_position(x) live in sound and vibration, not pixels. Active acoustic sensing is a particularly nice fit for the BLADE predicate-grounding story: BLADE currently learns predicate classifiers fθ(p): O → {T, F} over visual observations and flags that internal/contact states are a gap. A sensor like this is exactly the kind of non-visual observation channel a turned-on(faucet)-style classifier could read — e.g. a learned is_full(cup) or in_contact_with(x, table) predicate grounded in the 140-dim spectral feature rather than in cropped pixels. It also has a BLADE-adjacent active flavor: the robot controls the excitation signal, so the sensory output is action-conditioned — resonant with the "controllable data collection / tight action-to-sensation coupling" the authors emphasize, and with the abstraction-layer view that the agent should choose informative probing actions.

Two caveats on relevance. First, it's a 2023 hardware/sensing paper with no learning of abstractions, language, or planning — the ML is off-the-shelf KNN/SVM/MLP. So it's a sensor-modality anchor for the "predicates in touch/sound" thesis, not a method to build on directly. Second, its instance-conditioned pre-recording requirement is the opposite of the open-vocabulary, generalize-to-novel-objects regime BLADE targets — which is itself a useful tension to flag.

Quotable

Active acoustic sensing utilizes the state change of the objects, rather than the state change of the sensor itself, which is called exteroceptive sensing. — §III / p.2
In daily activities, a person can gauge the water volume in a cup by listening to the resonance due to tapping. — §I / p.1
The total cost of a prototype is about 15 US dollars, which is notably more affordable compared to common force or tactile sensors. — §IV.B / p.4

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant: