Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring

Hongzhuo Liang, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Fuchun Sun, Jianwei Zhang · Universität Hamburg / UCLA / Tsinghua · 2019 · ICRA 2019 · arXiv:1903.00650 · PDF

One-liner. Pour water into an opaque cup and the rising liquid shortens the air column inside, sweeping a resonance up the spectrum like blowing across a bottle — PouringNet is an RNN that reads that microphone sweep in real time to estimate liquid height closed-loop, no camera or force sensor needed, the canonical "this predicate lives in sound, not pixels" result for manipulation.

Problem & motivation

Robotic pouring needs accurate, real-time perception of how full the target container is, so the robot can stop before spilling. The two standard modalities both break: vision fails on opaque containers (you can't see the liquid line) and is sensitive to lighting and liquid color, with reported mean height errors >4 mm even on transparent liquids; haptic/force sensing on the source container measures how much liquid left the source (a volume integral), but cannot read the liquid height in an unseen target container, and the force-to-fill mapping varies with end-effector and container. The paper's insight, borrowed from how humans pour by ear, is that the air cavity above the liquid acts like an organ pipe: as the column shortens the air resonates faster, so the instantaneous resonance frequency directly encodes the current fill level — an immediate readout that needs no temporal integration and so accumulates no integration drift.

Method

Air column, not liquid height, as the label. The key modeling move (Fig 4): the regression target is the length of the air column H_a above the liquid, not the liquid height itself. Two differently-shaped containers filled to different liquid heights but sharing the same air-column length produce a similar air resonance — so air-column length is the container-agnostic quantity the resonance actually determines. Using liquid height as the label would map one acoustic feature to many labels; using air-column length makes the target single-valued and transfers across container shapes. This also gives free overfill protection: the controller just stops when H_a hits a small threshold, regardless of the target container's geometry.

Audio front-end. Raw audio (recorded at 44.1 kHz) is resampled to 16 kHz; spectrograms use a 0.032 s window with half-window overlap and a 512-point FFT, yielding 257-bin slices. A 4 s clip becomes a 257×251 time-frequency input. In the spectrograms (Fig 3) a single high-energy rising curve between ~256–2048 Hz is the air resonance; the authors note it tracks liquid height and is largely invariant to pouring speed, while the container's own (slightly falling) resonance is not clearly visible in their data.

PouringNet (Fig 5). A recurrent encoder — one layer of LSTM or GRU — consumes each time-slice of the spectrogram progressively, producing a 256-d recurrent feature; a 2-layer MLP height predictor regresses the instantaneous air-column length. Recurrence is chosen for two reasons: it suits the sequential nature of audio, and it implicitly bakes in the prior that liquid level rises monotonically, smoothing predictions. Two losses: an MSE term L_height = ∥Ĥ_a − H_a∥² and an auxiliary monotonicity penalty L_mono = ∑_t max(0, Ĥ_a,t+1 − Ĥ_a,t) that punishes any predicted increase in air-column length (since it should only shrink). Combined as L_audio = L_height + α · L_mono with α = 0.01. Data augmentation: random 4 s crops from each full pouring sequence, count proportional to trial length.

Deployment. The AudioLSTM model trained on the human dataset is fine-tuned on a small spout-equipped robot dataset (the spout slows pouring and reduces spillage from the high gripper), then used as a closed-loop feedback signal that terminates pouring the moment the desired H_a is reached. One feedback loop runs in ~21 ms (~20 ms of which is spectrogram processing).

Setup

Datasets / benchmarks: A self-collected multimodal pouring dataset — the paper claims the first multimodal dataset for the pouring perception task — with >3000 human pouring recordings (1000 trials × 3 target containers, 2 subjects), each containing audio, force/torque, video, motion-tracking trajectories, and scale-derived height labels. Training used water only; recordings ran 4–11 s. Height ground truth comes from a digital scale plus a per-container quadratic weight→height polynomial. A separate 30-trial-per-container spout dataset is collected for robot fine-tuning. Code/data/video released at the project page.
Hardware / simulator: Real-world only (no simulator). Collection rig (Fig 2): Behringer B-5 mic (44.1 kHz), ATI Mini40 F/T sensor (500 Hz), Maul Logic digital scale (1 Hz), Logitech webcam (30 Hz), PhaseSpace Impulse X2E motion tracker (240 Hz); three training containers (glass 127 mm, thermos 150 mm, mug 99 mm). Robot experiments on a UR5 with a spout-equipped source container; three unseen containers (red mug, blue mug, plastic cup).
Baselines: Internal architecture ablation only — AudioFC (feed-forward MLP) vs AudioLSTM vs AudioGRU; plus single-container vs all-container training. For the robot pouring precision, an informal comparison to prior vision-based pouring numbers (Schenck mean error 38 ml; Do 22.53 ml). No external learned audio/vision/haptic baseline is re-run on the same setup.
Compute: Inference machine: 20-core Intel i9-7900X CPU + two GTX 1080Ti GPUs. Training compute and time not reported.

Results

Headline. On the human test set both recurrent models reach 90% of sequences below 2 mm air-column error with absolute mean length error below 1.5 mm; AudioLSTM slightly beats AudioGRU and both crush the AudioFC feed-forward baseline (Fig 6), confirming that recurrence (and its implicit monotonic prior) matters and that audio alone suffices to infer height across containers. The all-container model beats any single-container model, i.e. it benefits from data scale and generalizes better. Among containers, the stainless-steel thermos is easiest ("crispest sound") and the glass hardest.

Robot closed-loop (Fig 8). Absolute mean height error and std are both below 3 mm on the three seen containers and below 4.5 mm on three unseen containers, despite a large human-to-robot domain gap (different trajectories, noisier environment). Accuracy improves as the target air column gets shorter (liquid higher) — a useful property since spill risk is highest there. Converted to amount error (Table I):

Container	Mean amount error ± std	Seen?
Glass	9.54 ± 7.81 ml	seen
Thermos	9.91 ± 8.48 ml	seen
Mug	13.79 ± 11.04 ml	seen
Red Mug	7.92 ± 7.14 ml	unseen
Blue Mug	6.42 ± 6.31 ml	unseen
Plastic Cup	10.72 ± 8.70 ml	unseen

These amount errors are reported as lower than prior vision-based pouring work (Schenck 38 ml; Do 22.53 ml), though on a different setup.

Generalization sweeps. The method holds up across eight microphone positions (Fig 9 — including varying distance to the noisy UR5 control box, the main noise source), across initial liquid heights {10,20,30,40 mm} (Fig 10a), and across liquid types: pure water, carbonated water, and orange juice generalize well. Where it loses: 1.8% milk fails — higher viscosity weakens the pouring sound and degrades the recording; the authors conclude generalization is negatively correlated with liquid viscosity.

Limitations & open questions

From the authors:

Fails on high-viscosity liquids (milk) because the sound is too weak to record/analyze cleanly.
The human-to-robot transfer needs a spout-equipped re-collection and fine-tuning; the robot pour is slowed and the gripper is fixed high above the scale for sound quality.
Sensitive to environment assumptions: data collected in a quiet room; future work is flagged for noisier environments with human voice and for variable source–target distance.
The rich force/trajectory/video channels of the dataset are collected but unused here; multimodal fusion is left as future work.

What I noticed reading it:

Tiny robot-test statistics. Every robot generalization result is five pours per condition. Means ± std on n=5 are weak evidence; the std bars (Fig 8c, e.g. thermos ranges roughly −1 to 9 mm) are wide relative to the headline means.
No external baseline on the same setup. The 90%@2mm and sub-3mm numbers have no competing vision/force/audio method run on this rig; the Schenck/Do comparison is cross-paper, cross-container, cross-metric, so "higher precision" is suggestive, not controlled.
Distance/geometry are quietly fixed. Source–target and mic–target distances are constrained to dataset-like ranges; the air-resonance signal almost certainly depends on mic placement and pour geometry, and the mic-position sweep only spans two distances on the collision-free side.
The container-resonance term is asserted but not observed. The physical model invokes two resonances, but the paper admits the container's own resonance "cannot be clearly seen" in their data — so the model leans entirely on the air resonance, which may limit transfer to very different container materials/wall stiffness.
No quantitative ablation of L_mono. The monotonicity loss is motivated and a single α is reported, but there's no with/without comparison showing how much it actually helps.

Why I care

This is a clean, early (2019) instance of the thesis behind the whole 2026-06-24 batch and the reason it matters for BLADE: a manipulation state variable that is not visually evaluable. BLADE learns predicate classifiers from vision (e.g. turned-on(faucet)) by cropping to the relevant object — but a predicate like is_full(cup) or fill_level(cup, h) on an opaque container is precisely the case where a visual classifier has no signal to read. This paper shows that quantity lives in the audio channel: the air-resonance sweep is the fill-level estimator. So PouringNet is a concrete worked example of a continuous predicate / state estimator grounded in sound rather than pixels, exactly the gap BLADE's purely-visual predicate layer leaves open. It also speaks to BLADE's noted limitation that "continuous parameters (pour amount...) sit entirely inside the diffusion policy" — audio gives an explicit, plannable readout of pour fill that a symbolic precondition (not is_full) could be grounded against. Relative to BLADE the paper is narrow (single task, single fixed trajectory, no planning, no language) and is best read as a sensing primitive, not a manipulation framework; its value to my line of work is as the audio-grounded-predicate existence proof.

Directly related batch papers: SonicSense and Active Acoustic Sensing push the same acoustic-sensing-for-manipulation idea to in-hand object/property estimation; Play it by Ear, ManiWAV, and See, Hear, and Feel fold contact/pouring audio into policies rather than a standalone estimator; Clarke et al. (granular material audio) [19] is the closest prior cited here.

Quotable

Inspired by how human judge the liquid height during pouring with their hearing, we try to design a model that can estimate the position of liquid height with audio vibration. — §I / p.1

To make sense of audio vibration, using the length of the air column as groundtruth could be more generative and indicative than using the height of the liquid level. — §IV / p.4

The generalization performances of our PouringNet are negatively correlated to the viscosity of the liquid. — §V-B4 / p.7

Papers cited here that could be ingested next:

[19] Clarke et al., CoRL 2018 — Learning audio feedback for estimating amount and flow of granular material — the closest prior; audio-from-shaking for granular weight, which this paper distinguishes from liquid-height regression.
[2] Schenck & Fox, ICRA 2017 — Visual closed-loop control for pouring liquids — the vision-based pouring baseline whose error this paper undercuts.
[18,20] Do & Burgard — RGB-D liquid-level / accurate pouring — the vision-perception comparison points.
[6] Rozo et al. 2013 — Force-based pouring via parametric HMM — the haptic-pouring counterpoint.

Newly ingested in the 2026-06-24 batch — directly relevant:

SonicSense — in-hand acoustic vibration sensing for object/property estimation; same "structure lives in sound" thesis at the object level.
Active Acoustic Sensing for Robot Manipulation — injects/probes acoustic signals to sense contact state; active counterpart to this passive pouring-sound readout.
Play it by Ear and ManiWAV — audio-(visual) imitation policies; use contact/pouring sound inside a policy rather than as an explicit state estimator.
See, Hear, and Feel — multisensory fusion including audio for contact-rich manipulation.
ObjectFolder 2.0 — multisensory (incl. acoustic) object dataset/simulator; the representation-learning backdrop for acoustic property sensing.