Making Sense of Audio Vibration for Liquid Height Estimation in Robotic Pouring

Hongzhuo Liang, Shuang Li, Xiaojian Ma, Norman Hendrich, Timo Gerkmann, Fuchun Sun, Jianwei Zhang · Universität Hamburg / UCLA / Tsinghua · 2019 · ICRA 2019 · arXiv:1903.00650 · PDF

One-liner. Pour water into an opaque cup and the rising liquid shortens the air column inside, sweeping a resonance up the spectrum like blowing across a bottle — PouringNet is an RNN that reads that microphone sweep in real time to estimate liquid height closed-loop, no camera or force sensor needed, the canonical "this predicate lives in sound, not pixels" result for manipulation.

Problem & motivation

Robotic pouring needs accurate, real-time perception of how full the target container is, so the robot can stop before spilling. The two standard modalities both break: vision fails on opaque containers (you can't see the liquid line) and is sensitive to lighting and liquid color, with reported mean height errors >4 mm even on transparent liquids; haptic/force sensing on the source container measures how much liquid left the source (a volume integral), but cannot read the liquid height in an unseen target container, and the force-to-fill mapping varies with end-effector and container. The paper's insight, borrowed from how humans pour by ear, is that the air cavity above the liquid acts like an organ pipe: as the column shortens the air resonates faster, so the instantaneous resonance frequency directly encodes the current fill level — an immediate readout that needs no temporal integration and so accumulates no integration drift.

Method

Air column, not liquid height, as the label. The key modeling move (Fig 4): the regression target is the length of the air column Ha above the liquid, not the liquid height itself. Two differently-shaped containers filled to different liquid heights but sharing the same air-column length produce a similar air resonance — so air-column length is the container-agnostic quantity the resonance actually determines. Using liquid height as the label would map one acoustic feature to many labels; using air-column length makes the target single-valued and transfers across container shapes. This also gives free overfill protection: the controller just stops when Ha hits a small threshold, regardless of the target container's geometry.

Audio front-end. Raw audio (recorded at 44.1 kHz) is resampled to 16 kHz; spectrograms use a 0.032 s window with half-window overlap and a 512-point FFT, yielding 257-bin slices. A 4 s clip becomes a 257×251 time-frequency input. In the spectrograms (Fig 3) a single high-energy rising curve between ~256–2048 Hz is the air resonance; the authors note it tracks liquid height and is largely invariant to pouring speed, while the container's own (slightly falling) resonance is not clearly visible in their data.

PouringNet (Fig 5). A recurrent encoder — one layer of LSTM or GRU — consumes each time-slice of the spectrogram progressively, producing a 256-d recurrent feature; a 2-layer MLP height predictor regresses the instantaneous air-column length. Recurrence is chosen for two reasons: it suits the sequential nature of audio, and it implicitly bakes in the prior that liquid level rises monotonically, smoothing predictions. Two losses: an MSE term Lheight = ∥Ĥa − Ha∥² and an auxiliary monotonicity penalty Lmono = ∑t max(0, Ĥa,t+1 − Ĥa,t) that punishes any predicted increase in air-column length (since it should only shrink). Combined as Laudio = Lheight + α · Lmono with α = 0.01. Data augmentation: random 4 s crops from each full pouring sequence, count proportional to trial length.

Deployment. The AudioLSTM model trained on the human dataset is fine-tuned on a small spout-equipped robot dataset (the spout slows pouring and reduces spillage from the high gripper), then used as a closed-loop feedback signal that terminates pouring the moment the desired Ha is reached. One feedback loop runs in ~21 ms (~20 ms of which is spectrogram processing).

Setup

Results

Headline. On the human test set both recurrent models reach 90% of sequences below 2 mm air-column error with absolute mean length error below 1.5 mm; AudioLSTM slightly beats AudioGRU and both crush the AudioFC feed-forward baseline (Fig 6), confirming that recurrence (and its implicit monotonic prior) matters and that audio alone suffices to infer height across containers. The all-container model beats any single-container model, i.e. it benefits from data scale and generalizes better. Among containers, the stainless-steel thermos is easiest ("crispest sound") and the glass hardest.

Robot closed-loop (Fig 8). Absolute mean height error and std are both below 3 mm on the three seen containers and below 4.5 mm on three unseen containers, despite a large human-to-robot domain gap (different trajectories, noisier environment). Accuracy improves as the target air column gets shorter (liquid higher) — a useful property since spill risk is highest there. Converted to amount error (Table I):

ContainerMean amount error ± stdSeen?
Glass9.54 ± 7.81 mlseen
Thermos9.91 ± 8.48 mlseen
Mug13.79 ± 11.04 mlseen
Red Mug7.92 ± 7.14 mlunseen
Blue Mug6.42 ± 6.31 mlunseen
Plastic Cup10.72 ± 8.70 mlunseen

These amount errors are reported as lower than prior vision-based pouring work (Schenck 38 ml; Do 22.53 ml), though on a different setup.

Generalization sweeps. The method holds up across eight microphone positions (Fig 9 — including varying distance to the noisy UR5 control box, the main noise source), across initial liquid heights {10,20,30,40 mm} (Fig 10a), and across liquid types: pure water, carbonated water, and orange juice generalize well. Where it loses: 1.8% milk fails — higher viscosity weakens the pouring sound and degrades the recording; the authors conclude generalization is negatively correlated with liquid viscosity.

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This is a clean, early (2019) instance of the thesis behind the whole 2026-06-24 batch and the reason it matters for BLADE: a manipulation state variable that is not visually evaluable. BLADE learns predicate classifiers from vision (e.g. turned-on(faucet)) by cropping to the relevant object — but a predicate like is_full(cup) or fill_level(cup, h) on an opaque container is precisely the case where a visual classifier has no signal to read. This paper shows that quantity lives in the audio channel: the air-resonance sweep is the fill-level estimator. So PouringNet is a concrete worked example of a continuous predicate / state estimator grounded in sound rather than pixels, exactly the gap BLADE's purely-visual predicate layer leaves open. It also speaks to BLADE's noted limitation that "continuous parameters (pour amount...) sit entirely inside the diffusion policy" — audio gives an explicit, plannable readout of pour fill that a symbolic precondition (not is_full) could be grounded against. Relative to BLADE the paper is narrow (single task, single fixed trajectory, no planning, no language) and is best read as a sensing primitive, not a manipulation framework; its value to my line of work is as the audio-grounded-predicate existence proof.

Directly related batch papers: SonicSense and Active Acoustic Sensing push the same acoustic-sensing-for-manipulation idea to in-hand object/property estimation; Play it by Ear, ManiWAV, and See, Hear, and Feel fold contact/pouring audio into policies rather than a standalone estimator; Clarke et al. (granular material audio) [19] is the closest prior cited here.

Quotable

Inspired by how human judge the liquid height during pouring with their hearing, we try to design a model that can estimate the position of liquid height with audio vibration. — §I / p.1
To make sense of audio vibration, using the length of the air column as groundtruth could be more generative and indicative than using the height of the liquid level. — §IV / p.4
The generalization performances of our PouringNet are negatively correlated to the viscosity of the liquid. — §V-B4 / p.7

Related

Papers cited here that could be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant: