One-liner. Pour water into an opaque cup and the rising liquid shortens the air column inside, sweeping a resonance up the spectrum like blowing across a bottle — PouringNet is an RNN that reads that microphone sweep in real time to estimate liquid height closed-loop, no camera or force sensor needed, the canonical "this predicate lives in sound, not pixels" result for manipulation.
Robotic pouring needs accurate, real-time perception of how full the target container is, so the robot can stop before spilling. The two standard modalities both break: vision fails on opaque containers (you can't see the liquid line) and is sensitive to lighting and liquid color, with reported mean height errors >4 mm even on transparent liquids; haptic/force sensing on the source container measures how much liquid left the source (a volume integral), but cannot read the liquid height in an unseen target container, and the force-to-fill mapping varies with end-effector and container. The paper's insight, borrowed from how humans pour by ear, is that the air cavity above the liquid acts like an organ pipe: as the column shortens the air resonates faster, so the instantaneous resonance frequency directly encodes the current fill level — an immediate readout that needs no temporal integration and so accumulates no integration drift.
Air column, not liquid height, as the label. The key
modeling move (Fig 4): the regression target is the length of the air
column Ha above the liquid, not the liquid height
itself. Two differently-shaped containers filled to different liquid heights
but sharing the same air-column length produce a similar air resonance —
so air-column length is the container-agnostic quantity the resonance actually
determines. Using liquid height as the label would map one acoustic feature to
many labels; using air-column length makes the target single-valued and
transfers across container shapes. This also gives free overfill protection:
the controller just stops when Ha hits a small
threshold, regardless of the target container's geometry.
Audio front-end. Raw audio (recorded at 44.1 kHz) is
resampled to 16 kHz; spectrograms use a 0.032 s window with half-window
overlap and a 512-point FFT, yielding 257-bin slices. A 4 s clip becomes a
257×251 time-frequency input. In the spectrograms (Fig 3) a
single high-energy rising curve between ~256–2048 Hz is the air
resonance; the authors note it tracks liquid height and is largely invariant to
pouring speed, while the container's own (slightly falling) resonance is not
clearly visible in their data.
PouringNet (Fig 5). A recurrent encoder — one layer of
LSTM or GRU — consumes each time-slice of the spectrogram progressively,
producing a 256-d recurrent feature; a 2-layer MLP height predictor regresses
the instantaneous air-column length. Recurrence is chosen for two reasons: it
suits the sequential nature of audio, and it implicitly bakes in the prior that
liquid level rises monotonically, smoothing predictions. Two losses:
an MSE term Lheight = ∥Ĥa − Ha∥²
and an auxiliary monotonicity penalty
Lmono = ∑t max(0, Ĥa,t+1 − Ĥa,t)
that punishes any predicted increase in air-column length (since it
should only shrink). Combined as
Laudio = Lheight + α · Lmono
with α = 0.01. Data augmentation: random 4 s crops from
each full pouring sequence, count proportional to trial length.
Deployment. The AudioLSTM model trained on the human dataset
is fine-tuned on a small spout-equipped robot dataset (the spout slows pouring
and reduces spillage from the high gripper), then used as a closed-loop
feedback signal that terminates pouring the moment the desired
Ha is reached. One feedback loop runs in ~21 ms
(~20 ms of which is spectrogram processing).
Headline. On the human test set both recurrent models reach 90% of sequences below 2 mm air-column error with absolute mean length error below 1.5 mm; AudioLSTM slightly beats AudioGRU and both crush the AudioFC feed-forward baseline (Fig 6), confirming that recurrence (and its implicit monotonic prior) matters and that audio alone suffices to infer height across containers. The all-container model beats any single-container model, i.e. it benefits from data scale and generalizes better. Among containers, the stainless-steel thermos is easiest ("crispest sound") and the glass hardest.
Robot closed-loop (Fig 8). Absolute mean height error and std are both below 3 mm on the three seen containers and below 4.5 mm on three unseen containers, despite a large human-to-robot domain gap (different trajectories, noisier environment). Accuracy improves as the target air column gets shorter (liquid higher) — a useful property since spill risk is highest there. Converted to amount error (Table I):
| Container | Mean amount error ± std | Seen? |
|---|---|---|
| Glass | 9.54 ± 7.81 ml | seen |
| Thermos | 9.91 ± 8.48 ml | seen |
| Mug | 13.79 ± 11.04 ml | seen |
| Red Mug | 7.92 ± 7.14 ml | unseen |
| Blue Mug | 6.42 ± 6.31 ml | unseen |
| Plastic Cup | 10.72 ± 8.70 ml | unseen |
These amount errors are reported as lower than prior vision-based pouring work (Schenck 38 ml; Do 22.53 ml), though on a different setup.
Generalization sweeps. The method holds up across eight microphone positions (Fig 9 — including varying distance to the noisy UR5 control box, the main noise source), across initial liquid heights {10,20,30,40 mm} (Fig 10a), and across liquid types: pure water, carbonated water, and orange juice generalize well. Where it loses: 1.8% milk fails — higher viscosity weakens the pouring sound and degrades the recording; the authors conclude generalization is negatively correlated with liquid viscosity.
From the authors:
What I noticed reading it:
Lmono.
The monotonicity loss is motivated and a single α is reported, but
there's no with/without comparison showing how much it actually helps.This is a clean, early (2019) instance of the thesis behind the whole
2026-06-24 batch and the reason it matters for
BLADE:
a manipulation state variable that is not visually evaluable.
BLADE learns predicate classifiers from vision (e.g.
turned-on(faucet)) by cropping to the relevant object — but a
predicate like is_full(cup) or fill_level(cup, h) on an
opaque container is precisely the case where a visual classifier has no
signal to read. This paper shows that quantity lives in the audio channel: the
air-resonance sweep is the fill-level estimator. So PouringNet is a
concrete worked example of a continuous predicate / state estimator grounded in
sound rather than pixels, exactly the gap BLADE's purely-visual predicate layer
leaves open. It also speaks to BLADE's noted limitation that "continuous
parameters (pour amount...) sit entirely inside the diffusion policy" —
audio gives an explicit, plannable readout of pour fill that a symbolic
precondition (not is_full) could be grounded against. Relative to
BLADE the paper is narrow (single task, single fixed trajectory, no planning,
no language) and is best read as a sensing primitive, not a manipulation
framework; its value to my line of work is as the audio-grounded-predicate
existence proof.
Directly related batch papers: SonicSense and Active Acoustic Sensing push the same acoustic-sensing-for-manipulation idea to in-hand object/property estimation; Play it by Ear, ManiWAV, and See, Hear, and Feel fold contact/pouring audio into policies rather than a standalone estimator; Clarke et al. (granular material audio) [19] is the closest prior cited here.
Inspired by how human judge the liquid height during pouring with their hearing, we try to design a model that can estimate the position of liquid height with audio vibration. — §I / p.1
To make sense of audio vibration, using the length of the air column as groundtruth could be more generative and indicative than using the height of the liquid level. — §IV / p.4
The generalization performances of our PouringNet are negatively correlated to the viscosity of the liquid. — §V-B4 / p.7
Papers cited here that could be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant: