Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation

Xiangyi Wei, Haotian Zhang, Xinyi Cao, Siyu Xie, Weifeng Ge, Yang Li, Changbo Wang · East China Normal University / Fudan University · 2025 · arXiv:2511.09958 · PDF

One-liner. Audio-VLA bolts a contact-microphone audio stream onto an OpenVLA-style policy — AudioCLIP audio encoder + DINOv2/SigLIP vision + proprio, fused into a Llama2-7B backbone with LoRA — so the policy can "hear" contact events (friction, impact, scooping resistance) that vision literally cannot see, and it ships an audio-augmented LIBERO/RLBench plus a Task Completion Rate metric to measure dynamic-process perception rather than just end-state success.

Problem & motivation

Mainstream VLA models (RT-2, OpenVLA, π₀, CoT-VLA) perceive the world through vision alone. The authors argue this is a fundamental blind spot for contact-rich manipulation: whether an eraser is actually pressing the board, whether a spoon has scooped enough oatmeal, whether two parts have seated — these are contact states, often occluded, that vision cannot reliably read. Tactile sensors (GelSight, BioTac) are the usual fix but are expensive, require end-effector integration, sense only local contact, and sample too slowly for fast tool–object dynamics. The paper's bet is that contact audio from cheap piezo microphones is a complementary, non-invasive, high-frequency signal that penetrates occlusion and is invariant to lighting/color domain shift. The gap they target: prior audio-manipulation work (ManiWAV, Hearing Touch) lacks high-level language/semantic understanding, and existing VLA frameworks have no audio pathway or audio-enhanced training/eval environment.

Method

Audio-VLA is a four-component architecture (Fig 2): a multi-modal encoder, a multi-modal projector, a Llama2-7B language module, and an MLP action head.

Multi-modal encoder. Vision: two ViTs — DINOv2 and SigLIP, LoRA fine-tuned — encode a third-person and a wrist image (N_p=256 patches each); features are concatenated into F^vis. Audio: AudioCLIP is the encoder, but first additionally pre-trained on the ManiWAV robotic dataset on top of its original weights, then LoRA fine-tuned inside Audio-VLA. Crucially, audio is processed independently at each timestep (not over a long window) for instantaneous contact detection and tight temporal alignment with vision/action. In the Frequency B-Spline Projection (FBSP) layer they shrink window length (L_win=1024) and hop length (L_hop=256) to boost temporal resolution while keeping enough frequency resolution for high-frequency contact events. The complex spectrogram Z goes through power-spectrum + log scaling + a ResNeXt to give F^aud (Eq 4–5). Proprio: an MLP encodes joint state into F^prop.

Multi-modal projector. Three projections (vision = linear, audio = 3-layer MLP, proprio = 2-layer MLP) map each modality into the LLM's embedding dimension and are concatenated along the sequence dimension (Eq 7), explicitly preserving intra-modal temporal continuity and cross-modal alignment so the LLM can reason over the temporal evolution of contact.

Language module + action head. The tokenized instruction is concatenated with the multimodal token sequence and with K·D learnable empty action embeddings, then fed to Llama2-7B with parallel decoding (Eq 9–10). Action hidden states are pulled out and a 4-layer MLP action head regresses an action block of K future steps, each of dimension D, in [-1,1] (Eq 11). Training is plain mean L1 imitation loss against expert action chunks (Eq 12) — no flow-matching, no discretization.

Audio-enhanced simulators. A second contribution: they augment LIBERO and RLBench with collision-triggered audio. Real contact sounds are recorded at 48 kHz on physical objects matched to sim objects by material and dimension, indexed into a library by material-pair / interaction-type / force-magnitude. At sim runtime, collisions (gripper–object and grasped-object–environment) are detected; the impact velocity and force query the library and the retrieved clip is modulated (amplitude by force, pitch by object size, duration for continuous contact). Promised open-source.

TCR metric. Task Completion Rate = Achieved Progress / Task Target ∈ [0,1] (Eq 13), a continuous per-task progress score (e.g., area of marks erased; weight of oatmeal scooped) meant to capture dynamic-process understanding that binary success rate misses.

Setup

Datasets / benchmarks: LIBERO (Spatial/Object/Goal/Long, official dataset) and 5 RLBench tasks (close jar, insert onto square peg, light bulb in, open drawer, put item in drawer; 100 demos/task from the PerAct/Shridhar split), both audio-augmented. Also a domain-shift condition (random lighting + desktop color/texture, following The Colosseum). Two real-world tasks: Erasing All Whiteboard Marks (EAWM) and Scooping 5 g of Oatmeal (S5GO), 40 teleop demos each.
Hardware / simulator: AgileX Mobile ALOHA with dual 7-DOF Piper arms (right arm used); Orbbec wrist + third-person cameras; two piezo contact microphones on the gripper recording at 44.1 kHz (48 kHz for the sim audio library). Inference on an NVIDIA H20, ROS, 25 Hz continuous control.
Baselines: π₀-FAST, CoT-VLA (sim only), and OpenVLA-OFT in simulation; OpenVLA-OFT and π₀-FAST in the real world. All vision-only. Ablations: vision-only Audio-VLA, and Audio-VLA w/o audio-encoder LoRA.
Compute: Training on 2× NVIDIA H20 GPUs; LoRA rank 32, 50k–100k steps, batch size 8, lr 1e-4 cosine. Backbone is Llama2-7B.

Results

Headline: in the standard sim environment Audio-VLA hits 97.6% avg on LIBERO and 55.1% on RLBench, beating all vision-only baselines; the gap is largest on contact-intensive RLBench Task2 (insert square peg) and Task3 (light bulb in), where it beats OpenVLA-OFT by 2.7 and 10.2 points. Under domain shift it degrades least (74.7% LIBERO / 41.5% RLBench).

Method	LIBERO Avg	RLBench Avg	LIBERO (shift)	RLBench (shift)
π₀-FAST	85.6	43.9	64.2	32.1
CoT-VLA	83.9	n/a	57.8	n/a
OpenVLA-OFT	97.1	48.1	71.0	35.5
Audio-VLA	97.6	55.1	74.7	41.5

Real-world (Table II, success rate / TCR, %): Audio-VLA roughly triples success over the vision-only baselines on both tasks in the seen condition, and holds up far better under unseen conditions where baselines collapse to near zero.

Method	EAWM SR / TCR (seen)	S5GO SR / TCR (seen)	EAWM (unseen)	S5GO (unseen)
π₀-FAST	20 / 34	10 / 23	0 / 16	0 / 11
OpenVLA-OFT	20 / 45	10 / 34	10 / 26	0 / 24
Audio-VLA	60 / 73	30 / 72	30 / 57	20 / 56

Ablations. On RLBench (Table III): full 55.1 avg vs vision-only 48.0 (audio worth ~7 points) and w/o-LoRA 51.5 (audio-encoder LoRA worth ~3.6). Real-world (Table IV): audio roughly triples or doubles success on EAWM; LoRA on the audio encoder doubles EAWM success — the authors stress the frozen pretrained AudioCLIP cannot process manipulation-specific sounds without fine-tuning. Where it does not dominate: on RLBench Task2 (square-peg insertion) absolute numbers stay low for everyone (15.0% for Audio-VLA), so "contact-intensive" wins are relative, not solved.

Limitations & open questions

From the authors:

Frozen AudioCLIP is inadequate for manipulation sounds — audio-encoder LoRA fine-tuning is necessary (stated as a finding, but it bounds out-of-the-box transfer).
Extracting contact-event information from high-frequency audio and the lack of audio-enhanced training/eval environments are framed as the core technical obstacles — their sim solution is a retrieval-and-modulate heuristic, not learned/physical audio synthesis.
No other explicit limitations or failure analysis section; no future-work discussion beyond the conclusion.

What I noticed reading it:

Tiny-N real-world statistics. Real-world success rates are multiples of 10% (e.g., 60%, 30%, 20%, 10%, 0%), implying roughly 10 trials per cell with no seeds or confidence intervals. "Threefold improvement" is 60% vs 20% — i.e., 6/10 vs 2/10. The headline claims rest on a very small sample; contrast with the seeded CALVIN tables in BLADE.
The sim audio is synthetic-by-retrieval, then tested in sim. LIBERO/RLBench gains are measured in the same audio-augmented simulators the authors built; the audio "ground truth" is a collision-triggered clip-lookup they designed. There is a real risk the policy learns to exploit the deterministic audio-generation rule rather than genuine contact physics. The real-robot results are the load-bearing evidence, and those are the small-N ones.
Domain-shift claim conflates modalities. Audio being lighting/color-invariant is almost tautological; the interesting question — does the audio encoder generalize across materials it was not trained on? — is only partially probed (darker oatmeal, different marker) and not isolated.
No tactile baseline. The motivation is "audio is cheaper than tactile," yet no tactile-VLA (e.g., Tactile-VLA, OmniVTLA) is run head to head, so the cost/benefit tradeoff is asserted, not measured.
Audio token count and latency unreported. Per-timestep audio into a 7B LLM at 25 Hz is non-trivial; N_a and end-to-end latency are not reported.

Why I care

This paper is a direct, concrete instance of the thesis behind this batch: many manipulation predicates are not visually evaluable. "The eraser is in contact with the board," "enough oatmeal has been scooped," "the peg is seated" — these are exactly the is_pressed / is_full / is_inserted predicates that live in sound/force, not pixels. Audio-VLA's EAWM example ("vision guides the eraser to the whiteboard but cannot determine surface contact") is almost a textbook argument for why a BLADE-style predicate like in-contact(eraser, board) would need a non-visual classifier. Relative to BLADE, this is the opposite design philosophy: BLADE imposes structure (LLM-proposed PDDL operators, learned predicate classifiers, symbolic bi-level planning) over imitation-learned skills; Audio-VLA is end-to-end scale-everything — it dumps audio tokens into a 7B LLM and regresses actions with an L1 loss, no symbols, no planner, no explicit predicate. The lesson I'd extract for my own line: the sensing argument is modality-agnostic w.r.t. the policy class. If contact audio carries the contact-state signal, then a structured policy could learn an in-contact predicate classifier off the same audio stream — an audio analogue of BLADE's visual predicate classifiers — and get interpretability + planning that this monolithic VLA forgoes. The TCR metric is also worth noting: it is a crude continuous progress signal, but it gestures at the same need BLADE has for intermediate state evaluation along a long-horizon execution, not just terminal success. This is squarely on-theme.

Quotable

Unlike vision, audio signals penetrate occlusions and capture temporal dynamics that are visually imperceptible… acoustic information reveals contact quality between objects and environmental changes during dynamic interactions—natural blind spots of visual perception that are indispensable for intelligent manipulation. — §I, Introduction / p.1

In EAWM, vision guides the eraser to the whiteboard but cannot determine surface contact; in S5GO, it cannot model spoon depth-weight relationships. Audio-VLA overcomes these limitations through acoustic signatures of interaction physics. — §IV-B, Real-World Robot Experiment Results / p.6

Papers cited that should likely be ingested next:

ManiWAV (Liu et al. 2024) — the "ear-in-hand" contact-audio policy; Audio-VLA reuses its dataset to pretrain AudioCLIP. PDF
Hearing Touch (Mejia et al. ICRA 2024) — audio-visual pretraining for contact-rich manipulation; closest audio-perception ancestor. PDF
OmniVTLA (Cheng et al. 2025) — the vision-tactile-LA counterpart this paper positions against (audio-vs-tactile). PDF
AudioCLIP (Guzhov et al. ICASSP 2022) — the audio encoder backbone. PDF
OpenVLA / π₀ / CoT-VLA — the vision-only VLA baselines and architectural lineage.
GelSight (Yuan et al. 2017) — the tactile-sensor foil cited as expensive/local. PDF

Newly ingested in the 2026-06-24 batch — directly relevant:

Beyond Sight (FuSe) — same cluster: fuses heterogeneous (incl. audio) sensors into a language-conditioned policy; the most direct multisensory-VLA peer.
Tactile-VLA, OmniVTLA, ForceVLA, VLA-Touch — the tactile/force VLA siblings; Audio-VLA argues audio is a cheaper substitute for the tactile signal these add.
VLAS — the other audio-into-VLA paper in the cluster, but for speech instructions rather than contact audio; clean contrast on what "audio" means in a VLA.
ManiWAV and Hearing Touch — the contact-audio-for-manipulation ancestors Audio-VLA builds on (and whose data it reuses).
AudioCLIP — the audio encoder it adopts.