Audio-VLA: Adding Contact Audio Perception to Vision-Language-Action Model for Robotic Manipulation

Xiangyi Wei, Haotian Zhang, Xinyi Cao, Siyu Xie, Weifeng Ge, Yang Li, Changbo Wang · East China Normal University / Fudan University · 2025 · arXiv:2511.09958 · PDF

One-liner. Audio-VLA bolts a contact-microphone audio stream onto an OpenVLA-style policy — AudioCLIP audio encoder + DINOv2/SigLIP vision + proprio, fused into a Llama2-7B backbone with LoRA — so the policy can "hear" contact events (friction, impact, scooping resistance) that vision literally cannot see, and it ships an audio-augmented LIBERO/RLBench plus a Task Completion Rate metric to measure dynamic-process perception rather than just end-state success.

Problem & motivation

Mainstream VLA models (RT-2, OpenVLA, π0, CoT-VLA) perceive the world through vision alone. The authors argue this is a fundamental blind spot for contact-rich manipulation: whether an eraser is actually pressing the board, whether a spoon has scooped enough oatmeal, whether two parts have seated — these are contact states, often occluded, that vision cannot reliably read. Tactile sensors (GelSight, BioTac) are the usual fix but are expensive, require end-effector integration, sense only local contact, and sample too slowly for fast tool–object dynamics. The paper's bet is that contact audio from cheap piezo microphones is a complementary, non-invasive, high-frequency signal that penetrates occlusion and is invariant to lighting/color domain shift. The gap they target: prior audio-manipulation work (ManiWAV, Hearing Touch) lacks high-level language/semantic understanding, and existing VLA frameworks have no audio pathway or audio-enhanced training/eval environment.

Method

Audio-VLA is a four-component architecture (Fig 2): a multi-modal encoder, a multi-modal projector, a Llama2-7B language module, and an MLP action head.

Multi-modal encoder. Vision: two ViTs — DINOv2 and SigLIP, LoRA fine-tuned — encode a third-person and a wrist image (Np=256 patches each); features are concatenated into Fvis. Audio: AudioCLIP is the encoder, but first additionally pre-trained on the ManiWAV robotic dataset on top of its original weights, then LoRA fine-tuned inside Audio-VLA. Crucially, audio is processed independently at each timestep (not over a long window) for instantaneous contact detection and tight temporal alignment with vision/action. In the Frequency B-Spline Projection (FBSP) layer they shrink window length (Lwin=1024) and hop length (Lhop=256) to boost temporal resolution while keeping enough frequency resolution for high-frequency contact events. The complex spectrogram Z goes through power-spectrum + log scaling + a ResNeXt to give Faud (Eq 4–5). Proprio: an MLP encodes joint state into Fprop.

Multi-modal projector. Three projections (vision = linear, audio = 3-layer MLP, proprio = 2-layer MLP) map each modality into the LLM's embedding dimension and are concatenated along the sequence dimension (Eq 7), explicitly preserving intra-modal temporal continuity and cross-modal alignment so the LLM can reason over the temporal evolution of contact.

Language module + action head. The tokenized instruction is concatenated with the multimodal token sequence and with K·D learnable empty action embeddings, then fed to Llama2-7B with parallel decoding (Eq 9–10). Action hidden states are pulled out and a 4-layer MLP action head regresses an action block of K future steps, each of dimension D, in [-1,1] (Eq 11). Training is plain mean L1 imitation loss against expert action chunks (Eq 12) — no flow-matching, no discretization.

Audio-enhanced simulators. A second contribution: they augment LIBERO and RLBench with collision-triggered audio. Real contact sounds are recorded at 48 kHz on physical objects matched to sim objects by material and dimension, indexed into a library by material-pair / interaction-type / force-magnitude. At sim runtime, collisions (gripper–object and grasped-object–environment) are detected; the impact velocity and force query the library and the retrieved clip is modulated (amplitude by force, pitch by object size, duration for continuous contact). Promised open-source.

TCR metric. Task Completion Rate = Achieved Progress / Task Target ∈ [0,1] (Eq 13), a continuous per-task progress score (e.g., area of marks erased; weight of oatmeal scooped) meant to capture dynamic-process understanding that binary success rate misses.

Setup

Results

Headline: in the standard sim environment Audio-VLA hits 97.6% avg on LIBERO and 55.1% on RLBench, beating all vision-only baselines; the gap is largest on contact-intensive RLBench Task2 (insert square peg) and Task3 (light bulb in), where it beats OpenVLA-OFT by 2.7 and 10.2 points. Under domain shift it degrades least (74.7% LIBERO / 41.5% RLBench).

MethodLIBERO AvgRLBench AvgLIBERO (shift)RLBench (shift)
π0-FAST85.643.964.232.1
CoT-VLA83.9n/a57.8n/a
OpenVLA-OFT97.148.171.035.5
Audio-VLA97.655.174.741.5

Real-world (Table II, success rate / TCR, %): Audio-VLA roughly triples success over the vision-only baselines on both tasks in the seen condition, and holds up far better under unseen conditions where baselines collapse to near zero.

MethodEAWM SR / TCR (seen)S5GO SR / TCR (seen)EAWM (unseen)S5GO (unseen)
π0-FAST20 / 3410 / 230 / 160 / 11
OpenVLA-OFT20 / 4510 / 3410 / 260 / 24
Audio-VLA60 / 7330 / 7230 / 5720 / 56

Ablations. On RLBench (Table III): full 55.1 avg vs vision-only 48.0 (audio worth ~7 points) and w/o-LoRA 51.5 (audio-encoder LoRA worth ~3.6). Real-world (Table IV): audio roughly triples or doubles success on EAWM; LoRA on the audio encoder doubles EAWM success — the authors stress the frozen pretrained AudioCLIP cannot process manipulation-specific sounds without fine-tuning. Where it does not dominate: on RLBench Task2 (square-peg insertion) absolute numbers stay low for everyone (15.0% for Audio-VLA), so "contact-intensive" wins are relative, not solved.

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This paper is a direct, concrete instance of the thesis behind this batch: many manipulation predicates are not visually evaluable. "The eraser is in contact with the board," "enough oatmeal has been scooped," "the peg is seated" — these are exactly the is_pressed / is_full / is_inserted predicates that live in sound/force, not pixels. Audio-VLA's EAWM example ("vision guides the eraser to the whiteboard but cannot determine surface contact") is almost a textbook argument for why a BLADE-style predicate like in-contact(eraser, board) would need a non-visual classifier. Relative to BLADE, this is the opposite design philosophy: BLADE imposes structure (LLM-proposed PDDL operators, learned predicate classifiers, symbolic bi-level planning) over imitation-learned skills; Audio-VLA is end-to-end scale-everything — it dumps audio tokens into a 7B LLM and regresses actions with an L1 loss, no symbols, no planner, no explicit predicate. The lesson I'd extract for my own line: the sensing argument is modality-agnostic w.r.t. the policy class. If contact audio carries the contact-state signal, then a structured policy could learn an in-contact predicate classifier off the same audio stream — an audio analogue of BLADE's visual predicate classifiers — and get interpretability + planning that this monolithic VLA forgoes. The TCR metric is also worth noting: it is a crude continuous progress signal, but it gestures at the same need BLADE has for intermediate state evaluation along a long-horizon execution, not just terminal success. This is squarely on-theme.

Quotable

Unlike vision, audio signals penetrate occlusions and capture temporal dynamics that are visually imperceptible… acoustic information reveals contact quality between objects and environmental changes during dynamic interactions—natural blind spots of visual perception that are indispensable for intelligent manipulation. — §I, Introduction / p.1
In EAWM, vision guides the eraser to the whiteboard but cannot determine surface contact; in S5GO, it cannot model spoon depth-weight relationships. Audio-VLA overcomes these limitations through acoustic signatures of interaction physics. — §IV-B, Real-World Robot Experiment Results / p.6

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant: