One-liner. Audio-VLA bolts a contact-microphone audio stream onto an OpenVLA-style policy — AudioCLIP audio encoder + DINOv2/SigLIP vision + proprio, fused into a Llama2-7B backbone with LoRA — so the policy can "hear" contact events (friction, impact, scooping resistance) that vision literally cannot see, and it ships an audio-augmented LIBERO/RLBench plus a Task Completion Rate metric to measure dynamic-process perception rather than just end-state success.
Mainstream VLA models (RT-2, OpenVLA, π0, CoT-VLA) perceive the world through vision alone. The authors argue this is a fundamental blind spot for contact-rich manipulation: whether an eraser is actually pressing the board, whether a spoon has scooped enough oatmeal, whether two parts have seated — these are contact states, often occluded, that vision cannot reliably read. Tactile sensors (GelSight, BioTac) are the usual fix but are expensive, require end-effector integration, sense only local contact, and sample too slowly for fast tool–object dynamics. The paper's bet is that contact audio from cheap piezo microphones is a complementary, non-invasive, high-frequency signal that penetrates occlusion and is invariant to lighting/color domain shift. The gap they target: prior audio-manipulation work (ManiWAV, Hearing Touch) lacks high-level language/semantic understanding, and existing VLA frameworks have no audio pathway or audio-enhanced training/eval environment.
Audio-VLA is a four-component architecture (Fig 2): a multi-modal encoder, a multi-modal projector, a Llama2-7B language module, and an MLP action head.
Multi-modal encoder. Vision: two ViTs —
DINOv2 and SigLIP, LoRA fine-tuned — encode a third-person and a wrist
image (Np=256 patches each); features are concatenated
into Fvis. Audio: AudioCLIP is the encoder, but
first additionally pre-trained on the ManiWAV robotic dataset on top of
its original weights, then LoRA fine-tuned inside Audio-VLA. Crucially, audio is
processed independently at each timestep (not over a long window) for
instantaneous contact detection and tight temporal alignment with vision/action.
In the Frequency B-Spline Projection (FBSP) layer they shrink window length
(Lwin=1024) and hop length (Lhop=256)
to boost temporal resolution while keeping enough frequency resolution for
high-frequency contact events. The complex spectrogram Z goes
through power-spectrum + log scaling + a ResNeXt to give
Faud (Eq 4–5). Proprio: an MLP encodes
joint state into Fprop.
Multi-modal projector. Three projections (vision = linear, audio = 3-layer MLP, proprio = 2-layer MLP) map each modality into the LLM's embedding dimension and are concatenated along the sequence dimension (Eq 7), explicitly preserving intra-modal temporal continuity and cross-modal alignment so the LLM can reason over the temporal evolution of contact.
Language module + action head. The tokenized instruction is
concatenated with the multimodal token sequence and with K·D
learnable empty action embeddings, then fed to Llama2-7B with
parallel decoding (Eq 9–10). Action hidden states are pulled out
and a 4-layer MLP action head regresses an action block of K future
steps, each of dimension D, in [-1,1] (Eq 11).
Training is plain mean L1 imitation loss against expert action chunks (Eq 12)
— no flow-matching, no discretization.
Audio-enhanced simulators. A second contribution: they augment LIBERO and RLBench with collision-triggered audio. Real contact sounds are recorded at 48 kHz on physical objects matched to sim objects by material and dimension, indexed into a library by material-pair / interaction-type / force-magnitude. At sim runtime, collisions (gripper–object and grasped-object–environment) are detected; the impact velocity and force query the library and the retrieved clip is modulated (amplitude by force, pitch by object size, duration for continuous contact). Promised open-source.
TCR metric. Task Completion Rate = Achieved Progress /
Task Target ∈ [0,1] (Eq 13), a continuous per-task progress
score (e.g., area of marks erased; weight of oatmeal scooped) meant to capture
dynamic-process understanding that binary success rate misses.
Headline: in the standard sim environment Audio-VLA hits 97.6% avg on LIBERO and 55.1% on RLBench, beating all vision-only baselines; the gap is largest on contact-intensive RLBench Task2 (insert square peg) and Task3 (light bulb in), where it beats OpenVLA-OFT by 2.7 and 10.2 points. Under domain shift it degrades least (74.7% LIBERO / 41.5% RLBench).
| Method | LIBERO Avg | RLBench Avg | LIBERO (shift) | RLBench (shift) |
|---|---|---|---|---|
| π0-FAST | 85.6 | 43.9 | 64.2 | 32.1 |
| CoT-VLA | 83.9 | n/a | 57.8 | n/a |
| OpenVLA-OFT | 97.1 | 48.1 | 71.0 | 35.5 |
| Audio-VLA | 97.6 | 55.1 | 74.7 | 41.5 |
Real-world (Table II, success rate / TCR, %): Audio-VLA roughly triples success over the vision-only baselines on both tasks in the seen condition, and holds up far better under unseen conditions where baselines collapse to near zero.
| Method | EAWM SR / TCR (seen) | S5GO SR / TCR (seen) | EAWM (unseen) | S5GO (unseen) |
|---|---|---|---|---|
| π0-FAST | 20 / 34 | 10 / 23 | 0 / 16 | 0 / 11 |
| OpenVLA-OFT | 20 / 45 | 10 / 34 | 10 / 26 | 0 / 24 |
| Audio-VLA | 60 / 73 | 30 / 72 | 30 / 57 | 20 / 56 |
Ablations. On RLBench (Table III): full 55.1 avg vs vision-only 48.0 (audio worth ~7 points) and w/o-LoRA 51.5 (audio-encoder LoRA worth ~3.6). Real-world (Table IV): audio roughly triples or doubles success on EAWM; LoRA on the audio encoder doubles EAWM success — the authors stress the frozen pretrained AudioCLIP cannot process manipulation-specific sounds without fine-tuning. Where it does not dominate: on RLBench Task2 (square-peg insertion) absolute numbers stay low for everyone (15.0% for Audio-VLA), so "contact-intensive" wins are relative, not solved.
From the authors:
What I noticed reading it:
Na and
end-to-end latency are not reported.This paper is a direct, concrete instance of the thesis behind this batch:
many manipulation predicates are not visually evaluable.
"The eraser is in contact with the board," "enough oatmeal has been scooped,"
"the peg is seated" — these are exactly the is_pressed /
is_full / is_inserted predicates that live in
sound/force, not pixels. Audio-VLA's EAWM example ("vision guides the eraser to
the whiteboard but cannot determine surface contact") is almost a textbook
argument for why a BLADE-style
predicate like in-contact(eraser, board) would need a non-visual
classifier. Relative to BLADE, this is the opposite design philosophy:
BLADE imposes structure (LLM-proposed PDDL operators, learned predicate
classifiers, symbolic bi-level planning) over imitation-learned skills;
Audio-VLA is end-to-end scale-everything — it dumps audio tokens into a 7B
LLM and regresses actions with an L1 loss, no symbols, no planner, no explicit
predicate. The lesson I'd extract for my own line: the sensing argument is
modality-agnostic w.r.t. the policy class. If contact audio carries the
contact-state signal, then a structured policy could learn an
in-contact predicate classifier off the same audio stream
— an audio analogue of BLADE's visual predicate classifiers — and
get interpretability + planning that this monolithic VLA forgoes. The TCR metric
is also worth noting: it is a crude continuous progress signal, but it gestures
at the same need BLADE has for intermediate state evaluation along a
long-horizon execution, not just terminal success. This is squarely on-theme.
Unlike vision, audio signals penetrate occlusions and capture temporal dynamics that are visually imperceptible… acoustic information reveals contact quality between objects and environmental changes during dynamic interactions—natural blind spots of visual perception that are indispensable for intelligent manipulation. — §I, Introduction / p.1
In EAWM, vision guides the eraser to the whiteboard but cannot determine surface contact; in S5GO, it cannot model spoon depth-weight relationships. Audio-VLA overcomes these limitations through acoustic signatures of interaction physics. — §IV-B, Real-World Robot Experiment Results / p.6
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant: