VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

Wei Zhao, Pengxiang Ding, Min Zhang, Zhefei Gong, Shuanghao Bai, Han Zhao, Donglin Wang · Westlake University / Zhejiang University / Xi'an Jiaotong University · ICLR 2025 · arXiv:2502.13508 · PDF

One-liner. VLAS is the first end-to-end VLA that takes raw speech (not ASR-transcribed text) as its instruction modality, so it can exploit non-semantic cues in the audio — chiefly the speaker's voiceprint — to retrieve personal knowledge and disambiguate otherwise under-specified commands like "pick up my cup."

Problem & motivation

Existing VLAs ride on vision-language models that only consume text instructions. Supporting spoken commands means bolting on an external ASR system, which (i) inflates the pipeline and propagates transcription errors into the policy, and (ii) discards everything in the speech beyond its words — identity, emotion, intonation. The authors argue that for customized home-care tasks, that discarded auxiliary information is exactly what lets the robot pick the right cup. The motivating example (Fig 1): given "Please pick up my cup," a text-only VLA fails (it can't know which cup), while VLAS, hearing "This is Li's voice. He owns a green cup," succeeds. The key question posed: how to integrate speech directly into a VLA "to produce a simpler and better end-user experience."

Method

VLAS extends the open-source LLaVA VLM (CLIP ViT vision encoder, Vicuna LLM) to accept speech, then fine-tunes it into a robot policy. Architecture in Fig 2.

Speech path. A frozen Whisper encoder converts a speech instruction s into hidden states; the signal is first turned into an 80-bin mel-spectrogram (STFT, padded to 3000 frames), producing 1500 hidden representations. A time-dimension reshape with reduction factor 5 shrinks this, and an MLP projects the speech tokens into the shared LLM semantic space — mirroring how LLaVA's MLP projects vision tokens. Formally the LLM input is concat(MLP_s(Emb_s(s)), Tok_l(RAG(s)), MLP_v(Emb_v(O))) (Eq 1), and actions are produced autoregressively (Eq 2).

Voice RAG. The novel piece for personalization. The raw speech is passed to a pre-trained speaker-identification / voiceprint module, whose voiceprint keys into an external database of personal knowledge (e.g. "Voice 2: I have a blue block; I keep things in drawers"). The retrieved text is tokenized and concatenated alongside the speech and vision tokens as background context. This avoids from-scratch training of the retrieval module.

Action tokenization. Continuous actions are discretized into 256 uniformly spaced bins, reusing the 256 least-frequent LLM vocabulary tokens as action tokens. A 7-DoF action [x, y, z, φ, θ, ψ, g] (Eq 3) is emitted as a space-separated string and de-tokenized to continuous values at deploy time.

Three-stage training (Fig 4). Stage I — Speech Alignment: coarse modality alignment via speech recognition on LibriSpeech-360; only the speech-encoder→LLM MLP and the LLM backbone update (speaker-ID module optionally co-trained). Stage II — Speech Question Answering: fine-tune on the curated SQA dataset plus LLaVA's original VQA and LibriSpeech-100, updating all components except the frozen image/speech encoders — yielding VLAS-Base, a multimodal LM handling both text-image and speech-image instructions. Stage III — Robot Manipulation Fine-tuning: behavior-cloning on the CSI dataset (image observations + speech/text instructions + manipulation trajectories) to get the final VLAS policy.

Setup

Datasets / benchmarks: Two new datasets the authors build with the ESPnet/VITS TTS tool (trained on LibriTTS). SQA: 185K image-audio pairs from 185K LLaVA image-text pairs, over 1,152 voices. CSI (CALVIN with Speech Instructions): CALVIN's 389 text instructions rendered into ~194K audio samples over 500 voices, across 23K episodes; half the training samples have text replaced by speech. Evaluated on the CALVIN benchmark (1,000 long-horizon tasks, chains of 5 sub-tasks) and a new self-built customized-task benchmark (Ownership / Preference / Compound tasks, 39 unseen voices). Also SGQA and LibriSpeech for foundation-model checks.
Hardware / simulator: CALVIN simulator for the main results; a real-world UR5 robot arm fine-tuned on the Berkeley UR5 demo dataset plus the authors' own cup-picking dataset (Fig 7). Evaluation speech uses 39 novel TTS voices plus real recorded speech from 10 individuals.
Baselines: MCIL, HULC, RT-1, a text-only VLA built on the same LLaVA backbone, and RoboFlamingo — the speech-only baselines (VLA+ASR, RoboFlamingo+ASR) pipe Whisper large-v2 ASR into a text policy. Foundation-model baselines: LLaVA v1.5, BLIP-2, InstructBLIP, Qwen-VL (Tables 3–4).
Compute: not reported.

Results

CALVIN long-horizon (Table 1, success per sub-task LH-1..LH-5, avg length out of 5). VLAS with speech instructions roughly matches the text-only VLA and beats cascaded ASR pipelines:

Model	Instr.	LH-1	LH-5	Avg len
HULC	text	89.2%	33.5%	2.90
RT-1	text	84.4%	22.7%	2.45
VLA (text baseline)	text	95.5%	58.2%	3.80
VLAS	text	94.5%	64.6%	3.74
RoboFlamingo+ASR	speech	89.8%	48.3%	3.41
VLA+ASR	speech	88.7%	40.2%	3.13
VLAS	speech	94.2%	54.6%	3.70
VLAS (real speech)	speech	93.6%	51.3%	3.61

End-to-end speech (3.70) beats both cascaded ASR baselines (3.13, 3.41), which the authors attribute to ASR errors amplifying into control. With real recorded speech VLAS only drops to 3.61, just 0.19 behind the text VLA baseline.

Customized tasks (Table 2, avg success). Here the Voice RAG is enabled and the gap is dramatic: the text-only VLA scores 19.2% avg (it has no access to who is speaking), while VLAS hits 86.5% (100% on Compound and Compound-Multistage Stage-1). Real speech: 78.6%. Ablating RAG collapses VLAS to 16.0%; adding RAG to the text VLA (VLA+RAG) lifts it to 82.0% — isolating Voice RAG as the driver, with the speech path contributing the voiceprint key that makes RAG possible at all.

Foundation-model sanity (Tables 3–4). VLAS-Base nearly matches LLaVA v1.5 on VQAv2/GQA/POPE (e.g. GQA 62.0 = LLaVA) and surpasses BLIP-2 on SGQA (50.8 vs 41.0), showing the speech grafting doesn't degrade the VLM. On LibriSpeech ASR, VLAS-Base reaches 2.79% WER vs Whisper's 2.7% — comparable.

Limitations & open questions

From the authors:

The fixed downsampling reduction factor (5) on the speech spectrum is noted as suboptimal — LibriSpeech WER could improve with a better downsampling module.
Future work is framed as exploiting other auxiliary speech cues (emotion, intonation) and environmental sounds, implying the current model really only leverages voiceprint identity, not the richer paralinguistics the intro advertises.

What I noticed reading it:

The customized-task win is almost entirely a RAG result, not a speech-understanding result: VLA+RAG (text + retrieved knowledge) already reaches 82.0% vs VLAS's 86.5%. The speech modality's load-bearing contribution is narrow — supplying a voiceprint to key the database. The paper's framing ("auxiliary information in raw speech") oversells what's demonstrated; only speaker identity is actually used.
Voiceprint→knowledge retrieval assumes a clean, pre-populated per-speaker database and reliable speaker ID. No analysis of impostor voices, unseen speakers, noisy/overlapping speech, or retrieval errors — all likely in a real home.
The "customized" benchmark is self-constructed in simulation with TTS voices; Ownership/Preference semantics are injected as meta-knowledge. Whether the model reasons over preferences or just pattern-matches retrieved strings is untested.
Real-robot results (Fig 7) are qualitative success-case demonstrations on cup-picking; no success-rate table over trials, so the real-world claim is the weakest statistical part of the paper.
On standard CALVIN, end-to-end speech VLAS (3.70) is slightly below the text VLA baseline (3.80) — the speech modality costs a little when there's no personalization to exploit.

Why I care

This is a VLA / instruction-modality paper, not a contact-sensing paper — worth being honest about that up front. Unlike the touch/ force/audio-vibration work in this batch, VLAS's "audio" is human speech, not contact sound; it does not touch the thesis that many manipulation predicates (is_grasped, is_inserted, is_screwed_tight) live in non-visual touch/force/sound channels. The audio here carries semantic and speaker-identity information, not physical-contact signal.

Where it is relevant to my line of work (BLADE, learning-from-language, long-horizon manipulation):

Instruction grounding. BLADE's test-time LLM translates a natural-language goal into a first-order-logic formula over learned predicates. VLAS shows that the instruction channel itself can carry sub-textual information (who is asking) that resolves otherwise-ambiguous goals — an angle BLADE's text-only goal parsing ignores. "pick up my cup" is a goal whose grounding depends on speaker identity, not just the scene.
Personalization as retrieved priors. The Voice RAG result (knowledge retrieval, not modality, does the heavy lifting) is a clean reminder that customized/long-horizon tasks often bottleneck on external knowledge rather than on perception or control — relevant to how an abstraction layer should incorporate user/world priors.
Contrast for the structure-vs-scale debate. VLAS is squarely end-to-end LLaVA-fine-tuned (the "scale-everything" pole BLADE counterpoints): no symbolic abstraction, action tokens emitted autoregressively. A useful foil.

Net: adjacent to my thesis — a modality-extension VLA whose real contribution is retrieval-augmented personalization, not non-visual physical sensing.

Quotable

VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. — Abstract

The transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. — Abstract

The raw speech command is processed by the speaker identification module to extract a voiceprint. This voiceprint serves as a key to query an external database, retrieving relevant information. — §3.1, Voice RAG / p.5

Papers cited that should likely be ingested next:

Radford et al. 2023 — Whisper (Robust Speech Recognition) — the frozen speech encoder VLAS builds on, and the ASR model used by the cascaded baselines. Forward-ref: PDF (not yet ingested).
Liu et al. 2023 — LLaVA (Visual Instruction Tuning) — the VLM backbone VLAS extends to speech.
Kim et al. 2024 — OpenVLA — the canonical open-source end-to-end VLA; closest design relative.
Brohan et al. 2023 — RT-2 — the web-knowledge VLA template VLAS positions itself against.
Shah et al. 2023 — MUTEX — unified policy for multimodal task specifications (incl. speech); the nearest prior on multi-modal instruction VLAs.
Fu et al. 2024 — VITA — cited as recent end-to-end speech-understanding LLM; relevant to the raw-speech-into-LLM trend.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

Beyond Sight (FuSe) — same cluster (multisensory VLA policies); fuses heterogeneous sensors with language. VLAS's speech-as-instruction is the language-side analogue to FuSe's sensor-side fusion.
Audio-VLA — the other "audio + VLA" paper in the batch, but its audio is contact sound for perception, where VLAS's audio is human speech for instruction — a clean contrast on what "audio in a VLA" can mean.
Tactile-VLA, OmniVTLA, ForceVLA, VLA-Touch, TaF-VLA, FAVLA, Bi-LAT — the rest of the multisensory-VLA cluster; all extend a VLA with a physical-contact modality (touch/force), whereas VLAS extends the instruction modality (speech). Together they map the space of "what non-text channel to add to a VLA."