VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

Wei Zhao, Pengxiang Ding, Min Zhang, Zhefei Gong, Shuanghao Bai, Han Zhao, Donglin Wang · Westlake University / Zhejiang University / Xi'an Jiaotong University · ICLR 2025 · arXiv:2502.13508 · PDF

One-liner. VLAS is the first end-to-end VLA that takes raw speech (not ASR-transcribed text) as its instruction modality, so it can exploit non-semantic cues in the audio — chiefly the speaker's voiceprint — to retrieve personal knowledge and disambiguate otherwise under-specified commands like "pick up my cup."

Problem & motivation

Existing VLAs ride on vision-language models that only consume text instructions. Supporting spoken commands means bolting on an external ASR system, which (i) inflates the pipeline and propagates transcription errors into the policy, and (ii) discards everything in the speech beyond its words — identity, emotion, intonation. The authors argue that for customized home-care tasks, that discarded auxiliary information is exactly what lets the robot pick the right cup. The motivating example (Fig 1): given "Please pick up my cup," a text-only VLA fails (it can't know which cup), while VLAS, hearing "This is Li's voice. He owns a green cup," succeeds. The key question posed: how to integrate speech directly into a VLA "to produce a simpler and better end-user experience."

Method

VLAS extends the open-source LLaVA VLM (CLIP ViT vision encoder, Vicuna LLM) to accept speech, then fine-tunes it into a robot policy. Architecture in Fig 2.

Speech path. A frozen Whisper encoder converts a speech instruction s into hidden states; the signal is first turned into an 80-bin mel-spectrogram (STFT, padded to 3000 frames), producing 1500 hidden representations. A time-dimension reshape with reduction factor 5 shrinks this, and an MLP projects the speech tokens into the shared LLM semantic space — mirroring how LLaVA's MLP projects vision tokens. Formally the LLM input is concat(MLP_s(Emb_s(s)), Tok_l(RAG(s)), MLP_v(Emb_v(O))) (Eq 1), and actions are produced autoregressively (Eq 2).

Voice RAG. The novel piece for personalization. The raw speech is passed to a pre-trained speaker-identification / voiceprint module, whose voiceprint keys into an external database of personal knowledge (e.g. "Voice 2: I have a blue block; I keep things in drawers"). The retrieved text is tokenized and concatenated alongside the speech and vision tokens as background context. This avoids from-scratch training of the retrieval module.

Action tokenization. Continuous actions are discretized into 256 uniformly spaced bins, reusing the 256 least-frequent LLM vocabulary tokens as action tokens. A 7-DoF action [x, y, z, φ, θ, ψ, g] (Eq 3) is emitted as a space-separated string and de-tokenized to continuous values at deploy time.

Three-stage training (Fig 4). Stage I — Speech Alignment: coarse modality alignment via speech recognition on LibriSpeech-360; only the speech-encoder→LLM MLP and the LLM backbone update (speaker-ID module optionally co-trained). Stage II — Speech Question Answering: fine-tune on the curated SQA dataset plus LLaVA's original VQA and LibriSpeech-100, updating all components except the frozen image/speech encoders — yielding VLAS-Base, a multimodal LM handling both text-image and speech-image instructions. Stage III — Robot Manipulation Fine-tuning: behavior-cloning on the CSI dataset (image observations + speech/text instructions + manipulation trajectories) to get the final VLAS policy.

Setup

Results

CALVIN long-horizon (Table 1, success per sub-task LH-1..LH-5, avg length out of 5). VLAS with speech instructions roughly matches the text-only VLA and beats cascaded ASR pipelines:

ModelInstr.LH-1LH-5Avg len
HULCtext89.2%33.5%2.90
RT-1text84.4%22.7%2.45
VLA (text baseline)text95.5%58.2%3.80
VLAStext94.5%64.6%3.74
RoboFlamingo+ASRspeech89.8%48.3%3.41
VLA+ASRspeech88.7%40.2%3.13
VLASspeech94.2%54.6%3.70
VLAS (real speech)speech93.6%51.3%3.61

End-to-end speech (3.70) beats both cascaded ASR baselines (3.13, 3.41), which the authors attribute to ASR errors amplifying into control. With real recorded speech VLAS only drops to 3.61, just 0.19 behind the text VLA baseline.

Customized tasks (Table 2, avg success). Here the Voice RAG is enabled and the gap is dramatic: the text-only VLA scores 19.2% avg (it has no access to who is speaking), while VLAS hits 86.5% (100% on Compound and Compound-Multistage Stage-1). Real speech: 78.6%. Ablating RAG collapses VLAS to 16.0%; adding RAG to the text VLA (VLA+RAG) lifts it to 82.0% — isolating Voice RAG as the driver, with the speech path contributing the voiceprint key that makes RAG possible at all.

Foundation-model sanity (Tables 3–4). VLAS-Base nearly matches LLaVA v1.5 on VQAv2/GQA/POPE (e.g. GQA 62.0 = LLaVA) and surpasses BLIP-2 on SGQA (50.8 vs 41.0), showing the speech grafting doesn't degrade the VLM. On LibriSpeech ASR, VLAS-Base reaches 2.79% WER vs Whisper's 2.7% — comparable.

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This is a VLA / instruction-modality paper, not a contact-sensing paper — worth being honest about that up front. Unlike the touch/ force/audio-vibration work in this batch, VLAS's "audio" is human speech, not contact sound; it does not touch the thesis that many manipulation predicates (is_grasped, is_inserted, is_screwed_tight) live in non-visual touch/force/sound channels. The audio here carries semantic and speaker-identity information, not physical-contact signal.

Where it is relevant to my line of work (BLADE, learning-from-language, long-horizon manipulation):

Net: adjacent to my thesis — a modality-extension VLA whose real contribution is retrieval-augmented personalization, not non-visual physical sensing.

Quotable

VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. — Abstract
The transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. — Abstract
The raw speech command is processed by the speaker identification module to extract a voiceprint. This voiceprint serves as a key to query an external database, retrieving relevant information. — §3.1, Voice RAG / p.5

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: