One-liner. VLAS is the first end-to-end VLA that takes raw speech (not ASR-transcribed text) as its instruction modality, so it can exploit non-semantic cues in the audio — chiefly the speaker's voiceprint — to retrieve personal knowledge and disambiguate otherwise under-specified commands like "pick up my cup."
Existing VLAs ride on vision-language models that only consume text instructions. Supporting spoken commands means bolting on an external ASR system, which (i) inflates the pipeline and propagates transcription errors into the policy, and (ii) discards everything in the speech beyond its words — identity, emotion, intonation. The authors argue that for customized home-care tasks, that discarded auxiliary information is exactly what lets the robot pick the right cup. The motivating example (Fig 1): given "Please pick up my cup," a text-only VLA fails (it can't know which cup), while VLAS, hearing "This is Li's voice. He owns a green cup," succeeds. The key question posed: how to integrate speech directly into a VLA "to produce a simpler and better end-user experience."
VLAS extends the open-source LLaVA VLM (CLIP ViT vision encoder, Vicuna LLM) to accept speech, then fine-tunes it into a robot policy. Architecture in Fig 2.
Speech path. A frozen Whisper
encoder converts a speech instruction s into hidden states; the
signal is first turned into an 80-bin mel-spectrogram (STFT, padded to 3000
frames), producing 1500 hidden representations. A time-dimension reshape with
reduction factor 5 shrinks this, and an MLP projects the speech tokens into the
shared LLM semantic space — mirroring how LLaVA's MLP projects vision
tokens. Formally the LLM input is
concat(MLP_s(Emb_s(s)), Tok_l(RAG(s)), MLP_v(Emb_v(O))) (Eq 1), and
actions are produced autoregressively (Eq 2).
Voice RAG. The novel piece for personalization. The raw speech is passed to a pre-trained speaker-identification / voiceprint module, whose voiceprint keys into an external database of personal knowledge (e.g. "Voice 2: I have a blue block; I keep things in drawers"). The retrieved text is tokenized and concatenated alongside the speech and vision tokens as background context. This avoids from-scratch training of the retrieval module.
Action tokenization. Continuous actions are discretized into
256 uniformly spaced bins, reusing the 256 least-frequent LLM vocabulary tokens
as action tokens. A 7-DoF action [x, y, z, φ, θ, ψ, g]
(Eq 3) is emitted as a space-separated string and de-tokenized to continuous
values at deploy time.
Three-stage training (Fig 4). Stage I — Speech Alignment: coarse modality alignment via speech recognition on LibriSpeech-360; only the speech-encoder→LLM MLP and the LLM backbone update (speaker-ID module optionally co-trained). Stage II — Speech Question Answering: fine-tune on the curated SQA dataset plus LLaVA's original VQA and LibriSpeech-100, updating all components except the frozen image/speech encoders — yielding VLAS-Base, a multimodal LM handling both text-image and speech-image instructions. Stage III — Robot Manipulation Fine-tuning: behavior-cloning on the CSI dataset (image observations + speech/text instructions + manipulation trajectories) to get the final VLAS policy.
CALVIN long-horizon (Table 1, success per sub-task LH-1..LH-5, avg length out of 5). VLAS with speech instructions roughly matches the text-only VLA and beats cascaded ASR pipelines:
| Model | Instr. | LH-1 | LH-5 | Avg len |
|---|---|---|---|---|
| HULC | text | 89.2% | 33.5% | 2.90 |
| RT-1 | text | 84.4% | 22.7% | 2.45 |
| VLA (text baseline) | text | 95.5% | 58.2% | 3.80 |
| VLAS | text | 94.5% | 64.6% | 3.74 |
| RoboFlamingo+ASR | speech | 89.8% | 48.3% | 3.41 |
| VLA+ASR | speech | 88.7% | 40.2% | 3.13 |
| VLAS | speech | 94.2% | 54.6% | 3.70 |
| VLAS (real speech) | speech | 93.6% | 51.3% | 3.61 |
End-to-end speech (3.70) beats both cascaded ASR baselines (3.13, 3.41), which the authors attribute to ASR errors amplifying into control. With real recorded speech VLAS only drops to 3.61, just 0.19 behind the text VLA baseline.
Customized tasks (Table 2, avg success). Here the Voice RAG is enabled and the gap is dramatic: the text-only VLA scores 19.2% avg (it has no access to who is speaking), while VLAS hits 86.5% (100% on Compound and Compound-Multistage Stage-1). Real speech: 78.6%. Ablating RAG collapses VLAS to 16.0%; adding RAG to the text VLA (VLA+RAG) lifts it to 82.0% — isolating Voice RAG as the driver, with the speech path contributing the voiceprint key that makes RAG possible at all.
Foundation-model sanity (Tables 3–4). VLAS-Base nearly matches LLaVA v1.5 on VQAv2/GQA/POPE (e.g. GQA 62.0 = LLaVA) and surpasses BLIP-2 on SGQA (50.8 vs 41.0), showing the speech grafting doesn't degrade the VLM. On LibriSpeech ASR, VLAS-Base reaches 2.79% WER vs Whisper's 2.7% — comparable.
From the authors:
What I noticed reading it:
This is a VLA / instruction-modality paper, not a contact-sensing
paper — worth being honest about that up front. Unlike the touch/
force/audio-vibration work in this batch, VLAS's "audio" is human speech,
not contact sound; it does not touch the thesis that many manipulation predicates
(is_grasped, is_inserted, is_screwed_tight)
live in non-visual touch/force/sound channels. The audio here carries
semantic and speaker-identity information, not physical-contact signal.
Where it is relevant to my line of work (BLADE, learning-from-language, long-horizon manipulation):
Net: adjacent to my thesis — a modality-extension VLA whose real contribution is retrieval-augmented personalization, not non-visual physical sensing.
VLAS allows the robot to understand spoken commands through inner speech-text alignment and produces corresponding actions to fulfill the task. — Abstract
The transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. — Abstract
The raw speech command is processed by the speaker identification module to extract a voiceprint. This voiceprint serves as a key to query an external database, retrieving relevant information. — §3.1, Voice RAG / p.5
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant to this work: