Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization

Jialei Huang*, Shuo Wang*, Fanqi Lin, Yihang Hu, Chuan Wen, Yang Gao · Tsinghua University / UESTC / Shanghai Jiao Tong University · 2025 · arXiv:2507.09160 · PDF

One-liner. Tactile-VLA fuses tactile sensing as a native input modality into a pretrained VLA and adds a hybrid position-force action expert, arguing that a VLM already knows that "gently" means low force and a fragile pitaya needs a soft grip — so with only a few demonstrations you can connect that latent physical commonsense to real force control and get zero-shot generalization in contact-rich tasks.

Problem & motivation

Current VLAs excel at high-level reasoning and at deciding what action to take, but stumble when grounding decisions in the fine-grained physical realities of contact — they answer "pick up the apple" but not "pick up the apple softly." Vision and language give high-level semantic information; tactile sensing gives the rich, local, temporally dynamic feedback (friction, compliance, material sensitivity) that contact-rich tasks need. Prior haptics-in-robotics work treats touch as a supplementary perceptual modality, not something that directly drives the policy's action generation. The paper's thesis: a VLM's pretrained prior already contains semantic understanding of physical interaction, and the missing piece is a bridge from that latent knowledge to the robot's tactile sensors and to force as an explicit action output.

Method

Tactile-VLA fuses vision, language, tactile, and proprioception, and outputs force-aware actions via a hybrid position-force controller. The architecture is in Fig 2.

Token-level multimodal fusion. Inputs are separately encoded then concatenated into a single prefix fed to a pretrained VLM trunk (Gemma 2.6B). A ViT encoder E'vis encodes the last H frames (as in π0); language goes through a common tokenizer Elang; a simple MLP E'ψ encodes the concatenated history of H tactile measurements into a single fused token. The prefix is St = [E'vis(It-H+1),…, E'vis(It), Elang(Lt), E'ψ([Tt-H+1,…,Tt])] (Eq 1). A non-causal attention mask over the prefix lets vision, language, and tactile tokens cross-attend freely.

Tactile-aware action expert. A 300M action expert consumes the prefix and outputs an augmented action vector at that explicitly specifies a target position Ptarget and a target contact force Ftarget. Both targets come from the imitation demonstrations — by putting force directly in the action space, the model learns to control interaction intensity. Shared components initialize from π0; the tactile encoder and modified action expert are randomly initialized. The whole model is finetuned end-to-end with a Conditional Flow Matching (CFM) objective that penalizes deviation in both kinematic and force dimensions — this is what forces the model to map linguistic nuance ("gently") to a force magnitude (e.g., 0.5N).

Hybrid position-force controller. The strategy is position-dominant (most manipulation is kinematic; force matters only during contact), following Raibert & Craig 1981, with an indirect force-control adjustment inspired by impedance control (Hogan 1985). Unlike classic impedance control aiming for passive compliance, the goal here is active tracking of a target force. The controller measures force error ΔF = Ftarget − Fmeasured and applies a corrective positional adjustment only when ‖ΔF‖ > τ (a threshold, for smoothness): Phybrid = Ptarget + K·ΔF if over threshold, else Ptarget (Eq 2), with K a gain matrix. A PID controller then actuates the joints to Phybrid. Two force channels are decoupled: the gripper's Cartesian position regulates the net external force on the object, while the gripper width regulates the internal grasping force — i.e., how firmly the object is held.

Tactile-VLA-CoT. A reasoning-augmented variant that runs a chain-of-thought via the VLM's own pretrained decoder. Force and tactile feedback become reasoning cues, not just policy inputs: triggered at fixed intervals, the model first decides whether the task succeeded, and on failure analyzes the cause from sensory feedback (e.g., "grasping force is sufficient, but normal force is too low") and emits a new corrective instruction (e.g., "wipe the board again, but apply more downward force"). It is finetuned on a small targeted dataset where each failure sample (e.g., wiping with slippage) is paired with a language annotation of the failure's cause — this both preserves the VLM's general reasoning (mitigating catastrophic forgetting) and extends reasoning to the tactile modality.

Data collection (Fig 4). Built on UMI (a portable handheld gripper, Chi et al. 2024), augmented with dual high-resolution tactile sensors capturing both normal and shear forces, so human operators give demonstrations that are explicitly guided by force. Tactile is captured at 100Hz, visual at 20Hz, then tactile is down-sampled to align with visual frames; timestamps are aligned across streams before each session.

Setup

Results

Across all three RQs Tactile-VLA beats the force-blind π0 baselines, often by large margins. RQ1 (force-language instruction following). On USB/Charger insertion success (Table 1):

ModelUSB (%)Charger (%)
π0-base540
π0-fast025
Tactile-VLA3590

On applied force (Table 2), Tactile-VLA cleanly separates "softly" (0.51N) from "hard" (2.57N) on the trained USB task, interpolates unseen adverbs ("gently" 0.75N, "firmly" 1.98N), and even extrapolates ("harder" 2.94N > "hard"). Crucially it transfers this zero-shot to the language-force-free charger task (4.68N for "softly" vs 9.13N for "hard"), whereas the π0 baselines show no correlation between adverb and applied force.

RQ2 (tactile commonsense, Table 3, grasping success without deformation, 10 trials/object). Tactile-VLA reaches 80–100% on both in-domain and OOD objects across Solid&Heavy, Solid&Light, and Fragile&Light categories. The starkest gap is fragile items: on OOD BlueBerry / PaperBox it scores 100 / 90 while both baselines score 0 / 0 — the baselines crush fragile objects. Fig 6 shows it applies hard / medium / soft force to heavy / light / fragile objects respectively, even for unseen ones.

RQ3 (reasoning-based adaptation, Table 4, success over ID whiteboard vs OOD blackboard).

TypeIn-Domain (Whiteboard)Out-of-Domain (Blackboard)
π0-base400
π0-fast450
Tactile-VLA8015
Tactile-VLA-CoT7580

On the zero-shot blackboard, the CoT variant recognizes the failure from force feedback, reasons that more force is needed, and autonomously raises the applied force from the 3.5N default (trained at 5N) to 6.7N (~34% above training), succeeding where plain Tactile-VLA (15%) and the baselines (0%) fail. Note the CoT variant slightly under-performs plain Tactile-VLA in-domain (75 vs 80) — the reasoning loop helps OOD but is not free on the familiar task.

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This paper is a direct, concrete instance of the batch's central thesis, and a sharp foil to BLADE. BLADE's abstraction layer is purely categorical: continuous parameters like grasp force live entirely inside the diffusion policy, and one of BLADE's flagged open questions was exactly "force/continuous parameters." Tactile-VLA shows the complementary move — lift force to a first-class, language-conditioned action output (Ftarget), so that an adverb like "gently" maps to 0.5N. That is precisely the kind of predicate that is not visually evaluable: is_grasped, grip_is_gentle, surface_pressure_sufficient, is_screwed_tight live in touch/force, not pixels. The wiping-CoT result — "normal force is too low, apply more downward pressure" — is a tactile predicate evaluation dressed as chain-of-thought, which is suggestive for predicate-from-touch invention on top of a BLADE-style abstraction layer.

Where it diverges from my line: there is no symbolic planning, no operator library, no long-horizon composition — it's a single end-to-end force-aware policy with a reasoning wrapper. So it's relevant as the low-level force-grounding substrate a BLADE-style planner could sit on top of, not as a planning-abstractions paper itself. The "VLM prior already contains physical semantics" claim is also the same structure-vs-scale debate BLADE sits on the other side of: here scale (a pretrained VLM) supplies the force commonsense that BLADE would want to make explicit and symbolic.

Quotable

A key finding is that the VLM's prior knowledge already contains semantic understanding of physical interaction; by connecting it to the robot's tactile sensors with only a few demonstrations, we can activate this prior knowledge to achieve zero-shot generalization in contact-rich tasks. — Abstract / p.1
This capability is essential for differentiating commands that share the same motion but differ in force, such as "insert the USB firmly" versus "insert the USB gently". — §2.1 / p.3
The gripper's Cartesian position is used to exclusively regulate the net external force applied to an object, while the gripper width is used in parallel to control the internal grasping force, thus dictating how firmly the object is held. — §2.2 / p.4

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: