Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization

Jialei Huang*, Shuo Wang*, Fanqi Lin, Yihang Hu, Chuan Wen, Yang Gao · Tsinghua University / UESTC / Shanghai Jiao Tong University · 2025 · arXiv:2507.09160 · PDF

One-liner. Tactile-VLA fuses tactile sensing as a native input modality into a pretrained VLA and adds a hybrid position-force action expert, arguing that a VLM already knows that "gently" means low force and a fragile pitaya needs a soft grip — so with only a few demonstrations you can connect that latent physical commonsense to real force control and get zero-shot generalization in contact-rich tasks.

Problem & motivation

Current VLAs excel at high-level reasoning and at deciding what action to take, but stumble when grounding decisions in the fine-grained physical realities of contact — they answer "pick up the apple" but not "pick up the apple softly." Vision and language give high-level semantic information; tactile sensing gives the rich, local, temporally dynamic feedback (friction, compliance, material sensitivity) that contact-rich tasks need. Prior haptics-in-robotics work treats touch as a supplementary perceptual modality, not something that directly drives the policy's action generation. The paper's thesis: a VLM's pretrained prior already contains semantic understanding of physical interaction, and the missing piece is a bridge from that latent knowledge to the robot's tactile sensors and to force as an explicit action output.

Method

Tactile-VLA fuses vision, language, tactile, and proprioception, and outputs force-aware actions via a hybrid position-force controller. The architecture is in Fig 2.

Token-level multimodal fusion. Inputs are separately encoded then concatenated into a single prefix fed to a pretrained VLM trunk (Gemma 2.6B). A ViT encoder E'_vis encodes the last H frames (as in π₀); language goes through a common tokenizer E_lang; a simple MLP E'_ψ encodes the concatenated history of H tactile measurements into a single fused token. The prefix is S_t = [E'_vis(I_t-H+1),…, E'_vis(I_t), E_lang(L_t), E'_ψ([T_t-H+1,…,T_t])] (Eq 1). A non-causal attention mask over the prefix lets vision, language, and tactile tokens cross-attend freely.

Tactile-aware action expert. A 300M action expert consumes the prefix and outputs an augmented action vector a_t that explicitly specifies a target position P_target and a target contact force F_target. Both targets come from the imitation demonstrations — by putting force directly in the action space, the model learns to control interaction intensity. Shared components initialize from π₀; the tactile encoder and modified action expert are randomly initialized. The whole model is finetuned end-to-end with a Conditional Flow Matching (CFM) objective that penalizes deviation in both kinematic and force dimensions — this is what forces the model to map linguistic nuance ("gently") to a force magnitude (e.g., 0.5N).

Hybrid position-force controller. The strategy is position-dominant (most manipulation is kinematic; force matters only during contact), following Raibert & Craig 1981, with an indirect force-control adjustment inspired by impedance control (Hogan 1985). Unlike classic impedance control aiming for passive compliance, the goal here is active tracking of a target force. The controller measures force error ΔF = F_target − F_measured and applies a corrective positional adjustment only when ‖ΔF‖ > τ (a threshold, for smoothness): P_hybrid = P_target + K·ΔF if over threshold, else P_target (Eq 2), with K a gain matrix. A PID controller then actuates the joints to P_hybrid. Two force channels are decoupled: the gripper's Cartesian position regulates the net external force on the object, while the gripper width regulates the internal grasping force — i.e., how firmly the object is held.

Tactile-VLA-CoT. A reasoning-augmented variant that runs a chain-of-thought via the VLM's own pretrained decoder. Force and tactile feedback become reasoning cues, not just policy inputs: triggered at fixed intervals, the model first decides whether the task succeeded, and on failure analyzes the cause from sensory feedback (e.g., "grasping force is sufficient, but normal force is too low") and emits a new corrective instruction (e.g., "wipe the board again, but apply more downward force"). It is finetuned on a small targeted dataset where each failure sample (e.g., wiping with slippage) is paired with a language annotation of the failure's cause — this both preserves the VLM's general reasoning (mitigating catastrophic forgetting) and extends reasoning to the tactile modality.

Data collection (Fig 4). Built on UMI (a portable handheld gripper, Chi et al. 2024), augmented with dual high-resolution tactile sensors capturing both normal and shear forces, so human operators give demonstrations that are explicitly guided by force. Tactile is captured at 100Hz, visual at 20Hz, then tactile is down-sampled to align with visual frames; timestamps are aligned across streams before each session.

Setup

Datasets / benchmarks: Self-collected "VLA-T" dataset on three contact-rich tasks. USB/Charger Insertion & Extraction: 100 demos each for "soft" and "hard" USB manipulations + 100 for the charger (kinematic-only, no force language). Tabletop Grasping: 50 demos per object over 6 training objects (categorized Solid&Heavy / Solid&Light / Fragile&Light), evaluated on 6 additional unseen (OOD) objects. Wiping the Board: 100 successful + 100 failed wiping demos on a whiteboard; blackboard never seen in training (zero-shot transfer).
Hardware / simulator: Real-world only. UMI-based handheld 3D-printed gripper with dual high-resolution tactile sensors (normal + shear) and a GoPro camera (Fig 4). VLM backbone Gemma 2.6B; tactile-aware action expert 300M.
Baselines: π₀-base (a VLA flow model for general robot control); π₀-fast (a variant); Tactile-VLA (ours); Tactile-VLA-CoT (ours with reasoning). Baselines lack the tactile-fusion architecture.
Compute: not reported.

Results

Across all three RQs Tactile-VLA beats the force-blind π₀ baselines, often by large margins. RQ1 (force-language instruction following). On USB/Charger insertion success (Table 1):

Model	USB (%)	Charger (%)
π₀-base	5	40
π₀-fast	0	25
Tactile-VLA	35	90

On applied force (Table 2), Tactile-VLA cleanly separates "softly" (0.51N) from "hard" (2.57N) on the trained USB task, interpolates unseen adverbs ("gently" 0.75N, "firmly" 1.98N), and even extrapolates ("harder" 2.94N > "hard"). Crucially it transfers this zero-shot to the language-force-free charger task (4.68N for "softly" vs 9.13N for "hard"), whereas the π₀ baselines show no correlation between adverb and applied force.

RQ2 (tactile commonsense, Table 3, grasping success without deformation, 10 trials/object). Tactile-VLA reaches 80–100% on both in-domain and OOD objects across Solid&Heavy, Solid&Light, and Fragile&Light categories. The starkest gap is fragile items: on OOD BlueBerry / PaperBox it scores 100 / 90 while both baselines score 0 / 0 — the baselines crush fragile objects. Fig 6 shows it applies hard / medium / soft force to heavy / light / fragile objects respectively, even for unseen ones.

RQ3 (reasoning-based adaptation, Table 4, success over ID whiteboard vs OOD blackboard).

Type	In-Domain (Whiteboard)	Out-of-Domain (Blackboard)
π₀-base	40	0
π₀-fast	45	0
Tactile-VLA	80	15
Tactile-VLA-CoT	75	80

On the zero-shot blackboard, the CoT variant recognizes the failure from force feedback, reasons that more force is needed, and autonomously raises the applied force from the 3.5N default (trained at 5N) to 6.7N (~34% above training), succeeding where plain Tactile-VLA (15%) and the baselines (0%) fail. Note the CoT variant slightly under-performs plain Tactile-VLA in-domain (75 vs 80) — the reasoning loop helps OOD but is not free on the familiar task.

Limitations & open questions

From the authors:

The contribution is framed as unlocking a VLM's existing prior, so generalization is bounded by what the pretrained VLM already "knows" about physical interaction — objects/forces far outside that prior are not addressed.
CoT reasoning is triggered at fixed intervals, a "simple and effective" heuristic rather than a learned decision of when to reason.
The hybrid controller applies a force correction only above a threshold τ for smoothness; tuning of τ, K, and PID gains is not analyzed.

What I noticed reading it:

Tiny-N statistics. USB/Charger success rates appear to be over 20 trials (rates like 5%, 35%, 90% are multiples of 5); grasping is 10 trials/object; force numbers in Table 2 are single reported values with no variance. These are count-of-success-style claims, not rate-with-CI — a much smaller statistical claim than the strong "semantic generalization" framing suggests.
No tactile-foundation-model baseline. Baselines are only force-blind π₀ variants. There is no comparison against an ablation that adds tactile input but not the force action output, nor against other tactile-VLA contemporaries (FuSe, ForceVLA, TLA) — so it's unclear how much of the win is the architecture vs. simply having force in the action space + a force controller.
"VLM already knows" is asserted, not isolated. The headline claim is that the VLM's prior contains the force semantics, but there's no probe (e.g., freeze the VLM, randomize its weights, swap backbones) showing the zero-shot adverb interpolation degrades without the pretrained prior. The transfer could partly be the action expert generalizing over a continuous force regression target.
All evaluation is real-world single-arm UMI gripper; no sim, no cross-embodiment, and "force" is mediated by two specific high-res tactile sensors — sensor-dependence of the learned force semantics is untested.

Why I care

This paper is a direct, concrete instance of the batch's central thesis, and a sharp foil to BLADE. BLADE's abstraction layer is purely categorical: continuous parameters like grasp force live entirely inside the diffusion policy, and one of BLADE's flagged open questions was exactly "force/continuous parameters." Tactile-VLA shows the complementary move — lift force to a first-class, language-conditioned action output (F_target), so that an adverb like "gently" maps to 0.5N. That is precisely the kind of predicate that is not visually evaluable: is_grasped, grip_is_gentle, surface_pressure_sufficient, is_screwed_tight live in touch/force, not pixels. The wiping-CoT result — "normal force is too low, apply more downward pressure" — is a tactile predicate evaluation dressed as chain-of-thought, which is suggestive for predicate-from-touch invention on top of a BLADE-style abstraction layer.

Where it diverges from my line: there is no symbolic planning, no operator library, no long-horizon composition — it's a single end-to-end force-aware policy with a reasoning wrapper. So it's relevant as the low-level force-grounding substrate a BLADE-style planner could sit on top of, not as a planning-abstractions paper itself. The "VLM prior already contains physical semantics" claim is also the same structure-vs-scale debate BLADE sits on the other side of: here scale (a pretrained VLM) supplies the force commonsense that BLADE would want to make explicit and symbolic.

Quotable

A key finding is that the VLM's prior knowledge already contains semantic understanding of physical interaction; by connecting it to the robot's tactile sensors with only a few demonstrations, we can activate this prior knowledge to achieve zero-shot generalization in contact-rich tasks. — Abstract / p.1

This capability is essential for differentiating commands that share the same motion but differ in force, such as "insert the USB firmly" versus "insert the USB gently". — §2.1 / p.3

The gripper's Cartesian position is used to exclusively regulate the net external force applied to an object, while the gripper width is used in parallel to control the internal grasping force, thus dictating how firmly the object is held. — §2.2 / p.4

Papers cited that should likely be ingested next:

Hao et al. 2025 — TLA: Tactile-Language-Action model (arXiv:2503.08548) — the closest contemporary; tactile + language + action for contact-rich manipulation. Not in this batch's cross-ref list; worth flagging for a future ingest.
Yu et al. 2025 — ForceVLA (arXiv:2505.22159) — concurrent force-aware MoE VLA; cited as the modality-specific-routing counterpoint to Tactile-VLA's token fusion.
Jones et al. 2025 — FuSe (Beyond Sight) (arXiv:2501.04693) — concurrent; finetunes generalist policies on heterogeneous sensors via language grounding, cited as the auxiliary-loss counterpoint.
Black et al. 2024 — π₀ (arXiv:2410.24164) — the VLA flow model the architecture and baselines build on. Foundational dependency.
Chi et al. 2024 — UMI (Universal Manipulation Interface) — the handheld data-collection device augmented with tactile sensors.
Xue et al. 2025 — Reactive Diffusion Policy (arXiv:2503.02881) — slow-fast visuo-tactile policy; cited as a hierarchical decouple-planning-and-control tactile approach.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

FuSe — same cluster C; the closest sibling on fusing heterogeneous (incl. tactile) sensors into a generalist VLA via language, but via auxiliary contrastive/generative losses rather than force in the action space.
ForceVLA — same cluster C; force-aware mixture-of-experts VLA, the architectural alternative to Tactile-VLA's token-level fusion + hybrid controller.
OmniVTLA, VLA-Touch, TaF-VLA, FAVLA, Bi-LAT — the rest of cluster C; the multisensory tactile/force VLA-policy landscape this paper lives in.
Hybrid Position/Force Control (Raibert & Craig 1981) — the control-theory classic Tactile-VLA's hybrid controller is built directly on.
Towards Forceful Robotic Foundation Models (survey) — situates Tactile-VLA in the broader push to put force into foundation-model policies.