One-liner. Tactile-VLA fuses tactile sensing as a native input modality into a pretrained VLA and adds a hybrid position-force action expert, arguing that a VLM already knows that "gently" means low force and a fragile pitaya needs a soft grip — so with only a few demonstrations you can connect that latent physical commonsense to real force control and get zero-shot generalization in contact-rich tasks.
Current VLAs excel at high-level reasoning and at deciding what action to take, but stumble when grounding decisions in the fine-grained physical realities of contact — they answer "pick up the apple" but not "pick up the apple softly." Vision and language give high-level semantic information; tactile sensing gives the rich, local, temporally dynamic feedback (friction, compliance, material sensitivity) that contact-rich tasks need. Prior haptics-in-robotics work treats touch as a supplementary perceptual modality, not something that directly drives the policy's action generation. The paper's thesis: a VLM's pretrained prior already contains semantic understanding of physical interaction, and the missing piece is a bridge from that latent knowledge to the robot's tactile sensors and to force as an explicit action output.
Tactile-VLA fuses vision, language, tactile, and proprioception, and outputs force-aware actions via a hybrid position-force controller. The architecture is in Fig 2.
Token-level multimodal fusion. Inputs are separately
encoded then concatenated into a single prefix fed to a pretrained VLM trunk
(Gemma 2.6B). A ViT encoder E'vis encodes the last
H frames (as in π0); language goes through a common
tokenizer Elang; a simple MLP E'ψ
encodes the concatenated history of H tactile measurements into a
single fused token. The prefix is
St = [E'vis(It-H+1),…,
E'vis(It), Elang(Lt),
E'ψ([Tt-H+1,…,Tt])] (Eq 1). A
non-causal attention mask over the prefix lets vision, language, and
tactile tokens cross-attend freely.
Tactile-aware action expert. A 300M action expert consumes
the prefix and outputs an augmented action vector at
that explicitly specifies a target position Ptarget
and a target contact force Ftarget. Both
targets come from the imitation demonstrations — by putting force
directly in the action space, the model learns to control interaction
intensity. Shared components initialize from π0; the tactile
encoder and modified action expert are randomly initialized. The whole model
is finetuned end-to-end with a Conditional Flow Matching (CFM) objective that
penalizes deviation in both kinematic and force dimensions — this is what
forces the model to map linguistic nuance ("gently") to a force magnitude
(e.g., 0.5N).
Hybrid position-force controller. The strategy is
position-dominant (most manipulation is kinematic; force matters only during
contact), following Raibert
& Craig 1981, with an indirect force-control adjustment inspired by
impedance control (Hogan 1985). Unlike classic impedance control aiming for
passive compliance, the goal here is active tracking of a target
force. The controller measures force error ΔF = Ftarget
− Fmeasured and applies a corrective positional
adjustment only when ‖ΔF‖ > τ (a threshold,
for smoothness): Phybrid = Ptarget + K·ΔF
if over threshold, else Ptarget (Eq 2), with K a
gain matrix. A PID controller then actuates the joints to
Phybrid. Two force channels are decoupled: the
gripper's Cartesian position regulates the net external force on the
object, while the gripper width regulates the internal grasping
force — i.e., how firmly the object is held.
Tactile-VLA-CoT. A reasoning-augmented variant that runs a chain-of-thought via the VLM's own pretrained decoder. Force and tactile feedback become reasoning cues, not just policy inputs: triggered at fixed intervals, the model first decides whether the task succeeded, and on failure analyzes the cause from sensory feedback (e.g., "grasping force is sufficient, but normal force is too low") and emits a new corrective instruction (e.g., "wipe the board again, but apply more downward force"). It is finetuned on a small targeted dataset where each failure sample (e.g., wiping with slippage) is paired with a language annotation of the failure's cause — this both preserves the VLM's general reasoning (mitigating catastrophic forgetting) and extends reasoning to the tactile modality.
Data collection (Fig 4). Built on UMI (a portable handheld gripper, Chi et al. 2024), augmented with dual high-resolution tactile sensors capturing both normal and shear forces, so human operators give demonstrations that are explicitly guided by force. Tactile is captured at 100Hz, visual at 20Hz, then tactile is down-sampled to align with visual frames; timestamps are aligned across streams before each session.
Across all three RQs Tactile-VLA beats the force-blind π0 baselines, often by large margins. RQ1 (force-language instruction following). On USB/Charger insertion success (Table 1):
| Model | USB (%) | Charger (%) |
|---|---|---|
| π0-base | 5 | 40 |
| π0-fast | 0 | 25 |
| Tactile-VLA | 35 | 90 |
On applied force (Table 2), Tactile-VLA cleanly separates "softly" (0.51N) from "hard" (2.57N) on the trained USB task, interpolates unseen adverbs ("gently" 0.75N, "firmly" 1.98N), and even extrapolates ("harder" 2.94N > "hard"). Crucially it transfers this zero-shot to the language-force-free charger task (4.68N for "softly" vs 9.13N for "hard"), whereas the π0 baselines show no correlation between adverb and applied force.
RQ2 (tactile commonsense, Table 3, grasping success without deformation, 10 trials/object). Tactile-VLA reaches 80–100% on both in-domain and OOD objects across Solid&Heavy, Solid&Light, and Fragile&Light categories. The starkest gap is fragile items: on OOD BlueBerry / PaperBox it scores 100 / 90 while both baselines score 0 / 0 — the baselines crush fragile objects. Fig 6 shows it applies hard / medium / soft force to heavy / light / fragile objects respectively, even for unseen ones.
RQ3 (reasoning-based adaptation, Table 4, success over ID whiteboard vs OOD blackboard).
| Type | In-Domain (Whiteboard) | Out-of-Domain (Blackboard) |
|---|---|---|
| π0-base | 40 | 0 |
| π0-fast | 45 | 0 |
| Tactile-VLA | 80 | 15 |
| Tactile-VLA-CoT | 75 | 80 |
On the zero-shot blackboard, the CoT variant recognizes the failure from force feedback, reasons that more force is needed, and autonomously raises the applied force from the 3.5N default (trained at 5N) to 6.7N (~34% above training), succeeding where plain Tactile-VLA (15%) and the baselines (0%) fail. Note the CoT variant slightly under-performs plain Tactile-VLA in-domain (75 vs 80) — the reasoning loop helps OOD but is not free on the familiar task.
From the authors:
τ for smoothness; tuning of τ, K,
and PID gains is not analyzed.What I noticed reading it:
This paper is a direct, concrete instance of the batch's central thesis,
and a sharp foil to BLADE.
BLADE's abstraction layer is purely categorical: continuous
parameters like grasp force live entirely inside the diffusion policy, and one
of BLADE's flagged open questions was exactly "force/continuous parameters."
Tactile-VLA shows the complementary move — lift force to a first-class,
language-conditioned action output (Ftarget),
so that an adverb like "gently" maps to 0.5N. That is precisely the kind of
predicate that is not visually evaluable: is_grasped,
grip_is_gentle, surface_pressure_sufficient,
is_screwed_tight live in touch/force, not pixels. The wiping-CoT
result — "normal force is too low, apply more downward pressure" — is
a tactile predicate evaluation dressed as chain-of-thought, which is
suggestive for predicate-from-touch invention on top of a BLADE-style
abstraction layer.
Where it diverges from my line: there is no symbolic planning, no operator library, no long-horizon composition — it's a single end-to-end force-aware policy with a reasoning wrapper. So it's relevant as the low-level force-grounding substrate a BLADE-style planner could sit on top of, not as a planning-abstractions paper itself. The "VLM prior already contains physical semantics" claim is also the same structure-vs-scale debate BLADE sits on the other side of: here scale (a pretrained VLM) supplies the force commonsense that BLADE would want to make explicit and symbolic.
A key finding is that the VLM's prior knowledge already contains semantic understanding of physical interaction; by connecting it to the robot's tactile sensors with only a few demonstrations, we can activate this prior knowledge to achieve zero-shot generalization in contact-rich tasks. — Abstract / p.1
This capability is essential for differentiating commands that share the same motion but differ in force, such as "insert the USB firmly" versus "insert the USB gently". — §2.1 / p.3
The gripper's Cartesian position is used to exclusively regulate the net external force applied to an object, while the gripper width is used in parallel to control the internal grasping force, thus dictating how firmly the object is held. — §2.2 / p.4
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant to this work: