One-liner. OmniVTLA bolts a semantically-aligned tactile encoder onto a π0-style VLA: instead of treating touch as a raw low-level signal, it pretrains a tactile ViT via cross-modal contrastive learning (on a new 135K-sample tri-modal dataset, ObjTac) so that tactile tokens land in the same latent space as vision and language — and that grounding, not just more parameters, is what lifts contact-rich pick-place and peg-insertion success.
VLA models inherit CLIP/SigLIP vision backbones that are already contrastively aligned to language, but they rely on vision and language alone and stumble on contact-rich tasks where the decisive feedback lives in touch (material, roughness, hardness, contact state). Prior attempts to bolt touch onto VLAs (Tactile-VLA, VTLA) treat tactile data as a low-level signal and do not align it semantically with vision and language. The paper's wager (Fig 1): the tactile encoder's architecture and alignment objective matter more than raw parameter count, and a vision-language-tactile semantic alignment can be engineered the way CLIP aligned vision and language.
OmniVTLA is built on π0 (Black et al. 2024) with a Gemma-2.6B backbone,
a PaliGemma text tokenizer, a SigLIP-400M image encoder, and a flow-matching
action expert. The action model targets p(A_t | o_t) over an
action chunk of length H=50; the VTLA variant extends the π0
observation with tactile tokens: A_t ~ M(fφ(I_t), fθ(T_t),
l_t) where T_t is tactile data remapped to a ViT-like
3-channel image (via max-min force normalization and tensor reshaping) and
stitched/multi-sensor inputs are resized to 224×224 yielding 256 tokens.
Dual-encoder tactile path (Fig 2). The core design is a dual-ViT tactile encoder: one path is the SigLIP image encoder reused on the tactile-as-image input (knowledge transfer from large-scale vision pretraining), the other is a Semantic-Aligned tactile ViT (SA-ViT) trained explicitly via cross-modal contrastive learning to push tactile embeddings toward the matching vision and text concepts. This dual path addresses the heterogeneity between vision and touch and across different tactile sensors (Touch-and-Go, SSVTP, ObjectFolder, ObjTac all look different).
SA-ViT training. Built on the second-stage alignment
pipeline of AnyTouch.
Because ObjTac is tri-modal, the paper uses a symmetric all-pairs alignment
loss (Eq. 3): L_align = α_VL(L_{V→L}+L_{L→V})/2 +
α_VT(L_{V→T}+L_{T→V})/2 + α_TL(L_{T→L}+L_{L→T})/2,
plus a cross-sensor matching loss (binary cross-entropy) so embeddings align
across sensors. The SA-ViT has an encoder/decoder structure; alignment grounds
tactile signals (material, roughness, hardness) in visual and linguistic
context.
ObjTac dataset. 56 objects across 10 material categories (plastic, glass, wood, brick, metal, fabric, leather, ceramic, paper, others), labeled by surface roughness (rough/smooth) and hardness (rigid/soft). Sensor is a Paxini Gen2 magnetic Hall-effect tactile sensor that recovers 6D pose of embedded magnets and inverts to a multi-dimensional force distribution. Per object: 2–5 interaction trials, 10–60 s each at 60 Hz, plus 720p first-person video at 30 FPS, temporally synchronized by timestamp. Totals: 135K tri-modal samples, 270K force recordings, 252 video sequences. Tri-modal text annotations give object name, material, roughness, hardness, and descriptions.
Action representation. For two-finger grippers: 10 tokens (3 relative positions, 6 relative angles, 1 gripper state). For the four-finger hand: 25 tokens (3 relative positions, 6 relative angles, 16 absolute joint positions).
Headline real-world numbers: on pick-and-place with the gripper, OmniVTLA reaches 96.9% average SR vs 75.0% for π0 (+21.9%); with the dexterous hand 100% vs 93.8% (+6.2%); on peg-insertion 83.3% vs 50.0% (+33.3%). It also cuts completion time (657→498 steps gripper pick-place) and smooths trajectories.
Pick-and-place, two-finger gripper (Table 3):
| Model | Tactile enc. | SR % (avg) | CT steps (avg) |
|---|---|---|---|
| VLA (π0) | none | 75.0 | 657 |
| VTLA-FS | from scratch | 81.2 | 537 |
| VTLA-Pre | pretrained vision | 84.4 | 586 |
| VTLA-SA | semantic-aligned | 87.5 | 484 |
| VTLA-Pre-Pre | SigLIP+CLIP | 81.3 | 501 |
| VTLA-Pre-Tac | SigLIP+Tac-ViT | 84.4 | 550 |
| OmniVTLA | SigLIP+SA-ViT | 96.9 | 498 |
Other findings:
From the authors:
What I noticed reading it:
OmniVTLA is a clean data point for the thesis that many manipulation
predicates are not visually evaluable — is_grasped,
is_inserted, surface_is_rough, is_soft
live in force/contact, exactly the signals OmniVTLA shows decide pick-place and
peg-insertion. Relative to
BLADE:
BLADE's contact predicates (turned-on(faucet),
is-grasped) are learned as visual classifiers, and BLADE
itself flags that gripper-state segmentation can't handle caging grasps or
contact-rich tasks, and that continuous force sits opaquely inside the
diffusion policy. OmniVTLA is on the opposite end of the structure-vs-scale
axis — a monolithic flow-matching VLA with no symbolic abstraction layer
— but its tactile encoder is precisely the kind of grounded
perceptual front-end a force-aware predicate would need. The
provocative connection: could SA-ViT-style semantically-aligned tactile
embeddings serve as the classifier input for tactile predicates
(is_inserted, is_screwed_tight) in a BLADE-style
bilevel planner, giving the abstraction layer access to non-visual state? The
"architecture > parameters" result also rhymes with BLADE's "structure as
prior" stance against pure end-to-end scaling.
Caveat on relevance: OmniVTLA contributes no planning, no compositionality, and no predicate learning — it is a perception/representation + end-to-end-policy paper. Its value to my line is as a tactile grounding module, not as a planning method.
Existing attempts to incorporate touch into VLA frameworks often treat tactile data as low-level signals, failing to align them semantically with visual and linguistic contexts. — §1 Introduction / p.1
This indicates the performance gains are attributed to the choice of tactile encoder architecture rather than the parameter count. — §4.2 Real-World Pick and Place Results / p.8
… smoother trajectories that adhere to the intuitive principle of “move quickly when clear, only slow down during contact approach.” — §1 / p.2
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant: