KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation

Di Zhang*, Chengbo Yuan, Chuan Wen, Hai Zhang, Junqiao Zhao, Yang Gao† · Tongji / Tsinghua / SJTU / HKU / Shanghai Qi Zhi / Shanghai AI Lab · 2025 · arXiv preprint · arXiv:2505.01974 · PDF · project page

One-liner. KineDex collects tactile-rich dexterous-hand demos by hand-over-hand kinesthetic teaching (the operator literally "wears" the robot hand via finger straps), inpaints the operator's body out of the camera view, and trains a diffusion policy that predicts both joint targets and fingertip forces — turning force into an explicit action channel so contact-rich tasks like squeezing toothpaste actually apply pressure instead of just touching.

Problem & motivation

Dexterous manipulation benefits enormously from tactile sensing, but the bottleneck is collecting high-fidelity tactile demonstrations. The two dominant data-collection paradigms both fail at this: (i) teleoperation (VR headsets, data gloves) and (ii) video retargeting suffer from kinematic mismatch between the human and robot hand, and critically the operator receives no real-time tactile feedback — so demo quality hinges on operator expertise and the recorded contact forces are unreliable. Exoskeleton rigs add haptic proxies but the interaction still differs from direct contact and collection stays slow. KineDex's pitch: let the operator physically guide the robot hand hand-over-hand, so contact forces transmit directly to the human and the recorded tactile data is physically grounded. The remaining catch is that the human hand occludes the camera during teaching — a train/inference domain shift the paper resolves with video inpainting.

Method

1. Kinesthetic data collection (Fig 1, §3.2). Ring-shaped straps are attached to the dorsal sides of the four non-thumb fingers of the robot hand; the operator slips fingers through them and guides the hand "as if wearing a glove," so contact forces during motion are immediately felt. Because of the human/robot thumb morphology mismatch, the operator controls the thumb separately with their left hand while the right hand drives the other four fingers. Two RGB cameras record (front workspace view + wrist-mounted close-range view). Recorded per-demo modalities: visual frames, proprioception (arm end-effector pose + hand joint positions), per-finger dense tactile matrix, and an aggregated 3D fingertip force vector f = (fx, fy, fz) per fingertip.

2. Inpainting away the human hand (§3.3). The front camera inevitably captures the operator's body, creating a severe out-of-distribution shift at inference when the human is gone (the paper shows w/o-inpainting collapses to 0% everywhere). Pipeline: Grounded-SAM [51] extracts masks of the operator's body parts, then ProPainter [30] inpaints the occluded regions (Fig 2). The inpainting model is not robot-specific and removal isn't perfect, but it is shown sufficient for policy training — and is positioned as a more scalable alternative to prior trajectory-replay methods (DexForce [28]).

3. Policy learning (§3.3). Backbone is Diffusion Policy [44], conditioned on inpainted visual observations, tactile sensing, and proprioception. The observation at step t comprises RGB from No views ot, tactile vectors from Nq sensing points per fingertip qt, and proprioceptive joint signals xt. The policy models p(xd, fd | ot, qt, xt), predicting both target joint positions xd and Nf-dimensional target fingertip forces fd. Force supervision uses only the normal component fz (the axis the fingertip can actively exert along). The policy emits action chunks [52] for smoother control.

4. Force control at inference ("force-informed action", §3.4). This is the key trick. A pure position-controlled PD law u = Kp(xd - x) + Kd(ẋd - ẋ) (Eq. 1) merely tracks the recorded joint positions and thus only touches the object surface without applying meaningful force. KineDex instead exploits a physical property: a nonzero position error against a rigid object generates pressure, as if the target lies inside the object. Using the predicted forces fd (orthogonal to the contact surface), it computes force-informed target positions for the fingertip and base joints (Eq. 2): xdtip = xtip + Ktip·fd and xdbase = xbase + Kbase·fd, with stiffness hyperparameters Ktip, Kbase fixed across tasks. The fingers then actively track the desired contact forces.

Setup

Results

Headline: KineDex averages 74.4% success across the nine tasks, a +57.7% absolute improvement over the no-force-control variant (which averages 16.7%). Tactile input alone is worth +26.7% on the three most contact-intensive tasks (Cap Twisting, Toothpaste Squeezing, Syringe Pressing). Inference-time success, successful trials out of 20 (Table 1):

TaskKineDexw/o Force Ctrlw/o Tactilew/o Inpaint
Bottle Picking170150
Cup Picking2016170
Egg Picking175180
Cap Twisting152100
Nut Tightening167120
Peg Insertion150160
Charger Plugging120100
Toothpaste Squeezing9030
Syringe Pressing13080

Where it wins and loses:

Limitations & open questions

From the authors (§6):

What I noticed reading it:

Why I care

This sits squarely on the thesis I keep coming back to from BLADE: many manipulation predicates — is_screwed_tight(cap), is_inserted(peg), is_full(syringe), tube_is_squeezed — are not visually evaluable; they live in touch and force. KineDex is a clean, concrete demonstration of that gap at the control layer: its central empirical result is that a position-only policy looks like it's doing the task (it touches the object) but applies no force and fails, while making force an explicit predicted action channel recovers 57.7%. For BLADE specifically, this is the missing continuous / force-modulated layer I flagged in BLADE's "Why I care": BLADE's abstraction is purely categorical and shoves all continuous parameters (pour amount, grasp pose, force) inside the diffusion policy. KineDex shows how to make force a first-class, supervised output of exactly such a diffusion-policy body — a candidate recipe for a force-aware behavior body whose effects (is_screwed_tight) could then be checked by a tactile predicate classifier rather than a visual one. It also strengthens the learning-from-demonstration angle: hand-over-hand teaching is a data source where the force ground truth is physically real, not retargeted.

Off the BLADE planning axis, this is a perception/control paper, not a language/planning one — no symbolic abstraction, no language. Its relevance is to the sensing substrate that future predicate-invention work would need.

Quotable

Based on these demonstrations, we then train a visuomotor policy using tactile-augmented inputs and implement force control during deployment for precise contact-rich manipulation. — Abstract / p.1
Relying solely on position control … results in merely contacting the object's surface without applying meaningful forces, as the recorded fingertip positions remain unchanged regardless of the forces applied during kinesthetic teaching. This discrepancy often leads to unstable grasps, slipping, or ineffective manipulation. — §3.4 / p.5
Across these tasks, KineDex achieves an average success rate of 74.4%, representing a 57.7% improvement over the variant without force control. — Abstract / p.1

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: