KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation

Di Zhang*, Chengbo Yuan, Chuan Wen, Hai Zhang, Junqiao Zhao, Yang Gao† · Tongji / Tsinghua / SJTU / HKU / Shanghai Qi Zhi / Shanghai AI Lab · 2025 · arXiv preprint · arXiv:2505.01974 · PDF · project page

One-liner. KineDex collects tactile-rich dexterous-hand demos by hand-over-hand kinesthetic teaching (the operator literally "wears" the robot hand via finger straps), inpaints the operator's body out of the camera view, and trains a diffusion policy that predicts both joint targets and fingertip forces — turning force into an explicit action channel so contact-rich tasks like squeezing toothpaste actually apply pressure instead of just touching.

Problem & motivation

Dexterous manipulation benefits enormously from tactile sensing, but the bottleneck is collecting high-fidelity tactile demonstrations. The two dominant data-collection paradigms both fail at this: (i) teleoperation (VR headsets, data gloves) and (ii) video retargeting suffer from kinematic mismatch between the human and robot hand, and critically the operator receives no real-time tactile feedback — so demo quality hinges on operator expertise and the recorded contact forces are unreliable. Exoskeleton rigs add haptic proxies but the interaction still differs from direct contact and collection stays slow. KineDex's pitch: let the operator physically guide the robot hand hand-over-hand, so contact forces transmit directly to the human and the recorded tactile data is physically grounded. The remaining catch is that the human hand occludes the camera during teaching — a train/inference domain shift the paper resolves with video inpainting.

Method

1. Kinesthetic data collection (Fig 1, §3.2). Ring-shaped straps are attached to the dorsal sides of the four non-thumb fingers of the robot hand; the operator slips fingers through them and guides the hand "as if wearing a glove," so contact forces during motion are immediately felt. Because of the human/robot thumb morphology mismatch, the operator controls the thumb separately with their left hand while the right hand drives the other four fingers. Two RGB cameras record (front workspace view + wrist-mounted close-range view). Recorded per-demo modalities: visual frames, proprioception (arm end-effector pose + hand joint positions), per-finger dense tactile matrix, and an aggregated 3D fingertip force vector f = (f_x, f_y, f_z) per fingertip.

2. Inpainting away the human hand (§3.3). The front camera inevitably captures the operator's body, creating a severe out-of-distribution shift at inference when the human is gone (the paper shows w/o-inpainting collapses to 0% everywhere). Pipeline: Grounded-SAM [51] extracts masks of the operator's body parts, then ProPainter [30] inpaints the occluded regions (Fig 2). The inpainting model is not robot-specific and removal isn't perfect, but it is shown sufficient for policy training — and is positioned as a more scalable alternative to prior trajectory-replay methods (DexForce [28]).

3. Policy learning (§3.3). Backbone is Diffusion Policy [44], conditioned on inpainted visual observations, tactile sensing, and proprioception. The observation at step t comprises RGB from N_o views o_t, tactile vectors from N_q sensing points per fingertip q_t, and proprioceptive joint signals x_t. The policy models p(x_d, f_d | o_t, q_t, x_t), predicting both target joint positions x_d and N_f-dimensional target fingertip forces f_d. Force supervision uses only the normal component f_z (the axis the fingertip can actively exert along). The policy emits action chunks [52] for smoother control.

4. Force control at inference ("force-informed action", §3.4). This is the key trick. A pure position-controlled PD law u = K_p(x_d - x) + K_d(ẋ_d - ẋ) (Eq. 1) merely tracks the recorded joint positions and thus only touches the object surface without applying meaningful force. KineDex instead exploits a physical property: a nonzero position error against a rigid object generates pressure, as if the target lies inside the object. Using the predicted forces f_d (orthogonal to the contact surface), it computes force-informed target positions for the fingertip and base joints (Eq. 2): x_d^tip = x^tip + K^tip·f_d and x_d^base = x^base + K^base·f_d, with stiffness hyperparameters K^tip, K^base fixed across tasks. The fingers then actively track the desired contact forces.

Setup

Datasets / benchmarks: Self-collected kinesthetic demonstrations on a suite of nine contact-rich dexterous tasks: Bottle Picking, Cup Picking, Egg Picking, Cap Twisting, Nut Tightening, Peg Insertion, Charger Plugging, Toothpaste Squeezing, Syringe Pressing. No public benchmark; task descriptions in Appendix A. Number of demos per task: not reported in main text.
Hardware / simulator: Real-world only. Franka Emika Panda 7-DoF arm + Robotera XHand1 dexterous hand (12 DoF: 2 joints per finger, plus an extra rotational joint on thumb and index; 120 tactile sensing points per finger). Two RGB cameras (front + wrist). Efficiency experiments use an Inspire Hand to replicate the Open-TeleVision [15] / Meta Quest 3 teleop setup for a fair comparison.
Baselines: Three ablated variants of KineDex itself — w/o Force Control (position control only at inference), w/o Tactile Input (tactile dropped from policy inputs, still predicts/executes forces), w/o Inpainting (skip the body-removal preprocessing). For data-collection efficiency, the baseline is teleoperation. No external learned-policy baselines.
Compute: not reported.

Results

Headline: KineDex averages 74.4% success across the nine tasks, a +57.7% absolute improvement over the no-force-control variant (which averages 16.7%). Tactile input alone is worth +26.7% on the three most contact-intensive tasks (Cap Twisting, Toothpaste Squeezing, Syringe Pressing). Inference-time success, successful trials out of 20 (Table 1):

Task	KineDex	w/o Force Ctrl	w/o Tactile
Bottle Picking	17	0	15
Cup Picking	20	16	17
Egg Picking	17	5	18
Cap Twisting	15	2	10
Nut Tightening	16	7	12
Peg Insertion	15	0	16
Charger Plugging	12	0	10
Toothpaste Squeezing	9	0	3
Syringe Pressing	13	0	8

Where it wins and loses:

Force control is decisive on contact-rich tasks. w/o Force Control drops to 0/20 on Bottle Picking, Peg Insertion, Charger Plugging, Toothpaste Squeezing, Syringe Pressing — the hand contacts but never applies pressure. Fig 3 shows the force-informed policy tracks predicted contact forces in magnitude and timing, while the position-only variant's executed force stays flat.
Tactile is not always net-positive. On Egg Picking (18 vs 17) and Peg Insertion (16 vs 15), the w/o-Tactile variant slightly beats full KineDex — visual info suffices there, and tactile adds noise. Tactile matters for occlusion-heavy / contact-reliant tasks.
Inpainting is load-bearing. w/o Inpainting is 0/20 on every task — raw demos with the human body present are unusable for training.
Data-collection efficiency (Table 2): across five teleop-feasible tasks, KineDex hits near-100% collection success (e.g. 20/20 on Bottle/Cup/Cap, 18/20 Charger) vs teleoperation's 39% average (Charger Plugging 0/20). Per Fig 4, KineDex collects >2× faster — ~half the time on Syringe Pressing, <one-third on Bottle Picking. A 5-participant user study (Fig 5) reports 100% agreement that KineDex collects more accurate tactile data and suits complex tasks better; 80% found it easier to use.

Limitations & open questions

From the authors (§6):

Inpainting is sufficient here but may degrade under more severe occlusions; fine-tuning the inpainter on robot-specific data could help.
The current rig needs two human hands to control one dexterous hand (because of the thumb morphology mismatch), so it does not scale to bimanual demonstrations; better biomimetic hardware could enable single-handed kinesthetic teaching.

What I noticed reading it:

All results are count-of-success out of 20, single setting — no seeds, no variance, no confidence intervals. The headline "74.4% / +57.7%" rests on 9×20 = 180 trials with no statistical error bars; the small Egg/Peg reversals (1 trial) are within plausible noise.
No external baselines. Every comparison is an ablation of KineDex; there's no head-to-head against a teleoperation-trained policy or a competing tactile-policy method (e.g. DexForce [28], which they call the closest prior). The efficiency study compares data collection success, not final policy success of teleop-trained policies.
Force supervision uses only the normal component f_z; shear/tangential forces (critical for slip detection, twisting) are discarded from the action target even though the tactile matrix presumably captures them.
The force-control law (Eq. 2) is an open-loop position offset proportional to predicted force, with hand-tuned fixed stiffness — it assumes rigid contact and won't obviously transfer to soft/deformable objects where the error→force mapping is nonlinear.
"Demos per task" and training compute are not reported, so sample-efficiency claims can't be assessed.

Why I care

This sits squarely on the thesis I keep coming back to from BLADE: many manipulation predicates — is_screwed_tight(cap), is_inserted(peg), is_full(syringe), tube_is_squeezed — are not visually evaluable; they live in touch and force. KineDex is a clean, concrete demonstration of that gap at the control layer: its central empirical result is that a position-only policy looks like it's doing the task (it touches the object) but applies no force and fails, while making force an explicit predicted action channel recovers 57.7%. For BLADE specifically, this is the missing continuous / force-modulated layer I flagged in BLADE's "Why I care": BLADE's abstraction is purely categorical and shoves all continuous parameters (pour amount, grasp pose, force) inside the diffusion policy. KineDex shows how to make force a first-class, supervised output of exactly such a diffusion-policy body — a candidate recipe for a force-aware behavior body whose effects (is_screwed_tight) could then be checked by a tactile predicate classifier rather than a visual one. It also strengthens the learning-from-demonstration angle: hand-over-hand teaching is a data source where the force ground truth is physically real, not retargeted.

Off the BLADE planning axis, this is a perception/control paper, not a language/planning one — no symbolic abstraction, no language. Its relevance is to the sensing substrate that future predicate-invention work would need.

Quotable

Based on these demonstrations, we then train a visuomotor policy using tactile-augmented inputs and implement force control during deployment for precise contact-rich manipulation. — Abstract / p.1

Relying solely on position control … results in merely contacting the object's surface without applying meaningful forces, as the recorded fingertip positions remain unchanged regardless of the forces applied during kinesthetic teaching. This discrepancy often leads to unstable grasps, slipping, or ineffective manipulation. — §3.4 / p.5

Across these tasks, KineDex achieves an average success rate of 74.4%, representing a 57.7% improvement over the variant without force control. — Abstract / p.1

Papers cited that should likely be ingested next:

[28] Chen et al. 2025 — DexForce — the closest prior: also extracts force-informed actions from kinesthetic demonstrations, on simpler/fewer-DoF hardware; KineDex's direct conceptual predecessor and the method it most distinguishes itself from. High-priority ingest.
[44] Chi et al. 2024 — Diffusion Policy — the backbone; foundational dependency (shared with BLADE).
[45] Ze et al. 2024 — 3D Diffusion Policy — point-cloud-conditioned DP variant; natural extension of KineDex's perception.
[15] Cheng et al. 2024 — Open-TeleVision — the teleop system KineDex replicates as its efficiency baseline.
[29] Liu et al. 2024 — ForceMimic — force-centric imitation with a force-motion capture system for contact-rich manipulation; sibling line.
[46] Sun et al. 2025 — VTAO-BiManip — masked visual-tactile-action pretraining for bimanual dexterous manipulation; relevant to the bimanual limitation.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

MimicTouch — same cluster F; learning from human tactile demonstrations, the closest cousin in data-collection philosophy (humans as the tactile teacher).
FoAR and Tactile-Conditioned Diffusion Policy — force-aware diffusion policies; same backbone + force-as-signal design, alternative ways to fold force into the policy.
Reactive Diffusion Policy — slow-fast visuotactile control; complementary take on real-time tactile reactivity that KineDex's action-chunking sidesteps.
FACTR / FACTR 2 — force-attending policy training and external force sensing on commodity arms; adjacent on the force-as-input axis.
Making Sense of Vision and Touch — foundational visuo-tactile fusion; the representation-learning ancestor of KineDex's tactile-augmented inputs.
ForceVLA and Tactile-VLA — cluster C; force/tactile pushed into the VLA paradigm, the larger-model counterpart to KineDex's task-specific diffusion policy.