One-liner. KineDex collects tactile-rich dexterous-hand demos by hand-over-hand kinesthetic teaching (the operator literally "wears" the robot hand via finger straps), inpaints the operator's body out of the camera view, and trains a diffusion policy that predicts both joint targets and fingertip forces — turning force into an explicit action channel so contact-rich tasks like squeezing toothpaste actually apply pressure instead of just touching.
Dexterous manipulation benefits enormously from tactile sensing, but the bottleneck is collecting high-fidelity tactile demonstrations. The two dominant data-collection paradigms both fail at this: (i) teleoperation (VR headsets, data gloves) and (ii) video retargeting suffer from kinematic mismatch between the human and robot hand, and critically the operator receives no real-time tactile feedback — so demo quality hinges on operator expertise and the recorded contact forces are unreliable. Exoskeleton rigs add haptic proxies but the interaction still differs from direct contact and collection stays slow. KineDex's pitch: let the operator physically guide the robot hand hand-over-hand, so contact forces transmit directly to the human and the recorded tactile data is physically grounded. The remaining catch is that the human hand occludes the camera during teaching — a train/inference domain shift the paper resolves with video inpainting.
1. Kinesthetic data collection (Fig 1, §3.2). Ring-shaped
straps are attached to the dorsal sides of the four non-thumb fingers of the robot
hand; the operator slips fingers through them and guides the hand "as if wearing a
glove," so contact forces during motion are immediately felt. Because of the
human/robot thumb morphology mismatch, the operator controls the thumb separately
with their left hand while the right hand drives the other four fingers.
Two RGB cameras record (front workspace view + wrist-mounted close-range view).
Recorded per-demo modalities: visual frames, proprioception (arm end-effector pose
+ hand joint positions), per-finger dense tactile matrix, and an aggregated 3D
fingertip force vector f = (fx, fy, fz)
per fingertip.
2. Inpainting away the human hand (§3.3). The front camera
inevitably captures the operator's body, creating a severe out-of-distribution
shift at inference when the human is gone (the paper shows w/o-inpainting collapses
to 0% everywhere). Pipeline: Grounded-SAM [51] extracts masks of the
operator's body parts, then ProPainter [30] inpaints the occluded
regions (Fig 2). The inpainting model is not robot-specific and removal isn't
perfect, but it is shown sufficient for policy training — and is positioned
as a more scalable alternative to prior trajectory-replay methods (DexForce [28]).
3. Policy learning (§3.3). Backbone is Diffusion Policy [44],
conditioned on inpainted visual observations, tactile sensing, and proprioception.
The observation at step t comprises RGB from No views
ot, tactile vectors from Nq sensing points
per fingertip qt, and proprioceptive joint signals
xt. The policy models p(xd, fd |
ot, qt, xt), predicting both target joint
positions xd and Nf-dimensional target
fingertip forces fd. Force supervision uses only the normal
component fz (the axis the fingertip can actively exert along).
The policy emits action chunks [52] for smoother control.
4. Force control at inference ("force-informed action", §3.4).
This is the key trick. A pure position-controlled PD law u = Kp(xd
- x) + Kd(ẋd - ẋ) (Eq. 1) merely tracks the
recorded joint positions and thus only touches the object surface without
applying meaningful force. KineDex instead exploits a physical property: a nonzero
position error against a rigid object generates pressure, as if the target lies
inside the object. Using the predicted forces fd
(orthogonal to the contact surface), it computes force-informed target
positions for the fingertip and base joints (Eq. 2):
xdtip = xtip + Ktip·fd
and xdbase = xbase + Kbase·fd,
with stiffness hyperparameters Ktip, Kbase fixed
across tasks. The fingers then actively track the desired contact forces.
Headline: KineDex averages 74.4% success across the nine tasks, a +57.7% absolute improvement over the no-force-control variant (which averages 16.7%). Tactile input alone is worth +26.7% on the three most contact-intensive tasks (Cap Twisting, Toothpaste Squeezing, Syringe Pressing). Inference-time success, successful trials out of 20 (Table 1):
| Task | KineDex | w/o Force Ctrl | w/o Tactile | w/o Inpaint |
|---|---|---|---|---|
| Bottle Picking | 17 | 0 | 15 | 0 |
| Cup Picking | 20 | 16 | 17 | 0 |
| Egg Picking | 17 | 5 | 18 | 0 |
| Cap Twisting | 15 | 2 | 10 | 0 |
| Nut Tightening | 16 | 7 | 12 | 0 |
| Peg Insertion | 15 | 0 | 16 | 0 |
| Charger Plugging | 12 | 0 | 10 | 0 |
| Toothpaste Squeezing | 9 | 0 | 3 | 0 |
| Syringe Pressing | 13 | 0 | 8 | 0 |
Where it wins and loses:
From the authors (§6):
What I noticed reading it:
This sits squarely on the thesis I keep coming back to from
BLADE:
many manipulation predicates — is_screwed_tight(cap),
is_inserted(peg), is_full(syringe),
tube_is_squeezed — are not visually evaluable;
they live in touch and force. KineDex is a clean, concrete demonstration of
that gap at the control layer: its central empirical result is that a
position-only policy looks like it's doing the task (it touches the object)
but applies no force and fails, while making force an explicit predicted action
channel recovers 57.7%. For BLADE specifically, this is the missing continuous /
force-modulated layer I flagged in BLADE's "Why I care": BLADE's abstraction is
purely categorical and shoves all continuous parameters (pour amount, grasp pose,
force) inside the diffusion policy. KineDex shows how to make force a
first-class, supervised output of exactly such a diffusion-policy body — a
candidate recipe for a force-aware behavior body whose effects
(is_screwed_tight) could then be checked by a tactile predicate
classifier rather than a visual one. It also strengthens the
learning-from-demonstration angle: hand-over-hand teaching is a data source where
the force ground truth is physically real, not retargeted.
Off the BLADE planning axis, this is a perception/control paper, not a language/planning one — no symbolic abstraction, no language. Its relevance is to the sensing substrate that future predicate-invention work would need.
Based on these demonstrations, we then train a visuomotor policy using tactile-augmented inputs and implement force control during deployment for precise contact-rich manipulation. — Abstract / p.1
Relying solely on position control … results in merely contacting the object's surface without applying meaningful forces, as the recorded fingertip positions remain unchanged regardless of the forces applied during kinesthetic teaching. This discrepancy often leads to unstable grasps, slipping, or ineffective manipulation. — §3.4 / p.5
Across these tasks, KineDex achieves an average success rate of 74.4%, representing a 57.7% improvement over the variant without force control. — Abstract / p.1
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant to this work: