MimicTouch: Leveraging Multi-modal Human Tactile Demonstrations for Contact-rich Manipulation

Kelin Yu*, Yunhai Han*, Qixian Wang, Vaibhav Saxena, Danfei Xu, Ye Zhao · Georgia Tech / Zhejiang Technology · CoRL 2024 · arXiv:2310.16917 · PDF

One-liner. MimicTouch lets a human demonstrate contact-rich insertion with their bare hand — wearing a fingertip GelSight + contact mic instead of teleoperating a robot — then learns a tactile-and-audio-only policy by non-parametric imitation and closes the human–robot embodiment gap with residual RL, so the robot inserts using the same touch-guided strategy a person would, with no vision at execution time.

Problem & motivation

Contact-rich tasks like peg insertion and assembly need real-time tactile feedback because vision is occluded at the contact interface and tolerances are tight. Prior tactile imitation methods collect data by teleoperating a sensorized robot — but the human teleoperator steers using visual feedback, so the modality that controls the demonstration (vision) differs from the modality the policy is supposed to learn from (touch). MimicTouch names this the "sensing gap": teleoperated tactile demos never actually encode a human's tactile-guided control strategy, they only record touch as a passenger. Teleoperation is also slow, requires expertise, and tends to restrict the action space to low-DoF (e.g. 3D translation) to keep dynamic motions tractable. The paper's bet: collect demos directly from the human hand so the touch signal is the thing driving the motion, then transfer that strategy to the robot.

Method

Four stages, shown in Fig 2: (a) human tactile data collection, (b) self-supervised representation learning, (c) non-parametric offline policy, (d) online residual RL fine-tuning on the real robot.

1. Human tactile data collection (Sec 3.1). A human performs the insertion with their hand. Fingertip pose is tracked with a RealSense camera reading an ArUco marker, then calibrated/filtered into end-effector poses. A GelSight Mini vision-based tactile sensor is mounted on the fingertip via a custom fixture for contact images (only one tactile sensor is used, not two). A HOYUJI TD-11 piezo-electric contact microphone captures audio/vibration; crucially it is placed at the base of the insertion hole (not on the hand) to keep vibration signals consistent between the human and robot embodiments. The action is the 6D delta pose of the end-effector — a full 6-DoF space rather than the restricted 3D of prior teleop work.

2. Tactile representation learning (Sec 3.2). Raw tactile images and audio spectra are high-dimensional and noisy, and human-vs-robot contact forces differ, so MimicTouch learns compact embeddings with self-supervision. BYOL is trained on tactile images and BYOL-A on 2 Hz-segmented audio spectra; both embedding spaces are 2048-dimensional. This maps augmented views of the same signal to nearby embeddings, making them robust to task-irrelevant sensor noise. Dataset: 100 demonstration trajectories (successful, failed, and sub-optimal), 7657 tactile images and 1000 audio segments.

3. Non-parametric offline policy (Sec 3.3). Built on the VINN/NN framework (Pari et al. [12]) but extended to tactile + audio without visual input. At step i the observation is (o_i^T, o_i^A, o_i^EE, a_i) — tactile, audio, end-effector pose, action. Pre-trained encoders give features (y_i^T, y_i^A); together with pose o_i^EE these form a normalized demonstration library (max inter-feature distance scaled to unity). At test time the live observation is encoded into a query feature and a nearest-neighbor lookup retrieves the action. Non-parametric lookup is chosen deliberately over parametric nets to avoid covariant shift/compounding error from out-of-domain human demos — it constrains the robot to demonstrated behaviors, which the authors argue is safer for real contact-rich tasks.

4. Online residual RL fine-tuning (Sec 3.4). The offline NN policy alone doesn't guarantee success on the robot due to (i) hand-vs-gripper morphology, (ii) noisy fingertip tracking, and (iii) underexplored contact effects. A residual policy π_r is trained with SAC; the robot action is the sum of the (frozen) offline policy output and the residual output, with tight action limits so the residual only makes slight adjustments. Reward combines an expert-aligned term — the KL divergence between the human-expert trajectory distribution and the robot's executed trajectory (encourages mimicking the demos) — with a task-specific term that drives exploration toward in-domain success. Around 70 actions/trajectory with a replay buffer.

Setup

Datasets / benchmarks: Self-collected human tactile dataset (100 trajectories; 7657 tactile images, 1000 audio segments) on a 3D-printed cylinder-and-hole insertion task. Seven zero-shot generalization tasks across five domains (Fig 5): shifting positions, tilting angles (10°/20°), two-stage dense packing, multi-material packing (rigid/soft), and furniture assembly (FurnitureBench-style leg threading).
Hardware / simulator: Franka Emika Panda arm; learned policy outputs 6-DoF pose commands mapped to 7-DoF joint torques via IK + low-level controller. GelSight Mini fingertip tactile sensor, HOYUJI TD-11 piezo contact microphone at the hole base, RealSense + ArUco fingertip tracking. Real-world only; no simulator.
Baselines: Data collection — Spacemouse teleoperation and Hand-guided (kinesthetic) teleoperation. Offline policy — MULSA [6] (parametric multisensory IL). Generalization — Openloop Policy (replays 5 successful trajectories from the initial setting).
Compute: not reported (RL fine-tuning evaluated every 20 min / ~13 epochs; full policy reaches its peak by 3 hours).

Results

Data collection throughput (Table 1). Human tactile demos dominate teleoperation on both speed and quality:

Method	Frequency	Usable success rate
Spacemouse teleoperation	19 traj/hr	38.5% (20/52)
Hand teleoperation	44 traj/hr	58.8% (20/34)
Human tactile demonstrations	104 traj/hr	83.3% (20/24)

Offline policy (Sec 4.2). The NN-based policy reaches MSE 0.21 vs MULSA's 1.53, and 40% (10/25) real task success vs MULSA's 16% (4/25) — non-parametric lookup produces more reliable 6D actions for downstream RL. Trained on human tactile demos, the NN policy hits 40% (10/25) vs 12% (3/25) from Spacemouse-teleop demos and 28% (7/25) from hand-guided teleop demos. Action-index trends (Fig 3) show human-tactile rollouts are near-linear with low variance, while teleop policies are nonlinear and high-variance during the insertion phase — teleop lacks the human tactile feedback that captures contact events.

Online RL (Sec 4.3). With human tactile demos as the offline prior, residual RL reaches 96% (24/25) success in 3 hours (88% / 22-25 at 2 hours), versus 32% (8/25, Spacemouse) and 60% (15/25, hand-guided) for teleop priors at the same time — the better prior makes RL both higher-ceiling and more sample-efficient.

Zero-shot generalization (Table 2), MimicTouch final policy vs Openloop replay:

Policy	Shift	10°	20°	Two-stage	Rigid	Soft	Assem(I)	Assem
Openloop	24/40	14/25	10/25	13/25	13/25	9/25	8/25	3/25
MimicTouch	37/40	23/25	20/25	22/25	20/25	16/25	19/25	13/25

MimicTouch beats Openloop in every domain, including a previously-unseen furniture-assembly task — evidence the learned policy reacts to contact rather than replaying a fixed trajectory.

Limitations & open questions

From the authors (Sec 5):

Still needs several hours of on-robot RL to bridge the embodiment gap; better representation learning could shrink this.
Task-specific — cannot directly transfer the tactile-guided control strategy to a new task; a generalizable tactile dynamics model is proposed as future work.
Only demonstrated on two-piece insertion/assembly; extension to dexterous, bimanual, and soft-object manipulation is open.

What I noticed reading it:

Almost every headline number is count-of-success out of N (10/25, 24/25, 20/24) with no seeds or confidence intervals — weaker statistics than a rate±std. 24/25 vs 22/25 at 2h vs 3h is one episode apart.
The contact mic sits at the hole base, not on the robot/hand. This elegantly sidesteps the human-vs-robot vibration mismatch, but it means audio only works when you can instrument the receptacle — a strong assumption that wouldn't hold for in-the-wild or mobile insertion. An ablation isolating audio's contribution is deferred to Appendix J.
The offline-policy comparison is partly confounded: NN-vs-MULSA conflates "non-parametric" with "this representation pipeline." MULSA at 16% is a low bar; no parametric-on-the-same-embeddings control.
Fingertip pose comes from a single-camera ArUco marker, which the authors admit is noisy under fast motion — part of why RL is needed rather than a clean offline transfer. Demo "throughput" advantage is partly bought back as RL fine-tuning time.
100 demos, one peg geometry for training. Generalization is impressive but all within insertion-like contact topology; no claim about categorically different contact modes (e.g. screwing, sliding).

Why I care

This is squarely on the batch thesis that many manipulation predicates aren't visually evaluable. is_inserted, is_seated, "the peg caught the chamfer," "two surfaces are flush" — MimicTouch shows these are read off touch and contact-vibration, not pixels, to the point that its execution policy uses no vision at all. For the BLADE line this is a concrete instance of the gap I keep flagging: BLADE's predicate classifiers are visual (open-vocab detector crops + neural classifier), and BLADE explicitly lists contact-rich tasks and continuous force parameters as out of scope. MimicTouch is what the controller body of a contact-rich behavior looks like when you take touch seriously — and its audio-at-the-hole trick is a reminder that some predicates have a natural non-visual evaluator if you instrument the right place.

Two things to carry forward. (1) The human-hand-demo + residual-RL recipe is a data-collection answer to BLADE's reliance on robot teleop demos; collecting tactile demos at 104 traj/hr without a robot in the loop is a real throughput unlock if the embodiment gap can be closed offline. (2) The expert-aligned KL reward is a clean way to keep an RL-tuned controller faithful to a demonstrated strategy — relevant if BLADE-style behaviors ever get force-modulated controllers fine-tuned on hardware. This is an execution-layer / low-level-policy paper, not a planning-abstraction paper: no symbols, no composition, single behavior. Its relevance to my thesis is as a tactile controller + tactile predicate-evaluator existence proof, not as a planning method.

Quotable

However, to provide the demonstration, human demonstrators often rely on visual feedback to control the robot. This creates a gap between the sensing modality used for controlling the robot (visual) and the modality of interest (tactile). — Abstract

Humans exhibit fine-grained manipulation skills through tactile sensing, which allows for successful insertions by solely using tactile feedback to generate complex, continuous, and precise motions. — §1, Introduction / p.2

We combine an expert-aligned reward, which is measured by the KL divergence between the human expert trajectory and the robot executed trajectory, with a task-specific reward. — §3.4 / p.5

Papers cited that should likely be ingested next:

[6] Li et al. 2022 — See, Hear, and Feel (MULSA) (CoRL) — the parametric multisensory-fusion baseline; in this batch as see_hear_feel_sensory_fusion.
[7] Guzey et al. 2023 — Dexterity from Touch (T-DEX) (CoRL) — self-supervised tactile pretraining + tactile play; in this batch as dexterity_from_touch_self_supervised_pretraining.
[10] Chi et al. 2024 — Universal Manipulation Interface (UMI) (RSS) — the in-the-wild hand-held data-collection cousin; closest peer in collecting demos without a robot in the loop.
[12] Pari et al. 2022 — The Surprising Effectiveness of Representation Learning for Visual Imitation (VINN) (RSS) — the non-parametric NN framework MimicTouch extends to tactile+audio.
[13] Yuan et al. 2017 — GelSight (Sensors) — the tactile-sensor foundation; in this batch as gelsight_high_resolution_tactile_sensors.
[28] Thankaraj & Pinto 2023 — That Sounds Right (CoRL) — auditory self-supervision for manipulation; in this batch as that_sounds_right_auditory_self_supervision.
[47] Du et al. 2022 — Play it by Ear (RSS) — audio-visual imitation through occlusion; in this batch as play_it_by_ear_audio_visual_imitation.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

KineDex — same cluster F; another tactile + kinesthetic-teaching route to contact-rich policies. Direct methodological peer on how to source tactile demos.
See, Hear, and Feel (MULSA) — the exact parametric baseline MimicTouch beats; the multisensory-fusion counterpart it positions itself against.
Dexterity from Touch (T-DEX) — the self-supervised tactile-representation recipe (BYOL-on-touch) MimicTouch's representation stage builds on.
That Sounds Right and Play it by Ear — the contact-audio-for-manipulation line MimicTouch's piezo-mic channel sits in.
Reactive Diffusion Policy and Tactile-Conditioned Diffusion Policy — cluster-F parametric (diffusion) alternatives to MimicTouch's non-parametric NN controller; useful contrast on policy class for contact-rich insertion.
BLADE — the planning-abstraction anchor whose visual predicate classifiers and diffusion-policy bodies this paper complements at the contact-rich, non-visual-evaluator extreme.