One-liner. MimicTouch lets a human demonstrate contact-rich insertion with their bare hand — wearing a fingertip GelSight + contact mic instead of teleoperating a robot — then learns a tactile-and-audio-only policy by non-parametric imitation and closes the human–robot embodiment gap with residual RL, so the robot inserts using the same touch-guided strategy a person would, with no vision at execution time.
Contact-rich tasks like peg insertion and assembly need real-time tactile feedback because vision is occluded at the contact interface and tolerances are tight. Prior tactile imitation methods collect data by teleoperating a sensorized robot — but the human teleoperator steers using visual feedback, so the modality that controls the demonstration (vision) differs from the modality the policy is supposed to learn from (touch). MimicTouch names this the "sensing gap": teleoperated tactile demos never actually encode a human's tactile-guided control strategy, they only record touch as a passenger. Teleoperation is also slow, requires expertise, and tends to restrict the action space to low-DoF (e.g. 3D translation) to keep dynamic motions tractable. The paper's bet: collect demos directly from the human hand so the touch signal is the thing driving the motion, then transfer that strategy to the robot.
Four stages, shown in Fig 2: (a) human tactile data collection, (b) self-supervised representation learning, (c) non-parametric offline policy, (d) online residual RL fine-tuning on the real robot.
1. Human tactile data collection (Sec 3.1). A human performs
the insertion with their hand. Fingertip pose is tracked with a
RealSense camera reading an ArUco marker, then calibrated/filtered into
end-effector poses. A GelSight Mini vision-based tactile sensor
is mounted on the fingertip via a custom fixture for contact images
(only one tactile sensor is used, not two). A HOYUJI TD-11 piezo-electric
contact microphone captures audio/vibration; crucially it is placed
at the base of the insertion hole (not on the hand) to keep vibration
signals consistent between the human and robot embodiments. The action is the
6D delta pose of the end-effector — a full 6-DoF space rather
than the restricted 3D of prior teleop work.
2. Tactile representation learning (Sec 3.2). Raw tactile images and audio spectra are high-dimensional and noisy, and human-vs-robot contact forces differ, so MimicTouch learns compact embeddings with self-supervision. BYOL is trained on tactile images and BYOL-A on 2 Hz-segmented audio spectra; both embedding spaces are 2048-dimensional. This maps augmented views of the same signal to nearby embeddings, making them robust to task-irrelevant sensor noise. Dataset: 100 demonstration trajectories (successful, failed, and sub-optimal), 7657 tactile images and 1000 audio segments.
3. Non-parametric offline policy (Sec 3.3). Built on the
VINN/NN framework (Pari et al. [12]) but extended to tactile + audio without
visual input. At step i the observation is
(o_i^T, o_i^A, o_i^EE, a_i) — tactile, audio, end-effector
pose, action. Pre-trained encoders give features (y_i^T, y_i^A);
together with pose o_i^EE these form a normalized demonstration
library (max inter-feature distance scaled to unity). At test time the live
observation is encoded into a query feature and a
nearest-neighbor lookup retrieves the action. Non-parametric
lookup is chosen deliberately over parametric nets to avoid covariant
shift/compounding error from out-of-domain human demos — it constrains the
robot to demonstrated behaviors, which the authors argue is safer for real
contact-rich tasks.
4. Online residual RL fine-tuning (Sec 3.4). The offline NN
policy alone doesn't guarantee success on the robot due to (i) hand-vs-gripper
morphology, (ii) noisy fingertip tracking, and (iii) underexplored contact
effects. A residual policy π_r is trained with
SAC; the robot action is the sum of the (frozen) offline policy
output and the residual output, with tight action limits so the residual only
makes slight adjustments. Reward combines an expert-aligned term
— the KL divergence between the human-expert trajectory distribution and
the robot's executed trajectory (encourages mimicking the demos) — with a
task-specific term that drives exploration toward in-domain
success. Around 70 actions/trajectory with a replay buffer.
Data collection throughput (Table 1). Human tactile demos dominate teleoperation on both speed and quality:
| Method | Frequency | Usable success rate |
|---|---|---|
| Spacemouse teleoperation | 19 traj/hr | 38.5% (20/52) |
| Hand teleoperation | 44 traj/hr | 58.8% (20/34) |
| Human tactile demonstrations | 104 traj/hr | 83.3% (20/24) |
Offline policy (Sec 4.2). The NN-based policy reaches MSE 0.21 vs MULSA's 1.53, and 40% (10/25) real task success vs MULSA's 16% (4/25) — non-parametric lookup produces more reliable 6D actions for downstream RL. Trained on human tactile demos, the NN policy hits 40% (10/25) vs 12% (3/25) from Spacemouse-teleop demos and 28% (7/25) from hand-guided teleop demos. Action-index trends (Fig 3) show human-tactile rollouts are near-linear with low variance, while teleop policies are nonlinear and high-variance during the insertion phase — teleop lacks the human tactile feedback that captures contact events.
Online RL (Sec 4.3). With human tactile demos as the offline prior, residual RL reaches 96% (24/25) success in 3 hours (88% / 22-25 at 2 hours), versus 32% (8/25, Spacemouse) and 60% (15/25, hand-guided) for teleop priors at the same time — the better prior makes RL both higher-ceiling and more sample-efficient.
Zero-shot generalization (Table 2), MimicTouch final policy vs Openloop replay:
| Policy | Shift | 10° | 20° | Two-stage | Rigid | Soft | Assem(I) | Assem |
|---|---|---|---|---|---|---|---|---|
| Openloop | 24/40 | 14/25 | 10/25 | 13/25 | 13/25 | 9/25 | 8/25 | 3/25 |
| MimicTouch | 37/40 | 23/25 | 20/25 | 22/25 | 20/25 | 16/25 | 19/25 | 13/25 |
MimicTouch beats Openloop in every domain, including a previously-unseen furniture-assembly task — evidence the learned policy reacts to contact rather than replaying a fixed trajectory.
From the authors (Sec 5):
What I noticed reading it:
This is squarely on the batch thesis that many manipulation
predicates aren't visually evaluable. is_inserted,
is_seated, "the peg caught the chamfer," "two surfaces are flush"
— MimicTouch shows these are read off touch and contact-vibration, not
pixels, to the point that its execution policy uses no vision at all.
For the BLADE line this is a concrete instance of the gap I keep flagging:
BLADE's predicate classifiers are visual (open-vocab detector crops + neural
classifier), and BLADE explicitly lists contact-rich tasks and continuous force
parameters as out of scope. MimicTouch is what the controller body of a
contact-rich behavior looks like when you take touch seriously — and its
audio-at-the-hole trick is a reminder that some predicates have a natural
non-visual evaluator if you instrument the right place.
Two things to carry forward. (1) The human-hand-demo + residual-RL recipe is a data-collection answer to BLADE's reliance on robot teleop demos; collecting tactile demos at 104 traj/hr without a robot in the loop is a real throughput unlock if the embodiment gap can be closed offline. (2) The expert-aligned KL reward is a clean way to keep an RL-tuned controller faithful to a demonstrated strategy — relevant if BLADE-style behaviors ever get force-modulated controllers fine-tuned on hardware. This is an execution-layer / low-level-policy paper, not a planning-abstraction paper: no symbols, no composition, single behavior. Its relevance to my thesis is as a tactile controller + tactile predicate-evaluator existence proof, not as a planning method.
However, to provide the demonstration, human demonstrators often rely on visual feedback to control the robot. This creates a gap between the sensing modality used for controlling the robot (visual) and the modality of interest (tactile). — Abstract
Humans exhibit fine-grained manipulation skills through tactile sensing, which allows for successful insertions by solely using tactile feedback to generate complex, continuous, and precise motions. — §1, Introduction / p.2
We combine an expert-aligned reward, which is measured by the KL divergence between the human expert trajectory and the robot executed trajectory, with a task-specific reward. — §3.4 / p.5
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant to this work: