Taxim: An Example-based Simulation Model for GelSight Tactile Sensors

Zilin Si, Wenzhen Yuan (CMU Robotics Institute) · 2021 · IEEE RA-L 2022 (arXiv v2, Dec 2021) · arXiv · PDF

One-liner. A data-driven, CPU-real-time GelSight simulator that fits a per-pixel polynomial reflectance table from <100 real contact examples and adds the first integrated marker-motion field model, so you can synthesize realistic tactile images (both the optical geometry signal and the force/shear marker flow) without expensive physics rendering.

Problem & motivation

Mainstream robot simulators (PyBullet, MuJoCo, Isaac Gym, Drake, SOFA) model rigid/soft bodies and vision but have no native tactile sensing, yet vision-based tactile sensors like GelSight give high-resolution contact geometry and force. Simulating them is hard because it requires modeling both the mechanical response of the soft gelpad and the optical response (LED illumination + embedded camera). Prior optical sims were physics-based (Phong shading, ray tracing, TACTO's pyrender): computationally heavy, hard to migrate to a new sensor, and unable to reproduce the intrinsic noise of real sensors. No prior work simulated the marker-motion field (the shear-force signal) jointly with the optical image.

Method

Taxim has two calibrated components fed by a contact height map (collision detection → local height map → pyramid-Gaussian-kernel soft-body approximation of the gelpad).

1. Optical simulation via example-based photometric stereo. The diffuse gelpad makes reflectance spatial-invariant, so intensity is a function of local surface normal. A naive linear lookup table I = Σl al n assumes parallel uniform light, but GelSight's LEDs are close and non-uniform. Taxim instead fits a polynomial table that is also a function of image position: fnl(x,y) = wnl b, where b = [x², y², xy, x, y, 1]T (a 2nd-order polynomial sufficed). The table is indexed by a discretized 125×125 surface-normal grid (magnitude × direction), per RGB light source, and solved by least squares from ball-indenter contacts. Per-point normals are mapped through this table to synthesize the image (Fig 2, 3).

2. Shadow simulation by superposition of "unit" shadows. Shadows from the red/green/blue LED groups are simulated by collecting a "unit" shadow mask (a single standing pin at varying depths, ~10 examples). Arbitrary geometry is approximated as side-by-side accumulated pin shadows; since beams travel independently (no inter-reflection), shadows are linearly accumulated and attached where neighbors are lower (Fig 5).

3. Marker motion field via linear displacement + superposition. Markers move because the elastomer surface stretches under normal+shear load. Taxim meshes the surface densely; nodes are active (in contact, externally loaded) or passive (only internal elastic forces). Mutual influence between nodes ni, nj is a 3×3 tensor Tninj; any node's displacement is the superposition uj = Σi Tkinj uki (Eq 4). Because active nodes also influence each other, initial displacements are first amended to virtual displacements by inverting a matrix of inter-node tensors (Eq 5–7), then superposed. The tensor T is calibrated offline in ANSYS FEM (unit-node loads on the dense gelpad mesh, sampling the 2nd-layer mesh 0.5 mm below the surface); online sim is just matrix ops. Whole model calibrates from <100 real contacts in ~1 hour.

Setup

Results

Optical: lowest pixel-wise error on all four metrics vs. all three baselines (Table I), and fastest on CPU (Table II).

MethodL1 ↓MSE ↓SSIM ↑PSNR ↑fps (CPU)
TACTO10.861215.8610.80825.4951.9
Phong's8.163123.2490.83227.7633.8
Physics7.40990.6230.75928.6870.1
Taxim (ours)5.56558.3580.88230.97418.1 (9.6 w/ shadows)

Generalizes across 4 GelSight sensors and a DIGIT sensor; handles fine textures, varying indentation depth/location (MSE grows with depth and distance from center). Marker field: vs. FEM, interpolated L1 errors ~3–5×10−3 mm per axis. Marker-magnitude L1: 1.00×10−2 mm (real&FEM), 1.02×10−2 (real&ours), 3.96×10−3 (FEM&ours). Weighted angular L1: 12.94° (real&FEM), 14.57° (real&ours), 4.89° (FEM&ours) — i.e. Taxim tracks FEM tightly but inherits FEM's sim-to-real gap.

Limitations & open questions

(a) Author-stated. Only quasi-static contact is simulated; dynamic phenomena like slip are not modeled, and the model cannot capture partial slip under shear (a common real case). Real–sim gap attributed to: hand-manufactured gelpad not matching the ANSYS FEM model, marker-tracking noise in real data, and the no-partial-slip assumption. Polynomial table must be recalibrated per sensor (and whenever a component is replaced). GPU acceleration left for future work.

(b) What I noticed reading it. Evaluation is small and self-designed: the marker study reports a handful of load cases (0.3–0.8 mm displacement) with no variance/CI; metrics are dataset means over a modest object set, so statistical confidence is weak. Optical ground truth required manual alignment in GIMP because the real rig isn't precise enough — the reported pixel errors partly reflect alignment quality, not just simulation fidelity. The optical table is fit on spherical-indenter normals but evaluated on textured objects, so the out-of-distribution normal coverage is untested. Crucially, there is no downstream task evaluation: the paper never trains a policy or perception model on Taxim images and shows sim-to-real transfer (the headline use case for a tactile simulator) — that is deferred to "future work." The DIGIT result is qualitative only. Speed comparison is CPU-only and arguably unfair since TACTO and the physics model are GPU-accelerable.

Why I care

Off the central long-horizon-planning thesis, but squarely on the batch thesis that many manipulation predicates (is_grasped, is_inserted, surface_is_rough, is_screwed_tight) live in touch/force, not vision. Learning a tactile predicate classifier — the touch analogue of BLADE's visual predicate classifiers — needs either real tactile data at scale or a faithful simulator. Taxim is that simulator for GelSight: cheap to calibrate, CPU-real-time, and it uniquely produces the marker/shear signal that encodes contact force, which is exactly what a is_slipping or grasp_is_stable predicate would key off. It is infrastructure, not a learning method: it lets a BLADE-style pipeline generate tactile training data for predicates that are not visually evaluable. Caveat for that use: Taxim is quasi-static and can't render slip, so dynamic contact predicates would need a different generator. Pairs naturally with the simulators and datasets in this batch's Cluster I.

Quotable

To the best of our knowledge, Taxim is the first model that simulates all functions of vision-based tactile sensors, including the optical response for geometry measurement and marker motion field for force/torque measurement. — §II / p.2
The simulation model is calibrated with less than 100 data points from a real sensor. The example-based approach enables the model to easily migrate to other GelSight sensors or its variations. — Abstract / p.1

Related

Cited here, worth ingesting next:

Newly ingested in 2026-06-24 batch — directly relevant: