The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects

Ruohan Gao*, Yiming Dou*†, Hao Li*, Tanmay Agarwal, Jeannette Bohg, Yunzhu Li, Li Fei-Fei, Jiajun Wu · Stanford University · CVPR 2023 · arXiv:2306.00956 · PDF

One-liner. A 10-task benchmark suite for multisensory object-centric learning across sight, sound, and touch, paired with ObjectFolder Real — the first real-world dataset to capture all three modalities (3D meshes, impact sounds, GelSight tactile readings) for 100 household objects — so that the field can finally measure which modality carries which information, and how badly sim2real bites for non-visual sensing.

Problem & motivation

Vision dominates object-centric learning, but real interaction is multisensory: objects make sounds when struck and deform tactile sensors when touched. The predecessor ObjectFolder and ObjectFolder 2.0 provided 1,000 neural objects (implicit visual/acoustic/tactile "Object Files") but had two gaps the authors call out: (1) no real objects — all data is simulated with no sim2real calibration, so we don't know whether models trained on neural objects transfer; and (2) only a couple of tasks were demonstrated. The paper closes both gaps by (a) building ObjectFolder Real, real multisensory captures of 100 objects, and (b) standardizing 10 benchmark tasks with defined metrics and baselines, spanning recognition, reconstruction, and manipulation. The motivating thesis is degeneracy from cognitive science: fused multisensory perception is robust to the loss of any one modality (§1), and different modalities carry structurally different information (vision = global geometry/pose; audio = scale/material; touch = precise local contact geometry).

Method

This is a dataset + benchmark paper. The "method" is the real-data capture pipeline (Fig 2) plus the 10 standardized tasks and their baselines.

ObjectFolder Real capture pipeline (100 household objects, each with vision + audio + touch):

Vision (§3.1): EinScan Pro HD 2020 handheld scanner for a high-quality 3D mesh + color texture (min point distance 0.2 mm). Three mesh resolutions provided: 16K / 64K / Full triangles. HD video of each object rotating in a lightbox captures appearance.
Audio (§3.2): professional anechoic studio. 30–50 surface points per object (scaled to size), prioritizing points with specific geometry/texture (rim, handle). Each point struck along its normal with a PCB 086C01 impact hammer (force-transducer tip giving ground-truth contact force), audio recorded by a PCB 376A32 phantom-powered field microphone; 5-second clips, plus a RealSense RGBD video per strike.
Touch (§3.3): Franka Emika Panda arm + R1.5 GelSight tactile finger (32×24 mm sensing area). RealSense RGBD at each frame corner; point cloud of target object from scanned mesh registered to the RealSense point cloud via four-point manual init + ICP (with manual fallback). Tactile readings collected at the same surface points where impact sounds were collected, position-controlled to reach each point along its normal; video of the gel deformation captured.

The 10 tasks (Fig 1), grouped in three families:

Recognition (3): Cross-Sensory Retrieval (any modality → any modality, 9 sub-tasks, mAP metric, baselines CCA / PLSCA / DSCMR / DAR); Contact Localization (predict the mesh vertex where contact happens from vision/audio/touch; Normalized-Distance metric; baselines Point Filtering + a new differentiable Multisensory Contact Regression, MCR); Material Classification (7 material classes; top-1 accuracy; baselines ResNet, FENet).

Reconstruction (3): 3D Shape Reconstruction (Chamfer Distance; baselines MDN, PCN, and a new Multisensory Reconstruction Transformer, MRT); Sound Generation of Dynamic Objects (video→sound of a falling object; STFT/Envelope/CDPAM metrics; baselines RegNet, SpecVQGAN); Visuo-Tactile Cross-Generation (Vision↔Touch image translation; PSNR/SSIM; baselines Pix2Pix, VisGel).

Manipulation (4): Grasp-Stability Prediction (will the grasp hold? accuracy); Contact Refinement, Surface Traversal, and Dynamic Pushing — the latter three use a shared Multisensory-MPC baseline (SVG future-frame prediction + MPPI / CEM control). Table 1 shows which tasks are feasible in sim, real, or both: recognition + 3D reconstruction have real results; sound generation, visuo-tactile cross-gen, and the four manipulation tasks are sim-only (sim2real for manipulation needs per-task robot calibration; see Appendix L).

Setup

Datasets / benchmarks: ObjectFolder Real (100 real household objects: meshes at 16K/64K/Full triangles, lightbox HD video, 30–50 impact-sound points/object, GelSight readings at matching points) + the 1,000 neural objects from ObjectFolder 2.0. Per-task object/instance splits vary (e.g., retrieval splits neural objects 800/100/100 and OF-Real 8/1/1 per object; contact localization 800/190/10 neural, 53 real objects 8:1:1).
Hardware / simulator: EinScan Pro HD 2020 3D scanner; anechoic studio with PCB 086C01 impact hammer + PCB 376A32 microphone; Franka Emika Panda + R1.5 GelSight finger; RealSense RGBD cameras; PyBullet used to simulate falling objects for the sound-generation task; Blender to render videos.
Baselines: per task — CCA, PLSCA, DSCMR, DAR (retrieval); Point Filtering + MCR (contact localization); ResNet, FENet (material); MDN, PCN, MRT (3D shape); RegNet, SpecVQGAN (sound gen); Pix2Pix, VisGel (visuo-tactile); TACTO/ResNet-18 (grasp stability); Multisensory-MPC = SVG + MPPI/CEM (the three remaining manipulation tasks).
Compute: not reported.

Results

Headline cross-cutting finding (§1, conclusion): vision and audio are reliable for recognition; touch is too local to recognize an object alone; fusing modalities is consistently best; and you can sometimes hallucinate one modality from another. For reconstruction, vision gives coarse global shape, audio gives scale, touch gives precise local geometry — fusion wins.

Cross-sensory retrieval (neural objects, Table 2, mAP): global-information modalities (vision, audio) retrieve far better than the local one (touch). Best single cell: vision→vision 89.28 (DAR); audio→audio 80.77 (PLSCA); touch→touch only 54.80 (DAR). Cross-modal into touch is weak (vision→touch 7.03, audio→touch 6.96), confirming touch's locality.

Contact localization (Table 4, neural, Normalized Distance %, lower better):

Method	Vision	Touch	Audio	V+T+A
RANDOM	47.32	47.32	47.32	47.32
Point Filtering	–	4.21	1.45	3.73 (T+A)
MCR (ours)	5.03	23.59	4.85	1.84

Vision/audio localize contact far better than touch (touch is locally ambiguous — different vertices share tactile patterns). Point Filtering is strongest but relies on knowing relative pose between consecutive contacts; the new end-to-end MCR is close with a simpler architecture and no pose assumption. On OF-Real (Table 5), fusion (ND 12.00) again beats any single modality.

Material classification (Table 6, neural, top-1 acc): FENet 96.60 > ResNet 96.28 with full fusion; touch alone is much weaker (75–76%) vs vision (~92%) / audio (~95%); fusion best. Sim2real transfer (Table 7): pre-training on neural objects gives ResNet +6% accuracy on real objects (51.02 vs 45.25 w/o pretrain).

3D shape reconstruction (Table 8, neural, Chamfer Distance cm, lower better): vision alone is the strongest single modality (PCN 2.36); full V+T+A fusion best (PCN 2.25, MDN 2.91). On OF-Real (Table 9), MRT achieves CD 0.95 with all three modalities vs 1.17 vision-only.

Manipulation (sim). Vision and touch are complementary and fusion wins across all four tasks:

Grasp stability (Fig 8, wine glass): V 87.4%, T 93.4%, V+T 99.4%.
Contact refinement (Fig 9, wooden cup): SR V 0.86 / T 0.83 / V+T 0.88; angle error 0.38° / 0.56° / 0.34°.
Surface traversal (Fig 10, iron pan): SR V 0.26 / T 0.54 / V+T 0.80 — the largest fusion gain, and a case where touch alone beats vision alone.
Dynamic pushing (Fig 11, rinsing cup): position error V 23.81 / T 21.76 / V+T 17.63 cm.

Where modalities lose: touch alone is a poor recognizer (retrieval, localization, material) and a poor global-shape reconstructor; vision alone fails when contact is occluded or when fine local geometry matters (surface traversal). Notably for grasp stability, pre-training on ObjectFolder Real beats ImageNet, OF 2.0, and Touch-and-Go for tactile representation transfer (Table 13: 84.9% vs 73.0% / 69.4% / 78.1%).

Limitations & open questions

From the authors:

Sim2real is only achieved for 4 of 10 tasks (recognition + 3D recon); sound generation and visuo-tactile cross-gen would need new real datasets, and the four manipulation tasks need nontrivial per-task robot deployment and physics/optical/elastic calibration (Table 1, Appendix L). They offer only "tentative guidelines," not a calibrated procedure.
OF-Real is modest: 100 objects total (53 used for contact localization, 50 for cross-gen, 5 for several manipulation tasks).
Tactile data is collected at the same points as audio, biasing coverage toward geometrically-special points (rims, handles) rather than uniform surface sampling.

What I noticed reading it:

Several manipulation "results" are teasers on a single object (wine glass, wooden cup, iron pan, rinsing cup) — Fig 8–11 are single-object demonstrations, not aggregate success rates over an object distribution. The appendix Tables 14–17 widen to 5 objects but still no seeds/variance, so the manipulation claims are far weaker statistically than the recognition tables.
The MCR baseline is introduced as the paper's own contribution yet loses to the prior Point Filtering baseline on neural objects (Table 4); the framing ("promising results... great potential") soft-pedals that it's worse, justified by Point Filtering being inapplicable to OF-Real (it needs arbitrary-point touch/audio).
No statistical significance anywhere — no error bars on the recognition/reconstruction tables either. With objects this few, several fusion-vs-single-modality gaps (e.g., V+T 0.88 vs T 0.83 grasp) could be within noise.
"Touch is too local to recognize" is asserted from these specific GelSight captures; it may be a property of single-frame static touch rather than touch in general (dynamic/sliding touch could carry more). The benchmark doesn't test that, which later tactile-representation work (AnyTouch, Sparsh-X) directly contests.

Why I care

This is an adjacent infrastructure paper, not a manipulation policy paper — but it is the canonical empirical backing for the thesis behind the entire 2026-06-24 batch, and a direct foil for BLADE. BLADE learns visual predicate classifiers (turned-on(faucet), open(drawer)) by cropping to the relevant object and training a neural classifier on pixels. The big idea this batch is built around is that many manipulation predicates — is_grasped, is_inserted, is_full, surface_is_rough, is_screwed_tight — are not visually evaluable; they live in touch, force, and sound. ObjectFolder Benchmark is the cleanest quantitative evidence that these three modalities carry structurally different, non-redundant information: vision = global geometry/pose, audio = scale/material, touch = precise local contact geometry, and fusion beats any single channel on grasp stability, contact localization, material, and 3D shape. That is exactly the argument for why a BLADE-style abstraction layer would need multisensory predicate classifiers, not vision-only ones. The material-classification result (touch/audio resolve surface_is_rough-type material properties that vision guesses) and contact-localization (touch is locally precise but globally ambiguous — the inverse of vision) map almost one-to-one onto which predicates each sensor should ground. Concretely relevant downstream: the ObjectFolder simulators feed tactile/acoustic simulators in cluster I (TACTO, SoundSpaces 2.0), and its sim2real-gap measurement frames every "train in sim, deploy real" tactile/audio policy in clusters C/D. It is not a planning or abstraction paper, so I'm tagging it as an evidence/dataset anchor rather than a method to build on directly.

Quotable

For recognition, vision and audio tend to be more reliable compared to touch, where the information is too local to recognize. ... it is possible to hallucinate one modality from the other. This agrees with the notion of degeneracy in cognitive studies. — §1 / p.2

Touch, often as a good complement to vision, is especially useful to capture the accurate local geometry of the contact point. — §1 / p.2

Our ObjectFolder Real dataset is the first dataset that contains all three modalities with rich annotations to facilitate multisensory learning research with real object data. — §2 / p.3

Papers cited that should likely be ingested next:

Gao et al. 2021 — ObjectFolder [25] — the original implicit-representation dataset this extends. Forward-ref PDF.
Gao et al. 2022 — ObjectFolder 2.0 [28] — the 1,000 neural objects + sim2real predecessor; supplies the neural-object half of every table here. PDF.
Wang et al. — TACTO — GelSight tactile simulator underpinning tactile rendering / sim2real. PDF.
Chen et al. — SoundSpaces 2.0 — visual-acoustic simulation counterpart for the audio modality. PDF.
Yuan/Dong/Adelson — GelSight / improved GelSight [20] — the tactile sensor used for the touch captures. PDF.
Touch and Go [76] — the human-collected vision+touch dataset compared against for tactile pre-training (Table 13). PDF.
Lee et al. 2019 — Making Sense of Vision and Touch [41] — cited multisensory representation-learning precursor. PDF.

Newly ingested in the 2026-06-24 batch — directly relevant to this work:

ObjectFolder and ObjectFolder 2.0 — the same authors' neural-object datasets this benchmark builds on and measures sim2real against. The trilogy's third entry.
Touch and Go — human-collected vision+touch dataset; the tactile-pretraining baseline OF-Real beats in Table 13 (84.9% vs 78.1%).
TACTO, Taxim, TacEx — cluster-I GelSight simulators; the tactile-rendering machinery whose sim2real gap this paper quantifies for downstream tactile policies.
SoundSpaces 2.0 — visual-acoustic simulator; the audio-modality analogue to the tactile simulators above.
Making Sense of Vision and Touch — foundational visuo-tactile representation learning for contact-rich tasks; conceptual precursor to fusing touch into manipulation.
VTDexManip and Kaiwu — sibling cluster-I multimodal manipulation benchmarks/datasets; complementary coverage (dexterous / large-scale real manipulation vs object-centric).
Sparsh and AnyTouch — tactile foundation models that contest this paper's "touch is too local to recognize" claim by learning richer (and dynamic) tactile representations.
SonicSense and See, Hear, and Feel — cluster-D acoustic/multisensory manipulation that operationalizes the "audio carries material/scale" finding into closed-loop policies.