The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects

Ruohan Gao*, Yiming Dou*†, Hao Li*, Tanmay Agarwal, Jeannette Bohg, Yunzhu Li, Li Fei-Fei, Jiajun Wu · Stanford University · CVPR 2023 · arXiv:2306.00956 · PDF

One-liner. A 10-task benchmark suite for multisensory object-centric learning across sight, sound, and touch, paired with ObjectFolder Real — the first real-world dataset to capture all three modalities (3D meshes, impact sounds, GelSight tactile readings) for 100 household objects — so that the field can finally measure which modality carries which information, and how badly sim2real bites for non-visual sensing.

Problem & motivation

Vision dominates object-centric learning, but real interaction is multisensory: objects make sounds when struck and deform tactile sensors when touched. The predecessor ObjectFolder and ObjectFolder 2.0 provided 1,000 neural objects (implicit visual/acoustic/tactile "Object Files") but had two gaps the authors call out: (1) no real objects — all data is simulated with no sim2real calibration, so we don't know whether models trained on neural objects transfer; and (2) only a couple of tasks were demonstrated. The paper closes both gaps by (a) building ObjectFolder Real, real multisensory captures of 100 objects, and (b) standardizing 10 benchmark tasks with defined metrics and baselines, spanning recognition, reconstruction, and manipulation. The motivating thesis is degeneracy from cognitive science: fused multisensory perception is robust to the loss of any one modality (§1), and different modalities carry structurally different information (vision = global geometry/pose; audio = scale/material; touch = precise local contact geometry).

Method

This is a dataset + benchmark paper. The "method" is the real-data capture pipeline (Fig 2) plus the 10 standardized tasks and their baselines.

ObjectFolder Real capture pipeline (100 household objects, each with vision + audio + touch):

The 10 tasks (Fig 1), grouped in three families:

Recognition (3): Cross-Sensory Retrieval (any modality → any modality, 9 sub-tasks, mAP metric, baselines CCA / PLSCA / DSCMR / DAR); Contact Localization (predict the mesh vertex where contact happens from vision/audio/touch; Normalized-Distance metric; baselines Point Filtering + a new differentiable Multisensory Contact Regression, MCR); Material Classification (7 material classes; top-1 accuracy; baselines ResNet, FENet).

Reconstruction (3): 3D Shape Reconstruction (Chamfer Distance; baselines MDN, PCN, and a new Multisensory Reconstruction Transformer, MRT); Sound Generation of Dynamic Objects (video→sound of a falling object; STFT/Envelope/CDPAM metrics; baselines RegNet, SpecVQGAN); Visuo-Tactile Cross-Generation (Vision↔Touch image translation; PSNR/SSIM; baselines Pix2Pix, VisGel).

Manipulation (4): Grasp-Stability Prediction (will the grasp hold? accuracy); Contact Refinement, Surface Traversal, and Dynamic Pushing — the latter three use a shared Multisensory-MPC baseline (SVG future-frame prediction + MPPI / CEM control). Table 1 shows which tasks are feasible in sim, real, or both: recognition + 3D reconstruction have real results; sound generation, visuo-tactile cross-gen, and the four manipulation tasks are sim-only (sim2real for manipulation needs per-task robot calibration; see Appendix L).

Setup

Results

Headline cross-cutting finding (§1, conclusion): vision and audio are reliable for recognition; touch is too local to recognize an object alone; fusing modalities is consistently best; and you can sometimes hallucinate one modality from another. For reconstruction, vision gives coarse global shape, audio gives scale, touch gives precise local geometry — fusion wins.

Cross-sensory retrieval (neural objects, Table 2, mAP): global-information modalities (vision, audio) retrieve far better than the local one (touch). Best single cell: vision→vision 89.28 (DAR); audio→audio 80.77 (PLSCA); touch→touch only 54.80 (DAR). Cross-modal into touch is weak (vision→touch 7.03, audio→touch 6.96), confirming touch's locality.

Contact localization (Table 4, neural, Normalized Distance %, lower better):

MethodVisionTouchAudioV+T+A
RANDOM47.3247.3247.3247.32
Point Filtering4.211.453.73 (T+A)
MCR (ours)5.0323.594.851.84

Vision/audio localize contact far better than touch (touch is locally ambiguous — different vertices share tactile patterns). Point Filtering is strongest but relies on knowing relative pose between consecutive contacts; the new end-to-end MCR is close with a simpler architecture and no pose assumption. On OF-Real (Table 5), fusion (ND 12.00) again beats any single modality.

Material classification (Table 6, neural, top-1 acc): FENet 96.60 > ResNet 96.28 with full fusion; touch alone is much weaker (75–76%) vs vision (~92%) / audio (~95%); fusion best. Sim2real transfer (Table 7): pre-training on neural objects gives ResNet +6% accuracy on real objects (51.02 vs 45.25 w/o pretrain).

3D shape reconstruction (Table 8, neural, Chamfer Distance cm, lower better): vision alone is the strongest single modality (PCN 2.36); full V+T+A fusion best (PCN 2.25, MDN 2.91). On OF-Real (Table 9), MRT achieves CD 0.95 with all three modalities vs 1.17 vision-only.

Manipulation (sim). Vision and touch are complementary and fusion wins across all four tasks:

Where modalities lose: touch alone is a poor recognizer (retrieval, localization, material) and a poor global-shape reconstructor; vision alone fails when contact is occluded or when fine local geometry matters (surface traversal). Notably for grasp stability, pre-training on ObjectFolder Real beats ImageNet, OF 2.0, and Touch-and-Go for tactile representation transfer (Table 13: 84.9% vs 73.0% / 69.4% / 78.1%).

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This is an adjacent infrastructure paper, not a manipulation policy paper — but it is the canonical empirical backing for the thesis behind the entire 2026-06-24 batch, and a direct foil for BLADE. BLADE learns visual predicate classifiers (turned-on(faucet), open(drawer)) by cropping to the relevant object and training a neural classifier on pixels. The big idea this batch is built around is that many manipulation predicates — is_grasped, is_inserted, is_full, surface_is_rough, is_screwed_tight — are not visually evaluable; they live in touch, force, and sound. ObjectFolder Benchmark is the cleanest quantitative evidence that these three modalities carry structurally different, non-redundant information: vision = global geometry/pose, audio = scale/material, touch = precise local contact geometry, and fusion beats any single channel on grasp stability, contact localization, material, and 3D shape. That is exactly the argument for why a BLADE-style abstraction layer would need multisensory predicate classifiers, not vision-only ones. The material-classification result (touch/audio resolve surface_is_rough-type material properties that vision guesses) and contact-localization (touch is locally precise but globally ambiguous — the inverse of vision) map almost one-to-one onto which predicates each sensor should ground. Concretely relevant downstream: the ObjectFolder simulators feed tactile/acoustic simulators in cluster I (TACTO, SoundSpaces 2.0), and its sim2real-gap measurement frames every "train in sim, deploy real" tactile/audio policy in clusters C/D. It is not a planning or abstraction paper, so I'm tagging it as an evidence/dataset anchor rather than a method to build on directly.

Quotable

For recognition, vision and audio tend to be more reliable compared to touch, where the information is too local to recognize. ... it is possible to hallucinate one modality from the other. This agrees with the notion of degeneracy in cognitive studies. — §1 / p.2
Touch, often as a good complement to vision, is especially useful to capture the accurate local geometry of the contact point. — §1 / p.2
Our ObjectFolder Real dataset is the first dataset that contains all three modalities with rich annotations to facilitate multisensory learning research with real object data. — §2 / p.3

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant to this work: