One-liner. A 10-task benchmark suite for multisensory object-centric learning across sight, sound, and touch, paired with ObjectFolder Real — the first real-world dataset to capture all three modalities (3D meshes, impact sounds, GelSight tactile readings) for 100 household objects — so that the field can finally measure which modality carries which information, and how badly sim2real bites for non-visual sensing.
Vision dominates object-centric learning, but real interaction is multisensory: objects make sounds when struck and deform tactile sensors when touched. The predecessor ObjectFolder and ObjectFolder 2.0 provided 1,000 neural objects (implicit visual/acoustic/tactile "Object Files") but had two gaps the authors call out: (1) no real objects — all data is simulated with no sim2real calibration, so we don't know whether models trained on neural objects transfer; and (2) only a couple of tasks were demonstrated. The paper closes both gaps by (a) building ObjectFolder Real, real multisensory captures of 100 objects, and (b) standardizing 10 benchmark tasks with defined metrics and baselines, spanning recognition, reconstruction, and manipulation. The motivating thesis is degeneracy from cognitive science: fused multisensory perception is robust to the loss of any one modality (§1), and different modalities carry structurally different information (vision = global geometry/pose; audio = scale/material; touch = precise local contact geometry).
This is a dataset + benchmark paper. The "method" is the real-data capture pipeline (Fig 2) plus the 10 standardized tasks and their baselines.
ObjectFolder Real capture pipeline (100 household objects, each with vision + audio + touch):
The 10 tasks (Fig 1), grouped in three families:
Recognition (3): Cross-Sensory Retrieval (any modality → any modality, 9 sub-tasks, mAP metric, baselines CCA / PLSCA / DSCMR / DAR); Contact Localization (predict the mesh vertex where contact happens from vision/audio/touch; Normalized-Distance metric; baselines Point Filtering + a new differentiable Multisensory Contact Regression, MCR); Material Classification (7 material classes; top-1 accuracy; baselines ResNet, FENet).
Reconstruction (3): 3D Shape Reconstruction (Chamfer Distance; baselines MDN, PCN, and a new Multisensory Reconstruction Transformer, MRT); Sound Generation of Dynamic Objects (video→sound of a falling object; STFT/Envelope/CDPAM metrics; baselines RegNet, SpecVQGAN); Visuo-Tactile Cross-Generation (Vision↔Touch image translation; PSNR/SSIM; baselines Pix2Pix, VisGel).
Manipulation (4): Grasp-Stability Prediction (will the grasp hold? accuracy); Contact Refinement, Surface Traversal, and Dynamic Pushing — the latter three use a shared Multisensory-MPC baseline (SVG future-frame prediction + MPPI / CEM control). Table 1 shows which tasks are feasible in sim, real, or both: recognition + 3D reconstruction have real results; sound generation, visuo-tactile cross-gen, and the four manipulation tasks are sim-only (sim2real for manipulation needs per-task robot calibration; see Appendix L).
Headline cross-cutting finding (§1, conclusion): vision and audio are reliable for recognition; touch is too local to recognize an object alone; fusing modalities is consistently best; and you can sometimes hallucinate one modality from another. For reconstruction, vision gives coarse global shape, audio gives scale, touch gives precise local geometry — fusion wins.
Cross-sensory retrieval (neural objects, Table 2, mAP): global-information modalities (vision, audio) retrieve far better than the local one (touch). Best single cell: vision→vision 89.28 (DAR); audio→audio 80.77 (PLSCA); touch→touch only 54.80 (DAR). Cross-modal into touch is weak (vision→touch 7.03, audio→touch 6.96), confirming touch's locality.
Contact localization (Table 4, neural, Normalized Distance %, lower better):
| Method | Vision | Touch | Audio | V+T+A |
|---|---|---|---|---|
| RANDOM | 47.32 | 47.32 | 47.32 | 47.32 |
| Point Filtering | – | 4.21 | 1.45 | 3.73 (T+A) |
| MCR (ours) | 5.03 | 23.59 | 4.85 | 1.84 |
Vision/audio localize contact far better than touch (touch is locally ambiguous — different vertices share tactile patterns). Point Filtering is strongest but relies on knowing relative pose between consecutive contacts; the new end-to-end MCR is close with a simpler architecture and no pose assumption. On OF-Real (Table 5), fusion (ND 12.00) again beats any single modality.
Material classification (Table 6, neural, top-1 acc): FENet 96.60 > ResNet 96.28 with full fusion; touch alone is much weaker (75–76%) vs vision (~92%) / audio (~95%); fusion best. Sim2real transfer (Table 7): pre-training on neural objects gives ResNet +6% accuracy on real objects (51.02 vs 45.25 w/o pretrain).
3D shape reconstruction (Table 8, neural, Chamfer Distance cm, lower better): vision alone is the strongest single modality (PCN 2.36); full V+T+A fusion best (PCN 2.25, MDN 2.91). On OF-Real (Table 9), MRT achieves CD 0.95 with all three modalities vs 1.17 vision-only.
Manipulation (sim). Vision and touch are complementary and fusion wins across all four tasks:
Where modalities lose: touch alone is a poor recognizer (retrieval, localization, material) and a poor global-shape reconstructor; vision alone fails when contact is occluded or when fine local geometry matters (surface traversal). Notably for grasp stability, pre-training on ObjectFolder Real beats ImageNet, OF 2.0, and Touch-and-Go for tactile representation transfer (Table 13: 84.9% vs 73.0% / 69.4% / 78.1%).
From the authors:
What I noticed reading it:
This is an adjacent infrastructure paper, not a manipulation
policy paper — but it is the canonical empirical backing for the thesis
behind the entire 2026-06-24 batch, and a direct foil for
BLADE.
BLADE learns visual predicate classifiers (turned-on(faucet),
open(drawer)) by cropping to the relevant object and training a
neural classifier on pixels. The big idea this batch is built around is that many
manipulation predicates — is_grasped, is_inserted,
is_full, surface_is_rough, is_screwed_tight
— are not visually evaluable; they live in touch, force, and sound.
ObjectFolder Benchmark is the cleanest quantitative evidence that these three
modalities carry structurally different, non-redundant
information: vision = global geometry/pose, audio = scale/material, touch =
precise local contact geometry, and fusion beats any single channel on grasp
stability, contact localization, material, and 3D shape. That is exactly the
argument for why a BLADE-style abstraction layer would need multisensory
predicate classifiers, not vision-only ones. The material-classification result
(touch/audio resolve surface_is_rough-type material properties that
vision guesses) and contact-localization (touch is locally precise but globally
ambiguous — the inverse of vision) map almost one-to-one onto which
predicates each sensor should ground. Concretely relevant downstream: the
ObjectFolder simulators feed tactile/acoustic simulators in cluster I
(TACTO,
SoundSpaces 2.0),
and its sim2real-gap measurement frames every "train in sim, deploy real"
tactile/audio policy in clusters C/D. It is not a planning or
abstraction paper, so I'm tagging it as an evidence/dataset anchor rather than a
method to build on directly.
For recognition, vision and audio tend to be more reliable compared to touch, where the information is too local to recognize. ... it is possible to hallucinate one modality from the other. This agrees with the notion of degeneracy in cognitive studies. — §1 / p.2
Touch, often as a good complement to vision, is especially useful to capture the accurate local geometry of the contact point. — §1 / p.2
Our ObjectFolder Real dataset is the first dataset that contains all three modalities with rich annotations to facilitate multisensory learning research with real object data. — §2 / p.3
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant to this work: