ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations

Ruohan Gao, Yen-Yu Chang*, Shivani Mall*, Li Fei-Fei, Jiajun Wu · Stanford University · CoRL 2021 · arXiv:2109.07991 · PDF

One-liner. ObjectFolder packages 100 virtualized 3D objects each as a single Object File — a small triplet of neural networks (VisionNet / AudioNet / TouchNet) you query with extrinsic parameters (viewpoint, contact location) to get that object's rendered image, impact sound, and tactile reading — turning multisensory object perception into a shareable, storage-cheap implicit-representation benchmark instead of a pile of raw renders and recordings.

Problem & motivation

Multisensory, object-centric perception is bottlenecked by data. Synthetic object datasets (ShapeNet, ModelNet) carry geometry but little realistic texture and no sound or touch; real manipulation object sets (YCB, BigBIRD) are expensive, unstable to acquire (shipping, inventory), and hard to virtualize across sensory modes. The world gets modeled as "silent and untouchable" because we only have looking-based data. ObjectFolder's goal is a community-accessible standard benchmark of 3D objects that are (1) easy to share, (2) high-quality in visual texture, and (3) augmented with realistic auditory and tactile sensory data — all under one uniform, object-centric, implicit representation.

Method

Each object is an Object File: a compact implicit neural representation with three sub-networks (Fig 3). You feed extrinsics; the networks emit sensory data. The "object file" framing is borrowed from Kahneman et al. [22] — the information received about an object at a location.

Objects. 100 high-quality 3D objects from online repositories: 20 from 3D Model Haven, 28 from YCB, 52 from Google Scanned Objects. Each annotated with a material type (ceramic, glass, wood, plastic, iron, polycarbonate, steel) that drives audio simulation. Common household categories (bowl, mug, cabinet, television, shelf, fork, spoon).

VisionNet. A NeRF-style object-centric neural scattering function (following object-centric NeRF [27]). Inputs are a 7D coordinate (x,y,z, φii, φoo) — 3D location plus incoming/outgoing light directions; an MLP Fv maps it to volume density σ and per-channel scatter fractions rgb). Classic volume rendering (Eq 1–2) produces images at arbitrary viewpoint and lighting. Training images are rendered in Blender Cycles, object normalized to a unit cube, point light on a unit sphere, white background, viewpoints on a full sphere. Uses stratified sampling, positional encoding, hierarchical volume sampling.

AudioNet. Physics-based rigid-body impact sound via linear modal analysis [23]. The surface mesh is converted to a volumetric hexahedron mesh (N voxels); the linear elastic dynamics Mü + C&udot; + Ku = f (Eq 3) is solved by generalized eigenvalue decomposition into a bank of damped sinusoids, one per vibration mode (Eq 4–6), with frequency and damping set by material parameters (density, Young's modulus, Poisson's ratio, Rayleigh damping). Modes are excited by unit forces f1,f2,f3 at every vertex. For each force direction a separate MLP branch encodes the modes signal: input is vertex coordinate (x,y,z) + spectrogram location (f,t); it predicts the real and imaginary parts of the complex spectrogram, inverted via ISTFT. (They found predicting the spectrogram works better than predicting the modes signal directly.) At test time an arbitrary force is decomposed into the three unit-force amplitudes and combined per vertex.

TouchNet. Tactile readings are simulated with the vision-based touch simulator TACTO [25], configured to a DIGIT [24] sensor (chosen for compactness and high resolution, with relevance to in-hand manipulation). DIGIT is pressed at each surface vertex along the normal, force constrained to a small range so renderings don't vary much across forces, yielding an RGB tactile image capturing local contact geometry. TouchNet is an MLP taking vertex (x,y,z) + tactile image location (w,h) and predicting the per-pixel RGB of the tactile image.

Setup

Results

The paper's headline is a storage argument plus a suite of small benchmark demonstrations that multisensory beats single-modality.

Storage (Table 1). Implicit Object Files are far smaller than raw data: Vision 7.2 MB (vs. effectively infinite for all-viewpoint/lighting renders), Audio 6.3 MB (vs. 12.6 GB), Touch 2.1 MB (vs. 2.9 GB).

Multisensory instance recognition (Table 2, Acc %). Chance is 1.00.

ModalityVATV+AV+TA+TV+A+T
Acc (%)94.898.372.499.597.299.099.8

Surprise: audio alone (98.3) beats vision alone (94.8) — a single impact sound encodes shape/size/material well. Touch is weakest (a single touch is locally informative but not globally discriminative). All three fused is best.

Cross-sensory retrieval (Table 3, mAP; chance 0.05). ObjectFolder embeddings beat CCA across all six directions, e.g. Vision→Audio 0.90 (CCA 0.57), Audio→Vision 0.92 (CCA 0.59). Touch-involving pairs are lower (Vision→Touch 0.50, Touch→Vision 0.48), echoing touch's locality.

Audio-visual 3D reconstruction (Table 4). Image+Audio2Mesh is best (IoU 0.8906, Chamfer-L1 0.0043, Normal Consistency 0.9535), edging Image2Mesh (0.8809 / 0.0046 / 0.9522) — audio adds acoustic spatial cues on top of single-view vision. Audio-only (Audio2Mesh) already beats nothing meaningfully (IoU 0.8729) vs. the Average baseline (0.0675). The Image2Mesh model also transfers to real-world images (Fig 7), failing on out-of-distribution shapes (a teapot).

Robotic grasping (§4.4, Fig 9). Touch reaches high grasp-stability accuracy with far less data than vision; vision+touch is best. A TRPO touch-based grasping policy gets 75.5% success vs. 53.0% for random. A Meta-World reach task on three objects (cup, bowl, dice) hits 100%.

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This is an infrastructure / dataset anchor, not a manipulation-policy or abstraction-learning paper — so its relevance to BLADE is foundational rather than methodological. The thesis it serves: many manipulation predicates (is_grasped, is_inserted, surface_is_rough, is_screwed_tight) are not visually evaluable — they live in touch, force, and sound. ObjectFolder is the original argument that the touch and sound channels can be virtualized and learned object-centrically at all. BLADE learns visual predicate classifiers; the obvious extension is multisensory predicate classifiers, and ObjectFolder (and its 2.0 / Benchmark successors) is the canonical asset for prototyping that without a real multisensory rig.

Concretely useful angles: (1) its instance-recognition result that audio alone out-predicts vision is a clean motivation slide for "looking isn't enough"; (2) the implicit-representation packaging (one small network = full multisensory object profile) is an interesting design pattern for storing learned per-object sensory models; (3) it is the head of the ObjectFolder line (2.0 sim-to-real, the real-object Benchmark) that the rest of cluster I tracks. I should not overclaim: there is no language, no long-horizon planning, and no predicate learning here.

Quotable

While there has been significant progress by "looking" — recognizing objects based on glimpses of their visual appearance or 3D shape — objects in the world are often modeled as silent and untouchable entities. — §1, Introduction / p.2
Surprisingly, the audio classifier achieves the best results. This confirms that our audio simulation pipeline well models the shape, size, and material property of the object instance. — §4.1, Multisensory Instance Recognition / p.6

Related

Papers cited that should likely be ingested next:

Newly ingested in the 2026-06-24 batch — directly relevant: