ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and Tactile Representations

Ruohan Gao, Yen-Yu Chang*, Shivani Mall*, Li Fei-Fei, Jiajun Wu · Stanford University · CoRL 2021 · arXiv:2109.07991 · PDF

One-liner. ObjectFolder packages 100 virtualized 3D objects each as a single Object File — a small triplet of neural networks (VisionNet / AudioNet / TouchNet) you query with extrinsic parameters (viewpoint, contact location) to get that object's rendered image, impact sound, and tactile reading — turning multisensory object perception into a shareable, storage-cheap implicit-representation benchmark instead of a pile of raw renders and recordings.

Problem & motivation

Multisensory, object-centric perception is bottlenecked by data. Synthetic object datasets (ShapeNet, ModelNet) carry geometry but little realistic texture and no sound or touch; real manipulation object sets (YCB, BigBIRD) are expensive, unstable to acquire (shipping, inventory), and hard to virtualize across sensory modes. The world gets modeled as "silent and untouchable" because we only have looking-based data. ObjectFolder's goal is a community-accessible standard benchmark of 3D objects that are (1) easy to share, (2) high-quality in visual texture, and (3) augmented with realistic auditory and tactile sensory data — all under one uniform, object-centric, implicit representation.

Method

Each object is an Object File: a compact implicit neural representation with three sub-networks (Fig 3). You feed extrinsics; the networks emit sensory data. The "object file" framing is borrowed from Kahneman et al. [22] — the information received about an object at a location.

Objects. 100 high-quality 3D objects from online repositories: 20 from 3D Model Haven, 28 from YCB, 52 from Google Scanned Objects. Each annotated with a material type (ceramic, glass, wood, plastic, iron, polycarbonate, steel) that drives audio simulation. Common household categories (bowl, mug, cabinet, television, shelf, fork, spoon).

VisionNet. A NeRF-style object-centric neural scattering function (following object-centric NeRF [27]). Inputs are a 7D coordinate (x,y,z, φ_i,θ_i, φ_o,θ_o) — 3D location plus incoming/outgoing light directions; an MLP F_v maps it to volume density σ and per-channel scatter fractions (ρ_r,ρ_g,ρ_b). Classic volume rendering (Eq 1–2) produces images at arbitrary viewpoint and lighting. Training images are rendered in Blender Cycles, object normalized to a unit cube, point light on a unit sphere, white background, viewpoints on a full sphere. Uses stratified sampling, positional encoding, hierarchical volume sampling.

AudioNet. Physics-based rigid-body impact sound via linear modal analysis [23]. The surface mesh is converted to a volumetric hexahedron mesh (N voxels); the linear elastic dynamics Mü + C&udot; + Ku = f (Eq 3) is solved by generalized eigenvalue decomposition into a bank of damped sinusoids, one per vibration mode (Eq 4–6), with frequency and damping set by material parameters (density, Young's modulus, Poisson's ratio, Rayleigh damping). Modes are excited by unit forces f₁,f₂,f₃ at every vertex. For each force direction a separate MLP branch encodes the modes signal: input is vertex coordinate (x,y,z) + spectrogram location (f,t); it predicts the real and imaginary parts of the complex spectrogram, inverted via ISTFT. (They found predicting the spectrogram works better than predicting the modes signal directly.) At test time an arbitrary force is decomposed into the three unit-force amplitudes and combined per vertex.

TouchNet. Tactile readings are simulated with the vision-based touch simulator TACTO [25], configured to a DIGIT [24] sensor (chosen for compactness and high resolution, with relevance to in-hand manipulation). DIGIT is pressed at each surface vertex along the normal, force constrained to a small range so renderings don't vary much across forces, yielding an RGB tactile image capturing local contact geometry. TouchNet is an MLP taking vertex (x,y,z) + tactile image location (w,h) and predicting the per-pixel RGB of the tactile image.

Setup

Datasets / benchmarks: ObjectFolder itself — 100 implicitly-represented objects (20 3D Model Haven, 28 YCB, 52 Google Scanned Objects) with paired vision/audio/touch. Evaluated on four benchmark tasks: multisensory instance recognition, cross-sensory retrieval, audio-visual 3D reconstruction, robotic grasping (grasp-stability prediction + a reach task).
Hardware / simulator: Blender Cycles path tracer for vision; linear modal analysis for audio; TACTO simulator with a DIGIT vision-based tactile sensor for touch. Robotic grasping uses a simulated arm with left/right fingers; the reach experiment uses Meta-World [55]. No physical robot hardware — all in virtual environments.
Baselines: Per task — instance recognition: Chance, single-modality (V/A/T) and fused ResNet-18 classifiers; cross-sensory retrieval: Chance and CCA [53]; 3D reconstruction: an Average-mesh baseline, Image2Mesh [30] (Occupancy Networks), Audio2Mesh, Image+Audio2Mesh; grasping: vision-only / touch-only / vision+touch classifiers, plus a random grasping policy vs. a TRPO [54] touch policy.
Compute: not reported.

Results

The paper's headline is a storage argument plus a suite of small benchmark demonstrations that multisensory beats single-modality.

Storage (Table 1). Implicit Object Files are far smaller than raw data: Vision 7.2 MB (vs. effectively infinite for all-viewpoint/lighting renders), Audio 6.3 MB (vs. 12.6 GB), Touch 2.1 MB (vs. 2.9 GB).

Multisensory instance recognition (Table 2, Acc %). Chance is 1.00.

Modality	V	A	T	V+A	V+T	A+T	V+A+T
Acc (%)	94.8	98.3	72.4	99.5	97.2	99.0	99.8

Surprise: audio alone (98.3) beats vision alone (94.8) — a single impact sound encodes shape/size/material well. Touch is weakest (a single touch is locally informative but not globally discriminative). All three fused is best.

Cross-sensory retrieval (Table 3, mAP; chance 0.05). ObjectFolder embeddings beat CCA across all six directions, e.g. Vision→Audio 0.90 (CCA 0.57), Audio→Vision 0.92 (CCA 0.59). Touch-involving pairs are lower (Vision→Touch 0.50, Touch→Vision 0.48), echoing touch's locality.

Audio-visual 3D reconstruction (Table 4). Image+Audio2Mesh is best (IoU 0.8906, Chamfer-L₁ 0.0043, Normal Consistency 0.9535), edging Image2Mesh (0.8809 / 0.0046 / 0.9522) — audio adds acoustic spatial cues on top of single-view vision. Audio-only (Audio2Mesh) already beats nothing meaningfully (IoU 0.8729) vs. the Average baseline (0.0675). The Image2Mesh model also transfers to real-world images (Fig 7), failing on out-of-distribution shapes (a teapot).

Robotic grasping (§4.4, Fig 9). Touch reaches high grasp-stability accuracy with far less data than vision; vision+touch is best. A TRPO touch-based grasping policy gets 75.5% success vs. 53.0% for random. A Meta-World reach task on three objects (cup, bowl, dice) hits 100%.

Limitations & open questions

From the authors:

Future work is needed for richer object states with more accurate, fine-grained physics.
Additional sensory modalities and textual descriptions are not yet included (flagged as future).
Audio realism is supported only by a user study where simulated audio is preferred 42% of the time vs. real — i.e., not yet indistinguishable from real recordings.

What I noticed reading it:

Only 100 objects, and benchmark experiments often use small subsets (10 objects for grasp-stability, 3 objects for the reach task) — the manipulation claims rest on very small samples and are reported as single success rates, not seed-averaged with variance.
Touch is simulated from TACTO/DIGIT, then re-encoded by TouchNet — a simulation of a simulation. Sim-to-real fidelity of the tactile channel is never quantified (the audio channel at least has a user study; touch does not).
Force on the tactile sensor is deliberately constrained to a narrow range so renderings "do not vary significantly with different forces" — meaning the tactile representation is essentially a static local-geometry map and discards the force/pressure dimension that contact-rich manipulation cares about most.
The audio model is purely impact sound from rigid-body modal analysis — no scraping, rolling, liquid, or sustained-contact sounds, which limits its relevance to many manipulation audio cues.
Tasks are all evaluated in simulation; "robotic grasping" is grasp-outcome classification plus a short RL policy, not real-robot manipulation.

Why I care

This is an infrastructure / dataset anchor, not a manipulation-policy or abstraction-learning paper — so its relevance to BLADE is foundational rather than methodological. The thesis it serves: many manipulation predicates (is_grasped, is_inserted, surface_is_rough, is_screwed_tight) are not visually evaluable — they live in touch, force, and sound. ObjectFolder is the original argument that the touch and sound channels can be virtualized and learned object-centrically at all. BLADE learns visual predicate classifiers; the obvious extension is multisensory predicate classifiers, and ObjectFolder (and its 2.0 / Benchmark successors) is the canonical asset for prototyping that without a real multisensory rig.

Concretely useful angles: (1) its instance-recognition result that audio alone out-predicts vision is a clean motivation slide for "looking isn't enough"; (2) the implicit-representation packaging (one small network = full multisensory object profile) is an interesting design pattern for storing learned per-object sensory models; (3) it is the head of the ObjectFolder line (2.0 sim-to-real, the real-object Benchmark) that the rest of cluster I tracks. I should not overclaim: there is no language, no long-horizon planning, and no predicate learning here.

Quotable

While there has been significant progress by "looking" — recognizing objects based on glimpses of their visual appearance or 3D shape — objects in the world are often modeled as silent and untouchable entities. — §1, Introduction / p.2

Surprisingly, the audio classifier achieves the best results. This confirms that our audio simulation pipeline well models the shape, size, and material property of the object instance. — §4.1, Multisensory Instance Recognition / p.6

Papers cited that should likely be ingested next:

[23] Ren, Yeh & Lin 2013 — Example-guided physically based modal sound synthesis (TOG) — the modal-analysis backbone of AudioNet.
[25] Wang et al. 2020 — TACTO — the vision-based touch simulator TouchNet is built on. forward ref → tacto_tactile_simulator.
[24] Lambeta et al. 2020 — DIGIT — the simulated tactile sensor. forward ref → digit_low_cost_compact_tactile_sensor.
[49] Yuan, Dong & Adelson 2017 — GelSight — the alternative high-res tactile sensor. forward ref → gelsight_high_resolution_tactile_sensors.
[41] Lee et al. 2019 — Making Sense of Vision and Touch — self-supervised multimodal representations for contact-rich tasks. forward ref → making_sense_of_vision_and_touch.
[21] Chen et al. 2020 — SoundSpaces — audio-visual navigation; the 2.0 successor is in this batch. forward ref → soundspaces_2_visual_acoustic.
[26] Mildenhall et al. 2020 — NeRF and [27] Guo et al. 2020 — Object-Centric Neural Scene Rendering — the implicit-vision backbone.

Newly ingested in the 2026-06-24 batch — directly relevant:

ObjectFolder 2.0 — the direct follow-up: scales the dataset and pushes sim-to-real transfer of the multisensory simulations.
ObjectFolder Benchmark — extends the line to real-object multisensory data and standardized benchmark tasks; pairs with this paper's neural Object Files.
Touch and Go — human-collected paired vision-touch in the wild; the real-data counterpart to ObjectFolder's simulated touch.
TACTO and Taxim — the tactile simulators underpinning / competing with ObjectFolder's TouchNet pipeline.
SoundSpaces 2.0 — the acoustic-simulation sibling for audio-visual scenes; complementary modality-simulation infrastructure.
SonicSense and See, Hear, and Feel — downstream multisensory (audio/contact) manipulation work that consumes the kind of cross-modal signal ObjectFolder argues is learnable.