One-liner. SoundSpaces 2.0 is an on-the-fly, geometry-based acoustic renderer that, given any 3D mesh, produces physically-grounded room impulse responses (reflection, reverb, transmission, diffraction, air absorption, binaural spatialization) for arbitrary source/receiver placements — turning a static visual simulator into a joint audio-visual one fast enough for embodied RL training and faithful enough to transfer to the real world.
What we see and hear are tightly coupled: room geometry and surface materials transform every sound that reaches our ears (a marble museum vs. a carpeted bookshop). Embodied-AI and AR/VR systems — a rescue robot localizing a person calling for help, a service robot listening for whether the espresso machine is running — need that visual↔acoustic correspondence in simulation. Prior audio simulators were too rigid. The original SoundSpaces [14] precomputes room impulse responses (RIRs) on a fixed 0.5 m discrete grid for a fixed list of 100 environments, weighing on the order of terabytes; agents can only hop between grid points, configurations (microphones, materials) can't be changed, and it doesn't generalize to new scenes. ThreeDWorld [21] renders continuously but only for oversimplified "shoebox" (rectangular) rooms, so it can't use real-scan meshes. The gap: no geometry-based acoustic simulator was simultaneously high-fidelity, configurable, generalizable to arbitrary meshes, and fast enough for embodied learning.
The core is the RLR-Audio-Propagation engine, integrated
into the Habitat-Sim [63] visual simulator (Fig 1). Given a scene mesh plus
user-specified source/receiver locations, it computes an RIR via a
bidirectional path-tracing algorithm [10] over M
logarithmically-spaced frequency bands. It models reflection, transmission,
and diffraction, plus binaural spatialization. Rays from each source carry
per-band energy and bounce through the scene (reflection via a Phong BRDF,
diffraction probabilistically at specially-constructed edge geometry,
transmission with probability set by material coefficients); rays are then
emitted from the listener and connected to source paths via multiple
importance sampling. Energy is accumulated into an energy-time histogram whose
spherical-harmonic coefficients per bin encode the
directional distribution of arriving sound, which is then spatialized to an
ambisonic or binaural pressure IR [69] and convolved with the source audio.
Improvements over original SoundSpaces. (i) Added acoustic
diffraction via the fast diffraction approach [68], fixing abrupt
occlusion artifacts. (ii) Fixed a direct-to-reverberant ratio (DRR) bias
of √(4π) present in the indirect sound pressure of the
original.
Continuity. Spatial continuity: render
R(s, r, θ) at arbitrary (non-grid) source s,
receiver r, heading θ; received sound is
Aⁿ = Aₛ ∗ R(s, r, θ). Acoustic
continuity: instead of convolving a fresh IR per discrete step (which made the
source effectively stop/restart each second), they index the source audio at
the right delayed sample (tₙ = tᵢ − R(s, xᵢ,
θ) + 1) and apply linear crossfading between successive listener
positions over a window T to smooth transitions.
Configurability. Exposed parameters: sampling rate, number
of frequency bands, ray counts, whether reflection/transmission/diffraction
are enabled. Microphone types: mono, binaural, ambisonics, or a user-specified
array. Custom HRTFs can be loaded. Material modeling: 29
built-in acoustic materials, each with frequency-dependent absorption,
scattering, and transmission coefficients of the form [f₁,
c₁, ..., fₙ, cₙ], plus distance-dependent air
absorption [8]. This fixes the original's rigid object-category→material
map (e.g., all floors forced to "carpet").
Two rendering modes (Sec 3.5): high-quality (all parameters maxed, temporal coherence off, gold standard) and high-speed (fewer rays, reuses previously-computed IRs under a locally-continuous-motion assumption [66]) for RL training that needs millions of steps.
SoundSpaces-PanoIR dataset. To serve users who want visual-acoustic data without running the simulator, they render an offline dataset of 10M panoramic image + IR pairs across 750 environments from Matterport3D, Gibson, and HM3D, each with the source's polar coordinate relative to the panorama center (Fig 2).
Simulator capability (Table 1): SoundSpaces 2.0 is the only platform checking all four boxes — Audio-Visual, Geometric, Configurable, and Arbitrary-Env. SoundSpaces 1.0 lacks Configurable and Arbitrary-Env; ThreeDWorld and Pyroomacoustics lack Geometric/Arbitrary-Env; GWA lacks Audio-Visual.
Accuracy vs. real measurements (Sec 5.2). Against the 7 FRL-apartment measurements, SoundSpaces 2.0 reduces average DRR error from 11.0 dB (original SoundSpaces) to 0.98 dB while preserving a 12.4% relative RT60 error; its energy-decay curves track the real measurements much more closely (Fig 3b,c).
Speed/quality tradeoff (Table 2): high-speed mode is 8× faster (1 thread) / 33× (5 threads) than high-quality while losing only 9.5% relative RT60 accuracy; a nav model trained in high-speed and evaluated in high-quality differs by <1%.
Benchmark 1 — Continuous Audio-Visual Navigation (Table 3):
| Train | Test | Success % | SPL % | DTG (m) |
|---|---|---|---|---|
| SoundSpaces [14] | Continuous space | 64.2 ± 0.8 | 27.5 ± 0.4 | 5.6 ± 0.2 |
| SoundSpaces [14] | Continuous space & sound | 0.9 ± 0.2 | 0.3 ± 0.1 | 12.9 ± 0.1 |
| SoundSpaces 2.0 | Continuous space & sound | 64.7 ± 3.9 | 49.3 ± 3.0 | 5.9 ± 0.5 |
An agent trained on discrete-sound SoundSpaces collapses (0.9% success) when tested under acoustically-continuous audio, because it leans on an always-present (but unrealistic) direct-sound cue; SoundSpaces 2.0 training restores 64.7%. Note the discrete agent retains decent success but much worse SPL (27.5 vs. 49.3) even on continuous space alone — spatial discretization mainly hurts path efficiency.
Benchmark 2 — Far-field ASR (Table 4, Word Error Rate):
| Finetuning IRs | WER % |
|---|---|
| Pretrained (none) | 29.10 |
| Real IRs [39] | 13.32 |
| Pyroomacoustics [64] | 16.24 |
| SoundSpaces 1.0 [14] | 18.48 |
| SoundSpaces 2.0 | 12.48 |
Finetuning ASR on SoundSpaces 2.0 synthetic IRs beats finetuning on real IRs (12.48 vs. 13.32 WER) — the headline sim2real result. Adding their acoustic randomization (random per-category material + Gaussian coefficient noise) pushes WER further to 12.04%, whereas naive uniform coefficient sampling hurts (12.58%), showing structured randomization matters.
From the authors (Sec 6):
What I noticed reading it:
Off-theme flag: this is an infrastructure/dataset paper for
audio-visual navigation and far-field ASR, not robot manipulation,
and there is no robot, gripper, or contact-rich task in it. I should
not overstate its relevance to BLADE's
long-horizon manipulation / planning-abstraction agenda. That said, it is the
canonical acoustic-rendering substrate for the broader thesis I keep
circling: many manipulation predicates —
is_full (pouring), is_screwed_tight,
is_inserted, surface_is_rough — are not visually
evaluable and live in sound/vibration. If a future BLADE-style system
wants to learn a poured-enough(cup) or contact-made
predicate classifier from audio, it needs (a) realistic acoustic data at scale
and (b) sim2real acoustic transfer. SoundSpaces 2.0 is the cleanest existence
proof that geometry-based acoustic sim can out-transfer real recordings
(the ASR result), which de-risks the "can we even train acoustic predicate
classifiers in sim?" question.
Concrete tangential relevance to my themes:
To our knowledge, this is the first geometry-based acoustic simulation that offers high fidelity and realism while also being fast enough to use for embodied learning. — Abstract / p.1
Finetuning on real IRs also reduces the error substantially, but still not as much as our simulated data, which can be generated at scale across a wide variety of environments. Our simulation generates realistic IRs that help machine learning models generalize better to reality. — §5.4 / p.9
At the object level, we can anticipate the sounds an object makes based on how it looks, and vice versa (a dog barks, a door slams, a baby cries). At the environment level, materials and geometry of the surrounding 3D space ... transform the sounds that reach our ears. — §1 / p.1
Papers cited that should likely be ingested next:
Newly ingested in the 2026-06-24 batch — directly relevant to this work: