One-liner. The defining survey of Interactive Perception
(IP): a position paper arguing that robots should perceive by acting,
because forceful interaction (i) creates novel, otherwise-absent sensory signals
(haptic, audio, motion correlated with action) and (ii) exposes a regularity in
the joint space of sensor data, action parameters, and time
(S × A × t) that makes perception simpler and more
robust — then taxonomizes ~120 papers along these two axes.
Mainstream computer vision treats perception as a disembodied, passive mapping from static images to labels, which renders many perception problems under-constrained and data-hungry. But biological perception is intrinsically active and exploratory — the paper opens with the Held & Hein kitten-carousel experiment (only the actively-moving kitten developed visually-guided behavior) and Gibson's finding that subjects identify irregular 3D objects far better when allowed to touch and rotate them (49% from a single image → 99% with touch). Robots, unlike vision algorithms, are embodied agents that can physically interact, which generates richer, action-correlated sensory signals that "would otherwise not be present." The paper postulates exploiting this as a principle for robot perception, and surveys the field to define it and seed benchmarks.
The conceptual core is two postulated benefits of forceful interaction (any action exerting a possibly time-varying force on the environment):
S × A × t of sensory data S, action
parameters A, over time t. Although that space is
much higher-dimensional than S alone, the signal it carries has
more structure. Knowing this regularity is the causal action→
sensory-response model, and lets a robot (i) predict the sensory signal given
an action + environment properties, (ii) update beliefs about latent
environment state by comparing prediction to observation, and (iii) infer the
applied action from the observed signal — which also enables optimal
action selection.Distinction from Active Perception (AP). §III relates IP to
the older Active Perception tradition (Bajcsy 1988; "an agent is an active
perceiver if it knows why it wishes to sense, then chooses what/how/when/
where to achieve that perception"). AP largely manipulates the sensor
(camera viewpoint, zoom) within a vision-centric frame; IP manipulates the
environment itself with forceful, often non-visual interaction, and
cares about goal-directed manipulation, not just simplifying a vision problem.
Fig. 3 places approaches in a S × A × t grid:
sensorless manipulation (action only), perception of images/video (sensing only),
AP (sensing + sensor-only action), and Active Haptic / Interactive Perception
(sensing + environment-changing action over time).
Application areas (§IV, Fig. 4). The survey groups IP work into ten areas: object segmentation, articulation-model estimation, object dynamics learning & haptic-property estimation, object recognition/ categorization, multimodal object-model learning, object pose estimation, grasp planning, manipulation-skill learning, plus state-representation learning. Running Lego-block and sphere examples (Figs. 5–6) show the same percept (count objects; estimate weight) going from impossible-when-passive to reliable once the robot pushes/lifts and observes the resulting signal.
Taxonomy axes (§V, Tables 8–9). Every surveyed paper is
classified by: (1) how the S × A × t signal is leveraged
(CNS vs APR — APR implies CNS); (2) what priors are employed (rigid /
articulated / deformable objects, planar motion, action primitives, simple
dynamics); (3) whether it performs action selection (myopic / planning-horizon /
global-policy via POMDP/MDP/RL); (4) objective (perception, manipulation, or
both); (5) whether multiple sensor modalities are exploited; and (6) how
uncertainty is modeled — four labels DDM/SDM (deterministic/stochastic
dynamics) × DOM/SOM (deterministic/stochastic observations), plus EU
(explicitly estimates uncertainty). Fig. 7 plots approaches on a spectrum from
"weak priors, general" (optical flow, brightness constancy) to "strong priors,
specific" (joint dynamics models) by how much of APR vs prior knowledge they lean
on.
As a survey, the "result" is the framework itself plus the organized evidence base. Key synthesized claims:
From the authors (§VI):
What I noticed reading it (2017 vintage):
S × A × t" is exactly the structure a
modern language-conditioned multisensory policy (FuSe, Tactile-VLA) or a
learned change-predicate would try to capture — but here it is framed as
a model to hand-design or estimate, not to learn from web-scale data.t) is in the formalism but the surveyed work is mostly
about static-property recovery (mass, material, articulation) via a
single probing interaction. Dynamic state-transition monitoring
("became slippery", "now boiling") is named in spirit (sensory signals
correlated over time) but no surveyed system continuously tracks a semantic
state-change to drive replanning — the closed-loop semantic
correction loop is left implicit.This is the principled, conceptual root of the whole "use forceful action to create informative, often non-visual signals that reveal object state" idea — it predates the learned change-predicates and LLM/VLM grounding by the better part of a decade and gives them a precise vocabulary. For the dynamic state-change concept grounding direction (grounding "hot enough now", "became slippery after wetting", "coming to a boil" on tactile/audio/thermal/force sensing, monitored over time, to drive closed-loop semantic replanning), this paper is the foundation on several axes:
S × A × t signal is structured and predictable.
BLADE's
visual predicate classifiers and contact-primitive controllers are a learned,
data-driven realization of this regularity — but BLADE reads it
visually and grounds static predicates from first/last (well,
every) frames. This survey is the conceptual mandate to push those predicates
onto non-visual modalities and onto state transitions tracked over
t.Where it threatens / falls short of the idea (the honest gap):
the surveyed systems do passive-recognition-after-probing of mostly
static properties; almost none do continuous monitoring of a semantic
state-transition feeding back into symbolic replanning. It supplies
the philosophy (act to perceive; non-visual signals; the
S × A × t regularity) but not the closed-loop
semantic-correction machinery that
REFLECT,
Inner Monologue,
and BLADE's replan-after-each-behavior loop start to provide. So: cite it as the
principled foundation, then position the new idea as "instantiate APR for
dynamic, language-named state transitions on multimodal sensing, and close the
loop into a planner" — the step the 2017 survey explicitly flags as open
(balancing perception and manipulation; which modality, when).
Recent approaches in robot perception follow the insight that perception is facilitated by interaction with the environment. These approaches are subsumed under the term Interactive Perception (IP). — Abstract
First, interaction with the environment creates a rich sensory signal that would otherwise not be present. Second, knowledge of the regularity in the combined space of sensory data and action parameters facilitates the prediction and interpretation of the sensory signal. — Abstract
Forceful interactions reveal regularities in the combined space (S × A × t) of sensor information (S) and action parameters (A) over time (t). … Therefore despite S × A × t being much higher dimensional, the signal represented in this space has more structure. — §II-B, Action Perception Regularity
When performing manipulation tasks, humans aptly combine different sources of information … visual information, haptic feedback, and acoustic signals. Research in Interactive Perception is currently mostly concerned with visual information. — §VI-A, Remaining Challenges
Papers cited in this survey that are candidates to ingest (not yet in corpus):
Related papers already in the corpus: