Interactive Perception: Leveraging Action in Perception and Perception in Action

J. Bohg*, K. Hausman*, B. Sankaran*, O. Brock, D. Kragic, S. Schaal, G. Sukhatme · 2017 · IEEE Transactions on Robotics (T-RO) · arXiv:1604.03670 · PDF

*Authors contributed equally and are listed alphabetically. Bohg & Schaal at Max Planck Institute for Intelligent Systems; Hausman, Sankaran, Schaal & Sukhatme at USC; Bohg also at Stanford; Kragic at KTH; Brock at TU Berlin.

One-liner. The defining survey of Interactive Perception (IP): a position paper arguing that robots should perceive by acting, because forceful interaction (i) creates novel, otherwise-absent sensory signals (haptic, audio, motion correlated with action) and (ii) exposes a regularity in the joint space of sensor data, action parameters, and time (S × A × t) that makes perception simpler and more robust — then taxonomizes ~120 papers along these two axes.

Problem & motivation

Mainstream computer vision treats perception as a disembodied, passive mapping from static images to labels, which renders many perception problems under-constrained and data-hungry. But biological perception is intrinsically active and exploratory — the paper opens with the Held & Hein kitten-carousel experiment (only the actively-moving kitten developed visually-guided behavior) and Gibson's finding that subjects identify irregular 3D objects far better when allowed to touch and rotate them (49% from a single image → 99% with touch). Robots, unlike vision algorithms, are embodied agents that can physically interact, which generates richer, action-correlated sensory signals that "would otherwise not be present." The paper postulates exploiting this as a principle for robot perception, and surveys the field to define it and seed benchmarks.

Framework / taxonomy

The conceptual core is two postulated benefits of forceful interaction (any action exerting a possibly time-varying force on the environment):

Create Novel Signals (CNS). Forceful interaction creates rich sensory signals that would otherwise be absent — e.g. motion cues that segment objects, haptic/audio data correlated over time that reveal object weight, surface material, or rigidity. These signals are useful precisely for quantities tied to manipulation (mass, friction, articulation, softness) that a static image cannot expose.
Action–Perception Regularity (APR). Forceful interaction reveals a repeatable, multi-modal regularity in the joint space S × A × t of sensory data S, action parameters A, over time t. Although that space is much higher-dimensional than S alone, the signal it carries has more structure. Knowing this regularity is the causal action→ sensory-response model, and lets a robot (i) predict the sensory signal given an action + environment properties, (ii) update beliefs about latent environment state by comparing prediction to observation, and (iii) infer the applied action from the observed signal — which also enables optimal action selection.

Distinction from Active Perception (AP). §III relates IP to the older Active Perception tradition (Bajcsy 1988; "an agent is an active perceiver if it knows why it wishes to sense, then chooses what/how/when/ where to achieve that perception"). AP largely manipulates the sensor (camera viewpoint, zoom) within a vision-centric frame; IP manipulates the environment itself with forceful, often non-visual interaction, and cares about goal-directed manipulation, not just simplifying a vision problem. Fig. 3 places approaches in a S × A × t grid: sensorless manipulation (action only), perception of images/video (sensing only), AP (sensing + sensor-only action), and Active Haptic / Interactive Perception (sensing + environment-changing action over time).

Application areas (§IV, Fig. 4). The survey groups IP work into ten areas: object segmentation, articulation-model estimation, object dynamics learning & haptic-property estimation, object recognition/ categorization, multimodal object-model learning, object pose estimation, grasp planning, manipulation-skill learning, plus state-representation learning. Running Lego-block and sphere examples (Figs. 5–6) show the same percept (count objects; estimate weight) going from impossible-when-passive to reliable once the robot pushes/lifts and observes the resulting signal.

Taxonomy axes (§V, Tables 8–9). Every surveyed paper is classified by: (1) how the S × A × t signal is leveraged (CNS vs APR — APR implies CNS); (2) what priors are employed (rigid / articulated / deformable objects, planar motion, action primitives, simple dynamics); (3) whether it performs action selection (myopic / planning-horizon / global-policy via POMDP/MDP/RL); (4) objective (perception, manipulation, or both); (5) whether multiple sensor modalities are exploited; and (6) how uncertainty is modeled — four labels DDM/SDM (deterministic/stochastic dynamics) × DOM/SOM (deterministic/stochastic observations), plus EU (explicitly estimates uncertainty). Fig. 7 plots approaches on a spectrum from "weak priors, general" (optical flow, brightness constancy) to "strong priors, specific" (joint dynamics models) by how much of APR vs prior knowledge they lean on.

Setup

Datasets / benchmarks: not reported — this is a survey; it explicitly calls for benchmarks to be established (it contrasts IP with the static-image benchmarks PASCAL VOC, ImageNet, MS COCO that drove passive vision).
Hardware / simulator: not reported (survey). Discusses, across cited work, manipulators with vision, tactile/force-torque, proximity, and contact-audio sensing; motivating application = the DARPA Robotics Challenge (whole-body, multi-contact, unstructured environments).
Baselines: not reported (survey). The implicit foil throughout is passive / disembodied computer vision and sensorless manipulation.
Compute: not reported.

Results

As a survey, the "result" is the framework itself plus the organized evidence base. Key synthesized claims:

The two postulates (CNS and APR) cleanly separate IP from Active Perception, sensorless manipulation, and passive vision, and serve as inclusion criteria for whether a paper "is" IP.
Across application areas, interaction repeatedly converts an under-constrained perception problem into a tractable one: object segmentation via motion cues (Fitzpatrick & Metta; Kenney et al.; Hausman et al.), articulation-model estimation from observed relative motion, inertial/material-property estimation from push/lift signals (Atkeson et al.; Sinapov et al.), and pose estimation via touch/manifold particle filters (Koval et al.; Javdani et al.).
Multi-modal IP (vision + tactile + force/torque + audio + proprioception) is flagged as relatively rare: most surveyed work is still vision-only (Tables 8–9), with non-visual modalities concentrated in haptic-property and dynamics-learning clusters.
Action selection ranges from myopic (one-step, e.g. next-best-touch) to planning-horizon (MLE-observation look-ahead) to global policies (POMDP/RL), with a trade-off: richer uncertainty models cope better with noisy sensing/dynamics but cost more to compute.

Limitations & open questions

From the authors (§VI):

No unifying framework yet. The authors conclude there is no single formalism that addresses all IP challenges; candidates include Object-Action Complexes (OACs), affordances, and decision-theoretic frameworks (MDP/POMDP/PSR/bandits), but each only covers a slice.
Balancing perception and manipulation. How to find manipulation actions that simultaneously achieve a task and gather information (the exploration/exploitation trade-off) is open — closely analogous to RL.
Right representations/features for IP. Existing visual features (edges/corners) and tracking were designed for passive, static scenes; what the analogous "fundamental" features are for dynamic, interaction-driven, multi-modal (haptic/acoustic) perception is unanswered.
Deciding which modality to use, when, and at what cost (passive vs active vs interactive sensing each carry different acquisition cost and expected information gain) is unresolved.

What I noticed reading it (2017 vintage):

The paper precedes the LLM/VLM/foundation-model era entirely. Its "regularity in S × A × t" is exactly the structure a modern language-conditioned multisensory policy (FuSe, Tactile-VLA) or a learned change-predicate would try to capture — but here it is framed as a model to hand-design or estimate, not to learn from web-scale data.
"Time" (t) is in the formalism but the surveyed work is mostly about static-property recovery (mass, material, articulation) via a single probing interaction. Dynamic state-transition monitoring ("became slippery", "now boiling") is named in spirit (sensory signals correlated over time) but no surveyed system continuously tracks a semantic state-change to drive replanning — the closed-loop semantic correction loop is left implicit.
The framework is descriptive, not predictive: it organizes a decade of work but offers no algorithm, which is why it lives or dies on whether the CNS/APR distinction is genuinely load-bearing for designing new systems (I think it is — "which novel signal does this action create, and what regularity lets me read it?" is a good design question).

Why I care

This is the principled, conceptual root of the whole "use forceful action to create informative, often non-visual signals that reveal object state" idea — it predates the learned change-predicates and LLM/VLM grounding by the better part of a decade and gives them a precise vocabulary. For the dynamic state-change concept grounding direction (grounding "hot enough now", "became slippery after wetting", "coming to a boil" on tactile/audio/thermal/force sensing, monitored over time, to drive closed-loop semantic replanning), this paper is the foundation on several axes:

The CNS argument = why non-visual sensing is needed at all. "Forceful interactions create novel, rich sensory signals that would otherwise not be present" is the first-principles justification for grounding manipulation predicates on touch/audio/thermal rather than vision. Octopi, Tactile-VLA, Audio-VLA, and the pouring/boiling acoustic work all instantiate CNS for specific modalities.
The APR regularity = what BLADE-style predicate classifiers implicitly learn. APR says the joint S × A × t signal is structured and predictable. BLADE's visual predicate classifiers and contact-primitive controllers are a learned, data-driven realization of this regularity — but BLADE reads it visually and grounds static predicates from first/last (well, every) frames. This survey is the conceptual mandate to push those predicates onto non-visual modalities and onto state transitions tracked over t.
Direct authorship lineage. Co-author Oliver Brock + Jeannette Bohg; Bohg is also senior author on Grounding Predicates through Actions (Migimatsu & Bohg), BLADE's direct ablation target. This survey is the worldview those predicate- grounding papers grew out of.

Where it threatens / falls short of the idea (the honest gap): the surveyed systems do passive-recognition-after-probing of mostly static properties; almost none do continuous monitoring of a semantic state-transition feeding back into symbolic replanning. It supplies the philosophy (act to perceive; non-visual signals; the S × A × t regularity) but not the closed-loop semantic-correction machinery that REFLECT, Inner Monologue, and BLADE's replan-after-each-behavior loop start to provide. So: cite it as the principled foundation, then position the new idea as "instantiate APR for dynamic, language-named state transitions on multimodal sensing, and close the loop into a planner" — the step the 2017 survey explicitly flags as open (balancing perception and manipulation; which modality, when).

Quotable

Recent approaches in robot perception follow the insight that perception is facilitated by interaction with the environment. These approaches are subsumed under the term Interactive Perception (IP). — Abstract

First, interaction with the environment creates a rich sensory signal that would otherwise not be present. Second, knowledge of the regularity in the combined space of sensory data and action parameters facilitates the prediction and interpretation of the sensory signal. — Abstract

Forceful interactions reveal regularities in the combined space (S × A × t) of sensor information (S) and action parameters (A) over time (t). … Therefore despite S × A × t being much higher dimensional, the signal represented in this space has more structure. — §II-B, Action Perception Regularity

When performing manipulation tasks, humans aptly combine different sources of information … visual information, haptic feedback, and acoustic signals. Research in Interactive Perception is currently mostly concerned with visual information. — §VI-A, Remaining Challenges

Papers cited in this survey that are candidates to ingest (not yet in corpus):

Bajcsy, Aloimonos & Tsotsos — "Revisiting Active Perception" (CoRR 2016, [22]) — the companion modern statement of the older AP tradition that IP defines itself against.
Krüger et al. — "Object-Action Complexes (OACs)" (RAS 2011, [131]) — named by the authors as a leading candidate framework for symbolically representing continuous sensorimotor experience; directly relevant to grounding symbolic predicates in sensorimotor signals.
Gibson — The Ecological Approach to Visual Perception ([1]) and Noë — Action in Perception ([3]) — the affordance / sensorimotor-contingency philosophical roots.
Sinapov, Schenck & Stoytchev — "Grounding semantic categories in behavioral interactions" / "Learning relational object categories" ([59], [60]) — multi-modal property learning over 100 objects via interaction; close to the "ground object-state concepts on multisensory probing" thread.
Chu et al. — "Robotic learning of haptic adjectives through physical interaction" (RAS 2015, [42]) — grounds language haptic adjectives (an object-state-concept analogue) on tactile signals from interaction.