Interactive Perception: Leveraging Action in Perception and Perception in Action

J. Bohg*, K. Hausman*, B. Sankaran*, O. Brock, D. Kragic, S. Schaal, G. Sukhatme · 2017 · IEEE Transactions on Robotics (T-RO) · arXiv:1604.03670 · PDF

*Authors contributed equally and are listed alphabetically. Bohg & Schaal at Max Planck Institute for Intelligent Systems; Hausman, Sankaran, Schaal & Sukhatme at USC; Bohg also at Stanford; Kragic at KTH; Brock at TU Berlin.

One-liner. The defining survey of Interactive Perception (IP): a position paper arguing that robots should perceive by acting, because forceful interaction (i) creates novel, otherwise-absent sensory signals (haptic, audio, motion correlated with action) and (ii) exposes a regularity in the joint space of sensor data, action parameters, and time (S × A × t) that makes perception simpler and more robust — then taxonomizes ~120 papers along these two axes.

Problem & motivation

Mainstream computer vision treats perception as a disembodied, passive mapping from static images to labels, which renders many perception problems under-constrained and data-hungry. But biological perception is intrinsically active and exploratory — the paper opens with the Held & Hein kitten-carousel experiment (only the actively-moving kitten developed visually-guided behavior) and Gibson's finding that subjects identify irregular 3D objects far better when allowed to touch and rotate them (49% from a single image → 99% with touch). Robots, unlike vision algorithms, are embodied agents that can physically interact, which generates richer, action-correlated sensory signals that "would otherwise not be present." The paper postulates exploiting this as a principle for robot perception, and surveys the field to define it and seed benchmarks.

Framework / taxonomy

The conceptual core is two postulated benefits of forceful interaction (any action exerting a possibly time-varying force on the environment):

Distinction from Active Perception (AP). §III relates IP to the older Active Perception tradition (Bajcsy 1988; "an agent is an active perceiver if it knows why it wishes to sense, then chooses what/how/when/ where to achieve that perception"). AP largely manipulates the sensor (camera viewpoint, zoom) within a vision-centric frame; IP manipulates the environment itself with forceful, often non-visual interaction, and cares about goal-directed manipulation, not just simplifying a vision problem. Fig. 3 places approaches in a S × A × t grid: sensorless manipulation (action only), perception of images/video (sensing only), AP (sensing + sensor-only action), and Active Haptic / Interactive Perception (sensing + environment-changing action over time).

Application areas (§IV, Fig. 4). The survey groups IP work into ten areas: object segmentation, articulation-model estimation, object dynamics learning & haptic-property estimation, object recognition/ categorization, multimodal object-model learning, object pose estimation, grasp planning, manipulation-skill learning, plus state-representation learning. Running Lego-block and sphere examples (Figs. 5–6) show the same percept (count objects; estimate weight) going from impossible-when-passive to reliable once the robot pushes/lifts and observes the resulting signal.

Taxonomy axes (§V, Tables 8–9). Every surveyed paper is classified by: (1) how the S × A × t signal is leveraged (CNS vs APR — APR implies CNS); (2) what priors are employed (rigid / articulated / deformable objects, planar motion, action primitives, simple dynamics); (3) whether it performs action selection (myopic / planning-horizon / global-policy via POMDP/MDP/RL); (4) objective (perception, manipulation, or both); (5) whether multiple sensor modalities are exploited; and (6) how uncertainty is modeled — four labels DDM/SDM (deterministic/stochastic dynamics) × DOM/SOM (deterministic/stochastic observations), plus EU (explicitly estimates uncertainty). Fig. 7 plots approaches on a spectrum from "weak priors, general" (optical flow, brightness constancy) to "strong priors, specific" (joint dynamics models) by how much of APR vs prior knowledge they lean on.

Setup

Results

As a survey, the "result" is the framework itself plus the organized evidence base. Key synthesized claims:

Limitations & open questions

From the authors (§VI):

What I noticed reading it (2017 vintage):

Why I care

This is the principled, conceptual root of the whole "use forceful action to create informative, often non-visual signals that reveal object state" idea — it predates the learned change-predicates and LLM/VLM grounding by the better part of a decade and gives them a precise vocabulary. For the dynamic state-change concept grounding direction (grounding "hot enough now", "became slippery after wetting", "coming to a boil" on tactile/audio/thermal/force sensing, monitored over time, to drive closed-loop semantic replanning), this paper is the foundation on several axes:

Where it threatens / falls short of the idea (the honest gap): the surveyed systems do passive-recognition-after-probing of mostly static properties; almost none do continuous monitoring of a semantic state-transition feeding back into symbolic replanning. It supplies the philosophy (act to perceive; non-visual signals; the S × A × t regularity) but not the closed-loop semantic-correction machinery that REFLECT, Inner Monologue, and BLADE's replan-after-each-behavior loop start to provide. So: cite it as the principled foundation, then position the new idea as "instantiate APR for dynamic, language-named state transitions on multimodal sensing, and close the loop into a planner" — the step the 2017 survey explicitly flags as open (balancing perception and manipulation; which modality, when).

Quotable

Recent approaches in robot perception follow the insight that perception is facilitated by interaction with the environment. These approaches are subsumed under the term Interactive Perception (IP). — Abstract
First, interaction with the environment creates a rich sensory signal that would otherwise not be present. Second, knowledge of the regularity in the combined space of sensory data and action parameters facilitates the prediction and interpretation of the sensory signal. — Abstract
Forceful interactions reveal regularities in the combined space (S × A × t) of sensor information (S) and action parameters (A) over time (t). … Therefore despite S × A × t being much higher dimensional, the signal represented in this space has more structure. — §II-B, Action Perception Regularity
When performing manipulation tasks, humans aptly combine different sources of information … visual information, haptic feedback, and acoustic signals. Research in Interactive Perception is currently mostly concerned with visual information. — §VI-A, Remaining Challenges

Related

Papers cited in this survey that are candidates to ingest (not yet in corpus):

Related papers already in the corpus: