Grounding Predicates through Actions

Toki Migimatsu, Jeannette Bohg · Stanford · ICRA 2022 · arXiv:2109.14718 · PDF

Direct ablation target for BLADE. This is the paper that BLADE (Liu et al., CoRL 2024) explicitly compares against as the closest prior on predicate-grounding from action segments. Migimatsu & Bohg use only the first and last frame of each segment as supervision; BLADE labels every frame within the segment by propagating effects forward. BLADE reports +20.7% F1 overall, +16.3% on object states, and +38.6% on spatial relations from this single change.

One-liner. Trains a visual predicate classifier from weak supervision — just an action label per video — by using PDDL pre- and post-conditions to derive partial symbolic state labels for the first and last frame of each clip, then optimizing a DNF cross-entropy loss; demonstrates near-fully-supervised F1 in Gridworld and closed-loop pick-and-place planning on a real robot using a classifier trained only on 20BN human-activity videos.

Problem & motivation

Long-horizon robot planning needs grounded symbols ("drawer is open", "cup on table"), but state-of-the-art predicate classifiers [3, 6–8] require dense per-image symbolic-state labels, which are prohibitively expensive: the authors estimate manually labeling 132,853 20BN videos would take 2,760 8-hour work days. Furthermore, labels don't transfer across planning domains, because different problems use different predicates. The paper reframes the supervision problem: action labels are cheap to collect; PDDL pre/post-conditions are written once per action class; together they yield partial symbolic-state labels at the start and end of each video clip, for free.

Method

1. From PDDL to partial state labels. Given an action a with pre-condition formula and post-condition formula in PDDL, the authors convert each formula to disjunctive normal form (DNF): a disjunction of conjunctions of positive and negative propositions. Each conjunction is a candidate partial state. They then collapse each DNF to a single partial state by intersecting positive states and intersecting negative states across all disjuncts (Eq. 6). Theorem 1.1 (Appx. B) proves this collapsed set is the largest set of propositions whose truth values are fully determined by the DNF; any proposition outside the collapsed set could be either true or false without violating the DNF. The before-frame I_pre is labeled with the collapsed pre-condition; the after-frame I_post with the collapsed post-condition.

2. Predicate classifier. Network takes an RGB image plus M bounding boxes (ordered predicate arguments) and outputs a length-P vector of predicate probabilities (Eq. 4). Querying with different argument orderings yields different propositions, e.g., in(spoon, cup) vs in(cup, spoon). Architecture (Appx. C) is based on Inayoshi et al.'s bounding-box channel net [25]: ResNet-50 backbone, RoIAlign-normalized image features (7×7×1024) on the union-box, plus binary spatial-mask features (7×7×256M) and per-argument object features (7×7×1024M), concatenated and passed through a ResNet-50 head. The bounding-box choice is presented as orthogonal to the labeling method.

3. DNF cross-entropy loss. For each training instance (I_pre, I_post, action a): collapse the pre- and post-condition DNFs to (s⁺, s⁻) pairs; apply CE_DNF(y, š) = −s⁺ log σ(y) − s⁻ log σ(−y) (Eq. 7), which only contributes loss on the propositions the DNF constrains. Total loss is the sum on pre- and post-images (Eq. 8). A class-balanced variant (DNF WCE) using Cui et al. [35] addresses the heavy skew toward rare predicates.

Setup

Datasets / benchmarks:
- 20BN Something Something v2 [9]: 220,847 short human manipulation videos of 174 action classes. Authors define pre/post-conditions for 172 actions and 35 predicates (151 total propositions); bounding boxes come from Something-Else [34]. Used subset: 132,853 videos.
- Gridworld: synthetic toy with 8 objects, 6 predicates, 8 actions involving universal/existential/conditional quantifiers (172 propositions). Used to compare partial-label vs full-label training because ground-truth state is available.
Hardware / simulator: Real-robot pick-and-place (arm not reported) using YOLOv5 [36] for predicate bounding boxes and Mediapipe [37] for inter-frame tracking; full pipeline perception→planner runs at 10 Hz.
Baselines: On 20BN there is no oracle (no ground-truth full state available), so DNF CE vs DNF WCE are compared against a random-classifier expectation. In Gridworld: Oracle (full supervision), DNF CE, Half DNF (ablation that sees only one image of each pre/post pair, with 2× the data for fairness). Related work to Ahmadzadeh et al. [31] is acknowledged but only learns a single predicate (far(a)).
Compute: not reported.

Results

20BN (Table I): Both DNF CE and DNF WCE reach 0.92 test F1 overall (0.96 train F1), 0.93 test accuracy. The interesting delta is per-predicate average F1: DNF CE 0.60 → DNF WCE 0.65, showing class-balanced loss helps rare predicates (e.g., broken 0.35→0.53, deformed 0.09→0.34) at slight cost to common ones. Random baseline would yield 0.36 F1 (positives appear at 36% rate). Predicates like visible, is-holdable, fits, close, touching, onsurface, in exceed 0.89 F1.

Gridworld (Fig. 4):

Method	Train Size	Test F1
Oracle (full labels)	10,000	1.00
DNF CE (partial labels)	10,000	0.96
Half DNF (pre OR post only)	20,000	0.94

At 100,000 training examples, all DNF variants reach perfect F1 (reported, not shown). Key takeaway: weak partial labels match Oracle with enough data; the ablation that loses paired before/after frames is worse even with twice the data — visual change between frames is more informative than additional single-frame instances.

Closed-loop real-robot planning (Fig. 3): The 20BN-trained DNF WCE classifier is zero-shot transferred to a tabletop fruit-into-container scene. Goal in(banana, drawer) is achieved through open(drawer) → pick(banana) → put-into(banana, drawer) → close(drawer), with the planner re-planning at 10 Hz from updated predicate predictions. Qualitative only (no real-robot success-rate table).

Annotation cost (Appx. A): Defining pre/post for 171 actions at ~10 min each: ~30 hours, or 4 8-hour days. Manually labeling the same partial states would take 690 days — a 172× reduction.

Limitations & open questions

From the authors:

Predicates that aren't manipulated by any action ("sky is blue") can't be learned this way — the method requires symbolic states that actions change.
Predicates from a general dataset like 20BN may not match the symbol vocabulary of a custom planning domain; the framework's biggest value is when the user defines their own actions.
Noisy/skewed predicate distributions hurt rare predicates — weak supervision techniques like Snorkel [32] / Snuba [33] are flagged as future work.
Natural-language acquisition of pre/post-conditions [38, 39] is proposed as future work, since manual PDDL authoring is still a bottleneck per new domain.

What I noticed reading it:

First/last frame only. Every other frame in the clip is unused supervision — precisely the gap BLADE later exploits by forward-propagating effects through every observation, yielding +20.7% F1 (BLADE Table 2).
Pre/post collapsing throws away information when the DNF has many disjuncts; rare-predicate F1 suggests the bottleneck isn't loss function alone.
Real-robot evaluation is a single qualitative trajectory — no multi-task or perturbation table comparable to BLADE/Text2Motion.
The DNF→single-state collapse is the largest provably-determined set (Theorem 1.1), but it implicitly assumes pre- and post-conditions were authored to be informative enough. In practice, predicates outside pre/post receive no gradient signal except through co-occurrence.
The bounding-box conditioning sidesteps the open-vocabulary detection problem but presumes object tracking is already solved — BLADE integrates an open-vocabulary detector [68] to handle this in the wild.

Why I care

This is the direct ablation target for BLADE's predicate-grounding component. BLADE explicitly quotes Migimatsu & Bohg [66] as the closest prior approach and frames its own contribution as a near-drop-in replacement: instead of supervising the classifier only with first/last frames of each demo segment, BLADE auto-labels every frame by propagating eff_a forward until another behavior alters it. The empirical gap is large and load-bearing — +20.7% F1 overall, +16.3% on object states, +38.6% on spatial relations — and BLADE's CALVIN planning-success gains flow directly from this richer classifier supervision. So this paper matters to me for three reasons:

It validates the DNF-from-PDDL labeling primitive that BLADE inherits — symbolic supervision from action labels works at scale (20BN, 132k videos).
It defines the baseline that BLADE beats. Any future predicate-grounding work I do has to clear this bar; any contrast I write up should cite Migimatsu & Bohg as the canonical "frame-pair supervision" reference.
It surfaces residual problems BLADE doesn't fully solve: rare-predicate F1, predicate invention (still PDDL-authored here and in BLADE), and predicates that no action manipulates. These map cleanly onto follow-up directions I've been tracking (see Predicate Invention for Bilevel Planning and INTERPRET).

Topic-page worthy: this paper anchors the "predicate grounding from action pre/post-conditions" subtree of the predicate learning topic (if/when written) alongside Konidaris [32 in BLADE refs] and Chitnis et al.'s bilevel planning.

Quotable

Rather than learning visual groundings from direct labels of symbolic state, we propose to learn them indirectly from visual examples of symbolic actions. Actions change the symbolic state in a predefined manner according to their pre- and post-conditions. — §I, Introduction

š_DNF is the largest set of propositions that is fully determined by a DNF. — Theorem 1.1, Appx. B

Half DNF's worse performance indicates that seeing the visual changes induced by each action is more beneficial than simply receiving more data (twice the amount). — §V-B.3, Gridworld results

In many cases, predicates learned from a general large-scale dataset may not be applicable to custom task planning domains, where accuracy is critical. However, our labeling framework would perhaps be most useful for these very applications where collecting new datasets is necessary. — §VI, Conclusion

BLADE (Liu et al., CoRL 2024) — direct successor that supersedes the frame-pair supervision scheme. BLADE propagates effects through every intermediate frame, reports +20.7% predicate F1 over this paper's first/last-frame approach, and integrates the learned classifiers into a full bi-level planner with diffusion-policy controllers.
Learning Neuro-Symbolic Relational Transition Models for Bilevel Planning (Chitnis et al.) — complementary: this paper learns predicates from actions; Chitnis et al. learn transition models over predicates for bilevel planning. Together they cover the perception-to-symbolic-dynamics stack.
Predicate Invention for Bilevel Planning (Silver et al.) — addresses the limitation Migimatsu & Bohg flag: predicates are PDDL-authored here. Silver et al. invent the predicate vocabulary itself.
PARL: Planning Abstractions from Language — another angle on acquiring predicate/operator structure automatically, this time from language guidance.
INTERPRET: Interactive Predicate Learning from Language Feedback — interactive variant: instead of weak supervision from pre/post, learns predicates from language feedback over rollouts.
Ahmadzadeh et al. (ICRA 2015) [31] — closest prior in spirit cited by this paper: learns a single predicate (far(a)) from push/pull demonstrations. Migimatsu & Bohg generalize to arbitrary PDDL operators.
Konidaris, Kaelbling, Lozano-Pérez (JAIR 2018) [2] — "From skills to symbols," the canonical classical reference for learning grounded predicates from skill options.
Mao et al. (ICLR 2019) [3] — Neuro-Symbolic Concept Learner — cited as the natural-supervision counterpoint; BLADE also cites this lineage.