Grounding Predicates through Actions

Toki Migimatsu, Jeannette Bohg · Stanford · ICRA 2022 · arXiv:2109.14718 · PDF

Direct ablation target for BLADE. This is the paper that BLADE (Liu et al., CoRL 2024) explicitly compares against as the closest prior on predicate-grounding from action segments. Migimatsu & Bohg use only the first and last frame of each segment as supervision; BLADE labels every frame within the segment by propagating effects forward. BLADE reports +20.7% F1 overall, +16.3% on object states, and +38.6% on spatial relations from this single change.

One-liner. Trains a visual predicate classifier from weak supervision — just an action label per video — by using PDDL pre- and post-conditions to derive partial symbolic state labels for the first and last frame of each clip, then optimizing a DNF cross-entropy loss; demonstrates near-fully-supervised F1 in Gridworld and closed-loop pick-and-place planning on a real robot using a classifier trained only on 20BN human-activity videos.

Problem & motivation

Long-horizon robot planning needs grounded symbols ("drawer is open", "cup on table"), but state-of-the-art predicate classifiers [3, 6–8] require dense per-image symbolic-state labels, which are prohibitively expensive: the authors estimate manually labeling 132,853 20BN videos would take 2,760 8-hour work days. Furthermore, labels don't transfer across planning domains, because different problems use different predicates. The paper reframes the supervision problem: action labels are cheap to collect; PDDL pre/post-conditions are written once per action class; together they yield partial symbolic-state labels at the start and end of each video clip, for free.

Method

1. From PDDL to partial state labels. Given an action a with pre-condition formula and post-condition formula in PDDL, the authors convert each formula to disjunctive normal form (DNF): a disjunction of conjunctions of positive and negative propositions. Each conjunction is a candidate partial state. They then collapse each DNF to a single partial state by intersecting positive states and intersecting negative states across all disjuncts (Eq. 6). Theorem 1.1 (Appx. B) proves this collapsed set is the largest set of propositions whose truth values are fully determined by the DNF; any proposition outside the collapsed set could be either true or false without violating the DNF. The before-frame Ipre is labeled with the collapsed pre-condition; the after-frame Ipost with the collapsed post-condition.

2. Predicate classifier. Network takes an RGB image plus M bounding boxes (ordered predicate arguments) and outputs a length-P vector of predicate probabilities (Eq. 4). Querying with different argument orderings yields different propositions, e.g., in(spoon, cup) vs in(cup, spoon). Architecture (Appx. C) is based on Inayoshi et al.'s bounding-box channel net [25]: ResNet-50 backbone, RoIAlign-normalized image features (7×7×1024) on the union-box, plus binary spatial-mask features (7×7×256M) and per-argument object features (7×7×1024M), concatenated and passed through a ResNet-50 head. The bounding-box choice is presented as orthogonal to the labeling method.

3. DNF cross-entropy loss. For each training instance (Ipre, Ipost, action a): collapse the pre- and post-condition DNFs to (s+, s) pairs; apply CEDNF(y, š) = −s+ log σ(y) − s log σ(−y) (Eq. 7), which only contributes loss on the propositions the DNF constrains. Total loss is the sum on pre- and post-images (Eq. 8). A class-balanced variant (DNF WCE) using Cui et al. [35] addresses the heavy skew toward rare predicates.

Setup

Results

20BN (Table I): Both DNF CE and DNF WCE reach 0.92 test F1 overall (0.96 train F1), 0.93 test accuracy. The interesting delta is per-predicate average F1: DNF CE 0.60 → DNF WCE 0.65, showing class-balanced loss helps rare predicates (e.g., broken 0.35→0.53, deformed 0.09→0.34) at slight cost to common ones. Random baseline would yield 0.36 F1 (positives appear at 36% rate). Predicates like visible, is-holdable, fits, close, touching, onsurface, in exceed 0.89 F1.

Gridworld (Fig. 4):

MethodTrain SizeTest F1
Oracle (full labels)10,0001.00
DNF CE (partial labels)10,0000.96
Half DNF (pre OR post only)20,0000.94

At 100,000 training examples, all DNF variants reach perfect F1 (reported, not shown). Key takeaway: weak partial labels match Oracle with enough data; the ablation that loses paired before/after frames is worse even with twice the data — visual change between frames is more informative than additional single-frame instances.

Closed-loop real-robot planning (Fig. 3): The 20BN-trained DNF WCE classifier is zero-shot transferred to a tabletop fruit-into-container scene. Goal in(banana, drawer) is achieved through open(drawer) → pick(banana) → put-into(banana, drawer) → close(drawer), with the planner re-planning at 10 Hz from updated predicate predictions. Qualitative only (no real-robot success-rate table).

Annotation cost (Appx. A): Defining pre/post for 171 actions at ~10 min each: ~30 hours, or 4 8-hour days. Manually labeling the same partial states would take 690 days — a 172× reduction.

Limitations & open questions

From the authors:

What I noticed reading it:

Why I care

This is the direct ablation target for BLADE's predicate-grounding component. BLADE explicitly quotes Migimatsu & Bohg [66] as the closest prior approach and frames its own contribution as a near-drop-in replacement: instead of supervising the classifier only with first/last frames of each demo segment, BLADE auto-labels every frame by propagating effa forward until another behavior alters it. The empirical gap is large and load-bearing — +20.7% F1 overall, +16.3% on object states, +38.6% on spatial relations — and BLADE's CALVIN planning-success gains flow directly from this richer classifier supervision. So this paper matters to me for three reasons:

Topic-page worthy: this paper anchors the "predicate grounding from action pre/post-conditions" subtree of the predicate learning topic (if/when written) alongside Konidaris [32 in BLADE refs] and Chitnis et al.'s bilevel planning.

Quotable

Rather than learning visual groundings from direct labels of symbolic state, we propose to learn them indirectly from visual examples of symbolic actions. Actions change the symbolic state in a predefined manner according to their pre- and post-conditions. — §I, Introduction
šDNF is the largest set of propositions that is fully determined by a DNF. — Theorem 1.1, Appx. B
Half DNF's worse performance indicates that seeing the visual changes induced by each action is more beneficial than simply receiving more data (twice the amount). — §V-B.3, Gridworld results
In many cases, predicates learned from a general large-scale dataset may not be applicable to custom task planning domains, where accuracy is critical. However, our labeling framework would perhaps be most useful for these very applications where collecting new datasets is necessary. — §VI, Conclusion

Related