One-liner. Trains a visual predicate classifier from weak supervision — just an action label per video — by using PDDL pre- and post-conditions to derive partial symbolic state labels for the first and last frame of each clip, then optimizing a DNF cross-entropy loss; demonstrates near-fully-supervised F1 in Gridworld and closed-loop pick-and-place planning on a real robot using a classifier trained only on 20BN human-activity videos.
Long-horizon robot planning needs grounded symbols ("drawer is open", "cup on table"), but state-of-the-art predicate classifiers [3, 6–8] require dense per-image symbolic-state labels, which are prohibitively expensive: the authors estimate manually labeling 132,853 20BN videos would take 2,760 8-hour work days. Furthermore, labels don't transfer across planning domains, because different problems use different predicates. The paper reframes the supervision problem: action labels are cheap to collect; PDDL pre/post-conditions are written once per action class; together they yield partial symbolic-state labels at the start and end of each video clip, for free.
1. From PDDL to partial state labels. Given an action
a with pre-condition formula and post-condition formula in
PDDL, the authors convert each formula to disjunctive normal form (DNF):
a disjunction of conjunctions of positive and negative propositions. Each
conjunction is a candidate partial state. They then collapse
each DNF to a single partial state by intersecting positive states and
intersecting negative states across all disjuncts (Eq. 6). Theorem 1.1
(Appx. B) proves this collapsed set is the largest set of
propositions whose truth values are fully determined by the DNF; any
proposition outside the collapsed set could be either true or false
without violating the DNF. The before-frame Ipre is
labeled with the collapsed pre-condition; the after-frame
Ipost with the collapsed post-condition.
2. Predicate classifier. Network takes an RGB image
plus M bounding boxes (ordered predicate arguments) and outputs
a length-P vector of predicate probabilities (Eq. 4). Querying
with different argument orderings yields different propositions, e.g.,
in(spoon, cup) vs in(cup, spoon). Architecture
(Appx. C) is based on Inayoshi et al.'s bounding-box channel net [25]:
ResNet-50 backbone, RoIAlign-normalized image features (7×7×1024)
on the union-box, plus binary spatial-mask features (7×7×256M)
and per-argument object features (7×7×1024M),
concatenated and passed through a ResNet-50 head. The bounding-box choice
is presented as orthogonal to the labeling method.
3. DNF cross-entropy loss. For each training instance (Ipre, Ipost, action a): collapse the pre- and post-condition DNFs to (s+, s−) pairs; apply CEDNF(y, š) = −s+ log σ(y) − s− log σ(−y) (Eq. 7), which only contributes loss on the propositions the DNF constrains. Total loss is the sum on pre- and post-images (Eq. 8). A class-balanced variant (DNF WCE) using Cui et al. [35] addresses the heavy skew toward rare predicates.
far(a)).20BN (Table I): Both DNF CE and DNF WCE reach 0.92
test F1 overall (0.96 train F1), 0.93 test accuracy. The interesting
delta is per-predicate average F1: DNF CE 0.60 → DNF WCE 0.65,
showing class-balanced loss helps rare predicates (e.g.,
broken 0.35→0.53, deformed
0.09→0.34) at slight cost to common ones. Random baseline would
yield 0.36 F1 (positives appear at 36% rate). Predicates like
visible, is-holdable, fits,
close, touching, onsurface,
in exceed 0.89 F1.
Gridworld (Fig. 4):
| Method | Train Size | Test F1 |
|---|---|---|
| Oracle (full labels) | 10,000 | 1.00 |
| DNF CE (partial labels) | 10,000 | 0.96 |
| Half DNF (pre OR post only) | 20,000 | 0.94 |
At 100,000 training examples, all DNF variants reach perfect F1 (reported, not shown). Key takeaway: weak partial labels match Oracle with enough data; the ablation that loses paired before/after frames is worse even with twice the data — visual change between frames is more informative than additional single-frame instances.
Closed-loop real-robot planning (Fig. 3): The
20BN-trained DNF WCE classifier is zero-shot transferred to a tabletop
fruit-into-container scene. Goal
in(banana, drawer) is achieved through
open(drawer) → pick(banana) → put-into(banana, drawer) → close(drawer),
with the planner re-planning at 10 Hz from updated predicate predictions.
Qualitative only (no real-robot success-rate table).
Annotation cost (Appx. A): Defining pre/post for 171 actions at ~10 min each: ~30 hours, or 4 8-hour days. Manually labeling the same partial states would take 690 days — a 172× reduction.
From the authors:
What I noticed reading it:
This is the direct ablation target for BLADE's predicate-grounding component. BLADE explicitly quotes Migimatsu & Bohg [66] as the closest prior approach and frames its own contribution as a near-drop-in replacement: instead of supervising the classifier only with first/last frames of each demo segment, BLADE auto-labels every frame by propagating effa forward until another behavior alters it. The empirical gap is large and load-bearing — +20.7% F1 overall, +16.3% on object states, +38.6% on spatial relations — and BLADE's CALVIN planning-success gains flow directly from this richer classifier supervision. So this paper matters to me for three reasons:
Topic-page worthy: this paper anchors the "predicate grounding from action pre/post-conditions" subtree of the predicate learning topic (if/when written) alongside Konidaris [32 in BLADE refs] and Chitnis et al.'s bilevel planning.
Rather than learning visual groundings from direct labels of symbolic state, we propose to learn them indirectly from visual examples of symbolic actions. Actions change the symbolic state in a predefined manner according to their pre- and post-conditions. — §I, Introduction
šDNF is the largest set of propositions that is fully determined by a DNF. — Theorem 1.1, Appx. B
Half DNF's worse performance indicates that seeing the visual changes induced by each action is more beneficial than simply receiving more data (twice the amount). — §V-B.3, Gridworld results
In many cases, predicates learned from a general large-scale dataset may not be applicable to custom task planning domains, where accuracy is critical. However, our labeling framework would perhaps be most useful for these very applications where collecting new datasets is necessary. — §VI, Conclusion
far(a)) from push/pull
demonstrations. Migimatsu & Bohg generalize to arbitrary
PDDL operators.