ACID: Action Consistency via Inverse Dynamics for Planning with World Models

The Realization Gap

Search-based planning samples candidate action sequences, simulates them through a world model, and executes the one that minimizes a planning cost. But that cost, defined solely by the proximity between the goal and the terminal state, cannot see whether each transition is realizable — whether the conditioning action could actually produce it in the environment. Visual plausibility at the end does not imply it: a planner can commit to an action sequence whose predicted goal-reaching cannot be reproduced.

What makes realizability measurable is a property specific to embodied control: unlike a caption that admits many valid images, a pair of consecutive observations strongly constrains the action between them. So an inverse dynamics model (IDM) can tell whether each predicted step is consistent with its conditioning action — and we repurpose the IDM as a decision-time verifier rather than the offline action decoder or pseudo-labeler it has conventionally been.

Method

ACID modifies only the planning cost. The world model, the IDM, and the CEM optimizer are left untouched, so ACID composes with any action-conditioned world model. It costs the whole trajectory — not only its terminal state.

Cycle Action Consistency

If a predicted transition ž_t+1 = F_θ(ž_t, a_t) truly reflects a_t, then the action an IDM infers from (ž_t, ž_t+1) should recover it. The residual vanishes exactly when the transition is the one a_t produces, and grows as the prediction drifts.

\( c_a(a_{0:H-1}) = \frac{1}{H}\sum_{t=0}^{H-1}\bigl\| a_t - G_\phi(\hat{z}_t, \hat{z}_{t+1}) \bigr\|_2^2 \)

Reuses the rollout already computed for the goal cost — no extra world-model rollout.

Scale-Invariant Adaptive Weight

CEM ranks candidates by spread, not absolute scale, and the relative spread of the two costs varies across world models, tasks, and even CEM iterations. We equalize their spread at every iteration so neither term dominates the ranking, leaving a single λ to tune.

\( w_a = \lambda \cdot \dfrac{\sigma_g}{\sigma_a}, \qquad \sigma_g = \mathrm{std}_n\!\bigl(c_g^{(n)}\bigr),\; \sigma_a = \mathrm{std}_n\!\bigl(c_a^{(n)}\bigr) \)

λ is tuned once per world model and transfers across tasks.

The augmented cost simply reranks the goal cost by realizability:

\( c(a_{0:H-1}) = \underbrace{\bigl\| \hat{z}_H - z_g \bigr\|_2^2}_{\text{goal cost } c_g} \; + \; w_a \cdot c_a(a_{0:H-1}) \)

The planner then prefers sequences whose predicted trajectory both reaches a goal-like state and remains step-by-step realizable.

Main Results

Success rate (%) on rigid manipulation, articulated control, and contact-rich pushing with the Le-WM and PLDM JEPA-style world models. Original plans with the goal cost only; Ours adds the action consistency cost. Parentheses: change over Original. Green: improvement. Higher is better.

World model	Planning cost	Cube	Reacher	PushT
Le-WM	Original	70.0	76.0	96.0
Le-WM	Ours	74.0(+4.0)	88.0(+12.0)	100.0(+4.0)
PLDM	Original	58.0	76.0	72.0
PLDM	Ours	68.0(+10.0)	90.0(+14.0)	76.0(+4.0)

Deformable manipulation with DINO-WM (Chamfer distance) and goal-conditioned visual navigation with NWM trained with CompACT (ATE / RPE). Lower is better for both.

DINO-WM	Rope	Granular
Original	1.38	0.49
Ours	0.56(−0.82)	0.30(−0.19)

NWM w/ CompACT	ATE	RPE (trans.)
Original	1.3141	0.3831
Ours	1.2835(−2.3%)	0.3773(−1.5%)

Gains hold across four world models — three JEPA-style latent predictors (Le-WM, PLDM, DINO-WM) and one video generative model (NWM) — and six tasks, from rigid and deformable object manipulation to articulated control and visual navigation.

Why It Works: Filtering Unrealizable Trajectories

The complex dynamics of deformable and granular media make many CEM candidate trajectories non-realizable — precisely the failure mode the action consistency cost is designed for. Under the Original cost, the planned action sequence reaches the goal in the model's predicted trajectory but drifts away from it in the actual environment rollout, leaving the goal cost satisfied by an unreachable future. Adding the action consistency cost (Ours) removes this discrepancy: the real rollout closely tracks the predicted trajectory and reaches the goal.

Le-WM

Baseline Ours

OGBench-Cube

Fail

Success

Reacher

Fail

Success

Push-T

Fail

Success

PLDM

Baseline Goal Ours Goal

OGBench-Cube

Fail

Goal

Success

Goal

Reacher

Fail

Goal

Success

Goal

Push-T

Fail

Goal

Success

Goal

DINO-WM

Baseline Ours

Granular

Fail

Success

Rope

Fail

Success

Real vs. imagined rollouts. Executing the planned action sequence in the environment, the Baseline (goal cost only) drifts away from the goal, while Ours (with action consistency) keeps the real rollout on track and reaches the goal — across three JEPA-style latent world models (Le-WM, PLDM, DINO-WM) and their respective tasks.

Robust to Budget, λ, and Cost Scale

Sweeping the CEM budget from 30 to 300 samples, Ours outperforms Original at every budget and never falls below the baseline. Sweeping λ across nearly two orders of magnitude (0.005–0.1) yields consistent gains, so no careful per-task tuning is needed. Below, a constant weight cannot win everywhere — no single value improves all three tasks — whereas the scale-invariant adaptive weight yields the largest total improvement.

CEM budget sweep on Reacher (success rate %)

N =	30	50	150	300
Le-WM — Original	68	62	70	76
Le-WM — Ours	82	76	78	88
PLDM — Original	58	66	70	76
PLDM — Ours	78	84	82	84

On Le-WM and PLDM, Ours at the smallest budget (N=30) already matches the full-budget Original (N=300).

Constant vs. adaptive weight on Le-WM (Δ over Original)

Task	Constant w_a=1	=5	=10	Ours
Cube	+0.0	+0.0	+2.0	+4.0
PushT	+0.0	−2.0	+2.0	+4.0
Reacher	+14.0	−2.0	+6.0	+12.0
Total Δ	+14.0	−4.0	+10.0	+20.0

No single constant weight improves all three tasks; the adaptive weight removes the dependence and yields the largest total gain.

Less Total Compute to Target Quality

The verifier is lightweight: a single Euler step of the flow-matching IDM adds only a small fraction of the world-model forward latency per CEM iteration, and reuses the trajectory already rolled out for the goal cost. Despite this modest per-step overhead, ACID reaches the baseline's final Chamfer distance in less than half the planning steps — a net compute of roughly 0.7× to match baseline quality. Because total planning compute scales with both the number of CEM samples and planning steps, reducing either factor directly lowers cost, and Ours reaches a given quality with strictly less compute along both axes.

Mean Chamfer distance over planning steps on Granular and Rope. Ours reaches Original's late-step plateau several steps earlier on Granular, and already at the very first planning step on Rope.

BibTeX

@article{seo2026acid,
  title     = {ACID: Action Consistency via Inverse Dynamics for Planning with World Models},
  author    = {Seo, Gawon and Kim, Dongwon and Kwak, Suha},
  journal   = {arXiv preprint arXiv:2607.02403},
  year      = {2026}
}

ACID: Action Consistency via Inverse Dynamics for Planning with World Models

Abstract

The Realization Gap

Method

Cycle Action Consistency

Scale-Invariant Adaptive Weight

Main Results

Why It Works: Filtering Unrealizable Trajectories

Le-WM

PLDM

DINO-WM

Robust to Budget, λ, and Cost Scale

CEM budget sweep on Reacher (success rate %)

Constant vs. adaptive weight on Le-WM (Δ over Original)

Less Total Compute to Target Quality

BibTeX

ACID:
Action Consistency via Inverse Dynamics for Planning with World Models