-
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Thanks for the great question! To clarify: all post-training rows in Extended Data Table 1 (except the SFT row) are closed-loop RL, including "post-training on common logs". The difference between rows is not open-loop vs. closed-loop, but rather the source of scenarios and the Other agents behaviour model. The reward structure is the same across all configurations: a hard-constraint gate for collision / drivable-area compliance multiplied by a weighted average of soft objectives (progress, TTC, comfort). Here is what each row means: Supervised fine-tuning on rare logs — the only non-RL row. Standard imitation learning on rare-event logs, no reward signal. Raw data. Post-training on common logs — closed-loop RL, but scenarios are drawn from ordinary driving logs (not long-tail), raw data. Post-training on rare logs — closed-loop RL on failure-prone scenarios discovered from real logs, raw data. Post-training on rare synthetic replays — RL on rare scenarios where other agents' behaviour is faithfully replayed from logs (non-reactive), synthetic data (3DGS). Post-training on rare rollouts w/o Behaviour WM — RL on rare scenarios with reactive other agents (IDM). This is the first fully interactive closed-loop setting. Synthetic data (3DGS). Post-training with World Engine (full) — builds on the above by adding the Behaviour World Model, which generates diverse counterfactual traffic variations via goal conditioning and optimization guidance. Synthetic data (3DGS). Besides, once the arxiv is ready, you may find more details in our preprint paper. Coming soon. |
Beta Was this translation helpful? Give feedback.

Thanks for the great question!
To clarify: all post-training rows in Extended Data Table 1 (except the SFT row) are closed-loop RL, including "post-training on common logs". The difference between rows is not open-loop vs. closed-loop, but rather the source of scenarios and the Other agents behaviour model. The reward structure is the same across all configurations: a hard-constraint gate for collision / drivable-area compliance multiplied by a weighted average of soft objectives (progress, TTC, comfort).
Here is what each row means:
Supervised fine-tuning on rare logs — the only non-RL row. Standard imitation learning on rare-event logs, no reward signal. Raw data.
Post-training on commo…