Questionare about Post-training stage at tables you mentioned #2

karu-veh · 2026-04-12T02:41:22Z

karu-veh
Apr 12, 2026

Thanks for sharing you guys insight to github.
I wonder what is different each stage about "each post-training" at below table rows you mentioned

Can you explain it more in detail?
Especially I wonder all the "post-training" you mentioned below are RL-based finetune in close loop environment.
Or post trainin like "Post-training on common logs" is just trainining with open loop reward function like collision score.

Answered by WCJ-BERT

Apr 12, 2026

Thanks for the great question!

To clarify: all post-training rows in Extended Data Table 1 (except the SFT row) are closed-loop RL, including "post-training on common logs". The difference between rows is not open-loop vs. closed-loop, but rather the source of scenarios and the Other agents behaviour model. The reward structure is the same across all configurations: a hard-constraint gate for collision / drivable-area compliance multiplied by a weighted average of soft objectives (progress, TTC, comfort).

Here is what each row means:

Supervised fine-tuning on rare logs — the only non-RL row. Standard imitation learning on rare-event logs, no reward signal. Raw data.

Post-training on commo…

View full answer

WCJ-BERT · 2026-04-12T08:08:15Z

WCJ-BERT
Apr 12, 2026
Maintainer

Thanks for the great question!

To clarify: all post-training rows in Extended Data Table 1 (except the SFT row) are closed-loop RL, including "post-training on common logs". The difference between rows is not open-loop vs. closed-loop, but rather the source of scenarios and the Other agents behaviour model. The reward structure is the same across all configurations: a hard-constraint gate for collision / drivable-area compliance multiplied by a weighted average of soft objectives (progress, TTC, comfort).

Here is what each row means:

Supervised fine-tuning on rare logs — the only non-RL row. Standard imitation learning on rare-event logs, no reward signal. Raw data.

Post-training on common logs — closed-loop RL, but scenarios are drawn from ordinary driving logs (not long-tail), raw data.

Post-training on rare logs — closed-loop RL on failure-prone scenarios discovered from real logs, raw data.

Post-training on rare synthetic replays — RL on rare scenarios where other agents' behaviour is faithfully replayed from logs (non-reactive), synthetic data (3DGS).

Post-training on rare rollouts w/o Behaviour WM — RL on rare scenarios with reactive other agents (IDM). This is the first fully interactive closed-loop setting. Synthetic data (3DGS).

Post-training with World Engine (full) — builds on the above by adding the Behaviour World Model, which generates diverse counterfactual traffic variations via goal conditioning and optimization guidance. Synthetic data (3DGS).

Besides, once the arxiv is ready, you may find more details in our preprint paper. Coming soon.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questionare about Post-training stage at tables you mentioned #2

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Questionare about Post-training stage at tables you mentioned #2

Uh oh!

karu-veh Apr 12, 2026

Replies: 1 comment

Uh oh!

WCJ-BERT Apr 12, 2026 Maintainer

karu-veh
Apr 12, 2026

WCJ-BERT
Apr 12, 2026
Maintainer