Skip to content
Discussion options

You must be logged in to vote

Thanks for the great question!

To clarify: all post-training rows in Extended Data Table 1 (except the SFT row) are closed-loop RL, including "post-training on common logs". The difference between rows is not open-loop vs. closed-loop, but rather the source of scenarios and the Other agents behaviour model. The reward structure is the same across all configurations: a hard-constraint gate for collision / drivable-area compliance multiplied by a weighted average of soft objectives (progress, TTC, comfort).

Here is what each row means:

Supervised fine-tuning on rare logs — the only non-RL row. Standard imitation learning on rare-event logs, no reward signal. Raw data.

Post-training on commo…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by WCJ-BERT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants