[2024.11.25-arxiv] DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving

motivation：diffusion model for planning

diffusion model 现在很多应用在traffic simulation 上，在用 diffusion model 做 plan的没有。

(a): this paradigm does not account for the inherent uncertainty and multi-mode nature of driving behaviors.

无法表示 uncertainty

(b): this large fixed-vocabulary paradigm is fundamentally constrained by the number and quality of anchor trajectories, often failing in out-of-vocabulary scenarios. Furthermore, managing a large number of anchors presents significant computational challenges for real-time applications.[VAD-v2]

依赖于锚点的数量和质量，action空间离散

©: 推断延迟太高

contribution

diffusion model to end to end autonomous drive plan
design an efficient transformer-based diffusion decoder that interacts with the conditional information in a cascaded manner for better trajectory reconstruction 改进了传统diffusion --> Truncated Diffusion
sota and diversity

overall

TransfuserDP（vanilla scheme）

We begin from the representative deterministic end-to-end planner Transfuser and turn it into a generative model TransfuserDP by simply replacing the regression MLP layers with the conditional diffusion model UNet following vanilla diffusion policy .

存在的问题

Mode collapse : diversity 很低
Heavy denoising overhead : inference 速度很慢

Truncated Diffusion

前向过程： We first construct the diffusion process by adding Gaussian noise to anchors $\{\mathbf{a}_k\}_{k=1}^{N_{\text{anchor}}}$ clustered by K-Means on the training set, where $\mathbf{a}_k = \{(x_t, y_t)\}_{t=1}^{T_f}$ . We truncate the diffusion noise schedule to diffuse the anchors to the anchored Gaussian distribution:

$\tau_k^i = \sqrt{\bar{\alpha}^i} \mathbf{a}_k + \sqrt{1 - \bar{\alpha}^i} \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I}),$

where $i \in [1, T_{\text{trunc}}]$ and $T_{\text{trunc}} \ll T$ is the truncated diffusion steps.

sample 出 $N_{anchor}$ 个noisy轨迹。

反向过程（diffusion decoder）：the diffusion decoder $f_{\theta}$ takes as input $N_{\text{anchor}}$ noisy trajectories $\{\tau_{k}^{i}\}_{k=1}^{N_{\text{anchor}}}$ and predicts classification scores $\{\hat{s}_{k}\}_{k=1}^{N_{\text{anchor}}}$ and denoised trajectories $\{\hat{\tau}_{k}\}_{k=1}^{N_{\text{anchor}}}$ .

$\{\hat{s}_{k}, \hat{\tau}_{k}\}_{k=1}^{N_{\text{anchor}}} = f_{\theta}(\{\tau_{k}^{i}\}_{k=1}^{N_{\text{anchor}}}, z),$

where $z$ represents the conditional information.

把这 $N_{anchor}$ 个noisy轨迹恢复为 $N_{anchor}$ 个轨迹路线。

损失函数 We assign the noisy trajectory around the closest anchor to the ground truth trajectory $\tau_{gt}$ as positive sample $(y_{k}=1)$ and others as negative samples $(y_{k}=0)$ . The training objective combines trajectory reconstruction and classification:

$\mathcal{L} = \sum_{k=1}^{N_{\text{anchor}}} \left[ y_{k} \mathcal{L}_{\text{rec}}(\hat{\tau}_{k}, \tau_{gt}) + \lambda \text{BCE}(\hat{s}_{k}, y_{k}) \right], \quad (6)$

where $\lambda$ balances the simple L1 reconstruction loss $\mathcal{L}_{\text{rec}}$ and binary cross-entropy (BCE) classification loss.

Inference. We use a truncated denoising process that starts with noisy trajectories sampled from the anchored Gaussian distribution and progressively denoises them to final predictions. At each denoising timestep, the estimated trajectories from the previous step are passed to the diffusion decoder $f_{\theta}$ , which predicts classification scores $\{\hat{s}_{k}\}_{k=1}^{N_{\text{infer}}}$ and coordinates $\{\hat{\tau}_{k}\}_{k=1}^{N_{\text{infer}}}$ . After obtaining the current timestep’s predictions, we apply the DDIM update rule to sample trajectories for the next timestep.

Inference flexibility. A key advantage of our approach lies in its inference flexibility. While the model is trained with $N_{\text{anchor}}$ trajectories, the inference process can accommodate an arbitrary number of trajectory samples $N_{\text{infer}}$ , where $N_{\text{infer}}$ can be dynamically adjusted based on computational resources or application requirements.

实验

数据集：NAVSIM navtest split