image-20241223062443182

motivation:diffusion model for planning

diffusion model 现在很多应用在traffic simulation 上,在用 diffusion model 做 plan的没有。

image-20241223062619336

(a): this paradigm does not account for the inherent uncertainty and multi-mode nature of driving behaviors.

无法表示 uncertainty

(b): this large fixed-vocabulary paradigm is fundamentally constrained by the number and quality of anchor trajectories, often failing in out-of-vocabulary scenarios. Furthermore, managing a large number of anchors presents significant computational challenges for real-time applications.[VAD-v2]

依赖于锚点的数量和质量,action空间离散

©: 推断延迟太高

contribution

  1. diffusion model to end to end autonomous drive plan
  2. design an efficient transformer-based diffusion decoder that interacts with the conditional information in a cascaded manner for better trajectory reconstruction 改进了传统diffusion --> Truncated Diffusion
  3. sota and diversity

overall

image-20241223091420956

TransfuserDP(vanilla scheme)

We begin from the representative deterministic end-to-end planner Transfuser and turn it into a generative model TransfuserDP by simply replacing the regression MLP layers with the conditional diffusion model UNet following vanilla diffusion policy .

存在的问题

  1. Mode collapse : diversity 很低
  2. Heavy denoising overhead : inference 速度很慢

Truncated Diffusion

image-20241223101555781

前向过程: We first construct the diffusion process by adding Gaussian noise to anchors {ak}k=1Nanchor\{\mathbf{a}_k\}_{k=1}^{N_{\text{anchor}}} clustered by K-Means on the training set, where ak={(xt,yt)}t=1Tf\mathbf{a}_k = \{(x_t, y_t)\}_{t=1}^{T_f}. We truncate the diffusion noise schedule to diffuse the anchors to the anchored Gaussian distribution:

τki=αˉiak+1αˉiϵ,ϵN(0,I),\tau_k^i = \sqrt{\bar{\alpha}^i} \mathbf{a}_k + \sqrt{1 - \bar{\alpha}^i} \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I}),

where i[1,Ttrunc]i \in [1, T_{\text{trunc}}] and TtruncTT_{\text{trunc}} \ll T is the truncated diffusion steps.

sample 出 NanchorN_{anchor}个noisy轨迹。

反向过程(diffusion decoder):the diffusion decoder fθf_{\theta} takes as input NanchorN_{\text{anchor}} noisy trajectories {τki}k=1Nanchor\{\tau_{k}^{i}\}_{k=1}^{N_{\text{anchor}}} and predicts classification scores {s^k}k=1Nanchor\{\hat{s}_{k}\}_{k=1}^{N_{\text{anchor}}} and denoised trajectories {τ^k}k=1Nanchor\{\hat{\tau}_{k}\}_{k=1}^{N_{\text{anchor}}}.

{s^k,τ^k}k=1Nanchor=fθ({τki}k=1Nanchor,z),\{\hat{s}_{k}, \hat{\tau}_{k}\}_{k=1}^{N_{\text{anchor}}} = f_{\theta}(\{\tau_{k}^{i}\}_{k=1}^{N_{\text{anchor}}}, z),

where zz represents the conditional information.

把这NanchorN_{anchor}个noisy轨迹恢复为NanchorN_{anchor}个轨迹路线。

损失函数 We assign the noisy trajectory around the closest anchor to the ground truth trajectory τgt\tau_{gt} as positive sample (yk=1)(y_{k}=1) and others as negative samples (yk=0)(y_{k}=0). The training objective combines trajectory reconstruction and classification:

L=k=1Nanchor[ykLrec(τ^k,τgt)+λBCE(s^k,yk)],(6)\mathcal{L} = \sum_{k=1}^{N_{\text{anchor}}} \left[ y_{k} \mathcal{L}_{\text{rec}}(\hat{\tau}_{k}, \tau_{gt}) + \lambda \text{BCE}(\hat{s}_{k}, y_{k}) \right], \quad (6)

where λ\lambda balances the simple L1 reconstruction loss Lrec\mathcal{L}_{\text{rec}} and binary cross-entropy (BCE) classification loss.

Inference. We use a truncated denoising process that starts with noisy trajectories sampled from the anchored Gaussian distribution and progressively denoises them to final predictions. At each denoising timestep, the estimated trajectories from the previous step are passed to the diffusion decoder fθf_{\theta}, which predicts classification scores {s^k}k=1Ninfer\{\hat{s}_{k}\}_{k=1}^{N_{\text{infer}}} and coordinates {τ^k}k=1Ninfer\{\hat{\tau}_{k}\}_{k=1}^{N_{\text{infer}}}. After obtaining the current timestep’s predictions, we apply the DDIM update rule to sample trajectories for the next timestep.

Inference flexibility. A key advantage of our approach lies in its inference flexibility. While the model is trained with NanchorN_{\text{anchor}} trajectories, the inference process can accommodate an arbitrary number of trajectory samples NinferN_{\text{infer}}, where NinferN_{\text{infer}} can be dynamically adjusted based on computational resources or application requirements.

实验

数据集:NAVSIM navtest split

image-20241223102346691