image-20241216032457914

motivation

  1. Most existing end-to-end autonomous driving models are composed of several modules and follow a pipeline of perception, motion prediction, and planning .However, the serial design of prediction and planning of existing pipelines ignores the possible future interactions between the ego car and the other traffic participants.
  2. Also, future trajectories are highly structured and share a common prior (e.g., most trajectories are continuous and straight lines). Still, most existing methods fail to consider this structural prior, leading to inaccurate predictions and planning

第一,不能串行的进行预测和规划;第二,路径本身应该存在先验知识

contribution

models autonomous driving as a trajectory generation problem

  1. Our GenAD can simultaneously perform motion prediction and planning using the unified future trajectory generation model.

  2. To model the structural prior of future trajectories, we learn a variational autoencoder to map ground-truth trajectories to Gaussian distributions considering the uncertain nature of motion prediction and driving planning

  3. SOTA

第一,预测和规划同时进行;第二,使用生成模型学习路径本身的先验分布;第三,实验效果很好

overall

image-20241216033518371

input: surrounding camera inputs

3 task: 3d detection, map segmentation, planning

The goal of end-to-end autonomous driving can be formulated as obtaining a planned $ f $-frame future trajectory $ T(T, f) = {w^{T+1}, w^{T+2}, \ldots, w^{T+f}} $ for the ego vehicle given the current and past $ p $-frame sensor inputs $ S = {s^T, s^{T-1}, \ldots, s^{T-p}} $ and trajectory $ T(T-p, p+1) = {w^T, w^{T-1}, \ldots, w^{T-p}} $.

T(Tp,p+1),ST(T,f), T(T-p, p+1), S \rightarrow T(T, f),

where $ \mathbf{T}(T, f) $ denotes a $ f $-frame trajectory starting from the $ T $-th frame, $ \mathbf{w}^t $ denotes the waypoint at the $ t $-th frame, and $ \mathbf{s}^t $ denotes the sensor input at $ t $-th frame.

Instance-Centric Scene Representation

  1. Image to BEV: Given surrounding camera signals s as inputs, we first employ an image backbone to extract multiscale image features FF and then use deformable attention to transform them into the BEV space. We align the BEV features from the past p frames to the current ego-coordinate to obtain the final BEV feature BB.

B=DA(B0,F,F)B = DA(B_0, F, F)

  1. BEV to map/agents: We perform global cross attention and deformable attention to refine a set of map tokens MM and agent tokens AA, respectively.

M=CA(M0,B,B)A=DA(A0,B,B)M = CA(M_0, B, B)\\ A = DA(A_0, B, B)

  1. feature fusion: To model the high-order interactions between traffic agents and the ego vehicle, we combine agent tokens with an ego token and perform self-attention among them to construct a set of instance tokens II. We also use cross-attention to inject semantic map information into the instance tokens II to facilitate further prediction and planning .

ISA(I,I,I)ICA(I,M,M)I \leftarrow SA(I, I, I)\\ I \leftarrow CA(I, M, M)

Trajectory Prior Modeling and Latent Future Trajectory Generation

Different from existing methods which directly output the trajectory using a simple decoder, we model it as a trajectory generation problem $ \mathbf{T} \sim p(\mathbf{T}|\mathbf{I}) $ considering its uncertain nature.

image-20241216043922428
  • Trajectory prior modeling:

Jplan=DKL(p(zI),p(zT(T,f)))J_{\text{plan}} = D_{KL}(p(\mathbf{z}|\mathbf{I}), p(\mathbf{z}|\mathbf{T}(T, f)))

  • **Generation:**We then adopt a gated recurrent unit (GRU) as the future trajectory generator to model the temporal evolutions of instances. Specifically, the GRU model gg takes as inputs the current latent representation ztz_t and transforms it into the next state g(zt)=zt+1g(z_t) = z_{t+1}. We can then decode the waypoint wt+1w^{t+1} at the (t+1)(t+1)-th time stamp using the waypoint decoder wt+1=dw(zt+1)w^{t+1} = d_w(z_{t+1}), i.e., we model

p(wt+1wT+1,,wt,z) with dw(g(zt)).p(w^{t+1} \mid w^{T+1}, \ldots, w^t, \mathbf{z}) \text{ with } d_w(g(z_t)).

experiments

image-20241216040748149

问题

  1. 轨迹路线先验知识的融合? 做的不好