[ECCV 2024]GenAD: Generative End-to-End Autonomous Driving

motivation

Most existing end-to-end autonomous driving models are composed of several modules and follow a pipeline of perception, motion prediction, and planning .However, the serial design of prediction and planning of existing pipelines ignores the possible future interactions between the ego car and the other traffic participants.
Also, future trajectories are highly structured and share a common prior (e.g., most trajectories are continuous and straight lines). Still, most existing methods fail to consider this structural prior, leading to inaccurate predictions and planning

第一，不能串行的进行预测和规划；第二，路径本身应该存在先验知识

contribution

models autonomous driving as a trajectory generation problem

Our GenAD can simultaneously perform motion prediction and planning using the unified future trajectory generation model.
To model the structural prior of future trajectories, we learn a variational autoencoder to map ground-truth trajectories to Gaussian distributions considering the uncertain nature of motion prediction and driving planning
SOTA

第一，预测和规划同时进行；第二，使用生成模型学习路径本身的先验分布；第三，实验效果很好

overall

input: surrounding camera inputs

3 task: 3d detection, map segmentation, planning

The goal of end-to-end autonomous driving can be formulated as obtaining a planned $ f $-frame future trajectory $ T(T, f) = {w^{T+1}, w^{T+2}, \ldots, w^{T+f}} $ for the ego vehicle given the current and past $ p $-frame sensor inputs $ S = {s^T, s^{T-1}, \ldots, s^{T-p}} $ and trajectory $ T(T-p, p+1) = {w^T, w^{T-1}, \ldots, w^{T-p}} $.

$T(T-p, p+1), S \rightarrow T(T, f),$

where $ \mathbf{T}(T, f) $ denotes a $ f $-frame trajectory starting from the $ T $-th frame, $ \mathbf{w}^t $ denotes the waypoint at the $ t $-th frame, and $ \mathbf{s}^t $ denotes the sensor input at $ t $-th frame.

Instance-Centric Scene Representation

Image to BEV: Given surrounding camera signals s as inputs, we first employ an image backbone to extract multiscale image features $F$ and then use deformable attention to transform them into the BEV space. We align the BEV features from the past p frames to the current ego-coordinate to obtain the final BEV feature $B$ .

$B = DA(B_0, F, F)$

BEV to map/agents: We perform global cross attention and deformable attention to refine a set of map tokens $M$ and agent tokens $A$ , respectively.

$M = CA(M_0, B, B)\\ A = DA(A_0, B, B)$

feature fusion: To model the high-order interactions between traffic agents and the ego vehicle, we combine agent tokens with an ego token and perform self-attention among them to construct a set of instance tokens $I$ . We also use cross-attention to inject semantic map information into the instance tokens $I$ to facilitate further prediction and planning .

$I \leftarrow SA(I, I, I)\\ I \leftarrow CA(I, M, M)$

Trajectory Prior Modeling and Latent Future Trajectory Generation

Different from existing methods which directly output the trajectory using a simple decoder, we model it as a trajectory generation problem $ \mathbf{T} \sim p(\mathbf{T}|\mathbf{I}) $ considering its uncertain nature.

Trajectory prior modeling：

$J_{\text{plan}} = D_{KL}(p(\mathbf{z}|\mathbf{I}), p(\mathbf{z}|\mathbf{T}(T, f)))$

**Generation：**We then adopt a gated recurrent unit (GRU) as the future trajectory generator to model the temporal evolutions of instances. Specifically, the GRU model $g$ takes as inputs the current latent representation $z_t$ and transforms it into the next state $g(z_t) = z_{t+1}$ . We can then decode the waypoint $w^{t+1}$ at the $(t+1)$ -th time stamp using the waypoint decoder $w^{t+1} = d_w(z_{t+1})$ , i.e., we model

$p(w^{t+1} \mid w^{T+1}, \ldots, w^t, \mathbf{z}) \text{ with } d_w(g(z_t)).$

experiments

问题

轨迹路线先验知识的融合? 做的不好