motivation
- Most existing end-to-end autonomous driving models are composed of several modules and follow a pipeline of perception, motion prediction, and planning .However, the serial design of prediction and planning of existing pipelines ignores the possible future interactions between the ego car and the other traffic participants.
- Also, future trajectories are highly structured and share a common prior (e.g., most trajectories are continuous and straight lines). Still, most existing methods fail to consider this structural prior, leading to inaccurate predictions and planning
第一,不能串行的进行预测和规划;第二,路径本身应该存在先验知识
contribution
models autonomous driving as a trajectory generation problem
-
Our GenAD can simultaneously perform motion prediction and planning using the unified future trajectory generation model.
-
To model the structural prior of future trajectories, we learn a variational autoencoder to map ground-truth trajectories to Gaussian distributions considering the uncertain nature of motion prediction and driving planning
-
SOTA
第一,预测和规划同时进行;第二,使用生成模型学习路径本身的先验分布;第三,实验效果很好
overall
input: surrounding camera inputs
3 task: 3d detection, map segmentation, planning
The goal of end-to-end autonomous driving can be formulated as obtaining a planned $ f $-frame future trajectory $ T(T, f) = {w^{T+1}, w^{T+2}, \ldots, w^{T+f}} $ for the ego vehicle given the current and past $ p $-frame sensor inputs $ S = {s^T, s^{T-1}, \ldots, s^{T-p}} $ and trajectory $ T(T-p, p+1) = {w^T, w^{T-1}, \ldots, w^{T-p}} $.
where $ \mathbf{T}(T, f) $ denotes a $ f $-frame trajectory starting from the $ T $-th frame, $ \mathbf{w}^t $ denotes the waypoint at the $ t $-th frame, and $ \mathbf{s}^t $ denotes the sensor input at $ t $-th frame.
Instance-Centric Scene Representation
- Image to BEV: Given surrounding camera signals s as inputs, we first employ an image backbone to extract multiscale image features and then use deformable attention to transform them into the BEV space. We align the BEV features from the past p frames to the current ego-coordinate to obtain the final BEV feature .
- BEV to map/agents: We perform global cross attention and deformable attention to refine a set of map tokens and agent tokens , respectively.
- feature fusion: To model the high-order interactions between traffic agents and the ego vehicle, we combine agent tokens with an ego token and perform self-attention among them to construct a set of instance tokens . We also use cross-attention to inject semantic map information into the instance tokens to facilitate further prediction and planning .
Trajectory Prior Modeling and Latent Future Trajectory Generation
Different from existing methods which directly output the trajectory using a simple decoder, we model it as a trajectory generation problem $ \mathbf{T} \sim p(\mathbf{T}|\mathbf{I}) $ considering its uncertain nature.
- Trajectory prior modeling:
- **Generation:**We then adopt a gated recurrent unit (GRU) as the future trajectory generator to model the temporal evolutions of instances. Specifically, the GRU model takes as inputs the current latent representation and transforms it into the next state . We can then decode the waypoint at the -th time stamp using the waypoint decoder , i.e., we model
experiments
问题
- 轨迹路线先验知识的融合? 做的不好