[NIPS2023] asynchrony robust collaborative perception via birds eye view flow Paper Conference

1. motivation

irregular asynchronous setting:

the time stamps of the collaboration messages from other agents are not aligned
the time interval of two consecutive messages from the same agent is irregular

Problem formulation:

$\max_{P,\theta}\sum_{n=1}^{N}g(\widehat{Y}_{n}^{t_{n}^{i}},{Y}_{n}^{t_{n}^{i}})\\ \text{subject to }\widehat{Y}_{n}^{t_{n}^{i}}=c_{\theta}(X_{n}^{t_n^i},\{P_{m\rightarrow n}^{t_m^j},P_{m\rightarrow n}^{t_m^{j-1}},...,P_{m\rightarrow n}^{t_m^{j-k+1}}\}_{m=1}^{N})$

$t_n^i$ : $i$ -th timestamp of agent $n$
$g(\cdot,\cdot)$ :perception evaluation metric,which is used to compare the perception $\widehat{Y}_{n}^{t_{n}^{i}}$ and the ground-truth perception ${Y}_{n}^{t_{n}^{i}}$
$\widehat{Y}_{n}^{t_{n}^{i}}$ : the perception of agent $n$ at time $t_n^i$
$X_{n}^{t_n^i}$ : the raw observation of agent $n$ at time $t_n^i$
$P_{m\rightarrow n}^{t_m^j}$ : the collaboration message sent from agent $m$ at time $t_m^j$
The perception $\widehat{Y}_{n}^{t_{n}^{i}}$ can be obtained by the function $c_{\theta}$ leveraging the raw observation from itself at $t_n^i$ and the stored k frames of collaboration message from other cars.

standard well-synchronized setting:

$t_n^i=t_m^i$ for all agent’s pairs $m,n$
$t_n^{i}-t_n^{i-1}$ is a constant for all agents $n$

regular asynchronous setting (SyncNet):

$t_n^i= t_m^i$ cant be guaranteed
$t_n^{i}-t_n^{i-1}$ is a constant all agents $n$

irregular asynchronous setting:

$t_n^i= t_m^i$ cant be guaranteed
$t_n^{i}-t_n^{i-1}$ is irregular

2. contribution

IRregular V2V(IRV2V),: the first synthetic asynchronous collaborative perception dataset with irregular time delays, simulating various real-world scenarios
CoBEVFlow, an asynchrony-robust collaborative perception system based on bird’s eye view (BEV) flow

3. details

3.1 Overall

3.2 Preparation for transmission

$\mathbf{F}_{n}^{t_{n}^{i}} =f_{\mathrm{enc}}(\mathbf{X}_n^{t_n^{i}}) \\ \widetilde{\mathbf{F}}_n^{t_n^i},\mathcal{R}_n^{t_n^i} =f_(\mathbf{F}_n^{t_n^{i}})$

$\mathbf{F}_n^{t_n^i}\in\mathbb{R}^{H\times W\times D}$ is the BEV perceptual feature map of agent $n$ at time $t_n^i$ , with H, W the size of BEV map and D the number of channels.
$\widetilde{\mathbf{F}}_n^{t_n^i}$ :the sparse version of $\mathbf{F}_n^{t_n^i}$ ,which only contains features inside $\mathcal{R}_n^{t_n^i}$ and zero-padding outside;
$\mathcal{R}_n^{t_n^i}$ : is the set of region of interest (ROI)

3.21 the generation of ROIs, $\mathcal{R}_m^{t_m^j}$

$\mathbf{O}_m^{t_m^j}=\Phi_{\mathrm{roi\_gen}}(\mathbf{F}_m^{t_m^j})\in\mathbb{R}^{H\times W\times7}$

$\Phi_{\mathrm{roi\_gen}}$ : the ROI generation network with detection decoder structure

$(\mathbf{O}_m^{t_m^j})_{h,w}=(c,x,y,h,w,\cos\alpha,\sin\alpha)$ : one detected ROI with its class confidence, position, size, and orientation

We threshold the class confidence, apply non-max suppression, and obtain a set of detected boxes, whose occupied spaces form the set of ROIs . $\mathcal{R}_m^{t_m^j}$

3.3 Collaborative information alignment

3.31 BEV flow map generation

$$ \mathbf{M}_m^{t_m^j\to t_n^i} =f_{\mathrm{flow\_gen}}(t_n^i,\{\mathcal{R}_m^{t_m^q}\}_{q=j-k+1,j-k+2,\cdots,j}) $$

* Adjacent frames’ ROI matching

The goal is match the ROIs in two consecutive messages sent by the same agent so that we can track each ROI’s multiple locations across frames .

3 steps:

cost matrix construction based on the distance between the ROI
greedy matching: find the nearest ROI
post-processing:

* BEV flow estimation

$\mathbf{V}_{m,r}=\{\mathbf{v}_r^{t_m^j},\mathbf{v}_r^{t_m^{j-1}},\cdots,\mathbf{v}_r^{t_m^{j-k+1}}\}$ be a historical sequence of the $r$ -th ROI’s attributes sent by the $m$ -th agent

$\mathbf{v}_r^{t_m^j}=(x_r^{t_m^j},y_r^{t_m^j},\alpha_r^{t_m^j})$ is the location and orientation

We need to estimate $\mathbf{v}_r^{t_n^i}$ (Time-Series Forecasting )

encoding timestamp
$\widehat{\mathbf{v}}_r^{t_n^i}=\mathrm{MHA}(\mathbf{u}(t_n^i),\mathrm{MLP}(\mathbf{V}_{m,r})+\mathbf{U}_k,\mathrm{MLP}(\mathbf{V}_{m,r})+\mathbf{U}_k)$

Finally, we can generate BEWFlow map. we calculate the motion vector at each grid cell by an affine transformation of the associated ROI’s motion, constituting the whole BEV flow map .

3.32 Feature warp and aggregation

$$ \widehat{\mathbf{F}}_m^{t_n^i} =f_{\mathrm{warp}}(\widetilde{\mathbf{F}}_m^{t_m^j},\mathbf{M}_m^{t_m^j\to t_n^i})\\ \widehat{\mathbf{H}}_n^{t_n^i} =f_{\mathrm{agg}}(\widetilde{\mathbf{F}}_n^{t_n^i},\{\widehat{\mathbf{F}}_m^{t_n^i}\}_{m\in\mathcal{N}_n}) $$

$\widehat{\mathbf{F}}_{m}^{t_{n}^{i}}\left[h+\mathbf{M}_{m}^{t_{m}^{j}\to t_{n}^{i}}\left[h,w,0\right],w+\mathbf{M}_{m}^{t_{m}^{j}\to t_{n}^{i}}\left[h,w,1\right]\right]=\widetilde{\mathbf{F}}_{m}^{t_{m}^{j}}[h,w]$

$\widehat{\mathbf{F}}_{m}^{t_{n}^{i}}$ is the realigned feature map from the $m$ -th agent’s at timestamp $t_i^n$ after motion compensation

$\widehat{\mathbf{H}}_n^{t_n^i}$ is the aggregated features from all of the agents

4. experiment

5. the dataset IRV2V

The ideal sample interval of the sensor is 100ms

There is a time offset at the sampling starting point of non-ego vehicles
all non-ego vehicles’ collaborative messages are sampled with time turbulence

so that $t_n^i= t_m^i$ cant be guaranteed

we sample the frame intervals of received messages with binomial distribution to get random irregular time intervals