1. motivation

irregular asynchronous setting:

  1. the time stamps of the collaboration messages from other agents are not aligned
  2. the time interval of two consecutive messages from the same agent is irregular
1

Problem formulation:

maxP,θn=1Ng(Y^ntni,Yntni)subject to Y^ntni=cθ(Xntni,{Pmntmj,Pmntmj1,...,Pmntmjk+1}m=1N)\max_{P,\theta}\sum_{n=1}^{N}g(\widehat{Y}_{n}^{t_{n}^{i}},{Y}_{n}^{t_{n}^{i}})\\ \text{subject to }\widehat{Y}_{n}^{t_{n}^{i}}=c_{\theta}(X_{n}^{t_n^i},\{P_{m\rightarrow n}^{t_m^j},P_{m\rightarrow n}^{t_m^{j-1}},...,P_{m\rightarrow n}^{t_m^{j-k+1}}\}_{m=1}^{N})

  1. tnit_n^i: ii-th timestamp of agent nn

  2. g(,)g(\cdot,\cdot) :perception evaluation metric,which is used to compare the perception Y^ntni\widehat{Y}_{n}^{t_{n}^{i}} and the ground-truth perception Yntni{Y}_{n}^{t_{n}^{i}}

  3. Y^ntni\widehat{Y}_{n}^{t_{n}^{i}}: the perception of agent nn at time tnit_n^i

  4. XntniX_{n}^{t_n^i}: the raw observation of agent nn at time tnit_n^i

  5. PmntmjP_{m\rightarrow n}^{t_m^j}: the collaboration message sent from agent mm at time tmjt_m^j

  6. The perception Y^ntni\widehat{Y}_{n}^{t_{n}^{i}} can be obtained by the function cθc_{\theta} leveraging the raw observation from itself at tnit_n^iand the stored k frames of collaboration message from other cars.

standard well-synchronized setting:

  1. tni=tmit_n^i=t_m^i for all agent’s pairs m,nm,n
  2. tnitni1t_n^{i}-t_n^{i-1} is a constant for all agents nn

regular asynchronous setting (SyncNet):

  1. tni=tmit_n^i= t_m^i cant be guaranteed
  2. tnitni1t_n^{i}-t_n^{i-1} is a constant all agents nn

irregular asynchronous setting:

  1. tni=tmit_n^i= t_m^i cant be guaranteed
  2. tnitni1t_n^{i}-t_n^{i-1} is irregular

2. contribution

  1. IRregular V2V(IRV2V),: the first synthetic asynchronous collaborative perception dataset with irregular time delays, simulating various real-world scenarios
  2. CoBEVFlow, an asynchrony-robust collaborative perception system based on bird’s eye view (BEV) flow

3. details

3.1 Overall

2

3.2 Preparation for transmission

Fntni=fenc(Xntni)F~ntni,Rntni=f<!swig0>(Fntni)\mathbf{F}_{n}^{t_{n}^{i}} =f_{\mathrm{enc}}(\mathbf{X}_n^{t_n^{i}}) \\ \widetilde{\mathbf{F}}_n^{t_n^i},\mathcal{R}_n^{t_n^i} =f_(\mathbf{F}_n^{t_n^{i}})

  1. FntniRH×W×D\mathbf{F}_n^{t_n^i}\in\mathbb{R}^{H\times W\times D} is the BEV perceptual feature map of agent nn at time tnit_n^i, with H, W the size of BEV map and D the number of channels.

  2. F~ntni\widetilde{\mathbf{F}}_n^{t_n^i}:the sparse version of Fntni\mathbf{F}_n^{t_n^i},which only contains features inside Rntni\mathcal{R}_n^{t_n^i} and zero-padding outside;

  3. Rntni\mathcal{R}_n^{t_n^i}: is the set of region of interest (ROI)

3.21 the generation of ROIs,Rmtmj\mathcal{R}_m^{t_m^j}

Omtmj=Φroi_gen(Fmtmj)RH×W×7\mathbf{O}_m^{t_m^j}=\Phi_{\mathrm{roi\_gen}}(\mathbf{F}_m^{t_m^j})\in\mathbb{R}^{H\times W\times7}

Φroi_gen\Phi_{\mathrm{roi\_gen}}: the ROI generation network with detection decoder structure

(Omtmj)h,w=(c,x,y,h,w,cosα,sinα)(\mathbf{O}_m^{t_m^j})_{h,w}=(c,x,y,h,w,\cos\alpha,\sin\alpha): one detected ROI with its class confidence, position, size, and orientation

We threshold the class confidence, apply non-max suppression, and obtain a set of detected boxes, whose occupied spaces form the set of ROIs .Rmtmj\mathcal{R}_m^{t_m^j}

3.3 Collaborative information alignment

3.31 BEV flow map generation

3 $$ \mathbf{M}_m^{t_m^j\to t_n^i} =f_{\mathrm{flow\_gen}}(t_n^i,\{\mathcal{R}_m^{t_m^q}\}_{q=j-k+1,j-k+2,\cdots,j}) $$

* Adjacent frames’ ROI matching

The goal is match the ROIs in two consecutive messages sent by the same agent so that we can track each ROI’s multiple locations across frames .

3 steps:

  1. cost matrix construction based on the distance between the ROI
  2. greedy matching: find the nearest ROI
  3. post-processing:

* BEV flow estimation

4

Vm,r={vrtmj,vrtmj1,,vrtmjk+1}\mathbf{V}_{m,r}=\{\mathbf{v}_r^{t_m^j},\mathbf{v}_r^{t_m^{j-1}},\cdots,\mathbf{v}_r^{t_m^{j-k+1}}\} be a historical sequence of the rr-th ROI’s attributes sent by the mm-th agent

vrtmj=(xrtmj,yrtmj,αrtmj)\mathbf{v}_r^{t_m^j}=(x_r^{t_m^j},y_r^{t_m^j},\alpha_r^{t_m^j}) is the location and orientation

We need to estimate vrtni\mathbf{v}_r^{t_n^i} (Time-Series Forecasting )

  1. encoding timestamp
  2. v^rtni=MHA(u(tni),MLP(Vm,r)+Uk,MLP(Vm,r)+Uk)\widehat{\mathbf{v}}_r^{t_n^i}=\mathrm{MHA}(\mathbf{u}(t_n^i),\mathrm{MLP}(\mathbf{V}_{m,r})+\mathbf{U}_k,\mathrm{MLP}(\mathbf{V}_{m,r})+\mathbf{U}_k)

Finally, we can generate BEWFlow map. we calculate the motion vector at each grid cell by an affine transformation of the associated ROI’s motion, constituting the whole BEV flow map .

3.32 Feature warp and aggregation

5 $$ \widehat{\mathbf{F}}_m^{t_n^i} =f_{\mathrm{warp}}(\widetilde{\mathbf{F}}_m^{t_m^j},\mathbf{M}_m^{t_m^j\to t_n^i})\\ \widehat{\mathbf{H}}_n^{t_n^i} =f_{\mathrm{agg}}(\widetilde{\mathbf{F}}_n^{t_n^i},\{\widehat{\mathbf{F}}_m^{t_n^i}\}_{m\in\mathcal{N}_n}) $$

F^mtni[h+Mmtmjtni[h,w,0],w+Mmtmjtni[h,w,1]]=F~mtmj[h,w]\widehat{\mathbf{F}}_{m}^{t_{n}^{i}}\left[h+\mathbf{M}_{m}^{t_{m}^{j}\to t_{n}^{i}}\left[h,w,0\right],w+\mathbf{M}_{m}^{t_{m}^{j}\to t_{n}^{i}}\left[h,w,1\right]\right]=\widetilde{\mathbf{F}}_{m}^{t_{m}^{j}}[h,w]

F^mtni\widehat{\mathbf{F}}_{m}^{t_{n}^{i}} is the realigned feature map from the mm-th agent’s at timestamp tint_i^n after motion compensation

H^ntni\widehat{\mathbf{H}}_n^{t_n^i} is the aggregated features from all of the agents

4. experiment

6 7 6

5. the dataset IRV2V

The ideal sample interval of the sensor is 100ms

  1. There is a time offset at the sampling starting point of non-ego vehicles
  2. all non-ego vehicles’ collaborative messages are sampled with time turbulence

so that tni=tmit_n^i= t_m^i cant be guaranteed

we sample the frame intervals of received messages with binomial distribution to get random irregular time intervals