1. motivation
irregular asynchronous setting:
the time stamps of the collaboration messages from other agents are not aligned
the time interval of two consecutive messages from the same agent is irregular
Problem formulation:
max P , θ ∑ n = 1 N g ( Y ^ n t n i , Y n t n i ) subject to Y ^ n t n i = c θ ( X n t n i , { P m → n t m j , P m → n t m j − 1 , . . . , P m → n t m j − k + 1 } m = 1 N ) \max_{P,\theta}\sum_{n=1}^{N}g(\widehat{Y}_{n}^{t_{n}^{i}},{Y}_{n}^{t_{n}^{i}})\\
\text{subject to }\widehat{Y}_{n}^{t_{n}^{i}}=c_{\theta}(X_{n}^{t_n^i},\{P_{m\rightarrow n}^{t_m^j},P_{m\rightarrow n}^{t_m^{j-1}},...,P_{m\rightarrow n}^{t_m^{j-k+1}}\}_{m=1}^{N})
P , θ max n = 1 ∑ N g ( Y n t n i , Y n t n i ) subject to Y n t n i = c θ ( X n t n i , { P m → n t m j , P m → n t m j − 1 , ... , P m → n t m j − k + 1 } m = 1 N )
t n i t_n^i t n i : i i i -th timestamp of agent n n n
g ( ⋅ , ⋅ ) g(\cdot,\cdot) g ( ⋅ , ⋅ ) :perception evaluation metric,which is used to compare the perception Y ^ n t n i \widehat{Y}_{n}^{t_{n}^{i}} Y n t n i and the ground-truth perception Y n t n i {Y}_{n}^{t_{n}^{i}} Y n t n i
Y ^ n t n i \widehat{Y}_{n}^{t_{n}^{i}} Y n t n i : the perception of agent n n n at time t n i t_n^i t n i
X n t n i X_{n}^{t_n^i} X n t n i : the raw observation of agent n n n at time t n i t_n^i t n i
P m → n t m j P_{m\rightarrow n}^{t_m^j} P m → n t m j : the collaboration message sent from agent m m m at time t m j t_m^j t m j
The perception Y ^ n t n i \widehat{Y}_{n}^{t_{n}^{i}} Y n t n i can be obtained by the function c θ c_{\theta} c θ leveraging the raw observation from itself at t n i t_n^i t n i and the stored k frames of collaboration message from other cars.
standard well-synchronized setting:
t n i = t m i t_n^i=t_m^i t n i = t m i for all agent’s pairs m , n m,n m , n
t n i − t n i − 1 t_n^{i}-t_n^{i-1} t n i − t n i − 1 is a constant for all agents n n n
regular asynchronous setting (SyncNet):
t n i = t m i t_n^i= t_m^i t n i = t m i cant be guaranteed
t n i − t n i − 1 t_n^{i}-t_n^{i-1} t n i − t n i − 1 is a constant all agents n n n
irregular asynchronous setting:
t n i = t m i t_n^i= t_m^i t n i = t m i cant be guaranteed
t n i − t n i − 1 t_n^{i}-t_n^{i-1} t n i − t n i − 1 is irregular
2. contribution
IRregular V2V(IRV2V),: the first synthetic asynchronous collaborative perception dataset with irregular time delays, simulating various real-world scenarios
CoBEVFlow, an asynchrony-robust collaborative perception system based on bird’s eye view (BEV) flow
3. details
3.1 Overall
3.2 Preparation for transmission
F n t n i = f e n c ( X n t n i ) F ~ n t n i , R n t n i = f < ! − − s w i g  0 − − > ( F n t n i ) \mathbf{F}_{n}^{t_{n}^{i}} =f_{\mathrm{enc}}(\mathbf{X}_n^{t_n^{i}}) \\
\widetilde{\mathbf{F}}_n^{t_n^i},\mathcal{R}_n^{t_n^i} =f_(\mathbf{F}_n^{t_n^{i}})
F n t n i = f enc ( X n t n i ) F n t n i , R n t n i = f < ! − − s w i g 0 − − > ( F n t n i )
F n t n i ∈ R H × W × D \mathbf{F}_n^{t_n^i}\in\mathbb{R}^{H\times W\times D} F n t n i ∈ R H × W × D is the BEV perceptual feature map of agent n n n at time t n i t_n^i t n i , with H, W the size of BEV map and D the number of channels.
F ~ n t n i \widetilde{\mathbf{F}}_n^{t_n^i} F n t n i :the sparse version of F n t n i \mathbf{F}_n^{t_n^i} F n t n i ,which only contains features inside R n t n i \mathcal{R}_n^{t_n^i} R n t n i and zero-padding outside;
R n t n i \mathcal{R}_n^{t_n^i} R n t n i : is the set of region of interest (ROI)
3.21 the generation of ROIs,R m t m j \mathcal{R}_m^{t_m^j} R m t m j
O m t m j = Φ r o i _ g e n ( F m t m j ) ∈ R H × W × 7 \mathbf{O}_m^{t_m^j}=\Phi_{\mathrm{roi\_gen}}(\mathbf{F}_m^{t_m^j})\in\mathbb{R}^{H\times W\times7}
O m t m j = Φ roi_gen ( F m t m j ) ∈ R H × W × 7
Φ r o i _ g e n \Phi_{\mathrm{roi\_gen}} Φ roi_gen : the ROI generation network with detection decoder structure
( O m t m j ) h , w = ( c , x , y , h , w , cos α , sin α ) (\mathbf{O}_m^{t_m^j})_{h,w}=(c,x,y,h,w,\cos\alpha,\sin\alpha) ( O m t m j ) h , w = ( c , x , y , h , w , cos α , sin α ) : one detected ROI with its class confidence, position, size, and orientation
We threshold the class confidence, apply non-max suppression, and obtain a set of detected boxes, whose occupied spaces form the set of ROIs .R m t m j \mathcal{R}_m^{t_m^j} R m t m j
3.31 BEV flow map generation
$$
\mathbf{M}_m^{t_m^j\to t_n^i} =f_{\mathrm{flow\_gen}}(t_n^i,\{\mathcal{R}_m^{t_m^q}\}_{q=j-k+1,j-k+2,\cdots,j})
$$
* Adjacent frames’ ROI matching
The goal is match the ROIs in two consecutive messages sent by the same agent so that we can track each ROI’s multiple locations across frames .
3 steps:
cost matrix construction based on the distance between the ROI
greedy matching: find the nearest ROI
post-processing:
* BEV flow estimation
V m , r = { v r t m j , v r t m j − 1 , ⋯ , v r t m j − k + 1 } \mathbf{V}_{m,r}=\{\mathbf{v}_r^{t_m^j},\mathbf{v}_r^{t_m^{j-1}},\cdots,\mathbf{v}_r^{t_m^{j-k+1}}\} V m , r = { v r t m j , v r t m j − 1 , ⋯ , v r t m j − k + 1 } be a historical sequence of the r r r -th ROI’s attributes sent by the m m m -th agent
v r t m j = ( x r t m j , y r t m j , α r t m j ) \mathbf{v}_r^{t_m^j}=(x_r^{t_m^j},y_r^{t_m^j},\alpha_r^{t_m^j}) v r t m j = ( x r t m j , y r t m j , α r t m j ) is the location and orientation
We need to estimate v r t n i \mathbf{v}_r^{t_n^i} v r t n i (Time-Series Forecasting )
encoding timestamp
v ^ r t n i = M H A ( u ( t n i ) , M L P ( V m , r ) + U k , M L P ( V m , r ) + U k ) \widehat{\mathbf{v}}_r^{t_n^i}=\mathrm{MHA}(\mathbf{u}(t_n^i),\mathrm{MLP}(\mathbf{V}_{m,r})+\mathbf{U}_k,\mathrm{MLP}(\mathbf{V}_{m,r})+\mathbf{U}_k) v r t n i = MHA ( u ( t n i ) , MLP ( V m , r ) + U k , MLP ( V m , r ) + U k )
Finally, we can generate BEWFlow map. we calculate the motion vector at each grid cell by an affine transformation of the associated ROI’s motion, constituting the whole BEV flow map .
3.32 Feature warp and aggregation
$$
\widehat{\mathbf{F}}_m^{t_n^i} =f_{\mathrm{warp}}(\widetilde{\mathbf{F}}_m^{t_m^j},\mathbf{M}_m^{t_m^j\to t_n^i})\\
\widehat{\mathbf{H}}_n^{t_n^i} =f_{\mathrm{agg}}(\widetilde{\mathbf{F}}_n^{t_n^i},\{\widehat{\mathbf{F}}_m^{t_n^i}\}_{m\in\mathcal{N}_n})
$$
F ^ m t n i [ h + M m t m j → t n i [ h , w , 0 ] , w + M m t m j → t n i [ h , w , 1 ] ] = F ~ m t m j [ h , w ] \widehat{\mathbf{F}}_{m}^{t_{n}^{i}}\left[h+\mathbf{M}_{m}^{t_{m}^{j}\to t_{n}^{i}}\left[h,w,0\right],w+\mathbf{M}_{m}^{t_{m}^{j}\to t_{n}^{i}}\left[h,w,1\right]\right]=\widetilde{\mathbf{F}}_{m}^{t_{m}^{j}}[h,w]
F m t n i [ h + M m t m j → t n i [ h , w , 0 ] , w + M m t m j → t n i [ h , w , 1 ] ] = F m t m j [ h , w ]
F ^ m t n i \widehat{\mathbf{F}}_{m}^{t_{n}^{i}} F m t n i is the realigned feature map from the m m m -th agent’s at timestamp t i n t_i^n t i n after motion compensation
H ^ n t n i \widehat{\mathbf{H}}_n^{t_n^i} H n t n i is the aggregated features from all of the agents
4. experiment
5. the dataset IRV2V
The ideal sample interval of the sensor is 100ms
There is a time offset at the sampling starting point of non-ego vehicles
all non-ego vehicles’ collaborative messages are sampled with time turbulence
so that t n i = t m i t_n^i= t_m^i t n i = t m i cant be guaranteed
we sample the frame intervals of received messages with binomial distribution to get random irregular time intervals