[CVPR2024]Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow

Motivation

Scene flow aims to model the correspondence between adjacent visual RGB or LiDAR features to estimate 3D motion features.

RGB and LiDAR has intrinsic heterogeneous nature , fuse directly is inappropriate.

We discover that the event has the homogeneous nature with RGB and LiDAR in both visual and motion spaces .

visual space complementarity

RGB camera: absolute value of luminance

event camera: relative change of luminance

LiDAR: global shape

event camera: local boundary

motion space complementarity

RGB camera: spatial-dense 2D features

event camera: temporal-dense 2D features

LiDAR: spatiotemporal-sparse 3D features

how to fuse

We need to build a homogeneous space.

In visual luminance fusion, we transform the event and RGB into the luminance space, and fuse the complementary (relative v.s. absolute) knowledge for high dynamic imaging under the constraint of the similar spatiotemporal gradient.
In visual structure fusion, we represent the event and LiDAR into the spatial structure space, and fuse the complementary (local boundary v.s. global shape) knowledge for physical structure integrity using the self-similarity clustering strategy.
In motion correlation fusion, we map the visual features of RGB, event and LiDAR to the same correlation space, and fuse the complementary (x, y-axis spatial-dense correlation, x, y-axis temporal-dense correlation and x, y, z-axis sparse correlation) knowledge for 3D motion spatiotemporal continuity via motion distribution alignment.

Visual Fusion (Homogeneous Luminance Space)

RGB image —> YUV image $[I^Y,I^U,I^V]$

$I^X=\sum p_iC$

$H^Y=\mathcal{W}(I^Y,I^X)=(\omega_yI^Y+\omega_xI^X)/(\omega_y+\omega_x)$

$H=\mathcal C(\mathcal W(I^{Y},I^{X}),I^{U},I^{V})$

make it natural

adversarial training

keep the continuity (the color and luminance remain the same )

optical flow basic model :

$I_{pos}’U+I_{t}'=0$

$I_t'=U\cdot\sum_{e_i\in\Delta t}p_iC$ and $I'_{pos}=\Delta I_{pos}$

Visual Fusion (Homogeneous Structure space)

event camera image $P_{e}=\{u_{j}=x_{j},v_{j}=y_{j},p_{j}|\{x_{j},y_{j},p_{j}\}\in ev\}$

LiDAR to event camera coordinate system :

$u_i=f\cdot x_i/z_i+c_x,v_i=f\cdot y_i/z_i+c_y,d_i=z_i\\s.t.\{x_i,y_i,z_i\}\in pc,0<u_i<w,0<u_i<h$

after finding the corresponding point using cluster , KNN and attention, LiDAR depth map need the boundary message from event Boundary map to complement itself.

Motion Correlation Fusion

randomly sample N 3D coordinates from LiDAR and find the corresponding 2D coordinates from RGB,Events
we use the sampled coordinates as the center, and perform 2D spatially-sampling for the RGB correlation to obtain the corresponding spatial-dense x, y-axis correlations

As for the event, we temporally sample the correlation features to get the temporal-dense x, y-axis correlations

For the LiDAR, we spatially sample the correlations into the relatively sparse x, y, z-axis correlations

Apply K-L divergence to align the multimodal motion correlation distributions

$\mathcal{L}_{corr}^{kl}=\sum\Phi(cv_{l}^{x,y})\cdot log\frac{\Phi(cv_{l}^{x,y})}{\Phi(cv_{r}^{x,y})}+\Phi(cv_{l}^{x,y})\cdot log\frac{\Phi(cv_{l}^{x,y})}{\Phi(cv_{e}^{x,y})}$

concatenate the z-axis correlations of LiDAR

$corr=Concat\{\frac{1}{T}\sum_{i=0}^{T}(cv_{r}^{x}+cv_{e}^{x,i}+cv_{l}^{x})/3,\\\frac{1}{T}\sum_{i=0}^{T}(cv_{r}^{y}+cv_{e}^{y,i}+cv_{l}^{y})/3,cv_{l}^{z}\},$

Experiments

Comparison Methods

Ablation Study

Thinking

too complex

multimodal feature fusion in collaborative perception:

camera vs LiDAR ？
RGB feature , Point cloud feature and Bevflow feature ?
Multi-view fusion? (multi-vehicle view,)