Motivation
Scene flow aims to model the correspondence between adjacent visual RGB or LiDAR features to estimate 3D motion features.
RGB and LiDAR has intrinsic heterogeneous nature , fuse directly is inappropriate.
We discover that the event has the homogeneous nature with RGB and LiDAR in both visual and motion spaces .
visual space complementarity
RGB camera: absolute value of luminance
event camera: relative change of luminance
LiDAR: global shape
event camera: local boundary
motion space complementarity
RGB camera: spatial-dense 2D features
event camera: temporal-dense 2D features
LiDAR: spatiotemporal-sparse 3D features
how to fuse
We need to build a homogeneous space.
-
In visual luminance fusion, we transform the event and RGB into the luminance space, and fuse the complementary (relative v.s. absolute) knowledge for high dynamic imaging under the constraint of the similar spatiotemporal gradient.
-
In visual structure fusion, we represent the event and LiDAR into the spatial structure space, and fuse the complementary (local boundary v.s. global shape) knowledge for physical structure integrity using the self-similarity clustering strategy.
-
In motion correlation fusion, we map the visual features of RGB, event and LiDAR to the same correlation space, and fuse the complementary (x, y-axis spatial-dense correlation, x, y-axis temporal-dense correlation and x, y, z-axis sparse correlation) knowledge for 3D motion spatiotemporal continuity via motion distribution alignment.
Visual Fusion (Homogeneous Luminance Space)
RGB image —> YUV image
- make it natural
adversarial training
- keep the continuity (the color and luminance remain the same )
optical flow basic model :
and
Visual Fusion (Homogeneous Structure space)
event camera image
LiDAR to event camera coordinate system :
after finding the corresponding point using cluster , KNN and attention, LiDAR depth map need the boundary message from event Boundary map to complement itself.
Motion Correlation Fusion
- randomly sample N 3D coordinates from LiDAR and find the corresponding 2D coordinates from RGB,Events
- we use the sampled coordinates as the center, and perform 2D spatially-sampling for the RGB correlation to obtain the corresponding spatial-dense x, y-axis correlations
As for the event, we temporally sample the correlation features to get the temporal-dense x, y-axis correlations
For the LiDAR, we spatially sample the correlations into the relatively sparse x, y, z-axis correlations
- Apply K-L divergence to align the multimodal motion correlation distributions
- concatenate the z-axis correlations of LiDAR
Experiments
Comparison Methods
Ablation Study
Thinking
too complex
multimodal feature fusion in collaborative perception:
- camera vs LiDAR ?
- RGB feature , Point cloud feature and Bevflow feature ?
- Multi-view fusion? (multi-vehicle view,)