[CVPR2024]Point Transformer V3: Simpler, Faster, Stronger

Motivation

Scaling up is all you need.

scale: size of datasets, the number of model parameters, the range of effective receptive field, and the computing power .

scale principle: efficiency ( simplicity scalability ) VS accuracy

Unlike the advancements made in 2d or NLP field,the previous works in 3D vision had to focus on improve the accuracy of the model due to the limited size and diversity of point cloud data available in separate domains .

The time consumption of point transformer V1 or V2:

KNN Query and RPE occupy a total of 54% of forward time ,which pays too much attention to design the overfitting pattern to improve the accuracy of the 3d task.

This paper consider the concept about the trade off between the efficiency and accuracy , leveraging the potential of the power of scalability on the ‘‘weak’’ dataset to improve the accuracy .

remove the limitation of permutation invariance of points clouds
modify the time-consuming Positional encoding
explore the potential of the dataset (Data Augmentation)

The idea : **Any initial accuracy gaps can be effectively bridged by harnessing the scalability potential ** dominate the whole paper.

overall architecture

Point Cloud Serialization

Space-filling curves

Bijective function $\varphi:\mathbb{Z}\mapsto\mathbb{Z}^n$ , where $n$ is the dimensionality of the space, which is 3 within the context of point clouds and also can extend to a higher dimension .

Serialized encoding

$\varphi^{-1}:\mathbb{Z}^n\mapsto\mathbb{Z}$

$\operatorname{Encode}(\boldsymbol{p},b,g)=(b\ll k)\mid\varphi^{-1}(\lfloor \boldsymbol{p} / g \rfloor)$

the serialization of point clouds is accomplished by sorting the codes resulting from the serialized encoding

Serialized Attention

Image transformers , benefiting from the structured and regular grid of pixel data, naturally prefer window and dot-product attention mechanisms. However, this advantage vanishes when confronting the unstructured nature of point clouds.

With the serialized points cloud , now we can choose to revisit and adopt the efficient window and dot-product attention mechanisms as our foundational approach.

Evolving from window attention, this paper define patch attention, a mechanism that groups points into non-overlapping patches and performs attention within each individual patch.

Patch grouping

Patch interaction

The serialized order of the point cloud data is dynamically varied between attention blocks

prevent the model from overfitting to a single pattern and promotes a more robust integration of features across the data

simpler position encoding

conditional positional encoding

This paper presents an enhanced conditional positional encoding (xCPE), implemented by directly prepending a sparse convolution layer with a skip connection before the attention layer.