MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

Shaocong Dong; Lihe Ding; Haiyang Wang; Tingfa Xu; Xinli Xu; Ziyang Bian; Ying Wang; Jie Wang; Jianan Li

MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

Shaocong Dong, Lihe Ding, Haiyang Wang, Tingfa Xu^*, Xinli Xu, Ziyang Bian, Ying Wang, Jie Wang, Jianan Li^*

^*Corresponding author for this work

School of Optics and Photonics

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

23 Citations (Scopus)

Abstract

3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https://github.com/dscdyc/MsSVT.

Original language	English
Title of host publication	Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
Editors	S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh
Publisher	Neural information processing systems foundation
ISBN (Electronic)	9781713871088
Publication status	Published - 2022
Event	36th Conference on Neural Information Processing Systems, NeurIPS 2022 - New Orleans, United States Duration: 28 Nov 2022 → 9 Dec 2022

Publication series

Name	Advances in Neural Information Processing Systems
Volume	35
ISSN (Print)	1049-5258

Conference

Conference	36th Conference on Neural Information Processing Systems, NeurIPS 2022
Country/Territory	United States
City	New Orleans
Period	28/11/22 → 9/12/22

Cite this

Dong, S., Ding, L., Wang, H., Xu, T., Xu, X., Bian, Z., Wang, Y., Wang, J., & Li, J. (2022). MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022 (Advances in Neural Information Processing Systems; Vol. 35). Neural information processing systems foundation.

Dong, Shaocong ; Ding, Lihe ; Wang, Haiyang et al. / MsSVT : Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds. Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022. editor / S. Koyejo ; S. Mohamed ; A. Agarwal ; D. Belgrave ; K. Cho ; A. Oh. Neural information processing systems foundation, 2022. (Advances in Neural Information Processing Systems).

@inproceedings{237661cb5c884c2c8056f21e92407783,

title = "MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds",

abstract = "3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https://github.com/dscdyc/MsSVT.",

author = "Shaocong Dong and Lihe Ding and Haiyang Wang and Tingfa Xu and Xinli Xu and Ziyang Bian and Ying Wang and Jie Wang and Jianan Li",

note = "Publisher Copyright: {\textcopyright} 2022 Neural information processing systems foundation. All rights reserved.; 36th Conference on Neural Information Processing Systems, NeurIPS 2022 ; Conference date: 28-11-2022 Through 09-12-2022",

year = "2022",

language = "English",

series = "Advances in Neural Information Processing Systems",

publisher = "Neural information processing systems foundation",

editor = "S. Koyejo and S. Mohamed and A. Agarwal and D. Belgrave and K. Cho and A. Oh",

booktitle = "Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022",

}

Dong, S, Ding, L, Wang, H, Xu, T, Xu, X, Bian, Z, Wang, Y, Wang, J & Li, J 2022, MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds. in S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho & A Oh (eds), Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022. Advances in Neural Information Processing Systems, vol. 35, Neural information processing systems foundation, 36th Conference on Neural Information Processing Systems, NeurIPS 2022, New Orleans, United States, 28/11/22.

MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds. / Dong, Shaocong; Ding, Lihe; Wang, Haiyang et al.
Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022. ed. / S. Koyejo; S. Mohamed; A. Agarwal; D. Belgrave; K. Cho; A. Oh. Neural information processing systems foundation, 2022. (Advances in Neural Information Processing Systems; Vol. 35).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - MsSVT

T2 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022

AU - Dong, Shaocong

AU - Ding, Lihe

AU - Wang, Haiyang

AU - Xu, Tingfa

AU - Xu, Xinli

AU - Bian, Ziyang

AU - Wang, Ying

AU - Wang, Jie

AU - Li, Jianan

PY - 2022

Y1 - 2022

N2 - 3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https://github.com/dscdyc/MsSVT.

AB - 3D object detection from the LiDAR point cloud is fundamental to autonomous driving. Large-scale outdoor scenes usually feature significant variance in instance scales, thus requiring features rich in long-range and fine-grained information to support accurate detection. Recent detectors leverage the power of window-based transformers to model long-range dependencies but tend to blur out fine-grained details. To mitigate this gap, we present a novel Mixed-scale Sparse Voxel Transformer, named MsSVT, which can well capture both types of information simultaneously by the divide-and-conquer philosophy. Specifically, MsSVT explicitly divides attention heads into multiple groups, each in charge of attending to information within a particular range. All groups' output is merged to obtain the final mixed-scale features. Moreover, we provide a novel chessboard sampling strategy to reduce the computational complexity of applying a window-based transformer in 3D voxel space. To improve efficiency, we also implement the voxel sampling and gathering operations sparsely with a hash map. Endowed by the powerful capability and high efficiency of modeling mixed-scale information, our single-stage detector built on top of MsSVT surprisingly outperforms state-of-the-art two-stage detectors on Waymo. Our project page: https://github.com/dscdyc/MsSVT.

UR - http://www.scopus.com/inward/record.url?scp=85152943929&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85152943929

T3 - Advances in Neural Information Processing Systems

BT - Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022

A2 - Koyejo, S.

A2 - Mohamed, S.

A2 - Agarwal, A.

A2 - Belgrave, D.

A2 - Cho, K.

A2 - Oh, A.

PB - Neural information processing systems foundation

Y2 - 28 November 2022 through 9 December 2022

ER -

Dong S, Ding L, Wang H, Xu T, Xu X, Bian Z et al. MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds. In Koyejo S, Mohamed S, Agarwal A, Belgrave D, Cho K, Oh A, editors, Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022. Neural information processing systems foundation. 2022. (Advances in Neural Information Processing Systems).

MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

Abstract

Publication series

Conference

Other files and links

Fingerprint

Cite this