In recent years, camera-only 3D object detectors have made significant progress, largely fueled by the adoption of Bird's-Eye-View (BEV) representation. However, a notable limitation still exists: BEV representations weaken the height-dimensional information, especially in processes like voxel-pooling, which flatten 3D voxel features into 2D plane directly to build BEV feature. To address this issue, we propose a novel and effective method termed GaussBEV, which departs from the conventional construction of BEV feature, instead, it commences by introducing slice-voxel-pooling to reserve height information and categorizing objects into different groups based on statistics for differential processing. Utilizing the unique spatial distributions within each group, we design a Gaussian Weight Generator (GWG) module, which reweights voxel feature based on learnable Gaussian parameters, thereby generating group features, retaining the corresponding group-wise height information to a great extent. Subsequently, an Efficient Channel Attention (ECA) FPN is introduced to bring global feature, which can further be combined with the group features to capture both the group spatial information and global semantics. This combination strategy ensures a comprehensive and detailed representation of the 3D environment. With the combined features, we use multiple detection heads for specific groups, where each head focuses on the feature-constructing procedure of the corresponding group. Extensive experiments and thorough analysis of the nuScenes dataset have been conducted to validate the effectiveness of GaussBEV. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.