BEV (Bird's Eye View) object detection based on multiple cameras has become a mainstream paradigm in the field of autonomous driving. However, we have identified that many existing methods perform poorly when detecting extremely large or small objects. To address this issue, this paper proposes a new framework for multi-camera BEV detection called SS-BEV. First, we designed a more expressive feature extraction module, A-WBFP, which integrates parallel atrous convolutions and weighted bidirectional feature pyramid blocks into the backbone network using a cascaded approach. This helps prevent the loss of small object information in deeper network layers and enhances the receptive field, thereby producing feature maps enriched with contextual information. We then introduce the MORD module, which leverages the accurate depth information from radar point clouds to improve the model’s spatial structure understanding for both large and small objects. By learning the relative depth between internal structures of the objects and selected reference points, a corresponding loss function is constructed to supervise the final detection performance. SS-BEV outperforms the baseline model on the challenging nuScenes validation set with a 2.1-point improvement in NDS detection score. On the nuScenes test set, it achieved detection accuracies of 66.1% and 22.3% mAP for obstacles and construction vehicles, respectively, surpassing some methods based on multi-camera and LiDAR fusion.
