|Video object detection has a lot of applications that require detections in real-time, but these applications are unable to leverage the high accuracy of current SOTA video object detection models due to their high computational requirements. A popular approach to overcome this limitation is to reduce the frame sampling rate, but this comes at the cost of losing important temporal information from these frames. Thus, the most widely used object detection models for real-time applications are image-based single-stage models. Therefore, there is a need for a model that can capture the temporal information from the other frames in a video to boost detection results while still staying real-time. To this end, we propose a YOLOX based video object detection model YOLO-SWINF. Particularly, we introduce a 3D-attention based module that uses a three-dimensional window to capture information across the temporal dimension. We integrate this module with the YOLOX backbone to take advantage of the single-stage nature of YOLOX. We extensively test this module on the ImageNet-VID dataset and show that it has an improvement of 3 AP points over the baseline with just less than 1 ms increase in inference time. Our model is comparable to current real-time SOTA models in accuracy while being the fastest. Our YOLO-SWINF-X model achieves 80.4% AP at 38FPS on NVIDIA 1080Ti GPU.
*** Title, author list and abstract as seen in the Camera-Ready version of the paper that was provided to Conference Committee. Small changes that may have occurred during processing by Springer may not appear in this window.