Enhancing Ocean Scene Video Captioning with Multimodal Pre-Training and Video-Swin-Transformer | IEEE Conference Publication | IEEE Xplore