Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding | IEEE Conference Publication | IEEE Xplore