STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding | IEEE Conference Publication | IEEE Xplore