Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning | IEEE Conference Publication | IEEE Xplore