Multi-Granularity Aggregation Transformer for Joint Video-Audio-Text Representation Learning | IEEE Journals & Magazine | IEEE Xplore