Impact Statement:Artificial intelligence (AI) has unlocked myriad possibilities to help people with disabilities, e.g., giving voice to nonverbal people, sign language translation, overco...Show More
Abstract:
Dense video captioning requires localization and description of multiple events in long videos. Prior works detect events in videos solely relying on the visual content a...Show MoreMetadata
Impact Statement:
Artificial intelligence (AI) has unlocked myriad possibilities to help people with disabilities, e.g., giving voice to nonverbal people, sign language translation, overcoming autism, and other motor disabilities. Recently, integration of vision and language has further enabled AI to assist nearly 2.2 billion,people with vision impairment. Such AI models must comprehend both visual and language domains to provide solutions for daily life challenges of visually impaired, e.g., navigation, reading, and understanding of their surrounding events. Dense video captioning (DVC) is one of the challenges that vision and language research communities jointly tackle to describe visual events in natural languages. Our algorithm proposes to leverage both modalities and enhance the comprehension capability of a DVC framework.
Abstract:
Dense video captioning requires localization and description of multiple events in long videos. Prior works detect events in videos solely relying on the visual content and completely ignore the semantics (captions) related to the events. This is undesirable because human-provided captions often also describe events that are visually nonpresent or subtle to detect. In this research, we propose to capitalize on this natural kinship between events and their human-provided descriptions. We propose a semantic contextualization network to encode the visual content of videos by representing it in a semantic space. The representation is further refined to incorporate temporal information and transformed into event descriptors using a hierarchical application of short Fourier transform. Our proposal network exploits the fusion of semantic and visual content enabling it to generate semantically meaningful event proposals. For each proposed event, we attentively fuse its hidden state and descrip...
Published in: IEEE Transactions on Artificial Intelligence ( Volume: 3, Issue: 5, October 2022)