skip to main content
10.1145/2983563acmconferencesBook PagePublication PagesmmConference Proceedingsconference-collections
iV&L-MM '16: Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion
ACM2016 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
MM '16: ACM Multimedia Conference Amsterdam The Netherlands 16 October 2016
ISBN:
978-1-4503-4519-4
Published:
16 October 2016
Sponsors:
Recommend ACM DL
ALREADY A SUBSCRIBER?SIGN IN

Reflects downloads up to 19 Feb 2025Bibliometrics
Skip Abstract Section
Abstract

It is our great pleasure to welcome you to the ACM Multimedia 2016 Workshop Vision and Language Integration Meets Multimedia Fusion (iV&L-MM 2016) in Amsterdam, The Netherlands on October 16, 2016.

Multimodal information fusion both at the signal and the semantics levels is a core part in most multimedia applications, including multimedia indexing, retrieval, summarization and others. Early or late fusion of modality-specific processing results has been addressed in multimedia prototypes since their very early days, through various methodologies including rule-based approaches, information-theoretic models and machine learning. Vision and Language are two of the predominant modalities that are being fused and which have attracted special attention in international challenges with a long history of results, such as TRECVid, ImageClef and others. During the last decade, vision-language semantic integration has attracted attention from traditionally non-interdisciplinary research communities, such as Computer Vision and Natural Language Processing. This is due to the fact that one modality can greatly assist the processing of another providing cues for disambiguation, complementary information and noise/error filtering. The latest boom of deep learning methods has opened up new directions in joint modelling of visual and co-occurring verbal information in multimedia discourse.

The proceedings contain seven selected long papers, which have been orally presented at the workshop, and three abstracts of the invited keynote speeches. The papers and abstracts discuss data collection, representation learning, deep learning approaches, matrix and tensor factorization methods and graph based clustering with regard to the fusion of multimedia data. A variety of applications is presented including image captioning, summarization of news, video hyperlinking, sub-shot segmentation of user generated video, cross-modal classification, cross-modal questionanswering, and the detection of misleading metadata of user generated video.

The call for papers attracted submissions from Europe, Asia, Australia and the United States. We received 15 long papers of which the program committee reviewed and accepted 7, resulting in an acceptance rate of about 47%. The accepted long papers are orally presented at the workshop. We also encourage attendees to attend the keynote talk presentations. These valuable and insightful talks can and will guide us to a better understanding of the future:

  • Explain and Answer: Relating Natural Language and Visual Recognition, Marcus Rohrbach (University of California Berkeley, USA)

  • Jointly Representing Images and Text: Dependency Graphs, Word Senses, and Multimodal Embeddings, Frank Keller (University of Edinburgh, UK)

  • Beyond Language and Vision, Towards Truly Multimedia Integration, Tat-Seng Chua (National University of Singapore, Singapore)

Skip Table Of Content Section
SESSION: Paper Session 1
research-article
Exploiting Scene Context for Image Captioning

This paper presents a framework for image captioning by exploiting the scene context. To date, most of the captioning models have been relying on the combination of Convolutional Neural Networks (CNN) and the Long-Short Term Memory (LSTM) model, trained ...

SESSION: Paper Session 2
research-article
Public Access
News Event Understanding by Mining Latent Factors From Multimodal Tensors

We present a novel and efficient constrained tensor factorization algorithm that first represents a video archive, of multimedia news stories concerning a news event, as a sparse tensor of order 4. The dimensions correspond to extracted visual memes, ...

research-article
Cross-modal Classification by Completing Unimodal Representations

We argue that cross-modal classification, where models are trained on data from one modality (e.g. text) and applied to data from another (e.g. image), is a relevant problem in multimedia retrieval. We propose a method that addresses this specific ...

research-article
Semantic Indexing of Wearable Camera Images: Kids'Cam Concepts

In order to provide content-based search on image media, including images and video, they are typically accessed based on manual or automatically assigned concepts or tags, or sometimes based on image-image similarity depending on the use case. While ...

SESSION: Keynote 2
invited-talk
Jointly Representing Images and Text: Dependency Graphs, Word Senses, and Multimodal Embeddings

The amount of image data available on the web is growing rapidly: on Facebook alone, 350 million new images are uploaded every day. Making sense of this data requires new ways of efficiently indexing, annotating, and querying such enormous collections. ...

SESSION: Paper Session 3
research-article
Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking

Video hyperlinking represents a classical example of multimodal problems. Common approaches to such problems are early fusion of the initial modalities and crossmodal translation from one modality to the other. Recently, deep neural networks, especially ...

research-article
User Video Summarization Based on Joint Visual and Semantic Affinity Graph

Automatically generating summaries of user-generated videos is very useful but challenging. User-generated videos are unedited and usually only contain a long single shot which makes traditional video temporal segmentation methods such as shot boundary ...

research-article
Disinformation in Multimedia Annotation: Misleading Metadata Detection on YouTube

Popularity of online videos is increasing at a rapid rate. Not only the users can access these videos online, but they can also upload video content on platforms like YouTube and Myspace. These videos are indexed by user generated multimedia annotation, ...

SESSION: Keynote 3
invited-talk
Beyond Language and Vision, Towards Truly Multimedia Integration

Text has been the dominant medium for understanding the world around us. More recently, because of the increasingly amount of visual content without text, visual with computer vision technology has been making wave in a number of visual oriented ...

Contributors
  • Institute for Language and Speech Processing
  • Boston University
Index terms have been assigned to the content through auto-classification.

Recommendations

Acceptance Rates

iV&L-MM '16 Paper Acceptance Rate 7 of 15 submissions, 47%;
Overall Acceptance Rate 7 of 15 submissions, 47%
YearSubmittedAcceptedRate
iV&L-MM '1615747%
Overall15747%