Big multimodal multimedia data can benefit our understanding of humans and society in a subversive way, and the massive data from different media sources requires a feasible manner to process potential information. The recent advances in deep learning can help researchers handle big multimedia data analytics better in order to study user behavior patterns and understand practical implications for various applications. This summary reports some recent advancements in this area. The special issue presents nine articles after a careful peer review process. The addressed topics cover cross-modal retrieval to deep model design, representation learning, clustering, and image processing, as well as a comprehensive survey of big multimodal multimedia data analytics.
The purpose of cross-modal retrieval is to find the relationship between different modal samples, and to retrieve other modal samples with similar semantics by using a certain modal sample. However, the existing methods often ignore the semantic correlation between the same modalities among different multimodal samples. To overcome this challenge, Zhang et al. propose HCMSL, a novel hybrid cross-model similarity learning model, which aims to capture sufficient semantic information from both labeled and unlabeled cross-model pairs and intramodel pairs with the same classification label. In addition, two Siamese convolutional neural network models are employed to learn intramodal similarity from samples of the same modality. These intramodal similarities are fused with cross-modal similarity to construct hybrid cross-modal similarity loss, so as to transform intramodal semantic correlation into cross-modal similarity to train a common subspace learning model.
Xu et al. address the problem of existing Zero-Shot Cross-Modal Retrieval (ZS-CMR) models (i.e., poor generalization ability in a zero-shot setting and relatively inferior performance). To this end, a novel method named AAEGAN (Assembling AutoEncoder and Generative Adversarial Network) for the more realistic ZS-CMR scenario is proposed, which combines the strength of AutoEncoder and Generative Adversarial Network, to jointly incorporate common latent space learning, knowledge transfer, and feature synthesis for ZS-CMR. In addition, the novel constraint of distribution alignment can preserve the semantic compatibility between modalities while enhancing the learning of common latent space; this will be beneficial to learn more robust common space.
Fu et al. show that most of the existing graph convolutional network-based methods depend on static data structural relationships, resulting in extracted data features lacking representativeness during the convolution process. To resolve that, they develop the Dynamic Graph Learning Convolutional Network (DGLCN) with semisupervised learning to yield a dynamic graph structure learning model. With the single-layer propagation rule of the DGLCN obtained by optimizing the spectral dynamic graph convolution, due to the fusion of the optimized structural information, the multilayer DGLCN can extract richer sample features to improve classification performance.
Despite the good performance of low-rank coding-based representation learning in discovering and recovering the subspace structures in data, its single-layer structure determines that deep hidden information cannot be obtained. The studies of Zhang et al. propose DLRF-Net, a new and progressive deep latent low-rank fusion network to uncover deep features and the clustering structures embedded in latent subspaces while also obtaining deep hidden information to ensure the representation learning of deeper layers to discover the underlying clean subspaces. In addition, as indicated, DLRF-Net is general and is applicable to most existing latent low-rank representation models.
With the aim of effectively addressing the missing features among data of Gaussian Mixture Model (GMM) clustering, Zhang et al. propose to integrate the imputation and GMM clustering into a unified learning procedure. Specifically, the missing data is filled by the result of GMM clustering, and the imputed data is then taken for GMM clustering. These two steps alternatively negotiate with each other to achieve an optimum result. Further, a two-step alternative algorithm with proved convergence is designed to solve the resultant optimization problem.
The aim of the study by Zhang et al. focus on the user credit grading problem to achieve anomaly detection, risk early warning, personalized information, and service recommendation for privileged users. First, the three defined naturally ordered categories based on user registration and behavior information can formulate user credit grading as the ordinal regression problem. To avoid the fragility of KDLOR (Kernel Discriminant Learning for Ordinal Regression), they adopt a robust sampling model with the triplet metric constraint to balance distribution while addressing overfit or underfit learning. Last, the improved sampling method can obtain hard negative samples to enhance robustness and effectiveness for ordinal regression.
Low light images captured by a nonuniform illumination environment are usually degraded with the scene depth and the corresponding environment lights, which results in severe object information loss in the degraded image modality. Different from the existing Salient Object Detection (SOD) methods that conduct SOD directly on original degraded images, Xu et al. eliminate the effect of low illumination by explicitly modeling the physical lighting of the environment for image enhancement. Specifically, an image enhancement approach is proposed to facilitate SOD in low light images and the Non-Local-Block Layer to capture the difference of local content of an object against its local neighborhood-favoring regions. Moreover, a low light image dataset is created to evaluate the performance of SOD.
Li et al. design the lightweight Dense Connection Distillation Network for Single Image Super-Resolution by combining the feature fusion units and dense connection distillation blocks that include selective cascading and dense distillation components. In every dense connection distillation block, the distillation mechanism helps to reduce training parameters and improve training efficiency, and the layer contrast-aware channel attention further improves performance of the model. The network takes advantage of more useful features (edges, angle, textures, etc.) for image restoration. The experimental results on several benchmark datasets show that the proposed method performs better tradeoff in terms of accuracy and efficiency.
Finally, Wang offers a comprehensive overview of the existing state of the art in the field multi-modal multimedia data analytics from shallow to deep spaces. Wang claims that the critical issue over existing state of the art is how to perform multimodal collaborations, including adversarial deep multimodal collaboration, so as to better fuse the complementary multimodal information. Throughout this survey, Wang further indicates that the critical components for this field go to collaboration, adversarial competition, and fusion over multimodal spaces. The experimental results of the state-of-the-art deep multimodal/cross-modal architectures over benchmark multimodal datasets are summarized.