Cross-modal social image clustering and tag cleansing

doi:10.1016/j.jvcir.2013.06.004

Journal of Visual Communication and Image Representation

Volume 24, Issue 7, October 2013, Pages 895-910

https://doi.org/10.1016/j.jvcir.2013.06.004 Get rights and content

Highlights

•
Semantic image clustering for exploit image topics of interest.
•
Kernel combination for cross-modal contexts among weekly-tagged images.
•
K-way min-max cut for social image clustering.

Abstract

In this paper, a cross-modal approach is developed for social image clustering and tag cleansing. First, a semantic image clustering algorithm is developed for assigning large-scale weakly-tagged social images into a large number of image topics of interest. Spam tags are detected automatically via sentiment analysis and multiple synonymous tags are merged as one super-topic according to their inter-topic semantic similarity contexts. Second, multiple base kernels are seamlessly combined by maximizing the correlations between the visual similarity contexts and the semantic similarity context, which can achieve more precise characterization of cross-modal (semantic and visual) similarity contexts among weakly-tagged social images. Finally, a K-way min–max cut algorithm is developed for social image clustering by minimizing the cumulative inter-cluster cross-modal similarity contexts while maximizing the cumulative intra-cluster cross-modal similarity contexts. The optimal weights for base kernel combination are simultaneously determined by minimizing the cumulative within-cluster variances. The polysemous tags and their ambiguous images are further split into multiple sub-topics for reducing their within-topic visual diversity. Our experiments on large-scale weakly-tagged Flickr images have provided very positive results.

Introduction

Collaborative image tagging systems, such as Flickr.com, have now become very popular for tagging large-scale social images by relying on collaborative efforts of a large population of Internet users [1], [2]. In a collaborative image tagging system, people may tag social images according to their social or cultural backgrounds, personal expertise and perception. We call such collaboratively-tagged social images as weakly-tagged social images because their social tags may not have exact correspondences with the underlying image semantics. With the exponential growth of weakly-tagged social images, it has become increasingly attractive to develop new algorithms for achieving more effective organization and summarization of large-scale social images.

Image clustering, which can assign large amounts of images into different clusters with some common semantics or visual properties [2], [33], [34], [35], [41], [42], [43], [44], [45], is very attractive for achieving more effective organization, summarization and visualization of large-scale image collections. Clustering the search results into multiple semantic groups is helpful for users to assess the relevance between large amounts of returned images (which are returned by the same query terms) and their query intentions [33], [34], [35], [44]. Image clustering has also been used to achieve more precise organization and summarization of large-scale Flickr images according to their visual similarities [2]. Most existing algorithms for image clustering focused on only the low-level visual features [44], thus it is doubtful of their effectiveness because of the problem of semantic gap [55], [56], [57], [58]. Some researchers have recently integrated visual and textual features for Web image clustering and higher accuracy rates are reported [33], [34], [35], [41], [42], [43].

In a collaborative image tagging space, there are two inter-related information sources that can be used to support image clustering: (1) visual properties of weakly-tagged social images; and (2) their social tags. It is worth noting that the visual properties of weakly-tagged social images and their social tags can offer complementary strengths, thus it is very attractive to develop new frameworks that are able to integrate the visual features with the social tags for achieving more precise clustering of large-scale weakly-tagged social images. Unfortunately, it is not a trivial work because of the following issues:

(a) Social Tag Ambiguity: Different people may use different social tags (i.e., text terms), which have the same or close meaning (synonyms), to tag their social images [5], [26], [27], [28], [29], [30], [31]. The weakly-tagged social images, which belong to a set of synonymous tags, may share some common visual properties and semantics. The appearance of synonymous tags may prevent most existing clustering algorithms from deriving more representative image clusters. On the other hand, collaborative image tagging is an ambiguous process [22], [23], [24], [25]. Different people may apply the same social tag in different ways (i.e., the polysemous tag may have different meanings under different contexts), which may result in large amounts of ambiguous images with diverse visual properties. Because the effectiveness of most existing clustering algorithms largely depends on the accuracy of the underlying functions for data similarity characterization, the appearance of polysemous tags and their ambiguous images may bring new challenges for social image clustering, e.g., it is hard to design suitable similarity functions for characterizing the diverse image similarity contexts accurately. In a collaborative image tagging space, dishonest users may use spam tags to tag their social images, so that they can derive traffic to their social images for fun or profit [37], [38], [39]. The appearance of spam tags may result in large amounts of junk images, which may mislead most existing clustering algorithms to derive less representative clusters from large-scale weakly-tagged social images.

(b) Visual Ambiguity and Semantic Gap: Multiple types of visual features are usually extracted to achieve more sufficient characterization of various visual properties of the images, thus the distributions of the images could be very sparse and the visual similarity contexts among the images could be very diverse in the high-dimensional feature space (i.e., visual ambiguity). As a result, it is very hard to use one single type of base kernels such as RBF kernel [7], [8], [9], [10], [11], [12] to achieve precise characterization of the diverse visual similarity contexts among the images. In addition, there may have large amounts of outliers in the high-dimensional feature space and most existing clustering algorithms may seriously suffer from the problem of skewed cuts. Another challenging problem for image clustering is the semantic gap [55], [56], [57], [58] between the low-level visual features and the image semantics, e.g., it is very hard to achieve semantic clustering of large-scale weakly-tagged social images by using only the low-level visual features. It is worth noting that the visual properties of weakly-tagged social images and their social tags can offer complementary strengths, thus they can be integrated to achieve more precise clustering of large-scale weakly-tagged social images. Because the visual features and the social tags belong to different spaces, it is unsatisfied to combine them directly for social image clustering.

As shown in Fig. 1, a cross-modal approach is developed in this paper to achieve social image clustering and tag cleansing: (a) a semantic image clustering algorithm is developed for extracting the image topics of interest from large amounts of social tags; (b) spam tags are ide.epsied automatically via sentiment analysis and multiple synonymous tags are merged as one super-topic according to their inter-topic semantic similarity contexts; (c) a mixture-of-kernels algorithm is developed to achieve more accurate characterization of the cross-modal similarity contexts among the weakly-tagged social images; (d) a K-way min–max cut algorithm is extended for supporting cross-modal social image clustering and tag cleansing, where the polysemous tags and their ambiguous images are split into multiple sub-topics for reducing their intra-topic visual diversity; (e) a topic network is constructed to achieve more effective organization and summarization of large-scale weakly-tagged social images at the semantic level.

The rest of this paper is organized as follows. In Section 2, a brief review of some relevant work is presented; In Section 3, a semantic image clustering algorithm is introduced to assign large-scale social images into a large number of image topics of interest; In Section 4, a mixture-of-kernels algorithm is developed for achieving more precise characterization of the diverse cross-modal image similarity contexts among the social images; In Section 5 K-way min–max cut algorithm is presented for achieving cross-modal social image clustering; In Section 6 topic network is constructed to enable semantic summarization and organization of large-scale weakly-tagged social images at the semantic level; Our experimental results on algorithm evaluation are given in Section 7 and we conclude this paper in Section 8.

Section snippets

Related work

Clustering, which is one of the fundamental problems in machine learning and data mining, has received a significant amount of attentions in the last three decades [19]. Spectral clustering has recently become very popular because it is more effective in finding representative clusters [13], [14], [15], [16], [17], [18], and one popular objective function (which is used in most spectral clustering approaches) is to minimize the normalized cuts [13] (i.e., minimizing the inter-connections among

Semantic image clustering

As shown in Fig. 2, each image in a collaborative tagging system is associated with the image holder’s tags of the image semantics and other users’ tags or comments. Because multiple social tags are given individually in a collaborative image tagging space, entity extraction can be done more effectively. In this paper, a semantic image clustering algorithm is developed for: (a) automatically extracting the social tags for image topic interpretation; and (b) assigning large-scale weakly-tagged

Cross-modal similarity characterization for social images

As shown in Fig. 5, four grid resolutions are used for image partition and feature extraction [3]. As shown in Fig. 6, three types of visual features are extracted for characterizing various visual properties of weakly-tagged social images: (a) grid-based color histograms; (b) Gabor texture features; (c) SIFT features.

For the color features, one color histogram is extracted for each image grid, thus there are $\sum_{r = 0}^{3} 2^{r} \times 2^{r}$ $=$ 85 grid-based color histograms. Each grid-based color histogram consists

Cross-modal social image clustering and tag cleansing

To achieve more effective social image clustering and automatic kernel weight determination, a K-way min–max cut algorithm is developed, where the cumulative inter-cluster cross-modal similarity contexts are minimized while the cumulative intra-cluster cross-modal similarity contexts (summation of the pairwise image similarity contexts among the social images within the same cluster) are maximized.

Our K-way min–max cut algorithm takes the following steps iteratively for social image clustering

Topic Network Generation for Large-Scale Image Summarization and Navigation

To support interactive visualization and exploration of large-scale weakly-tagged social images, it is very attractive to enable graph-based representation of a large number of image topics of interest and their inter-topic similarity contexts. As illustrated in Fig. 14, a new algorithm is developed for determining the inter-topic similarity contexts. The inter-topic similarity context $γ (C_{i}, C_{j})$ between two image topics $C_{i}$ and $C_{j}$ can be determined by: $γ (C_{i}, C_{j}) = \begin{matrix} \max \\ θ, ϑ \end{matrix} \frac{θ^{T} κ (S_{i}) κ (S_{j}) ϑ}{\sqrt{θ^{T} κ^{2} (S_{i}) θ \cdot ϑ^{T} κ^{2} (S_{j})}}$

Algorithm Evaluation

Our experiments on algorithm evaluation are performed on 5 million Flickr images. To assess the effectiveness of our proposed algorithms, our algorithm evaluation work focuses on: (1) comparing the performance differences of our social image clustering algorithm by using single base kernel or mixture-of-kernels for image similarity characterization; (2) comparing the performance differences between various approaches for social image clustering (i.e., our K-way min–max cut algorithm, normalized

Conclusions

In this paper, a new algorithm is developed for achieving cross-modal social image clustering and tag cleansing. A semantic image clustering algorithm is developed to assign large-scale weakly-tagged social images into a large number of image topics of interest. A K-way min–max cut algorithm is developed for social image clustering by minimizing the cumulative inter-cluster cross-modal similarity contexts while maximizing the cumulative intra-cluster cross-modal similarity contexts. To tackle

Acknowledgment

The authors would like to than the reviewers for their insightful comments and suggestions to make this paper more readable. This research is partly supported by National Science Foundation of China under Grants 61272285, 61103062 and 61075014, Doctoral Program of Higher Education of China (Grant No. 20126101110022, 20116102110027, 20116102120031) and Program for New Century Excellent Talents in University under NCET-10-0071.

References (58)

H. Bay et al.
SURF: speeded up robust features
Comput. Vision Image Understand. (CVIU)
(2008)
K. Barnard et al.
Word sense disambiguation with pictures
Art. Intell.
(2005)
Flickr....
J. Fan et al.
JustClick: personalized image recommendation via exploratory search from large-scale Flickr images
IEEE Trans. CSVT
(2009)
Y.G. Jiang, C.W. Ngo, J. Yang, Towards optimal bag-of-features for object categorization and semantic video retrieval,...
C. Fellbaum
WordNet: An Electronic Lexical Database
(1998)
N. Cristianini et al.
Latent semantic kernels
J. Intell. Inf. Syst.
(2002)
S. Sonnenburg et al.
Large scale multiple kernel learning
J. Mach. Learn. Res.
(2006)
M. Varma, D. Ray, Learning the discriminative power-invariance trade-off, in: IEEE ICCV,...
A. Frome, Y. Singer, F. Sha, J. Malik, Learning globally-consistent local distance functions for shape-based image...

A. Bosch, A. Zisserman, X. Munoz, Representing shape with a spatial pyramid kernel, in: ACM CIVR,...

J. Zhang et al.

Local features and kernels for classification of texture and object categories: a comprehensive study

Int. J. Comput. Vision

(2007)

J. Fan et al.

Integrating concept ontology and multi-task learning to achieve more effective classifier training for multi-level image annotation

IEEE Trans. Image Process.

(2008)

J. Shi et al.

Normalized cuts and image segmentation

IEEE Trans. PAMI

(2000)

C. Ding, X. He, H. Zha, M. Gu, H. Simon, A min–max cut algorithm for graph partitioning and data clustering, in: ICDM,...

S. Yu, J. Shi, Multiclass spectral clustering, in: ICCV,...

I. Dhillon, Y. Guan, B. Kulis, Kernel k-mean, spectral clustering and normalized cut, in: KDD,...

D. Yuan, L. Huang, M.J. Jordan, Fast approximate spectral clustering, in: KDD,...

M. Gu, H. Zha, C. Ding, X. He, H. Simon, J. Xia, Spectral relaxation models and structure analysis for K-way graph...

J. Han et al.

Data Mining: Concepts and Techniques

(2006)

K. Grauman, T. Darrell, The pyramid match kernel: discriminative classification with sets of image features, in: ICCV,...

D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: An overview with application to learning...

M. Sussna, Word sense disambiguation for free-text indexing using a massive semantic network, in: ACM CIKM, pp. 67–74,...

J. Fan, H. Luo, Y. Shen, C. Yang, Integrating visual and semantic contexts for topic network generation and word sense...

Y. Jing, S. Baluja, PageRank for product image search, in: ACM WWW, 2008, pp....

C.H. Brooks, N. Montanez, Improved annotation of the blogosphere via autotagging and hierarchical clustering, in: ACM...

S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu, Optimizing web search using social annotations, in: WWW, 2007,...

G. Begelman, P. Keller, F. Smadja, Automated tag clustering: improving search and exploration in the tag space, in: ACM...

J. Gemmell, A. Shepitsen, B. Mobasher, R. Burke, Personalized navigation in folksonomies using hierarchical tag...

Cited by (6)

Deep cross-modal subspace clustering with Contrastive Neighbour Embedding
2024, Neurocomputing
Deep cross-modal clustering has been developing rapidly and attracted considerable attention in recent years. It aims to pursue a consistent subspace from different modalities with deep neural networks and achieves remarkable clustering performance. However, most existing methods do not simultaneously consider the inherently diverse information of each modality and the neighbour geometric structure over cross-modal data, which inevitably degrades the cluster structure revealed by the common subspace. In this paper, we propose a novel method named Deep Cross-Modal Subspace Clustering with Contrastive Neighbour Embedding (DCSC-CNE) to address the above challenge. DCSC-CNE maintains the inherent independence of each modality while concurrently uncovering consistent information across diverse modalities. In addition, we introduce a contrastive neighbour graph in the proposed deep cross-modal subspace clustering framework by performing contrastive learning between positive and negative samples, to highlight the underlying neighbour geometry of the original data and learn discriminative latent (subspace) representations. In this way, DCSC-CNE integrates the consistent-inherent learning and the contrastive neighbour embedding into a unified deep learning framework. Experimental results demonstrate that the proposed method can significantly improve the cross-modal subspace clustering performance compared with state-of-the-art methods on six benchmark datasets.
An image-text consistency driven multimodal sentiment analysis approach for social media
2019, Information Processing and Management
Citation Excerpt :
Sentiment analysis aims to automatically uncover the underlying attitude of the posts. Due to the rich sentiment cues that can be found in images, sentiment analysis of visual content can contribute more towards extracting user sentiments and understand user behavior, stock market forecasting and voting for politicians (Jiang et al., 2017; Nie, Peng, Wang, Zhao, & Su, 2017; Peng, Shen, & Fan, 2013). Taking the examples of some popular posters, as illustrated in Fig. 1, it can be seen that some posters record their time and express their expectations for the next period.
Social media users are increasingly using both images and text to express their opinions and share their experiences, instead of only using text in the conventional social media. Consequently, the conventional text-based sentiment analysis has evolved into more complicated studies of multimodal sentiment analysis. To tackle the challenge of how to effectively exploit the information from both visual content and textual content from image-text posts, this paper proposes a new image-text consistency driven multimodal sentiment analysis approach. The proposed approach explores the correlation between the image and the text, followed by a multimodal adaptive sentiment analysis method. To be more specific, the mid-level visual features extracted by the conventional SentiBank approach are used to represent visual concepts, with the integration of other features, including textual, visual and social features, to develop a machine learning sentiment analysis approach. Extensive experiments are conducted to demonstrate the superior performance of the proposed approach.
Deep Cross-Modal Subspace Clustering with Contrastive Neighbour Embedding
2023, SSRN
MapReduce-based clustering for near-duplicate image identification
2017, Multimedia Tools and Applications
Partially tagged image clustering
2015, Proceedings - International Conference on Image Processing, ICIP
Tagged image clustering via topic models
2015, Proceedings of the 2015 27th Chinese Control and Decision Conference, CCDC 2015

View full text

Cross-modal social image clustering and tag cleansing

Highlights

Abstract

Introduction

Section snippets

Related work

Semantic image clustering

Cross-modal similarity characterization for social images

Cross-modal social image clustering and tag cleansing

Topic Network Generation for Large-Scale Image Summarization and Navigation

Algorithm Evaluation

Conclusions

Acknowledgment

Comput. Vision Image Understand. (CVIU)

Art. Intell.

JustClick: personalized image recommendation via exploratory search from large-scale Flickr images

IEEE Trans. CSVT

WordNet: An Electronic Lexical Database

Latent semantic kernels

J. Intell. Inf. Syst.

Large scale multiple kernel learning

J. Mach. Learn. Res.

Local features and kernels for classification of texture and object categories: a comprehensive study

Int. J. Comput. Vision

Integrating concept ontology and multi-task learning to achieve more effective classifier training for multi-level image annotation

IEEE Trans. Image Process.

Normalized cuts and image segmentation

IEEE Trans. PAMI

Data Mining: Concepts and Techniques