Journal of Visual Communication and Image Representation
Cross-modal social image clustering and tag cleansing
Introduction
Collaborative image tagging systems, such as Flickr.com, have now become very popular for tagging large-scale social images by relying on collaborative efforts of a large population of Internet users [1], [2]. In a collaborative image tagging system, people may tag social images according to their social or cultural backgrounds, personal expertise and perception. We call such collaboratively-tagged social images as weakly-tagged social images because their social tags may not have exact correspondences with the underlying image semantics. With the exponential growth of weakly-tagged social images, it has become increasingly attractive to develop new algorithms for achieving more effective organization and summarization of large-scale social images.
Image clustering, which can assign large amounts of images into different clusters with some common semantics or visual properties [2], [33], [34], [35], [41], [42], [43], [44], [45], is very attractive for achieving more effective organization, summarization and visualization of large-scale image collections. Clustering the search results into multiple semantic groups is helpful for users to assess the relevance between large amounts of returned images (which are returned by the same query terms) and their query intentions [33], [34], [35], [44]. Image clustering has also been used to achieve more precise organization and summarization of large-scale Flickr images according to their visual similarities [2]. Most existing algorithms for image clustering focused on only the low-level visual features [44], thus it is doubtful of their effectiveness because of the problem of semantic gap [55], [56], [57], [58]. Some researchers have recently integrated visual and textual features for Web image clustering and higher accuracy rates are reported [33], [34], [35], [41], [42], [43].
In a collaborative image tagging space, there are two inter-related information sources that can be used to support image clustering: (1) visual properties of weakly-tagged social images; and (2) their social tags. It is worth noting that the visual properties of weakly-tagged social images and their social tags can offer complementary strengths, thus it is very attractive to develop new frameworks that are able to integrate the visual features with the social tags for achieving more precise clustering of large-scale weakly-tagged social images. Unfortunately, it is not a trivial work because of the following issues:
(a) Social Tag Ambiguity: Different people may use different social tags (i.e., text terms), which have the same or close meaning (synonyms), to tag their social images [5], [26], [27], [28], [29], [30], [31]. The weakly-tagged social images, which belong to a set of synonymous tags, may share some common visual properties and semantics. The appearance of synonymous tags may prevent most existing clustering algorithms from deriving more representative image clusters. On the other hand, collaborative image tagging is an ambiguous process [22], [23], [24], [25]. Different people may apply the same social tag in different ways (i.e., the polysemous tag may have different meanings under different contexts), which may result in large amounts of ambiguous images with diverse visual properties. Because the effectiveness of most existing clustering algorithms largely depends on the accuracy of the underlying functions for data similarity characterization, the appearance of polysemous tags and their ambiguous images may bring new challenges for social image clustering, e.g., it is hard to design suitable similarity functions for characterizing the diverse image similarity contexts accurately. In a collaborative image tagging space, dishonest users may use spam tags to tag their social images, so that they can derive traffic to their social images for fun or profit [37], [38], [39]. The appearance of spam tags may result in large amounts of junk images, which may mislead most existing clustering algorithms to derive less representative clusters from large-scale weakly-tagged social images.
(b) Visual Ambiguity and Semantic Gap: Multiple types of visual features are usually extracted to achieve more sufficient characterization of various visual properties of the images, thus the distributions of the images could be very sparse and the visual similarity contexts among the images could be very diverse in the high-dimensional feature space (i.e., visual ambiguity). As a result, it is very hard to use one single type of base kernels such as RBF kernel [7], [8], [9], [10], [11], [12] to achieve precise characterization of the diverse visual similarity contexts among the images. In addition, there may have large amounts of outliers in the high-dimensional feature space and most existing clustering algorithms may seriously suffer from the problem of skewed cuts. Another challenging problem for image clustering is the semantic gap [55], [56], [57], [58] between the low-level visual features and the image semantics, e.g., it is very hard to achieve semantic clustering of large-scale weakly-tagged social images by using only the low-level visual features. It is worth noting that the visual properties of weakly-tagged social images and their social tags can offer complementary strengths, thus they can be integrated to achieve more precise clustering of large-scale weakly-tagged social images. Because the visual features and the social tags belong to different spaces, it is unsatisfied to combine them directly for social image clustering.
As shown in Fig. 1, a cross-modal approach is developed in this paper to achieve social image clustering and tag cleansing: (a) a semantic image clustering algorithm is developed for extracting the image topics of interest from large amounts of social tags; (b) spam tags are ide.epsied automatically via sentiment analysis and multiple synonymous tags are merged as one super-topic according to their inter-topic semantic similarity contexts; (c) a mixture-of-kernels algorithm is developed to achieve more accurate characterization of the cross-modal similarity contexts among the weakly-tagged social images; (d) a K-way min–max cut algorithm is extended for supporting cross-modal social image clustering and tag cleansing, where the polysemous tags and their ambiguous images are split into multiple sub-topics for reducing their intra-topic visual diversity; (e) a topic network is constructed to achieve more effective organization and summarization of large-scale weakly-tagged social images at the semantic level.
The rest of this paper is organized as follows. In Section 2, a brief review of some relevant work is presented; In Section 3, a semantic image clustering algorithm is introduced to assign large-scale social images into a large number of image topics of interest; In Section 4, a mixture-of-kernels algorithm is developed for achieving more precise characterization of the diverse cross-modal image similarity contexts among the social images; In Section 5 K-way min–max cut algorithm is presented for achieving cross-modal social image clustering; In Section 6 topic network is constructed to enable semantic summarization and organization of large-scale weakly-tagged social images at the semantic level; Our experimental results on algorithm evaluation are given in Section 7 and we conclude this paper in Section 8.
Section snippets
Related work
Clustering, which is one of the fundamental problems in machine learning and data mining, has received a significant amount of attentions in the last three decades [19]. Spectral clustering has recently become very popular because it is more effective in finding representative clusters [13], [14], [15], [16], [17], [18], and one popular objective function (which is used in most spectral clustering approaches) is to minimize the normalized cuts [13] (i.e., minimizing the inter-connections among
Semantic image clustering
As shown in Fig. 2, each image in a collaborative tagging system is associated with the image holder’s tags of the image semantics and other users’ tags or comments. Because multiple social tags are given individually in a collaborative image tagging space, entity extraction can be done more effectively. In this paper, a semantic image clustering algorithm is developed for: (a) automatically extracting the social tags for image topic interpretation; and (b) assigning large-scale weakly-tagged
Cross-modal similarity characterization for social images
As shown in Fig. 5, four grid resolutions are used for image partition and feature extraction [3]. As shown in Fig. 6, three types of visual features are extracted for characterizing various visual properties of weakly-tagged social images: (a) grid-based color histograms; (b) Gabor texture features; (c) SIFT features.
For the color features, one color histogram is extracted for each image grid, thus there are 85 grid-based color histograms. Each grid-based color histogram consists
Cross-modal social image clustering and tag cleansing
To achieve more effective social image clustering and automatic kernel weight determination, a K-way min–max cut algorithm is developed, where the cumulative inter-cluster cross-modal similarity contexts are minimized while the cumulative intra-cluster cross-modal similarity contexts (summation of the pairwise image similarity contexts among the social images within the same cluster) are maximized.
Our K-way min–max cut algorithm takes the following steps iteratively for social image clustering
Topic Network Generation for Large-Scale Image Summarization and Navigation
To support interactive visualization and exploration of large-scale weakly-tagged social images, it is very attractive to enable graph-based representation of a large number of image topics of interest and their inter-topic similarity contexts. As illustrated in Fig. 14, a new algorithm is developed for determining the inter-topic similarity contexts. The inter-topic similarity context between two image topics and can be determined by:
Algorithm Evaluation
Our experiments on algorithm evaluation are performed on 5 million Flickr images. To assess the effectiveness of our proposed algorithms, our algorithm evaluation work focuses on: (1) comparing the performance differences of our social image clustering algorithm by using single base kernel or mixture-of-kernels for image similarity characterization; (2) comparing the performance differences between various approaches for social image clustering (i.e., our K-way min–max cut algorithm, normalized
Conclusions
In this paper, a new algorithm is developed for achieving cross-modal social image clustering and tag cleansing. A semantic image clustering algorithm is developed to assign large-scale weakly-tagged social images into a large number of image topics of interest. A K-way min–max cut algorithm is developed for social image clustering by minimizing the cumulative inter-cluster cross-modal similarity contexts while maximizing the cumulative intra-cluster cross-modal similarity contexts. To tackle
Acknowledgment
The authors would like to than the reviewers for their insightful comments and suggestions to make this paper more readable. This research is partly supported by National Science Foundation of China under Grants 61272285, 61103062 and 61075014, Doctoral Program of Higher Education of China (Grant No. 20126101110022, 20116102110027, 20116102120031) and Program for New Century Excellent Talents in University under NCET-10-0071.
References (58)
- et al.
SURF: speeded up robust features
Comput. Vision Image Understand. (CVIU)
(2008) - et al.
Word sense disambiguation with pictures
Art. Intell.
(2005) - Flickr....
- et al.
JustClick: personalized image recommendation via exploratory search from large-scale Flickr images
IEEE Trans. CSVT
(2009) - Y.G. Jiang, C.W. Ngo, J. Yang, Towards optimal bag-of-features for object categorization and semantic video retrieval,...
WordNet: An Electronic Lexical Database
(1998)- et al.
Latent semantic kernels
J. Intell. Inf. Syst.
(2002) - et al.
Large scale multiple kernel learning
J. Mach. Learn. Res.
(2006) - M. Varma, D. Ray, Learning the discriminative power-invariance trade-off, in: IEEE ICCV,...
- A. Frome, Y. Singer, F. Sha, J. Malik, Learning globally-consistent local distance functions for shape-based image...
Local features and kernels for classification of texture and object categories: a comprehensive study
Int. J. Comput. Vision
Integrating concept ontology and multi-task learning to achieve more effective classifier training for multi-level image annotation
IEEE Trans. Image Process.
Normalized cuts and image segmentation
IEEE Trans. PAMI
Data Mining: Concepts and Techniques
Cited by (6)
Deep cross-modal subspace clustering with Contrastive Neighbour Embedding
2024, NeurocomputingAn image-text consistency driven multimodal sentiment analysis approach for social media
2019, Information Processing and ManagementCitation Excerpt :Sentiment analysis aims to automatically uncover the underlying attitude of the posts. Due to the rich sentiment cues that can be found in images, sentiment analysis of visual content can contribute more towards extracting user sentiments and understand user behavior, stock market forecasting and voting for politicians (Jiang et al., 2017; Nie, Peng, Wang, Zhao, & Su, 2017; Peng, Shen, & Fan, 2013). Taking the examples of some popular posters, as illustrated in Fig. 1, it can be seen that some posters record their time and express their expectations for the next period.
MapReduce-based clustering for near-duplicate image identification
2017, Multimedia Tools and ApplicationsPartially tagged image clustering
2015, Proceedings - International Conference on Image Processing, ICIPTagged image clustering via topic models
2015, Proceedings of the 2015 27th Chinese Control and Decision Conference, CCDC 2015