Skip to main content
Log in

An automatic image-text alignment method for large-scale web image retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

For reducing huge uncertainty on the relatedness between the web images and their auxiliary text terms, an automatic image-text alignment algorithm is developed to achieve more accurate indexing and retrieval of large-scale web images by assigning the web images into their most relevant visual text terms precisely. First, large-scale web pages are crawled, where the informative images and their most relevant auxiliary text blocks are extracted. Second, parallel image clustering is performed to partition large-scale informative web images into a large number of clusters. By grouping the visually-similar web images into the same cluster, our parallel image clustering algorithm can significantly reduce the huge uncertainty on the relatedness between the web images and their auxiliary text terms, which can provide a good starting point for supporting automatic image-text alignment. Finally, a relevance re-ranking algorithm is developed to identify the most relevant text terms for characterizing the semantics of the visually-similar web images in the same cluster, e.g., assigning the web images into their most relevant visual text terms. Our experiments on large-scale web images have obtained very positive results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Barnard K, Duygulu P, Forsyth D, Freitas ND, Blei DM, Jordan MI (2003) Matching words and pictures. J Mach Learn Res 3:1107–1135

    MATH  Google Scholar 

  2. Berg TL, Berg AC, Edwards J, Forsyth DA (2004) Whos in the picture?. In: Advances in Neural Information Processing Systems. NIPS2004, pp 137–144

  3. Blei DM, Jordan MI (2003) Modeling annotated data. Paper presented at the Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto

  4. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  5. Cai D, He X, Li Z, Ma W-Y, Wen J-R (2004) Hierarchical clustering of WWW image search results using visual, textual and link information. Paper presented at the Proceedings of the 12th annual ACM international conference on Multimedia, New York

  6. Carneiro G, Chan AB, Moreno PJ, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Transactions on Pattern analysis and machine intelligence 29(3):394–410. doi:10.1109/TPAMI.2007.61

  7. Cheng D, Rongrong J, Dacheng T, Xinbo G, Xuelong L (2014) Weakly supervised Multi-Graph learning for robust image reranking. IEEE Transactions on Multimedia 16(3):785–795. doi:10.1109/TMM.2014.2298841

  8. Chong W, Blei D, Fei-Fei L (2009) Simultaneous image classification and annotation. In: IEEE conference on Computer vision and pattern recognition, 2009. CVPR 2009, pp 1903–1910, doi:10.1109/CVPR.2009.5206800, (to appear in print)

  9. Costa Pereira J, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in Cross-Modal multimedia retrieval. IEEE Transactions on Pattern analysis and machine intelligence 36(3):521–535. doi:10.1109/TPAMI.2013.142

  10. Cuicui K, Shiming X, Shengcai L, Changsheng X, Chunhong P (2015) Learning consistent feature representation for Cross-Modal multimedia retrieval. IEEE Transactions on Multimedia 17(3):370–381. doi:10.1109/TMM.2015.2390499

  11. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. ACM Commun. 51(1):107–113. doi:10.1145/1327452.1327492

  12. Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE Transactions on Pattern analysis and machine intelligence 35(8):1915–1929. doi:10.1109/TPAMI.2012.231

  13. Feng SL, Manmatha R, Lavrenko V (2004) Multiple Bernoulli relevance models for image and video annotation. In: Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition, 2004. CVPR 2004, vol 1002, pp II-1002–II-1009. doi:10.1109/CVPR.2004.1315274

  14. Fergus R, Fei-Fei L, Perona P, Zisserman A (2005) Learning object categories from Google’s image search. In: 10th IEEE International Conference on Computer Vision, 2005. ICCV 2005, vol 1812, pp 1816–1823. doi:10.1109/ICCV.2005.142

  15. Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315(5814):972–976. doi:10.1126/science.1136800

  16. Fujiwara Y, Irie G, Kitahara T (2011) Fast algorithm for affinity propagation. Paper presented at the Proceedings of the 22nd international joint conference on Artificial Intelligence - Volume Volume Three, Barcelona

  17. Gao B, Liu T-Y, Qin T, Zheng X, Cheng Q-S, Ma W-Y (2005) Web image clustering by consistent utilization of visual features and surrounding texts. In: Paper presented at the Proceedings of the 13th annual ACM international conference on Multimedia, Hilton

  18. Givoni I, Chung c, Frey BJ (2012) Hierarchical Affinity Propagation

  19. Gong Y, Ke Q, Isard M, Lazebnik S (2014) A Multi-View embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233. doi:10.1007/s11263-013-0658-4

  20. Gunhee K, Seungwhan M, Sigal L (2015) Joint photo stream and blog post summarization and exploration. In: 2015 IEEE conference on Computer vision and pattern recognition (CVPR), 7–12 june 2015, pp 3081–3089. doi:10.1109/CVPR.2015.7298927

  21. Hardoon DR, Szedmak SR, Shawe-taylor JR (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664. doi:10.1162/0899766042321814

  22. Hofmann T (1999) Probabilistic latent semantic indexing. Paper presented at the Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley

  23. Hofmann T (2001) Unsupervised Learning by Probabilistic Latent Semantic Analysis. Mach Learn 42(1-2):177–196. doi:10.1023/a:1007617005950

  24. Hsu WH, Kennedy LS, Chang S-F (2006) Video search reranking via information bottleneck principle. Paper presented at the Proceedings of the 14th ACM international conference on Multimedia, Santa Barbara

  25. Hsu WH, Kennedy LS, Chang S-F (2007) Video search reranking through random walk over document-level context graph. Paper presented at the Proceedings of the 15th ACM international conference on Multimedia, Augsburg

  26. Jamieson M, Fazly A, Stevenson S, Dickinson S, Wachsmuth S (2010) Using language to learn structured appearance models for image annotation. IEEE Transactions on Pattern analysis and machine intelligence 32(1):148–164. doi:10.1109/TPAMI.2008.283

  27. Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. Paper presented at the Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto

  28. Jia L, Wang JZ (2003) Automatic Linguistic Indexing of Pictures by a statistical modeling approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(9):1075–1088. doi:10.1109/TPAMI.2003.1227984

  29. Jia Y, Wang J, Zhang C, Hua X-S (2008) Finding image exemplars using fast sparse affinity propagation. Paper presented at the Proceedings of the 16th ACM international conference on Multimedia, Vancouver

  30. Jun-Bin Y, Chung-Hsien W, Sheng-Xiong C (2011) Unsupervised alignment of news video and text using visual patterns and textual concepts. IEEE Transactions on Multimedia 13(2):206–215. doi:10.1109/TMM.2010.2095412

  31. Lei W, Xian-sheng H, Nenghai Y et al (2012) Flickr Distance: A Relationship Measure for Visual Concepts. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(5):863–875. doi:10.1109/TPAMI.2011.195

  32. Li-Jia L, Socher R, Li F-F (2009) Towards total scene understanding: classification, annotation and segmentation in an automatic framework. In: IEEE conference on Computer vision and pattern recognition, 2009. CVPR 2009. 20-25 june 2009, pp 2036–2043. doi:10.1109/CVPR.2009.5206718

  33. Liu J, Lai W, Hua X-S, Huang Y, Li S (2007) Video search re-ranking via multi-graph propagation. In: Paper presented at the Proceedings of the 15th ACM international conference on Multimedia, Augsburg

  34. Liu D, Hua X-S, Yang L, Wang M, Zhang H-J (2009) Tag ranking. In: Paper presented at the Proceedings of the 18th international conference on World wide web, Madrid

  35. Lowe D (2004) Distinctive Image Features from Scale-Invariant Keypoints. Int J Comput Vis 60(2):91–110. doi:10.1023/B:VISI.0000029664.99615.94

  36. Monay F, Gatica-Perez D (2007) Modeling Semantic Aspects for Cross-Media Image Indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(10):1802–1817. doi:10.1109/TPAMI.2007.1097

  37. Mori Y (1999) Image-to-word transformation based on dividing and vector quantizing images with words. In: Proceedings of 1st Intl Workshop on Multimedia Intelligent Storage and Retrieval Management, p 1999

  38. Phi TP, Moens M, Tuytelaars T (2010) Cross-Media Alignment of names and faces. IEEE Transactions on Multimedia 12(1):13–27. doi:10.1109/TMM.2009.2036232

  39. Quattoni A, Collins M, Darrell T (2007) Learning Visual Representations using Images with Captions. In: IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR ’07, pp 1–8. doi:10.1109/CVPR.2007.383173

  40. Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. Paper presented at the Proceedings of the 18th ACM international conference on Multimedia, Firenze

  41. Rose DM, Rouly JM, Haber R, Mijatovic N, Peter AM (2014) Parallel Hierarchical Affinity Propagation with MapReduce

  42. Satoh S, Nakamura Y, Kanade T (1999) Name-It: naming and detecting faces in news videos. IEEE MultiMedia 6(1):22–35. doi:10.1109/93.752960

  43. Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12):1349–1380. doi:10.1109/34.895972

  44. Socher R, Li F-F (2010) Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 13–18 June 2010, pp 966–973. doi:10.1109/CVPR.2010.5540112

  45. Srivastava N, Salakhutdinov RR (2012) Multimodal learning with deep boltzmann machines. J Mach Learn Res 15:2949–2980

    MathSciNet  MATH  Google Scholar 

  46. Stan S, Marco L, Saratendu S (1999) Unifying textual and visual cues for Content-Based image retrieval on the world wide web. Comput Vis Image Underst 75(12):86–98. doi:10.1006/cviu.1999.0765

  47. Tan H-K, Ngo C-W, Wu X (2008) Modeling video hyperlinks with hypergraph for web video reranking. Paper presented at the Proceedings of the 16th ACM international conference on Multimedia, Vancouver

  48. Victor L, Manmatha R, Jiwoon J (2004) A Model for Learning the Semantics of Pictures

  49. Wang X-J, Ma W-Y, Xue G-R, Li X (2004) Multi-model similarity propagation and its application for web image retrieval. Paper presented at the Proceedings of the 12th annual ACM international conference on Multimedia, New York

  50. Wang C, Jing F, Zhang L, Zhang H-J (2006) Image annotation refinement using random walk with restarts. Paper presented at the Proceedings of the 14th ACM international conference on Multimedia, Santa Barbara

  51. Weston J, Bengio S, Usunier N (2010) Large scale image annotation: learning to rank with joint word-image embeddings. Mach Learn 81(1):21–35. doi:10.1007/s10994-010-5198-3

  52. Xiaogang W, Shi Q, Ke L, Xiaoou T (2014) Web Image Re-Ranking UsingQuery-Specific Semantic Signatures. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(4):810–823. doi:10.1109/TPAMI.2013.214

  53. Yahong H, Fei W, Qi T, Yueting Z (2012) Image annotation by Input-Output structural grouping sparsity. IEEE Trans Image Process 21(6):3066–3079. doi:10.1109/TIP.2012.2183880

  54. Yahong H, Xingxing W, Xiaochun C, Yi Y, Xiaofang Z (2014) Augmenting image descriptions using structured prediction output. IEEE Transactions on Multimedia 16(6):1665–1676. doi:10.1109/TMM.2014.2321530

  55. Yahong H, Yi Y, Zhigang M, Haoquan S, Nicu S, Xiaofang Z (2014) Image attribute adaptation. IEEE Transactions on Multimedia 16(4):1115–1126. doi:10.1109/TMM.2014.2306092

  56. Yansong F, Lapata M (2013) Automatic caption generation for news images. IEEE Transactions on Pattern analysis and machine intelligence 35(4):797–812. doi:10.1109/TPAMI.2012.118

  57. Yanyun Q, Baopeng Z, Jianping F (2015) Parallel AP Clustering and Re-ranking for Automatic Image-Text Alignment and Large-Scale Web Image Search. In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, shanghai, pp 451–454. doi:10.1145/2671188.2749294

  58. Yushi J, Baluja S (2008) Visualrank: Applying PageRank to Large-Scale Image Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (11):1877–1890. doi:10.1109/TPAMI.2008.121

  59. Yushi J, Michele C, David T, James MR (2013) Learning Query-Specific distance functions for Large-Scale web image search. IEEE Transactions on Multimedia 15(8):2022–2034. doi:10.1109/TMM.2013.2279663

  60. Zhixin L, Xi L, Zhiping S, Zhongzhi S (2009) Learning image semantics with latent aspect model. In: IEEE international conference on Multimedia and expo, 2009. ICME 2009, pp 366–369. doi:10.1109/ICME.2009.5202510

Download references

Acknowledgments

This research is partly supported by National Science Foundation of China under (Grant No.61272285 and No. 61373077), National High-Technology Program of China (No.2014AA012301), National Key Technology Support Program of China (No.2014BAH24F02), Program for Changjiang Scholars and Innovative Research Team in University (No.IRT13090), and Program of Shaanxi Province Innovative Research Team (No.2014KCT-17).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baopeng Zhang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, B., Qu, Y., Peng, J. et al. An automatic image-text alignment method for large-scale web image retrieval. Multimed Tools Appl 76, 21401–21421 (2017). https://doi.org/10.1007/s11042-016-4059-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-016-4059-x

Keywords

Navigation