Abstract
In this paper, an effective algorithm is developed for tackling the problem of near-duplicate image identification from large-scale image sets, where the LLC (locality-constrained linear coding) method is seamlessly integrated with the maxIDF cut model to achieve more discriminative representations of images. By incorporating MapReduce framework for image clustering and pairwise merging, the near duplicates of images can be identified effectively from large-scale image sets. An intuitive strategy is also introduced to guide the process for parameter selection. Our experimental results on large-scale image sets have revealed that our algorithm can achieve significant improvement on both the accuracy rates and the computation efficiency as compared with other baseline methods.
Similar content being viewed by others
References
Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on World Wide Web, pp. 131–140. ACM
Broder AZ (1997) On the resemblance and containment of documents. In: Compression and Complexity of Sequences 1997. Proceedings, pp. 21–29. IEEE
Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Computer Networks and Isdn Systems 29(8-13):1157–1166
Cherian A, Morellas V, Papanikolopoulos N (2012) Robust sparse hashing. In: Proceedings / ICIP... International Conference on Image Processing, pp. 2417–2420
Chum O, Perdoch M, Matas J (2009) Geometric min-hashing: Finding a (thick) needle in a haystack. In: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 17–24. IEEE
Chum O, Philbin J, Zisserman A, et al. (2008) Near duplicate image detection: min-hash and tf-idf weighting. In: BMVC, vol. 810, pp. 812–815
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262. ACM
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Dong W, Wang Z, Charikar M, Li K (2012) High-confidence near-duplicate image detection. In: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval
Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 265–268. Association for Computational Linguistics
Foo JJ, Zobel J, Sinha R (2007) Clustering near-duplicate images in large collections. In: Proceedings of the international workshop on Workshop on multimedia information retrieval, pp. 21–30
Hama H, Zin TT, Tin P (2009) A hybrid ranking of link and popularity for novel search engine. International Journal of Innovative Computing. Inf Control 5 (11):4041–4049
Hsieh LC, Wu GL, Hsu YM, Hsu W (2014) Online image search result grouping with mapreduce-based image clustering and graph construction for large-scale photos. J Vis Commun Image Represent 25(2):384–395
Hsieh LC, Wu GL, Lee WY, Hsu W (2012) Two-stage sparse graph construction using minhash on mapreduce. In: IEEE International Conference on Acoustics, pp. 1013–1016
Kim S, Wang XJ, Zhang L, Choi S (2015) Near duplicate image discovery on one billion images. In: 2015 IEEE Winter Conference on, Applications of Computer Vision (WACV), pp. 943–950
Lee DC, Ke Q, Isard M (2010) Partition min-hash for partial duplicate image discovery. In: European Conference on Computer Vision, pp. 648–662. Springer
Liu T, Rosenberg C, Rowley H, et al. (2007) Clustering billions of images with large scale nearest neighbor search. In: Applications of Computer Vision, 2007. WACV’07. IEEE Workshop on, pp. 28–28. IEEE
Peng J, Shen Y, Fan J (2013) Cross-modal social image clustering and tag cleansing. J Vis Commun Image Represent 24(7):895–910
Salakhutdinov R, Hinton GE (2007) Learning a nonlinear embedding by preserving class neighbourhood structure. In: International Conference on Artificial Intelligence and Statistics, pp. 412–419
Sivic J, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on, pp. 1470–1477. IEEE
Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li LJ (2015) The new data and new challenges in multimedia research. arXiv preprint. arXiv:1503.01817
Vonikakis V, Jinda-Apiraksa A, Winkler S (2014) Photocluster: A multi-clustering technique for near-duplicate detection in personal photo collections. In: Computer Vision Theory and Applications (VISAPP), 2014 International Conference on, pp. 153–161
Wang H, Zhu F, Xiao B, Wang L, Jiang YG (2014) Gpu-based mapreduce for large-scale near-duplicate video retrieval. Multimedia Tools & Applications 74(23):10,515–10,534
Wang J, Kumar S, Chang SF (2012) Semi-supervised hashing for large-scale search. Pattern Analysis and Machine Intelligence. Tran IEEE 34(12):2393–2406
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3360–3367. IEEE
Wang XJ, Zhang L, Liu C (2013) Duplicate discovery on 2 billion internet images. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on, pp. 429–436
Weiss Y, Torralba A, Fergus R (2009) Spectral hashing. In: Advances in neural information processing systems, pp. 1753–1760
Xie L, Tian Q, Zhou W, Zhang B (2014) Fast and accurate near-duplicate image search with affinity propagation on the imageweb. Comput Vis Image Underst 124:31–41
Yang C, Peng J, Fan J (2012) Image collection summarization via dictionary learning for sparse representation. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1122– 1129
Zheng L, Wang S, Liu Z, Tian Q (2013) Lp-norm idf for large scale image search. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 1626–1633. IEEE
Acknowledgments
This research is partly supported by National Science Foundation of China under Grant 61272285, National High-Technology Program of China (863 Program, Grant No.2014AA015201), Program for Changjiang Scholars and Innovative Research Team in University (No.IRT13090), and Program of Shaanxi Province Innovative Research Team (No.2014KCT-17).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhao, W., Luo, H., Peng, J. et al. MapReduce-based clustering for near-duplicate image identification. Multimed Tools Appl 76, 23291–23307 (2017). https://doi.org/10.1007/s11042-016-4060-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-4060-4