Abstract
In this paper, we identify and solve a multi-join optimization problem for Arbitrary Feature-based social image Similarity JOINs(AFS-JOIN). Given two collections(i.e., R and S) of social images that carry both visual, spatial and textual(i.e., tag) information, the multiple joins based on arbitrary features retrieves the pairs of images that are visually, textually similar or spatially close from different users. To address this problem, in this paper, we have proposed three methods to facilitate the multi-join processing: 1) two baseline approaches(i.e., a naïve join approach and a maximal threshold(MT)-based), and 2) a Batch Similarity Join(BSJ) method. For the BSJ method, given m users’ join requests, they are first conversed and grouped into m″ clusters which correspond to m″ join boxes, where m > m″. To speedup the BSJ processing, a feature distance space is first partitioned into some cubes based on four segmentation schemes; the image pairs falling in the cubes are indexed by the cube tree index; thus BSJ processing is transformed into the searching of the image pairs falling in some affected cubes for m″ AFS-JOINs with the aid of the index. An extensive experimental evaluation using real and synthetic datasets shows that our proposed BSJ technique outperforms the state-of-the-art solutions.
Similar content being viewed by others
Notes
Note that, the total number of features in the AFS-JOIN is three (e.g., visual, textual and spatial features) in this paper, it can be easily extended to support multiple features AFS-JOINs such as textual, shape and temporal features, etc.
References
Alamery, M., Faraahi, A., Javadi, H.H.S., et al.: Multi-joins query optimization using the bees algorithm. In: Advances in Intelligent and Soft Computing. 79. pp.449–457. (2010)
Arasu, A., Ganti, V., Kaushik, R,: Efficient exact set-similarity joins. In: VLDB, (2006)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, 1st edition, (1999)
Ballesteros, J., Cary, A., Rishe, N.: Spsjoin: parallel spatial similarity joins. In: GIS, pp. 481–484. (2011)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW. (2007)
Bouros, P., Ge, S., Mamoulis, N.: Spatio-textual similarity joins. In: VLDB. (2013)
Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient processing of spatial joins using r-trees. In SIGMOD, (1993)
Broder A.Z.: On the resemblance and containment of documents. In: SEQS. (1997)
Chan, E.P.F.: Buffer queries. TKDE 15(4), 895–910 (2003)
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC. (2002)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE. (2006)
Chowdhury, A., Frieder, O., Grossman, D.A., et al.: Collection statistics for fast duplicate document detection. In TOIS. 20(2): 171–191, (2002)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. In: TKDE, 19(1):1–16. (2007)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB. (2001)
Kementsietsidis, A., Neven, F., Van de Craen, D.: Scalable multi query optimization for exploratory queries over federated scientific databases. In: VLDB. (2008)
Li, G.L., Deng, D., Wang, J.N., Feng, J.H.: Pass-join: a partition-based method for similarity joins. In VLDB. (2012)
Lu, H.J., Shan, M.C., Tan, K.L.: Optimization of multi-way join queries for parallel execution. In VLDB. (1991)
Roy, P., Seshadri S., Sudarshan, S., et al.: Efficient and extensible algorithms for multi query optimization. In: SIGOMOD. (2000)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: SIGMOD, (2004)
Sarma, A. D., He, Y, Chaudhuri, S.: ClusterJoin: a similarity joins framework using MapReduce. In: VLDB. (2014)
Sellis, T.K., Multi-query optimization. In: TODS. 13(1). (1988)
Shan, M.C., Yu, P., Wu, K.L.: Optimization of parallel execution for multi-join queries. In: TKDE. 8(3) pp. 416–428. (1996)
Shekita, E.J., Young, H.C., Tan, K.L.: Multi-join optimization for symmetric multiprocessors. In: VLDB. (1993)
Sun, A.X., Bhowmick, S.S., Nguyen, K.T.N. et al.: Tag-based social image retrieval: an empirical evaluation. In: JASIST. 62(12): 2364–2381. (2011)
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. In: TODS. 36(3):15, (2011)
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW. (2008)
Xiao, C., Wang, W., Lin, X., Shang, H.: Top-k set similarity joins. In: ICDE. (2009)
Zhuang, Y., Li, Q., Chen, L.: Multi-query optimization for distributed similarity query processing. In: ICDCS. (2008)
Acknowledgments
This paper is partially supported by the Program of National Natural Science Foundation of China under Grant No. 61272188; the Program of Natural Science Foundation of Zhejiang Province under Grant Nos. LY13F020008, and LY13F020010; the Ministry of Education of Humanities and Social Sciences Project under Grant No. 14YJCZH235. National Center for International Joint Research on E-Business Information Processing (2013B01035).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhuang, Y., Jiang, N., Wu, ZA. et al. Efficient batch similarity join processing of social images based on arbitrary features. World Wide Web 19, 725–753 (2016). https://doi.org/10.1007/s11280-015-0355-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-015-0355-z