Abstract
Currently, the computation of set similarity has become an increasingly important tool in many real-world applications, such as near-duplicate detection, data cleaning and record linkage, etc., in which sets often are uncertain due to date missing, imprecise and noise, etc. The challenge of evaluating similarity between probabilistic sets mainly stems from the exponential blowup in the number of possible worlds induced by uncertainty. In this paper, we define the probability threshold set similarity (PTSS) between two probabilistic sets based on the possible world semantics and propose an exact solution to compute PTSS via the dynamic programming. To speed up the computation of the probability threshold set query (PTSQ), we derive an efficient and effective pruning rule for PTSQ. Finally, we conduct extensive experiments to verify the effectiveness and efficiency of our algorithms using both real and synthetic datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929,(2006)
Bharambe, A.R., Agrawal, M., Seshan, S.: Mercury: supporting scalable multi-attribute range queries. In: SIGCOMM, pp. 353–366 (2004)
Börzsönyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: ICDE, pp. 421–430 (2001)
Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient processing of spatial joins using r-trees. In: SIGMOD Conference, pp. 237–246 (1993)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)
Cheema, M.A., Lin, X., Wang, W., Zhang, W., Pei, J.: Probabilistic reverse nearest neighbor queries on uncertain data. IEEE Trans. Knowl. Data Eng. 22(4), 550–564 (2010)
Chum, O., Philbin, J., Isard, M., Zisserman, A.: Scalable near identical image and shot detection. In: Proc. of CIVR (2007)
Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. The VLDB Journal 16(4), 523–544 (2007)
Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: Proc. of ACM PODS, pp. 1–12 (2007)
Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: VLDB, pp. 687–698 (2007)
Gao, M., Jin, C., Wang, W., Lin, X., Zhou, A.: Similarity query processing for probabilistic sets. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8–12, pp. 913–924 (2013)
Hua, M., Pei, J., Zhang, W., Lin, X.: Ranking queries on uncertain data: a probabilistic threshold approach. In: SIGMOD Conference, pp. 673–686 (2008)
Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Supporting top-k join queries in relational databases. VLDB J. 13(3), 207–221 (2004)
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD Conference, pp. 327–338 (2010)
Jin, C., Yi, K., Chen, L., Yu, J.X., Lin, X.: Sliding-window top-k queries on uncertain streams. PVLDB 1(1), 301–312 (2008)
Kriegel, H.-P., Kunath, P., Pfeifle, M., Renz, M.: Probabilistic similarity join on uncertain data. In: Li Lee, M., Tan, K.-L., Wuwongse, V. (eds.) DASFAA 2006. LNCS, vol. 3882, pp. 295–309. Springer, Heidelberg (2006)
Lian, X., Chen, L.: Set similarity join on probabilistic data. In: Proc. of VLDB (2010)
Ljosa, V., Singh, A.K.: Top-k spatial joins of probabilistic objects. In: ICDE, pp. 566–575 (2008)
Pei, J., Jiang, B., Lin, X., Yuan, Y.: Probabilistic skylines on uncertain data. In: VLDB, pp. 15–26 (2007)
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40(2), 99–121 (2000)
Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient e detection in large web collections. In: Proc. of ACM SIGIR, pp. 563–570 (2008)
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)
Xu, J., Zhang, Z., Tung, A.K.H., Yu, G.: Efficient and effective similarity search over probabilistic data based on earth mover’s distance. PVLDB 3(1), 758–769 (2010)
Yi, K., Lian, X., Li, F., Chen, L.: A concise representation of range queries. In: ICDE, pp. 1179–1182 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wang, L., Gao, M., Zhang, R., Jin, C., Zhou, A. (2015). Computing Probability Threshold Set Similarity on Probabilistic Sets. In: Dong, X., Yu, X., Li, J., Sun, Y. (eds) Web-Age Information Management. WAIM 2015. Lecture Notes in Computer Science(), vol 9098. Springer, Cham. https://doi.org/10.1007/978-3-319-21042-1_30
Download citation
DOI: https://doi.org/10.1007/978-3-319-21042-1_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21041-4
Online ISBN: 978-3-319-21042-1
eBook Packages: Computer ScienceComputer Science (R0)