Computing Probability Threshold Set Similarity on Probabilistic Sets

Wang, Lei; Gao, Ming; Zhang, Rong; Jin, Cheqing; Zhou, Aoying

doi:10.1007/978-3-319-21042-1_30

Lei Wang¹⁷,
Ming Gao¹⁷,
Rong Zhang¹⁷,
Cheqing Jin¹⁷ &
…
Aoying Zhou¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9098))

Included in the following conference series:

International Conference on Web-Age Information Management

2697 Accesses

Abstract

Currently, the computation of set similarity has become an increasingly important tool in many real-world applications, such as near-duplicate detection, data cleaning and record linkage, etc., in which sets often are uncertain due to date missing, imprecise and noise, etc. The challenge of evaluating similarity between probabilistic sets mainly stems from the exponential blowup in the number of possible worlds induced by uncertainty. In this paper, we define the probability threshold set similarity (PTSS) between two probabilistic sets based on the possible world semantics and propose an exact solution to compute PTSS via the dynamic programming. To speed up the computation of the probability threshold set query (PTSQ), we derive an efficient and effective pruning rule for PTSQ. Finally, we conduct extensive experiments to verify the effectiveness and efficiency of our algorithms using both real and synthetic datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929,(2006)
Google Scholar
Bharambe, A.R., Agrawal, M., Seshan, S.: Mercury: supporting scalable multi-attribute range queries. In: SIGCOMM, pp. 353–366 (2004)
Google Scholar
Börzsönyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: ICDE, pp. 421–430 (2001)
Google Scholar
Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient processing of spatial joins using r-trees. In: SIGMOD Conference, pp. 237–246 (1993)
Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)
Google Scholar
Cheema, M.A., Lin, X., Wang, W., Zhang, W., Pei, J.: Probabilistic reverse nearest neighbor queries on uncertain data. IEEE Trans. Knowl. Data Eng. 22(4), 550–564 (2010)
Article Google Scholar
Chum, O., Philbin, J., Isard, M., Zisserman, A.: Scalable near identical image and shot detection. In: Proc. of CIVR (2007)
Google Scholar
Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. The VLDB Journal 16(4), 523–544 (2007)
Article Google Scholar
Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: Proc. of ACM PODS, pp. 1–12 (2007)
Google Scholar
Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: VLDB, pp. 687–698 (2007)
Google Scholar
Gao, M., Jin, C., Wang, W., Lin, X., Zhou, A.: Similarity query processing for probabilistic sets. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8–12, pp. 913–924 (2013)
Google Scholar
Hua, M., Pei, J., Zhang, W., Lin, X.: Ranking queries on uncertain data: a probabilistic threshold approach. In: SIGMOD Conference, pp. 673–686 (2008)
Google Scholar
Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Supporting top-k join queries in relational databases. VLDB J. 13(3), 207–221 (2004)
Article Google Scholar
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD Conference, pp. 327–338 (2010)
Google Scholar
Jin, C., Yi, K., Chen, L., Yu, J.X., Lin, X.: Sliding-window top-k queries on uncertain streams. PVLDB 1(1), 301–312 (2008)
Google Scholar
Kriegel, H.-P., Kunath, P., Pfeifle, M., Renz, M.: Probabilistic similarity join on uncertain data. In: Li Lee, M., Tan, K.-L., Wuwongse, V. (eds.) DASFAA 2006. LNCS, vol. 3882, pp. 295–309. Springer, Heidelberg (2006)
Chapter Google Scholar
Lian, X., Chen, L.: Set similarity join on probabilistic data. In: Proc. of VLDB (2010)
Google Scholar
Ljosa, V., Singh, A.K.: Top-k spatial joins of probabilistic objects. In: ICDE, pp. 566–575 (2008)
Google Scholar
Pei, J., Jiang, B., Lin, X., Yuan, Y.: Probabilistic skylines on uncertain data. In: VLDB, pp. 15–26 (2007)
Google Scholar
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40(2), 99–121 (2000)
Article MATH Google Scholar
Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient e detection in large web collections. In: Proc. of ACM SIGIR, pp. 563–570 (2008)
Google Scholar
Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)
Google Scholar
Xu, J., Zhang, Z., Tung, A.K.H., Yu, G.: Efficient and effective similarity search over probabilistic data based on earth mover’s distance. PVLDB 3(1), 758–769 (2010)
Google Scholar
Yi, K., Lian, X., Li, F., Chen, L.: A concise representation of range queries. In: ICDE, pp. 1179–1182 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Data Science and Engineering and Shanghai Key Lab for Trustworthy Computing, East China Normal University, No. 3663, Zhongshan Rd. (N), Shanghai, 200062, China
Lei Wang, Ming Gao, Rong Zhang, Cheqing Jin & Aoying Zhou

Authors

Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Gao
View author publications
You can also search for this author in PubMed Google Scholar
Rong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Cheqing Jin
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rong Zhang .

Editor information

Editors and Affiliations

Google, CA, USA
Xin Luna Dong
Postdoc Apartments (Hong Lou) 4-1-4, Shandong University, Li Cheng, Jinan, China
Xiaohui Yu
Tsinghua University, Beijing, China
Jian Li
Northeastern University, BOSTON, USA
Yizhou Sun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, L., Gao, M., Zhang, R., Jin, C., Zhou, A. (2015). Computing Probability Threshold Set Similarity on Probabilistic Sets. In: Dong, X., Yu, X., Li, J., Sun, Y. (eds) Web-Age Information Management. WAIM 2015. Lecture Notes in Computer Science(), vol 9098. Springer, Cham. https://doi.org/10.1007/978-3-319-21042-1_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-21042-1_30
Published: 06 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21041-4
Online ISBN: 978-3-319-21042-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics