Skip to main content

Computing Probability Threshold Set Similarity on Probabilistic Sets

  • Conference paper
  • First Online:
Book cover Web-Age Information Management (WAIM 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9098))

Included in the following conference series:

  • 2697 Accesses

Abstract

Currently, the computation of set similarity has become an increasingly important tool in many real-world applications, such as near-duplicate detection, data cleaning and record linkage, etc., in which sets often are uncertain due to date missing, imprecise and noise, etc. The challenge of evaluating similarity between probabilistic sets mainly stems from the exponential blowup in the number of possible worlds induced by uncertainty. In this paper, we define the probability threshold set similarity (PTSS) between two probabilistic sets based on the possible world semantics and propose an exact solution to compute PTSS via the dynamic programming. To speed up the computation of the probability threshold set query (PTSQ), we derive an efficient and effective pruning rule for PTSQ. Finally, we conduct extensive experiments to verify the effectiveness and efficiency of our algorithms using both real and synthetic datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929,(2006)

    Google Scholar 

  2. Bharambe, A.R., Agrawal, M., Seshan, S.: Mercury: supporting scalable multi-attribute range queries. In: SIGCOMM, pp. 353–366 (2004)

    Google Scholar 

  3. Börzsönyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: ICDE, pp. 421–430 (2001)

    Google Scholar 

  4. Brinkhoff, T., Kriegel, H.-P., Seeger, B.: Efficient processing of spatial joins using r-trees. In: SIGMOD Conference, pp. 237–246 (1993)

    Google Scholar 

  5. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006)

    Google Scholar 

  6. Cheema, M.A., Lin, X., Wang, W., Zhang, W., Pei, J.: Probabilistic reverse nearest neighbor queries on uncertain data. IEEE Trans. Knowl. Data Eng. 22(4), 550–564 (2010)

    Article  Google Scholar 

  7. Chum, O., Philbin, J., Isard, M., Zisserman, A.: Scalable near identical image and shot detection. In: Proc. of CIVR (2007)

    Google Scholar 

  8. Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. The VLDB Journal 16(4), 523–544 (2007)

    Article  Google Scholar 

  9. Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: Proc. of ACM PODS, pp. 1–12 (2007)

    Google Scholar 

  10. Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. In: VLDB, pp. 687–698 (2007)

    Google Scholar 

  11. Gao, M., Jin, C., Wang, W., Lin, X., Zhou, A.: Similarity query processing for probabilistic sets. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8–12, pp. 913–924 (2013)

    Google Scholar 

  12. Hua, M., Pei, J., Zhang, W., Lin, X.: Ranking queries on uncertain data: a probabilistic threshold approach. In: SIGMOD Conference, pp. 673–686 (2008)

    Google Scholar 

  13. Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Supporting top-k join queries in relational databases. VLDB J. 13(3), 207–221 (2004)

    Article  Google Scholar 

  14. Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD Conference, pp. 327–338 (2010)

    Google Scholar 

  15. Jin, C., Yi, K., Chen, L., Yu, J.X., Lin, X.: Sliding-window top-k queries on uncertain streams. PVLDB 1(1), 301–312 (2008)

    Google Scholar 

  16. Kriegel, H.-P., Kunath, P., Pfeifle, M., Renz, M.: Probabilistic similarity join on uncertain data. In: Li Lee, M., Tan, K.-L., Wuwongse, V. (eds.) DASFAA 2006. LNCS, vol. 3882, pp. 295–309. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  17. Lian, X., Chen, L.: Set similarity join on probabilistic data. In: Proc. of VLDB (2010)

    Google Scholar 

  18. Ljosa, V., Singh, A.K.: Top-k spatial joins of probabilistic objects. In: ICDE, pp. 566–575 (2008)

    Google Scholar 

  19. Pei, J., Jiang, B., Lin, X., Yuan, Y.: Probabilistic skylines on uncertain data. In: VLDB, pp. 15–26 (2007)

    Google Scholar 

  20. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40(2), 99–121 (2000)

    Article  MATH  Google Scholar 

  21. Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient e detection in large web collections. In: Proc. of ACM SIGIR, pp. 563–570 (2008)

    Google Scholar 

  22. Xiao, C., Wang, W., Lin, X.: Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)

    Google Scholar 

  23. Xu, J., Zhang, Z., Tung, A.K.H., Yu, G.: Efficient and effective similarity search over probabilistic data based on earth mover’s distance. PVLDB 3(1), 758–769 (2010)

    Google Scholar 

  24. Yi, K., Lian, X., Li, F., Chen, L.: A concise representation of range queries. In: ICDE, pp. 1179–1182 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rong Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Wang, L., Gao, M., Zhang, R., Jin, C., Zhou, A. (2015). Computing Probability Threshold Set Similarity on Probabilistic Sets. In: Dong, X., Yu, X., Li, J., Sun, Y. (eds) Web-Age Information Management. WAIM 2015. Lecture Notes in Computer Science(), vol 9098. Springer, Cham. https://doi.org/10.1007/978-3-319-21042-1_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-21042-1_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-21041-4

  • Online ISBN: 978-3-319-21042-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics