Skip to main content
Log in

Efficient and effective similarity search over probabilistic data based on Earth Mover’s Distance

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Advances in geographical tracking, multimedia processing, information extraction, and sensor networks have created a deluge of probabilistic data. While similarity search is an important tool to support the manipulation of probabilistic data, it raises new challenges to traditional relational databases. The problem stems from the limited effectiveness of the distance metrics employed by existing database systems. On the other hand, several more complicated distance operators have proven their values for better distinguishing ability in specific probabilistic domains. In this paper, we discuss the similarity search problem with respect to Earth Mover’s Distance (EMD). EMD is the most successful distance metric for probability distribution comparison but is an expensive operator as it has cubic time complexity. We present a new database indexing approach to answer EMD-based similarity queries, including range queries and k-nearest neighbor queries on probabilistic data. Our solution utilizes primal-dual theory from linear programming and employs a group of B + trees for effective candidate pruning. We also apply our filtering technique to the processing of continuous similarity queries, especially with applications to frame copy detection in real-time videos. Extensive experiments show that our proposals dramatically improve the usefulness and scalability of probabilistic data management.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agarwal, P.K., Cheng, S.-W., Tao, Y., Yi, K.: Indexing uncertain data. In: PODS, pp. 137–146 (2009)

  2. Agrawal, P., Benjelloun, O., Sarma, A.D., Hayworth, C., Nabar, S.U., Sugihara, T., Widom Trio, J.: A system for data, uncertainty, and lineage. In: VLDB, pp. 1151–1154 (2006)

  3. Andoni, A., Indyk, P., Krauthgamer, R.: Earth mover distance over high-dimensional spaces. In: SODA, pp. 343–352 (2008)

  4. Assent, I., Wenning, A., Seidl, T.: Approximation techniques for indexing the earth mover’s distance in multimedia databases. In: ICDE, p. 11 (2006)

  5. Babu S., Widom J.: Continuous queries over data streams. SIGMOD Rec. 30(3), 109–120 (2001)

    Article  Google Scholar 

  6. Benjelloun, O., Sarma, A.D., Halevy, A.Y., Widom, J.: Uldbs: Databases with uncertainty and lineage. In: VLDB, pp. 953–964 (2006)

  7. Bernstein P.A., Hadzilacos V., Goodman N.: Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading (1987)

    Google Scholar 

  8. Cheng, R., Singh, S., Prabhakar U-dbms, S.: A database system for managing constantly-evolving data. In: VLDB, pp. 1271–1274 (2005)

  9. Chu, D., Deshpande, A., Hellerstein, J.M., Hong, W.: Approximate data collection in sensor networks using probabilistic models. In: ICDE, p. 48 (2006)

  10. Cormode, G., Li, F., Yi, K.: Semantics of ranking queries for probabilistic data and expected ranks. In: ICDE, pp. 305–316 (2009)

  11. Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB, pp. 864–875 (2004)

  12. Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS, pp. 1–12 (2007)

  13. Deselaers T., Keysers D., Ney H.: Discriminative training for object recognition using image patches. Comput. Vis. Pattern Recognit. IEEE Comput. Soc. Conf. 2, 157–162 (2005)

    Google Scholar 

  14. Deshpande, A., Guestrin, C., Madden, S.: Using probabilistic models for data management in acquisitional environments. In: CIDR, pp. 317–328 (2005)

  15. Deshpande A., Guestrin C., Madden S., Hellerstein J.M., Hong W.: Model-based approximate querying in sensor networks. VLDB J. 14(4), 417–443 (2005)

    Article  Google Scholar 

  16. Fagin R., Lotem A., Naor M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  17. Grauman, K., Darrell, T.: Fast contour matching using approximate earth mover’s distance. In: CVPR, pp. 220–227 (2004)

  18. Hua, M., Pei, J., Zhang, W., Lin, X.: Ranking queries on uncertain data: a probabilistic threshold approach. In SIGMOD Conference, pp. 673–686 (2008)

  19. Lakshmanan L.V.S., Leone N., Ross R.B., Subrahmanian V.S.: Probview: a flexible probabilistic database system. ACM Trans. Database Syst. 22(3), 419–469 (1997)

    Article  Google Scholar 

  20. Lehmann, T. et al.: IRMA project site. http://ganymed.imib.rwth-aachen.de/irma/

  21. Li J., Saha B., Deshpande A.: A unified approach to ranking in probabilistic databases. PVLDB 2(1), 502–513 (2009)

    Google Scholar 

  22. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: ICDE, pp. 106–115 (2007)

  23. Ling H., Okada K.: An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE Trans. Pattern Anal. Mach. Intell. 29(5), 840–853 (2007)

    Article  Google Scholar 

  24. Papadimitriou C.H., Steiglitz K.: Combinatorial Optimization: Algorithms and Complexity, pp. 67–71. Dover Publications, Englewood Cliffs (1998)

    MATH  Google Scholar 

  25. Re, C., Dalvi, N.N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE, pp. 886–895 (2007)

  26. Rubner Y., Puzicha J., Tomasi C., Buhmann J.M.: Empirical evaluation of dissimilarity measures for color and texture. Comput. Vis. Image Understand. 84(1), 25–43 (2001)

    Article  MATH  Google Scholar 

  27. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: ICCV, pp. 59–66 (1998)

  28. Rubner Y., Tomasi C., Guibas L.J.: The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40(2), 99–121 (2000)

    Article  MATH  Google Scholar 

  29. Sandler, R., Lindenbaum, M.: Nonnegative matrix factorization with earth mover’s distance metric. In: CVPR, pp. 1873–1880 (2009)

  30. Seidl, T., Kriegel, H.-P.: Optimal multi-step k-nearest neighbor search. In: SIGMOD Conference (1998)

  31. Shirdhonkar, S., Jacobs, D.W.: Approximate earth mover’s distance in linear time. In: CVPR, pp. 1–8 (2008)

  32. Soliman, M.A., Ilyas, I.F., Chang, K.C.-C.: Probabilistic top-k and ranking-aggregate queries. ACM Trans. Database Syst. 33(3) (2008)

  33. Tao Y., Xiao X., Cheng R.: Range search on multidimensional uncertain data. ACM Trans. Database Syst. 32(3), 15 (2007)

    Article  Google Scholar 

  34. Trajcevski G., Wolfson O., Hinrichs K., Chamberlain S.: Managing uncertainty in moving objects databases. ACM Trans. Database Syst. 29(3), 463–507 (2004)

    Article  Google Scholar 

  35. Wang D.Z., Franklin M.J., Garofalakis M.N., Hellerstein J.M.: Querying probabilistic information extraction. PVLDB 3(1), 1057–1067 (2010)

    Google Scholar 

  36. Wichterich, M., Assent, I., Kranen, P., Seidl, T.: Efficient emd-based similarity search in multimedia databases via flexible dimensionality reduction. In: SIGMOD Conference, pp. 199–212 (2008)

  37. Xu J., Zhang Z., Tung A.K.H., Yu G.: Efficient and effective similarity search over probabilistic data based on earth mover’s distance. PVLDB 3(1), 758–769 (2010)

    Google Scholar 

  38. Xu, J. et al.: Appendix Section. http://vldb.org/vldb_journal. http://faculty.neu.edu.cn/ise/xujia/home/appendix.pdf

  39. Zhang M., Hadjieleftheriou M., Ooi B.C., Procopiuc C.M., Srivastava D.: On multi-column foreign key discovery. PVLDB 3(1), 805–814 (2010)

    Google Scholar 

  40. Zhang Z., Ooi B.C., Parthasarathy S., Tung A.K.H.: Similarity search on bregman divergence: towards non-metric indexing. PVLDB 2(1), 13–24 (2009)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Xu.

Additional information

Zhenjie Zhang and Anthony K. H. Tung were supported by Singapore NRF grant R-252-000-376-279. This work is also supported by the National Natural Science Foundation of China (No. 60933001 and No. 61003058), the Fundamental Research Funds for the Central Universities (No. N100704001) and the National Basic Research Program of China (973 Program) under grant 2012CB316201.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, J., Zhang, Z., Tung, A.K.H. et al. Efficient and effective similarity search over probabilistic data based on Earth Mover’s Distance. The VLDB Journal 21, 535–559 (2012). https://doi.org/10.1007/s00778-011-0258-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-011-0258-2

Keywords

Navigation