skip to main content
10.1145/3503823.3503899acmotherconferencesArticle/Chapter ViewAbstractPublication PagespciConference Proceedingsconference-collections
research-article

Entity Resolution in Dissimilarity Spaces

Published:22 February 2022Publication History

ABSTRACT

In this paper we propose a dissimilarity-based entity resolution framework that imposes a new efficient object representation scheme. This representation relies on the embedding of the dissimilarity space of pairs of objects to the space of distances of objects from a set of prototypes. These prototypes are selected among the input objects as the centers of clusters which are identified through an efficient clustering technique. An accurate object similarity metric that takes into consideration the rank correlation of distances from the prototypes is utilized to overcome the curse of dimensionality problem. Our methodology proposes the use of the generalized Hausdorff distance metric to deal with those cases where only partially ranked data is available in the representation domain of objects. Finally a locality sensitive hashing approach for partially ranked data is applied to reduce the high complexity of the similarity search for approximate duplicates.

References

  1. V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios. 2004. BoostMap: A method for efficient approximate similarity rankings. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE Computer Society, Los Alamitos, CA, USA, II–268–II–275 Vol.2.Google ScholarGoogle Scholar
  2. J. Bourgain. 1985. On Lipschitz Embedding of Finite Metric Spaces in Hilbert Space. Israel Journal of Mathematics 52, 1 (1985), 46 – 52.Google ScholarGoogle ScholarCross RefCross Ref
  3. Douglas E. Critchlow. 1985. Metric Methods for Analyzing Partially Ranked Data (1 ed.). Springer-Verlag New York.Google ScholarGoogle Scholar
  4. M. Datar, N. Immorlica, P. Indyk, and V.S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Symp. on Comp. Geom.253 – 262.Google ScholarGoogle Scholar
  5. Robert P.W. Duin and Elzbieta Pekalska. 2012. The Dissimilarity Space. Pattern Recogn. Lett. 33, 7 (2012), 826–832.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Edelman. 1999. Representation and Recognition in Vision. MIT Press.Google ScholarGoogle Scholar
  7. A. Elmagarmid, P. Ipeirotis, and V. Verykios. 2007. Duplicate Record Detection: A Survey. TKDE 19, 1 (2007), 1–16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Faloutsos and K. Lin. 1995. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In SIGMOD. 163 – 174.Google ScholarGoogle Scholar
  9. Gabriela Hristescu and Martin Farach-Colton. 2001. Cluster-Preserving Embedding of Proteins. Tech rep (07 2001).Google ScholarGoogle Scholar
  10. L. Jin, C. Li, and S. Mehrotra. 2003. Efficient Record Linakge In Large Data Sets. In DASFAA. 137–146.Google ScholarGoogle Scholar
  11. D. Karapiperis, D. Vatsalan, V.S. Verykios, and P. Christen. 2016. Efficient Record Linakge Using a Compact Hamming Space. In EDBT.Google ScholarGoogle Scholar
  12. Maurice G. Kendall. 1970. Rank Correlation Methods(4th ed.). Griffin, London.Google ScholarGoogle Scholar
  13. George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput. Surv. 53, 2 (2020).Google ScholarGoogle Scholar
  14. Hanan Samet. 2006. Foundations of multidimensional and metric data structures.Academic Press. I–XXVI, 1–993 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Schnell, T. Bachteler, and J. Reiher. 2009. Privacy-preserving Record Linkage using Bloom Filters. Central Medical Inf. and Decision Making 9 (2009).Google ScholarGoogle Scholar
  16. A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B. Hsu, and K. Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In WWW. 243–246.Google ScholarGoogle Scholar
  17. Reinier H. Van Leuken and Remco C. Veltkamp. 2011. Selecting Vantage Objects for Similarity Indexing. ACM Trans. Multimedia Comput. Commun. Appl. 7, 3 (2011).Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. H. van Leuken, R. C. Veltkamp, and R. Typke. 2006. Selecting vantage objects for similarity indexing. In 18th International Conference on Pattern Recognition (ICPR’06), Vol. 3. 453–456.Google ScholarGoogle Scholar
  19. Jules Vleugels and Remco C. Veltkamp. 2002. Efficient image retrieval through vantage objects. Pattern Recognition 35, 1 (2002), 69 – 80.Google ScholarGoogle ScholarCross RefCross Ref
  20. Xiong Wang, Jason T L Wang, King-Ip Lin, Dennis Shasha, Bruce A. Shapiro, and Kaizhong Zhang. 2000. An index structure for data mining and clustering. Knowledge and Information Systems 2 (May 2000), 161–184.Google ScholarGoogle Scholar
  21. J. Yagnik, D. Strelow, D. A. Ross, and R. Lin. 2011. The power of comparative reasoning. In 2011 International Conference on Computer Vision. 2431–2438.Google ScholarGoogle Scholar
  22. Peter Yianilos. 1993. Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. Fourth Annual ACM-SIAM Symposium on Discrete Algorithms 93. https://doi.org/10.1145/313559.313789Google ScholarGoogle Scholar
  23. C. Zhu, F. Wen, and J. Sun. 2011. A rank-order distance based clustering algorithm for face tagging. In CVPR 2011. 481–488.Google ScholarGoogle Scholar

Index Terms

  1. Entity Resolution in Dissimilarity Spaces
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            PCI '21: Proceedings of the 25th Pan-Hellenic Conference on Informatics
            November 2021
            499 pages

            Copyright © 2021 ACM

            © 2021 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 22 February 2022

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            Overall Acceptance Rate190of390submissions,49%
          • Article Metrics

            • Downloads (Last 12 months)4
            • Downloads (Last 6 weeks)1

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format