research-article

Entity Resolution in Dissimilarity Spaces

Authors:
Vassilios Verykios

Hellenic Open university, Greece

Hellenic Open university, Greece
View Profile

,
Dimitrios Karapiperis

International Hellenic University, Greece

International Hellenic University, Greece
View Profile

PCI '21: Proceedings of the 25th Pan-Hellenic Conference on InformaticsNovember 2021Pages 413–418https://doi.org/10.1145/3503823.3503899

Published:22 February 2022Publication History

PCI '21: Proceedings of the 25th Pan-Hellenic Conference on Informatics

Pages 413–418

ABSTRACT

In this paper we propose a dissimilarity-based entity resolution framework that imposes a new efficient object representation scheme. This representation relies on the embedding of the dissimilarity space of pairs of objects to the space of distances of objects from a set of prototypes. These prototypes are selected among the input objects as the centers of clusters which are identified through an efficient clustering technique. An accurate object similarity metric that takes into consideration the rank correlation of distances from the prototypes is utilized to overcome the curse of dimensionality problem. Our methodology proposes the use of the generalized Hausdorff distance metric to deal with those cases where only partially ranked data is available in the representation domain of objects. Finally a locality sensitive hashing approach for partially ranked data is applied to reduce the high complexity of the similarity search for approximate duplicates.

References

V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios. 2004. BoostMap: A method for efficient approximate similarity rankings. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE Computer Society, Los Alamitos, CA, USA, II–268–II–275 Vol.2.Google Scholar
J. Bourgain. 1985. On Lipschitz Embedding of Finite Metric Spaces in Hilbert Space. Israel Journal of Mathematics 52, 1 (1985), 46 – 52.Google ScholarCross Ref
Douglas E. Critchlow. 1985. Metric Methods for Analyzing Partially Ranked Data (1 ed.). Springer-Verlag New York.Google Scholar
M. Datar, N. Immorlica, P. Indyk, and V.S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Symp. on Comp. Geom.253 – 262.Google Scholar
Robert P.W. Duin and Elzbieta Pekalska. 2012. The Dissimilarity Space. Pattern Recogn. Lett. 33, 7 (2012), 826–832.Google ScholarDigital Library
S. Edelman. 1999. Representation and Recognition in Vision. MIT Press.Google Scholar
A. Elmagarmid, P. Ipeirotis, and V. Verykios. 2007. Duplicate Record Detection: A Survey. TKDE 19, 1 (2007), 1–16.Google ScholarDigital Library
C. Faloutsos and K. Lin. 1995. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In SIGMOD. 163 – 174.Google Scholar
Gabriela Hristescu and Martin Farach-Colton. 2001. Cluster-Preserving Embedding of Proteins. Tech rep (07 2001).Google Scholar
L. Jin, C. Li, and S. Mehrotra. 2003. Efficient Record Linakge In Large Data Sets. In DASFAA. 137–146.Google Scholar
D. Karapiperis, D. Vatsalan, V.S. Verykios, and P. Christen. 2016. Efficient Record Linakge Using a Compact Hamming Space. In EDBT.Google Scholar
Maurice G. Kendall. 1970. Rank Correlation Methods(4th ed.). Griffin, London.Google Scholar
George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput. Surv. 53, 2 (2020).Google Scholar
Hanan Samet. 2006. Foundations of multidimensional and metric data structures.Academic Press. I–XXVI, 1–993 pages.Google ScholarDigital Library
R. Schnell, T. Bachteler, and J. Reiher. 2009. Privacy-preserving Record Linkage using Bloom Filters. Central Medical Inf. and Decision Making 9 (2009).Google Scholar
A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B. Hsu, and K. Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In WWW. 243–246.Google Scholar
Reinier H. Van Leuken and Remco C. Veltkamp. 2011. Selecting Vantage Objects for Similarity Indexing. ACM Trans. Multimedia Comput. Commun. Appl. 7, 3 (2011).Google ScholarDigital Library
R. H. van Leuken, R. C. Veltkamp, and R. Typke. 2006. Selecting vantage objects for similarity indexing. In 18th International Conference on Pattern Recognition (ICPR’06), Vol. 3. 453–456.Google Scholar
Jules Vleugels and Remco C. Veltkamp. 2002. Efficient image retrieval through vantage objects. Pattern Recognition 35, 1 (2002), 69 – 80.Google ScholarCross Ref
Xiong Wang, Jason T L Wang, King-Ip Lin, Dennis Shasha, Bruce A. Shapiro, and Kaizhong Zhang. 2000. An index structure for data mining and clustering. Knowledge and Information Systems 2 (May 2000), 161–184.Google Scholar
J. Yagnik, D. Strelow, D. A. Ross, and R. Lin. 2011. The power of comparative reasoning. In 2011 International Conference on Computer Vision. 2431–2438.Google Scholar
Peter Yianilos. 1993. Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. Fourth Annual ACM-SIAM Symposium on Discrete Algorithms 93. https://doi.org/10.1145/313559.313789Google Scholar
C. Zhu, F. Wen, and J. Sun. 2011. A rank-order distance based clustering algorithm for face tagging. In CVPR 2011. 481–488.Google Scholar

Index Terms

Entity Resolution in Dissimilarity Spaces
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Handling data quality in entity resolution
IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systems

Entity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what ...
Read More
Evaluating entity resolution results

Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F₁, cluster F₁) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-...
Read More
Collective entity resolution in relational data

Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PCI '21: Proceedings of the 25th Pan-Hellenic Conference on Informatics
November 2021
499 pages
ISBN:9781450395557
DOI:10.1145/3503823
Editors:
Michael Gr. Vassilakopoulos,
Nikitas N. Karanikolas,
George Stamoulis,
Vassilios S. Verykios,
Cleo Sgouropoulou
Copyright © 2021 ACM
© 2021 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 February 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate190of390submissions,49%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 22
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Entity Resolution in Dissimilarity Spaces

PCI '21: Proceedings of the 25th Pan-Hellenic Conference on Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Handling data quality in entity resolution

Evaluating entity resolution results

Collective entity resolution in relational data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Entity Resolution in Dissimilarity Spaces

PCI '21: Proceedings of the 25th Pan-Hellenic Conference on Informatics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Handling data quality in entity resolution

Evaluating entity resolution results

Collective entity resolution in relational data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media