ABSTRACT
In this paper we propose a dissimilarity-based entity resolution framework that imposes a new efficient object representation scheme. This representation relies on the embedding of the dissimilarity space of pairs of objects to the space of distances of objects from a set of prototypes. These prototypes are selected among the input objects as the centers of clusters which are identified through an efficient clustering technique. An accurate object similarity metric that takes into consideration the rank correlation of distances from the prototypes is utilized to overcome the curse of dimensionality problem. Our methodology proposes the use of the generalized Hausdorff distance metric to deal with those cases where only partially ranked data is available in the representation domain of objects. Finally a locality sensitive hashing approach for partially ranked data is applied to reduce the high complexity of the similarity search for approximate duplicates.
- V. Athitsos, J. Alon, S. Sclaroff, and G. Kollios. 2004. BoostMap: A method for efficient approximate similarity rankings. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE Computer Society, Los Alamitos, CA, USA, II–268–II–275 Vol.2.Google Scholar
- J. Bourgain. 1985. On Lipschitz Embedding of Finite Metric Spaces in Hilbert Space. Israel Journal of Mathematics 52, 1 (1985), 46 – 52.Google ScholarCross Ref
- Douglas E. Critchlow. 1985. Metric Methods for Analyzing Partially Ranked Data (1 ed.). Springer-Verlag New York.Google Scholar
- M. Datar, N. Immorlica, P. Indyk, and V.S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In Symp. on Comp. Geom.253 – 262.Google Scholar
- Robert P.W. Duin and Elzbieta Pekalska. 2012. The Dissimilarity Space. Pattern Recogn. Lett. 33, 7 (2012), 826–832.Google ScholarDigital Library
- S. Edelman. 1999. Representation and Recognition in Vision. MIT Press.Google Scholar
- A. Elmagarmid, P. Ipeirotis, and V. Verykios. 2007. Duplicate Record Detection: A Survey. TKDE 19, 1 (2007), 1–16.Google ScholarDigital Library
- C. Faloutsos and K. Lin. 1995. FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In SIGMOD. 163 – 174.Google Scholar
- Gabriela Hristescu and Martin Farach-Colton. 2001. Cluster-Preserving Embedding of Proteins. Tech rep (07 2001).Google Scholar
- L. Jin, C. Li, and S. Mehrotra. 2003. Efficient Record Linakge In Large Data Sets. In DASFAA. 137–146.Google Scholar
- D. Karapiperis, D. Vatsalan, V.S. Verykios, and P. Christen. 2016. Efficient Record Linakge Using a Compact Hamming Space. In EDBT.Google Scholar
- Maurice G. Kendall. 1970. Rank Correlation Methods(4th ed.). Griffin, London.Google Scholar
- George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput. Surv. 53, 2 (2020).Google Scholar
- Hanan Samet. 2006. Foundations of multidimensional and metric data structures.Academic Press. I–XXVI, 1–993 pages.Google ScholarDigital Library
- R. Schnell, T. Bachteler, and J. Reiher. 2009. Privacy-preserving Record Linkage using Bloom Filters. Central Medical Inf. and Decision Making 9 (2009).Google Scholar
- A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B. Hsu, and K. Wang. 2015. An Overview of Microsoft Academic Service (MAS) and Applications. In WWW. 243–246.Google Scholar
- Reinier H. Van Leuken and Remco C. Veltkamp. 2011. Selecting Vantage Objects for Similarity Indexing. ACM Trans. Multimedia Comput. Commun. Appl. 7, 3 (2011).Google ScholarDigital Library
- R. H. van Leuken, R. C. Veltkamp, and R. Typke. 2006. Selecting vantage objects for similarity indexing. In 18th International Conference on Pattern Recognition (ICPR’06), Vol. 3. 453–456.Google Scholar
- Jules Vleugels and Remco C. Veltkamp. 2002. Efficient image retrieval through vantage objects. Pattern Recognition 35, 1 (2002), 69 – 80.Google ScholarCross Ref
- Xiong Wang, Jason T L Wang, King-Ip Lin, Dennis Shasha, Bruce A. Shapiro, and Kaizhong Zhang. 2000. An index structure for data mining and clustering. Knowledge and Information Systems 2 (May 2000), 161–184.Google Scholar
- J. Yagnik, D. Strelow, D. A. Ross, and R. Lin. 2011. The power of comparative reasoning. In 2011 International Conference on Computer Vision. 2431–2438.Google Scholar
- Peter Yianilos. 1993. Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces. Fourth Annual ACM-SIAM Symposium on Discrete Algorithms 93. https://doi.org/10.1145/313559.313789Google Scholar
- C. Zhu, F. Wen, and J. Sun. 2011. A rank-order distance based clustering algorithm for face tagging. In CVPR 2011. 481–488.Google Scholar
Index Terms
- Entity Resolution in Dissimilarity Spaces
Recommendations
Handling data quality in entity resolution
IQIS '05: Proceedings of the 2nd international workshop on Information quality in information systemsEntity resolution (ER) is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers).However, there are no unique identifiers that tell us what ...
Evaluating entity resolution results
Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F1, cluster F1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-...
Collective entity resolution in relational data
Many databases contain uncertain and imprecise references to real-world entities. The absence of identifiers for the underlying entities often results in a database which contains multiple references to the same entity. This can lead not only to data ...
Comments