Discriminative and deterministic approaches towards entity resolution

On, Byung-Won; Lee, Ingyu; Choi, Gyu Sang; Park, Ho-Sik

doi:10.1007/s10844-014-0308-5

Discriminative and deterministic approaches towards entity resolution

Published: 01 March 2014

Volume 43, pages 101–127, (2014)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Byung-Won On¹,
Ingyu Lee¹,
Gyu Sang Choi² &
…
Ho-Sik Park³

369 Accesses
3 Citations
Explore all metrics

Abstract

To address the entity resolution problem, existing studies usually consist of two-steps. Given two lists of records, in the first step a small set of duplicate records (a candidate set) are selected based on index structures and algorithms for efficient entity resolution. Then, a given similarity function is applied to quantify the similarity of records in the candidate set. However, for real applications, it is a non-trivial task to select appropriate indexing techniques and similarity functions. In this paper, we tackle the problem of indexing and similarity function identification using both discriminative and deterministic approaches that select the best of indexing and similarity measures. According to our experimental results, our proposed solution considering both discriminative and deterministic approaches shows more than a 90 % average accuracy within hundreds of seconds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An analysis of one-to-one matching algorithms for entity resolution

Article Open access 18 April 2023

Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

Notes

Most similarity measurements and algorithms have been carefully described in Elmagarmid et al. (2007) and Koudas et al. (2006). For the details, please see the papers.

References

Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international world wide web conference (WWW’05). Chiba, Japan, 10–14 May 2005.
Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J. (2005). Swoosh: a generic approach to entity resolution. Technical Report 2005-5, InforLab, Stanford University.
Bennett, C., Gacs, P., Li, M., Vitanyi, P., Zurek, W. (2002). Information distance. IEEE Transactions on Information Theory, 44(4), 1407–1423.
Article MathSciNet Google Scholar
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1–36.
Article Google Scholar
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S. (2003). Adaptive name-matching in information integration. IEEE Intelligent Systems, 18(5), 16–23.
Article Google Scholar
Chaudhuri, S., Chen, B., Ganti, V., Kaushik, R. (2007). Example-driven design of efficient record matching queries. In Proceedings of the 33rd international conference on very large data bases (VLDB’07). Vienna, Austria, 23–27 September 2007.
Cochinwala, M., Kurien, V., Lalk, G., Shasha, D. (2001). Efficient data reconciliation. Information Sciences, 137(1), 1–15.
Article MATH Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S. (2003). A comparison of string distance metrics for name-matching tasks. In Proceedings of the 8th international joint conference on artificial intelligence (IJCAI’03). Acapulco, Mexico, 9–15 August 2003.
Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactios on Knowledge and Data Engineering, 24(9), 1537–1555.
Article Google Scholar
Doan, A., Lu, Y., Lee, Y., Han, J. (2003). Profile-based object matching for information integraion. IEEE Intelligent Systems, 18(5), 54–59.
Article Google Scholar
Dong, X., Halevy, A., Madhavan, J. (2005). Reference reconciliation in complex information spaces. In Proceedings of the 24th ACM SIGMOD international conference on management of data (SIGMOD’05). Baltimore, Maryland, USA, 13–16 June 2005.
Elmagarmid, A., Ipeirotis, P., Verykios, V. (2007). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.
Article Google Scholar
Fan, W., Jia, X., Li, J., Ma, S. (2009). Reasoning about record matching rules. In Proceedings of the 35th internation conference on very large data bases (VLDB’09). Lyon, France, 24–28 August 2009.
Fellegi, I., & Sunter, A. (1968). A theory for record linkage. Journal of American Statistical Association, 63(324), 1321–1332.
Article Google Scholar
Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnana, S., Pietarinen, L., Srivastava, D. (2001). Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4), 90–101.
Google Scholar
Gravano, L., Ipeirotis, P., Koudas, N., Srivastava, D. (2003). Text joins in an RDBMS for web data integration. In Proceedings of the 12th international world wide web conference (WWW’03). Budapest, Hungary, 20–24 May 2003.
Guo, S., Dong, X., Srivastava, D., Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. In Proceedings of the 36th international conference on very large data bases (VLDB’10). Singapore, 13–17 September 2010.
Halbert, D. (2008). Record linkage. American Journal of Public Health, 36(12), 1412–1416.
Google Scholar
Hammouda, K., & Kamel, M. (2004). Document similarity using a phrase indexing graph model. Knowledge and Information Systems, 6, 710–727.
Article Google Scholar
Han, H., Zha, H., Lee Giles, C. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’05). Denvor, 7–11 June 2005.
Hernandez, M., & Stolfo, S. (1995). The Merge/purge problem for large databases. In Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD’95). San Jose, 22–25 May 1995.
Herranz, J., Nin, J., Sole, M. (2010). Optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Transactios on Knowledge and Data Engineering, 23(10), 1541–1554.
Article Google Scholar
Hong, Y., On, B., Lee, D. (2004). System support for name authority control problem in digital libraries: OpenDBLP approach. In Proceedings of the 8th European conference on digital libraries (ECDL’04). Bath, 12–17 September 2004.
Jaro, M. (1989). Advances in record linkage methodology as applied to matching the 1985 census of Tampa Florida. Journal of American Statistical Association, 84(406), 414–420.
Article Google Scholar
Kalashnikov, D., Mehrotra, S., Chen, Z. (2005). Exploiting relationships for domain-independent data cleaning. In Proceedings of the SIAM data mining conference (SDM’05). Newport Beach, 21–23 April 2005.
Kim, H., & Lee, D. (2010). HARRA: fast iterative hashed record linkage for large-scale data collections. In Proceedings of the 13th international conference on extending database technology (EDBT’10). Lausanne, Switzerland, 22–26 March 2010.
Koudas, N., Sarawagi, S., Srivastava, D. (2006). Record linkage: Similarity measures and algorithms. In Proceedings of the 25th ACM SIGMOD international conference on management of data (SIGMOD’06). Chicago, 26–29 June 2006.
Lawrence, S., Lee Giles, C., Bollacker, K. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67–71.
Article Google Scholar
Lee, D., On, B., Kang, J., Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceedings of ACM SIGMOD workshop on information quality in information systems (IQIS’05). Baltimore, 13–16 June 2005.
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.
Article MathSciNet Google Scholar
Li, P., Dong, X., Maurino, A., Srivastava, D. (2011). Linking temporal records. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.
Lim, E., Srivastava, J., Prabhakar, S., Richardson, J. (1993). Entity identification in database integration. In Proceedings of international conference on data engineering (ICDE’93). Vienna, 19–23 April 1993.
Lu, W., Milios, J., Japkowicz, M., Zhang, Y. (2006). Node similarity in the citation graph. Knowledge and Information Systems, 11, 105–129.
Article Google Scholar
Monge, A., & Elkan, C. (1996). The field matching problem: Algorithms and applications. In Proceedings of international conference on knowledge discovery and data mining (KDD’96). Portland.
On, B., & Choi, G. (2012). Acase study of understanding the nature of redundant entities in bibliographic digital libraries. Technical Report (2012–001), Public Data Research Center, Advanced Institutes of Convergence Technology, Seoul National University, Suwon, Korea.
On, B., Koudas, N., Lee, D., Srivastava, D. (2007). Group linkage. In Proceedings of international conference on data engineering (ICDE’07). Istanbul, 15–20 April 2007.
On, B., Lee, D., Kang, J., Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’05). Denver, 7–11 June 2005.
Pasula, H., Marthi, B., Milch, B., Russell, S., Shapitser, I. (2003). Identity uncertainty and citation matching. Advances in neural information processing (Vol. 15). Cambridge: MIT press.
Rastogi, V., Dalvi, N., Garofalakis, M. (2011). Large-scale collective entity matching. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.
Sarawagi, S., & Bhamidipty, A. (2002). Interactive deduplication using active learning. In Proceedings of international conference on knowledge discovery and data mining (KDD’02). Edmonton, 23–26 July 2002.
Shen, W., Li, X., Doan, A. (2005). Constraint-based entity matching. In Proceedings of the 25th national conference on artificial intelligence (AAAI’05). Pittsburgh, 9–13 July 2005.
Verykios, V., Elmagarmid, A., Houstis, E. (2000). Automating the approximate record matching process. Information Sciences, 126(1), 83–98.
Article MATH Google Scholar
Wang, J., Li, G., Yu, J., Feng, J. (2011). Entity matching: How similar is similar. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.
Whang, S., & Garcia-Molina, H. (2010). Entity resolution with evolving rules. In Proceedings of the 36th international conference on very large data bases (VLDB’10). Singapore, 13–17 September 2010.
Xiao, C., Wang, W., Lin, X. (2008). Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. In Proceedings of the 34th international conference on very large data bases (VLDB’08). Auckland, 24–30 August 2008.

Download references

Acknowledgments

This research were supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2013-012524) for the first author, the Energy Efficiency & Resources of the Korea Institute of Energy Technology Evaluation and Planning (KETEP) grant funded by the Korea government Ministry of Knowledge Economy (No. 20132010101800) for the first and second authors, and the 2012 Yeungnam University Research Grant for the corresponding author.

Author information

Authors and Affiliations

Advanced Institutes of Convergence Technology, Seoul National University, 145 Gwanggyo-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do, 443-270, Korea
Byung-Won On & Ingyu Lee
Department of Information and Communication Engineering, Yeungnam University, 214-1, Dae-dong, Gyeongsan, Gyeongsangbuk, 712-749, Korea
Gyu Sang Choi
Division of Information and Computer Engineering, Ajou University, 206 World cup-ro, Yeongtong-gu, Suwon-si, Gyeonggi-do, 443-749, Korea
Ho-Sik Park

Authors

Byung-Won On
View author publications
You can also search for this author in PubMed Google Scholar
Ingyu Lee
View author publications
You can also search for this author in PubMed Google Scholar
Gyu Sang Choi
View author publications
You can also search for this author in PubMed Google Scholar
Ho-Sik Park
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gyu Sang Choi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

On, BW., Lee, I., Choi, G.S. et al. Discriminative and deterministic approaches towards entity resolution. J Intell Inf Syst 43, 101–127 (2014). https://doi.org/10.1007/s10844-014-0308-5

Download citation

Received: 15 March 2012
Revised: 20 January 2014
Accepted: 20 January 2014
Published: 01 March 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s10844-014-0308-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discriminative and deterministic approaches towards entity resolution

Abstract

Access this article

Similar content being viewed by others

An analysis of one-to-one matching algorithms for entity resolution

Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Discriminative and deterministic approaches towards entity resolution

Abstract

Access this article

Similar content being viewed by others

An analysis of one-to-one matching algorithms for entity resolution

Dynamic Similarity-Aware Inverted Indexing for Real-Time Entity Resolution

Experimental Evaluation Among Reblocking Techniques Applied to the Entity Resolution

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation