Skip to main content
Log in

Discriminative and deterministic approaches towards entity resolution

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

To address the entity resolution problem, existing studies usually consist of two-steps. Given two lists of records, in the first step a small set of duplicate records (a candidate set) are selected based on index structures and algorithms for efficient entity resolution. Then, a given similarity function is applied to quantify the similarity of records in the candidate set. However, for real applications, it is a non-trivial task to select appropriate indexing techniques and similarity functions. In this paper, we tackle the problem of indexing and similarity function identification using both discriminative and deterministic approaches that select the best of indexing and similarity measures. According to our experimental results, our proposed solution considering both discriminative and deterministic approaches shows more than a 90 % average accuracy within hundreds of seconds.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Most similarity measurements and algorithms have been carefully described in Elmagarmid et al. (2007) and Koudas et al. (2006). For the details, please see the papers.

References

  • Bekkerman, R., & McCallum, A. (2005). Disambiguating web appearances of people in a social network. In Proceedings of the 14th international world wide web conference (WWW’05). Chiba, Japan, 10–14 May 2005.

  • Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J. (2005). Swoosh: a generic approach to entity resolution. Technical Report 2005-5, InforLab, Stanford University.

  • Bennett, C., Gacs, P., Li, M., Vitanyi, P., Zurek, W. (2002). Information distance. IEEE Transactions on Information Theory, 44(4), 1407–1423.

    Article  MathSciNet  Google Scholar 

  • Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1–36.

    Article  Google Scholar 

  • Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., Fienberg, S. (2003). Adaptive name-matching in information integration. IEEE Intelligent Systems, 18(5), 16–23.

    Article  Google Scholar 

  • Chaudhuri, S., Chen, B., Ganti, V., Kaushik, R. (2007). Example-driven design of efficient record matching queries. In Proceedings of the 33rd international conference on very large data bases (VLDB’07). Vienna, Austria, 23–27 September 2007.

  • Cochinwala, M., Kurien, V., Lalk, G., Shasha, D. (2001). Efficient data reconciliation. Information Sciences, 137(1), 1–15.

    Article  MATH  Google Scholar 

  • Cohen, W., Ravikumar, P., Fienberg, S. (2003). A comparison of string distance metrics for name-matching tasks. In Proceedings of the 8th international joint conference on artificial intelligence (IJCAI’03). Acapulco, Mexico, 9–15 August 2003.

  • Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactios on Knowledge and Data Engineering, 24(9), 1537–1555.

    Article  Google Scholar 

  • Doan, A., Lu, Y., Lee, Y., Han, J. (2003). Profile-based object matching for information integraion. IEEE Intelligent Systems, 18(5), 54–59.

    Article  Google Scholar 

  • Dong, X., Halevy, A., Madhavan, J. (2005). Reference reconciliation in complex information spaces. In Proceedings of the 24th ACM SIGMOD international conference on management of data (SIGMOD’05). Baltimore, Maryland, USA, 13–16 June 2005.

  • Elmagarmid, A., Ipeirotis, P., Verykios, V. (2007). Duplicate record detection: a survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.

    Article  Google Scholar 

  • Fan, W., Jia, X., Li, J., Ma, S. (2009). Reasoning about record matching rules. In Proceedings of the 35th internation conference on very large data bases (VLDB’09). Lyon, France, 24–28 August 2009.

  • Fellegi, I., & Sunter, A. (1968). A theory for record linkage. Journal of American Statistical Association, 63(324), 1321–1332.

    Article  Google Scholar 

  • Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnana, S., Pietarinen, L., Srivastava, D. (2001). Using q-grams in a DBMS for approximate string processing. IEEE Data Engineering Bulletin, 24(4), 90–101.

    Google Scholar 

  • Gravano, L., Ipeirotis, P., Koudas, N., Srivastava, D. (2003). Text joins in an RDBMS for web data integration. In Proceedings of the 12th international world wide web conference (WWW’03). Budapest, Hungary, 20–24 May 2003.

  • Guo, S., Dong, X., Srivastava, D., Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. In Proceedings of the 36th international conference on very large data bases (VLDB’10). Singapore, 13–17 September 2010.

  • Halbert, D. (2008). Record linkage. American Journal of Public Health, 36(12), 1412–1416.

    Google Scholar 

  • Hammouda, K., & Kamel, M. (2004). Document similarity using a phrase indexing graph model. Knowledge and Information Systems, 6, 710–727.

    Article  Google Scholar 

  • Han, H., Zha, H., Lee Giles, C. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’05). Denvor, 7–11 June 2005.

  • Hernandez, M., & Stolfo, S. (1995). The Merge/purge problem for large databases. In Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD’95). San Jose, 22–25 May 1995.

  • Herranz, J., Nin, J., Sole, M. (2010). Optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Transactios on Knowledge and Data Engineering, 23(10), 1541–1554.

    Article  Google Scholar 

  • Hong, Y., On, B., Lee, D. (2004). System support for name authority control problem in digital libraries: OpenDBLP approach. In Proceedings of the 8th European conference on digital libraries (ECDL’04). Bath, 12–17 September 2004.

  • Jaro, M. (1989). Advances in record linkage methodology as applied to matching the 1985 census of Tampa Florida. Journal of American Statistical Association, 84(406), 414–420.

    Article  Google Scholar 

  • Kalashnikov, D., Mehrotra, S., Chen, Z. (2005). Exploiting relationships for domain-independent data cleaning. In Proceedings of the SIAM data mining conference (SDM’05). Newport Beach, 21–23 April 2005.

  • Kim, H., & Lee, D. (2010). HARRA: fast iterative hashed record linkage for large-scale data collections. In Proceedings of the 13th international conference on extending database technology (EDBT’10). Lausanne, Switzerland, 22–26 March 2010.

  • Koudas, N., Sarawagi, S., Srivastava, D. (2006). Record linkage: Similarity measures and algorithms. In Proceedings of the 25th ACM SIGMOD international conference on management of data (SIGMOD’06). Chicago, 26–29 June 2006.

  • Lawrence, S., Lee Giles, C., Bollacker, K. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67–71.

    Article  Google Scholar 

  • Lee, D., On, B., Kang, J., Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceedings of ACM SIGMOD workshop on information quality in information systems (IQIS’05). Baltimore, 13–16 June 2005.

  • Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P. (2004). The similarity metric. IEEE Transactions on Information Theory, 50(12), 3250–3264.

    Article  MathSciNet  Google Scholar 

  • Li, P., Dong, X., Maurino, A., Srivastava, D. (2011). Linking temporal records. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.

  • Lim, E., Srivastava, J., Prabhakar, S., Richardson, J. (1993). Entity identification in database integration. In Proceedings of international conference on data engineering (ICDE’93). Vienna, 19–23 April 1993.

  • Lu, W., Milios, J., Japkowicz, M., Zhang, Y. (2006). Node similarity in the citation graph. Knowledge and Information Systems, 11, 105–129.

    Article  Google Scholar 

  • Monge, A., & Elkan, C. (1996). The field matching problem: Algorithms and applications. In Proceedings of international conference on knowledge discovery and data mining (KDD’96). Portland.

  • On, B., & Choi, G. (2012). Acase study of understanding the nature of redundant entities in bibliographic digital libraries. Technical Report (2012–001), Public Data Research Center, Advanced Institutes of Convergence Technology, Seoul National University, Suwon, Korea.

  • On, B., Koudas, N., Lee, D., Srivastava, D. (2007). Group linkage. In Proceedings of international conference on data engineering (ICDE’07). Istanbul, 15–20 April 2007.

  • On, B., Lee, D., Kang, J., Mitra, P. (2005). Comparative study of name disambiguation problem using a scalable blocking-based framework. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’05). Denver, 7–11 June 2005.

  • Pasula, H., Marthi, B., Milch, B., Russell, S., Shapitser, I. (2003). Identity uncertainty and citation matching. Advances in neural information processing (Vol. 15). Cambridge: MIT press.

  • Rastogi, V., Dalvi, N., Garofalakis, M. (2011). Large-scale collective entity matching. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.

  • Sarawagi, S., & Bhamidipty, A. (2002). Interactive deduplication using active learning. In Proceedings of international conference on knowledge discovery and data mining (KDD’02). Edmonton, 23–26 July 2002.

  • Shen, W., Li, X., Doan, A. (2005). Constraint-based entity matching. In Proceedings of the 25th national conference on artificial intelligence (AAAI’05). Pittsburgh, 9–13 July 2005.

  • Verykios, V., Elmagarmid, A., Houstis, E. (2000). Automating the approximate record matching process. Information Sciences, 126(1), 83–98.

    Article  MATH  Google Scholar 

  • Wang, J., Li, G., Yu, J., Feng, J. (2011). Entity matching: How similar is similar. In Proceedings of the 37th international conference on very large data bases (VLDB’11). Seattle, 29 August–3 September 2011.

  • Whang, S., & Garcia-Molina, H. (2010). Entity resolution with evolving rules. In Proceedings of the 36th international conference on very large data bases (VLDB’10). Singapore, 13–17 September 2010.

  • Xiao, C., Wang, W., Lin, X. (2008). Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. In Proceedings of the 34th international conference on very large data bases (VLDB’08). Auckland, 24–30 August 2008.

Download references

Acknowledgments

This research were supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT and Future Planning (2013-012524) for the first author, the Energy Efficiency & Resources of the Korea Institute of Energy Technology Evaluation and Planning (KETEP) grant funded by the Korea government Ministry of Knowledge Economy (No. 20132010101800) for the first and second authors, and the 2012 Yeungnam University Research Grant for the corresponding author.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gyu Sang Choi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

On, BW., Lee, I., Choi, G.S. et al. Discriminative and deterministic approaches towards entity resolution. J Intell Inf Syst 43, 101–127 (2014). https://doi.org/10.1007/s10844-014-0308-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-014-0308-5

Keywords

Navigation