Skip to main content
Log in

A node resistance-based probability model for resolving duplicate named entities

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Duplicate entities tend to degrade the quality of data seriously. Despite recent remarkable achievement, existing methods still produce a large number of false positives (i.e., an entity determined to be a duplicate one when it is not) that are likely to impair the accuracy. Toward this challenge, we propose a novel node resistance-based probability model in which we view a given data set as a graph of entities that are linked each other via relationships, and then compute the probability value between two entities to see how similar the two entities are. Especially, in the graph, each node has its own resistance value equivalent to 1-confidence (normalized in 0–1) and resistance\(\cdot \)probability value is filtered out per node during computing the probability value. To evaluate the proposed model, we performed intensive experiments with different data sets including ACM (https://dl.acm.org), DBLP (https://dblp.uni-trier.de), and IMDB (https://imdb.com). Our experimental results show that the proposed probability model outperforms the existing probability model, improving average F1 scores up to 14%, but never worsens them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Ailon, N. (2008). Aggregating inconsistent information: Ranking and clustering. JACM, 55(5), 1–27.

    MathSciNet  MATH  Google Scholar 

  • Aldous, D. (1982). Some inequalities for reversible Markov chains. Journal of the London Mathematical Society, 25, 564–576.

    MathSciNet  MATH  Google Scholar 

  • Alias-i. (2008). Lingpipe 4.1.0. http://alias-i.com/lingpipe. Retrieved October 1, 2008.

  • Arasu, A. (2010). On active learning of record matching packages. In SIGMOD.

  • Bansal, N. (2004). Correlation clustering. Machine Learning, 56, 1–3.

    MathSciNet  MATH  Google Scholar 

  • Baxter, R. P. C., & Churches, T. (2003). A comparison of fast blocking methods for record linkage. In ACM SIGKDD’03 workshop on data cleaning, record linkage and object consolidation.

  • Bellare, K. (2012). Active sampling for entity matching. In Proceedings of the 18th ACM SIGKDD.

  • Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S., & Widom, J. (2009). Swoosh: A generic approach to entity resolution. The VLDB Journal, 18(1), 255–276.

    Google Scholar 

  • Bhattachary, I., & Getoor, L. (2007). A latent Dirichhlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM international conference on data mining.

  • Bhattacharya, I., & Getoor, L. (2004). Iterative record linkage for cleaning and integration. In Proceedings of ACM SIGMOD workshop on research issues in data mining and knowledge discovery (DMKD’04), Paris, France, June 13.

  • Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1–36.

    Google Scholar 

  • Bilenko, M., & Mooney, R. (2003). Adaptive duplicate detection using leanable string similarity. In Proceedings of international conference on knowledge discovery and data mining (KDD).

  • Bilenko, M., Kamath, B., & Mooney, R. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of IEEE international conference on data mining (ICDM’06), Hong Kong, China, December.

  • Chaudhuri, S. (2005). Robust identification of fuzzy duplicates. In Proceedings of ICDE.

  • Chen, Z. (2009). Exploiting context analysis for combining multiple entity resolution systems. In Proceedings of SIGMOD.

  • Christen, P. (2007). Towards parameter-free blocking for scalable record linkage. Technical report, The Australian National University, Canberra.

  • Christen, P. (2008). Automatic record linkage using seeded nearest neighbor and support vector machine classification. In Proceedings of international conference on knowledge discovery and data mining (KDD).

  • Christen, P. (2011a). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.

    Google Scholar 

  • Christen, P. (2011b). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 99(1), 5.

    Google Scholar 

  • Cochinwala, M. (2001). Efficient data reconcillation. Information Sciences, 137, 1–4.

    Google Scholar 

  • Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. In Proceedings of workshop on data cleaning, record linkage, and object consolidation in conjunction with ACM international conference on knowledge discovery and data mining (KDD’03), Washington DC, USA, August 21–24.

  • do Nascimento, D. C. C. E. S. P., & Mestre, D. G. (2018). Heuristic-based approaches for speeding up incremental record linkage. Journal of Systems and Software, 137, 335–354.

    Google Scholar 

  • Elkan, A. E. M. C. (1996). The field matching problem: Algorithms and applications. In Proceedings of the second international conference on knowledge discovery and data mining (KDD-96).

  • Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.

    Google Scholar 

  • Elsayed, T., Oard, D., & Namata, G. (2008). Resolving personal names in email using context expansion. In Proceedings of the 46th annual meeting of the association for computational linguistics: Human language technologies (ACL’08), Columbus, OH, USA, June 15–20.

  • Elsner, M., & Charnaik, E. (2008). You talking to me? A corpus and algorithm for conversation disentanglement. ACL-HLT.

  • Elsner, M., & Schudy, W. (2009). Bounding and comparing methods for correlation clustering beyong ilp. ILP-NLP.

  • Fan, X., Wang, J., Pu, X., Zhou, L., Zuou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality, 2(2), 1–23.

    Google Scholar 

  • Fellegi, I., & Sunter, A. (1968). A theory for record linkage. Journal of American Statistical Association, 63(324), 1321–1332.

    Google Scholar 

  • Ferreira, A., Silva, R., Goncalves, M., Veloso, A., & Laender, A. (2012). Active associative sampling for author name disambiguation. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’12), Washington DC, USA, June 10–14.

  • Fienberg, W. C. P. R. S., & Rivard, K. (2013). Secondstring project page: Open source java-based package of approximate string-matching specification. http://www.secondstringsourceforgenet.

  • Firmani, D., Saha, B., & Srivastava, D. (2016). Online entity resolution using an oracle. Proc of the VLDB Endowment, 9(5), 384–395.

    Google Scholar 

  • Freire, N., Borbinha, J., & Calado, P. (2007). Identification of frbr works within bibliographic databases: An experiment with unimarc and duplicate detection techniques. In Proceedings of the international conference on Asian digital libraries (ICADL’07), Hanoi, Vietnam, December 10–13.

  • Geerts, F., Mecca, G., Papotti, P., & Santoro, D. (2013). The llunatic data-cleaning framework. In Proceedings of the 39th international conference on very large data bases (VLDB ’13), Riva del Garda, Trento, Italy, August 26–30.

  • Getoor, L. (2012). Entity resolution tutorial. In Proceedings of the 38th international conference on very large data bases (VLDB ’12), Istanbul, Turkey, August 27–31.

  • Giles, S. L. L., & Bollacker, K. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67–71.

    Google Scholar 

  • Gravano, L., Ipeirotis, P., Koudas, N., & Srivastava, D. (2013). Text joins in an rdbms for web data integration. In Proceedings of the 14th international world wide web conference (WWW’03), Budapest, Hungary, May 20–24, 2003 Trento, Italy, August 26–30.

  • Gruenheid, A., Dong, X. L., & Srivastava, D. (2014). Incremental record linkage. The VLDB Journal, 7, 697–708.

    Google Scholar 

  • Guo, S., Dong, X., Srivastava, D., & Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. In Proceedings of the 37th International Conference on Very Large Data Bases (VLDB’10), Singapore, August 29–September 3.

  • Gupta, R., & Sarawagi, S. (2009). Answering table augmentation queries from unstructured lists on the web. PVLDB, 2(1), 289–300.

    Google Scholar 

  • Hall, R., Sutton, C., & McCallum, A. (2008). Unsupervised deduplication using cross-field dependencies. In Proceedings of the ACM international conference on knowledge discovery and data mining (KDD’08), Las Vegas, NV, USA, August 24–27.

  • Hermansson, L., Johansson, F., Kerola, T., Jethava, V., & Dubhashi, D. (2013). Entity disambiguation in anonymized graphs using graph kernels. In Proceedings of the ACM international conference on information and knowledge management (CIKM’13), San Francisco, CA, USA, October 27–November 1.

  • Hernandez, M., & Stolfo, S. (1995). The merge/purge problem for large databases. In Proceedings of the ACM special interest group on management of data conference (SIGMOD’95), San Jose, CA, USA, May 22–25.

  • Herranz, J., Nin, J., & Sole, M. (2010). Optimal symbol alignment distance: A new distance for sequences of symbols. IEEE Transactions on Knowledge and Data Engineering, 23(10), 1541–1554.

    Google Scholar 

  • Herschel, M., Naumann, F., Szott, S., & Taubert, M. (2012). Scalable iterative graph duplicate detection. IEEE Transactions on Knowledge and Data Engineering, 24, 2094–2108.

    Google Scholar 

  • Herzog, S. (2007). Data Quality and Record Linkage Techniques. New York: Springer.

    MATH  Google Scholar 

  • Hong, Y., On, B., & Lee, D. (2004). System support for name authority control problem in digital libraries: Open dblp approach. In Proceeding of 8th European conference on digital libraries (ECDL’04), Bath, UK, September 12–17.

  • Jaro, M. (1989). Advances in record linkage methodology as applied to matching the 1985 census of tampa florida. Journal of American Statistical Association, 84(406), 414–420.

    Google Scholar 

  • Kalashnikov, D., Mehrotra, S., & Chen, Z. (2005). Exploiting relationships for domain-independent data cleaning. In Proceeding of SIAM conference on data mining (SDM’05), Newport Beach, California, USA, April 21–23.

  • Kalashnikov, D., Mehrotra, S., & Chen, Z. (2006). Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems, 31, 716–767.

    Google Scholar 

  • Khabsa, M., Treeratpituk, P., & Giles, C. (2012). Entity resolution using search engine results. In Proceeding of the 21st ACM international conference on information and knowledge management (CIKM’12), Maui, USA, October 29–November 2.

  • Kim, H., & Lee, D. (2010). Harra: Fast iterative hashed record linkage for large-scale data collections. In Proceeding of the 13th international conference on extending database technology (EDBT’10), Lausanne, Switzerland, March 22–26.

  • Kolb, L., Thor, A., & Rahm, E. (2011). Block-based load balancing for entity resolution for mapreduce. In Proceeding of the 20th ACM international conference on information and management (CIKM’11), Glasgow, Scotland, UK, October 24–28.

  • Lee, D., Kang, J., Mitra, P., Giles, L., & On, B. (2007). Are your citations clean? ACM Communication of the ACM, 50(12), 33–38.

    Google Scholar 

  • Lee, D., On, B., J.K., & Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceeding of ACM SIGMOD workshop on information quality in information systems (IQIS’05), Baltimore, Maryland, USA, June 13–16.

  • Li, P., Dong, X., Maurino, A., & Srivastava, D. (2011). Linking temporal records. In Proceeding of the 37th International conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.

  • Lingli, L., Li, J., Wang, H., & Gao, H. (2011). Context-based entity description rule for entity resolution. In Proceeding of the 20th ACM international conference on information and management (CIKM’11), Glasgow, Scotland, UK, October 24–28.

  • Marcus, A. (2011). Human-powered sorts and joins. PVLDB.

  • Navarro, G., & Gonzalo, S. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1), 31–88.

    Google Scholar 

  • Nentwig, M. A. G., & Rahm, E. (2016). Gb-jer: A graph-based model for joint entity resolution. In IEEE 16th international conference on data mining workshops (ICDMW).

  • Ng, V., & Cardie, C. (2002). Improving machine learning approaches to coreference resolution. Philadelphia: ACL.

    Google Scholar 

  • On, B., Choi, G., & Jung, S. (2014). A case study for understanding the nature of redundant entities in bibliographic digital libraries. Electronic Libraries and Information Systems, 48(3), 246–271.

    Google Scholar 

  • On, B., Elmacioglu, E., Lee, D., Kang, J., & Pei, J. (2006). Improving grouped-entity resolution using quasi-cliques. In Proceeding of IEEE international conference on data mining (ICDM’06), Hong Kong, China, December.

  • On, B., Koudas, N., Lee, D., & Srivastava, D. (2007). Group linkage. In Proceeding of IEEE international conference on data engineering (ICDE’07), Istanbul, Turkey, April.

  • On, B., & Lee, I. (2011). Meta similarity. Applied Intelligence, 35(3), 359–374.

    Google Scholar 

  • On, B., Lee, I., & Lee, D. (2012). Scalable clustering methods for the name disambiguation problem. Knowledge and Information Systems, 31(1), 129–151.

    Google Scholar 

  • Papadakis, G., Ioannou, E., Niederee, C., & Fankhauser, P. (2011). Efficient entity resolution for large heterogeneous information spaces. In Proceeding of the 4th ACM international conference on web search and data mining (WSDM’11), Hong Kong, China, February 9–12.

  • Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2003). Identity unsertainty and citation matching. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press.

    Google Scholar 

  • Pujara, J., & Getoor, L. (2016). Generic statistical relational entity resolution in knowledge graphs. In Proceeding of international workshop on statistical relational AI.

  • Rastogi, V., Dalvi, N., & Garofalakis, M. (2011). Large-scale collective entity matching. In Proceeding of the 37th international conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.

  • Ravikumar, P., & Cohen, W. (2004). A hierarchical graphical model for record linkage. In UAI.

  • Ravikumar, W. C. P., & Fienberg, S. (2003). A comparison of string distance metrics for name-matching tasks. In Proceeding of IJCAI workshop on information integration on the web.

  • Rick, B., Hengel-Dittrich, C., O’Neill, E., & Tilett, B. (2007). Viaf(virtual international authority file): Linking die deutsche bibliothek and library of congress name authority files. International Cataloging and Bibliographic Control, 36(1), 12–19.

    Google Scholar 

  • Sarawagi, S. (2003). Interactive deduplication using active learning. In Proceeding of international conference on knowledge discovery and data mining (KDD).

  • Shah, D. (2008). Gossip algorithms. Foundations and Trends in Networking, 3(1), 1–125.

    MATH  Google Scholar 

  • Shen, W., Li, X., & Doan, A. (2005). Constraint-based entity matching. In Proceeding of the 25th national conference on artificial intelligence (AAAI’05), Pittsburgh, PA, USA, July 9–13.

  • Simon, D. F. E., & Shasha, D. (2000). An extensible framework for data cleaning. In Proceeding of international conference on data engineering.

  • Soon, W. (2001). A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4), 521–544.

    Google Scholar 

  • Soundex. (2007-05-30). The soundex indexing system. National Archives and Records Administration.

  • Sun, C., Shen, D., Kou, Y., Nie, T., & Yu, G. (2015). GB-JER: A graph-based model for joint entity resolution. In International conference on database systems for advanced applications.

  • Tang, J., Lu, Q., Wang, T., Wang, J., & Li, W. (2011). Constraint-based entity matching. In Proceeding of 34th international ACM SIGIR conference on research and development in information retrieval (SIGIR’11), Beijing, China, July 24–28.

  • Taniguchi, S. (2013). Constraint-based entity matching. Journal of Information Science, 39(2), 153–168.

    Google Scholar 

  • Tejada, S. (2001). Learning object identification rules for information integration. Information Sciences, 126, 83–98.

    MATH  Google Scholar 

  • Wang, J., J.Y., Li, G., & Feng, J. (2011). Entity matching: How similar is similar. In Proceeding of the 37th international conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.

  • Wang, J., Kraska, T., Franklin, M., & Feng, J. (2012). Crowder: Crowdsourcing entity resolution. PVLDB, 5(11), 1483–1494.

    Google Scholar 

  • Weber, J. (2015). Leaf: Linking and exploring authority files. http://www.leaf-eduorg. Retrieved March 1, 2015.

  • Whang, S., & Garcia-Molina, H. (2010). Entity resolution with evolving rules. In Proceeding of the 36th international conference on very large data bases (VLDB’10), Singapore, August 29–September 3.

  • Whang, S., & Garcia-Molina, H. (2012). Joint entity resolution. In Proceeding of IEEE 28th international conference on data engineering (ICDE’12), Arlington, VA, USA, April 1–5.

  • Whang, S., & Garcia-Molina, H. (2014). Incremental entity resolution on rules and data. VLDB Journal, 23(1), 77–102.

    Google Scholar 

  • Whang, S., Marmaros, D., & Garcia-Molina, H. (2012). Pay-as-you-go entity resolution. IEEE Transactions on Knowledge and Data Engineering, 25(5), 1111–1124.

    Google Scholar 

  • Wick, M., Singh, S., & McCallum, A. (2012). A discriminative hierarchical model for fast coreference at large scale. In Proceeding of the 50th annual meeting of the association for computational linguistics (ACL’12), Jeju, Korea, July 8–14.

  • Winkler, W. (1990). String comparator metrics and enchanced decision rules in the Fellegi–Sunter model of record linkage. In Proceeding of the section on survey research methods. American Statistical Association.

  • Winkler, W. E. (1999). The state of record linkage and current research problems. Technical report, US Census Bureau.

  • Winkler, W. (2006). Overview of record linkage and current research directions. Technical report, Bureau of the Census.

  • Xiao, C., Wang, W., & Lin, X. (2008). Ed-join: An efficient algorithm for similarity joins with edit distance constraints. In Proceeding of the 34th international conference on very large data bases (VLDB’08), Auckland, New Zealand, August 24–30.

Download references

Acknowledgements

This research was partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. NRF-2016R1A2 B1014843). This work was also supported by the National Research Foundation of Korea Grant funded by the Korean Government(NRF-2019R1F1 A1060752). Byung-Won On is the corresponding author of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Byung-Won On.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kang, N., Kim, JJ., On, BW. et al. A node resistance-based probability model for resolving duplicate named entities. Scientometrics 124, 1721–1743 (2020). https://doi.org/10.1007/s11192-020-03585-4

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-020-03585-4

Keywords

Navigation