A node resistance-based probability model for resolving duplicate named entities

Kang, Namyong; Kim, Jeong-Jae; On, Byung-Won; Lee, Ingyu

doi:10.1007/s11192-020-03585-4

A node resistance-based probability model for resolving duplicate named entities

Published: 13 July 2020

Volume 124, pages 1721–1743, (2020)
Cite this article

Scientometrics Aims and scope Submit manuscript

Namyong Kang¹,
Jeong-Jae Kim²,
Byung-Won On ORCID: orcid.org/0000-0001-6929-3188³ &
…
Ingyu Lee⁴

357 Accesses
2 Citations
Explore all metrics

Abstract

Duplicate entities tend to degrade the quality of data seriously. Despite recent remarkable achievement, existing methods still produce a large number of false positives (i.e., an entity determined to be a duplicate one when it is not) that are likely to impair the accuracy. Toward this challenge, we propose a novel node resistance-based probability model in which we view a given data set as a graph of entities that are linked each other via relationships, and then compute the probability value between two entities to see how similar the two entities are. Especially, in the graph, each node has its own resistance value equivalent to 1-confidence (normalized in 0–1) and resistance\(\cdot \)probability value is filtered out per node during computing the probability value. To evaluate the proposed model, we performed intensive experiments with different data sets including ACM (https://dl.acm.org), DBLP (https://dblp.uni-trier.de), and IMDB (https://imdb.com). Our experimental results show that the proposed probability model outperforms the existing probability model, improving average F1 scores up to 14%, but never worsens them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Filtering Inaccurate Entity Co-references on the Linked Open Data

An analysis of one-to-one matching algorithms for entity resolution

Article Open access 18 April 2023

Using Link Features for Entity Clustering in Knowledge Graphs

References

Ailon, N. (2008). Aggregating inconsistent information: Ranking and clustering. JACM, 55(5), 1–27.
MathSciNet MATH Google Scholar
Aldous, D. (1982). Some inequalities for reversible Markov chains. Journal of the London Mathematical Society, 25, 564–576.
MathSciNet MATH Google Scholar
Alias-i. (2008). Lingpipe 4.1.0. http://alias-i.com/lingpipe. Retrieved October 1, 2008.
Arasu, A. (2010). On active learning of record matching packages. In SIGMOD.
Bansal, N. (2004). Correlation clustering. Machine Learning, 56, 1–3.
MathSciNet MATH Google Scholar
Baxter, R. P. C., & Churches, T. (2003). A comparison of fast blocking methods for record linkage. In ACM SIGKDD’03 workshop on data cleaning, record linkage and object consolidation.
Bellare, K. (2012). Active sampling for entity matching. In Proceedings of the 18th ACM SIGKDD.
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S., & Widom, J. (2009). Swoosh: A generic approach to entity resolution. The VLDB Journal, 18(1), 255–276.
Google Scholar
Bhattachary, I., & Getoor, L. (2007). A latent Dirichhlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM international conference on data mining.
Bhattacharya, I., & Getoor, L. (2004). Iterative record linkage for cleaning and integration. In Proceedings of ACM SIGMOD workshop on research issues in data mining and knowledge discovery (DMKD’04), Paris, France, June 13.
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1–36.
Google Scholar
Bilenko, M., & Mooney, R. (2003). Adaptive duplicate detection using leanable string similarity. In Proceedings of international conference on knowledge discovery and data mining (KDD).
Bilenko, M., Kamath, B., & Mooney, R. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of IEEE international conference on data mining (ICDM’06), Hong Kong, China, December.
Chaudhuri, S. (2005). Robust identification of fuzzy duplicates. In Proceedings of ICDE.
Chen, Z. (2009). Exploiting context analysis for combining multiple entity resolution systems. In Proceedings of SIGMOD.
Christen, P. (2007). Towards parameter-free blocking for scalable record linkage. Technical report, The Australian National University, Canberra.
Christen, P. (2008). Automatic record linkage using seeded nearest neighbor and support vector machine classification. In Proceedings of international conference on knowledge discovery and data mining (KDD).
Christen, P. (2011a). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.
Google Scholar
Christen, P. (2011b). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 99(1), 5.
Google Scholar
Cochinwala, M. (2001). Efficient data reconcillation. Information Sciences, 137, 1–4.
Google Scholar
Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. In Proceedings of workshop on data cleaning, record linkage, and object consolidation in conjunction with ACM international conference on knowledge discovery and data mining (KDD’03), Washington DC, USA, August 21–24.
do Nascimento, D. C. C. E. S. P., & Mestre, D. G. (2018). Heuristic-based approaches for speeding up incremental record linkage. Journal of Systems and Software, 137, 335–354.
Google Scholar
Elkan, A. E. M. C. (1996). The field matching problem: Algorithms and applications. In Proceedings of the second international conference on knowledge discovery and data mining (KDD-96).
Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.
Google Scholar
Elsayed, T., Oard, D., & Namata, G. (2008). Resolving personal names in email using context expansion. In Proceedings of the 46th annual meeting of the association for computational linguistics: Human language technologies (ACL’08), Columbus, OH, USA, June 15–20.
Elsner, M., & Charnaik, E. (2008). You talking to me? A corpus and algorithm for conversation disentanglement. ACL-HLT.
Elsner, M., & Schudy, W. (2009). Bounding and comparing methods for correlation clustering beyong ilp. ILP-NLP.
Fan, X., Wang, J., Pu, X., Zhou, L., Zuou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality, 2(2), 1–23.
Google Scholar
Fellegi, I., & Sunter, A. (1968). A theory for record linkage. Journal of American Statistical Association, 63(324), 1321–1332.
Google Scholar
Ferreira, A., Silva, R., Goncalves, M., Veloso, A., & Laender, A. (2012). Active associative sampling for author name disambiguation. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’12), Washington DC, USA, June 10–14.
Fienberg, W. C. P. R. S., & Rivard, K. (2013). Secondstring project page: Open source java-based package of approximate string-matching specification. http://www.secondstringsourceforgenet.
Firmani, D., Saha, B., & Srivastava, D. (2016). Online entity resolution using an oracle. Proc of the VLDB Endowment, 9(5), 384–395.
Google Scholar
Freire, N., Borbinha, J., & Calado, P. (2007). Identification of frbr works within bibliographic databases: An experiment with unimarc and duplicate detection techniques. In Proceedings of the international conference on Asian digital libraries (ICADL’07), Hanoi, Vietnam, December 10–13.
Geerts, F., Mecca, G., Papotti, P., & Santoro, D. (2013). The llunatic data-cleaning framework. In Proceedings of the 39th international conference on very large data bases (VLDB ’13), Riva del Garda, Trento, Italy, August 26–30.
Getoor, L. (2012). Entity resolution tutorial. In Proceedings of the 38th international conference on very large data bases (VLDB ’12), Istanbul, Turkey, August 27–31.
Giles, S. L. L., & Bollacker, K. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67–71.
Google Scholar
Gravano, L., Ipeirotis, P., Koudas, N., & Srivastava, D. (2013). Text joins in an rdbms for web data integration. In Proceedings of the 14th international world wide web conference (WWW’03), Budapest, Hungary, May 20–24, 2003 Trento, Italy, August 26–30.
Gruenheid, A., Dong, X. L., & Srivastava, D. (2014). Incremental record linkage. The VLDB Journal, 7, 697–708.
Google Scholar
Guo, S., Dong, X., Srivastava, D., & Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. In Proceedings of the 37th International Conference on Very Large Data Bases (VLDB’10), Singapore, August 29–September 3.
Gupta, R., & Sarawagi, S. (2009). Answering table augmentation queries from unstructured lists on the web. PVLDB, 2(1), 289–300.
Google Scholar
Hall, R., Sutton, C., & McCallum, A. (2008). Unsupervised deduplication using cross-field dependencies. In Proceedings of the ACM international conference on knowledge discovery and data mining (KDD’08), Las Vegas, NV, USA, August 24–27.
Hermansson, L., Johansson, F., Kerola, T., Jethava, V., & Dubhashi, D. (2013). Entity disambiguation in anonymized graphs using graph kernels. In Proceedings of the ACM international conference on information and knowledge management (CIKM’13), San Francisco, CA, USA, October 27–November 1.
Hernandez, M., & Stolfo, S. (1995). The merge/purge problem for large databases. In Proceedings of the ACM special interest group on management of data conference (SIGMOD’95), San Jose, CA, USA, May 22–25.
Herranz, J., Nin, J., & Sole, M. (2010). Optimal symbol alignment distance: A new distance for sequences of symbols. IEEE Transactions on Knowledge and Data Engineering, 23(10), 1541–1554.
Google Scholar
Herschel, M., Naumann, F., Szott, S., & Taubert, M. (2012). Scalable iterative graph duplicate detection. IEEE Transactions on Knowledge and Data Engineering, 24, 2094–2108.
Google Scholar
Herzog, S. (2007). Data Quality and Record Linkage Techniques. New York: Springer.
MATH Google Scholar
Hong, Y., On, B., & Lee, D. (2004). System support for name authority control problem in digital libraries: Open dblp approach. In Proceeding of 8th European conference on digital libraries (ECDL’04), Bath, UK, September 12–17.
Jaro, M. (1989). Advances in record linkage methodology as applied to matching the 1985 census of tampa florida. Journal of American Statistical Association, 84(406), 414–420.
Google Scholar
Kalashnikov, D., Mehrotra, S., & Chen, Z. (2005). Exploiting relationships for domain-independent data cleaning. In Proceeding of SIAM conference on data mining (SDM’05), Newport Beach, California, USA, April 21–23.
Kalashnikov, D., Mehrotra, S., & Chen, Z. (2006). Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems, 31, 716–767.
Google Scholar
Khabsa, M., Treeratpituk, P., & Giles, C. (2012). Entity resolution using search engine results. In Proceeding of the 21st ACM international conference on information and knowledge management (CIKM’12), Maui, USA, October 29–November 2.
Kim, H., & Lee, D. (2010). Harra: Fast iterative hashed record linkage for large-scale data collections. In Proceeding of the 13th international conference on extending database technology (EDBT’10), Lausanne, Switzerland, March 22–26.
Kolb, L., Thor, A., & Rahm, E. (2011). Block-based load balancing for entity resolution for mapreduce. In Proceeding of the 20th ACM international conference on information and management (CIKM’11), Glasgow, Scotland, UK, October 24–28.
Lee, D., Kang, J., Mitra, P., Giles, L., & On, B. (2007). Are your citations clean? ACM Communication of the ACM, 50(12), 33–38.
Google Scholar
Lee, D., On, B., J.K., & Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceeding of ACM SIGMOD workshop on information quality in information systems (IQIS’05), Baltimore, Maryland, USA, June 13–16.
Li, P., Dong, X., Maurino, A., & Srivastava, D. (2011). Linking temporal records. In Proceeding of the 37th International conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.
Lingli, L., Li, J., Wang, H., & Gao, H. (2011). Context-based entity description rule for entity resolution. In Proceeding of the 20th ACM international conference on information and management (CIKM’11), Glasgow, Scotland, UK, October 24–28.
Marcus, A. (2011). Human-powered sorts and joins. PVLDB.
Navarro, G., & Gonzalo, S. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1), 31–88.
Google Scholar
Nentwig, M. A. G., & Rahm, E. (2016). Gb-jer: A graph-based model for joint entity resolution. In IEEE 16th international conference on data mining workshops (ICDMW).
Ng, V., & Cardie, C. (2002). Improving machine learning approaches to coreference resolution. Philadelphia: ACL.
Google Scholar
On, B., Choi, G., & Jung, S. (2014). A case study for understanding the nature of redundant entities in bibliographic digital libraries. Electronic Libraries and Information Systems, 48(3), 246–271.
Google Scholar
On, B., Elmacioglu, E., Lee, D., Kang, J., & Pei, J. (2006). Improving grouped-entity resolution using quasi-cliques. In Proceeding of IEEE international conference on data mining (ICDM’06), Hong Kong, China, December.
On, B., Koudas, N., Lee, D., & Srivastava, D. (2007). Group linkage. In Proceeding of IEEE international conference on data engineering (ICDE’07), Istanbul, Turkey, April.
On, B., & Lee, I. (2011). Meta similarity. Applied Intelligence, 35(3), 359–374.
Google Scholar
On, B., Lee, I., & Lee, D. (2012). Scalable clustering methods for the name disambiguation problem. Knowledge and Information Systems, 31(1), 129–151.
Google Scholar
Papadakis, G., Ioannou, E., Niederee, C., & Fankhauser, P. (2011). Efficient entity resolution for large heterogeneous information spaces. In Proceeding of the 4th ACM international conference on web search and data mining (WSDM’11), Hong Kong, China, February 9–12.
Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2003). Identity unsertainty and citation matching. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press.
Google Scholar
Pujara, J., & Getoor, L. (2016). Generic statistical relational entity resolution in knowledge graphs. In Proceeding of international workshop on statistical relational AI.
Rastogi, V., Dalvi, N., & Garofalakis, M. (2011). Large-scale collective entity matching. In Proceeding of the 37th international conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.
Ravikumar, P., & Cohen, W. (2004). A hierarchical graphical model for record linkage. In UAI.
Ravikumar, W. C. P., & Fienberg, S. (2003). A comparison of string distance metrics for name-matching tasks. In Proceeding of IJCAI workshop on information integration on the web.
Rick, B., Hengel-Dittrich, C., O’Neill, E., & Tilett, B. (2007). Viaf(virtual international authority file): Linking die deutsche bibliothek and library of congress name authority files. International Cataloging and Bibliographic Control, 36(1), 12–19.
Google Scholar
Sarawagi, S. (2003). Interactive deduplication using active learning. In Proceeding of international conference on knowledge discovery and data mining (KDD).
Shah, D. (2008). Gossip algorithms. Foundations and Trends in Networking, 3(1), 1–125.
MATH Google Scholar
Shen, W., Li, X., & Doan, A. (2005). Constraint-based entity matching. In Proceeding of the 25th national conference on artificial intelligence (AAAI’05), Pittsburgh, PA, USA, July 9–13.
Simon, D. F. E., & Shasha, D. (2000). An extensible framework for data cleaning. In Proceeding of international conference on data engineering.
Soon, W. (2001). A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4), 521–544.
Google Scholar
Soundex. (2007-05-30). The soundex indexing system. National Archives and Records Administration.
Sun, C., Shen, D., Kou, Y., Nie, T., & Yu, G. (2015). GB-JER: A graph-based model for joint entity resolution. In International conference on database systems for advanced applications.
Tang, J., Lu, Q., Wang, T., Wang, J., & Li, W. (2011). Constraint-based entity matching. In Proceeding of 34th international ACM SIGIR conference on research and development in information retrieval (SIGIR’11), Beijing, China, July 24–28.
Taniguchi, S. (2013). Constraint-based entity matching. Journal of Information Science, 39(2), 153–168.
Google Scholar
Tejada, S. (2001). Learning object identification rules for information integration. Information Sciences, 126, 83–98.
MATH Google Scholar
Wang, J., J.Y., Li, G., & Feng, J. (2011). Entity matching: How similar is similar. In Proceeding of the 37th international conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.
Wang, J., Kraska, T., Franklin, M., & Feng, J. (2012). Crowder: Crowdsourcing entity resolution. PVLDB, 5(11), 1483–1494.
Google Scholar
Weber, J. (2015). Leaf: Linking and exploring authority files. http://www.leaf-eduorg. Retrieved March 1, 2015.
Whang, S., & Garcia-Molina, H. (2010). Entity resolution with evolving rules. In Proceeding of the 36th international conference on very large data bases (VLDB’10), Singapore, August 29–September 3.
Whang, S., & Garcia-Molina, H. (2012). Joint entity resolution. In Proceeding of IEEE 28th international conference on data engineering (ICDE’12), Arlington, VA, USA, April 1–5.
Whang, S., & Garcia-Molina, H. (2014). Incremental entity resolution on rules and data. VLDB Journal, 23(1), 77–102.
Google Scholar
Whang, S., Marmaros, D., & Garcia-Molina, H. (2012). Pay-as-you-go entity resolution. IEEE Transactions on Knowledge and Data Engineering, 25(5), 1111–1124.
Google Scholar
Wick, M., Singh, S., & McCallum, A. (2012). A discriminative hierarchical model for fast coreference at large scale. In Proceeding of the 50th annual meeting of the association for computational linguistics (ACL’12), Jeju, Korea, July 8–14.
Winkler, W. (1990). String comparator metrics and enchanced decision rules in the Fellegi–Sunter model of record linkage. In Proceeding of the section on survey research methods. American Statistical Association.
Winkler, W. E. (1999). The state of record linkage and current research problems. Technical report, US Census Bureau.
Winkler, W. (2006). Overview of record linkage and current research directions. Technical report, Bureau of the Census.
Xiao, C., Wang, W., & Lin, X. (2008). Ed-join: An efficient algorithm for similarity joins with edit distance constraints. In Proceeding of the 34th international conference on very large data bases (VLDB’08), Auckland, New Zealand, August 24–30.

Download references

Acknowledgements

This research was partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. NRF-2016R1A2 B1014843). This work was also supported by the National Research Foundation of Korea Grant funded by the Korean Government(NRF-2019R1F1 A1060752). Byung-Won On is the corresponding author of this paper.

Author information

Authors and Affiliations

School of Computing, Korea Advanced Institute of Science and Technology, Daejeon, Korea
Namyong Kang
Department of Computer Science, Kyonggi University, Suwon, Korea
Jeong-Jae Kim
Department of Software Convergence Engineering, Kunsan National University, Gunsan, Korea
Byung-Won On
Sorrell College of Business, Troy University, Troy, USA
Ingyu Lee

Authors

Namyong Kang
View author publications
You can also search for this author in PubMed Google Scholar
Jeong-Jae Kim
View author publications
You can also search for this author in PubMed Google Scholar
Byung-Won On
View author publications
You can also search for this author in PubMed Google Scholar
Ingyu Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Byung-Won On.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kang, N., Kim, JJ., On, BW. et al. A node resistance-based probability model for resolving duplicate named entities. Scientometrics 124, 1721–1743 (2020). https://doi.org/10.1007/s11192-020-03585-4

Download citation

Received: 19 April 2018
Published: 13 July 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11192-020-03585-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A node resistance-based probability model for resolving duplicate named entities

Abstract

Access this article

Similar content being viewed by others

Filtering Inaccurate Entity Co-references on the Linked Open Data

An analysis of one-to-one matching algorithms for entity resolution

Using Link Features for Entity Clustering in Knowledge Graphs

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A node resistance-based probability model for resolving duplicate named entities

Abstract

Access this article

Similar content being viewed by others

Filtering Inaccurate Entity Co-references on the Linked Open Data

An analysis of one-to-one matching algorithms for entity resolution

Using Link Features for Entity Clustering in Knowledge Graphs

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation