Abstract
Duplicate entities tend to degrade the quality of data seriously. Despite recent remarkable achievement, existing methods still produce a large number of false positives (i.e., an entity determined to be a duplicate one when it is not) that are likely to impair the accuracy. Toward this challenge, we propose a novel node resistance-based probability model in which we view a given data set as a graph of entities that are linked each other via relationships, and then compute the probability value between two entities to see how similar the two entities are. Especially, in the graph, each node has its own resistance value equivalent to 1-confidence (normalized in 0–1) and resistance\(\cdot \)probability value is filtered out per node during computing the probability value. To evaluate the proposed model, we performed intensive experiments with different data sets including ACM (https://dl.acm.org), DBLP (https://dblp.uni-trier.de), and IMDB (https://imdb.com). Our experimental results show that the proposed probability model outperforms the existing probability model, improving average F1 scores up to 14%, but never worsens them.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11192-020-03585-4/MediaObjects/11192_2020_3585_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11192-020-03585-4/MediaObjects/11192_2020_3585_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11192-020-03585-4/MediaObjects/11192_2020_3585_Fig3_HTML.png)
Similar content being viewed by others
References
Ailon, N. (2008). Aggregating inconsistent information: Ranking and clustering. JACM, 55(5), 1–27.
Aldous, D. (1982). Some inequalities for reversible Markov chains. Journal of the London Mathematical Society, 25, 564–576.
Alias-i. (2008). Lingpipe 4.1.0. http://alias-i.com/lingpipe. Retrieved October 1, 2008.
Arasu, A. (2010). On active learning of record matching packages. In SIGMOD.
Bansal, N. (2004). Correlation clustering. Machine Learning, 56, 1–3.
Baxter, R. P. C., & Churches, T. (2003). A comparison of fast blocking methods for record linkage. In ACM SIGKDD’03 workshop on data cleaning, record linkage and object consolidation.
Bellare, K. (2012). Active sampling for entity matching. In Proceedings of the 18th ACM SIGKDD.
Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S., & Widom, J. (2009). Swoosh: A generic approach to entity resolution. The VLDB Journal, 18(1), 255–276.
Bhattachary, I., & Getoor, L. (2007). A latent Dirichhlet model for unsupervised entity resolution. In Proceedings of the 2006 SIAM international conference on data mining.
Bhattacharya, I., & Getoor, L. (2004). Iterative record linkage for cleaning and integration. In Proceedings of ACM SIGMOD workshop on research issues in data mining and knowledge discovery (DMKD’04), Paris, France, June 13.
Bhattacharya, I., & Getoor, L. (2007). Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1), 1–36.
Bilenko, M., & Mooney, R. (2003). Adaptive duplicate detection using leanable string similarity. In Proceedings of international conference on knowledge discovery and data mining (KDD).
Bilenko, M., Kamath, B., & Mooney, R. (2006). Adaptive blocking: Learning to scale up record linkage. In Proceedings of IEEE international conference on data mining (ICDM’06), Hong Kong, China, December.
Chaudhuri, S. (2005). Robust identification of fuzzy duplicates. In Proceedings of ICDE.
Chen, Z. (2009). Exploiting context analysis for combining multiple entity resolution systems. In Proceedings of SIGMOD.
Christen, P. (2007). Towards parameter-free blocking for scalable record linkage. Technical report, The Australian National University, Canberra.
Christen, P. (2008). Automatic record linkage using seeded nearest neighbor and support vector machine classification. In Proceedings of international conference on knowledge discovery and data mining (KDD).
Christen, P. (2011a). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9), 1537–1555.
Christen, P. (2011b). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 99(1), 5.
Cochinwala, M. (2001). Efficient data reconcillation. Information Sciences, 137, 1–4.
Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. In Proceedings of workshop on data cleaning, record linkage, and object consolidation in conjunction with ACM international conference on knowledge discovery and data mining (KDD’03), Washington DC, USA, August 21–24.
do Nascimento, D. C. C. E. S. P., & Mestre, D. G. (2018). Heuristic-based approaches for speeding up incremental record linkage. Journal of Systems and Software, 137, 335–354.
Elkan, A. E. M. C. (1996). The field matching problem: Algorithms and applications. In Proceedings of the second international conference on knowledge discovery and data mining (KDD-96).
Elmagarmid, A., Ipeirotis, P., & Verykios, V. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1–16.
Elsayed, T., Oard, D., & Namata, G. (2008). Resolving personal names in email using context expansion. In Proceedings of the 46th annual meeting of the association for computational linguistics: Human language technologies (ACL’08), Columbus, OH, USA, June 15–20.
Elsner, M., & Charnaik, E. (2008). You talking to me? A corpus and algorithm for conversation disentanglement. ACL-HLT.
Elsner, M., & Schudy, W. (2009). Bounding and comparing methods for correlation clustering beyong ilp. ILP-NLP.
Fan, X., Wang, J., Pu, X., Zhou, L., Zuou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality, 2(2), 1–23.
Fellegi, I., & Sunter, A. (1968). A theory for record linkage. Journal of American Statistical Association, 63(324), 1321–1332.
Ferreira, A., Silva, R., Goncalves, M., Veloso, A., & Laender, A. (2012). Active associative sampling for author name disambiguation. In Proceedings of ACM/IEEE joint conference on digital libraries (JCDL’12), Washington DC, USA, June 10–14.
Fienberg, W. C. P. R. S., & Rivard, K. (2013). Secondstring project page: Open source java-based package of approximate string-matching specification. http://www.secondstringsourceforgenet.
Firmani, D., Saha, B., & Srivastava, D. (2016). Online entity resolution using an oracle. Proc of the VLDB Endowment, 9(5), 384–395.
Freire, N., Borbinha, J., & Calado, P. (2007). Identification of frbr works within bibliographic databases: An experiment with unimarc and duplicate detection techniques. In Proceedings of the international conference on Asian digital libraries (ICADL’07), Hanoi, Vietnam, December 10–13.
Geerts, F., Mecca, G., Papotti, P., & Santoro, D. (2013). The llunatic data-cleaning framework. In Proceedings of the 39th international conference on very large data bases (VLDB ’13), Riva del Garda, Trento, Italy, August 26–30.
Getoor, L. (2012). Entity resolution tutorial. In Proceedings of the 38th international conference on very large data bases (VLDB ’12), Istanbul, Turkey, August 27–31.
Giles, S. L. L., & Bollacker, K. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67–71.
Gravano, L., Ipeirotis, P., Koudas, N., & Srivastava, D. (2013). Text joins in an rdbms for web data integration. In Proceedings of the 14th international world wide web conference (WWW’03), Budapest, Hungary, May 20–24, 2003 Trento, Italy, August 26–30.
Gruenheid, A., Dong, X. L., & Srivastava, D. (2014). Incremental record linkage. The VLDB Journal, 7, 697–708.
Guo, S., Dong, X., Srivastava, D., & Zajac, R. (2010). Record linkage with uniqueness constraints and erroneous values. In Proceedings of the 37th International Conference on Very Large Data Bases (VLDB’10), Singapore, August 29–September 3.
Gupta, R., & Sarawagi, S. (2009). Answering table augmentation queries from unstructured lists on the web. PVLDB, 2(1), 289–300.
Hall, R., Sutton, C., & McCallum, A. (2008). Unsupervised deduplication using cross-field dependencies. In Proceedings of the ACM international conference on knowledge discovery and data mining (KDD’08), Las Vegas, NV, USA, August 24–27.
Hermansson, L., Johansson, F., Kerola, T., Jethava, V., & Dubhashi, D. (2013). Entity disambiguation in anonymized graphs using graph kernels. In Proceedings of the ACM international conference on information and knowledge management (CIKM’13), San Francisco, CA, USA, October 27–November 1.
Hernandez, M., & Stolfo, S. (1995). The merge/purge problem for large databases. In Proceedings of the ACM special interest group on management of data conference (SIGMOD’95), San Jose, CA, USA, May 22–25.
Herranz, J., Nin, J., & Sole, M. (2010). Optimal symbol alignment distance: A new distance for sequences of symbols. IEEE Transactions on Knowledge and Data Engineering, 23(10), 1541–1554.
Herschel, M., Naumann, F., Szott, S., & Taubert, M. (2012). Scalable iterative graph duplicate detection. IEEE Transactions on Knowledge and Data Engineering, 24, 2094–2108.
Herzog, S. (2007). Data Quality and Record Linkage Techniques. New York: Springer.
Hong, Y., On, B., & Lee, D. (2004). System support for name authority control problem in digital libraries: Open dblp approach. In Proceeding of 8th European conference on digital libraries (ECDL’04), Bath, UK, September 12–17.
Jaro, M. (1989). Advances in record linkage methodology as applied to matching the 1985 census of tampa florida. Journal of American Statistical Association, 84(406), 414–420.
Kalashnikov, D., Mehrotra, S., & Chen, Z. (2005). Exploiting relationships for domain-independent data cleaning. In Proceeding of SIAM conference on data mining (SDM’05), Newport Beach, California, USA, April 21–23.
Kalashnikov, D., Mehrotra, S., & Chen, Z. (2006). Domain-independent data cleaning via analysis of entity-relationship graph. ACM Transactions on Database Systems, 31, 716–767.
Khabsa, M., Treeratpituk, P., & Giles, C. (2012). Entity resolution using search engine results. In Proceeding of the 21st ACM international conference on information and knowledge management (CIKM’12), Maui, USA, October 29–November 2.
Kim, H., & Lee, D. (2010). Harra: Fast iterative hashed record linkage for large-scale data collections. In Proceeding of the 13th international conference on extending database technology (EDBT’10), Lausanne, Switzerland, March 22–26.
Kolb, L., Thor, A., & Rahm, E. (2011). Block-based load balancing for entity resolution for mapreduce. In Proceeding of the 20th ACM international conference on information and management (CIKM’11), Glasgow, Scotland, UK, October 24–28.
Lee, D., Kang, J., Mitra, P., Giles, L., & On, B. (2007). Are your citations clean? ACM Communication of the ACM, 50(12), 33–38.
Lee, D., On, B., J.K., & Park, S. (2005). Effective and scalable solutions for mixed and split citation problems in digital libraries. In Proceeding of ACM SIGMOD workshop on information quality in information systems (IQIS’05), Baltimore, Maryland, USA, June 13–16.
Li, P., Dong, X., Maurino, A., & Srivastava, D. (2011). Linking temporal records. In Proceeding of the 37th International conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.
Lingli, L., Li, J., Wang, H., & Gao, H. (2011). Context-based entity description rule for entity resolution. In Proceeding of the 20th ACM international conference on information and management (CIKM’11), Glasgow, Scotland, UK, October 24–28.
Marcus, A. (2011). Human-powered sorts and joins. PVLDB.
Navarro, G., & Gonzalo, S. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33(1), 31–88.
Nentwig, M. A. G., & Rahm, E. (2016). Gb-jer: A graph-based model for joint entity resolution. In IEEE 16th international conference on data mining workshops (ICDMW).
Ng, V., & Cardie, C. (2002). Improving machine learning approaches to coreference resolution. Philadelphia: ACL.
On, B., Choi, G., & Jung, S. (2014). A case study for understanding the nature of redundant entities in bibliographic digital libraries. Electronic Libraries and Information Systems, 48(3), 246–271.
On, B., Elmacioglu, E., Lee, D., Kang, J., & Pei, J. (2006). Improving grouped-entity resolution using quasi-cliques. In Proceeding of IEEE international conference on data mining (ICDM’06), Hong Kong, China, December.
On, B., Koudas, N., Lee, D., & Srivastava, D. (2007). Group linkage. In Proceeding of IEEE international conference on data engineering (ICDE’07), Istanbul, Turkey, April.
On, B., & Lee, I. (2011). Meta similarity. Applied Intelligence, 35(3), 359–374.
On, B., Lee, I., & Lee, D. (2012). Scalable clustering methods for the name disambiguation problem. Knowledge and Information Systems, 31(1), 129–151.
Papadakis, G., Ioannou, E., Niederee, C., & Fankhauser, P. (2011). Efficient entity resolution for large heterogeneous information spaces. In Proceeding of the 4th ACM international conference on web search and data mining (WSDM’11), Hong Kong, China, February 9–12.
Pasula, H., Marthi, B., Milch, B., Russell, S., & Shpitser, I. (2003). Identity unsertainty and citation matching. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press.
Pujara, J., & Getoor, L. (2016). Generic statistical relational entity resolution in knowledge graphs. In Proceeding of international workshop on statistical relational AI.
Rastogi, V., Dalvi, N., & Garofalakis, M. (2011). Large-scale collective entity matching. In Proceeding of the 37th international conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.
Ravikumar, P., & Cohen, W. (2004). A hierarchical graphical model for record linkage. In UAI.
Ravikumar, W. C. P., & Fienberg, S. (2003). A comparison of string distance metrics for name-matching tasks. In Proceeding of IJCAI workshop on information integration on the web.
Rick, B., Hengel-Dittrich, C., O’Neill, E., & Tilett, B. (2007). Viaf(virtual international authority file): Linking die deutsche bibliothek and library of congress name authority files. International Cataloging and Bibliographic Control, 36(1), 12–19.
Sarawagi, S. (2003). Interactive deduplication using active learning. In Proceeding of international conference on knowledge discovery and data mining (KDD).
Shah, D. (2008). Gossip algorithms. Foundations and Trends in Networking, 3(1), 1–125.
Shen, W., Li, X., & Doan, A. (2005). Constraint-based entity matching. In Proceeding of the 25th national conference on artificial intelligence (AAAI’05), Pittsburgh, PA, USA, July 9–13.
Simon, D. F. E., & Shasha, D. (2000). An extensible framework for data cleaning. In Proceeding of international conference on data engineering.
Soon, W. (2001). A machine learning approach to coreference resolution of noun phrases. Computational Linguistics, 27(4), 521–544.
Soundex. (2007-05-30). The soundex indexing system. National Archives and Records Administration.
Sun, C., Shen, D., Kou, Y., Nie, T., & Yu, G. (2015). GB-JER: A graph-based model for joint entity resolution. In International conference on database systems for advanced applications.
Tang, J., Lu, Q., Wang, T., Wang, J., & Li, W. (2011). Constraint-based entity matching. In Proceeding of 34th international ACM SIGIR conference on research and development in information retrieval (SIGIR’11), Beijing, China, July 24–28.
Taniguchi, S. (2013). Constraint-based entity matching. Journal of Information Science, 39(2), 153–168.
Tejada, S. (2001). Learning object identification rules for information integration. Information Sciences, 126, 83–98.
Wang, J., J.Y., Li, G., & Feng, J. (2011). Entity matching: How similar is similar. In Proceeding of the 37th international conference on very large data bases (VLDB’11), Seattle, WA, USA, August 29–September 3.
Wang, J., Kraska, T., Franklin, M., & Feng, J. (2012). Crowder: Crowdsourcing entity resolution. PVLDB, 5(11), 1483–1494.
Weber, J. (2015). Leaf: Linking and exploring authority files. http://www.leaf-eduorg. Retrieved March 1, 2015.
Whang, S., & Garcia-Molina, H. (2010). Entity resolution with evolving rules. In Proceeding of the 36th international conference on very large data bases (VLDB’10), Singapore, August 29–September 3.
Whang, S., & Garcia-Molina, H. (2012). Joint entity resolution. In Proceeding of IEEE 28th international conference on data engineering (ICDE’12), Arlington, VA, USA, April 1–5.
Whang, S., & Garcia-Molina, H. (2014). Incremental entity resolution on rules and data. VLDB Journal, 23(1), 77–102.
Whang, S., Marmaros, D., & Garcia-Molina, H. (2012). Pay-as-you-go entity resolution. IEEE Transactions on Knowledge and Data Engineering, 25(5), 1111–1124.
Wick, M., Singh, S., & McCallum, A. (2012). A discriminative hierarchical model for fast coreference at large scale. In Proceeding of the 50th annual meeting of the association for computational linguistics (ACL’12), Jeju, Korea, July 8–14.
Winkler, W. (1990). String comparator metrics and enchanced decision rules in the Fellegi–Sunter model of record linkage. In Proceeding of the section on survey research methods. American Statistical Association.
Winkler, W. E. (1999). The state of record linkage and current research problems. Technical report, US Census Bureau.
Winkler, W. (2006). Overview of record linkage and current research directions. Technical report, Bureau of the Census.
Xiao, C., Wang, W., & Lin, X. (2008). Ed-join: An efficient algorithm for similarity joins with edit distance constraints. In Proceeding of the 34th international conference on very large data bases (VLDB’08), Auckland, New Zealand, August 24–30.
Acknowledgements
This research was partially supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. NRF-2016R1A2 B1014843). This work was also supported by the National Research Foundation of Korea Grant funded by the Korean Government(NRF-2019R1F1 A1060752). Byung-Won On is the corresponding author of this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kang, N., Kim, JJ., On, BW. et al. A node resistance-based probability model for resolving duplicate named entities. Scientometrics 124, 1721–1743 (2020). https://doi.org/10.1007/s11192-020-03585-4
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-020-03585-4