Abstract
Entity matching is to map the records in a database to their corresponding entities. It is a well-known problem in the field of database and artificial intelligence. In digital libraries such as DBLP, ArnetMiner, Google Scholar, Scopus, Web of Science, AllMusic, IMDB, etc., some of the attributes may evolve over time, i.e., they change their values at different instants of time. For example, affiliation and email-id of an author in bibliographic databases which maintain publication details of various authors like DBLP, ArnetMiner, etc. may change their values. A taxpayer can change his or her address over time. Sometimes people change their surnames due to marriage. When a database contains records of these natures and the number of records grows beyond a limit, then it becomes really challenging to identify which records belong to which entity due to the lack of a proper key. In the current paper, the problem of automatic partitioning of records is posed as an optimization problem. Thereafter, a genetic algorithm based automatic technique is proposed to solve the entity matching problem. The proposed approach is able to automatically determine the number of partitions available in a bibliographic dataset. A comparative analysis with the two existing systems – DBLP and ArnetMiner, over sixteen bibliographic datasets proves the efficacy of the proposed approach.
Similar content being viewed by others
References
Baarsch J, Celebi ME (2012) Investigation of internal validity measures for k-means clustering. Proceedings of the international multiconference of engineers and computer scientists, vol 1, pp 14–16
Bandyopadhyay S, Saha S (2012) Unsupervised classification: similarity measures, classical and metaheuristic approaches, and applications. Springer
Bhandari D, Murthy C, Pal SK (1996) Genetic algorithm with elitist model and its convergence. Int J Pattern Recognit Artif Intell 10(06):731–747
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 39–48
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3(1):1– 27
Chaudhuri S, Chen BC, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, pp 327–338
Chou CH, Su MC, Lai E (2002) Symmetry as a new measure for cluster validity. 2nd WSEAS Int. Conf. on Scientific Computation and Soft Computing, pp 209–213
Chou CH, Su MC, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Anal Applic 7(2):205–220
Cramer NL (1985) A representation for the adaptive generation of simple sequential programs. Proceedings of the First International Conference on Genetic Algorithms, pp 183– 187
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2:224–227
De Carvalho MG, Laender AH, Gonċalves M A, Da Silva AS (2012) A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering 24(3): 399–412
DeRose P, Shen W, Chen F, Lee Y, Burdick D, Doan A, Ramakrishnan R (2007) Dblife: A community information management platform for the database research community. CIDR , pp 169–172
Diaz-Valenzuela I, Martin-Bautista MJ, Vila MA, Campaña JR (2013) An automatic system for identifying authorities in digital libraries. Expert Syst Appl 40(10):3994–4002
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters
Eberhart RC, Kennedy J (1995) A new optimizer using particle swarm theory. In: Proceedings of the 6th International Symposium on Micro Machine and Human Science, New York, vol 1, pp 39– 43
Eshelman LJ (ed.) (1995) Proceedings of the 6th International Conference on Genetic Algorithms, Pittsburgh, PA, USA, July 15–19, 1995, Morgan Kaufmann
Fan W, Jia X, Li J, Ma S (2009) Reasoning about record matching rules. Proceedings of the VLDB Endowment 2(1):407–418
Fan X, Wang J, Pu X, Zhou L, Lv B (2011) On graph-based name disambiguation. Journal of Data and Information Quality (JDIQ) 2(2):10
Fogel L, Owens A, Walsh M (1975) Adaptation in natural and artificial systems
Gadia SK (1988) The role of temporal elements in temporal databases. IEEE Data Eng Bull 11(4):19–25
Golberg DE (1989) Genetic algorithms in search, optimization, and machine learning, Addion wesley 1989
Goldberg DE et al (1989) Genetic algorithms in search, optimization, and machine learning, vol 412, Addison-wesley Reading Menlo Park
Hachani N, Ounelli H (2007) Improving cluster method quality by validity indices. Flairs Conference, pp 479–483
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. Journal of Intelligent Information Systems 17(2-3):107–145
Hartl RF, Belew R (1990) A global convergence proof for a class of genetic algorithms. University of Technology, Vienna
Hazimeh H, Youness I, Makki J, Noureddine H, Tscherrig J, Mugellini E, Khaled OA (2016) Leveraging co-authorship and biographical information for author ambiguity resolution in dblp. 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). IEEE , pp 1080–1084
Hernández M A, Stolfo SJ (1995) The merge/purge problem for large databases. ACM SIGMOD Record, ACM, vol 24, pp 127–138
Holland JH (1975) Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence, U Michigan Press
Höppner F (1999) Fuzzy cluster analysis: methods for classification, data analysis and image recognition, Wiley
Isele R, Bizer C (2012) Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment 5(11):1638–1649
Jensen CS, Clifford J, Gadia SK, Segev A, Snodgrass RT (1992) A glossary of temporal database concepts. ACM Sigmod Record 21(3):35–43
Jin H, Huang L, Yuan P (2009) Name disambiguation using semantic association clustering. IEEE International Conference on e-business engineering, 2009, ICEBE’09. IEEE, pp 42– 48
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment 3(1–2):484–493
Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Computing 14(4): 23–31
Kovács F, Legány C, Babos A (2005) Cluster validity measurement techniques. In: 6th International symposium of hungarian researchers on computational intelligence, Citeseer
Li L, Li J, Gao H (2015) Rule-based method for entity resolution. IEEE Trans Knowl Data Eng 27 (1):250–263
Li P, Dong XL, Maurino A, Srivastava D (2011) Linking temporal records. Proceedings of the VLDB Endowment 4(11):956– 967
Li P, Tziviskou C, Wang H, Dong XL, Liu X, Maurino A, Srivastava D (2012a) Chronos: Facilitating history discovery by linking temporal records. Proceedings of the VLDB Endowment 5(12):2006–2009
Li S, Cong G, Miao C (2012b) Author name disambiguation using a new categorical distribution similarity. Machine learning and knowledge discovery in databases, Springer, pp 569– 584
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12):1650–1654
Mishra S, Mondal S, Saha S (2013) Entity matching technique for bibliographic database. Database and Expert Systems Applications, Springer, pp 34–41
Mishra S, Saha S, Mondal S (2014a) Cluster validation techniques for bibliographic databases. Students’ Technology Symposium (TechSym), 2014 IEEE. IEEE, pp 93–98
Mishra S, Saha S, Mondal S (2014b) On validation of clustering techniques for bibliographic databases. 2014 22nd International Conference on Pattern Recognition (ICPR). IEEE, pp 3150–3155
Nikolov A, Uren V, Motta E, De Roeck A (2008) Integration of semantically annotated data by the knofuss architecture International Conference on Knowledge Engineering and Knowledge Management. Springer, pp 265–274
Nikolov A, DAquin M, Motta E (2012) Unsupervised learning of link discovery configuration. Extended Semantic Web Conference. Springer, pp 119–133
Pal SK, Bhandari D (1994) Selection of optimal set of weights in a layered network using genetic algorithms. Inf Sci 80(3):213– 234
Petermann A, Junghanns M, Müller R, Rahm E (2014) Foodbroker-generating synthetic datasets for graph-based business analytics. Workshop on Big Data Benchmarks, Springer, pp 145–155
Ribeiro Filho JL, Treleaven PC, Alippi C (1994) Genetic-algorithm programming environments. Computer 27(6):28–43
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Rudolph G (1994) Convergence analysis of canonical genetic algorithms. IEEE transactions on neural networks 5(1):96– 101
Sharapov RR, Lapshin AV (2006) Convergence of genetic algorithms. Pattern recognition and image analysis 16(3):392– 397
Srinivas M, Patnaik LM (1994) Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Transactions on systems, Man and Cybernetics 24(4):656–667
Storn R, Price K (1997) Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J Glob Optim 11(4):341–359
Sun Y, Wu T, Yin Z, Cheng H, Han J, Yin X, Zhao P (2008) Bibnetminer: mining bibliographic information networks. Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, pp 1341– 1344
Tadayon N, Wang H, Sharma B, Wang W, Hua K (2011) A cooperative transmission approach to reduce end-to-end delay in multi hop wireless ad-hoc networks. Global Telecommunications Conference (GLOBECOM 2011), 2011 IEEE. IEEE , pp 1–5
Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z (2008) Arnetminer: extraction and mining of academic social networks. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 990– 998
Tang J, Fong ACM, Wang B, Zhang J (2012) A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering 24(6):975– 987
Wang J, Li G, Yu JX, Feng J (2011a) Entity matching: How similar is similar. Proceedings of the VLDB Endowment 4(10):622– 633
Wang W (2011) Relative enumerability and 1-genericity. The Journal of Symbolic Logic 76(03):897–913
Wang X, Tang J, Cheng H, Yu PS (2011b) Adana: Active name disambiguation. 2011 IEEE 11th International Conference on Data Mining (ICDM). IEEE, pp 794–803
Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Transactions on pattern analysis and machine intelligence 13(8):841–847
Yin X, Han J, Yu P (2007) Object distinction: Distinguishing objects with identical names. IEEE 23rd International Conference on Data Engineering, 2007, ICDE 2007. IEEE, pp 1242–1246
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mishra, S., Saha, S. & Mondal, S. GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases. Appl Intell 47, 197–230 (2017). https://doi.org/10.1007/s10489-016-0874-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-016-0874-z