GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases

Mishra, Sumit; Saha, Sriparna; Mondal, Samrat

doi:10.1007/s10489-016-0874-z

GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases

Published: 02 March 2017

Volume 47, pages 197–230, (2017)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Sumit Mishra¹,
Sriparna Saha¹ &
Samrat Mondal¹

490 Accesses
13 Citations
Explore all metrics

Abstract

Entity matching is to map the records in a database to their corresponding entities. It is a well-known problem in the field of database and artificial intelligence. In digital libraries such as DBLP, ArnetMiner, Google Scholar, Scopus, Web of Science, AllMusic, IMDB, etc., some of the attributes may evolve over time, i.e., they change their values at different instants of time. For example, affiliation and email-id of an author in bibliographic databases which maintain publication details of various authors like DBLP, ArnetMiner, etc. may change their values. A taxpayer can change his or her address over time. Sometimes people change their surnames due to marriage. When a database contains records of these natures and the number of records grows beyond a limit, then it becomes really challenging to identify which records belong to which entity due to the lack of a proper key. In the current paper, the problem of automatic partitioning of records is posed as an optimization problem. Thereafter, a genetic algorithm based automatic technique is proposed to solve the entity matching problem. The proposed approach is able to automatically determine the number of partitions available in a bibliographic dataset. A comparative analysis with the two existing systems – DBLP and ArnetMiner, over sixteen bibliographic datasets proves the efficacy of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial intelligence to automate the systematic review of scientific literature

Article Open access 11 May 2023

Information extraction from electronic medical documents: state of the art and future research directions

Article 08 November 2022

Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases

Article Open access 10 November 2018

Notes

References

Baarsch J, Celebi ME (2012) Investigation of internal validity measures for k-means clustering. Proceedings of the international multiconference of engineers and computer scientists, vol 1, pp 14–16
Bandyopadhyay S, Saha S (2012) Unsupervised classification: similarity measures, classical and metaheuristic approaches, and applications. Springer
Bhandari D, Murthy C, Pal SK (1996) Genetic algorithm with elitist model and its convergence. Int J Pattern Recognit Artif Intell 10(06):731–747
Article Google Scholar
Bilenko M, Mooney RJ (2003) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 39–48
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3(1):1– 27
Article MathSciNet MATH Google Scholar
Chaudhuri S, Chen BC, Ganti V, Kaushik R (2007) Example-driven design of efficient record matching queries. Proceedings of the 33rd international conference on Very large data bases, VLDB Endowment, pp 327–338
Google Scholar
Chou CH, Su MC, Lai E (2002) Symmetry as a new measure for cluster validity. 2nd WSEAS Int. Conf. on Scientific Computation and Soft Computing, pp 209–213
Google Scholar
Chou CH, Su MC, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Anal Applic 7(2):205–220
Article MathSciNet Google Scholar
Cramer NL (1985) A representation for the adaptive generation of simple sequential programs. Proceedings of the First International Conference on Genetic Algorithms, pp 183– 187
Google Scholar
Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2:224–227
Article Google Scholar
De Carvalho MG, Laender AH, Gonċalves M A, Da Silva AS (2012) A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering 24(3): 399–412
Article Google Scholar
DeRose P, Shen W, Chen F, Lee Y, Burdick D, Doan A, Ramakrishnan R (2007) Dblife: A community information management platform for the database research community. CIDR , pp 169–172
Google Scholar
Diaz-Valenzuela I, Martin-Bautista MJ, Vila MA, Campaña JR (2013) An automatic system for identifying authorities in digital libraries. Expert Syst Appl 40(10):3994–4002
Article Google Scholar
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters
Eberhart RC, Kennedy J (1995) A new optimizer using particle swarm theory. In: Proceedings of the 6th International Symposium on Micro Machine and Human Science, New York, vol 1, pp 39– 43
Eshelman LJ (ed.) (1995) Proceedings of the 6th International Conference on Genetic Algorithms, Pittsburgh, PA, USA, July 15–19, 1995, Morgan Kaufmann
Fan W, Jia X, Li J, Ma S (2009) Reasoning about record matching rules. Proceedings of the VLDB Endowment 2(1):407–418
Article Google Scholar
Fan X, Wang J, Pu X, Zhou L, Lv B (2011) On graph-based name disambiguation. Journal of Data and Information Quality (JDIQ) 2(2):10
Google Scholar
Fogel L, Owens A, Walsh M (1975) Adaptation in natural and artificial systems
Gadia SK (1988) The role of temporal elements in temporal databases. IEEE Data Eng Bull 11(4):19–25
MathSciNet Google Scholar
Golberg DE (1989) Genetic algorithms in search, optimization, and machine learning, Addion wesley 1989
Goldberg DE et al (1989) Genetic algorithms in search, optimization, and machine learning, vol 412, Addison-wesley Reading Menlo Park
Hachani N, Ounelli H (2007) Improving cluster method quality by validity indices. Flairs Conference, pp 479–483
Google Scholar
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. Journal of Intelligent Information Systems 17(2-3):107–145
Article MATH Google Scholar
Hartl RF, Belew R (1990) A global convergence proof for a class of genetic algorithms. University of Technology, Vienna
Google Scholar
Hazimeh H, Youness I, Makki J, Noureddine H, Tscherrig J, Mugellini E, Khaled OA (2016) Leveraging co-authorship and biographical information for author ambiguity resolution in dblp. 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). IEEE , pp 1080–1084
Hernández M A, Stolfo SJ (1995) The merge/purge problem for large databases. ACM SIGMOD Record, ACM, vol 24, pp 127–138
Holland JH (1975) Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence, U Michigan Press
Höppner F (1999) Fuzzy cluster analysis: methods for classification, data analysis and image recognition, Wiley
Isele R, Bizer C (2012) Learning expressive linkage rules using genetic programming. Proceedings of the VLDB Endowment 5(11):1638–1649
Article Google Scholar
Jensen CS, Clifford J, Gadia SK, Segev A, Snodgrass RT (1992) A glossary of temporal database concepts. ACM Sigmod Record 21(3):35–43
Article Google Scholar
Jin H, Huang L, Yuan P (2009) Name disambiguation using semantic association clustering. IEEE International Conference on e-business engineering, 2009, ICEBE’09. IEEE, pp 42– 48
Köpcke H, Thor A, Rahm E (2010) Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment 3(1–2):484–493
Article Google Scholar
Köpcke H, Thor A, Rahm E (2010) Learning-based approaches for matching web data entities. IEEE Internet Computing 14(4): 23–31
Article Google Scholar
Kovács F, Legány C, Babos A (2005) Cluster validity measurement techniques. In: 6th International symposium of hungarian researchers on computational intelligence, Citeseer
Li L, Li J, Gao H (2015) Rule-based method for entity resolution. IEEE Trans Knowl Data Eng 27 (1):250–263
Article Google Scholar
Li P, Dong XL, Maurino A, Srivastava D (2011) Linking temporal records. Proceedings of the VLDB Endowment 4(11):956– 967
MATH Google Scholar
Li P, Tziviskou C, Wang H, Dong XL, Liu X, Maurino A, Srivastava D (2012a) Chronos: Facilitating history discovery by linking temporal records. Proceedings of the VLDB Endowment 5(12):2006–2009
Li S, Cong G, Miao C (2012b) Author name disambiguation using a new categorical distribution similarity. Machine learning and knowledge discovery in databases, Springer, pp 569– 584
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(12):1650–1654
Article Google Scholar
Mishra S, Mondal S, Saha S (2013) Entity matching technique for bibliographic database. Database and Expert Systems Applications, Springer, pp 34–41
Mishra S, Saha S, Mondal S (2014a) Cluster validation techniques for bibliographic databases. Students’ Technology Symposium (TechSym), 2014 IEEE. IEEE, pp 93–98
Mishra S, Saha S, Mondal S (2014b) On validation of clustering techniques for bibliographic databases. 2014 22nd International Conference on Pattern Recognition (ICPR). IEEE, pp 3150–3155
Nikolov A, Uren V, Motta E, De Roeck A (2008) Integration of semantically annotated data by the knofuss architecture International Conference on Knowledge Engineering and Knowledge Management. Springer, pp 265–274
Nikolov A, DAquin M, Motta E (2012) Unsupervised learning of link discovery configuration. Extended Semantic Web Conference. Springer, pp 119–133
Pal SK, Bhandari D (1994) Selection of optimal set of weights in a layered network using genetic algorithms. Inf Sci 80(3):213– 234
Article Google Scholar
Petermann A, Junghanns M, Müller R, Rahm E (2014) Foodbroker-generating synthetic datasets for graph-based business analytics. Workshop on Big Data Benchmarks, Springer, pp 145–155
Google Scholar
Ribeiro Filho JL, Treleaven PC, Alippi C (1994) Genetic-algorithm programming environments. Computer 27(6):28–43
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article MATH Google Scholar
Rudolph G (1994) Convergence analysis of canonical genetic algorithms. IEEE transactions on neural networks 5(1):96– 101
Article Google Scholar
Sharapov RR, Lapshin AV (2006) Convergence of genetic algorithms. Pattern recognition and image analysis 16(3):392– 397
Article Google Scholar
Srinivas M, Patnaik LM (1994) Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Transactions on systems, Man and Cybernetics 24(4):656–667
Article Google Scholar
Storn R, Price K (1997) Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J Glob Optim 11(4):341–359
Article MathSciNet MATH Google Scholar
Sun Y, Wu T, Yin Z, Cheng H, Han J, Yin X, Zhao P (2008) Bibnetminer: mining bibliographic information networks. Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, pp 1341– 1344
Tadayon N, Wang H, Sharma B, Wang W, Hua K (2011) A cooperative transmission approach to reduce end-to-end delay in multi hop wireless ad-hoc networks. Global Telecommunications Conference (GLOBECOM 2011), 2011 IEEE. IEEE , pp 1–5
Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z (2008) Arnetminer: extraction and mining of academic social networks. Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 990– 998
Tang J, Fong ACM, Wang B, Zhang J (2012) A unified probabilistic framework for name disambiguation in digital library. IEEE Transactions on Knowledge and Data Engineering 24(6):975– 987
Article Google Scholar
Wang J, Li G, Yu JX, Feng J (2011a) Entity matching: How similar is similar. Proceedings of the VLDB Endowment 4(10):622– 633
Wang W (2011) Relative enumerability and 1-genericity. The Journal of Symbolic Logic 76(03):897–913
Article MathSciNet MATH Google Scholar
Wang X, Tang J, Cheng H, Yu PS (2011b) Adana: Active name disambiguation. 2011 IEEE 11th International Conference on Data Mining (ICDM). IEEE, pp 794–803
Xie XL, Beni G (1991) A validity measure for fuzzy clustering. IEEE Transactions on pattern analysis and machine intelligence 13(8):841–847
Article Google Scholar
Yin X, Han J, Yu P (2007) Object distinction: Distinguishing objects with identical names. IEEE 23rd International Conference on Data Engineering, 2007, ICDE 2007. IEEE, pp 1242–1246

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Patna, Patna, 801103, Bihar, India
Sumit Mishra, Sriparna Saha & Samrat Mondal

Authors

Sumit Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Sriparna Saha
View author publications
You can also search for this author in PubMed Google Scholar
Samrat Mondal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sumit Mishra.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mishra, S., Saha, S. & Mondal, S. GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases. Appl Intell 47, 197–230 (2017). https://doi.org/10.1007/s10489-016-0874-z

Download citation

Published: 02 March 2017
Issue Date: July 2017
DOI: https://doi.org/10.1007/s10489-016-0874-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence to automate the systematic review of scientific literature

Information extraction from electronic medical documents: state of the art and future research directions

Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

GAEMTBD: Genetic algorithm based entity matching techniques for bibliographic databases

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence to automate the systematic review of scientific literature

Information extraction from electronic medical documents: state of the art and future research directions

Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation