Abstract
The paper proposes matching short forms (abbreviated titles from the citation report) with their corresponding longer ones (journal titles in the digital library). The main problem is that there are often a number of syntactically different abbreviated forms for one abbreviated title in the citation report. We use character- and token-based similarity metrics to identify duplicate records. Also, we improve the process of identifying syntactically different data with the automated discovery of ontological knowledge representations such as thesauri from correctly matched data.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Febrl, Freely Extensible Biomedical Record Linkage, http://sourceforge.net/projects/febrl
- 2.
- 3.
The author is the first author.
- 4.
- 5.
- 6.
Note: titles are grouped in clusters.
References
Benjelloun O, Garcia-Molina H, Su Q, Widom J (2005) Swoosh: A Generic Approach to Entity Resolution. Stanford University technical report, March 2005.
Bilenko M, Mooney RJ, Cohen WW, Ravikumar P, Fienber SE (2003) Adaptive name matching in information integration, IEEE Intelligent Systems, 18(5), 16–23.
Cohen WW (1998) Integration of Heterogeneous Databases without common domains using query based on textual similarity. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD’ 98), 201–212.
Cohen WW, Ravikumar P, Feinberg S (2003) A comparison of string metrics for matching names and records. In Proceedings of the KDD2003 (also available at http://www.cs.cmu.edu/$\sim$ pradeepr/papers/kdd03.pdf).
Daconta M, Obst LJ, Smith KT (2003) The Semantic Web, Wiley, New York.
Das S, Chong EI, George E, Srinivasan J (2004), Supporting ontology-based semantic matching in RDBMS. In Proceedings of the 30th VLDB Conference, Toronto, Canada.
Devedzic V (2006) Semantic Web and Education, Springer, Berlin.
Elmagarmid A, Ipeirotis P, Verykios V (2007) Duplicate record detection: a survey, IEEE Transaction on Knowledge and Data Engineering, 19(1), 1–16.
Fellegi IP, Sunter AB (1969) A theory for record linkage, Journal of the American Statistical Association, 328(64), 1183–1210.
Gruber T (1993) A translation approach to portable ontologies, Knowledge Acquisition, 5(2), 199–220.
Guha S, Koudas N, Marathe A, Srivastava D (2004) Merging the results of approximate match operations. In Proceedings of the 30th VLDB Conference 2004, 636–647.
International Standard ISO 2788: Documentation – Guidelines for the establishment and development of monolingual thesauri, Second edition – 1986-11-15, International Organization for Standardization.
JCR (2005) Journal Citation Report, Institute for Scientific Information, Thomson, http://scientific.thomson.com/products/jcr/
Jaro MA (1976) Unimatch: A Record Linkage System: User’s Manual,technical report, US Bureau of the Census, Washington, DC.
Kantardzic M (2003) Data Mining: Concepts, Models, Methods, and Algorithms, Wiley, New York.
KOBSON (2005) Internal data of the project on the evaluation of the Serbian authors publishing productivity.
Larose D (2004) Discovering Knowledge in Data, Wiley, New York.
Lawrence S, Giles CL, Bollacker K (1999) Digital libraries and autonomous citation indexing, IEEE Computer, 32(6), 67–71.
Levenshtein VI (1966), Binary codes capable of correcting deletions, insertions and reversals, Soviet Physics Doklady, 10(8), 707–710.
Mahesh K, Kud J, Dixon P (1999) Oracle at Trec8: a lexical approach, NIST Special Publication 500-246. In The Eighth Text REtrieval Conference (TREC 8).
Milutinovic V (2007) DataMining Versus Semantic Web (also available at http://galeb.etf.bg.ac.yu/ ∼ vm/tutorial/tutorial.html).
ODM (2005) Oracle Data Mining Concepts 10g release 2 (also available at http://download.oracle.com/docs/html/B14339_01/4descriptive.htm#i1005741).
Pyle D (1999) Data Preparation for Data Mining, Morgan Kaufmann, San Francisco, CA.
Salton G, Buckley C (1988) Term weighting approaches in automatic text retrieval, Information and Processing Management, 24(5), 513–523.
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In Proceedings of Eight ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02), 269–278.
Tejada S, Knoblock C, Minton S (2002) Learning domain-independent string transformation for high accuracy object identification. In Proceedings of ACM SIGKDD 2002.
Winkler WE (1995) Matching and record linkage. In B. G. Cox (ed.), Business Survey Methods, Wiley, New York, 355–384.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Kovačević, A. (2010). Ontology-Based Data Mining in Digital Libraries. In: Devedžić, V., Gaševic, D. (eds) Web 2.0 & Semantic Web. Annals of Information Systems, vol 6. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-1219-0_7
Download citation
DOI: https://doi.org/10.1007/978-1-4419-1219-0_7
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-1218-3
Online ISBN: 978-1-4419-1219-0
eBook Packages: Computer ScienceComputer Science (R0)