Clustering-Based Schema Matching of Web Data for Constructing Digital Library

Song, Hui; Ma, Fanyuan; Wang, Chen

doi:10.1007/11424826_116

Clustering-Based Schema Matching of Web Data for Constructing Digital Library

Hui Song^24,25,
Fanyuan Ma²⁵ &
Chen Wang²⁵

Conference paper

1626 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3481))

Abstract

The abundant information on the web attracts many researches on reusing the valuable web data in other information applications, for example, digital libraries. Web information published by various contributors in different ways, schema matching is a basic problem for the heterogeneous data sources integration. Web information integration arises new challenges from the following ways: web data are short of intact schema definition; and the schema matching between web data can not be simplified as 1-1 mapping problem. In this paper we propose an algorithm, COSM, to automatic the web data schema matching process. The matching process is transformed into a clustering problem: the data elements clustered into one cluster are viewed as mapping ones. COSM is mainly instance-level matching approach, also combined with a partial name matcher in calculating the elements distance metrics. A pretreatment for data is carried out to give rational distance metrics between elements before clustering step. The experiment of algorithm testing and application (applied in the Chinese folk music digital library construction) proves the algorithm’s efficiency.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Rahm, E., Bernstein, P.A.: On Matching Schemas Automatically. VLDB Journal 10(4) (2001)
Google Scholar
Lawrence, S., Giles, C.L., Bollacker, K.: Digital libraries and Autonomous Citation Indexing. IEEE Computer 32(6), 67–71 (1999)
Google Scholar
He, B., Chang, K.C.-C.: Statistical Schema Matching across Web Query Interfaces. In: ACM SIGMOD 2003, San Diego, CA (2003)
Google Scholar
Calado, P.P., Goncalves, M.A., et al.: The Web-DL Environment for Building Digital Libraries from the Web. In: Proceedings of the 3th ACM/IEEE Joint Conference on Digital Libraries, JCDL 2003, Houston, Texas USA, May 27 - 31 (2003)
Google Scholar
Madhavan, J., Bernstein, P., Rahm, E.: Generic Schema Matching with Cupid. In: The Proceeding s of VLDB (2001)
Google Scholar
Ashish, N., Knoblock, C.: Wrapper Generation for Semi-Structured Internet Sources. In: Proc. of the ACM SIGMOD Workshop on Management of Semistructured Data, Tucson, Arizona (May 1997)
Google Scholar
Xu, L., Embley, D.W.: Discovering Direct and Indirect Matches for Schema Elements. In: The IEEE conference of DASFAA 2003, Japan (2003)
Google Scholar
Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of Disparate Data Sources: A machine Learning Approach. In: SIGMOD 2001, Santa Barbara, California, USA (2001)
Google Scholar
Kang, J., Naughton, J.F.: On Schema Matching with Opaque Column Names and Data Values. In: ACM SIGMOD 2003, San Diego, CA (2003)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An introduction to Cluster Analysis. John Wiley & Sons, New York (1990)
Google Scholar
Hai, H., Rahm, E.: COMA – A System for Flexible Combination of Schema Matching Approaches. In: Proc. of the 28th VLDB (2002)
Google Scholar
Doan, A.,, J.: Learning to Map between Ontologies on the Semantic Web. In: Proc. of the 11th WWW (2002)
Google Scholar
Melnik, S., Garcia-monina, H., Rahm, E.: Similarity Flooding: A Versatile Graph Matching Algorithm. In: Proce. of the 18th ICDE (2002)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: ROADRUNNER: Towards automatic data extraction from large web sites. In: Proc. of the 2001 Intl. Conf. on Very Large Data Bases, pp. 109–118 (2001)
Google Scholar
Arasu, A., Garcia-Monina, H.: Extracting structured data from web pages. In: ACM SIGMOD 2003, San Diego, CA (2003)
Google Scholar
Song, H., Ma, F., Suraj, G.: Data Extraction and Annotation for Dynamic Web Pages. In: Proceeding of IEEE conference EEE 2004, Taibei (2004)
Google Scholar
Cohen, W., Hirsh, H.: Joins that generalize: Text classification using whirl. In: Proc. of the fourth Int. Conf. on KDD (1998)
Google Scholar
http://www.cogsci.princeton.edu/~wn/
http://www.keenage.com/

Download references

Author information

Authors and Affiliations

Department of Computer Information Technology, Donghua University, 200051, Shanghai, China
Hui Song
Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200030, Shanghai, China
Hui Song, Fanyuan Ma & Chen Wang

Authors

Hui Song
View author publications
You can also search for this author in PubMed Google Scholar
Fanyuan Ma
View author publications
You can also search for this author in PubMed Google Scholar
Chen Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics and Computer Science, University of Perugia, via Vanvitelli, 1, I-06123, Perugia, Italy
Osvaldo Gervasi
Department of Computer Science, University of Calgary, 2500 University Drive N.W., T2N 1N4, Calgary, AB, Canada
Marina L. Gavrilova
William Norris Professor, Head of the Computer Science and Engineering Department, University of Minnesota, USA
Vipin Kumar
Department of Chemistry, University of Perugia, Via Elce di Sotto, 8, P.O. Box, I-06123, Perugia, Italy
Antonio Laganà
Institute of High Performance Computing, IHCP, 1 Science Park Road, 01-01 The Capricorn, Singapore Science Park II, 117528, Singapore
Heow Pueh Lee
School of Computing, Soongsil University, Seoul, Korea
Youngsong Mun
Clayton School of IT, Monash University, 3800, Clayton, Australia
David Taniar
OptimaNumerics Ltd, P.O. Box, Belfast, United Kingdom
Chih Jeng Kenneth Tan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, H., Ma, F., Wang, C. (2005). Clustering-Based Schema Matching of Web Data for Constructing Digital Library. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2005. ICCSA 2005. Lecture Notes in Computer Science, vol 3481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424826_116

Download citation

DOI: https://doi.org/10.1007/11424826_116
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25861-2
Online ISBN: 978-3-540-32044-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics