Abstract
We propose a framework to extract topic maps from a set of Web pages. We use the clustering method with the Web pages and extract the topic map prototypes. We introduced the following two points to the existing clustering method: The first is merging only the linked Web pages, thus extracting the underlying relationships between the topics. The second is introducing weighting based on similarity from the contents of the Web pages and relevance between topics of pages. The relevance is based on the types of links with directories in Web sites structure and the distance between the directories in which the pages are located. We generate the topic map prototypes from the results of the clustering. Finally, users complete the prototype by labeling the topics and associations and removing the unnecessary items. For this paper, at the first step, we mounted the proposed clustering method and extracted the prototype with the method.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web: experiments and models. In: 5th International World Wide Web Conference (2000)
Flake, G.W., Lawrence, S., Giles, C.L.: Efficient identification of Web communities. In: KDD 2000: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 150–160 (2000)
Gansner, R.E., North, S.C.: An open graph visualization system and its applications to software engineering. Software – Practice and Experience 30(11), 1203–1233 (2000)
Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. PNAS 99(12), 7821–7826 (2002)
GVU’s WWW Surveying Team: GVU’s 10th WWW User Survey: Problem Using the Web (1998), http://www.gvu.gatech.edu/user_surveys/
International Standard Organization: ISO/IEC 13250 Topic Maps: Information Tecknology Document Description and Markup Language (2000)
Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall Inc., Upper Saddle River (1998)
Kerk, R., Groschupf, S.: How to Create Topic Maps (2003), http://www.media-style.com/gfx/assets/HowtoCreateTopicMaps.pdf
Menczer, F.: Lexical and semantic clustering by web links. Journal of American Society Information Science and Technology 55(14), 1261–1269 (2004)
Newman, M.E.J.: Fast algorithm for detecting community structure in networks. Physical Review E 69, 066133 (2004)
Reynolds, J., Kimber, W.E.: Topic Map Authoring With Reusable Ontologies and Automated Knowledge Mining. In: XML 2002 Conference (2002)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)
Spertus, E.: ParaSite: mining structural information on the Web. In: The 6th International World Wide Web Conference, pp. 1205–1215 (1997)
TopicMaps.Org: XML Topic Maps 1.0 (2001), http://www.topicmaps.org/xtm/1.0/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mase, M., Yamada, S., Nitta, K. (2009). Extracting Topic Maps from Web Pages. In: Chawla, S., et al. New Frontiers in Applied Data Mining. PAKDD 2008. Lecture Notes in Computer Science(), vol 5433. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00399-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-00399-8_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00398-1
Online ISBN: 978-3-642-00399-8
eBook Packages: Computer ScienceComputer Science (R0)