Abstract
The acquisition of explicit semantics is still a research challenge. Approaches for the extraction of semantics focus mostly on learning hierarchical hypernym-hyponym relations. The extraction of co-hyponym and co-meronym sibling semantics is performed to a much lesser extent, though they are not less important in ontology engineering.
In this paper we will describe and evaluate the XTREEM-SG (Xhtml TREE Mining – for Sibling Groups) approach on finding sibling semantics from semi-structured Web documents. XTREEM takes advantage of the added value of mark-up, available in web content, for grouping text siblings. We will show that this grouping is semantically meaningful. The XTREEM-SG approach has the advantage that it is domain and language independent; it does not rely on background knowledge, NLP software or training.
In this paper we apply the XTREEM-SG approach and evaluate against the reference semantics from two golden standard ontologies. We investigate how variations on input, parameters and reference influence the obtained results on structuring a closed vocabulary on sibling relations. Earlier methods that evaluate sibling relations against a golden standard report a 14.18% F-measure value. Our method improves this number into 21.47%.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agirre, E., Ansa, O., Hovy, E., Martinez, D.: Enriching very large ontologies using the WWW. In: Proc. of the Workshop on Ontology Construction ECAI 2000 (2000)
Buttler, D.: A short survey of document structure similarity algorithms. In: Proc. of the International Conference on Internet Computing (June 2004)
Buitelaar, P., Cimiano, P., Magnini, B.: Ontology Learning from Text: Methods. Evaluation and Applications. In: Frontiers in Artificial Intelligence and Applications Series, vol. 123. IOS Press, Amsterdam (2005)
Brunzel, M., Spiliopoulou, M.: Discovering Multi Terms and Co-Hyponymy from XHTML Documents with XTREEM. In: Nayak, R., Zaki, M.J. (eds.) KDXD 2006. LNCS, vol. 3915, pp. 22–32. Springer, Heidelberg (2006)
Caraballo, S.: Automatic construction of a hypernym-labeled noun hierarchy from text. In: Proc. of the 37th Annual Meeting of The Association for Computational Linguistics ACL
Choi, I., Moon, B., Kim, H.-J.: A Clustering Method based on Path Similarities of XML Data. Data & Knowledge Engineering (February 2006)
Cimiano, P., Staab, S.: Learning by googling. SIGKDD Explorations 6(2), 24–34 (2004)
Cimiano, P., Staab, S.: Learning concept hierarchies from text with a guided hierarchical clustering algorithm. In: Workshop on Learning and Extending Lexical Ontologies at ICML 2005, Bonn (2005)
Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.: Clustering XML documents using structural summaries. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, pp. 547–556. Springer, Heidelberg (2004)
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-Scale Information Extraction in KnowItAll. In: Proc of the 13th International WWW Conference, New York (2004)
Nédellec, C., Faure, D.: Knowledge Acquisition of Predicate Argument Structures from Technical Texts Using Machine Learning: The System ASIUM. In: Fensel, D., Studer, R. (eds.) EKAW 1999. LNCS (LNAI), vol. 1621, pp. 329–334. Springer, Heidelberg (1999)
Faatz, A., Steinmetz, R.: Ontology Enrichment with Texts from the WWW. In: Proc. of the First International Workshop on Semantic Web Mining, European Conference on Machine Learning 2002, Helsinki (2002)
Hearst, M.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th International Conference on Computational Linguistics, pp. 539–545 (1992)
Kruschwitz, U.: A Rapidly Acquired Domain Model Derived from Mark-Up Structure. In: In Proc. of the ESSLLI 2001 Workshop on Semantic Knowledge Acquisition and Categorization, Helsinki (2001)
Kruschwitz, U.: Exploiting Structure for Intelligent Web Search. In: Proc of the 34th Hawaii International Conference on System Sciences (HICSS), Maui Hawaii 2001. IEEE, Los Alamitos (2001)
Kashyap, V.: Design and creation of ontologies for environmental information retrieval. In: Proc. of the 12th Workshop on Knowledge Acquisition, Modeling and Management, Alberta, Canada (1999)
Maedche, A., Staab, S.: Discovering conceptual relations from text. In: Nareyek, A. (ed.) ECAI-WS 2000. LNCS (LNAI), vol. 2148, pp. 321–325. Springer, Heidelberg (2001)
Pasca, M.: Finding Instance Names and Alternative Glosses on the Web: WordNet Reloaded. In: Gelbukh, A. (ed.) CICLing 2005. LNCS, vol. 3406, pp. 280–292. Springer, Heidelberg (2005)
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)
Stojanovic, L., Stojanovic, N., Volz, R.: Migrating data-intensive Web Sites into the Semantic Web. In: Proc. of the 17th ACM symposium on applied computing, pp. 1100–1107. ACM press, New York (2002)
Shinzato, K., Torisawa, K.: Acquiring hyponymy relations from Web Documents. In: Proceedings of the 2004 Human Language Technology Conference (HLT-NAACL 2004), Boston, Massachusetts, pp. 73–80 (2004)
Tagarelli, A., Greco, S.: Toward Semantic XML Clustering. In: 6th SIAM International Conference on Data Mining (SDM 2006). Bethesda, Maryland, USA, April 20-22 (2006)
Zhang, Z., Li, R., Cao, S., Zhu, Y.: Similarity metric for XML documents. In: Proc. of the Workshop on Knowledge and Experience Management (October 2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Brunzel, M., Spiliopoulou, M. (2006). Discovering Semantic Sibling Groups from Web Documents with XTREEM-SG. In: Staab, S., Svátek, V. (eds) Managing Knowledge in a World of Networks. EKAW 2006. Lecture Notes in Computer Science(), vol 4248. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11891451_15
Download citation
DOI: https://doi.org/10.1007/11891451_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46363-4
Online ISBN: 978-3-540-46365-8
eBook Packages: Computer ScienceComputer Science (R0)