Abstract
Recently, a huge quantity of HTML documents have been created in Internet, which really constitute a treasury of information. HTML, however, is designed mainly for reading with browsers, and not suitable for machine processing, whereas XML was proposed as a solution for this problem. In this paper, we give a case-based transformation method from HTML documents to XML ones. There are many series of HTML pages in actual Web sites, and each page of a series usually has a quite similar structure with each other. Therefore a case-based transformation must be a promising method in practice for a semi-automatic transformation from HTML to XML. Throughout experimental evaluations, we show this case-based method achieved a highly accurate transformation, i.e., 85% of actual 80 pages can be transformed in a correct way, with this case-based method.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
G. Salton, M. J. McGill: Introduction to Modern Information Retrieval, McGraw-Hill, 1983
M. Murata, A. Momma, K. Arai: Introduction XML, Nihon Keizai Shimbun, Inc., 1998 (in Japanese)
Nicholas Kushmerick: Regression testing for wrapper maintenance, AAAI-99, pp. 74–79, 1999
S. Russell, P. Norvig: Artificial Intelligence A Modern Approach, Prentice-Hall, 1995
T. Tokunaga, M. Iwayama: Text Categorization based on Weighted Inverse Document Frequency, Information Processing Society of Japan, NL, 100-6, March 1994 33–40
Takenobu Tokunaga: Information Retrieval and Natural Language Processing, University of Tokyo Press, 1999 (in Japanese)
XML/SGML salon: Perfect Guide for Standard XML, Gijyutsu-Hyoron Co, 1998 (in Japanese)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Umehara, M., Iwanuma, K. (2000). A Case-Based Transformation from HTML to XML. In: Leung, K.S., Chan, LW., Meng, H. (eds) Intelligent Data Engineering and Automated Learning — IDEAL 2000. Data Mining, Financial Engineering, and Intelligent Agents. IDEAL 2000. Lecture Notes in Computer Science, vol 1983. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44491-2_60
Download citation
DOI: https://doi.org/10.1007/3-540-44491-2_60
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41450-6
Online ISBN: 978-3-540-44491-6
eBook Packages: Springer Book Archive