Abstract
The self-describing feature of XML offers both challenges and opportunities in information retrieval, document management, and data mining. To process and manage XML documents effectively on XML data server, database, Electronic Document Management System(EDMS) and search engine, we have to develop a new technique for categorizing large XML documents automatically. In this paper, we propose a new methodology for categorizing XML documents based on page style by taking account of meanings of the elements and nested structures of XML. Accurate categorization of XML documents by page styles provides an important basis for a variety of applications of managing and processing XML. Experiments with Yahoo! pages show that our methodology provides almost 100% accuracy in categorizing XML documents by page styles.
This Work was supported by the Ewha Womans University Research Grant of 2004.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading (1986)
Adelberg, B.: NoDoSE - A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. In: Proc. of SIGMOD, pp. 283–294 (1998)
Deutsch, A., Fernandez, M., Suciu, D.: Storing Semistructured Data with STORED. In: Proc. of SIGMOD, pp. 431–442 (1999)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Leung, H.-p., et al.: A New Sequential Mining Approach to XML Document Similarity Computation. In: Proc. of PAKDD, pp. 356–362 (2003)
Cruz, I.F., et al.: Measuring Structural Similarity Among Web Documents: Preliminary Results. In: Hersch, R.D., André, J., Brown, H. (eds.) RIDT 1998 and EPub 1998. LNCS, vol. 1375, p. 513. Springer, Heidelberg (1998)
Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone Detection using Abstract Syntax Tree. In: Proc. of the ICSM 1998 (November 1998)
Zaki, M.J.: Efficiently Frequent Trees in a Forest. In: Proc. of SIGKDD, pp. 71–80 (2002)
Nestorov, S., Abiteboul, S., Motwani, R.: Extracting Schema from Semistructured Data. In: Proc. of SIGMOD, pp. 295–306 (1998)
Srikant, R., Agrawal, R.: Mining Sequential Patterns:Generalizations and Performance Improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057. Springer, Heidelberg (1996)
Sutton, M.J.D.: Document Management for the Enterprise: Principles, Techniques and Applications. JASIS 49(1), 54–57 (1998)
Joshi, S., et al.: A Bag of Paths Model for Measuring Structural Similarity in Web Documents. In: Proc. of SIGKDD, pp. 577–582 (2003)
Chakrabarti, S., et al.: Focused crawling: A new approach to topic-specific Web resource discovery. In: WWW8, Toronto (May 1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lee, JW. (2004). Categorizing XML Documents Based on Page Styles. In: Chi, CH., Lam, KY. (eds) Content Computing. AWCC 2004. Lecture Notes in Computer Science, vol 3309. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30483-8_52
Download citation
DOI: https://doi.org/10.1007/978-3-540-30483-8_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23898-0
Online ISBN: 978-3-540-30483-8
eBook Packages: Springer Book Archive