Skip to main content

Categorizing XML Documents Based on Page Styles

  • Conference paper
Content Computing (AWCC 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3309))

Included in the following conference series:

  • 440 Accesses

Abstract

The self-describing feature of XML offers both challenges and opportunities in information retrieval, document management, and data mining. To process and manage XML documents effectively on XML data server, database, Electronic Document Management System(EDMS) and search engine, we have to develop a new technique for categorizing large XML documents automatically. In this paper, we propose a new methodology for categorizing XML documents based on page style by taking account of meanings of the elements and nested structures of XML. Accurate categorization of XML documents by page styles provides an important basis for a variety of applications of managing and processing XML. Experiments with Yahoo! pages show that our methodology provides almost 100% accuracy in categorizing XML documents by page styles.

This Work was supported by the Ewha Womans University Research Grant of 2004.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. Addison-Wesley, Reading (1986)

    Google Scholar 

  2. Adelberg, B.: NoDoSE - A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. In: Proc. of SIGMOD, pp. 283–294 (1998)

    Google Scholar 

  3. Deutsch, A., Fernandez, M., Suciu, D.: Storing Semistructured Data with STORED. In: Proc. of SIGMOD, pp. 431–442 (1999)

    Google Scholar 

  4. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  5. Leung, H.-p., et al.: A New Sequential Mining Approach to XML Document Similarity Computation. In: Proc. of PAKDD, pp. 356–362 (2003)

    Google Scholar 

  6. Cruz, I.F., et al.: Measuring Structural Similarity Among Web Documents: Preliminary Results. In: Hersch, R.D., André, J., Brown, H. (eds.) RIDT 1998 and EPub 1998. LNCS, vol. 1375, p. 513. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  7. Baxter, I.D., Yahin, A., Moura, L., Sant’Anna, M., Bier, L.: Clone Detection using Abstract Syntax Tree. In: Proc. of the ICSM 1998 (November 1998)

    Google Scholar 

  8. Zaki, M.J.: Efficiently Frequent Trees in a Forest. In: Proc. of SIGKDD, pp. 71–80 (2002)

    Google Scholar 

  9. Nestorov, S., Abiteboul, S., Motwani, R.: Extracting Schema from Semistructured Data. In: Proc. of SIGMOD, pp. 295–306 (1998)

    Google Scholar 

  10. Srikant, R., Agrawal, R.: Mining Sequential Patterns:Generalizations and Performance Improvements. In: Apers, P.M.G., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057. Springer, Heidelberg (1996)

    Google Scholar 

  11. Sutton, M.J.D.: Document Management for the Enterprise: Principles, Techniques and Applications. JASIS 49(1), 54–57 (1998)

    MathSciNet  Google Scholar 

  12. Joshi, S., et al.: A Bag of Paths Model for Measuring Structural Similarity in Web Documents. In: Proc. of SIGKDD, pp. 577–582 (2003)

    Google Scholar 

  13. Chakrabarti, S., et al.: Focused crawling: A new approach to topic-specific Web resource discovery. In: WWW8, Toronto (May 1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lee, JW. (2004). Categorizing XML Documents Based on Page Styles. In: Chi, CH., Lam, KY. (eds) Content Computing. AWCC 2004. Lecture Notes in Computer Science, vol 3309. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30483-8_52

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30483-8_52

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23898-0

  • Online ISBN: 978-3-540-30483-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics