Skip to main content
Log in

A semi-structured document model for text mining

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

A semi-structured document has more structured information compared to an ordinary document, and the relation among semi-structured documents can be fully utilized. In order to take advantage of the structure and link information in a semi-structured document for better mining, a structured link vector model (SLVM) is presented in this paper, where a vector represents, a document, and vectors’ elements are determined by terms, document structure and neighboring documents. Text mining based on SLVM is described in the procedure of K-means for briefness and clarity: calculating document similarity and calculating cluster center. The clustering based on SLVM performs significantly better than that based on a conventional vector space model in the experiments, and its F value increases from 0.65–0.73 to 0.82–0.86.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Bray T, Paoli J, Sperberg-McQueen C M. Extensible Markup Language (XML) 1.0 W3C Recommendation. World Wide Web Consortium, Feb. 1998. http://www.w3.org/TR/1998/REC-xml-19980210.

  2. Chakrabarti S, Dom B, Indyk P. Enhanced hypertext categorization using hyperlinks. InProc. ACM SIGMOD Conference, Seattle, Washington, 1998.

  3. Damien Guillaume, Fionn Murtagh. Clustering of XML documents.Computer Physics Communications, 2000, (127): 215–227.

    Article  MATH  Google Scholar 

  4. Jeonghee Yi, Neel Sundaresan. A classifier for semi-structured documents. InKDD 2000, 2000 Boston, MA USA.

  5. Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. University of Minnesota, Technical Report #00-034 (2000). http://www.cs.umn.edu/tech_reports/

  6. Gerard Salton, McGill M J. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.

  7. Gerard Salton, Chris Buckley. Term weighting approaches in automatic text retrieval. Technical Report 87-881, Cornell University, Computer Science Department, November, 1987.

  8. Charles F Goldfarb, Paul Prescod. The XML Handbook. Prentice Hall, PTR, 1998.

  9. Papakonstantinou Y, Garcia-Molina H, Widom J. Object exchange across heterogeneous information sources. InProceedings of the Eleventh International Conference on Data Engineering, Taipei, March, 1995, pp. 251–260.

  10. Bjorner Larsen, Chinatsu Aone. Fast and effective text mining using linear-time document clustering. InKDD-99, San Diego, California, 1999.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Jianwu.

Additional information

This research is supported by National Technology Innovation Project and Peking University Graduate Student Development Foundation as one of doctoral dissertation’s innovative research.

YANG Janwu is a Ph.D. candidate in the Institute of Computer Science and Technology, Peking University, China, where he received the M.S. degree in 1999. His current research interests include SGML/XML and data mining.

CHEN Xiaoou obtained his B.S. degree from the Department of Computer Science and Technology, National Defense University in 1983. He has been a research staff member at the Institute of Computer Science and Technology, Peking University since 1990, and has been a professor, since 2000. He is the president of Founder Research and Development Center. His current research interests are image processing, XML data exchange and representation.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, J., Chen, X. A semi-structured document model for text mining. J. Comput. Sci. & Technol. 17, 603–610 (2002). https://doi.org/10.1007/BF02948828

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02948828

Keywords

Navigation