Abstract
In this paper we report an investigation into the learning of semi-structured document categorization. We automatically discover low-level, short-range byte data structure patterns from a document data stream by extracting all byte sub-sequences within a sliding window to form an augmented (or bounded-length) string spectrum feature map and using a modified suffix trie data structure (called the coloured generalized suffix tree or CGST) to efficiently store and manipulate the feature map. Using the CGST we are able to efficiently compute the stream's bounded-length sequence spectrum kernel. We compare the performance of two classifier algorithms to categorize the data streams, namely, the SVM and Naive Bayes (NB) classifiers. Experiments have provided good classification performance results on a variety of document byte streams, particularly when using the NB classifier under certain parameter settings. Results indicate that the bounded-length kernel is superior to the standard fixed-length kernel for semi-structured documents.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bays, J. 1974. The complete PATRICIA. PhD thesis, University of Oklahoma.
Bieganski, P., Riedl, J., and Carlis, J. 1994. Generalized suffix trees for biological sequence data: Applications and implementation. In Proc. 27th Annual Hawaii International Conf. on Systems Sciences (HICSS94), pp. 35–44.
Cancedda, N., Cesa-Bianchi, N., Conconi, A., Gentile, C., Goutte, C., Li, Y., Renders, J.M., Shawe-Taylor, J., and Vinokourov, A. 2002. Kernel methods for document filtering. In Proc. 11th Text Retrieval Conference (TREC-11).
Cancedda, N., Gaussier, E., Goutte, C., and Renders, J.M. 2003. Word-sequence Kernels. Journal of Machine Learning Research, 3:1059–1082.
Collins, M. and Duffy, N. 2002. Convolution kernels for natural language. In T.G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14, Cambridge, MA: MIT Press, pp. 625–632.
Cristianini, N., Shawe-Taylor, J., and Lodhi, H. 2001. Latent semantic kernels. In C. Brodley and A. Danyluk (Eds.), Proceedings of ICML-01, 18th International Conference on Machine Learning, Williams College, US, San Francisco, US: Morgan Kaufmann Publishers, pp. 66–73.
de Vel, O. 2000. Mining e-mail authorship. In Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD'2000).
de Vel, O., Corney, M., Anderson, A., and Mohay, G. 2002. Language and gender author cohort analysis of e-mail for computer forensics. In Digital Forensic Research Workshop (DFRWS 2002), www.dfrws.org.
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S. 2002. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In D. Barbara and S. Jajodia (Eds.), Applications of Data mining in Computer Security. Kluwer Academic Publishers.
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning. Springer-Verlag.
Haussler, D. 1999. Convolutional kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz, UCSC-CRL-99-10.
Jensen, K. 1997. Coloured Petri Nets, Basic Concepts, Analysis Methods and Practical Use. Springer-Verlag.
Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proc. European Conf. Machine Learning (ECML'98), pp. 137–142.
Joachims, T. 1999. Making large-scale SVM learning practical. In C. Burges, B. Scholkopf, and A. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, Cambridge, MA.
Leslie, C., Eskin, E., Cohen, A., Weston, J., and Stafford-Noble, W. 2002a. Mismatch string kernels for SVM protein classification. In Proc. Neural Information Processing Systems (NIPS2002).
Leslie, C., Eskin, E., and Stafford-Noble, W. 2002b. The spectrum kernel: A string kernel for SVM protein classification. In Proc. Pacific Symposium on Biocomputing (PSB-2002).
Leslie, C. and Kuang, R. 2003. Fast kernels for inexact string matching. Proc. 16th Conference on Computational Learning Theory COLT2003, Lecture Notes in Computer Science (LNCS), 2777:114–128.
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. 2002. Text classification using string kernels. Journal of Machine Learning Research, 2:419–441.
Manevitz, L. and Yousef, M. 2001. One-Class SVMs for document classification. Journal of Machine Learning Research, 2:139–154.
McCreight, E. 1976. A space-economical suffix tree construction algorithm. Journal of the Association of Computing Machinery (ACM), 23(2):262–272.
Mitchell, T. 1997. Machine Learning, New York: McGraw-Hill.
Mladenic, D. and Grobelnik, M. 1998. Feature selection for classification based on text hierarchy. In Learning from Text and the Web: Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon University.
O'Callaghan. L. 2001. Clustering data streams. In Proc. DIMACS Workshop on Streaming Data Analysis and Mining, Piscataway, NJ: Rutgers University, Center for Discrete Mathematics and Theoretical Computer Science.
Scholkopf, B. and Smola, A. 2002. Learning with Kernels, MIT Press.
Teytaud, O. and Jalam, R. 2001. Kernel-based text categorization. In International Joint Conference on Neural Networks (IJCNN'2001).
Ukkonen, E. 1995. On-line construction of suffix trees. Algorithmica, 14(3):249–260.
Vapnik, V. 1995. The Nature of Statistical Learning Theory, New York: Springer-Verlag.
Vishwanathan, S. 2002. Kernel methods: Fast algorithms and real life applications. PhD thesis, Indian Institute of Science, Bangalore, India.
Vishwanathan, S. and Smola, A. 2003. Fast kernels on strings and trees. In S. Thrun, S. Becker and K. Obermayer (Eds.), Advances in Neural Information Processing Systems 15 (Proc. NIPS 2002 Conf.), Cambridge, US: MIT Press, pp. 66–73.
Watkins, C. 1999. Dynamic alignment kernels. Technical report, Department of Computer Science, Royal Holloway, University of London, CSD-TR-98-11.
Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67–88.
Yang, Y. and Liu, X. 1999. A re-examination of text categorisation methods. In Proc. 22nd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR99), pp. 67–73.
Zelenko, D., Aone, C., and Richardella, A. 2003. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083–1106.
Acknowledgments
The author would like to thank Mr. Jim Bell of the DSTO for implementing the initial version of the suffix tree code.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
de Vel, O. Learning Semi-Structured Document Categorization Using Bounded-Length Spectrum Sub-Sequence Kernels. Data Min Knowl Disc 13, 309–334 (2006). https://doi.org/10.1007/s10618-005-0037-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-005-0037-z