Learning Semi-Structured Document Categorization Using Bounded-Length Spectrum Sub-Sequence Kernels

de Vel, Olivier

doi:10.1007/s10618-005-0037-z

Learning Semi-Structured Document Categorization Using Bounded-Length Spectrum Sub-Sequence Kernels

Published: 26 May 2006

Volume 13, pages 309–334, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Olivier de Vel¹

218 Accesses
Explore all metrics

Abstract

In this paper we report an investigation into the learning of semi-structured document categorization. We automatically discover low-level, short-range byte data structure patterns from a document data stream by extracting all byte sub-sequences within a sliding window to form an augmented (or bounded-length) string spectrum feature map and using a modified suffix trie data structure (called the coloured generalized suffix tree or CGST) to efficiently store and manipulate the feature map. Using the CGST we are able to efficiently compute the stream's bounded-length sequence spectrum kernel. We compare the performance of two classifier algorithms to categorize the data streams, namely, the SVM and Naive Bayes (NB) classifiers. Experiments have provided good classification performance results on a variety of document byte streams, particularly when using the NB classifier under certain parameter settings. Results indicate that the bounded-length kernel is superior to the standard fixed-length kernel for semi-structured documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Space-Efficient Feature Maps for String Alignment Kernels

Article Open access 18 May 2020

Discovering Patterns Using Feature Selection Techniques and Correlation

A New Approach to the Multiaspect Text Categorization by Using the Support Vector Machines

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Bays, J. 1974. The complete PATRICIA. PhD thesis, University of Oklahoma.
Bieganski, P., Riedl, J., and Carlis, J. 1994. Generalized suffix trees for biological sequence data: Applications and implementation. In Proc. 27th Annual Hawaii International Conf. on Systems Sciences (HICSS94), pp. 35–44.
Cancedda, N., Cesa-Bianchi, N., Conconi, A., Gentile, C., Goutte, C., Li, Y., Renders, J.M., Shawe-Taylor, J., and Vinokourov, A. 2002. Kernel methods for document filtering. In Proc. 11th Text Retrieval Conference (TREC-11).
Cancedda, N., Gaussier, E., Goutte, C., and Renders, J.M. 2003. Word-sequence Kernels. Journal of Machine Learning Research, 3:1059–1082.
Google Scholar
Collins, M. and Duffy, N. 2002. Convolution kernels for natural language. In T.G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14, Cambridge, MA: MIT Press, pp. 625–632.
Cristianini, N., Shawe-Taylor, J., and Lodhi, H. 2001. Latent semantic kernels. In C. Brodley and A. Danyluk (Eds.), Proceedings of ICML-01, 18th International Conference on Machine Learning, Williams College, US, San Francisco, US: Morgan Kaufmann Publishers, pp. 66–73.
de Vel, O. 2000. Mining e-mail authorship. In Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD'2000).
de Vel, O., Corney, M., Anderson, A., and Mohay, G. 2002. Language and gender author cohort analysis of e-mail for computer forensics. In Digital Forensic Research Workshop (DFRWS 2002), www.dfrws.org.
Eskin, E., Arnold, A., Prerau, M., Portnoy, L., and Stolfo, S. 2002. A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In D. Barbara and S. Jajodia (Eds.), Applications of Data mining in Computer Security. Kluwer Academic Publishers.
Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning. Springer-Verlag.
Haussler, D. 1999. Convolutional kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz, UCSC-CRL-99-10.
Jensen, K. 1997. Coloured Petri Nets, Basic Concepts, Analysis Methods and Practical Use. Springer-Verlag.
Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. In Proc. European Conf. Machine Learning (ECML'98), pp. 137–142.
Joachims, T. 1999. Making large-scale SVM learning practical. In C. Burges, B. Scholkopf, and A. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning, Cambridge, MA.
Leslie, C., Eskin, E., Cohen, A., Weston, J., and Stafford-Noble, W. 2002a. Mismatch string kernels for SVM protein classification. In Proc. Neural Information Processing Systems (NIPS2002).
Leslie, C., Eskin, E., and Stafford-Noble, W. 2002b. The spectrum kernel: A string kernel for SVM protein classification. In Proc. Pacific Symposium on Biocomputing (PSB-2002).
Leslie, C. and Kuang, R. 2003. Fast kernels for inexact string matching. Proc. 16th Conference on Computational Learning Theory COLT2003, Lecture Notes in Computer Science (LNCS), 2777:114–128.
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., and Watkins, C. 2002. Text classification using string kernels. Journal of Machine Learning Research, 2:419–441.
Google Scholar
Manevitz, L. and Yousef, M. 2001. One-Class SVMs for document classification. Journal of Machine Learning Research, 2:139–154.
Google Scholar
McCreight, E. 1976. A space-economical suffix tree construction algorithm. Journal of the Association of Computing Machinery (ACM), 23(2):262–272.
Google Scholar
Mitchell, T. 1997. Machine Learning, New York: McGraw-Hill.
Mladenic, D. and Grobelnik, M. 1998. Feature selection for classification based on text hierarchy. In Learning from Text and the Web: Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon University.
O'Callaghan. L. 2001. Clustering data streams. In Proc. DIMACS Workshop on Streaming Data Analysis and Mining, Piscataway, NJ: Rutgers University, Center for Discrete Mathematics and Theoretical Computer Science.
Scholkopf, B. and Smola, A. 2002. Learning with Kernels, MIT Press.
Teytaud, O. and Jalam, R. 2001. Kernel-based text categorization. In International Joint Conference on Neural Networks (IJCNN'2001).
Ukkonen, E. 1995. On-line construction of suffix trees. Algorithmica, 14(3):249–260.
Google Scholar
Vapnik, V. 1995. The Nature of Statistical Learning Theory, New York: Springer-Verlag.
Vishwanathan, S. 2002. Kernel methods: Fast algorithms and real life applications. PhD thesis, Indian Institute of Science, Bangalore, India.
Vishwanathan, S. and Smola, A. 2003. Fast kernels on strings and trees. In S. Thrun, S. Becker and K. Obermayer (Eds.), Advances in Neural Information Processing Systems 15 (Proc. NIPS 2002 Conf.), Cambridge, US: MIT Press, pp. 66–73.
Watkins, C. 1999. Dynamic alignment kernels. Technical report, Department of Computer Science, Royal Holloway, University of London, CSD-TR-98-11.
Yang, Y. 1999. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67–88.
Google Scholar
Yang, Y. and Liu, X. 1999. A re-examination of text categorisation methods. In Proc. 22nd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR99), pp. 67–73.
Zelenko, D., Aone, C., and Richardella, A. 2003. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083–1106.
Google Scholar

Download references

Acknowledgments

The author would like to thank Mr. Jim Bell of the DSTO for implementing the initial version of the suffix tree code.

Author information

Authors and Affiliations

Information Assurance Branch, Information Networks Division, Defence Science and Technology Organisation, P.O. Box 1500, Edinburgh, SA 5111, Australia
Olivier de Vel

Authors

Olivier de Vel
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Olivier de Vel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Vel, O. Learning Semi-Structured Document Categorization Using Bounded-Length Spectrum Sub-Sequence Kernels. Data Min Knowl Disc 13, 309–334 (2006). https://doi.org/10.1007/s10618-005-0037-z

Download citation

Received: 18 April 2005
Accepted: 12 December 2005
Published: 26 May 2006
Issue Date: November 2006
DOI: https://doi.org/10.1007/s10618-005-0037-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Semi-Structured Document Categorization Using Bounded-Length Spectrum Sub-Sequence Kernels

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Space-Efficient Feature Maps for String Alignment Kernels

Discovering Patterns Using Feature Selection Techniques and Correlation

A New Approach to the Multiaspect Text Categorization by Using the Support Vector Machines

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now