Classifying High-Speed Text Streams

Fung, Gabriel Pui Cheong; Yu, Jeffrey Xu; Lu, Hongjun

doi:10.1007/978-3-540-45160-0_15

Gabriel Pui Cheong Fung⁷,
Jeffrey Xu Yu⁷ &
Hongjun Lu⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2762))

Included in the following conference series:

International Conference on Web-Age Information Management

427 Accesses
2 Citations

Abstract

Recently, a new class of data-intensive application becomes widely recognized where data is modeled best as transient open-end streams rather than persistent tables on disk. It leads to a new surge of research interest called data streams. However, most of the reported works are concentrated on structural data, such as bit-sequences, and seldom focus on unstructural data, such as textual documents. In this paper, we propose an efficient classification approach for classifying high-speed text streams. The proposed approach is based on sketches such that it is able to classify the streams efficiently by scanning them only once, meanwhile consuming a small bounded of memory in both model maintenance and operation. Extensive experiments using benchmarks and a real-life news article collection are conducted. The encouraging results indicated that our proposed approach is highly feasible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chakrabarti, S., Roy, S., Soundalgekar, M.V.: Fast and accurate text classification via multiple linear discriminant projections. In: Proceedings of the 28th Very Large Database Conference (2002)
Google Scholar
Cristianini, N., Shaws-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-Based learning Methods. Cambridge University Press, Cambridge (2000)
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Interscience, Hoboken (2001)
MATH Google Scholar
Frakes, W.B., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Prentice Hall PTR, Englewood Cliffs (1992)
Google Scholar
Fung, G.P.C., Yu, J.X., Lam, W.: Automatic stock trend prediction by real time news. In: Proceedings of 2002 Workshop in Data Mining and Modeling (2002)
Google Scholar
Greiff, W.R.: A theory of term weighting based on exploratory data analysis. In: Proceedings of SIGIR 1998 21th ACM International Conference on Research and Development in Information Retrieval, pp. 11–19 (1998)
Google Scholar
Holt, J.D., Chung, S.M.: Efficient mining of association rules in text databases. In: Proceedings of 8th International Conference on Information and Knowledge Management, pp. 234–242 (1999)
Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of 13th European Conference on Machine Learning, pp. 137–142 (1998)
Google Scholar
Lewis, D.D.: An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of SIGIR 1992 15th ACM International Conference on Research and Development in Information Retrieval, pp. 37–50 (1992)
Google Scholar
Lewis, D.D.: Naive (bayes) at forty: The independence assumption in information retrieval. In: Proceedings of 13th European Conference on Machine Learning, pp. 4–15 (1998)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Meretakis, D., Fragoudis, D., Lu, H., Likothanassis, S.: Scalable association-based text classification. In: Proceedings of 10th International Conference on Information and Knowledge Management, pp. 5–11 (2001)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Syed, N.A., Liu, H., Sung, K.K.: Incremental learning with support vector machines. In: Proceedings of SIGKDD 1999, 5th International Conference on Knowledge Discovery and Data Mining, pp. 313–321 (1999)
Google Scholar
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
MATH Google Scholar
Yamamoto, K., Masuyama, S., Naito, S.: Automatic text classification method with simple class-weighting approach. In: Natural Language Processing Pacific Rim Symposium (1995)
Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 2(1), 69–90 (1999)
Article Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of SIGIR 1999 22th ACM International Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Systems Engineering & Engineering Management, The Chinese University of Hong Kong, Hong Kong, China
Gabriel Pui Cheong Fung & Jeffrey Xu Yu
Dept. of Computer Science, The Hong Kong University of Science and Technology, Hong Kong, China
Hongjun Lu

Authors

Gabriel Pui Cheong Fung
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Xu Yu
View author publications
You can also search for this author in PubMed Google Scholar
Hongjun Lu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Wright State University, USA
Guozhu Dong
School of Computer Science, Sichuan University, 610065, Chengdu, China
Changjie Tang
UNC Chapel Hill,
Wei Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fung, G.P.C., Yu, J.X., Lu, H. (2003). Classifying High-Speed Text Streams. In: Dong, G., Tang, C., Wang, W. (eds) Advances in Web-Age Information Management. WAIM 2003. Lecture Notes in Computer Science, vol 2762. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45160-0_15

Download citation

DOI: https://doi.org/10.1007/978-3-540-45160-0_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40715-7
Online ISBN: 978-3-540-45160-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics