Integration of Manual and Automatic Text Categorization. A Categorization Workbench for Text-Based Email and Spam

Sun, Qin; Schommer, Christoph; Lang, Alexander

doi:10.1007/978-3-540-30221-6_13

Qin Sun²¹,
Christoph Schommer²¹ &
Alexander Lang²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3238))

Included in the following conference series:

Annual Conference on Artificial Intelligence

1197 Accesses
3 Altmetric

Abstract

As a method structuring information and knowledge contained in texts, text categorization can be to a great extend automated. The automatic text classification systems implement machine learning algorithms and need training samples. In commercial applications however, the automatic categorization appear to come up against limiting factors. For example, it turns out to be difficult to reduce the sample complexity without the categorization quality in terms of recall and precision will suffer. Instead of trying to fully replace the human work by machine, it could be more effective and ultimately efficient to let human and machine cooperate. So we have developed a categorization workbench to realise synergy between manual and machine categorization. To compare the categorization workbench with common automatic classification systems, the automatic categorizer of the IBM DB2 Information Integrator for Content has been chosen for tests. The test results show that, benefiting from the incorporation of user’s domain knowledge, the categorization workbench can improve the recall by a factor of two till four with the same number of training samples as the automatic categorizer uses. Further, to get a comparable categorization quality, the categorization workbench just needs an eighth till a quarter of the training samples as the automatic categorizer does.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Investigating the Effect of Combining Text Clustering with Classification on Improving Spam Email Detection

An approach to text data categorization based on the ideas of J.S. Mill

Article 01 November 2015

Categorization of text documents taking into account some structural features

Article 01 January 2016

References

Apte, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorization models. In: Proceedings of ACM SIGIR 1994 (1994)
Google Scholar
Apte, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12, 233–251 (1994)
Article Google Scholar
Davies, S., Russell, S.: NP-completeness of searches for smallest possible feature sets (1994), http://www.cs.berkeley.edu/russell/papers/mini94f-relevance.ps (11.5.2003)
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Joachims, T.: Making large-scale svm learning practical. Advances in kernel methods. MIT Press, Cambridge (1999)
Google Scholar
Johnson, D.E., Oles, F.J., Zhang, T., Goetz, T.: A decision-tree-based symbolic rule induction system for text categorization (2001), http://www.research.ibm.com/dssgrp/Papers/kitcat-ibmj.ps (29.2.2004)
Koller, D., Sahami, M.: Toward optimal feature selection. In: Machine Learning: Proceedings Conference, Morgan Kaufmann, San Francisco (1996)
Google Scholar
Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the Speech an Natural Language Workshop, pp. 212–217 (1992)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI/ICML 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998)
Google Scholar
Rocchio, J.: The smart retrieval system: Experiments in automated document processing, pp. 313–323. Prentice-Hall, Englewood Cliffs (1971)
Google Scholar
Weiss, S.M., Apte, C., Damerau, F., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing text-mining performance. IEEE Intelligent Systems 14, 63–69 (1999)
Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval Journal 1, 69–90 (1999)
Article Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization (1997), http://citeseer.nj.nec.com/yang97comparative.html (11.5.2003)

Download references

Author information

Authors and Affiliations

Department of Biology and Computer Science, Johann Wolfgang Goethe-University Frankfurt am Main, Germany
Qin Sun & Christoph Schommer
Data Management Development, IBM Development Laboratory Boeblingen, Germany
Alexander Lang

Authors

Qin Sun
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Schommer
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Lang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for Artificial Intelligence, Ulm University, Germany
Susanne Biundo
Fakultät für Ingenieurwissenschaften und Informatik, Universität Ulm, 89069, Ulm, v
Thom Frühwirth
Institute of Neural Information Processing, University of Ulm, D-89069, Ulm, Germany
Günther Palm

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, Q., Schommer, C., Lang, A. (2004). Integration of Manual and Automatic Text Categorization. A Categorization Workbench for Text-Based Email and Spam. In: Biundo, S., Frühwirth, T., Palm, G. (eds) KI 2004: Advances in Artificial Intelligence. KI 2004. Lecture Notes in Computer Science(), vol 3238. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30221-6_13

Download citation

DOI: https://doi.org/10.1007/978-3-540-30221-6_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23166-0
Online ISBN: 978-3-540-30221-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Integration of Manual and Automatic Text Categorization. A Categorization Workbench for Text-Based Email and Spam

Abstract

Access this chapter

Preview

Similar content being viewed by others

Investigating the Effect of Combining Text Clustering with Classification on Improving Spam Email Detection

An approach to text data categorization based on the ideas of J.S. Mill

Categorization of text documents taking into account some structural features

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Integration of Manual and Automatic Text Categorization. A Categorization Workbench for Text-Based Email and Spam

Abstract

Access this chapter

Preview

Similar content being viewed by others

Investigating the Effect of Combining Text Clustering with Classification on Improving Spam Email Detection

An approach to text data categorization based on the ideas of J.S. Mill

Categorization of text documents taking into account some structural features

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation