Abstract
As a method structuring information and knowledge contained in texts, text categorization can be to a great extend automated. The automatic text classification systems implement machine learning algorithms and need training samples. In commercial applications however, the automatic categorization appear to come up against limiting factors. For example, it turns out to be difficult to reduce the sample complexity without the categorization quality in terms of recall and precision will suffer. Instead of trying to fully replace the human work by machine, it could be more effective and ultimately efficient to let human and machine cooperate. So we have developed a categorization workbench to realise synergy between manual and machine categorization. To compare the categorization workbench with common automatic classification systems, the automatic categorizer of the IBM DB2 Information Integrator for Content has been chosen for tests. The test results show that, benefiting from the incorporation of user’s domain knowledge, the categorization workbench can improve the recall by a factor of two till four with the same number of training samples as the automatic categorizer uses. Further, to get a comparable categorization quality, the categorization workbench just needs an eighth till a quarter of the training samples as the automatic categorizer does.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Apte, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorization models. In: Proceedings of ACM SIGIR 1994 (1994)
Apte, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12, 233–251 (1994)
Davies, S., Russell, S.: NP-completeness of searches for smallest possible feature sets (1994), http://www.cs.berkeley.edu/russell/papers/mini94f-relevance.ps (11.5.2003)
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management, pp. 148–155 (1998)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Joachims, T.: Making large-scale svm learning practical. Advances in kernel methods. MIT Press, Cambridge (1999)
Johnson, D.E., Oles, F.J., Zhang, T., Goetz, T.: A decision-tree-based symbolic rule induction system for text categorization (2001), http://www.research.ibm.com/dssgrp/Papers/kitcat-ibmj.ps (29.2.2004)
Koller, D., Sahami, M.: Toward optimal feature selection. In: Machine Learning: Proceedings Conference, Morgan Kaufmann, San Francisco (1996)
Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the Speech an Natural Language Workshop, pp. 212–217 (1992)
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI/ICML 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998)
Rocchio, J.: The smart retrieval system: Experiments in automated document processing, pp. 313–323. Prentice-Hall, Englewood Cliffs (1971)
Weiss, S.M., Apte, C., Damerau, F., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing text-mining performance. IEEE Intelligent Systems 14, 63–69 (1999)
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval Journal 1, 69–90 (1999)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization (1997), http://citeseer.nj.nec.com/yang97comparative.html (11.5.2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sun, Q., Schommer, C., Lang, A. (2004). Integration of Manual and Automatic Text Categorization. A Categorization Workbench for Text-Based Email and Spam. In: Biundo, S., Frühwirth, T., Palm, G. (eds) KI 2004: Advances in Artificial Intelligence. KI 2004. Lecture Notes in Computer Science(), vol 3238. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30221-6_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-30221-6_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23166-0
Online ISBN: 978-3-540-30221-6
eBook Packages: Springer Book Archive