Skip to main content

Integration of Manual and Automatic Text Categorization. A Categorization Workbench for Text-Based Email and Spam

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3238))

Abstract

As a method structuring information and knowledge contained in texts, text categorization can be to a great extend automated. The automatic text classification systems implement machine learning algorithms and need training samples. In commercial applications however, the automatic categorization appear to come up against limiting factors. For example, it turns out to be difficult to reduce the sample complexity without the categorization quality in terms of recall and precision will suffer. Instead of trying to fully replace the human work by machine, it could be more effective and ultimately efficient to let human and machine cooperate. So we have developed a categorization workbench to realise synergy between manual and machine categorization. To compare the categorization workbench with common automatic classification systems, the automatic categorizer of the IBM DB2 Information Integrator for Content has been chosen for tests. The test results show that, benefiting from the incorporation of user’s domain knowledge, the categorization workbench can improve the recall by a factor of two till four with the same number of training samples as the automatic categorizer uses. Further, to get a comparable categorization quality, the categorization workbench just needs an eighth till a quarter of the training samples as the automatic categorizer does.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apte, C., Damerau, F., Weiss, S.M.: Towards language independent automated learning of text categorization models. In: Proceedings of ACM SIGIR 1994 (1994)

    Google Scholar 

  2. Apte, C., Damerau, F., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12, 233–251 (1994)

    Article  Google Scholar 

  3. Davies, S., Russell, S.: NP-completeness of searches for smallest possible feature sets (1994), http://www.cs.berkeley.edu/russell/papers/mini94f-relevance.ps (11.5.2003)

  4. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the 1998 ACM 7th International Conference on Information and Knowledge Management, pp. 148–155 (1998)

    Google Scholar 

  5. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  6. Joachims, T.: Making large-scale svm learning practical. Advances in kernel methods. MIT Press, Cambridge (1999)

    Google Scholar 

  7. Johnson, D.E., Oles, F.J., Zhang, T., Goetz, T.: A decision-tree-based symbolic rule induction system for text categorization (2001), http://www.research.ibm.com/dssgrp/Papers/kitcat-ibmj.ps (29.2.2004)

  8. Koller, D., Sahami, M.: Toward optimal feature selection. In: Machine Learning: Proceedings Conference, Morgan Kaufmann, San Francisco (1996)

    Google Scholar 

  9. Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the Speech an Natural Language Workshop, pp. 212–217 (1992)

    Google Scholar 

  10. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI/ICML 1998 Workshop on Learning for Text Categorization, pp. 41–48 (1998)

    Google Scholar 

  11. Rocchio, J.: The smart retrieval system: Experiments in automated document processing, pp. 313–323. Prentice-Hall, Englewood Cliffs (1971)

    Google Scholar 

  12. Weiss, S.M., Apte, C., Damerau, F., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T.: Maximizing text-mining performance. IEEE Intelligent Systems 14, 63–69 (1999)

    Google Scholar 

  13. Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval Journal 1, 69–90 (1999)

    Article  Google Scholar 

  14. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization (1997), http://citeseer.nj.nec.com/yang97comparative.html (11.5.2003)

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sun, Q., Schommer, C., Lang, A. (2004). Integration of Manual and Automatic Text Categorization. A Categorization Workbench for Text-Based Email and Spam. In: Biundo, S., Frühwirth, T., Palm, G. (eds) KI 2004: Advances in Artificial Intelligence. KI 2004. Lecture Notes in Computer Science(), vol 3238. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30221-6_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30221-6_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23166-0

  • Online ISBN: 978-3-540-30221-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics