Abstract
We developed an application that automates the process of assigning emails received in a generic request inbox to one of fourteen predefined topic categories. To build this application, we compared the performance of several classifiers in predicting the topic category, using an email dataset extracted from this inbox, which consisted of 8,841 emails over three years. The algorithms ranged from linear classifiers operating on n-gram features to deep learning techniques such as CNNs and LSTMs. For our objective, we found that the best-performing model was a logistic regression classifier using n-grams with TF-IDF weights, achieving 90.9% accuracy. The traditional models performed better than the deep learning models for this dataset, likely in part due to the small dataset size, and also because this particular classification task may not require the ordered sequence representation of tokens that deep learning models provide. Eventually, a bagged voting model was selected which combines the predictive power of the top eight models, with accuracy of 92.7%, surpassing the performance of any of the individual models.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Yang, J., Park, S.-Y.: Email categorization using fast machine learning algorithms. In: Lange, S., Satoh, K., Smith, C.H. (eds.) DS 2002. LNCS, vol. 2534, pp. 316–323. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36182-0_31
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2001)
Provost, J.: Naïve-Bayes vs. rule-learning in classification of email. University of Texas at Austin, Artificial Intelligence Lab, CiteSeer (Ingebrigsten), pp. 1–4 (1999)
Zhou, C., Sun, C., Liu, Z., Lau, F.C.M.: A C-LSTM Neural Network for Text Classification. ArXiv e-prints, November 2015
Zhang, X., Zhao, J., LeCun, Y.: Character-level Convolutional Networks for Text Classification, pp. 1–9 (2015)
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: AAAI-29, pp. 2267–2273 (2015)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of Tricks for Efficient Text Classification (2016)
Johnson, R., Zhang, T.: Effective Use of Word Order for Text Categorization with Convolutional Neural Networks (2011, 2014)
Kim, T., Yang, J.: Abstractive Text Classification Using Sequence-to-convolution Neural Networks (2018)
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python (2009)
Pedregosa, F., Varoquaux, G., Gramfort, A.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2012)
Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System (2016)
He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval Introduction, vol. 35 (2008)
Lewis, D.D.: Feature selection and feature extraction for text categorization. Speech and natural language. In: Proceedings of a Workshop Held at Harriman, New York, 23–26 February 1992 (1992)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on EMNLP, pp. 1532–1543 (2014)
Lecun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Conneau, A., Schwenk, H., Le Cun, Y., Barrault, L.: Very deep convolutional networks for text classification. In: Proceedings of the 15th Conference of the EACL, vol. 1, pp. 1107–1116 (2017)
Jurafsky, D., Martin, J.: Speech & Language Processing, 2 edn., London (2014)
Ramos, J.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning, pp. 1–4 (2003)
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the ACL, pp. 142–150 (2011)
Luong, M.T., Manning, C.D.: Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models (2016)
Bahdanau, D., Bosc, T.: Learning to Compute Word Embeddings on the Fly (2018)
Gordan, M., Kochen, M.: Recall-precision trade-off : a derivation. J. Am. Soc. Inf. Sci. 40 145 (1989, 1998)
Fisher, D.: Knowledge acquisition via incremental clustering. Mach. Learn. 2(1980), 139–182 (1987)
Choi, J.D., Tetreault, J., Stent, A.: It Depends: Dependency Parser Comparison Using a Web-based Evaluation Tool, pp. 387–396 (2015)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. CoRR abs/1607.04606 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, H., Rangrej, J., Rais, S., Hillmer, M., Rudzicz, F., Malikov, K. (2019). Categorizing Emails Using Machine Learning with Textual Features. In: Meurs, MJ., Rudzicz, F. (eds) Advances in Artificial Intelligence. Canadian AI 2019. Lecture Notes in Computer Science(), vol 11489. Springer, Cham. https://doi.org/10.1007/978-3-030-18305-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-18305-9_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18304-2
Online ISBN: 978-3-030-18305-9
eBook Packages: Computer ScienceComputer Science (R0)