Multi-label Text Classification Using Semantic Features and Dimensionality Reduction with Autoencoders

Alkhatib, Wael; Rensing, Christoph; Silberbauer, Johannes

doi:10.1007/978-3-319-59888-8_32

Wael Alkhatib¹⁹,
Christoph Rensing¹⁹ &
Johannes Silberbauer¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10318))

Included in the following conference series:

International Conference on Language, Data and Knowledge

1489 Accesses
9 Citations

Abstract

Feature selection is of vital concern in text classification to reduce the high dimensionality of feature space. The wide range of statistical techniques which have been proposed for weighting and selecting features suffer from loss of semantic relationship among concepts and ignoring of dependencies and ordering between adjacent words. In this work we propose two techniques for incorporating semantics in feature selection. Furthermore, we use autoencoders to transform the features into a reduced feature space in order to analyse the performance penalty of feature extraction. Our intensive experiments, using the EUR-lex dataset, showed that semantic-based feature selection techniques significantly outperform the Bag-of-Word (BOW) frequency based feature selection method with term frequency/inverse document frequency (TF-IDF) for features weighting. In addition, after an aggressive dimensionality reduction of original features with a factor of 10, the autoencoders are still capable of producing better features compared to BOW with TF-IDF.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2016)
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
Article Google Scholar
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manage. 45(4), 427–437 (2009)
Article Google Scholar
Sebastiani, F.: Text categorization. In: Encyclopedia of Database Technologies and Applications, pp. 683–687. IGI Global (2005)
Google Scholar
Fodor, I.K.: A survey of dimension reduction techniques, Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, vol. 9, pp. 1–18 (2002)
Google Scholar
Cunningham, P.: Dimension reduction. In: Cord, M., Cunningham, P. (eds.) Machine Learning Techniques for Multimedia, pp. 91–112. Springer, Heidelberg (2008)
Google Scholar
Pudil, P., Novovičová, J.: Novel methods for feature subset selection with respect to problem knowledge. In: Liu, H., Motoda, H. (eds.) Feature Extraction, Construction and Selection, vol. 453, pp. 101–116. Springer, New York (1998)
Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Google Scholar
Ogura, H., Amano, H., Kondo, M.: Feature selection with a measure of deviations from poisson in text categorization. Expert Syst. Appl. 36(3), 6826–6832 (2009)
Article Google Scholar
Soucy, P., Mineau, G.W.: Beyond TFIDF weighting for text categorization in the vector space model. In: IJCAI, vol. 5, pp. 1130–1135 (2005)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)
Google Scholar
Masuyama, T., Nakagawa, H.: Cascaded feature selection in SVMs text categorization. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 588–591. Springer, Heidelberg (2003). doi:10.1007/3-540-36456-0_65
Chapter Google Scholar
Lewis, D.D.: Feature selection and feature extraction for text categorization. In: Proceedings of the Workshop on Speech and Natural Language, pp. 212–217. Association for Computational Linguistics (1992)
Google Scholar
Liu, Y., Loh, H.T., Lu, W.F.: Deriving taxonomy from documents at sentence level. In: Prado, H.A.D., Ferneda, E. (eds.) Emerging Technologies of Text Mining: Techniques and Applications, Idea, Hershey, PA, pp. 99–119 (2007)
Google Scholar
Fürnkranz, J.: A study using n-gram features for text categorization. Austrian Res. Inst. Artif. Intell. 3, 1–10 (1998)
Google Scholar
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Google Scholar
Khan, A., Baharudin, B., Khan, K.: Semantic based features selection and weighting method for text classification. In: 2010 International Symposium in Information Technology (ITSim), vol. 2, pp. 850–855. IEEE (2010)
Google Scholar
Janik, M., Kochut, K.: Training-less ontology-based text categorization. In: Workshop on Exploiting Semantic Annotations in Information Retrieval (ESAIR 2008) at the 30th European Conference on Information Retrieval, ECIR, vol. 20 (2008)
Google Scholar
Chang, Y.-H., Huang, H.-Y.: An automatic document classifier system based on Naive Bayes classifier and ontology. In: 2008 International Conference on Machine Learning and Cybernetics, vol. 6, pp. 3144–3149. IEEE (2008)
Google Scholar
Chua, S., Kulathuramaiyer, N.: Feature selection based on semantics. In: Elleithy, K. (ed.) Innovations and Advanced Techniques in Systems, Computing Sciences and Software Engineering, pp. 471–476. Springer, Dordrecht (2008)
Google Scholar
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemom. Intell. Lab. Syst. 2(1–3), 37–52 (1987)
Article Google Scholar
Jolliffe, I.: Principal Component Analysis. Wiley Online Library, Aberdeen (2002)
Google Scholar
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Article Google Scholar
Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Google Scholar
Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: discriminative learning for dimensionality reduction and classification. In: Advances in Neural Information Processing Systems, pp. 897–904 (2009)
Google Scholar
Thonnard, O., Mees, W., Dacier, M.: Addressing the attack attribution problem using knowledge discovery and multi-criteria fuzzy decision-making. In: Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, pp. 11–21. ACM (2009)
Google Scholar
Van Der Maaten, L.: Fast optimization for t-SNE. In: 2010 Workshop on Challenges in Data Visualization Neural Information Processing Systems (NIPS), vol. 100 (2010)
Google Scholar
Bengio, Y., Paiement, J.-F., Vincent, P., Delalleau, O., Le Roux, N., Ouimet, M.: Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps, and spectral clustering. MIJ 1, 2 (2003)
Google Scholar
Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. NIPS 14(14), 585–591 (2001)
Google Scholar
Tenenbaum, J.B., De Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)
Article Google Scholar
Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational linguistics, vol. 2, pp. 539–545. Association for Computational Linguistics (1992)
Google Scholar
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)
Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet MATH Google Scholar
Zhang, M.-L., Zhou, Z.-H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007)
Article MATH Google Scholar
(01, 2017). http://www.ke.tu-darmstadt.de/resources/eurlex
Loza Mencía, E., Fürnkranz, J.: Efficient multilabel classification algorithms for large-scale problems in the legal domain. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts. LNCS, vol. 6036, pp. 192–215. Springer, Heidelberg (2010). doi:10.1007/978-3-642-12837-0_11
Chapter Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford coreNLP natural language processing toolkit. In: ACL (System Demonstrations), pp. 55–60 (2014)
Google Scholar
Seitner, J., Bizer, C., Eckert, K., Faralli, S., Meusel, R., Paulheim, H., Ponzetto, S.: A large database of hypernymy relations extracted from the web. In: Proceedings of the Language Resources and Evaluation Conference, Portoroz, Slovenia, 10th edn. (2016)
Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, New York (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Fachgebiet Multimedia Kommunikation, Technische Universität Darmstadt, S3/20, Rundeturmstr. 10, 64283, Darmstadt, Germany
Wael Alkhatib, Christoph Rensing & Johannes Silberbauer

Authors

Wael Alkhatib
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Rensing
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Silberbauer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wael Alkhatib .

Editor information

Editors and Affiliations

Universidad Politécnica de Madrid, Madrid, Spain
Jorge Gracia
Nanyang Technological University, Singapore, Singapore
Francis Bond
Insight Centre for Data Analytics, National University of Ireland, Galway, Galway, Ireland
John P. McCrae
Insight Centre for Data Analytics, National University of Ireland, Galway, Ireland
Paul Buitelaar
Goethe-University Frankfurt, Frankfurt, Germany
Christian Chiarcos
University of Leipzig, Leipzig, Germany
Sebastian Hellmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Alkhatib, W., Rensing, C., Silberbauer, J. (2017). Multi-label Text Classification Using Semantic Features and Dimensionality Reduction with Autoencoders. In: Gracia, J., Bond, F., McCrae, J., Buitelaar, P., Chiarcos, C., Hellmann, S. (eds) Language, Data, and Knowledge. LDK 2017. Lecture Notes in Computer Science(), vol 10318. Springer, Cham. https://doi.org/10.1007/978-3-319-59888-8_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-59888-8_32
Published: 27 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-59887-1
Online ISBN: 978-3-319-59888-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics