short-paper

Combining Word Embeddings with Taxonomy Information for Multi-Label Document Classification

Authors:

Stefan Hirschmeier,

Detlef SchoderAuthors Info & Claims

DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019

Article No.: 37, Pages 1 - 4

https://doi.org/10.1145/3342558.3345424

Published: 23 September 2019 Publication History

Abstract

In business contexts, documents often need to be classified using company-specific taxonomies. Text-classification approaches based on word embeddings have become increasingly popular as they enable words, documents, and tags to be represented in a semantically robust way (as distributed representations of their contexts) and make documents and tags processable in an algebraic vector space. However, these distributed representations of contexts have their shortcomings when used for multi-label classification tasks: the more similar the contexts of two tags, the more difficult they are to separate in classification. Intensified by poor training data, poor training, or inherent limitations of the word-embedding approach, in practice, we find areas of indistinguishability, leading to false positive predictions (typically in leaf tags of a taxonomy tree). We contribute an approach to tackle the problem of indistinguishable areas for multi-label classification tasks based on word embeddings by including taxonomy information during prediction.

References

[1]

B. Pang and L. Lee, "Opinion Mining and Sentiment Analysis," Found. Trends Inf. Retr., vol. 2, no. 1-2, pp. 1--135, Jan. 2008.

Digital Library

[2]

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by latent semantic analysis," Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391--407, 1990.

[3]

Y. Kim, "Convolutional Neural Networks for Sentence Classification," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1746--1751.

[4]

A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, "Very Deep Convolutional Networks for Text Classification," arXiv:1606.01781 [cs], Jun. 2016.

[5]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed Representations of Words and Phrases and their Compositionality," in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111--3119.

[6]

A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, "Bag of Tricks for Efficient Text Classification," in European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 427--431.

[7]

S. Chen, A. Soni, A. Pappu, and Y. Mehdad, "DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging," Jul. 2017.

[8]

D. Mekala, V. Gupta, B. Paranjape, and H. Karnick, "SCDV: Sparse Composite Document Vectors using soft clustering over distributional representations," 2016.

[9]

L. Q. Trieu, H. Q. Tran, and M.-T. Tran, "News Classification from Social Media Using Twitter-based Doc2Vec Model and Automatic Query Expansion," in International Symposium on Information and Communication Technology Proceedings, 2017, pp. 460--467.

[10]

Z. Wu and S. Saito, "HiNet: Hierarchical Classification with Neural Network," in ICLR Workshop, 2017.

[11]

R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, 2ed edition. New York: Addison Wesley, 2010.

[12]

G. Salton and C. Buckley, "Term-weighting Approaches in Automatic Text Retrieval," Inf. Process. Manage., vol. 24, no. 5, pp. 513--523, Aug. 1988.

Digital Library

[13]

T. Hofmann, "Probabilistic Latent Semantic Analysis," in Conference on Uncertainty in Artificial Intelligence Proceedings, USA, 1999, pp. 289--296.

[14]

D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent Dirichlet Allocation," The Journal of Machine Learning Research, vol. 3, pp. 993--1022, Mar. 2003.

Digital Library

[15]

Q. Le and T. Mikolov, "Distributed Representations of Sentences and Documents," in Int. Conference on Machine Learning, 2014, pp. 1188--1196.

[16]

J. Mitchell and M. Lapata, "Composition in Distributional Models of Semantics," Cognitive Science, vol. 34, no. 8, pp. 1388--1429, 2010.

[17]

F. M. Zanzotto, I. Korkontzelos, F. Fallucchi, and S. Manandhar, "Estimating Linear Models for Compositional Distributional Semantics," in Proceedings of the 23rd International Conference on Computational Linguistics, Stroudsburg, PA, USA, 2010, pp. 1263--1271.

[18]

A. Yessenalina and C. Cardie, "Compositional Matrix-space Models for Sentiment Analysis," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, 2011, pp. 172--182.

[19]

E. Grefenstette, G. Dinu, Y.-Z. Zhang, M. Sadrzadeh, and M. Baroni, "Multi-Step Regression Learning for Compositional Distributional Semantics," presented at the Int. Conference on Computational Semantics (IWCS), 2013.

[20]

R. Socher, C. C.-Y. Lin, A. Y. Ng, and C. D. Manning, "Parsing Natural Scenes and Natural Language with Recursive Neural Networks," in International Conference on Machine Learning, 2011, pp. 129--136.

[21]

R. Y. Wang and D. M. Strong, "Beyond Accuracy: What Data Quality Means to data Consumers," J. of Mgmt. Inf. Systems, vol. 12, no. 4, pp. 5--33, 1996.

Digital Library

[22]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," in ICLR Workshop, 2013.

[23]

F. Liu, D. Pennell, F. Liu, and Y. Liu, "Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts," in Proceedings of Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2009, pp. 620--628.

Cited By

Arslan MCruz C(2022)Semantic taxonomy enrichment to improve business text classification for dynamic environments2022 International Conference on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA55318.2022.9894173(1-6)Online publication date: 8-Aug-2022
https://doi.org/10.1109/INISTA55318.2022.9894173
Fdez-Díaz MQuevedo JMontañés E(2022)Target inductive methods for zero-shot regressionInformation Sciences10.1016/j.ins.2022.03.075599(44-63)Online publication date: Jun-2022
https://doi.org/10.1016/j.ins.2022.03.075

Index Terms

Combining Word Embeddings with Taxonomy Information for Multi-Label Document Classification
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Annotation
  2. Enterprise computing
    1. Enterprise ontologies, taxonomies and vocabularies
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Document topic models

Recommendations

Triplet transformer network for multi-label document classification
DocEng '22: Proceedings of the 22nd ACM Symposium on Document Engineering

Multi-label document classification is the task of assigning one or more labels to a document, and has become a common task in various businesses. Typically, current state-of-the-art models based on pretrained language models tackle this task without ...
Multi-class Document Classification Using Improved Word Embeddings
ICCBD '19: Proceedings of the 2nd International Conference on Computing and Big Data

In this paper, we conducted an experiment to build a classification model that combines different techniques in most of the Natural Language Processing Tasks. We used the word embedding method to transform every word in the dataset and to obtain the ...
Proportioning documents over categories based on word embeddings
ACSW '17: Proceedings of the Australasian Computer Science Week Multiconference

News articles, even in the same category, often interweave with multiple stories and topics. In fact, exploring semantic components of textual documents is challenging, and assigning a document to one single category may appear inadequate. This paper ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019

September 2019

254 pages

ISBN:9781450368872

DOI:10.1145/3342558

General Chairs:
Uwe Borghoff,
Sonja Schimmler

Copyright © 2019 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

In-Cooperation

SIGDOC: ACM Special Interest Group for Design of Communications

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 September 2019

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

DocEng '19

Sponsor:

SIGWEB

DocEng '19: ACM Symposium on Document Engineering 2019

September 23 - 26, 2019

Berlin, Germany

Acceptance Rates

DocEng '19 Paper Acceptance Rate 30 of 77 submissions, 39%;

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
243
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Arslan MCruz C(2022)Semantic taxonomy enrichment to improve business text classification for dynamic environments2022 International Conference on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA55318.2022.9894173(1-6)Online publication date: 8-Aug-2022
https://doi.org/10.1109/INISTA55318.2022.9894173
Fdez-Díaz MQuevedo JMontañés E(2022)Target inductive methods for zero-shot regressionInformation Sciences10.1016/j.ins.2022.03.075599(44-63)Online publication date: Jun-2022
https://doi.org/10.1016/j.ins.2022.03.075

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten