skip to main content
10.1145/3342558.3345424acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper

Combining Word Embeddings with Taxonomy Information for Multi-Label Document Classification

Published: 23 September 2019 Publication History

Abstract

In business contexts, documents often need to be classified using company-specific taxonomies. Text-classification approaches based on word embeddings have become increasingly popular as they enable words, documents, and tags to be represented in a semantically robust way (as distributed representations of their contexts) and make documents and tags processable in an algebraic vector space. However, these distributed representations of contexts have their shortcomings when used for multi-label classification tasks: the more similar the contexts of two tags, the more difficult they are to separate in classification. Intensified by poor training data, poor training, or inherent limitations of the word-embedding approach, in practice, we find areas of indistinguishability, leading to false positive predictions (typically in leaf tags of a taxonomy tree). We contribute an approach to tackle the problem of indistinguishable areas for multi-label classification tasks based on word embeddings by including taxonomy information during prediction.

References

[1]
B. Pang and L. Lee, "Opinion Mining and Sentiment Analysis," Found. Trends Inf. Retr., vol. 2, no. 1-2, pp. 1--135, Jan. 2008.
[2]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, "Indexing by latent semantic analysis," Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391--407, 1990.
[3]
Y. Kim, "Convolutional Neural Networks for Sentence Classification," in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1746--1751.
[4]
A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun, "Very Deep Convolutional Networks for Text Classification," arXiv:1606.01781 [cs], Jun. 2016.
[5]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed Representations of Words and Phrases and their Compositionality," in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111--3119.
[6]
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, "Bag of Tricks for Efficient Text Classification," in European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 427--431.
[7]
S. Chen, A. Soni, A. Pappu, and Y. Mehdad, "DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging," Jul. 2017.
[8]
D. Mekala, V. Gupta, B. Paranjape, and H. Karnick, "SCDV: Sparse Composite Document Vectors using soft clustering over distributional representations," 2016.
[9]
L. Q. Trieu, H. Q. Tran, and M.-T. Tran, "News Classification from Social Media Using Twitter-based Doc2Vec Model and Automatic Query Expansion," in International Symposium on Information and Communication Technology Proceedings, 2017, pp. 460--467.
[10]
Z. Wu and S. Saito, "HiNet: Hierarchical Classification with Neural Network," in ICLR Workshop, 2017.
[11]
R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, 2ed edition. New York: Addison Wesley, 2010.
[12]
G. Salton and C. Buckley, "Term-weighting Approaches in Automatic Text Retrieval," Inf. Process. Manage., vol. 24, no. 5, pp. 513--523, Aug. 1988.
[13]
T. Hofmann, "Probabilistic Latent Semantic Analysis," in Conference on Uncertainty in Artificial Intelligence Proceedings, USA, 1999, pp. 289--296.
[14]
D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent Dirichlet Allocation," The Journal of Machine Learning Research, vol. 3, pp. 993--1022, Mar. 2003.
[15]
Q. Le and T. Mikolov, "Distributed Representations of Sentences and Documents," in Int. Conference on Machine Learning, 2014, pp. 1188--1196.
[16]
J. Mitchell and M. Lapata, "Composition in Distributional Models of Semantics," Cognitive Science, vol. 34, no. 8, pp. 1388--1429, 2010.
[17]
F. M. Zanzotto, I. Korkontzelos, F. Fallucchi, and S. Manandhar, "Estimating Linear Models for Compositional Distributional Semantics," in Proceedings of the 23rd International Conference on Computational Linguistics, Stroudsburg, PA, USA, 2010, pp. 1263--1271.
[18]
A. Yessenalina and C. Cardie, "Compositional Matrix-space Models for Sentiment Analysis," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, 2011, pp. 172--182.
[19]
E. Grefenstette, G. Dinu, Y.-Z. Zhang, M. Sadrzadeh, and M. Baroni, "Multi-Step Regression Learning for Compositional Distributional Semantics," presented at the Int. Conference on Computational Semantics (IWCS), 2013.
[20]
R. Socher, C. C.-Y. Lin, A. Y. Ng, and C. D. Manning, "Parsing Natural Scenes and Natural Language with Recursive Neural Networks," in International Conference on Machine Learning, 2011, pp. 129--136.
[21]
R. Y. Wang and D. M. Strong, "Beyond Accuracy: What Data Quality Means to data Consumers," J. of Mgmt. Inf. Systems, vol. 12, no. 4, pp. 5--33, 1996.
[22]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," in ICLR Workshop, 2013.
[23]
F. Liu, D. Pennell, F. Liu, and Y. Liu, "Unsupervised Approaches for Automatic Keyword Extraction Using Meeting Transcripts," in Proceedings of Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2009, pp. 620--628.

Cited By

View all
  • (2022)Semantic taxonomy enrichment to improve business text classification for dynamic environments2022 International Conference on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA55318.2022.9894173(1-6)Online publication date: 8-Aug-2022
  • (2022)Target inductive methods for zero-shot regressionInformation Sciences10.1016/j.ins.2022.03.075599(44-63)Online publication date: Jun-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '19: Proceedings of the ACM Symposium on Document Engineering 2019
September 2019
254 pages
ISBN:9781450368872
DOI:10.1145/3342558
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 September 2019

Check for updates

Author Tags

  1. keyword identification
  2. multi-label document classification
  3. taxonomy
  4. text tagging
  5. word embeddings

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

DocEng '19
Sponsor:
DocEng '19: ACM Symposium on Document Engineering 2019
September 23 - 26, 2019
Berlin, Germany

Acceptance Rates

DocEng '19 Paper Acceptance Rate 30 of 77 submissions, 39%;
Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Semantic taxonomy enrichment to improve business text classification for dynamic environments2022 International Conference on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA55318.2022.9894173(1-6)Online publication date: 8-Aug-2022
  • (2022)Target inductive methods for zero-shot regressionInformation Sciences10.1016/j.ins.2022.03.075599(44-63)Online publication date: Jun-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media