Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansion

Tran, Minh-Triet; Trieu, Lap Q.; Tran, Huy Q.

doi:10.1007/s10732-019-09417-w

Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansion

Published: 14 June 2019

Volume 28, pages 211–233, (2022)
Cite this article

Journal of Heuristics Aims and scope Submit manuscript

440 Accesses
3 Citations
Explore all metrics

Abstract

Document vectorization with an appropriate encoding scheme is an essential component in various document processing tasks, including text document classification, retrieval, or generation. Training a dedicated document in a specific domain may require large enough data and sufficient resource. This motivates us to propose a novel document representation scheme with two main components. First, we train TD2V, a generic pre-trained document embedding for English documents from more than one million tweets in Twitter. Second, we propose a domain adaptation process with adversarial training to adapt TD2V to different domains. To classify a document, we use the rank list of its similar documents using query expansion techniques, either Average Query Expansion or Discriminative Query Expansion. Experiments on datasets from different online sources show that by using TD2V only, our method can classify documents with better accuracy than existing methods. By applying adversarial adaptation process, we can further boost and achieve the accuracy on BBC, BBCSport, Amazon4, 20NewsGroup datasets. We also evaluate our method on a specific domain of sensitivity classification and achieve the accuracy of higher than \(95\%\) even with a short text fragment having 1024 characters on 5 datasets: Snowden, Mormon, Dyncorp, TM, and Enron.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Diksha Khurana, Aditya Koli, … Sukhdev Singh

TextConvoNet: a convolutional neural network based architecture for text classification

Article 22 October 2022

Sanskar Soni, Satyendra Singh Chouhan & Santosh Singh Rathore

Transformer models for text-based emotion detection: a review of BERT-based approaches

Article 08 February 2021

Francisca Adoma Acheampong, Henry Nunoo-Mensah & Wenyu Chen

References

Arandjelovic, R.: Three things everyone should know to improve object retrieval. In: Proceedings of the 2012 IEEE conference on computer vision and pattern recognition (CVPR), CVPR ’12, pp. 2911–2918, Washington, DC, USA, 2012. IEEE Computer Society. ISBN 978-1-4673-1226-4. URL http://dl.acm.org/citation.cfm?id=2354409.2355123
Arnaoudova, V., Haiduc, S., Marcus, A., Antoniol, G.: The use of text retrieval and natural language processing in software engineering. In: Proceedings of the 37th international conference on Software Engineering - Vol. 2, ICSE ’15, pp. 949–950, Piscataway, NJ, USA, 2015. IEEE Press. URL http://dl.acm.org/citation.cfm?id=2819009.2819224
Blei, D. M., Ng, A. Y., Jordan, M. I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3: 993–1022. ISSN 1532–4435
Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp. 440–447, Prague, Czech Republic, June 2007. Association for Computational Linguistics
Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y., Strope, B., Kurzweil, R.: Universal sentence encoder. CoRR, arXiv:1803.11175 (2018)
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Cohen, W. W., McCallum, A. and Roweis, S. T. (eds.) ICML, vol. 307 of ACM international conference proceeding series, pp. 160–167. ACM, 2008. ISBN 978-1-60558-205-4
Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. CoRR, arXiv:1605.09782 (2016)
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proceedings of the 27th international conference on neural information processing systems - Vol. 2, NIPS’14, pp. 2672–2680, Cambridge, MA, USA, (2014) MIT Press. URL http://dl.acm.org/citation.cfm?id=2969033.2969125
Gouws, S.: Training neural word embeddings for transfer learning and translation. PhD thesis, Stellenbosch University, (2016)
Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on machine learning, ICML ’06, pp. 377–384, New York, NY, USA, (2006) ACM. ISBN 1-59593-383-2
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., Schölkopf, B.: Covariate shift and local learning by distribution matching, pp. 131–160. MIT Press, Cambridge, MA, USA, (2009)
Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)
Article Google Scholar
Hart, M., Manadhata, P., Johnson, R.: Text Classification for Data Loss Prevention, pp. 18–37. Springer, Berlin (2011) ISBN 978-3-642-22263-4
Joachims, T.: Text categorization with suport vector machines: Learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, ECML ’98, pp. 137–142, London, UK, UK, (1998). Springer-Verlag. ISBN 3-540-64417-2
Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Vol. 1: Long Papers)
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1746–1751, (2014) URL http://aclweb.org/anthology/D/D14/D14-1181.pdf
Kowsari, K., Brown, D.E., Heidarysafa, M., Meimandi, K.J., Gerber, M.S., Barnes, L.E.: Hdltex: Hierarchical deep learning for text classification. In: Chen, X., Luo, B., Luo, F., Palade, V. and Wani, M.A. (eds.) ICMLA, pp. 364–371. IEEE, (2017) ISBN 978-1-5386-1418-1
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd international conference on international conference on machine learning - Vol. 37
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd international conference on international conference on machine learning - Vol. 37
Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Bonet, B., Koenig, S. (eds.) Proceedings of the twenty-ninth AAAI conference on artificial intelligence, AAAI’15, vol. 333, pp. 2267–2273. AAAI Press, (2015). ISBN 0-262-51129-0. URL http://dl.acm.org/citation.cfm?id=2886521.2886636
Landauer, T., Foltz, P., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998)
Article Google Scholar
Le, Q. V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML, vol. 32 of JMLR Workshop and Conference Proceedings, pp. 1188–1196. JMLR.org, (2014)
Liu, M., Tuzel, O.: Coupled generative adversarial networks. CoRR, arXiv:1606.07536 (2016)
Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd international conference on international conference on machine learning - Vol. 37, ICML’15, pp. 97–105. JMLR.org, (2015) URL http://dl.acm.org/citation.cfm?id=3045118.3045130
Manevitz, L.M., Yousef, M.: One-class svms for document classification. J. Mach. Learn. Res. 2, 139–154 Mar. (2002) ISSN 1532-4435
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR arXiv:1301.3781 (2013a)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems - Vol. 2, NIPS’13, pp. 3111–3119, USA, (2013b) Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999792.2999959
Mitchell, J., Lapata, M.: Composition in distributional models of semantics. J. Cogn. Sci. 34(8), 1388–1429 (2010)
Article Google Scholar
Mitra, M., Singhal, A., Buckley, C.: Improving automatic query expansion. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98, pp. 206–214, New York, NY, USA, (1998) ACM. ISBN 1-58113-015-5
Moraes, R., Valiati, J.F., Neto, W.P.G.: Document-level sentiment classification: an empirical comparison between SVM and ANN. Expert Syst. Appl. 40(2), 621–633 (2013)
Article Google Scholar
Patel, K., Patel, D., Golakiya, M., Bhattacharyya, P., Birari, N.: Adapting pre-trained word embeddings for use in medical coding. In: BioNLP 2017, pp. 302–306. Association for Computational Linguistics, (2017) URL http://aclweb.org/anthology/W17-2338
Pawar, P.Y., Gawande, S.H.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2(4), 423–426 (2012)
Article Google Scholar
Pazzani, M., Billsus, D.: Learning and revising user profiles: the identification ofinteresting web sites. Mach. Learn. 27(3), 313–331, June (1997) ISSN 0885-6125
Rish, I.: An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3, pp. 41–46. IBM New York, (2001)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. TACL 2, 207–218 (2014)
Article Google Scholar
Subramanian, S., Trischler, A., Bengio, Y., Pal, C.J.: Learning general purpose distributed sentence representations via large scale multi-task learning. CoRR, arXiv:1804.00079 (2018)
Trieu, L.Q., Tran, H.Q., Tran, M.-T.: News classification from social media using twitter-based doc2vec model and automatic query expansion. In: Proceedings of the eighth International symposium on information and communication technology, SoICT 2017, pp. 460–467, New York, NY, USA, (2017a). ACM. ISBN 978-1-4503-5328-1. https://doi.org/10.1145/3155133.3155206. URL http://doi.acm.org/10.1145/3155133.3155206
Trieu, L.Q., Tran, T., Tran, M., Tran, M.: Document sensitivity classification for data leakage prevention with twitter-based document embedding and query expansion. In: 13th international conference on computational intelligence and security, CIS 2017, Hong Kong, China, December 15-18, 2017, pp. 537–542, (2017b) https://doi.org/10.1109/CIS.2017.00125
Turney, P.D., Pantel, P., et al.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)
Article MathSciNet Google Scholar
Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: maximizing for domain invariance. CoRR, arXiv:1412.3474 (2014)
Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Adversarial discriminative domain adaptation. In: Computer vision and pattern recognition (CVPR)
Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J., Xu, B.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31, (2017) ISSN 0893-6080. https://doi.org/10.1016/j.neunet.2016.12.008. URL http://www.sciencedirect.com/science/article/pii/S0893608016301976
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., Hovy, E.H.: Hierarchical attention networks for document classification. In: HLT-NAACL, (2016)
Yin, Y., Song, Y., Zhang, M.: Document-level multi-aspect sentiment classification as machine comprehension. In: Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, Copenhagen, Denmark, September 9–11, 2017, pp. 2044–2054, (2017) URL https://aclanthology.info/papers/D17-1217/d17-1217
Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: a survey. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 8(4), 1 (2018). https://doi.org/10.1002/widm.1253
Article Google Scholar
Zhao, R., Mao, K.: Fuzzy bag-of-words model for document representation. IEEE Trans. Fuzzy Syst. 26(2), 794–804 (2017)
Article Google Scholar

Download references

Acknowledgements

This research is funded by Department of Science and Technology, Ho Chi Minh city, under grant number 40/2015/HD-SKHCN.

Author information

Authors and Affiliations

Faculty of Information Technology, University of Science, VNU-HCM, 227 Nguyen Van Cu Street, District 5, Ho Chi Minh City, Vietnam
Minh-Triet Tran, Lap Q. Trieu & Huy Q. Tran

Authors

Minh-Triet Tran
View author publications
You can also search for this author in PubMed Google Scholar
Lap Q. Trieu
View author publications
You can also search for this author in PubMed Google Scholar
Huy Q. Tran
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Minh-Triet Tran.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tran, MT., Trieu, L.Q. & Tran, H.Q. Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansion. J Heuristics 28, 211–233 (2022). https://doi.org/10.1007/s10732-019-09417-w

Download citation

Received: 20 March 2018
Revised: 26 March 2019
Accepted: 30 May 2019
Published: 14 June 2019
Issue Date: April 2022
DOI: https://doi.org/10.1007/s10732-019-09417-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansion

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

TextConvoNet: a convolutional neural network based architecture for text classification

Transformer models for text-based emotion detection: a review of BERT-based approaches

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansion

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

TextConvoNet: a convolutional neural network based architecture for text classification

Transformer models for text-based emotion detection: a review of BERT-based approaches

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation