Skip to main content
Log in

Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansion

  • Published:
Journal of Heuristics Aims and scope Submit manuscript

Abstract

Document vectorization with an appropriate encoding scheme is an essential component in various document processing tasks, including text document classification, retrieval, or generation. Training a dedicated document in a specific domain may require large enough data and sufficient resource. This motivates us to propose a novel document representation scheme with two main components. First, we train TD2V, a generic pre-trained document embedding for English documents from more than one million tweets in Twitter. Second, we propose a domain adaptation process with adversarial training to adapt TD2V to different domains. To classify a document, we use the rank list of its similar documents using query expansion techniques, either Average Query Expansion or Discriminative Query Expansion. Experiments on datasets from different online sources show that by using TD2V only, our method can classify documents with better accuracy than existing methods. By applying adversarial adaptation process, we can further boost and achieve the accuracy on BBC, BBCSport, Amazon4, 20NewsGroup datasets. We also evaluate our method on a specific domain of sensitivity classification and achieve the accuracy of higher than \(95\%\) even with a short text fragment having 1024 characters on 5 datasets: Snowden, Mormon, Dyncorp, TM, and Enron.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Arandjelovic, R.: Three things everyone should know to improve object retrieval. In: Proceedings of the 2012 IEEE conference on computer vision and pattern recognition (CVPR), CVPR ’12, pp. 2911–2918, Washington, DC, USA, 2012. IEEE Computer Society. ISBN 978-1-4673-1226-4. URL http://dl.acm.org/citation.cfm?id=2354409.2355123

  • Arnaoudova, V., Haiduc, S., Marcus, A., Antoniol, G.: The use of text retrieval and natural language processing in software engineering. In: Proceedings of the 37th international conference on Software Engineering - Vol. 2, ICSE ’15, pp. 949–950, Piscataway, NJ, USA, 2015. IEEE Press. URL http://dl.acm.org/citation.cfm?id=2819009.2819224

  • Blei, D. M., Ng, A. Y., Jordan, M. I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3: 993–1022. ISSN 1532–4435

  • Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: Proceedings of the 45th annual meeting of the association of computational linguistics, pp. 440–447, Prague, Czech Republic, June 2007. Association for Computational Linguistics

  • Cer, D., Yang, Y., Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y., Strope, B., Kurzweil, R.: Universal sentence encoder. CoRR, arXiv:1803.11175 (2018)

  • Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Cohen, W. W., McCallum, A. and Roweis, S. T. (eds.) ICML, vol. 307 of ACM international conference proceeding series, pp. 160–167. ACM, 2008. ISBN 978-1-60558-205-4

  • Donahue, J., Krähenbühl, P., Darrell, T.: Adversarial feature learning. CoRR, arXiv:1605.09782 (2016)

  • Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proceedings of the 27th international conference on neural information processing systems - Vol. 2, NIPS’14, pp. 2672–2680, Cambridge, MA, USA, (2014) MIT Press. URL http://dl.acm.org/citation.cfm?id=2969033.2969125

  • Gouws, S.: Training neural word embeddings for transfer learning and translation. PhD thesis, Stellenbosch University, (2016)

  • Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on machine learning, ICML ’06, pp. 377–384, New York, NY, USA, (2006) ACM. ISBN 1-59593-383-2

  • Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., Schölkopf, B.: Covariate shift and local learning by distribution matching, pp. 131–160. MIT Press, Cambridge, MA, USA, (2009)

  • Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)

    Article  Google Scholar 

  • Hart, M., Manadhata, P., Johnson, R.: Text Classification for Data Loss Prevention, pp. 18–37. Springer, Berlin (2011) ISBN 978-3-642-22263-4

  • Joachims, T.: Text categorization with suport vector machines: Learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, ECML ’98, pp. 137–142, London, UK, UK, (1998). Springer-Verlag. ISBN 3-540-64417-2

  • Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Vol. 1: Long Papers)

  • Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1746–1751, (2014) URL http://aclweb.org/anthology/D/D14/D14-1181.pdf

  • Kowsari, K., Brown, D.E., Heidarysafa, M., Meimandi, K.J., Gerber, M.S., Barnes, L.E.: Hdltex: Hierarchical deep learning for text classification. In: Chen, X., Luo, B., Luo, F., Palade, V. and Wani, M.A. (eds.) ICMLA, pp. 364–371. IEEE, (2017) ISBN 978-1-5386-1418-1

  • Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd international conference on international conference on machine learning - Vol. 37

  • Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of the 32nd international conference on international conference on machine learning - Vol. 37

  • Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Bonet, B., Koenig, S. (eds.) Proceedings of the twenty-ninth AAAI conference on artificial intelligence, AAAI’15, vol. 333, pp. 2267–2273. AAAI Press, (2015). ISBN 0-262-51129-0. URL http://dl.acm.org/citation.cfm?id=2886521.2886636

  • Landauer, T., Foltz, P., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998)

    Article  Google Scholar 

  • Le, Q. V., Mikolov, T.: Distributed representations of sentences and documents. In: ICML, vol. 32 of JMLR Workshop and Conference Proceedings, pp. 1188–1196. JMLR.org, (2014)

  • Liu, M., Tuzel, O.: Coupled generative adversarial networks. CoRR, arXiv:1606.07536 (2016)

  • Long, M., Cao, Y., Wang, J., Jordan, M.I.: Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd international conference on international conference on machine learning - Vol. 37, ICML’15, pp. 97–105. JMLR.org, (2015) URL http://dl.acm.org/citation.cfm?id=3045118.3045130

  • Manevitz, L.M., Yousef, M.: One-class svms for document classification. J. Mach. Learn. Res. 2, 139–154 Mar. (2002) ISSN 1532-4435

  • Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR arXiv:1301.3781 (2013a)

  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems - Vol. 2, NIPS’13, pp. 3111–3119, USA, (2013b) Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999792.2999959

  • Mitchell, J., Lapata, M.: Composition in distributional models of semantics. J. Cogn. Sci. 34(8), 1388–1429 (2010)

    Article  Google Scholar 

  • Mitra, M., Singhal, A., Buckley, C.: Improving automatic query expansion. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98, pp. 206–214, New York, NY, USA, (1998) ACM. ISBN 1-58113-015-5

  • Moraes, R., Valiati, J.F., Neto, W.P.G.: Document-level sentiment classification: an empirical comparison between SVM and ANN. Expert Syst. Appl. 40(2), 621–633 (2013)

    Article  Google Scholar 

  • Patel, K., Patel, D., Golakiya, M., Bhattacharyya, P., Birari, N.: Adapting pre-trained word embeddings for use in medical coding. In: BioNLP 2017, pp. 302–306. Association for Computational Linguistics, (2017) URL http://aclweb.org/anthology/W17-2338

  • Pawar, P.Y., Gawande, S.H.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2(4), 423–426 (2012)

    Article  Google Scholar 

  • Pazzani, M., Billsus, D.: Learning and revising user profiles: the identification ofinteresting web sites. Mach. Learn. 27(3), 313–331, June (1997) ISSN 0885-6125

  • Rish, I.: An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol. 3, pp. 41–46. IBM New York, (2001)

  • Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)

    Article  Google Scholar 

  • Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  • Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. TACL 2, 207–218 (2014)

    Article  Google Scholar 

  • Subramanian, S., Trischler, A., Bengio, Y., Pal, C.J.: Learning general purpose distributed sentence representations via large scale multi-task learning. CoRR, arXiv:1804.00079 (2018)

  • Trieu, L.Q., Tran, H.Q., Tran, M.-T.: News classification from social media using twitter-based doc2vec model and automatic query expansion. In: Proceedings of the eighth International symposium on information and communication technology, SoICT 2017, pp. 460–467, New York, NY, USA, (2017a). ACM. ISBN 978-1-4503-5328-1. https://doi.org/10.1145/3155133.3155206. URL http://doi.acm.org/10.1145/3155133.3155206

  • Trieu, L.Q., Tran, T., Tran, M., Tran, M.: Document sensitivity classification for data leakage prevention with twitter-based document embedding and query expansion. In: 13th international conference on computational intelligence and security, CIS 2017, Hong Kong, China, December 15-18, 2017, pp. 537–542, (2017b) https://doi.org/10.1109/CIS.2017.00125

  • Turney, P.D., Pantel, P., et al.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)

    Article  MathSciNet  Google Scholar 

  • Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion: maximizing for domain invariance. CoRR, arXiv:1412.3474 (2014)

  • Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Adversarial discriminative domain adaptation. In: Computer vision and pattern recognition (CVPR)

  • Xu, J., Xu, B., Wang, P., Zheng, S., Tian, G., Zhao, J., Xu, B.: Self-taught convolutional neural networks for short text clustering. Neural Netw. 88, 22–31, (2017) ISSN 0893-6080. https://doi.org/10.1016/j.neunet.2016.12.008. URL http://www.sciencedirect.com/science/article/pii/S0893608016301976

  • Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., Hovy, E.H.: Hierarchical attention networks for document classification. In: HLT-NAACL, (2016)

  • Yin, Y., Song, Y., Zhang, M.: Document-level multi-aspect sentiment classification as machine comprehension. In: Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, Copenhagen, Denmark, September 9–11, 2017, pp. 2044–2054, (2017) URL https://aclanthology.info/papers/D17-1217/d17-1217

  • Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: a survey. Wiley Interdisc. Rev. Data Min. Knowl. Discov. 8(4), 1 (2018). https://doi.org/10.1002/widm.1253

    Article  Google Scholar 

  • Zhao, R., Mao, K.: Fuzzy bag-of-words model for document representation. IEEE Trans. Fuzzy Syst. 26(2), 794–804 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

This research is funded by Department of Science and Technology, Ho Chi Minh city, under grant number 40/2015/HD-SKHCN.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minh-Triet Tran.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tran, MT., Trieu, L.Q. & Tran, H.Q. Document representation and classification with Twitter-based document embedding, adversarial domain-adaptation, and query expansion. J Heuristics 28, 211–233 (2022). https://doi.org/10.1007/s10732-019-09417-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10732-019-09417-w

Keywords

Navigation