Skip to main content
Log in

Engineering doc2vec for automatic classification of product descriptions on O2O applications

  • Published:
Electronic Commerce Research Aims and scope Submit manuscript

Abstract

In this paper, we develop an automatic product classifier that can become a vital part of a natural user interface for an integrated online-to-offline (O2O) service platform. We devise a novel feature extraction technique to represent product descriptions that are expressed in full natural language sentences. We specifically adapt doc2vec algorithm that implements the document embedding technique. Doc2vec is a way to predict a vector of salient contexts that are specific to a document. Our classifier is trained to classify a product description based on the doc2vec-based feature that is augmented in various ways. We trained and tested our classifier with up to 53,000 real product descriptions from Groupon, a popular social commerce site that also offers O2O commerce features such as online ordering for in-store pick-up. Compared to the baseline approaches of using bag-of-words modeling and word-level embedding, our classifier showed significant improvement in terms of classification accuracy when our adapted doc2vec-based feature was used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. https://radimrehurek.com/gensim/.

  2. A comprehensive explanation is available in [32]. A visual explanation is available at https://ronxin.github.io/wevi/.

References

  1. Abrahams, S. L. (2008). Handmade online: The crafting of commerce, aesthetics and community on Etsy com. Chapel Hill: The University of North Carolina.

    Google Scholar 

  2. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    Google Scholar 

  3. Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998.

  4. Das, P., Xia, Y., Levine, A., Di Fabbrizio, G., & Datta, A. (2016). Large-scale taxonomy categorization for noisy product listings. In 2016 IEEE international conference on big data (big data) (pp. 3885–3894). IEEE.

  5. Ding, Y., Korotkiy, M., Omelayenko, B., Kartseva, V., Zykov, V., Klein, M., et al. (2002). Goldenbullet: Automated classification of product data in e-commerce. In Proceedings of the 5th international conference on business information systems.

  6. Dos Santos, C. N., & Gatti, M. (2014). Deep convolutional neural networks for sentiment analysis of short texts. In COLING (pp. 69–78).

  7. Goldberg, Y., & Levy, O. (2014). word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.

  8. Gottipati, S. (2012). E-commerce product categorization srinivasu gottipati and mumtaz vauhkonen. Stanford C229 Final Projects.

  9. Hashimoto, K., Stenetorp, P., Miwa, M., & Tsuruoka, Y. (2015). Task-oriented learning of word embeddings for semantic relation classification. arXiv preprint arXiv:1503.00095.

  10. Hull, D. A., et al. (1996). Stemming algorithms: A case study for detailed evaluation. JASIS, 47(1), 70–84.

    Article  Google Scholar 

  11. Ju, R., Zhou, P., Li, C.H., & Liu, L. (2015). An efficient method for document categorization based on word2vec and latent semantic analysis. In 2015 IEEE international conference on computer and information technology; ubiquitous computing and communications; dependable, autonomic and secure computing; pervasive intelligence and computing (CIT/IUCC/DASC/PICOM) (pp. 2276–2283). IEEE.

  12. Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.

  13. Kim, Y. G., Lee, T., Chun, J., & Lee, S. G. (2006). Modified naïve bayes classifier for e-catalog classification. In Data engineering issues in e-commerce and services (pp. 246–257). Springer.

  14. Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In KDD (vol. 96, pp. 202–207). Citeseer.

  15. Kononenko, I. (1993). Inductive and bayesian learning in medical diagnosis. Applied Artificial Intelligence an International Journal, 7(4), 317–337.

    Article  Google Scholar 

  16. Kozareva, Z. (2015). Everyone likes shopping! multi-class product categorization for e-commerce. In HLT-NAACL (pp. 1329–1333).

  17. Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. In ICML (vol. 14, pp. 1188–1196).

  18. Lee, H., Lim, E., Cho, Y., & Yoon, Y. (2016). Automatic classification of product data for natural general-purpose o2o application user interface. In The 2016 fall conference of the KIPS (pp. 382–385).

  19. Lee, J. H., Ha, J., Jung, J. Y., & Lee, S. (2013). Semantic contextual advertising based on the open directory project. ACM Transactions on the Web (TWEB), 7(4), 24.

    Google Scholar 

  20. Lee, Y. E., & Benbasat, I. (2003). Interface design for mobile commerce. Communications of the ACM, 46(12), 48–52.

    Article  Google Scholar 

  21. Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015). Topical word embeddings. In AAAI (pp. 2418–2424).

  22. Lu, S. H., Chiang, D. A., Keh, H. C., & Huang, H. H. (2010). Chinese text classification by the naïve bayes classifier and the associative classifier with multiple confidence threshold values. Knowledge-Based Systems, 23(6), 598–604.

    Article  Google Scholar 

  23. Ma, C., Xu, W., Li, P., & Yan, Y. (2015). Distributional representations of words for short text classification. In VS@ HLT-NAACL (pp. 33–38).

  24. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

  25. Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., et al. (2016). Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(4), 694–707.

    Article  Google Scholar 

  26. Panetto, H., Dassisti, M., & Tursi, A. (2012). Onto-pdm: Product-driven ontology for product data management interoperability within manufacturing process environment. Advanced Engineering Informatics, 26(2), 334–348.

    Article  Google Scholar 

  27. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.

    Google Scholar 

  28. Perez, S. (2014). Etsy moves further into the offline world with launch of card reader for in-person payments. https://techcrunch.com/2014/10/23/etsy-moves-further-into-the-offline-world-with-launch-of-card-reader-for-in-person-payments/.

  29. Ren, Y., Wang, R., & Ji, D. (2016). A topic-enhanced word embedding for twitter sentiment classification. Information Sciences, 369, 188–198.

    Article  Google Scholar 

  30. Ren, Y., Zhang, Y., Zhang, M., & Ji, D. (2016). Context-sensitive twitter sentiment classification using neural network. In AAAI (pp. 215–221).

  31. Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60(5), 503–520.

    Article  Google Scholar 

  32. Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.

  33. Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A bayesian approach to filtering junk e-mail. In Learning for text categorization: Papers from the 1998 workshop (vol. 62, pp. 98–105).

  34. Scholl, N. B., Crawford, J., & Puckett, J. (2013). Online ordering for in-shop service (2013). US Patent App. 13/839,414.

  35. Staykova, K. S., & Damsgaard, J. (2016). Platform expansion design as strategic choice: The case of wechat and kakaotalk. http://aisel.aisnet.org/ecis2016_rp/78.

  36. Tang, D. (2015). Sentiment-specific representation learning for document-level sentiment analysis. In Proceedings of the eighth ACM international conference on web search and data mining (pp. 447–452). ACM.

  37. Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267.

    Article  Google Scholar 

  38. Yang, X., Macdonald, C., & Ounis, I. (2016). Using word embeddings in twitter election classification. arXiv preprint arXiv:1606.07006.

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2016R1D1A1B03931324) and 2017 Hongik University Research Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Young Yoon.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, H., Yoon, Y. Engineering doc2vec for automatic classification of product descriptions on O2O applications. Electron Commer Res 18, 433–456 (2018). https://doi.org/10.1007/s10660-017-9268-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10660-017-9268-5

Keywords

Navigation