Abstract
In this paper, we develop an automatic product classifier that can become a vital part of a natural user interface for an integrated online-to-offline (O2O) service platform. We devise a novel feature extraction technique to represent product descriptions that are expressed in full natural language sentences. We specifically adapt doc2vec algorithm that implements the document embedding technique. Doc2vec is a way to predict a vector of salient contexts that are specific to a document. Our classifier is trained to classify a product description based on the doc2vec-based feature that is augmented in various ways. We trained and tested our classifier with up to 53,000 real product descriptions from Groupon, a popular social commerce site that also offers O2O commerce features such as online ordering for in-store pick-up. Compared to the baseline approaches of using bag-of-words modeling and word-level embedding, our classifier showed significant improvement in terms of classification accuracy when our adapted doc2vec-based feature was used.
















Similar content being viewed by others
Notes
A comprehensive explanation is available in [32]. A visual explanation is available at https://ronxin.github.io/wevi/.
References
Abrahams, S. L. (2008). Handmade online: The crafting of commerce, aesthetics and community on Etsy com. Chapel Hill: The University of North Carolina.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Dai, A. M., Olah, C., & Le, Q. V. (2015). Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998.
Das, P., Xia, Y., Levine, A., Di Fabbrizio, G., & Datta, A. (2016). Large-scale taxonomy categorization for noisy product listings. In 2016 IEEE international conference on big data (big data) (pp. 3885–3894). IEEE.
Ding, Y., Korotkiy, M., Omelayenko, B., Kartseva, V., Zykov, V., Klein, M., et al. (2002). Goldenbullet: Automated classification of product data in e-commerce. In Proceedings of the 5th international conference on business information systems.
Dos Santos, C. N., & Gatti, M. (2014). Deep convolutional neural networks for sentiment analysis of short texts. In COLING (pp. 69–78).
Goldberg, Y., & Levy, O. (2014). word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.
Gottipati, S. (2012). E-commerce product categorization srinivasu gottipati and mumtaz vauhkonen. Stanford C229 Final Projects.
Hashimoto, K., Stenetorp, P., Miwa, M., & Tsuruoka, Y. (2015). Task-oriented learning of word embeddings for semantic relation classification. arXiv preprint arXiv:1503.00095.
Hull, D. A., et al. (1996). Stemming algorithms: A case study for detailed evaluation. JASIS, 47(1), 70–84.
Ju, R., Zhou, P., Li, C.H., & Liu, L. (2015). An efficient method for document categorization based on word2vec and latent semantic analysis. In 2015 IEEE international conference on computer and information technology; ubiquitous computing and communications; dependable, autonomic and secure computing; pervasive intelligence and computing (CIT/IUCC/DASC/PICOM) (pp. 2276–2283). IEEE.
Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
Kim, Y. G., Lee, T., Chun, J., & Lee, S. G. (2006). Modified naïve bayes classifier for e-catalog classification. In Data engineering issues in e-commerce and services (pp. 246–257). Springer.
Kohavi, R. (1996). Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. In KDD (vol. 96, pp. 202–207). Citeseer.
Kononenko, I. (1993). Inductive and bayesian learning in medical diagnosis. Applied Artificial Intelligence an International Journal, 7(4), 317–337.
Kozareva, Z. (2015). Everyone likes shopping! multi-class product categorization for e-commerce. In HLT-NAACL (pp. 1329–1333).
Le, Q. V., & Mikolov, T. (2014). Distributed representations of sentences and documents. In ICML (vol. 14, pp. 1188–1196).
Lee, H., Lim, E., Cho, Y., & Yoon, Y. (2016). Automatic classification of product data for natural general-purpose o2o application user interface. In The 2016 fall conference of the KIPS (pp. 382–385).
Lee, J. H., Ha, J., Jung, J. Y., & Lee, S. (2013). Semantic contextual advertising based on the open directory project. ACM Transactions on the Web (TWEB), 7(4), 24.
Lee, Y. E., & Benbasat, I. (2003). Interface design for mobile commerce. Communications of the ACM, 46(12), 48–52.
Liu, Y., Liu, Z., Chua, T. S., & Sun, M. (2015). Topical word embeddings. In AAAI (pp. 2418–2424).
Lu, S. H., Chiang, D. A., Keh, H. C., & Huang, H. H. (2010). Chinese text classification by the naïve bayes classifier and the associative classifier with multiple confidence threshold values. Knowledge-Based Systems, 23(6), 598–604.
Ma, C., Xu, W., Li, P., & Yan, Y. (2015). Distributional representations of words for short text classification. In VS@ HLT-NAACL (pp. 33–38).
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., et al. (2016). Deep sentence embedding using long short-term memory networks: Analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(4), 694–707.
Panetto, H., Dassisti, M., & Tursi, A. (2012). Onto-pdm: Product-driven ontology for product data management interoperability within manufacturing process environment. Advanced Engineering Informatics, 26(2), 334–348.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
Perez, S. (2014). Etsy moves further into the offline world with launch of card reader for in-person payments. https://techcrunch.com/2014/10/23/etsy-moves-further-into-the-offline-world-with-launch-of-card-reader-for-in-person-payments/.
Ren, Y., Wang, R., & Ji, D. (2016). A topic-enhanced word embedding for twitter sentiment classification. Information Sciences, 369, 188–198.
Ren, Y., Zhang, Y., Zhang, M., & Ji, D. (2016). Context-sensitive twitter sentiment classification using neural network. In AAAI (pp. 215–221).
Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60(5), 503–520.
Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.
Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A bayesian approach to filtering junk e-mail. In Learning for text categorization: Papers from the 1998 workshop (vol. 62, pp. 98–105).
Scholl, N. B., Crawford, J., & Puckett, J. (2013). Online ordering for in-shop service (2013). US Patent App. 13/839,414.
Staykova, K. S., & Damsgaard, J. (2016). Platform expansion design as strategic choice: The case of wechat and kakaotalk. http://aisel.aisnet.org/ecis2016_rp/78.
Tang, D. (2015). Sentiment-specific representation learning for document-level sentiment analysis. In Proceedings of the eighth ACM international conference on web search and data mining (pp. 447–452). ACM.
Wang, Q., Garrity, G. M., Tiedje, J. M., & Cole, J. R. (2007). Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Applied and Environmental Microbiology, 73(16), 5261–5267.
Yang, X., Macdonald, C., & Ounis, I. (2016). Using word embeddings in twitter election classification. arXiv preprint arXiv:1606.07006.
Acknowledgements
This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2016R1D1A1B03931324) and 2017 Hongik University Research Fund.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lee, H., Yoon, Y. Engineering doc2vec for automatic classification of product descriptions on O2O applications. Electron Commer Res 18, 433–456 (2018). https://doi.org/10.1007/s10660-017-9268-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10660-017-9268-5