Abstract
With the emergence of webpage services, huge amounts of customer transaction data are flooded in cyberspace, which are getting more and more useful for profiling users and making recommendations. Since web user transaction data are usually multi-modal, heterogeneous and large-scale, the traditional data analysis methods meet new challenges. One of the challenges is the distance definition on two transaction data or two web users. The distance definition takes an important role in further analysis, such as the cluster analysis or k-nearest neighbor query. We introduce a category tree distance in this paper, which makes use of the product taxonomy information to convert the user transaction data to vectors. Then, the similarity between web users can be evaluated by the vectors from their transaction data. The properties of the distance like upper and lower bounds and the complexity analysis are also given in the paper. To investigate the performance of the proposal, we conduct experiments on real web user transaction data. The results show that the proposed distance outperforms the other distances on user transaction analysis.











Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Albadvi A, Shahbazi M (2009) A hybrid recommendation technique based on product category attributes. Expert Syst Appl 36(9):11,480-11,488. https://doi.org/10.1016/j.eswa.2009.03.046
Augsten N, Böhlen M, Gamper J (2008) The \(pq\)-gram distance between ordered labeled trees. ACM Trans Database Syst 10(1145/1670243):1670247
Blei DM, Jordan MI, Griffiths TL, et al (2003) Hierarchical topic models and the nested Chinese restaurant process. In Proceedings of the 16th international conference on neural information processing systems. MIT Press, Cambridge, MA, USA, NIPS’03, pp 17–24
Chen X, Fang Y, Yang M et al (2018) Purtreeclust: a clustering algorithm for customer segmentation from massive customer transaction data. IEEE Trans Knowl Data Eng 30(3):559–572. https://doi.org/10.1109/TKDE.2017.2763620
Cho YH, Kim JK (2004) Application of web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Syst Appl 26(2):233–246. https://doi.org/10.1016/S0957-4174(03)00138-6
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1):143–175. https://doi.org/10.1023/A:1007612920971
Estevez PA, Tesmer M, Perez CA et al (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201. https://doi.org/10.1109/TNN.2008.2005601
Giannotti F, Gozzi C, Manco G (2002) Clustering transactional data. In: Proceedings of the 6th European conference on principles of data mining and knowledge discovery. Springer-Verlag, Berlin, Heidelberg, PKDD ’02, pp 175–187
Gong L, Lin L, Song W et al (2020) JNET: learning User Representations via joint network embedding and topic embedding. Association for Computing Machinery, New York, NY, USA, pp 205–213
Grover A, Leskovec J (2016) Node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’16, pp 855–864. https://doi.org/10.1145/2939672.2939754
Guidotti R, Monreale A, Nanni M, et al (2017) Clustering individual transactional data for masses of users. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’17, pp 195–204.https://doi.org/10.1145/3097983.3098034
He R, McAuley J (2016) Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th international conference on world wide web. In: International world wide web conferences steering committee, Republic and Canton of Geneva, CHE, WWW ’16, pp 507–517. https://doi.org/10.1145/2872427.2883037
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075
Ienco D, Pensa RG, Meo R (2012) From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data 10(1145/2133360):2133361
Kang YB, Haghigh PD, Burstein F (2016) Taxofinder: a graph-based approach for taxonomy learning. IEEE Trans Knowl Data Eng 28(2):524–536. https://doi.org/10.1109/TKDE.2015.2475759
Lee H, Im J, Jang S, et al (2019) Melu: Meta-learned user preference estimator for cold-start recommendation. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, KDD ’19, pp 1073–1082. https://doi.org/10.1145/3292500.3330859
Levandowsky M, Winter D (1971) Distance between sets. Nature 234(5323):34–35. https://doi.org/10.1038/234034a0
Liang, S Zhang, X, Ren Z, Kanoulas E (2018) Dynamic embeddings for user profiling in twitter. Association for Computing Machinery, New York, NY, USA, pp 1764–1773
Liu X, Song Y, Liu S, et al (2012) Automatic taxonomy construction from keywords. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’12, pp 1433–1441. https://doi.org/10.1145/2339530.2339754
Liu X, Liu Y, Aberer K, et al (2013) Personalized point-of-interest recommendation by mining users’ preference transition. In: Proceedings of the 22nd ACM international conference on information & knowledge management. Association for Computing Machinery, New York, NY, USA, CIKM ’13, pp 733–738. https://doi.org/10.1145/2505515.2505639
Liu Y, Wei W, Sun A, et al (2014) Exploiting geographical neighborhood characteristics for location recommendation. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management. Association for Computing Machinery, New York, NY, USA, CIKM ’14, pp 739–748. https://doi.org/10.1145/2661829.2662002
McAuley J, Targett C, Shi Q, et al (2015) Image-based recommendations on styles and substitutes. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. Association for Computing Machinery, New York, NY, USA, SIGIR ’15, pp 43–52. https://doi.org/10.1145/2766462.2767755
McVicar M, Sach B, Mesnage C et al (2016) Sumoted: an intuitive edit distance between rooted unordered uniquely-labelled trees. Pattern Recognit Lett 79:52–59. https://doi.org/10.1016/j.patrec.2016.04.012
Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Burges C, Bottou L, Welling M et al (eds) Advances in Neural Information Processing Systems, vol 26. Curran Associates, Inc
Munthe Caspersen K, Bjeldbak Madsen M, Berre Eriksen A, et al (2017) A hierarchical tree distance measure for classification. In: Proceedings of the 6th international conference on pattern recognition applications and methods - ICPRAM,, INSTICC. SciTePress, pp 502–509. https://doi.org/10.5220/0006198505020509
Nguyen D, Nguyen TD, Luo W et al (2018) Trans2vec: Learning transaction embedding via items and frequent itemsets. In: Phung D, Tseng VS, Webb GI et al (eds) Advances in Knowledge Discovery and Data Mining. Springer International Publishing, Cham, pp 361–372
Ni Y, Ou D, Liu S, et al (2018) Perceive your users in depth: learning universal user representations from multiple e-commerce tasks. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, KDD ’18, pp 596–605. https://doi.org/10.1145/3219819.3219828
Nickel M, Kiela D (2017) Poincaré embeddings for learning hierarchical representations. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., Red Hook, NY, USA, NIPS’17, pp 6341–6350
Okura S, Tagami Y, Ono S, et al (2017) Embedding-based news recommendation for millions of users. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’17, pp 1933–1942. https://doi.org/10.1145/3097983.3098108
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’14, pp 701–710. https://doi.org/10.1145/2623330.2623732
Ramadan H, Tairi H (2015) Collaborative xmeans-em clustering for automatic detection and segmentation of moving objects in video. In: 2015 IEEE/ACS 12th international conference of computer systems and applications (AICCSA), pp 1–2. https://doi.org/10.1109/AICCSA.2015.7507148
Segond M, Borgelt C (2011) Item set mining based on cover similarity. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining - Volume Part II. Springer-Verlag, Berlin, Heidelberg, PAKDD’11, pp 493–505
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905. https://doi.org/10.1109/34.868688
Tang J, Qu M, Wang M, et al (2015) Line: Large-scale information network embedding. In: Proceedings of the 24th international conference on world wide web. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, WWW ’15, pp 1067–1077. https://doi.org/10.1145/2736277.2741093
Tummala K, Oswald C, Sivaselvan B (2018) A frequent and rare itemset mining approach to transaction clustering. In: Sharma RSM (eds). Data Science Analytics and Applications. Springer Singapore, Singapore, pp 8–18
Valiente G (2001) An efficient bottom-up distance between trees. In: Proceedings eighth symposium on string processing and information retrieval, pp 212–219. https://doi.org/10.1109/SPIRE.2001.989761
Yang R, Kalnis P, Tung AKH (2005) Similarity evaluation on tree-structured data. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. Association for Computing Machinery, New York, NY, USA, SIGMOD ’05, pp 754–765. https://doi.org/10.1145/1066157.1066243
Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262. https://doi.org/10.1137/0218082
Zhang C, Tao F, Chen X, et al (2018) Taxogen: unsupervised topic taxonomy construction by adaptive term embedding and clustering. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, KDD ’18, pp 2701–2709. https://doi.org/10.1145/3219819.3220064
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant Nos. 61972286, 62172301 and 61702371), the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0100), the Science and Technology Program of Shanghai, China (Grant Nos. 20ZR1460500, 22511104300), the Fundamental Research Funds for the Central Universities (No. 22120210545).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Fei Wang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, Y., Zhao, Q., Shi, Y. et al. Category tree distance: a taxonomy-based transaction distance for web user analysis. Data Min Knowl Disc 37, 39–66 (2023). https://doi.org/10.1007/s10618-022-00874-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-022-00874-9