Category tree distance: a taxonomy-based transaction distance for web user analysis

Zhang, Yinjia; Zhao, Qinpei; Shi, Yang; Li, Jiangfeng; Rao, Weixiong

doi:10.1007/s10618-022-00874-9

Category tree distance: a taxonomy-based transaction distance for web user analysis

Published: 13 October 2022

Volume 37, pages 39–66, (2023)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Yinjia Zhang^1,2,
Qinpei Zhao ORCID: orcid.org/0000-0002-1765-1171¹,
Yang Shi¹,
Jiangfeng Li^1,3 &
…
Weixiong Rao¹

563 Accesses
2 Altmetric
Explore all metrics

Abstract

With the emergence of webpage services, huge amounts of customer transaction data are flooded in cyberspace, which are getting more and more useful for profiling users and making recommendations. Since web user transaction data are usually multi-modal, heterogeneous and large-scale, the traditional data analysis methods meet new challenges. One of the challenges is the distance definition on two transaction data or two web users. The distance definition takes an important role in further analysis, such as the cluster analysis or k-nearest neighbor query. We introduce a category tree distance in this paper, which makes use of the product taxonomy information to convert the user transaction data to vectors. Then, the similarity between web users can be evaluated by the vectors from their transaction data. The properties of the distance like upper and lower bounds and the complexity analysis are also given in the paper. To investigate the performance of the proposal, we conduct experiments on real web user transaction data. The results show that the proposed distance outperforms the other distances on user transaction analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CTKM: Crypto-Based User Clustering on Web Transaction Data

Generalized weighted tree similarity algorithms for taxonomy trees

Article Open access 03 June 2016

An Enhanced Distance Based Similarity Measure for User Based Recommendations

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Notes

References

Albadvi A, Shahbazi M (2009) A hybrid recommendation technique based on product category attributes. Expert Syst Appl 36(9):11,480-11,488. https://doi.org/10.1016/j.eswa.2009.03.046
Article Google Scholar
Augsten N, Böhlen M, Gamper J (2008) The $pq$-gram distance between ordered labeled trees. ACM Trans Database Syst 10(1145/1670243):1670247
Google Scholar
Blei DM, Jordan MI, Griffiths TL, et al (2003) Hierarchical topic models and the nested Chinese restaurant process. In Proceedings of the 16th international conference on neural information processing systems. MIT Press, Cambridge, MA, USA, NIPS’03, pp 17–24
Chen X, Fang Y, Yang M et al (2018) Purtreeclust: a clustering algorithm for customer segmentation from massive customer transaction data. IEEE Trans Knowl Data Eng 30(3):559–572. https://doi.org/10.1109/TKDE.2017.2763620
Article Google Scholar
Cho YH, Kim JK (2004) Application of web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Syst Appl 26(2):233–246. https://doi.org/10.1016/S0957-4174(03)00138-6
Article Google Scholar
Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1):143–175. https://doi.org/10.1023/A:1007612920971
Article MATH Google Scholar
Estevez PA, Tesmer M, Perez CA et al (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201. https://doi.org/10.1109/TNN.2008.2005601
Article Google Scholar
Giannotti F, Gozzi C, Manco G (2002) Clustering transactional data. In: Proceedings of the 6th European conference on principles of data mining and knowledge discovery. Springer-Verlag, Berlin, Heidelberg, PKDD ’02, pp 175–187
Gong L, Lin L, Song W et al (2020) JNET: learning User Representations via joint network embedding and topic embedding. Association for Computing Machinery, New York, NY, USA, pp 205–213
Grover A, Leskovec J (2016) Node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’16, pp 855–864. https://doi.org/10.1145/2939672.2939754
Guidotti R, Monreale A, Nanni M, et al (2017) Clustering individual transactional data for masses of users. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’17, pp 195–204.https://doi.org/10.1145/3097983.3098034
He R, McAuley J (2016) Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th international conference on world wide web. In: International world wide web conferences steering committee, Republic and Canton of Geneva, CHE, WWW ’16, pp 507–517. https://doi.org/10.1145/2872427.2883037
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075
Article MATH Google Scholar
Ienco D, Pensa RG, Meo R (2012) From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data 10(1145/2133360):2133361
Google Scholar
Kang YB, Haghigh PD, Burstein F (2016) Taxofinder: a graph-based approach for taxonomy learning. IEEE Trans Knowl Data Eng 28(2):524–536. https://doi.org/10.1109/TKDE.2015.2475759
Article Google Scholar
Lee H, Im J, Jang S, et al (2019) Melu: Meta-learned user preference estimator for cold-start recommendation. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, KDD ’19, pp 1073–1082. https://doi.org/10.1145/3292500.3330859
Levandowsky M, Winter D (1971) Distance between sets. Nature 234(5323):34–35. https://doi.org/10.1038/234034a0
Article Google Scholar
Liang, S Zhang, X, Ren Z, Kanoulas E (2018) Dynamic embeddings for user profiling in twitter. Association for Computing Machinery, New York, NY, USA, pp 1764–1773
Liu X, Song Y, Liu S, et al (2012) Automatic taxonomy construction from keywords. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’12, pp 1433–1441. https://doi.org/10.1145/2339530.2339754
Liu X, Liu Y, Aberer K, et al (2013) Personalized point-of-interest recommendation by mining users’ preference transition. In: Proceedings of the 22nd ACM international conference on information & knowledge management. Association for Computing Machinery, New York, NY, USA, CIKM ’13, pp 733–738. https://doi.org/10.1145/2505515.2505639
Liu Y, Wei W, Sun A, et al (2014) Exploiting geographical neighborhood characteristics for location recommendation. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management. Association for Computing Machinery, New York, NY, USA, CIKM ’14, pp 739–748. https://doi.org/10.1145/2661829.2662002
McAuley J, Targett C, Shi Q, et al (2015) Image-based recommendations on styles and substitutes. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. Association for Computing Machinery, New York, NY, USA, SIGIR ’15, pp 43–52. https://doi.org/10.1145/2766462.2767755
McVicar M, Sach B, Mesnage C et al (2016) Sumoted: an intuitive edit distance between rooted unordered uniquely-labelled trees. Pattern Recognit Lett 79:52–59. https://doi.org/10.1016/j.patrec.2016.04.012
Article Google Scholar
Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Burges C, Bottou L, Welling M et al (eds) Advances in Neural Information Processing Systems, vol 26. Curran Associates, Inc
Munthe Caspersen K, Bjeldbak Madsen M, Berre Eriksen A, et al (2017) A hierarchical tree distance measure for classification. In: Proceedings of the 6th international conference on pattern recognition applications and methods - ICPRAM,, INSTICC. SciTePress, pp 502–509. https://doi.org/10.5220/0006198505020509
Nguyen D, Nguyen TD, Luo W et al (2018) Trans2vec: Learning transaction embedding via items and frequent itemsets. In: Phung D, Tseng VS, Webb GI et al (eds) Advances in Knowledge Discovery and Data Mining. Springer International Publishing, Cham, pp 361–372
Ni Y, Ou D, Liu S, et al (2018) Perceive your users in depth: learning universal user representations from multiple e-commerce tasks. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, KDD ’18, pp 596–605. https://doi.org/10.1145/3219819.3219828
Nickel M, Kiela D (2017) Poincaré embeddings for learning hierarchical representations. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., Red Hook, NY, USA, NIPS’17, pp 6341–6350
Okura S, Tagami Y, Ono S, et al (2017) Embedding-based news recommendation for millions of users. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’17, pp 1933–1942. https://doi.org/10.1145/3097983.3098108
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’14, pp 701–710. https://doi.org/10.1145/2623330.2623732
Ramadan H, Tairi H (2015) Collaborative xmeans-em clustering for automatic detection and segmentation of moving objects in video. In: 2015 IEEE/ACS 12th international conference of computer systems and applications (AICCSA), pp 1–2. https://doi.org/10.1109/AICCSA.2015.7507148
Segond M, Borgelt C (2011) Item set mining based on cover similarity. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining - Volume Part II. Springer-Verlag, Berlin, Heidelberg, PAKDD’11, pp 493–505
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905. https://doi.org/10.1109/34.868688
Article Google Scholar
Tang J, Qu M, Wang M, et al (2015) Line: Large-scale information network embedding. In: Proceedings of the 24th international conference on world wide web. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, WWW ’15, pp 1067–1077. https://doi.org/10.1145/2736277.2741093
Tummala K, Oswald C, Sivaselvan B (2018) A frequent and rare itemset mining approach to transaction clustering. In: Sharma RSM (eds). Data Science Analytics and Applications. Springer Singapore, Singapore, pp 8–18
Valiente G (2001) An efficient bottom-up distance between trees. In: Proceedings eighth symposium on string processing and information retrieval, pp 212–219. https://doi.org/10.1109/SPIRE.2001.989761
Yang R, Kalnis P, Tung AKH (2005) Similarity evaluation on tree-structured data. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. Association for Computing Machinery, New York, NY, USA, SIGMOD ’05, pp 754–765. https://doi.org/10.1145/1066157.1066243
Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262. https://doi.org/10.1137/0218082
Article MathSciNet MATH Google Scholar
Zhang C, Tao F, Chen X, et al (2018) Taxogen: unsupervised topic taxonomy construction by adaptive term embedding and clustering. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, KDD ’18, pp 2701–2709. https://doi.org/10.1145/3219819.3220064

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61972286, 62172301 and 61702371), the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0100), the Science and Technology Program of Shanghai, China (Grant Nos. 20ZR1460500, 22511104300), the Fundamental Research Funds for the Central Universities (No. 22120210545).

Author information

Authors and Affiliations

School of Software Engineering, Tongji University, Shanghai, China
Yinjia Zhang, Qinpei Zhao, Yang Shi, Jiangfeng Li & Weixiong Rao
Department of Computer Science, School of Science, Aalto University, Espoo, Finland
Yinjia Zhang
Key Laboratory of Blockchain and Cyberspace Governance of Zhejiang Province, Hangzhou, China
Jiangfeng Li

Authors

Yinjia Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Qinpei Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Yang Shi
View author publications
You can also search for this author inPubMed Google Scholar
Jiangfeng Li
View author publications
You can also search for this author inPubMed Google Scholar
Weixiong Rao
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Qinpei Zhao.

Additional information

Responsible editor: Fei Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Zhao, Q., Shi, Y. et al. Category tree distance: a taxonomy-based transaction distance for web user analysis. Data Min Knowl Disc 37, 39–66 (2023). https://doi.org/10.1007/s10618-022-00874-9

Download citation

Received: 14 September 2020
Accepted: 15 September 2022
Published: 13 October 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s10618-022-00874-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Category tree distance: a taxonomy-based transaction distance for web user analysis

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

CTKM: Crypto-Based User Clustering on Web Transaction Data

Generalized weighted tree similarity algorithms for taxonomy trees

An Enhanced Distance Based Similarity Measure for User Based Recommendations

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now