Skip to main content
Log in

Category tree distance: a taxonomy-based transaction distance for web user analysis

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

With the emergence of webpage services, huge amounts of customer transaction data are flooded in cyberspace, which are getting more and more useful for profiling users and making recommendations. Since web user transaction data are usually multi-modal, heterogeneous and large-scale, the traditional data analysis methods meet new challenges. One of the challenges is the distance definition on two transaction data or two web users. The distance definition takes an important role in further analysis, such as the cluster analysis or k-nearest neighbor query. We introduce a category tree distance in this paper, which makes use of the product taxonomy information to convert the user transaction data to vectors. Then, the similarity between web users can be evaluated by the vectors from their transaction data. The properties of the distance like upper and lower bounds and the complexity analysis are also given in the paper. To investigate the performance of the proposal, we conduct experiments on real web user transaction data. The results show that the proposed distance outperforms the other distances on user transaction analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://www.yelp.com/dataset/challenge.

  2. https://www.yelp.com/developers/documentation/v3/category_list.

  3. http://jmcauley.ucsd.edu/data/amazon/.

References

  • Albadvi A, Shahbazi M (2009) A hybrid recommendation technique based on product category attributes. Expert Syst Appl 36(9):11,480-11,488. https://doi.org/10.1016/j.eswa.2009.03.046

    Article  Google Scholar 

  • Augsten N, Böhlen M, Gamper J (2008) The \(pq\)-gram distance between ordered labeled trees. ACM Trans Database Syst 10(1145/1670243):1670247

    Google Scholar 

  • Blei DM, Jordan MI, Griffiths TL, et al (2003) Hierarchical topic models and the nested Chinese restaurant process. In Proceedings of the 16th international conference on neural information processing systems. MIT Press, Cambridge, MA, USA, NIPS’03, pp 17–24

  • Chen X, Fang Y, Yang M et al (2018) Purtreeclust: a clustering algorithm for customer segmentation from massive customer transaction data. IEEE Trans Knowl Data Eng 30(3):559–572. https://doi.org/10.1109/TKDE.2017.2763620

    Article  Google Scholar 

  • Cho YH, Kim JK (2004) Application of web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Syst Appl 26(2):233–246. https://doi.org/10.1016/S0957-4174(03)00138-6

    Article  Google Scholar 

  • Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1):143–175. https://doi.org/10.1023/A:1007612920971

    Article  MATH  Google Scholar 

  • Estevez PA, Tesmer M, Perez CA et al (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201. https://doi.org/10.1109/TNN.2008.2005601

    Article  Google Scholar 

  • Giannotti F, Gozzi C, Manco G (2002) Clustering transactional data. In: Proceedings of the 6th European conference on principles of data mining and knowledge discovery. Springer-Verlag, Berlin, Heidelberg, PKDD ’02, pp 175–187

  • Gong L, Lin L, Song W et al (2020) JNET: learning User Representations via joint network embedding and topic embedding. Association for Computing Machinery, New York, NY, USA, pp 205–213

  • Grover A, Leskovec J (2016) Node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’16, pp 855–864. https://doi.org/10.1145/2939672.2939754

  • Guidotti R, Monreale A, Nanni M, et al (2017) Clustering individual transactional data for masses of users. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’17, pp 195–204.https://doi.org/10.1145/3097983.3098034

  • He R, McAuley J (2016) Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th international conference on world wide web. In: International world wide web conferences steering committee, Republic and Canton of Geneva, CHE, WWW ’16, pp 507–517. https://doi.org/10.1145/2872427.2883037

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075

    Article  MATH  Google Scholar 

  • Ienco D, Pensa RG, Meo R (2012) From context to distance: learning dissimilarity for categorical data clustering. ACM Trans Knowl Discov Data 10(1145/2133360):2133361

    Google Scholar 

  • Kang YB, Haghigh PD, Burstein F (2016) Taxofinder: a graph-based approach for taxonomy learning. IEEE Trans Knowl Data Eng 28(2):524–536. https://doi.org/10.1109/TKDE.2015.2475759

    Article  Google Scholar 

  • Lee H, Im J, Jang S, et al (2019) Melu: Meta-learned user preference estimator for cold-start recommendation. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, KDD ’19, pp 1073–1082. https://doi.org/10.1145/3292500.3330859

  • Levandowsky M, Winter D (1971) Distance between sets. Nature 234(5323):34–35. https://doi.org/10.1038/234034a0

    Article  Google Scholar 

  • Liang, S Zhang, X, Ren Z, Kanoulas E (2018) Dynamic embeddings for user profiling in twitter. Association for Computing Machinery, New York, NY, USA, pp 1764–1773

  • Liu X, Song Y, Liu S, et al (2012) Automatic taxonomy construction from keywords. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’12, pp 1433–1441. https://doi.org/10.1145/2339530.2339754

  • Liu X, Liu Y, Aberer K, et al (2013) Personalized point-of-interest recommendation by mining users’ preference transition. In: Proceedings of the 22nd ACM international conference on information & knowledge management. Association for Computing Machinery, New York, NY, USA, CIKM ’13, pp 733–738. https://doi.org/10.1145/2505515.2505639

  • Liu Y, Wei W, Sun A, et al (2014) Exploiting geographical neighborhood characteristics for location recommendation. In: Proceedings of the 23rd ACM international conference on conference on information and knowledge management. Association for Computing Machinery, New York, NY, USA, CIKM ’14, pp 739–748. https://doi.org/10.1145/2661829.2662002

  • McAuley J, Targett C, Shi Q, et al (2015) Image-based recommendations on styles and substitutes. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. Association for Computing Machinery, New York, NY, USA, SIGIR ’15, pp 43–52. https://doi.org/10.1145/2766462.2767755

  • McVicar M, Sach B, Mesnage C et al (2016) Sumoted: an intuitive edit distance between rooted unordered uniquely-labelled trees. Pattern Recognit Lett 79:52–59. https://doi.org/10.1016/j.patrec.2016.04.012

    Article  Google Scholar 

  • Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Burges C, Bottou L, Welling M et al (eds) Advances in Neural Information Processing Systems, vol 26. Curran Associates, Inc

  • Munthe Caspersen K, Bjeldbak Madsen M, Berre Eriksen A, et al (2017) A hierarchical tree distance measure for classification. In: Proceedings of the 6th international conference on pattern recognition applications and methods - ICPRAM,, INSTICC. SciTePress, pp 502–509. https://doi.org/10.5220/0006198505020509

  • Nguyen D, Nguyen TD, Luo W et al (2018) Trans2vec: Learning transaction embedding via items and frequent itemsets. In: Phung D, Tseng VS, Webb GI et al (eds) Advances in Knowledge Discovery and Data Mining. Springer International Publishing, Cham, pp 361–372

  • Ni Y, Ou D, Liu S, et al (2018) Perceive your users in depth: learning universal user representations from multiple e-commerce tasks. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, KDD ’18, pp 596–605. https://doi.org/10.1145/3219819.3219828

  • Nickel M, Kiela D (2017) Poincaré embeddings for learning hierarchical representations. In: Proceedings of the 31st international conference on neural information processing systems. Curran Associates Inc., Red Hook, NY, USA, NIPS’17, pp 6341–6350

  • Okura S, Tagami Y, Ono S, et al (2017) Embedding-based news recommendation for millions of users. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’17, pp 1933–1942. https://doi.org/10.1145/3097983.3098108

  • Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, KDD ’14, pp 701–710. https://doi.org/10.1145/2623330.2623732

  • Ramadan H, Tairi H (2015) Collaborative xmeans-em clustering for automatic detection and segmentation of moving objects in video. In: 2015 IEEE/ACS 12th international conference of computer systems and applications (AICCSA), pp 1–2. https://doi.org/10.1109/AICCSA.2015.7507148

  • Segond M, Borgelt C (2011) Item set mining based on cover similarity. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining - Volume Part II. Springer-Verlag, Berlin, Heidelberg, PAKDD’11, pp 493–505

  • Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888–905. https://doi.org/10.1109/34.868688

    Article  Google Scholar 

  • Tang J, Qu M, Wang M, et al (2015) Line: Large-scale information network embedding. In: Proceedings of the 24th international conference on world wide web. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, WWW ’15, pp 1067–1077. https://doi.org/10.1145/2736277.2741093

  • Tummala K, Oswald C, Sivaselvan B (2018) A frequent and rare itemset mining approach to transaction clustering. In: Sharma RSM (eds). Data Science Analytics and Applications. Springer Singapore, Singapore, pp 8–18

  • Valiente G (2001) An efficient bottom-up distance between trees. In: Proceedings eighth symposium on string processing and information retrieval, pp 212–219. https://doi.org/10.1109/SPIRE.2001.989761

  • Yang R, Kalnis P, Tung AKH (2005) Similarity evaluation on tree-structured data. In: Proceedings of the 2005 ACM SIGMOD international conference on management of data. Association for Computing Machinery, New York, NY, USA, SIGMOD ’05, pp 754–765. https://doi.org/10.1145/1066157.1066243

  • Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262. https://doi.org/10.1137/0218082

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang C, Tao F, Chen X, et al (2018) Taxogen: unsupervised topic taxonomy construction by adaptive term embedding and clustering. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing Machinery, New York, NY, USA, KDD ’18, pp 2701–2709. https://doi.org/10.1145/3219819.3220064

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61972286, 62172301 and 61702371), the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0100), the Science and Technology Program of Shanghai, China (Grant Nos. 20ZR1460500, 22511104300), the Fundamental Research Funds for the Central Universities (No. 22120210545).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qinpei Zhao.

Additional information

Responsible editor: Fei Wang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Zhao, Q., Shi, Y. et al. Category tree distance: a taxonomy-based transaction distance for web user analysis. Data Min Knowl Disc 37, 39–66 (2023). https://doi.org/10.1007/s10618-022-00874-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00874-9

Keywords

Navigation