Abstract
Transaction analysis is an important part in studies aiming to understand consumer behaviour. The first step is defining a proper measure of similarity, or more specifically a distance metric, between transactions. Existing distance metrics on transactional data are built on retailer specific information, such as extensive product hierarchies or a large product catalogue. In this paper we propose a new distance metric that is retailer independent by design, allowing cross-retailer and cross-country analysis. The metric comes with a novel method of finding the importance of categories of products, alternating between unsupervised learning techniques and importance calibration. We test our methodology on a real-world dataset and show how we can identify clusters of consumer behaviour.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The full article on this study is found at https://medium.com/icemobile/mindset-moments-a-new-way-to-pinpoint-purchasing-decisions-18c58a574c02.
- 2.
Given two sets \(\mathcal {S}_1\) and \(\mathcal {S}_2\), the Jaccard distance [23] between them is defined as \(1-\frac{|\mathcal {S}_1\cap \mathcal {S}_2|}{|\mathcal {S}_1\cup \mathcal {S}_2|}\).
- 3.
We order (super-)categories alphabetically in \(\vec {a}\), \(\vec {b}\) and s.
- 4.
Without loss of generality, we discuss minimization of the internal evaluation measure, the problem and solution is similar for internal evaluation measures that need to be maximized.
References
Aggarwal, C., Procopiuc, C., Yu, P.: Finding localized associations in market basket data. IEEE Trans. KDE 14(1), 51–62 (2002)
Arthur, D., Vassilvitskii, S.: K-Means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA 2007, pp. 1027–1035. Society for Industrial and Applied Mathematics, USA (2007)
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Chen, X., Fang, Y., Yang, M., Nie, F., Zhao, Z., Huang, J.Z.: PurTreeClust: a clustering algorithm for customer segmentation from massive customer transaction data. IEEE Trans. Knowl. Data Eng. 30(3), 559–572 (2018)
Chen, X., Huang, J.Z., Luo, J.: PurTreeClust: a purchase tree clustering algorithm for large-scale customer transaction data. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 661–672, May 2016
Chen, X., Peng, S., Huang, J.Z., Nie, F., Ming, Y.: Local PurTree spectral clustering for massive customer transaction data. IEEE Intell. Syst. 32(2), 37–44 (2017)
Chen, X., Sun, W., Wang, B., Li, Z., Wang, X., Ye, Y.: Spectral clustering of customer transaction data with a two-level subspace weighting method. IEEE Trans. Cybern. 49(9), 3230–3241 (2019)
Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27(4), 857–871 (1971)
Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Inf. Syst. 25, 345–366 (2000)
Hamerly, G., Elkan, C.: Learning the k in K-means. In: Proceedings of the 16th International Conference on Neural Information Processing Systems. NIPS 2003, pp. 281–288. MIT Press, Cambridge (2003)
Hassani, M., Seidl, T.: Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J. Comput. Sci. 4(3), 171–183 (2016). https://doi.org/10.1007/s40595-016-0086-9
Hassani, M., Spaus, P., Seidl, T.: Adaptive multiple-resolution stream clustering. In: Perner, P. (ed.) MLDM 2014. LNCS (LNAI), vol. 8556, pp. 134–148. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08979-9_11
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Lam, S.K., Pitrou, A., Seibert, S.: Numba. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC - LLVM 2015, pp. 1–6. ACM Press, New York (2015)
Lu, K., Furukawa, T.: Similarity of transactions for customer segmentation. In: Quirchmayr, G., Basl, J., You, I., Xu, L., Weippl, E. (eds.) CD-ARES 2012. LNCS, vol. 7465, pp. 347–359. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32498-7_26
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. NIPS 2001, pp. 849–856. MIT Press, Cambridge (2001)
Park, H.S., Jun, C.H.: A simple and fast algorithm for K-medoids clustering. Expert Syst. Appl. 36(2 PART 2), 3336–3341 (2009)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Pelleg, D., Moore, A.W.: X-Means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the Seventeenth International Conference on Machine Learning. ICML 2000, pp. 727–734. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Rokach, L., Maimon, O.: Clustering methods. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer, Boston (2005). https://doi.org/10.1007/0-387-25465-X_15
Schubert, E., Rousseeuw, P.J.: Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In: Amato, G., Gennaro, C., Oria, V., Radovanović, M. (eds.) SISAP 2019. LNCS, vol. 11807, pp. 171–187. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32047-8_16
Storn, R., Price, K.: Differential evolution - a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11(4), 341–359 (1997). https://doi.org/10.1023/A:1008202821328
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)
Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020)
Wang, K., Xu, C., Liu, B.: Clustering transactions using large items. In: Proceedings of the Eighth International Conference on Information and Knowledge Management. CIKM 1999, pp. 483–490. ACM, New York (1999)
Wang, M.T., Hsu, P.Y., Lin, K.C., Chen, S.S.: Clustering transactions with an unbalanced hierarchical product structure. In: Song, I.Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2007. LNCS, vol. 4654, pp. 251–261. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74553-2_23
Acknowledgements
The authors kindly thank BrandLoyalty’s domain experts, in particular Anna Witteman, Hanneke van Keep, Lenneke van der Meijden, and Steven van den Boomen, for their contributions to the research presented in this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Spenrath, Y., Hassani, M., Dongen, B.v., Tariq, H. (2021). Why Did My Consumer Shop? Learning an Efficient Distance Metric for Retailer Transaction Data. In: Dong, Y., Ifrim, G., Mladenić, D., Saunders, C., Van Hoecke, S. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12461. Springer, Cham. https://doi.org/10.1007/978-3-030-67670-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-67670-4_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67669-8
Online ISBN: 978-3-030-67670-4
eBook Packages: Computer ScienceComputer Science (R0)