Why Did My Consumer Shop? Learning an Efficient Distance Metric for Retailer Transaction Data

Spenrath, Yorick; Hassani, Marwan; Dongen, Boudewijn van; Tariq, Haseeb

doi:10.1007/978-3-030-67670-4_20

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12461))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1903 Accesses
4 Citations

Abstract

Transaction analysis is an important part in studies aiming to understand consumer behaviour. The first step is defining a proper measure of similarity, or more specifically a distance metric, between transactions. Existing distance metrics on transactional data are built on retailer specific information, such as extensive product hierarchies or a large product catalogue. In this paper we propose a new distance metric that is retailer independent by design, allowing cross-retailer and cross-country analysis. The metric comes with a novel method of finding the importance of categories of products, alternating between unsupervised learning techniques and importance calibration. We test our methodology on a real-world dataset and show how we can identify clusters of consumer behaviour.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The full article on this study is found at https://medium.com/icemobile/mindset-moments-a-new-way-to-pinpoint-purchasing-decisions-18c58a574c02.
2.
Given two sets \(\mathcal {S}_1\) and \(\mathcal {S}_2\), the Jaccard distance [23] between them is defined as \(1-\frac{|\mathcal {S}_1\cap \mathcal {S}_2|}{|\mathcal {S}_1\cup \mathcal {S}_2|}\).
3.
We order (super-)categories alphabetically in \(\vec {a}\), \(\vec {b}\) and s.
4.
Without loss of generality, we discuss minimization of the internal evaluation measure, the problem and solution is similar for internal evaluation measures that need to be maximized.

References

Aggarwal, C., Procopiuc, C., Yu, P.: Finding localized associations in market basket data. IEEE Trans. KDE 14(1), 51–62 (2002)
Google Scholar
Arthur, D., Vassilvitskii, S.: K-Means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA 2007, pp. 1027–1035. Society for Industrial and Applied Mathematics, USA (2007)
Google Scholar
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Chapter Google Scholar
Chen, X., Fang, Y., Yang, M., Nie, F., Zhao, Z., Huang, J.Z.: PurTreeClust: a clustering algorithm for customer segmentation from massive customer transaction data. IEEE Trans. Knowl. Data Eng. 30(3), 559–572 (2018)
Google Scholar
Chen, X., Huang, J.Z., Luo, J.: PurTreeClust: a purchase tree clustering algorithm for large-scale customer transaction data. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 661–672, May 2016
Google Scholar
Chen, X., Peng, S., Huang, J.Z., Nie, F., Ming, Y.: Local PurTree spectral clustering for massive customer transaction data. IEEE Intell. Syst. 32(2), 37–44 (2017)
Google Scholar
Chen, X., Sun, W., Wang, B., Li, Z., Wang, X., Ye, Y.: Spectral clustering of customer transaction data with a two-level subspace weighting method. IEEE Trans. Cybern. 49(9), 3230–3241 (2019)
Google Scholar
Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27(4), 857–871 (1971)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Inf. Syst. 25, 345–366 (2000)
Google Scholar
Hamerly, G., Elkan, C.: Learning the k in K-means. In: Proceedings of the 16th International Conference on Neural Information Processing Systems. NIPS 2003, pp. 281–288. MIT Press, Cambridge (2003)
Google Scholar
Hassani, M., Seidl, T.: Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J. Comput. Sci. 4(3), 171–183 (2016). https://doi.org/10.1007/s40595-016-0086-9
Article Google Scholar
Hassani, M., Spaus, P., Seidl, T.: Adaptive multiple-resolution stream clustering. In: Perner, P. (ed.) MLDM 2014. LNCS (LNAI), vol. 8556, pp. 134–148. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08979-9_11
Chapter Google Scholar
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)
Google Scholar
Lam, S.K., Pitrou, A., Seibert, S.: Numba. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC - LLVM 2015, pp. 1–6. ACM Press, New York (2015)
Google Scholar
Lu, K., Furukawa, T.: Similarity of transactions for customer segmentation. In: Quirchmayr, G., Basl, J., You, I., Xu, L., Weippl, E. (eds.) CD-ARES 2012. LNCS, vol. 7465, pp. 347–359. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32498-7_26
Chapter Google Scholar
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. NIPS 2001, pp. 849–856. MIT Press, Cambridge (2001)
Google Scholar
Park, H.S., Jun, C.H.: A simple and fast algorithm for K-medoids clustering. Expert Syst. Appl. 36(2 PART 2), 3336–3341 (2009)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Pelleg, D., Moore, A.W.: X-Means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the Seventeenth International Conference on Machine Learning. ICML 2000, pp. 727–734. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Google Scholar
Rokach, L., Maimon, O.: Clustering methods. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer, Boston (2005). https://doi.org/10.1007/0-387-25465-X_15
Chapter MATH Google Scholar
Schubert, E., Rousseeuw, P.J.: Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In: Amato, G., Gennaro, C., Oria, V., Radovanović, M. (eds.) SISAP 2019. LNCS, vol. 11807, pp. 171–187. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32047-8_16
Chapter Google Scholar
Storn, R., Price, K.: Differential evolution - a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11(4), 341–359 (1997). https://doi.org/10.1023/A:1008202821328
Article MathSciNet MATH Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)
Google Scholar
Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020)
Google Scholar
Wang, K., Xu, C., Liu, B.: Clustering transactions using large items. In: Proceedings of the Eighth International Conference on Information and Knowledge Management. CIKM 1999, pp. 483–490. ACM, New York (1999)
Google Scholar
Wang, M.T., Hsu, P.Y., Lin, K.C., Chen, S.S.: Clustering transactions with an unbalanced hierarchical product structure. In: Song, I.Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2007. LNCS, vol. 4654, pp. 251–261. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74553-2_23
Chapter Google Scholar

Download references

Acknowledgements

The authors kindly thank BrandLoyalty’s domain experts, in particular Anna Witteman, Hanneke van Keep, Lenneke van der Meijden, and Steven van den Boomen, for their contributions to the research presented in this paper.

Author information

Authors and Affiliations

Eindhoven University of Technology, Eindhoven, Netherlands
Yorick Spenrath, Marwan Hassani & Boudewijn van Dongen
BrandLoyalty, ’s-Hertogenbosch, Netherlands
Haseeb Tariq

Authors

Yorick Spenrath
View author publications
You can also search for this author in PubMed Google Scholar
Marwan Hassani
View author publications
You can also search for this author in PubMed Google Scholar
Boudewijn van Dongen
View author publications
You can also search for this author in PubMed Google Scholar
Haseeb Tariq
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yorick Spenrath .

Editor information

Editors and Affiliations

Microsoft Research, Redmond, WA, USA
Yuxiao Dong
University College Dublin, Dublin, Ireland
Georgiana Ifrim
Jožef Stefan Institute, Ljubljana, Slovenia
Dunja Mladenić
Amazon Alexa Knowledge, Cambridge, UK
Craig Saunders
Ghent University, Kotrijk, Belgium
Sofie Van Hoecke

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Spenrath, Y., Hassani, M., Dongen, B.v., Tariq, H. (2021). Why Did My Consumer Shop? Learning an Efficient Distance Metric for Retailer Transaction Data. In: Dong, Y., Ifrim, G., Mladenić, D., Saunders, C., Van Hoecke, S. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12461. Springer, Cham. https://doi.org/10.1007/978-3-030-67670-4_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-67670-4_20
Published: 25 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67669-8
Online ISBN: 978-3-030-67670-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)