Skip to main content

Why Did My Consumer Shop? Learning an Efficient Distance Metric for Retailer Transaction Data

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track (ECML PKDD 2020)

Abstract

Transaction analysis is an important part in studies aiming to understand consumer behaviour. The first step is defining a proper measure of similarity, or more specifically a distance metric, between transactions. Existing distance metrics on transactional data are built on retailer specific information, such as extensive product hierarchies or a large product catalogue. In this paper we propose a new distance metric that is retailer independent by design, allowing cross-retailer and cross-country analysis. The metric comes with a novel method of finding the importance of categories of products, alternating between unsupervised learning techniques and importance calibration. We test our methodology on a real-world dataset and show how we can identify clusters of consumer behaviour.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The full article on this study is found at https://medium.com/icemobile/mindset-moments-a-new-way-to-pinpoint-purchasing-decisions-18c58a574c02.

  2. 2.

    Given two sets \(\mathcal {S}_1\) and \(\mathcal {S}_2\), the Jaccard distance [23] between them is defined as \(1-\frac{|\mathcal {S}_1\cap \mathcal {S}_2|}{|\mathcal {S}_1\cup \mathcal {S}_2|}\).

  3. 3.

    We order (super-)categories alphabetically in \(\vec {a}\), \(\vec {b}\) and s.

  4. 4.

    Without loss of generality, we discuss minimization of the internal evaluation measure, the problem and solution is similar for internal evaluation measures that need to be maximized.

References

  1. Aggarwal, C., Procopiuc, C., Yu, P.: Finding localized associations in market basket data. IEEE Trans. KDE 14(1), 51–62 (2002)

    Google Scholar 

  2. Arthur, D., Vassilvitskii, S.: K-Means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA 2007, pp. 1027–1035. Society for Industrial and Applied Mathematics, USA (2007)

    Google Scholar 

  3. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14

    Chapter  Google Scholar 

  4. Chen, X., Fang, Y., Yang, M., Nie, F., Zhao, Z., Huang, J.Z.: PurTreeClust: a clustering algorithm for customer segmentation from massive customer transaction data. IEEE Trans. Knowl. Data Eng. 30(3), 559–572 (2018)

    Google Scholar 

  5. Chen, X., Huang, J.Z., Luo, J.: PurTreeClust: a purchase tree clustering algorithm for large-scale customer transaction data. In: 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pp. 661–672, May 2016

    Google Scholar 

  6. Chen, X., Peng, S., Huang, J.Z., Nie, F., Ming, Y.: Local PurTree spectral clustering for massive customer transaction data. IEEE Intell. Syst. 32(2), 37–44 (2017)

    Google Scholar 

  7. Chen, X., Sun, W., Wang, B., Li, Z., Wang, X., Ye, Y.: Spectral clustering of customer transaction data with a two-level subspace weighting method. IEEE Trans. Cybern. 49(9), 3230–3241 (2019)

    Google Scholar 

  8. Gower, J.C.: A general coefficient of similarity and some of its properties. Biometrics 27(4), 857–871 (1971)

    Article  Google Scholar 

  9. Guha, S., Rastogi, R., Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Inf. Syst. 25, 345–366 (2000)

    Google Scholar 

  10. Hamerly, G., Elkan, C.: Learning the k in K-means. In: Proceedings of the 16th International Conference on Neural Information Processing Systems. NIPS 2003, pp. 281–288. MIT Press, Cambridge (2003)

    Google Scholar 

  11. Hassani, M., Seidl, T.: Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J. Comput. Sci. 4(3), 171–183 (2016). https://doi.org/10.1007/s40595-016-0086-9

    Article  Google Scholar 

  12. Hassani, M., Spaus, P., Seidl, T.: Adaptive multiple-resolution stream clustering. In: Perner, P. (ed.) MLDM 2014. LNCS (LNAI), vol. 8556, pp. 134–148. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08979-9_11

    Chapter  Google Scholar 

  13. Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)

    Google Scholar 

  14. Lam, S.K., Pitrou, A., Seibert, S.: Numba. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC - LLVM 2015, pp. 1–6. ACM Press, New York (2015)

    Google Scholar 

  15. Lu, K., Furukawa, T.: Similarity of transactions for customer segmentation. In: Quirchmayr, G., Basl, J., You, I., Xu, L., Weippl, E. (eds.) CD-ARES 2012. LNCS, vol. 7465, pp. 347–359. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32498-7_26

    Chapter  Google Scholar 

  16. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic. NIPS 2001, pp. 849–856. MIT Press, Cambridge (2001)

    Google Scholar 

  17. Park, H.S., Jun, C.H.: A simple and fast algorithm for K-medoids clustering. Expert Syst. Appl. 36(2 PART 2), 3336–3341 (2009)

    Google Scholar 

  18. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  19. Pelleg, D., Moore, A.W.: X-Means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the Seventeenth International Conference on Machine Learning. ICML 2000, pp. 727–734. Morgan Kaufmann Publishers Inc., San Francisco (2000)

    Google Scholar 

  20. Rokach, L., Maimon, O.: Clustering methods. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer, Boston (2005). https://doi.org/10.1007/0-387-25465-X_15

    Chapter  MATH  Google Scholar 

  21. Schubert, E., Rousseeuw, P.J.: Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In: Amato, G., Gennaro, C., Oria, V., Radovanović, M. (eds.) SISAP 2019. LNCS, vol. 11807, pp. 171–187. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32047-8_16

    Chapter  Google Scholar 

  22. Storn, R., Price, K.: Differential evolution - a simple and efficient heuristic for global optimization over continuous spaces. J. Global Optim. 11(4), 341–359 (1997). https://doi.org/10.1023/A:1008202821328

    Article  MathSciNet  MATH  Google Scholar 

  23. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)

    Google Scholar 

  24. Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020)

    Google Scholar 

  25. Wang, K., Xu, C., Liu, B.: Clustering transactions using large items. In: Proceedings of the Eighth International Conference on Information and Knowledge Management. CIKM 1999, pp. 483–490. ACM, New York (1999)

    Google Scholar 

  26. Wang, M.T., Hsu, P.Y., Lin, K.C., Chen, S.S.: Clustering transactions with an unbalanced hierarchical product structure. In: Song, I.Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2007. LNCS, vol. 4654, pp. 251–261. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74553-2_23

    Chapter  Google Scholar 

Download references

Acknowledgements

The authors kindly thank BrandLoyalty’s domain experts, in particular Anna Witteman, Hanneke van Keep, Lenneke van der Meijden, and Steven van den Boomen, for their contributions to the research presented in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yorick Spenrath .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Spenrath, Y., Hassani, M., Dongen, B.v., Tariq, H. (2021). Why Did My Consumer Shop? Learning an Efficient Distance Metric for Retailer Transaction Data. In: Dong, Y., Ifrim, G., Mladenić, D., Saunders, C., Van Hoecke, S. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12461. Springer, Cham. https://doi.org/10.1007/978-3-030-67670-4_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67670-4_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67669-8

  • Online ISBN: 978-3-030-67670-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics