skip to main content
10.1145/3366423.3380183acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Real-Time Clustering for Large Sparse Online Visitor Data

Published: 20 April 2020 Publication History

Abstract

Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.

References

[1]
Domenica Arlia and Massimo Coppola. 2001. Experiments in parallel clustering with DBSCAN. In European Conference on Parallel Processing. Springer, 326–331.
[2]
SatyaJaswanth Badri. 2019. A novel Map-Scan-Reduce based density peaks clustering and privacy protection approach for large datasets. International Journal of Computers and Applications (2019), 1–11.
[3]
Roberto J Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web. ACM, 131–140.
[4]
Pavel Berkhin. 2006. A survey of clustering data mining techniques. In Grouping multidimensional data. Springer, 25–71.
[5]
Rongfang Bie, Rashid Mehmood, Shanshan Ruan, Yunchuan Sun, and Hussain Dawood. 2016. Adaptive fuzzy clustering by fast search and find of density peaks. Personal and Ubiquitous Computing 20, 5 (2016), 785–793.
[6]
Christian Böhm, Robert Noll, Claudia Plant, and Bianca Wackersreuther. 2009. Density-based clustering using graphics processors. In Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 661–670.
[7]
Marco Cavallo and Çağatay Demiralp. 2018. Clustrophile 2: guided visual clustering analysis. IEEE transactions on visualization and computer graphics 25, 1(2018), 267–276.
[8]
Gromit Yeuk-Yin Chan, Panpan Xu, Zeng Dai, and Liu Ren. 2018. ViBr: Visualizing Bipartite Relations at Scale with the Minimum Description Length Principle. IEEE transactions on visualization and computer graphics 25, 1(2018), 321–330.
[9]
Xiaojun Chen, Yixiang Fang, Min Yang, Feiping Nie, Zhou Zhao, and Joshua Zhexue Huang. 2017. Purtreeclust: A clustering algorithm for customer segmentation from massive customer transaction data. IEEE Transactions on Knowledge and Data Engineering 30, 3(2017), 559–572.
[10]
Ondrej Chum, James Philbin, Andrew Zisserman, 2008. Near duplicate image detection: min-hash and tf-idf weighting. In Bmvc, Vol. 810. 812–815.
[11]
Dong Deng, Guoliang Li, He Wen, and Jianhua Feng. 2015. An efficient partition based method for exact set similarity joins. Proceedings of the VLDB Endowment 9, 4 (2015), 360–371.
[12]
Peter Deuflhard and Andreas Hohmann. 2003. Numerical analysis in modern scientific computing: an introduction. Springer.
[13]
Inderjit S Dhillon and Dharmendra S Modha. 2002. A data-clustering algorithm on distributed memory multiprocessors. In Large-scale parallel data mining. Springer, 245–260.
[14]
Fan Du, Catherine Plaisant, Neil Spring, and Ben Shneiderman. 2018. Visual interfaces for recommendation systems: Finding similar and dissimilar peers. ACM Transactions on Intelligent Systems and Technology (TIST) 10, 1(2018), 9.
[15]
D Foti, D Lipari, Clara Pizzuti, and Domenico Talia. 2000. Scalable parallel clustering for data mining on multicomputers. In International Parallel and Distributed Processing Symposium. Springer, 390–398.
[16]
Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni Zappella, and Evans Etrue. 2017. On context-dependent clustering of bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1253–1262.
[17]
Aristides Gionis, Piotr Indyk, Rajeev Motwani, 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518–529.
[18]
Michael Greenwald, Sanjeev Khanna, 2001. Space-efficient online computation of quantile summaries. ACM SIGMOD Record 30, 2 (2001), 58–66.
[19]
Jeffrey Heer and Maneesh Agrawala. 2008. Design considerations for collaborative visual analytics. Information visualization 7, 1 (2008), 49–62.
[20]
James Hendler. 1992. Artificial intelligence planning systems: proceedings of the first international conference, June 15-17, 1992, College Park, Maryland. Morgan Kaufmann.
[21]
Dong Hyun Jeong, Caroline Ziemkiewicz, Brian Fisher, William Ribarsky, and Remco Chang. 2009. iPCA: An Interactive System for PCA-based Visual Analytics. In Computer Graphics Forum, Vol. 28. Wiley Online Library, 767–774.
[22]
Daniel Keim, Gennady Andrienko, Jean-Daniel Fekete, Carsten Görg, Jörn Kohlhammer, and Guy Melançon. 2008. Visual analytics: Definition, process, and challenges. In Information visualization. Springer, 154–175.
[23]
Daniel A Keim, Florian Mansmann, Jörn Schneidewind, Jim Thomas, and Hartmut Ziegler. 2008. Visual analytics: Scope and challenges. In Visual data mining. Springer, 76–90.
[24]
Nathan Korda, Balázs Szörényi, and Li Shuai. 2016. Distributed clustering of linear bandits in peer to peer networks. In Journal of machine learning research workshop and conference proceedings, Vol. 48. International Machine Learning Societ, 1301–1309.
[25]
Bum Chul Kwon, Ben Eysenbach, Janu Verma, Kenney Ng, Christopher De Filippi, Walter F Stewart, and Adam Perer. 2017. Clustervision: Visual supervision of unsupervised clustering. IEEE transactions on visualization and computer graphics 24, 1(2017), 142–151.
[26]
Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John Stasko, and Haesun Park. 2012. iVisClustering: An interactive visual document clustering via topic modeling. In Computer graphics forum, Vol. 31. Wiley Online Library, 1155–1164.
[27]
Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. 2016. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 539–548.
[28]
Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on exploratory visual analysis. IEEE transactions on visualization and computer graphics 20, 12(2014), 2122–2131.
[29]
Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129–137.
[30]
Rashid Mehmood, Saeed El-Ashram, Rongfang Bie, Hussain Dawood, and Anton Kos. 2017. Clustering by fast search and merge of local density peaks for gene expression microarray data. Scientific reports 7(2017), 45602.
[31]
Rashid Mehmood, Guangzhi Zhang, Rongfang Bie, Hassan Dawood, and Haseeb Ahmad. 2016. Clustering by fast search and find of density peaks via heat diffusion. Neurocomputing 208(2016), 210–217.
[32]
Tamara Munzner. 2009. A nested process model for visualization design and validation. IEEE Transactions on Visualization and Computer Graphics6 (2009), 921–928.
[33]
Dan Pelleg, Andrew W Moore, 2000. X-means: Extending k-means with efficient estimation of the number of clusters. In ICML, Vol. 1. 727–734.
[34]
Anand Rajaraman and Jeffrey David Ullman. 2011. Chapter 3, Mining of massive datasets. Cambridge University Press.
[35]
William M Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66, 336(1971), 846–850.
[36]
Chandan K Reddy and Bhanukiran Vinzamuri. 2018. A survey of partitional and hierarchical clustering algorithms. In Data Clustering. Chapman and Hall/CRC, 87–110.
[37]
Alex Rodriguez and Alessandro Laio. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 1492–1496.
[38]
Tobias Schreck, Jürgen Bernard, Tatiana Von Landesberger, and Jörn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization 8, 1 (2009), 14–29.
[39]
Yong Shi, Zhensong Chen, Zhiquan Qi, Fan Meng, and Limeng Cui. 2017. A novel clustering-based image segmentation via density peaks algorithm with mid-level feature. Neural Computing and Applications 28, 1 (2017), 29–39.
[40]
Anshumali Shrivastava and Ping Li. 2014. In defense of minhash over simhash. In Artificial Intelligence and Statistics. 886–894.
[41]
Rares Vernica, Michael J Carey, and Chen Li. 2010. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 495–506.
[42]
Joe H Ward Jr. 1963. Hierarchical grouping to optimize an objective function. J. Amer. Statist. Assoc. 58, 301 (1963), 236–244.
[43]
John Wenskovitch, Ian Crandell, Naren Ramakrishnan, Leanna House, and Chris North. 2017. Towards a systematic combination of dimension reduction and clustering in visual analytics. IEEE transactions on visualization and computer graphics 24, 1(2017), 131–141.
[44]
Yanfeng Zhang, Shimin Chen, and Ge Yu. 2016. Efficient distributed density peaks for clustering large data sets in mapreduce. IEEE Transactions on Knowledge and Data Engineering 28, 12(2016), 3218–3230.
[45]
Hong Zhao, Tao Wang, and Xiangyan Zeng. 2018. A clustering algorithm for key frame extraction based on density peak. Journal of Computer and Communications 6, 12 (2018), 118–128.
[46]
Weizhong Zhao, Huifang Ma, and Qing He. 2009. Parallel k-means clustering based on mapreduce. In IEEE International Conference on Cloud Computing. Springer, 674–679.

Cited By

View all
  • (2023)Efficient Density-peaks Clustering Algorithms on Static and Dynamic Data in Euclidean SpaceACM Transactions on Knowledge Discovery from Data10.1145/360787318:1(1-27)Online publication date: 10-Aug-2023
  • (2023)Adaptive load balancing in cluster computing environmentThe Journal of Supercomputing10.1007/s11227-023-05434-679:17(20179-20207)Online publication date: 10-Jun-2023
  • (2022)Scalable and Accurate Density-Peaks Clustering on Fully Dynamic Data2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020690(445-454)Online publication date: 17-Dec-2022

Index Terms

  1. Real-Time Clustering for Large Sparse Online Visitor Data
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '20: Proceedings of The Web Conference 2020
    April 2020
    3143 pages
    ISBN:9781450370233
    DOI:10.1145/3366423
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 April 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Clustering
    2. Density peaks
    3. Sketching
    4. Spark
    5. Sparse binary data

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    WWW '20
    Sponsor:
    WWW '20: The Web Conference 2020
    April 20 - 24, 2020
    Taipei, Taiwan

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Efficient Density-peaks Clustering Algorithms on Static and Dynamic Data in Euclidean SpaceACM Transactions on Knowledge Discovery from Data10.1145/360787318:1(1-27)Online publication date: 10-Aug-2023
    • (2023)Adaptive load balancing in cluster computing environmentThe Journal of Supercomputing10.1007/s11227-023-05434-679:17(20179-20207)Online publication date: 10-Jun-2023
    • (2022)Scalable and Accurate Density-Peaks Clustering on Fully Dynamic Data2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020690(445-454)Online publication date: 17-Dec-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media