research-article

Real-Time Clustering for Large Sparse Online Visitor Data

Authors:

Gromit Yeuk-Yin Chan,

Cláudio T. Silva,

Juliana FreireAuthors Info & Claims

WWW '20: Proceedings of The Web Conference 2020

Pages 1049 - 1059

https://doi.org/10.1145/3366423.3380183

Published: 20 April 2020 Publication History

Abstract

Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.

References

[1]

Domenica Arlia and Massimo Coppola. 2001. Experiments in parallel clustering with DBSCAN. In European Conference on Parallel Processing. Springer, 326–331.

[2]

SatyaJaswanth Badri. 2019. A novel Map-Scan-Reduce based density peaks clustering and privacy protection approach for large datasets. International Journal of Computers and Applications (2019), 1–11.

[3]

Roberto J Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th international conference on World Wide Web. ACM, 131–140.

Digital Library

[4]

Pavel Berkhin. 2006. A survey of clustering data mining techniques. In Grouping multidimensional data. Springer, 25–71.

[5]

Rongfang Bie, Rashid Mehmood, Shanshan Ruan, Yunchuan Sun, and Hussain Dawood. 2016. Adaptive fuzzy clustering by fast search and find of density peaks. Personal and Ubiquitous Computing 20, 5 (2016), 785–793.

Digital Library

[6]

Christian Böhm, Robert Noll, Claudia Plant, and Bianca Wackersreuther. 2009. Density-based clustering using graphics processors. In Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 661–670.

Digital Library

[7]

Marco Cavallo and Çağatay Demiralp. 2018. Clustrophile 2: guided visual clustering analysis. IEEE transactions on visualization and computer graphics 25, 1(2018), 267–276.

[8]

Gromit Yeuk-Yin Chan, Panpan Xu, Zeng Dai, and Liu Ren. 2018. ViBr: Visualizing Bipartite Relations at Scale with the Minimum Description Length Principle. IEEE transactions on visualization and computer graphics 25, 1(2018), 321–330.

[9]

Xiaojun Chen, Yixiang Fang, Min Yang, Feiping Nie, Zhou Zhao, and Joshua Zhexue Huang. 2017. Purtreeclust: A clustering algorithm for customer segmentation from massive customer transaction data. IEEE Transactions on Knowledge and Data Engineering 30, 3(2017), 559–572.

[10]

Ondrej Chum, James Philbin, Andrew Zisserman, 2008. Near duplicate image detection: min-hash and tf-idf weighting. In Bmvc, Vol. 810. 812–815.

[11]

Dong Deng, Guoliang Li, He Wen, and Jianhua Feng. 2015. An efficient partition based method for exact set similarity joins. Proceedings of the VLDB Endowment 9, 4 (2015), 360–371.

Digital Library

[12]

Peter Deuflhard and Andreas Hohmann. 2003. Numerical analysis in modern scientific computing: an introduction. Springer.

[13]

Inderjit S Dhillon and Dharmendra S Modha. 2002. A data-clustering algorithm on distributed memory multiprocessors. In Large-scale parallel data mining. Springer, 245–260.

[14]

Fan Du, Catherine Plaisant, Neil Spring, and Ben Shneiderman. 2018. Visual interfaces for recommendation systems: Finding similar and dissimilar peers. ACM Transactions on Intelligent Systems and Technology (TIST) 10, 1(2018), 9.

[15]

D Foti, D Lipari, Clara Pizzuti, and Domenico Talia. 2000. Scalable parallel clustering for data mining on multicomputers. In International Parallel and Distributed Processing Symposium. Springer, 390–398.

[16]

Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Giovanni Zappella, and Evans Etrue. 2017. On context-dependent clustering of bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1253–1262.

Digital Library

[17]

Aristides Gionis, Piotr Indyk, Rajeev Motwani, 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518–529.

[18]

Michael Greenwald, Sanjeev Khanna, 2001. Space-efficient online computation of quantile summaries. ACM SIGMOD Record 30, 2 (2001), 58–66.

Digital Library

[19]

Jeffrey Heer and Maneesh Agrawala. 2008. Design considerations for collaborative visual analytics. Information visualization 7, 1 (2008), 49–62.

[20]

James Hendler. 1992. Artificial intelligence planning systems: proceedings of the first international conference, June 15-17, 1992, College Park, Maryland. Morgan Kaufmann.

[21]

Dong Hyun Jeong, Caroline Ziemkiewicz, Brian Fisher, William Ribarsky, and Remco Chang. 2009. iPCA: An Interactive System for PCA-based Visual Analytics. In Computer Graphics Forum, Vol. 28. Wiley Online Library, 767–774.

[22]

Daniel Keim, Gennady Andrienko, Jean-Daniel Fekete, Carsten Görg, Jörn Kohlhammer, and Guy Melançon. 2008. Visual analytics: Definition, process, and challenges. In Information visualization. Springer, 154–175.

[23]

Daniel A Keim, Florian Mansmann, Jörn Schneidewind, Jim Thomas, and Hartmut Ziegler. 2008. Visual analytics: Scope and challenges. In Visual data mining. Springer, 76–90.

[24]

Nathan Korda, Balázs Szörényi, and Li Shuai. 2016. Distributed clustering of linear bandits in peer to peer networks. In Journal of machine learning research workshop and conference proceedings, Vol. 48. International Machine Learning Societ, 1301–1309.

[25]

Bum Chul Kwon, Ben Eysenbach, Janu Verma, Kenney Ng, Christopher De Filippi, Walter F Stewart, and Adam Perer. 2017. Clustervision: Visual supervision of unsupervised clustering. IEEE transactions on visualization and computer graphics 24, 1(2017), 142–151.

[26]

Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John Stasko, and Haesun Park. 2012. iVisClustering: An interactive visual document clustering via topic modeling. In Computer graphics forum, Vol. 31. Wiley Online Library, 1155–1164.

[27]

Shuai Li, Alexandros Karatzoglou, and Claudio Gentile. 2016. Collaborative filtering bandits. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 539–548.

Digital Library

[28]

Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on exploratory visual analysis. IEEE transactions on visualization and computer graphics 20, 12(2014), 2122–2131.

[29]

Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory 28, 2 (1982), 129–137.

Digital Library

[30]

Rashid Mehmood, Saeed El-Ashram, Rongfang Bie, Hussain Dawood, and Anton Kos. 2017. Clustering by fast search and merge of local density peaks for gene expression microarray data. Scientific reports 7(2017), 45602.

[31]

Rashid Mehmood, Guangzhi Zhang, Rongfang Bie, Hassan Dawood, and Haseeb Ahmad. 2016. Clustering by fast search and find of density peaks via heat diffusion. Neurocomputing 208(2016), 210–217.

Digital Library

[32]

Tamara Munzner. 2009. A nested process model for visualization design and validation. IEEE Transactions on Visualization and Computer Graphics6 (2009), 921–928.

[33]

Dan Pelleg, Andrew W Moore, 2000. X-means: Extending k-means with efficient estimation of the number of clusters. In ICML, Vol. 1. 727–734.

[34]

Anand Rajaraman and Jeffrey David Ullman. 2011. Chapter 3, Mining of massive datasets. Cambridge University Press.

[35]

William M Rand. 1971. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66, 336(1971), 846–850.

[36]

Chandan K Reddy and Bhanukiran Vinzamuri. 2018. A survey of partitional and hierarchical clustering algorithms. In Data Clustering. Chapman and Hall/CRC, 87–110.

[37]

Alex Rodriguez and Alessandro Laio. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 1492–1496.

[38]

Tobias Schreck, Jürgen Bernard, Tatiana Von Landesberger, and Jörn Kohlhammer. 2009. Visual cluster analysis of trajectory data with interactive kohonen maps. Information Visualization 8, 1 (2009), 14–29.

Digital Library

[39]

Yong Shi, Zhensong Chen, Zhiquan Qi, Fan Meng, and Limeng Cui. 2017. A novel clustering-based image segmentation via density peaks algorithm with mid-level feature. Neural Computing and Applications 28, 1 (2017), 29–39.

Digital Library

[40]

Anshumali Shrivastava and Ping Li. 2014. In defense of minhash over simhash. In Artificial Intelligence and Statistics. 886–894.

[41]

Rares Vernica, Michael J Carey, and Chen Li. 2010. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 495–506.

Digital Library

[42]

Joe H Ward Jr. 1963. Hierarchical grouping to optimize an objective function. J. Amer. Statist. Assoc. 58, 301 (1963), 236–244.

[43]

John Wenskovitch, Ian Crandell, Naren Ramakrishnan, Leanna House, and Chris North. 2017. Towards a systematic combination of dimension reduction and clustering in visual analytics. IEEE transactions on visualization and computer graphics 24, 1(2017), 131–141.

[44]

Yanfeng Zhang, Shimin Chen, and Ge Yu. 2016. Efficient distributed density peaks for clustering large data sets in mapreduce. IEEE Transactions on Knowledge and Data Engineering 28, 12(2016), 3218–3230.

Digital Library

[45]

Hong Zhao, Tao Wang, and Xiangyan Zeng. 2018. A clustering algorithm for key frame extraction based on density peak. Journal of Computer and Communications 6, 12 (2018), 118–128.

[46]

Weizhong Zhao, Huifang Ma, and Qing He. 2009. Parallel k-means clustering based on mapreduce. In IEEE International Conference on Cloud Computing. Springer, 674–679.

Digital Library

Cited By

Amagata DHara T(2023)Efficient Density-peaks Clustering Algorithms on Static and Dynamic Data in Euclidean SpaceACM Transactions on Knowledge Discovery from Data10.1145/360787318:1(1-27)Online publication date: 10-Aug-2023
https://dl.acm.org/doi/10.1145/3607873
Singh TGupta SSatakshi Kumar M(2023)Adaptive load balancing in cluster computing environmentThe Journal of Supercomputing10.1007/s11227-023-05434-679:17(20179-20207)Online publication date: 10-Jun-2023
https://doi.org/10.1007/s11227-023-05434-6
Amagata D(2022)Scalable and Accurate Density-Peaks Clustering on Fully Dynamic Data2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020690(445-454)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020690

Index Terms

Real-Time Clustering for Large Sparse Online Visitor Data
1. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Parallel Implementation of Density Peaks Clustering Algorithm Based on Spark

Clustering algorithm is widely used in data mining. It attempt to classify elements into several clusters, and the elements in the same cluster are more similar to each other meanwhile the elements belonging to other clusters are not similar. The ...
Adaptive fuzzy clustering by fast search and find of density peaks

Clustering by fast search and find of density peaks (CFSFDP) is proposed to cluster the data by finding of density peaks. CFSFDP is based on two assumptions that: a cluster center is a high dense data point as compared to its surrounding neighbors, and ...
Density Ratio Peak Clustering
Web and Big Data
Abstract
Clustering is an important means of obtaining hidden information, and is widely used in economics, biomedicine and other disciplines. Data imbalance widely exists in real-world datasets. For example, when fraud detection is performs in transaction ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '20: Proceedings of The Web Conference 2020

April 2020

3143 pages

ISBN:9781450370233

DOI:10.1145/3366423

Editors:
Yennun Huang
Acadmica sinica, Taiwan
,
Irwin King
The Chinese University of Hong Kong, Hong Kong
,
Tie-Yan Liu
Microsoft Research Asia, China
,
Maarten van Steen
University of Twente, Netherlands

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '20

Sponsor:

SIGWEB

WWW '20: The Web Conference 2020

April 20 - 24, 2020

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
350
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)4

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Amagata DHara T(2023)Efficient Density-peaks Clustering Algorithms on Static and Dynamic Data in Euclidean SpaceACM Transactions on Knowledge Discovery from Data10.1145/360787318:1(1-27)Online publication date: 10-Aug-2023
https://dl.acm.org/doi/10.1145/3607873
Singh TGupta SSatakshi Kumar M(2023)Adaptive load balancing in cluster computing environmentThe Journal of Supercomputing10.1007/s11227-023-05434-679:17(20179-20207)Online publication date: 10-Jun-2023
https://doi.org/10.1007/s11227-023-05434-6
Amagata D(2022)Scalable and Accurate Density-Peaks Clustering on Fully Dynamic Data2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020690(445-454)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020690

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten