Abstract
The problem of efficient and high-quality clustering of extreme scale datasets with complex clustering structures continues to be one of the most challenging data analysis problems. An innovate use of data cloud would provide unique opportunity to address this challenge. In this paper, we propose the CloudVista framework to address (1) the problems caused by using sampling in the existing approaches and (2) the problems with the latency caused by cloud-side processing on interactive cluster visualization. The CloudVista framework aims to explore the entire large data stored in the cloud with the help of the data structure visual frame and the previously developed VISTA visualization model. The latency of processing large data is addressed by the RandGen algorithm that generates a series of related visual frames in the cloud without user’s intervention, and a hierarchical exploration model supported by cloud-side subset processing. Experimental study shows this framework is effective and efficient for visually exploring clustering structures for extreme scale datasets stored in the cloud.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., Zaharia, M.: Above the clouds: A berkeley view of cloud computing. Technical Report, University of Berkerley (2009)
Bovey, J., Rodgers, P., Benoy, F.: Movement as an aid to understanding graphs. In: IEEE Conference on Information Visualization, pp. 472–478. IEEE, Los Alamitos (2003)
Chen, K., Liu, L.: VISTA: Validating and refining clusters via visualization. Information Visualization 3(4), 257–270 (2004)
Chen, K., Liu, L.: iVIBRATE: Interactive visualization based framework for clustering large datasets. ACM Transactions on Information Systems 24(2), 245–292 (2006)
Cook, D., Buja, A., Cabrera, J., Hurley, C.: Grand tour and projection pursuit. Journal of Computational and Graphical Statistics 23, 155–172 (1995)
Cox, T.F., Cox, M.A.A.: Multidimensional Scaling. Chapman&Hall/CRC, Boca Raton, FL (2001)
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation (2004)
M.J. (ed.) (1998)
Faloutsos, C., Lin, K.-I.D.: FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: Proceedings of ACM SIGMOD Conference, pp. 163–174 (1995)
Grochow, K., Howe, B., Barga, R., Lazowska, E.: Client + cloud: Seamless architectures for visual data analytics in the ocean sciences. In: Proceedings of International Conference on Scientific and Statistical Database Management, SSDBM (2010)
Guha, S., Rastogi, R., Shim, K.: CURE: An efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD Conference, pp. 73–84 (1998)
Hinneburg, A., Keim, D.A., Wawryniuk, M.: Visual mining of high-dimensional data. In: IEEE Computer Graphics and Applications, pp. 1–8 (1999)
Huber, P.J.: Projection pursuit. Annals of Statistics 13(2), 435–475 (1985)
Inselberg, A.: Multidimensional detective. In: IEEE Symposium on Information Visualization, pp. 100–107 (1997)
Jain, A., Murty, M., Flynn, P.: Data clustering: A review. ACM Computing Surveys 31, 264–323 (1999)
Kandogan, E.: Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Proceedings of ACM SIGKDD Conference, pp. 107–116 (2001)
Kang, U., Tsourakakis, C.E., Faloutsos, C.: Pegasus: Mining peta-scale graphs. Knowledge and Information Systems, KAIS (2010)
Lin, J., Dyer, C.: Data-intensive text processing with MapReduce. Morgan & Claypool Publishers, San Francisco (2010)
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and algorithm. In: Proceedings Of Neural Information Processing Systems NIPS (2001)
Panda, B., Herbach, J.S., Basu, S., Bayardo, R.J.: Planet: Massively parall learning of tree ensembles with mapreduce. In: Proceedings of Very Large Databases Conference, VLDB (2009)
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Saul, L.K., Weinberger, K.Q., Sha, F., Ham, J., Lee, D.D.: Spectral methods for dimensionality reduction. In: Semi-Supervised Learning. MIT Press, Cambridge (2006)
Seo, J., Shneiderman, B.: Interactively exploring hierarchical clustering results. IEEE Computer 35(7), 80–86 (2002)
Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at facebook. In: Proceedings of ACM SIGMOD Conference, pp. 1013–1020. ACM, New York (2010)
Vempala, S.S.: The Random Projection Method. American Mathematical Society (2005)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009)
Yang, J., Ward, M.O., Rundensteiner, E.A.: Interactive hierarchical displays: a general framework for visualization and exploration of large multivariate datasets. Computers and Graphics Journal 27, 265–283 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, K., Xu, H., Tian, F., Guo, S. (2011). CloudVista: Visual Cluster Exploration for Extreme Scale Data in the Cloud. In: Bayard Cushing, J., French, J., Bowers, S. (eds) Scientific and Statistical Database Management. SSDBM 2011. Lecture Notes in Computer Science, vol 6809. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22351-8_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-22351-8_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22350-1
Online ISBN: 978-3-642-22351-8
eBook Packages: Computer ScienceComputer Science (R0)