skip to main content
10.1145/1088149.1088162acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Design of a next generation sampling service for large scale data analysis applications

Published: 20 June 2005 Publication History

Abstract

Advances in data collection and storage technologies have resulted in large and dynamically growing data sets at many organizations. Database and data mining researchers often use sampling with great effect to scale up performance on these data sets with small cost to accuracy. However, existing techniques often ignore the cost of computing a sample. This cost is often linear in the size of the data set, not the sample, which is expensive. Furthermore, for data mining applications that leverage progressive sampling or bootstrapping-based techniques, this cost can be prohibitive, since they require the generation of multiple samples.To address this problem, we present a solution in the context of a state-of-the-art data analysis center. Specifically, we propose a scalable service that supports sample generation with cost linear in the size of the sample. We then present an efficient parallelization of this service. Our solution leverages high speed interconnects (e.g. Myrinet, Infini-band) for parallel I/O operations with pipelined data transfers. We export an interface that supports both ad-hoc SQL-like querying for database applications, as well as a stand-alone service for data mining applications. We then evaluate our work using queries abstracted from a network monitoring and analysis application, which uses both database and progressive sampling queries. We demonstrate that our implementation achieves good load balance and realizes up to an order of magnitude speedup when compared with extant approaches.

References

[1]
B. Babcock et al. Models and issues in data stream systems. In ACM Symposium on Principles of Database Systems, 2002.]]
[2]
M. Cannataro and D. Talia. Knowledge grid an architecture for distributed knowledge discovery. In CACM, Vol. 46, No. 1, pp. 89--93, 2003.]]
[3]
P. Carns, W. Ligon, R. Ross, and R. Thakur. Pvfs: A parallel file system for linux clusters. In Proceedings of the Annual Linux Showcase and Conference, 2000.]]
[4]
Ling Tony Chen and Doron Rotem. Declustering objects for visualization. In Proceedings of the 19th International Conference on Very Large Data Bases, pages 85--96, Dublin, Ireland, August 1993.]]
[5]
A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke. The Data Grid: Towards an Architecture For the Distributed Management and Analysis of Large Scientific Datasets, 2001.]]
[6]
Peter F. Corbett and Dror G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3):225--264, August 1996.]]
[7]
Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms, McGraw Hill, 1990.]]
[8]
David DeWitt and Jim Gray. Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6):85--98, June 1992.]]
[9]
Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139--157, 2000.]]
[10]
P. Domingos and G. Hulten. A general method for scaling up machine learning algorithms and its applications to clustering. In Proceedings of the International Conference on Machine Learning, 2001.]]
[11]
H. C. Du and J. S. Sobolewski. Disk allocation for Cartesian product files on multiple-disk systems. ACM Transactions on Database Systems, 7(1):82--101, March 1982.]]
[12]
Christos Faloutsos and Pravin Bhagwat. Declustering using fractals. In the 2nd International Conference on Parallel and Distributed Information Systems, pages 18--25, San Diego, CA, January 1993.]]
[13]
M. T. Fang, R. C. T. Lee, and C. C. Chang. The idea of de-clustering and its applications. In Proceedings of the 12th VLDB Conference, pages 181--188, 1986.]]
[14]
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kauffman, 2001.]]
[15]
C. Jermaine, A. Pol, and S. Arumugam. Online maintenance of very large random samples. In Proceedings of the International Conference on Management of Data, 2004.]]
[16]
G. John and P. Langley. Static versus dynamic sampling for data mining. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1996.]]
[17]
David Kotz. Disk-directed I/O for MIMD multiprocessors. In Proceedings of the 1994 Symposium on Operating Systems Design and Implementation, pages 61--74. ACM Press, November 1994.]]
[18]
Duen-Ren Liu and Shashi Shekhar. A similarity graph-based approach to declustering problems and its applications towards parallelizing grid files. In Proceedings of the International Conference on Data Engineering, pages 373--381, Taipei, Taiwan, March 1995. IEEE Computer Society Press.]]
[19]
M. Mahoney and P. Chan. Learning rules for anomaly detection of hostile network traffic. In Proceedings of the International Conference on Data Mining, 2003.]]
[20]
John M. May. Parallel I/O for High Performance Computing. Morgan Kaufmann Publishers, 2000.]]
[21]
Bongki Moon, Anurag Acharya, and Joel Saltz. Study of scalable declustering algorithms for parallel grid files. In Proceedings of the Tenth International Parallel Processing Symposium. IEEE Computer Society Press, April 1996.]]
[22]
Nils Nieuwejaar and David Kotz. The Galley parallel file system. In Proceedings of the 1996 International Conference on Supercomputing, pages 374--381. ACM Press, May 1996.]]
[23]
F. Olken and D. Rotem. Random sampling from database files: A survey. In Proceedings of the International Conference on Scientific and Statistical Database Management, 1990.]]
[24]
J. Pan, C. Faloutsos, and S. Seshan. Fastcars: Fast, correlation-aware sampling for network data mining. In Proceeding of the IEEE GlobeCom Global Internet Symposium, 2002.]]
[25]
S. Parthasarathy. Efficient progressive sampling for association rules. In Proceedings of the International Conference on Data Mining, 2002.]]
[26]
Jean-Pierre Prost, Richard Treumann, Richard Hedges, Bin Jia, and Alice Koniges. MPI-IO/GPFS, an optimized implementation of MPI-IO on top of GPFS. In Proceedings of the 2001 ACM/IEEE SC01 Conference. ACM Press, November 2001.]]
[27]
F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1999.]]
[28]
F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 1999.]]
[29]
K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-directed collective I/O in Panda. In Proceedings of Supercomputing '95, San Diego, CA, December 1995. IEEE Computer Society Press.]]
[30]
X. Shen and A. Choudhary. A distributed multi-storage i/o system for high performance data intensive computing. In International Symposium on Cluster Computing and the Grid (CCGrid 2002), May 2002.]]
[31]
Rajeev Thakur, Alok Choudhary, Rajesh Bordawekar, Sachin More, and Sivaramakrishna Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70--78, June 1996.]]
[32]
H. Toivonen. Sampling large databases for associations. In Proceedings of the International Conference on Very Large Databases, 1996.]]
[33]
J. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985.]]
[34]
M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1997.]]

Cited By

View all
  • (2013)I/O ContainersProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.198(2015-2024)Online publication date: 20-May-2013
  • (2010)PreDatA – preparatory data analytics on peta-scale machines2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)10.1109/IPDPS.2010.5470454(1-12)Online publication date: Apr-2010
  • (2006)Design and analysis of a multi-dimensional data sampling service for large scale data analysis applicationsProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899011(77-77)Online publication date: 25-Apr-2006
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '05: Proceedings of the 19th annual international conference on Supercomputing
June 2005
414 pages
ISBN:1595931678
DOI:10.1145/1088149
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data centers
  2. data mining
  3. parallel I/O
  4. sampling

Qualifiers

  • Article

Conference

ICS05
Sponsor:
ICS05: International Conference on Supercomputing 2005
June 20 - 22, 2005
Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2013)I/O ContainersProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.198(2015-2024)Online publication date: 20-May-2013
  • (2010)PreDatA – preparatory data analytics on peta-scale machines2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)10.1109/IPDPS.2010.5470454(1-12)Online publication date: Apr-2010
  • (2006)Design and analysis of a multi-dimensional data sampling service for large scale data analysis applicationsProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899011(77-77)Online publication date: 25-Apr-2006
  • (2006)I/O conscious algorithm design and systems support for data analysis on emerging architecturesProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898699.1898813(288-288)Online publication date: 25-Apr-2006
  • (2006)I/O conscious algorithm design and systems support for data analysis on emerging architecturesProceedings 20th IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2006.1639586(8 pp.)Online publication date: 2006
  • (2006)Design and analysis of a multi-dimensional data sampling service for large scale data analysis applicationsProceedings 20th IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2006.1639315(9 pp.)Online publication date: 2006

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media