Article

Design of a next generation sampling service for large scale data analysis applications

Authors:

S. Parthasarathy,

J. SaltzAuthors Info & Claims

ICS '05: Proceedings of the 19th annual international conference on Supercomputing

Pages 91 - 100

https://doi.org/10.1145/1088149.1088162

Published: 20 June 2005 Publication History

Abstract

Advances in data collection and storage technologies have resulted in large and dynamically growing data sets at many organizations. Database and data mining researchers often use sampling with great effect to scale up performance on these data sets with small cost to accuracy. However, existing techniques often ignore the cost of computing a sample. This cost is often linear in the size of the data set, not the sample, which is expensive. Furthermore, for data mining applications that leverage progressive sampling or bootstrapping-based techniques, this cost can be prohibitive, since they require the generation of multiple samples.To address this problem, we present a solution in the context of a state-of-the-art data analysis center. Specifically, we propose a scalable service that supports sample generation with cost linear in the size of the sample. We then present an efficient parallelization of this service. Our solution leverages high speed interconnects (e.g. Myrinet, Infini-band) for parallel I/O operations with pipelined data transfers. We export an interface that supports both ad-hoc SQL-like querying for database applications, as well as a stand-alone service for data mining applications. We then evaluate our work using queries abstracted from a network monitoring and analysis application, which uses both database and progressive sampling queries. We demonstrate that our implementation achieves good load balance and realizes up to an order of magnitude speedup when compared with extant approaches.

References

[1]

B. Babcock et al. Models and issues in data stream systems. In ACM Symposium on Principles of Database Systems, 2002.]]

Digital Library

[2]

M. Cannataro and D. Talia. Knowledge grid an architecture for distributed knowledge discovery. In CACM, Vol. 46, No. 1, pp. 89--93, 2003.]]

Digital Library

[3]

P. Carns, W. Ligon, R. Ross, and R. Thakur. Pvfs: A parallel file system for linux clusters. In Proceedings of the Annual Linux Showcase and Conference, 2000.]]

Digital Library

[4]

Ling Tony Chen and Doron Rotem. Declustering objects for visualization. In Proceedings of the 19th International Conference on Very Large Data Bases, pages 85--96, Dublin, Ireland, August 1993.]]

Digital Library

[5]

A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke. The Data Grid: Towards an Architecture For the Distributed Management and Analysis of Large Scientific Datasets, 2001.]]

[6]

Peter F. Corbett and Dror G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3):225--264, August 1996.]]

Digital Library

[7]

Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms, McGraw Hill, 1990.]]

Digital Library

[8]

David DeWitt and Jim Gray. Parallel database systems: the future of high performance database systems. Communications of the ACM, 35(6):85--98, June 1992.]]

Digital Library

[9]

Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139--157, 2000.]]

Digital Library

[10]

P. Domingos and G. Hulten. A general method for scaling up machine learning algorithms and its applications to clustering. In Proceedings of the International Conference on Machine Learning, 2001.]]

Digital Library

[11]

H. C. Du and J. S. Sobolewski. Disk allocation for Cartesian product files on multiple-disk systems. ACM Transactions on Database Systems, 7(1):82--101, March 1982.]]

Digital Library

[12]

Christos Faloutsos and Pravin Bhagwat. Declustering using fractals. In the 2nd International Conference on Parallel and Distributed Information Systems, pages 18--25, San Diego, CA, January 1993.]]

Digital Library

[13]

M. T. Fang, R. C. T. Lee, and C. C. Chang. The idea of de-clustering and its applications. In Proceedings of the 12th VLDB Conference, pages 181--188, 1986.]]

Digital Library

[14]

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kauffman, 2001.]]

Digital Library

[15]

C. Jermaine, A. Pol, and S. Arumugam. Online maintenance of very large random samples. In Proceedings of the International Conference on Management of Data, 2004.]]

Digital Library

[16]

G. John and P. Langley. Static versus dynamic sampling for data mining. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1996.]]

[17]

David Kotz. Disk-directed I/O for MIMD multiprocessors. In Proceedings of the 1994 Symposium on Operating Systems Design and Implementation, pages 61--74. ACM Press, November 1994.]]

Digital Library

[18]

Duen-Ren Liu and Shashi Shekhar. A similarity graph-based approach to declustering problems and its applications towards parallelizing grid files. In Proceedings of the International Conference on Data Engineering, pages 373--381, Taipei, Taiwan, March 1995. IEEE Computer Society Press.]]

Digital Library

[19]

M. Mahoney and P. Chan. Learning rules for anomaly detection of hostile network traffic. In Proceedings of the International Conference on Data Mining, 2003.]]

Digital Library

[20]

John M. May. Parallel I/O for High Performance Computing. Morgan Kaufmann Publishers, 2000.]]

Digital Library

[21]

Bongki Moon, Anurag Acharya, and Joel Saltz. Study of scalable declustering algorithms for parallel grid files. In Proceedings of the Tenth International Parallel Processing Symposium. IEEE Computer Society Press, April 1996.]]

Digital Library

[22]

Nils Nieuwejaar and David Kotz. The Galley parallel file system. In Proceedings of the 1996 International Conference on Supercomputing, pages 374--381. ACM Press, May 1996.]]

Digital Library

[23]

F. Olken and D. Rotem. Random sampling from database files: A survey. In Proceedings of the International Conference on Scientific and Statistical Database Management, 1990.]]

Digital Library

[24]

J. Pan, C. Faloutsos, and S. Seshan. Fastcars: Fast, correlation-aware sampling for network data mining. In Proceeding of the IEEE GlobeCom Global Internet Symposium, 2002.]]

[25]

S. Parthasarathy. Efficient progressive sampling for association rules. In Proceedings of the International Conference on Data Mining, 2002.]]

Digital Library

[26]

Jean-Pierre Prost, Richard Treumann, Richard Hedges, Bin Jia, and Alice Koniges. MPI-IO/GPFS, an optimized implementation of MPI-IO on top of GPFS. In Proceedings of the 2001 ACM/IEEE SC01 Conference. ACM Press, November 2001.]]

Digital Library

[27]

F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1999.]]

Digital Library

[28]

F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 1999.]]

Digital Library

[29]

K. E. Seamons, Y. Chen, P. Jones, J. Jozwiak, and M. Winslett. Server-directed collective I/O in Panda. In Proceedings of Supercomputing '95, San Diego, CA, December 1995. IEEE Computer Society Press.]]

Digital Library

[30]

X. Shen and A. Choudhary. A distributed multi-storage i/o system for high performance data intensive computing. In International Symposium on Cluster Computing and the Grid (CCGrid 2002), May 2002.]]

Digital Library

[31]

Rajeev Thakur, Alok Choudhary, Rajesh Bordawekar, Sachin More, and Sivaramakrishna Kuditipudi. Passion: Optimized I/O for parallel applications. IEEE Computer, 29(6):70--78, June 1996.]]

Digital Library

[32]

H. Toivonen. Sampling large databases for associations. In Proceedings of the International Conference on Very Large Databases, 1996.]]

Digital Library

[33]

J. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 1985.]]

Digital Library

[34]

M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1997.]]

Cited By

Dayal JCao JEisenhauer GSchwan KWolf MZheng FAbbasi HKlasky SPodhorszki NLofstead J(2013)I/O ContainersProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.198(2015-2024)Online publication date: 20-May-2013
https://dl.acm.org/doi/10.1109/IPDPSW.2013.198
Zheng FAbbasi HDocan CLofstead JLiu QKlasky SParashar MPodhorszki NSchwan KWolf M(2010)PreDatA – preparatory data analytics on peta-scale machines2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)10.1109/IPDPS.2010.5470454(1-12)Online publication date: Apr-2010
https://doi.org/10.1109/IPDPS.2010.5470454
Zhang XKurc TSaltz JParthasarathy S(2006)Design and analysis of a multi-dimensional data sampling service for large scale data analysis applicationsProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899011(77-77)Online publication date: 25-Apr-2006
https://dl.acm.org/doi/10.5555/1898953.1899011
Show More Cited By

Recommendations

Developing Novel and Effective Approach for Association Rule Mining Using Progressive Sampling
ICCEE '09: Proceedings of the 2009 Second International Conference on Computer and Electrical Engineering - Volume 01

A challenging task in data mining is the process of discovering association rules from a large database. Most of the existing association rule mining algorithms make repeated passes over the entire database to determine the frequent itemsets, which is ...
A sampling based algorithm for finding association rules from uncertain data
AICI'10: Proceedings of the 2010 international conference on Artificial intelligence and computational intelligence: Part I

Since there are many real-life situations in which people are uncertain about the content of transactions, association rule mining with uncertain data is in demand. Most of these studies focus on the improvement of classical algorithms for frequent ...
A sampling-based framework for parallel data mining
PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming

The goal of data mining algorithm is to discover useful information embedded in large databases. Frequent itemset mining and sequential pattern mining are two important data mining problems with broad applications. Perhaps the most efficient way to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '05: Proceedings of the 19th annual international conference on Supercomputing

June 2005

414 pages

ISBN:1595931678

DOI:10.1145/1088149

General Chair:
Arvind
MIT
,
Program Chair:
Larry Rudolph
MIT

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICS05

Sponsor:

SIGARCH

ICS05: International Conference on Supercomputing 2005

June 20 - 22, 2005

Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
471
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dayal JCao JEisenhauer GSchwan KWolf MZheng FAbbasi HKlasky SPodhorszki NLofstead J(2013)I/O ContainersProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.198(2015-2024)Online publication date: 20-May-2013
https://dl.acm.org/doi/10.1109/IPDPSW.2013.198
Zheng FAbbasi HDocan CLofstead JLiu QKlasky SParashar MPodhorszki NSchwan KWolf M(2010)PreDatA – preparatory data analytics on peta-scale machines2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)10.1109/IPDPS.2010.5470454(1-12)Online publication date: Apr-2010
https://doi.org/10.1109/IPDPS.2010.5470454
Zhang XKurc TSaltz JParthasarathy S(2006)Design and analysis of a multi-dimensional data sampling service for large scale data analysis applicationsProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898953.1899011(77-77)Online publication date: 25-Apr-2006
https://dl.acm.org/doi/10.5555/1898953.1899011
Buehrer GGhoting AZhang XTatikonda SParthasarathy SKurc TSaltz J(2006)I/O conscious algorithm design and systems support for data analysis on emerging architecturesProceedings of the 20th international conference on Parallel and distributed processing10.5555/1898699.1898813(288-288)Online publication date: 25-Apr-2006
https://dl.acm.org/doi/10.5555/1898699.1898813
Buehrer GGhoting AXi Zhang Tatikonda SParthasarathy SKurc TSaltz J(2006)I/O conscious algorithm design and systems support for data analysis on emerging architecturesProceedings 20th IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2006.1639586(8 pp.)Online publication date: 2006
https://doi.org/10.1109/IPDPS.2006.1639586
Xi Zhang Kurc TSaltz JParthasarathy S(2006)Design and analysis of a multi-dimensional data sampling service for large scale data analysis applicationsProceedings 20th IEEE International Parallel & Distributed Processing Symposium10.1109/IPDPS.2006.1639315(9 pp.)Online publication date: 2006
https://doi.org/10.1109/IPDPS.2006.1639315

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten