Abstract
Knowledge discovery, that is, to analyze a given massive data set and derive or discover some knowledge from it, has been becoming a quite important subject in several fields including computer science. Good softwares have been demanded for various knowledge discovery tasks. For such softwares, we often need to develop efficient algorithms for handling huge data sets. Random sampling is one of the important algorithmic methods for processing huge data sets. In this paper, we explain some random sampling techniques for speeding up learning algorithms and making them applicable to large data sets [15], [16], [4], [3]. We also show some algorithms obtained by using these techniques.
A part of this work is supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research on Priority Areas (Discovery Science), 19982001.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
N. Abe and H. Mamitsuka, Query learning strategies using boosting and bagging, in Proc. the 15th Int’l Conf. on Machine Learning (ICML’00), 1–9, 1998.
I. Adler and R. Shamir, A randomized scheme for speeding up algorithms for linear and convex programming with high constraints-to-variable ratio, Math. Programming 61, 39–52, 1993.
J. Balcázar, Y. Dai, and O. Watanabe, Provably fast training algorithms for support vector machines, in Proc. the first IEEE Int’l Conf. on Data Mining, to appear.
J. Balcázar, Y. Dai, and O. Watanabe, Random sampling techniques for training support vector machines: For primal-form maximal-margin classifiers, in Proc. the 12th Int’l Conf. on Algorithmic Learning Theory (ALT’01), to appear.
K.P. Bennett and E.J. Bredensteiner, Duality and geometry in SVM classifiers, in Proc. the 17th Int’l Conf. on Machine Learning (ICML’2000), 57–64, 2000.
P.S. Bradley, O.L. Mangasarian, and D.R. Musicant, Optimization methods in massive datasets, in Handbook of Massive Datasets (J. Abello, P.M. Pardalos, and M.G.C. Resende, eds.), Kluwer Academic Pub., to appear.
L. Breiman, Pasting small votes for classification in large databases and on-line, Machine Learning 36, 85–103, 1999.
K.L. Clarkson, Las Vegas algorithms for linear and integer programming, J.ACM 42, 488–499, 1995.
M. Collins, R.E. Schapire, and Y. Singer, Logistic regression, AdaBoost and Bregman Distance, in Proc. the 13th Annual Conf. on Comput. Learning Theory (COLT’00), 158–169, 2000.
C. Cortes and V. Vapnik, Support-vector networks, Machine Learning 20, 273–297, 1995.
N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge Univ. Press, 2000.
P. Dagum, R. Karp, M. Luby, and S. Ross, An optimal algorithm for monte carlo estimation, SIAM J. Comput. 29(5), 1484–1496, 2000.
T.G. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization, Machine Learning 32, 1–22, 1998.
C. Domingo, R. Gavaldá, and O. Watanabe, Practical algorithms for on-line selection, in Proc. the first Intl. Conf. on Discovery Science (DS’98), Lecture Notes in AI 1532, 150–161, 1998.
C. Domingo, R. Gavaldá, and O. Watanabe, Adaptive sampling methods for scaling up knowledge discovery algorithms, in Proc. the 2nd Intl. Conf. on Discovery Science (DS’99), Lecture Notes in AI, 172–183, 1999. (The final version will appear in J. Knowledge Discovery and Data Mining.)
C. Domingo and O. Watanabe, MadaBoost: A modification of AdaBoost, in Proc. the 13th Annual Conf. on Comput. Learning Theory (COLT’00), 180–189, 2000.
C. Domingo and O. Watanabe, Scaling up a boosting-based learner via adaptive sampling, in Proc. of Knowledge Discovery and Data Mining (PAKDD’00), Lecture Notes in AI 1805, 317–328, 2000.
B. Gärtner and E. Welzl, A simple sampling lemma: Analysis and applications in geometric optimization, Discr. Comput. Geometry, to appear. (Also available from http://www.inf.ethz.ch/personal/gaertner/publications.html)
W. Feller, An Introduction to Probability Theory and its Applications (Third Edition), John Wiley & Sons, 1968.
Y. Freund, Boosting a weak learning algorithm by majority, Information and Computation 121(2), 256–285, 1995.
Y. Freund, An adaptive version of the boost by majority algorithm, in Proc. the 12th Annual Conf. on Comput. Learning Theory (COLT’99), 102–113, 1999.
J. Friedman, T. Hastie, and R. Tibshirani, Additive logistic regression: a statistical view of boosting, Technical Report, 1998.
Y. Freund and R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55(1), 119–139, 1997.
B.K. Ghosh and P.K. Sen eds., Handbook of Sequential Analysis, Marcel Dekker, 1991.
R. Greiner, PALO: a probabilistic hill-climbing algorithm, Artificial Intelligence 84, 177–204, 1996.
P. Haas and A. Swami, Sequential sampling, procedures for query size estimation, IBM Research Report RJ 9101(80915), 1992.
M. Kearns, Efficient noise-tolerant learning from statistical queries, in Proc. the 25th Annual ACM Sympos. on Theory of Comput. (STOC’93), 392–401, 1993.
R.J. Lipton, J.F. Naughton, D.A. Schneider, and S. Seshadri, Efficient sampling strategies for relational database operations, Theoret. Comput. Sci. 116, pp.195–226, 1993.
R.J. Lipton and J.F. Naughton, Query size estimation by adaptive sampling, J. Comput. and Syst. Sci. 51, 18–25, 1995.
J.F. Lynch, Analysis and application of adaptive sampling, in Proc. the 19th ACM Sympos. on Principles of Database Systems (PODS’99), 260–267, 1999.
O. Maron and A. Moore, Hoeffding races: accelerating model selection search for classification and function approximation, in Proc. Advances in Neural Information Process. Systems (NIPS’94), 59–66, 1994.
J. Platt, Fast training of support vector machines using sequential minimal optimization, in Advances in Kernel Methods — Support Vector Learning (B. Scholkopf, C.J.C. Burges, and A.J. Smola, eds.), MIT Press, 185–208, 1999.
R.E. Schapire, The strength of weak learnability, Machine Learning 5(2), 197–227, 1990.
T. Scheffer and S. Wrobel, A sequential sampling algorithm for a general class of utility criteria, in Proc. the 6th ACM Intl. Conf. on Knowledge Discovery and Data Mining (KDD’00), 2000.
A.J. Smola and B. Scholkopf, A tutorial on support vector regression, NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, Univ. London, 1998.
A. Wald, Sequential Analysis, John Wiley & Sons, 1947.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Watanabe, O. (2001). How Can Computer Science Contribute to Knowledge Discovery. In: Pacholski, L., Ružička, P. (eds) SOFSEM 2001: Theory and Practice of Informatics. SOFSEM 2001. Lecture Notes in Computer Science, vol 2234. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45627-9_11
Download citation
DOI: https://doi.org/10.1007/3-540-45627-9_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42912-8
Online ISBN: 978-3-540-45627-8
eBook Packages: Springer Book Archive