Skip to main content

How Can Computer Science Contribute to Knowledge Discovery

  • Conference paper
  • First Online:
SOFSEM 2001: Theory and Practice of Informatics (SOFSEM 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2234))

  • 283 Accesses

Abstract

Knowledge discovery, that is, to analyze a given massive data set and derive or discover some knowledge from it, has been becoming a quite important subject in several fields including computer science. Good softwares have been demanded for various knowledge discovery tasks. For such softwares, we often need to develop efficient algorithms for handling huge data sets. Random sampling is one of the important algorithmic methods for processing huge data sets. In this paper, we explain some random sampling techniques for speeding up learning algorithms and making them applicable to large data sets [15], [16], [4], [3]. We also show some algorithms obtained by using these techniques.

A part of this work is supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research on Priority Areas (Discovery Science), 19982001.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. N. Abe and H. Mamitsuka, Query learning strategies using boosting and bagging, in Proc. the 15th Int’l Conf. on Machine Learning (ICML’00), 1–9, 1998.

    Google Scholar 

  2. I. Adler and R. Shamir, A randomized scheme for speeding up algorithms for linear and convex programming with high constraints-to-variable ratio, Math. Programming 61, 39–52, 1993.

    Article  MathSciNet  Google Scholar 

  3. J. Balcázar, Y. Dai, and O. Watanabe, Provably fast training algorithms for support vector machines, in Proc. the first IEEE Int’l Conf. on Data Mining, to appear.

    Google Scholar 

  4. J. Balcázar, Y. Dai, and O. Watanabe, Random sampling techniques for training support vector machines: For primal-form maximal-margin classifiers, in Proc. the 12th Int’l Conf. on Algorithmic Learning Theory (ALT’01), to appear.

    Google Scholar 

  5. K.P. Bennett and E.J. Bredensteiner, Duality and geometry in SVM classifiers, in Proc. the 17th Int’l Conf. on Machine Learning (ICML’2000), 57–64, 2000.

    Google Scholar 

  6. P.S. Bradley, O.L. Mangasarian, and D.R. Musicant, Optimization methods in massive datasets, in Handbook of Massive Datasets (J. Abello, P.M. Pardalos, and M.G.C. Resende, eds.), Kluwer Academic Pub., to appear.

    Google Scholar 

  7. L. Breiman, Pasting small votes for classification in large databases and on-line, Machine Learning 36, 85–103, 1999.

    Article  Google Scholar 

  8. K.L. Clarkson, Las Vegas algorithms for linear and integer programming, J.ACM 42, 488–499, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  9. M. Collins, R.E. Schapire, and Y. Singer, Logistic regression, AdaBoost and Bregman Distance, in Proc. the 13th Annual Conf. on Comput. Learning Theory (COLT’00), 158–169, 2000.

    Google Scholar 

  10. C. Cortes and V. Vapnik, Support-vector networks, Machine Learning 20, 273–297, 1995.

    MATH  Google Scholar 

  11. N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge Univ. Press, 2000.

    Google Scholar 

  12. P. Dagum, R. Karp, M. Luby, and S. Ross, An optimal algorithm for monte carlo estimation, SIAM J. Comput. 29(5), 1484–1496, 2000.

    Article  MATH  MathSciNet  Google Scholar 

  13. T.G. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting and randomization, Machine Learning 32, 1–22, 1998.

    Google Scholar 

  14. C. Domingo, R. Gavaldá, and O. Watanabe, Practical algorithms for on-line selection, in Proc. the first Intl. Conf. on Discovery Science (DS’98), Lecture Notes in AI 1532, 150–161, 1998.

    Google Scholar 

  15. C. Domingo, R. Gavaldá, and O. Watanabe, Adaptive sampling methods for scaling up knowledge discovery algorithms, in Proc. the 2nd Intl. Conf. on Discovery Science (DS’99), Lecture Notes in AI, 172–183, 1999. (The final version will appear in J. Knowledge Discovery and Data Mining.)

    Google Scholar 

  16. C. Domingo and O. Watanabe, MadaBoost: A modification of AdaBoost, in Proc. the 13th Annual Conf. on Comput. Learning Theory (COLT’00), 180–189, 2000.

    Google Scholar 

  17. C. Domingo and O. Watanabe, Scaling up a boosting-based learner via adaptive sampling, in Proc. of Knowledge Discovery and Data Mining (PAKDD’00), Lecture Notes in AI 1805, 317–328, 2000.

    Google Scholar 

  18. B. Gärtner and E. Welzl, A simple sampling lemma: Analysis and applications in geometric optimization, Discr. Comput. Geometry, to appear. (Also available from http://www.inf.ethz.ch/personal/gaertner/publications.html)

  19. W. Feller, An Introduction to Probability Theory and its Applications (Third Edition), John Wiley & Sons, 1968.

    Google Scholar 

  20. Y. Freund, Boosting a weak learning algorithm by majority, Information and Computation 121(2), 256–285, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  21. Y. Freund, An adaptive version of the boost by majority algorithm, in Proc. the 12th Annual Conf. on Comput. Learning Theory (COLT’99), 102–113, 1999.

    Google Scholar 

  22. J. Friedman, T. Hastie, and R. Tibshirani, Additive logistic regression: a statistical view of boosting, Technical Report, 1998.

    Google Scholar 

  23. Y. Freund and R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci. 55(1), 119–139, 1997.

    Article  MATH  MathSciNet  Google Scholar 

  24. B.K. Ghosh and P.K. Sen eds., Handbook of Sequential Analysis, Marcel Dekker, 1991.

    Google Scholar 

  25. R. Greiner, PALO: a probabilistic hill-climbing algorithm, Artificial Intelligence 84, 177–204, 1996.

    Article  MathSciNet  Google Scholar 

  26. P. Haas and A. Swami, Sequential sampling, procedures for query size estimation, IBM Research Report RJ 9101(80915), 1992.

    Google Scholar 

  27. M. Kearns, Efficient noise-tolerant learning from statistical queries, in Proc. the 25th Annual ACM Sympos. on Theory of Comput. (STOC’93), 392–401, 1993.

    Google Scholar 

  28. R.J. Lipton, J.F. Naughton, D.A. Schneider, and S. Seshadri, Efficient sampling strategies for relational database operations, Theoret. Comput. Sci. 116, pp.195–226, 1993.

    Article  MATH  MathSciNet  Google Scholar 

  29. R.J. Lipton and J.F. Naughton, Query size estimation by adaptive sampling, J. Comput. and Syst. Sci. 51, 18–25, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  30. J.F. Lynch, Analysis and application of adaptive sampling, in Proc. the 19th ACM Sympos. on Principles of Database Systems (PODS’99), 260–267, 1999.

    Google Scholar 

  31. O. Maron and A. Moore, Hoeffding races: accelerating model selection search for classification and function approximation, in Proc. Advances in Neural Information Process. Systems (NIPS’94), 59–66, 1994.

    Google Scholar 

  32. J. Platt, Fast training of support vector machines using sequential minimal optimization, in Advances in Kernel Methods — Support Vector Learning (B. Scholkopf, C.J.C. Burges, and A.J. Smola, eds.), MIT Press, 185–208, 1999.

    Google Scholar 

  33. R.E. Schapire, The strength of weak learnability, Machine Learning 5(2), 197–227, 1990.

    Google Scholar 

  34. T. Scheffer and S. Wrobel, A sequential sampling algorithm for a general class of utility criteria, in Proc. the 6th ACM Intl. Conf. on Knowledge Discovery and Data Mining (KDD’00), 2000.

    Google Scholar 

  35. A.J. Smola and B. Scholkopf, A tutorial on support vector regression, NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, Univ. London, 1998.

    Google Scholar 

  36. A. Wald, Sequential Analysis, John Wiley & Sons, 1947.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Watanabe, O. (2001). How Can Computer Science Contribute to Knowledge Discovery. In: Pacholski, L., Ružička, P. (eds) SOFSEM 2001: Theory and Practice of Informatics. SOFSEM 2001. Lecture Notes in Computer Science, vol 2234. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45627-9_11

Download citation

  • DOI: https://doi.org/10.1007/3-540-45627-9_11

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42912-8

  • Online ISBN: 978-3-540-45627-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics