skip to main content
10.1145/1014052.1014131acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

The IOC algorithm: efficient many-class non-parametric classification for high-dimensional data

Published: 22 August 2004 Publication History

Abstract

This paper is about a variant of k nearest neighbor classification on large many-class high dimensional datasets.K nearest neighbor remains a popular classification technique, especially in areas such as computer vision, drug activity prediction and astrophysics. Furthermore, many more modern classifiers, such as kernel-based Bayes classifiers or the prediction phase of SVMs, require computational regimes similar to k-NN. We believe that tractable k-NN algorithms therefore continue to be important.This paper relies on the insight that even with many classes, the task of finding the majority class among the k nearest neighbors of a query need not require us to explicitly find those k nearest neighbors. This insight was previously used in (Liu et al., 2003) in two algorithms called KNS2 and KNS3 which dealt with fast classification in the case of two classes. In this paper we show how a different approach, IOC (standing for the International Olympic Committee) can apply to the case of n classes where n > 2.IOC assumes a slightly different processing of the datapoints in the neighborhood of the query. This allows it to search a set of metric trees, one for each class. During the searches it is possible to quickly prune away classes that cannot possibly be the majority.We give experimental results on datasets of up to 5.8 x 105 records and 1.5 x 103 attributes, frequently showing an order of magnitude acceleration compared with each of (i) conventional linear scan, (ii) a well-known independent SR-tree implementation of conventional k-NN and (iii) a highly optimized conventional k-NN metric tree search.

References

[1]
D. W. Aha. A Study of Instance-Based Algorithms for Supervised Learning Tasks: Mathematical, Empirical and Psychological Evaluations. PhD. Thesis; Technical Report No. 90--42, University of California, Irvine, November 1990.
[2]
D. W. Aha, D. Kibler, and M. K. Albert. Instance-Based Learning Algorithms. Machine Learning, 6:37--66, 1991.
[3]
S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45(6):891--923, 1998.
[4]
Jock A. Blackard. Forest covertype database. http://kdd.ics.uci.edu/databases/covertype/covertype.data.html.
[5]
P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In Proceedings of the 23rd VLDB International Conference, September 1997.
[6]
Ron Cole and Mark Fanty. Isolet spoken letter recognition database. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/isolet/.
[7]
J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209--226, September 1977.
[8]
A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In Proc 25th VLDB Conference, 1999.
[9]
J. M. Hammersley. The Distribution of Distances in a Hypersphere. Annals of Mathematical Statistics, 21:447--452, 1950.
[10]
CMU informedia digital video library project. The trec-2001 video trackorganized by nist shot boundary task. 2001.
[11]
IOC. International olympic committee: Candidature acceptance procedure. http://multimedia.olympic.org/pdf/ en report 711.pdf, 1999.
[12]
Norio Katayama and Shin'ichi Satoh. The SR-tree: an index structure for high-dimensional nearest neighbor queries. pages 369--380, 1997.
[13]
Nicholas Kushmerick. Internet advertisements. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/internet ads/.
[14]
Ting Liu, Andrew Moore, and Alexander Gray. Efficient exact k-nn and nonparametric classification in high dimensions. In Proceedings of Neural Information Processing Systems, 2003.
[15]
A. W. Moore. The Anchors Hierarchy: Using the Triangle Inequality to Survive High-Dimensional Data. In Twelfth Conference on Uncertainty in Artificial Intelligence. AAAI Press, 2000.
[16]
S. M. Omohundro. Bumptrees for Efficient Function, Constraint, and Classification Learning. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3. Morgan Kaufmann, 1991.
[17]
F. P. Preparata and M. Shamos. Computational Geometry. Springer-Verlag, 1985.
[18]
Yanjun Qi, Alexander G. Hauptmann, and Ting Liu. Supervised classification for video shot segmentation. In proceedings of 2003 IEEE International Conference on Multimedia & Expo, 2003.
[19]
D. B. Skalak. Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms. In W. W. Cohen and H. Hirsh, editors, Machine Learning: Proceedings of the Eleventh International Conference. Morgan Kaufmann, 1994.
[20]
David J. Slate. Letter recognition database. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/letter-recognition/.
[21]
J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40:175--179, 1991.
[22]
Roger Weber, Hans-Jorg Schek, and Stephen Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proc. 24th Int. Conf. Very Large Data Bases, VLDB, pages 194--205, 24--27 1998.

Cited By

View all
  • (2023)Forest Cover Type Prediction using Automatic Machine Learning2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT56998.2023.10307426(1-5)Online publication date: 6-Jul-2023
  • (2022)Performance Assessment of K-Nearest Neighbor Algorithm for Classification of Forest Cover TypeProceedings of Third International Conference on Sustainable Computing10.1007/978-981-16-4538-9_5(43-51)Online publication date: 4-Jan-2022
  • (2017)Data mining classification experiments with decision trees over the forest covertype database2017 21st International Conference on System Theory, Control and Computing (ICSTCC)10.1109/ICSTCC.2017.8107040(236-241)Online publication date: Oct-2017
  • Show More Cited By

Index Terms

  1. The IOC algorithm: efficient many-class non-parametric classification for high-dimensional data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2004
    874 pages
    ISBN:1581138881
    DOI:10.1145/1014052
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 August 2004

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. classification
    2. high dimension
    3. k nearest neighbor
    4. metric tree

    Qualifiers

    • Article

    Conference

    KDD04

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Forest Cover Type Prediction using Automatic Machine Learning2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT56998.2023.10307426(1-5)Online publication date: 6-Jul-2023
    • (2022)Performance Assessment of K-Nearest Neighbor Algorithm for Classification of Forest Cover TypeProceedings of Third International Conference on Sustainable Computing10.1007/978-981-16-4538-9_5(43-51)Online publication date: 4-Jan-2022
    • (2017)Data mining classification experiments with decision trees over the forest covertype database2017 21st International Conference on System Theory, Control and Computing (ICSTCC)10.1109/ICSTCC.2017.8107040(236-241)Online publication date: Oct-2017
    • (2008)BoostMapIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2007.114030:1(89-104)Online publication date: 1-Jan-2008
    • (2006)New Algorithms for Efficient High-Dimensional Nonparametric ClassificationThe Journal of Machine Learning Research10.5555/1248547.12485887(1135-1158)Online publication date: 1-Dec-2006

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media