research-article

Safely selecting subsets of training data

Authors:

Henry S. BairdAuthors Info & Claims

DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems

Pages 433 - 440

https://doi.org/10.1145/1815330.1815386

Published: 09 June 2010 Publication History

Abstract

Highly versatile classifiers for document analysis systems demand representative training sets which can be dauntingly large, often challenging conventional trainable classifier technologies. We propose to select a small subset of training data, matched to each particular test set, in hopes of improved speed without loss of accuracy. Since selection must occur on line, we cannot use classifiers that require off-line training. Fortunately, Nearest Neighbors classifiers support on-line training; we use a fast approximate kNN technology using hashed k-D trees. The distribution of samples in k-D bins can be used to measure similarity between any two document images: we select the three most similar training images for any given test image. In experiments on a document image content extraction system, our algorithm was able to prune 118 training images to three, for a speedup of a factor of 17 with no loss of accuracy. Other experiments with an oracle and manual selection suggest that it may be possible to improve accuracy as well.

References

[1]

S. Arya and D. M. Mount. Approximate nearest neighbor searching. In Proceedings 4th Annual ACM SIAM Symposium on Discrete Algorithms, pages 271--280, 1993.

Digital Library

[2]

S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimension. Journal of the ACM, 45(6):891--923, November 1998.

Digital Library

[3]

H. S. Baird, M. A. Moll, and C. An. Document image content inventories. In Proc., SPIE/IS&T Document Recognition & Retrieval XIV Conf., San Jose, CA, January 2007.

[4]

H. S. Baird, M. A. Moll, J. Nonnemaker, M. R. Casey, and D. L. Delorenzo. Versatile document image content extraction. In Proc., SPIE/IS&T Document Recognition & Retrieval XIII Conf., San Jose, CA, January 2006.

[5]

J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509--517, 1975.

Digital Library

[6]

M. R. Casey. Fast Approximate Nearest Neighbors. Computer Science & Engineering Dept, Lehigh University, Bethlehem, Pennsylvania, May 2006. M.S. Thesis; PDF available at www.cse.lehigh.edu/~baird/students.html.

[7]

M. R. Casey and H. S. Baird. Towards versatile document analysis systems. In Proceedings., 7th IAPR Document Analysis Workshop (DAS'06), Nelson, New Zealand, February 2006.

Digital Library

[8]

K. L. Clarkson. A randomized algorithm for closest-point queries. SIAM J. Comput. 17, 1988.

Digital Library

[9]

K. L. Clarkson. An algorithm for approximate closest-point queries. In Proceedings of the 10th Annual ACM Symposium on Computational Geometry, pages 160--164, 1994.

Digital Library

[10]

M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc., 20th Annual ACM Symposium Computational Geometry, pages 253--262. ACM Press, 2004.

Digital Library

[11]

M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-Sensitive Hashing using Stable Distributions, chapter 4. MIT Press, 2007.

[12]

R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2nd Edition. Wiley, New York, 2001.

Digital Library

[13]

J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209--226, September 1977.

Digital Library

[14]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc., 25th Int'l Conf. on Very Large Data Bases, September 1999.

Digital Library

[15]

P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pages 604--613, New York, 1998. ACM.

Digital Library

[16]

Z. Song and N. Roussopoulos. K-nearest neighbor search for moving query point. In SSTD 2001, pages 79--96, 2001.

Digital Library

[17]

D. Yin, H. S. Baird, and C. An. Time and space optimization of document content classifiers. In IS&T/SPIE Electronic Imaging 2002 Proc. of Document Recognition and Retrieval XVII, San Jose, California, January 2010.

Index Terms

Safely selecting subsets of training data
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Interest point and salient region detections

Recommendations

Imbalance and Concentration in k-NN Classification
ICPR '10: Proceedings of the 2010 20th International Conference on Pattern Recognition

We propose algorithms for ameliorating difficulties in fast approximate k Nearest Neighbors (kNN) classifiers that arise from imbalances among classes in numbers of samples, and from concentrations of samples in small regions of feature space. These ...
A -Nearest Neighbor Based Algorithm for Multi-Instance Multi-Label Active Learning
Artificial Neural Networks in Pattern Recognition
Abstract
Multi-instance multi-label learning (MIML) is a framework in machine learning in which each object is represented by multiple instances and associated with multiple labels. This relatively new approach has achieved success in various applications, ...
Selecting proper multi-class SVM training methods
AAAI'18/IAAI'18/EAAI'18: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence

Support Vector Machines (SVMs) are excellent candidate solutions to solving multi-class problems, and multi-class SVMs can be trained by several different methods. Different training methods commonly produce SVMs with different effectiveness, and no multi-...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems

June 2010

490 pages

ISBN:9781605587738

DOI:10.1145/1815330

General Chairs:
David Doermann
University of Maryland, College Park
,
Venu Govindaraju
University at Buffalo, SUNY
,
Daniel Lopresti
Lehigh University
,
Prem Natarajan
Raytheon BBN Technologies

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

DAS '10

DAS '10: The Eighth IAPR International Workshop on Document Analysis Systems

June 9 - 11, 2010

Massachusetts, Boston, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
103
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten