skip to main content
10.1145/1815330.1815386acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdasConference Proceedingsconference-collections
research-article

Safely selecting subsets of training data

Published: 09 June 2010 Publication History

Abstract

Highly versatile classifiers for document analysis systems demand representative training sets which can be dauntingly large, often challenging conventional trainable classifier technologies. We propose to select a small subset of training data, matched to each particular test set, in hopes of improved speed without loss of accuracy. Since selection must occur on line, we cannot use classifiers that require off-line training. Fortunately, Nearest Neighbors classifiers support on-line training; we use a fast approximate kNN technology using hashed k-D trees. The distribution of samples in k-D bins can be used to measure similarity between any two document images: we select the three most similar training images for any given test image. In experiments on a document image content extraction system, our algorithm was able to prune 118 training images to three, for a speedup of a factor of 17 with no loss of accuracy. Other experiments with an oracle and manual selection suggest that it may be possible to improve accuracy as well.

References

[1]
S. Arya and D. M. Mount. Approximate nearest neighbor searching. In Proceedings 4th Annual ACM SIAM Symposium on Discrete Algorithms, pages 271--280, 1993.
[2]
S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimension. Journal of the ACM, 45(6):891--923, November 1998.
[3]
H. S. Baird, M. A. Moll, and C. An. Document image content inventories. In Proc., SPIE/IS&T Document Recognition & Retrieval XIV Conf., San Jose, CA, January 2007.
[4]
H. S. Baird, M. A. Moll, J. Nonnemaker, M. R. Casey, and D. L. Delorenzo. Versatile document image content extraction. In Proc., SPIE/IS&T Document Recognition & Retrieval XIII Conf., San Jose, CA, January 2006.
[5]
J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18(9):509--517, 1975.
[6]
M. R. Casey. Fast Approximate Nearest Neighbors. Computer Science & Engineering Dept, Lehigh University, Bethlehem, Pennsylvania, May 2006. M.S. Thesis; PDF available at www.cse.lehigh.edu/~baird/students.html.
[7]
M. R. Casey and H. S. Baird. Towards versatile document analysis systems. In Proceedings., 7th IAPR Document Analysis Workshop (DAS'06), Nelson, New Zealand, February 2006.
[8]
K. L. Clarkson. A randomized algorithm for closest-point queries. SIAM J. Comput. 17, 1988.
[9]
K. L. Clarkson. An algorithm for approximate closest-point queries. In Proceedings of the 10th Annual ACM Symposium on Computational Geometry, pages 160--164, 1994.
[10]
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc., 20th Annual ACM Symposium Computational Geometry, pages 253--262. ACM Press, 2004.
[11]
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-Sensitive Hashing using Stable Distributions, chapter 4. MIT Press, 2007.
[12]
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2nd Edition. Wiley, New York, 2001.
[13]
J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209--226, September 1977.
[14]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc., 25th Int'l Conf. on Very Large Data Bases, September 1999.
[15]
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, pages 604--613, New York, 1998. ACM.
[16]
Z. Song and N. Roussopoulos. K-nearest neighbor search for moving query point. In SSTD 2001, pages 79--96, 2001.
[17]
D. Yin, H. S. Baird, and C. An. Time and space optimization of document content classifiers. In IS&T/SPIE Electronic Imaging 2002 Proc. of Document Recognition and Retrieval XVII, San Jose, California, January 2010.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
June 2010
490 pages
ISBN:9781605587738
DOI:10.1145/1815330
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adaptive recognition
  2. algorithms
  3. decimation
  4. document analysis systems
  5. document content extraction
  6. isogeny
  7. k nearest neighbors

Qualifiers

  • Research-article

Conference

DAS '10

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 103
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media