skip to main content
10.1145/3103010.3121044acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper

Learning before Learning: Reversing Validation and Training

Published: 31 August 2017 Publication History

Abstract

In the world of ground truthing--that is, the collection of highly valuable labeled training and validation data-there is a tendency to follow the path of first training on a set of data, then validating the data, and then testing the data. However, in many cases the labeled training data is of non-uniform quality, and thus of non-uniform value for assessing the accuracy and other performance indicators for analytics algorithms, systems and processes. This means that one or more of the so-labeled classes is likely a mixture of two or more clusters or sub-classes. These data may inhibit our ability to assess the classifier to use for deployment. We argue that one must learn about the labeled data before the labeled data can be used for downstream machine learning; that is, we reverse the validation and training steps in building the classifier. This "learning before learning" is assessed using a CNN corpus (cnn.com) which was hand-labeled as comprising 12 classes. We show how the suspect classes are identified using the initial validation, and how training after validation occurs. We then apply this process to the CNN corpus and show that it consists of 9 high-quality classes and three mixed-quality classes. The effects of this validation-training approach is then shown and discussed.

References

[1]
H. Malik and V.S. Bhardwaj, Automatic training data cleaning for text classification, Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on Pp. 442--449.
[2]
Gerard Salton and Christopher Buckley, Term-Weighting Approaches in Automatic Text Retrieval, Information Processing and Management 24.5 (1988): 513--23.
[3]
Stephen Robertson, Understanding inverse document frequency: on theoretical arguments for IDF, Journal of documentation 60.5 (2004): 503--520.
[4]
K.L Kwok, Experiments with a component theory of probabilistic informational retrieval based on single terms as document components. ACM Transactions on Information Systems, 8(4). (1990), Pp. 363--386.
[5]
J. Ramos, Using tf-idf to determine word relevance in document queries, Proceedings of the first instructional conference on machine learning (2003).
[6]
S. Karbasi, and M Boughanem, Effective level of term frequency impact on large-scale retrieval performance: by top-term ranking method, Proceedings of the 1st international conference on Scalable information systems, ACM (2006), pp, 37.
[7]
CodePlex. 2013. SharpNLP -- open source natural language processing tools. Retrieved from https://sharpnlp.codeplex.com/#.
[8]
A. M. Vans and S. J. Simske, Identifying top performing TF*IDF classifiers using the CNN corpus. Submitted for publication to the Journal of Imaging Science and Technology (JIST), September, 2016.
[9]
R. D. Lins, S. J. Simske, L. S. Cabral, G. F. P. Silva, R. J. Lima, R. F. Mello e L. Favaro, "A multi-tool scheme for summarizing textual documents," 11th IADIS International Conference WWW/INTERNET, Madrid, Spain, 2012.
[10]
Stanford CoreNLP: http://nlp.stanford.edu/software/corenlp.shtml.
[11]
C. Silva and B. Ribeiro, "The importance of stop word removal on recall values in text categorization." IJCNN, volume (3), 2003.
[12]
W. Frakes and R. Baeza-Yates, "Information Retrieval: Data Structures and Algorithms". Prentice Hall. 1992
[13]
M. Vans and S. Simske, "Summarization and Classification of CNN.com Articles using the TF*IDF Family of Metrics." Archiving 2016, April 2016, pp.21--23.
[14]
S. Simske, "Meta-Algorithmics: Patterns for Robust, Low-Cost, High-Quality Systems", Singapore, IEEE Press and Wiley, 2013.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '17: Proceedings of the 2017 ACM Symposium on Document Engineering
August 2017
242 pages
ISBN:9781450346894
DOI:10.1145/3103010
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • SIGDOC: ACM Special Interest Group on Systems Documentation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classification
  2. ensemble methods
  3. tf*idf
  4. training
  5. validation

Qualifiers

  • Short-paper

Conference

DocEng '17
Sponsor:
DocEng '17: ACM Symposium on Document Engineering 2017
September 4 - 7, 2017
Valletta, Malta

Acceptance Rates

DocEng '17 Paper Acceptance Rate 13 of 71 submissions, 18%;
Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 77
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media