short-paper

Learning before Learning: Reversing Validation and Training

Authors:

Steven Simske,

Marie VansAuthors Info & Claims

DocEng '17: Proceedings of the 2017 ACM Symposium on Document Engineering

Pages 137 - 140

https://doi.org/10.1145/3103010.3121044

Published: 31 August 2017 Publication History

Get Access

Abstract

In the world of ground truthing--that is, the collection of highly valuable labeled training and validation data-there is a tendency to follow the path of first training on a set of data, then validating the data, and then testing the data. However, in many cases the labeled training data is of non-uniform quality, and thus of non-uniform value for assessing the accuracy and other performance indicators for analytics algorithms, systems and processes. This means that one or more of the so-labeled classes is likely a mixture of two or more clusters or sub-classes. These data may inhibit our ability to assess the classifier to use for deployment. We argue that one must learn about the labeled data before the labeled data can be used for downstream machine learning; that is, we reverse the validation and training steps in building the classifier. This "learning before learning" is assessed using a CNN corpus (cnn.com) which was hand-labeled as comprising 12 classes. We show how the suspect classes are identified using the initial validation, and how training after validation occurs. We then apply this process to the CNN corpus and show that it consists of 9 high-quality classes and three mixed-quality classes. The effects of this validation-training approach is then shown and discussed.

References

[1]

H. Malik and V.S. Bhardwaj, Automatic training data cleaning for text classification, Data Mining Workshops (ICDMW), 2011 IEEE 11th International Conference on Pp. 442--449.

Google Scholar

[2]

Gerard Salton and Christopher Buckley, Term-Weighting Approaches in Automatic Text Retrieval, Information Processing and Management 24.5 (1988): 513--23.

Google Scholar

[3]

Stephen Robertson, Understanding inverse document frequency: on theoretical arguments for IDF, Journal of documentation 60.5 (2004): 503--520.

Google Scholar

[4]

K.L Kwok, Experiments with a component theory of probabilistic informational retrieval based on single terms as document components. ACM Transactions on Information Systems, 8(4). (1990), Pp. 363--386.

Google Scholar

[5]

J. Ramos, Using tf-idf to determine word relevance in document queries, Proceedings of the first instructional conference on machine learning (2003).

Google Scholar

[6]

S. Karbasi, and M Boughanem, Effective level of term frequency impact on large-scale retrieval performance: by top-term ranking method, Proceedings of the 1st international conference on Scalable information systems, ACM (2006), pp, 37.

Digital Library

Google Scholar

[7]

CodePlex. 2013. SharpNLP -- open source natural language processing tools. Retrieved from https://sharpnlp.codeplex.com/#.

Google Scholar

[8]

A. M. Vans and S. J. Simske, Identifying top performing TF*IDF classifiers using the CNN corpus. Submitted for publication to the Journal of Imaging Science and Technology (JIST), September, 2016.

Google Scholar

[9]

R. D. Lins, S. J. Simske, L. S. Cabral, G. F. P. Silva, R. J. Lima, R. F. Mello e L. Favaro, "A multi-tool scheme for summarizing textual documents," 11th IADIS International Conference WWW/INTERNET, Madrid, Spain, 2012.

Google Scholar

[10]

Stanford CoreNLP: http://nlp.stanford.edu/software/corenlp.shtml.

Google Scholar

[11]

C. Silva and B. Ribeiro, "The importance of stop word removal on recall values in text categorization." IJCNN, volume (3), 2003.

Google Scholar

[12]

W. Frakes and R. Baeza-Yates, "Information Retrieval: Data Structures and Algorithms". Prentice Hall. 1992

Google Scholar

[13]

M. Vans and S. Simske, "Summarization and Classification of CNN.com Articles using the TF*IDF Family of Metrics." Archiving 2016, April 2016, pp.21--23.

Google Scholar

[14]

S. Simske, "Meta-Algorithmics: Patterns for Robust, Low-Cost, High-Quality Systems", Singapore, IEEE Press and Wiley, 2013.

Google Scholar

Index Terms

Learning before Learning: Reversing Validation and Training
1. Applied computing
  1. Document management and text processing
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

A Survey of Semi-Supervised Learning Methods
CIS '08: Proceedings of the 2008 International Conference on Computational Intelligence and Security - Volume 02

In traditional machine learning approaches to classification, one uses only a labelled set to train the classifier. Labelled instances however are often difficult, expensive, or time consuming to obtain, as they require the efforts of experienced human ...
Self-trained Rotation Forest for semi-supervised learning

The most important asset of semi-supervised learning (SSL) methods is the use of available unlabeled data combined with an enough smaller set of labeled examples, so as to increase the classification accuracy compared with the default procedure of ...
Learning Instance Weighted Naive Bayes from labeled and unlabeled data

In real-world data mining applications, it is often the case that unlabeled instances are abundant, while available labeled instances are very limited. Thus, semi-supervised learning, which attempts to benefit from large amount of unlabeled data ...

Comments

Information & Contributors

Information

Published In

DocEng '17: Proceedings of the 2017 ACM Symposium on Document Engineering

August 2017

242 pages

ISBN:9781450346894

DOI:10.1145/3103010

General Chair:
Kenneth Camilleri
University of Malta, Malta
,
Program Chair:
Alexandra Bonnici
University of Malta, Malta

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGDOC: ACM Special Interest Group on Systems Documentation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

DocEng '17

Sponsor:

SIGWEB

DocEng '17: ACM Symposium on Document Engineering 2017

September 4 - 7, 2017

Valletta, Malta

Acceptance Rates

DocEng '17 Paper Acceptance Rate 13 of 71 submissions, 18%;

Overall Acceptance Rate 194 of 564 submissions, 34%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
77
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Index Terms

Recommendations

A Survey of Semi-Supervised Learning Methods

Self-trained Rotation Forest for semi-supervised learning

Learning Instance Weighted Naive Bayes from labeled and unlabeled data

Comments

Information

Published In

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations