Elsevier

Neural Networks

Volume 18, Issues 5–6, July–August 2005, Pages 595-601
Neural Networks

2005 Special issue
Training neural networks with heterogeneous data

https://doi.org/10.1016/j.neunet.2005.06.011Get rights and content

Abstract

Data pruning and ordered training are two methods and the results of a small theory that attempts to formalize neural network training with heterogeneous data. Data pruning is a simple process that attempts to remove noisy data. Ordered training is a more complex method that partitions the data into a number of categories and assigns training times to those assuming that data size and training time have a polynomial relation. Both methods derive from a set of premises that form the ‘axiomatic’ basis of our theory. Both methods have been applied to a time-delay neural network—which is one of the main learners in Microsoft's Tablet PC handwriting recognition system. Their effect is presented in this paper along with a rough estimate of their effect on the overall multi-learner system. The handwriting data and the chosen language are Italian.1

Introduction

In handwriting recognition, we usually have to train our learners with heterogeneous data, or, more precisely, with handwriting samples of varying types, sizes and distributions. For example, the Tablet PC handwriting data are a collection of dictionary words, phrases or sentences, telephone numbers, dates, times, people names, geographical names, web and e-mail addresses, postal addresses, numbers, formulas, single character data etc. Such a multitude of types and subsequently distinct and differing statistical properties raises an obvious question: should training methods take data heterogeneity into account and how?

In this paper, we attempt to answer the above question. We develop a small theory that guides training and applies not only to handwriting recognition but also to every training problem with prevalent heterogeneity. We consider our theory not a complete or conclusive result but rather a first step in a greater effort to formalize and quantify neural network training with heterogeneous data.

Our approach consists of two main parts. In the first part, which we call data pruning, we try to remove low quality data (e.g. data that have a high level of noise that contaminates the training data set).

Our ink data are stored in a number of ink files; each file contains a number of panels; each panel is a sequence of words; and words can be dictionary words, e-mail addresses, numbers, dates, single characters etc. To cleanse the training data, a combination of machine and human labeling is used. We label words as good or bad, and our pruning method merely discards the files with a high percentage of bad words.

The second part of our approach is the ordered training method: we initially partition the data into a number of categories that share some common properties and then we specify training times for those categories based on the ordered training model. The model itself derives from a single premise that training time and data size have a polynomial relation. (One can dispute this premise, of course, but we believe that whenever training time grows exponentially with data size the underlying problem should be truly hopeless.)

Furthermore, in order to define good or optimal data categories for our model we resort to co-operational game theory (Rasmusen, 2001). We treat category partitioning as an n-player game and we apply a standard hill-climbing method (Rich & Knight, 1992; Section 3.2).

Finally, to avoid catastrophic interference (McCloskey & Cohen, 1989), (Ratcliff, 1990), we combine all data categories and train our learner using a single training stream that has the recommended distribution.

Describing it in a more formal manner, our approach derives from the following premises:

  • (a)

    Pruning hypothesis. Data pruning may generate more effective training sets.

  • (b)

    Categorical hypothesis. Heterogeneous data can be partitioned into a number of categories so that data can be treated uniformly within each category while the various categories relate in a non-uniform way.

  • (c)

    Ordered training hypothesis. There is a polynomial relation between data size and training time.

  • (d)

    Retention hypothesis. Combining training categories into a single training stream increases retention and fights against catastrophic interference.

Of course, we can dispute or alter and refine any of those hypotheses. The first part of the categorical hypothesis looks quite suspicious and we currently work on a method that treats data within categories in a non-uniform manner. The ordered training hypothesis apparently does not apply to problems that do not belong in P (assuming a network of polynomial size). For example, if P≠NP, it does not apply to any NP-complete problem. Finally, new learning algorithms or network architectures may improve neural network retention and thus weaken the retention hypothesis.

At this point, our primary goal is not to find the most refined and comprehensive set of hypotheses but rather to set up a framework so that we can formalize neural network training with heterogeneous data and derive it from a small and concise set of premises. Modifications or extensions of those premises can then directly suggest improvements in the overall training procedure.

There are, of course, other approaches to the data heterogeneity problem such as the mixture of experts (Jacobs, R.A., Jordan, M.I., Nowlan, S.J., &, Hinton, G.E., 1991), which attempts to train different learners for distinct data collections or categories, posteriori methods like data-emphasizing and boosting (Freund & Schapire, 1996) and growing cell structure/Neural Gas techniques (Fritzke, 1994) that gather local measures during the adaptation process to insert new units into the structure of the learners to adapt to different distributions. Although posteriori methods challenge the first part of the categorical hypothesis and have some obvious advantages and—and indeed, we are currently extending our system into a hybrid that would take error rates into account during re-train—we believe that an a priori method, which trains with no prior knowledge of learner's accuracy level, is necessary in order to get the most accurate initial solution. Our results indicate that our a priori methods can improve accuracy significantly for a single learner.

The rest of this paper is organized as follows. The next section describes our pruning procedure. Section 3 contains a formal definition of the ordered training model for single and multi-category data. Section 4 contains the hill-climbing algorithm for partitioning the data into categories while Section 5 contains the experimental results (derived from our Italian data and learner). Section 6 discusses the overall effect of our methods on a multi-learner system and presents some rough estimates. Finally, Section 7 addresses current and future extensions of our work.

Section snippets

Data pruning

A fundamental entity in our ink collection process is the ink file. Users of our collection system transcribe a certain script that is presented to them and their ink is stored into a separate ink file. The script, and thus the stored ink, consists of a sequence of panels, each containing a sentence that is composed of sequence of words-whereby words we mean dictionary words, e-mail addresses, numbers, dates etc. Subsequently, the ink words are being labeled as bad or good depending on whether

Ordered training

We assume a training data set of size S, where S is expressed in fixed units (e.g. bytes) or units that do not vary significantly (e.g. ink segments). Furthermore, we assume that the set contains m samples and that we train for time T (which corresponds to E epochs, or N iterations, one sample used per iteration). It would then be E=N/m. If we define the training (or data processing) speed u to be the amount of data processed in a time unit i.e. u=SE/T then it would also be N=muT/S and E=uT/S.

The data partitioning game

The definition of ordered training does not specify how to split the data into categories. Data heterogeneity, on the other hand, may suggest that we split the data into multiple categories so that the derived training times are more accurate. The ultimate criterion should be to maximize the overall accuracy gain.

In our approach, we use data partitioning (that is we do not allow categories to overlap) and we treat the overall problem as an n-player game. We represent each category as a player

Experimental results

We measured the effect of pruning and ordered training on accuracy using one of our main handwriting classifiers, which is a time delay neural network with three layers and 150 hidden units. To isolate the effect of each method, we used three different settings. In the plain setting, we merely trained on all the data uniformly (i.e. no pruning or ordered training). In the pruned setting, we pruned the data and then trained uniformly (i.e. no ordered training). Finally, in the ordered setting,

The effect on multi learner systems

In the previous sections, we demonstrated the effect of our methods on the chosen TDDN classifier. However, given the fact that our classifier is a component of a multi-learner system (which includes two other classifiers), we may as well consider the question of the effect of our methods on the overall system.

Conceptually, both pruning and ordered training allow training the various classifiers of a multi-learner system with different data distributions, possibly emphasizing different parts in

Conclusion and future work

We have shown in this paper that data pruning and ordered training is an effective combination that can increase accuracy significantly. The methods derive from a small theory that attempts to formalize training of classifiers with heterogeneous data. Although more research may be necessary in order to validate the theory unequivocally (and perhaps improve it), we believe that the observed accuracy gains offer significant evidence. Any corrections or refinements of the underlying premises

References (8)

  • B. Fritzke

    Growing cell structures-a self-organizing network for unsupervised and supervised learning

    Neural Networks

    (1994)
  • J.A. Drakopoulos et al.

    Training with heterogeneous data

    Proceedings of the international joint conference on neural networks, July 31–August 4, Montreal, Canada

    (2005)
  • Y. Freund et al.

    Experiments with a new boosting algorithm

    International conference on machine learning

    (1996)
  • R.A. Jacobs et al.

    Adaptive mixtures of local experts

    Neural Computation

    (1991)
There are more references available in the full text version of this article.

Cited by (9)

View all citing articles on Scopus
1

An abbreviated version of some portions of this article appeared in Drakopoulos and Abdulkader, 2005, published under the IEEE copyright.

View full text