Introduction

With the growing use of digital devices and the fast growth of the number of pages on the World Wide Web, text categorization is a key component in managing information. Text categorization (or alternatively, text classification) can be defined simply as follows: Given a set of documents D, and a set of m classes (or labels) C, define a function F that will assign a value from the set of C to each document in D. For example, D might consist of the set of all classified advertisements in a newspaper, and C would therefore be the set of headings in the classified section of that same newspaper. If an element of D (taken from http://classifieds.dailyrecond.com/cv3/dailyrecord) is:

CARPENTRY All home improvements including decks, siding,

wood restoration, waterprfg, roofs, kitch/bath remod

& repairs, etc. 973-328-6876

then it would be assigned the value of Business & Service from the set C.

Given a set of labeled examples, a supervised learning algorithm can be used to classify pieces of text whose category is unknown. The problem with the supervised learning approach to text classification is that often very many labeled examples (or “training examples”) must be used in order for the system to correctly classify new documents. These training examples must be hand-labeled, which might be quite a tedious and expensive process. As a result, the set of training examples might not contain enough data to accurately classify unknown examples.

Our approach to solving the problem of limited training data focuses on using some “related” corpus of data in addition to the labeled training set. The question that we address is as follows: Given a text categorization task, can we possibly find some other data that can be incorporated into the learning process that will improve accuracy on test examples while limiting the number of labeled training examples needed? We believe that the answer is most often “yes”. For example, suppose that we wish to classify the names of companies by the industry that it is part of. A company such as Watson Pharmaceuticals Inc would be classified with the label drug, and the company name Wal-Mart would be classified as type retail. Although we many not have numerous training examples, and the training examples are very short, we can find other data that is related to this task. Such data could be articles from the business section of an on-line newspaper or information from company home pages. As a result of the explosion of the amount of digital data that is available, it is often the case that text, databases, or other sources of knowledge that are related to a text classification problem are easily accessible. We term this readily available information “background knowledge”. Some of this background knowledge can be used in a supervised learning situation to improve accuracy rates, while keeping the hand-labeled number of training examples needed to a minimum.

In this paper we present a competitive efficient approach to text categorization that allows for the incorporation of background knowledge, using the data integration system WHIRL (Cohen, 1998a,b).

Our paper is organized in the following manner. In the next section we review related work. The following section describes the WHIRL system in more detail, and discusses its usefulness for text categorization. We show that its performance as a text classifier is competitive to other known classifiers. The following sections introduce the idea of background knowledge, and describe how it can be elegantly incorporated into WHIRL queries. We present results on numerous data sets to show that WHIRL with background knowledge improves the WHIRL text classification method.

Related work

Machine learning for text classification is an active area of research (Sebastiani, 2002), encompassing a wide variety of different learning algorithms, systems, representations of data, and types of problems. In particular, the method introduced in this paper is concerned with two specific areas of this research, that of using related background information to aid in the text classification task, and that of short text classification problems.

A common problem when using machine learning for text classificaiton is dealing with an insufficient number of training examples to create a classifier that has a reasonably low error rate on unseen test examples. There are a number of approaches that may be taken to aid in the creation of a more accurate classifier.

Many researchers have noted that although it is often the case that there are very few labeled examples, there are often many unlabeled examples readily available (Bennet and Demiriz, 1998; Lewis and Catlett, 1994; Blum and Mitchell, 1998; Nigam et al., 2000; Blum and Chawla, 2001; Goldman and Zhou, 2000; Li and Liu, 2003; Yu et al., 2003; Xu and Schuurmans, 2005). An approach that has been taken by a number of researchers has been to choose, in some smart way, a small number of additional training examples that should be hand-labeled in the hope that this selection will improve learning. Uncertainty sampling has been used in this way (Lewis and Catlett, 1994) where specific examples are chosen out of a large pool of unlabeled examples to be given to humans to be classified. These hand labeled examples then become part of the training corpus. These examples are chosen based upon the inability of the current classifier to label them with high confidence. In this way fewer examples must be given to an expert to be labeled than if the examples were simply randomly sampled.

Even if we do not wish to give these unlabeled examples to experts to label, they can be used in various ways and in conjunction with a variety of classifiers to improve classification. Work using naive Bayes text classifiers use the labeled training examples to assign probabilistic classes to many unlabeled examples. These newly classified examples are then used in the classifier creation process (Nigam et al., 2000; Zhang and Oles, 2000). Unlabeled examples have also been used to create new features for the set of labeled examples. In this way the set of labeled examples is enhanced by the information provided in the set of unlabeled examples (Szummer and Jaakkola, 2001). If a small set of training examples can be re-expressed using different views (Blum and Mitchell, 1998; Nigam and Ghani, 2000; Collins and Singer, 1999; Ghani, 2002), or if two distinct learning algorithms can be used on the same small set of training data (Goldman and Zhou, 2000), the combination of the labeled and unlabeled sets can be used to achieve a highly accurate classifier. Unlabeled examples have been used in conjunction with positive examples to improve classification (Li and Liu, 2003; Yu et al., 2003). Empirically it has been shown that combining labeled and unlabeled examples can improve accuracy on unseen test sets in certain situations (Blum and Mitchell, 1998; Mitchell, 1999; Joachims, 1999; Nigam et al., 2000; Jaakkola et al., 1999; Zelikovitz and Hirsh, 2001; Bruce, 2001; Zhang and Oles, 2000).

Our approach to solving the problem of limited training data is different in some important aspects from those described above. We do not assume the existence of unlabeled examples. Rather, our system uses a corpus of related information or background knowledge in conjunction with the training set to find and combine those training examples that can best classify a new test instance.

There are other methods of compensating for few or incorrect training examples that are related to the work in this paper (Zhu, 2005). Graph-based semi-supervised learning methods first construct a graph in which the nodes are examples, and the edges represent similarities between examples. Learning is then constrained to produced a hypothesis that gives the same labels to examples that are nearby in the graph. A number of methods of this sort have been investigated (Blum and Chawla, 2001; Zhu et al., 2004).

There have been studies of the incorporation of domain knowledge by the selection (Liu and Yu, 2005), creation, or reweighting of features using related information such as ontologies (Gabrilovich and Markovitch, 2005) or user feedback (Raghavan et al., 2005). Our approach differs from the view of these methods in that it does not deal directly with the features of training or test examples. Domain knowledge has also been incorporated into text classifiers by modification of the classifiers to include priors (Wu and Srihari, 2004; Dayanik et al., 2006; Schapire et al., 2002). There has also been work done using query-expansion type techniques to incorporate additional knowledge into text classifiers (Sahami and Heilman, 2006). The relationship of our method to query expansion is discussed in detail in Section 4.3.

Background knowledge is particularly useful for short text classification problems. Short text classification is a challenging type of classification because very little information (i.e. words) is known for each example that is to be classified. Since short text examples tend to share few terms, it is particularly difficult to classify new instances and common comparisons between text often yield no useful results. An example of short text classification that is receiving interest lately is the web query (Sarawagi, 2005). Different approaches been taken in these short text classification tasks to provide longer related knowledge to each one, by using web searches, synonyms, and statistical methods (Sahami and Heilman, 2006; Shen et al., 2005; Vogel et al., 2005).

Tools and datasets

WHIRL queries

WHIRL (Word-based Heterogeneous Information Retrieval Language) is a special type of database management system that allows for the manipulation of textual data. Let us consider the problem of exploratory analysis of data obtained from the Internet. Assuming that one has already narrowed the set of available information sources to a manageable size, and developed some sort of automatic procedure for extracting data from the sites of interest, what remains is still a difficult problem. There are several operations that one would like to support on the resulting data, including standard relation database management system (DBMS) operations, since some information is in the form of data; text retrieval, since some information is in the form of text; and text categorization. Since data can be collected from many sources, one must also support the (non-standard) DBMS operation of data integration (Monge and Elkan, 1997; Hernandex and Stolfo, 1995). As an example of data integration, suppose we have one relation \(Place(univ,state)\) containing university names together with the state in which they are located, and one relation \(Job(univ,dept)\) listing university departments that are hiring, and suppose further we are interested in job openings located in a particular state. Normally a user could join the two relations to answer this query. However, if the relations have been extracted from two different and unrelated Web sites, the same university may be referred to as “Rutgers University” in one relation, and “Rutgers, the State University of New Jersey” in another. In this case, some sort of key normalization or data cleaning will be necessary before the relations can be joined. Because of problems like this, integration of heterogeneous databases is currently an active topic of research.

WHIRL is a conventional DBMS that has been extended to include special mechanisms for manipulating text-valued fields. In particular, WHIRL uses statistical similarity metrics developed in the information retrieval community to compare and reason about the similarity of pieces of text. These similarity metrics can be used to implement “similarity joins”, an extension of regular joins in which tuples are constructed based on the similarity of field values, rather than on equality of values. The constructed tuples are then presented to the user in an ordered list, with tuples containing the most similar pairs of fields coming first.

As an example, the relations described above could be integrated using the query

$$(Q_1) {\hbox{\tt ?-}} {\hbox{\tt Place(state,univ1)}} \wedge {\hbox{\tt Job(univ2,dept)}} \wedge {\hbox{\tt univ1}} \sim {\hbox{\tt univ2}}$$

The symbol ∼ specifies that Place.univ and Job.univ are “similar”. The result would be a table with columns labeled \(univ1, state, univ2\), and \(dept\), and rows sorted according to the similarity of \(univ1\) and \(univ2\). Thus rows in which \(univ1 = univ2\) (i.e., the rows that would be in a standard equijoin of the two relations) will appear first in this table, followed by rows for which \(univ1\) and \(univ2\) are similar, but not identical (such as the pair “Rutgers University” and “Rutgers, the State University of New Jersey”).

The major motivation for the development of similarity joins was for data integration; unlike normal joins, similarity joins can be performed across heterogeneous data sources without special data-cleaning operations. However, WHIRL can also be used for other purposes, including traditional text retrieval tasks, and traditional relational data operations.1 WHIRL thus provides a uniform approach to database querying, text retrieval, and data integration.

One important property of WHIRL queries is that the highest-scoring answers to a query can be found relatively quickly. (To do this, WHIRL uses a special-purpose algorithm that uses inverted indices and a variant of A* search (Cohen, 2000).) WHIRL thus provides an efficient mechanism for finding and propagating certain types of similarities between documents. One motivation of the work described in this paper is to explore ways in which this capability can be used in new contexts– contexts other than data integration and querying.

One example of such a novel use is for text categorization. To use WHIRL for text categorization, first note that one simple form of classification can be implemented with a conventional database management system. Assume we are given training data in the form of a relation \(Train(instance,label)\) associating instances instance with \(labels\) label from some fixed set. (For example, instances might be animal Web page titles, and labels might be animal categories from the set {dog, cat, horse, rodent, primate, cow, bird}.) To classify an unlabeled object X using a DBMS one might store X as the only element of a relation \(Test(instance)\) and use the query:

$$(Q_2) \mbox{\tt ?-~} {\hbox{\tt Train(instance1,label)}} \wedge {\hbox{\tt Test(instance2)}} \wedge {\hbox{\tt instance1}} = {\hbox{\tt instance2}} $$

In a conventional DBMS this retrieves the correct label for any X that has been stored in the training set \(Train\); thus the query implements rote learning.

WHIRL's ability to perform similarity joins suggests the following extension of this classification method. If one replaces the equality condition of Q 2 with the corresponding similarity condition, the resulting query will be:

$$(Q_3) \mbox{\tt ?-~} {\hbox{\tt Train(instance1,Class)}} \wedge {\hbox{\tt Test(instance2)}} \wedge {\hbox{\tt instance1}} \sim {\hbox{\tt instance2}}$$

This query finds training instances Y that are similar to X, and then associates their labels with X. In this way, this set of similar training instances can then be used to classify X.

Readers familiar with databases will recognize the clause above as a “soft” variant of the natural join of the \(Train\) and \(Test\) relations.

We can extend the sample soft join query, Q 3 so that we can specify the result in a table with two fields, instance and label, as follows:

$$ \displaylines{ {\hbox{\tt Result(instance2,label)}} \mbox{\tt ?-~} {\hbox{\tt Train(instance1,label)}} \wedge {\hbox{\tt Test(instance2)}}\cr \qquad \wedge {\hbox{\tt instance1}} \sim {\hbox{\tt instance2}} }$$

The result of the query will be a table

Table 1

Every tuple in this table associates a label \(L_i\) with \(X\). Each \(L_i\) appears in the table because there were some instances \(Y_{i,1}\),…,\(Y_{i,n}\) in the training data that had label \(L_i\) and were similar to \(X\). As we will describe below, every tuple in the table above also has an associated “score”, which depends on the number of these \(Y_{i,j}\)'s and their similarities to \(X\). This table can be viewed as a set of tentative classifications for X, obtained using a sort of nearest neighbor classification algorithm.

Given the query above and a user-specified parameter K, WHIRL will first find the set of K tuples 〈X i ,Y j ,L j 〉 from the Cartesian product of \(Test\) and \(Train\) such that the similarity between \(X_i\) and \(Y_j\) is largest (where \(L_j\) is \(Y_j\)'s label in \(Train\)). We used K = 30, based on exploratory analysis of a small number of datasets, plus the experience of Yang and Chute (1994) with a related classifier, which we use for comparison with our WHIRL classifier below. Thus, for example, if \(Test\) contains a single element X, then the resulting set of tuples corresponds directly to the K nearest neighbors to X, plus their associated labels. Each of these tuples has a numeric score; in this case the score of (X i ,Y j ,L j ) is simply cosine similarity of \(X_i\) and \(Y_j\). An example of a table returned by WHIRL can be seen in Table 1. Here the columns correspond to the test example,the training example that is close to the test example, the label of that training example and the score.

Table 1 Example of a selection as done by WHIRL

The next step for WHIRL is to select the first and third columns of the table of (X i ,Y j ,L j ) tuples. In this projection step, tuples with equal \(X_i\) and \(L_j\) but different \(Y_j\) will be combined. If \(p_1\),…,\(p_n\) are the scores of the tuples (X,Y_1,L), …, (X,Y n ,L) from the n intermediate result tuples that contain both X and L then the noisyor operation computes the scores:

$$ {\rm score of} \langle X, L \rangle = 1 - \prod_{j=1}^{n} (1-p_j)$$
((1))

An example of a projection table that uses the score table that is shown in Table 1 can be seen in Table 2. In this table, there is only one line per label, and the score for each label is the combination of scores from the larger table.

Table 2 Example of a projection as done by WHIRL

Comparing WHIRL to other text categorization systems

We have tested the systems that we present with numerous different text-categorization problems, which vary in many of their characteristics, as well as in the nature of the background knowledge that is used to aid the task.2 We give a full description of the data sets in Appendix A. Table 3 gives a one line summary of each of the data sets that we have used. Sources of the data sets can be found in the appendix.

Table 3 Statistics of the different data sets

For the first four data sets in the table, a separate test set that is disjoint from the training set was used. In these cases, we ran up to 10 trials on training sets of different sizes, presenting accuracy results on the test set. For the remaining five data sets, we used five fold cross validation to obtain accuracy results. We divided each dataset into five disjoint partitions, using four partitions each time for training and the fifth partition for testing. The results that we present are averages across all five runs. For some further experiments described in Section 4.4, we kept the test set for one cross validation trial steady and we varied the number of training examples used for classification in the following manner. We used 100% of the training examples for each of the five runs, then only 80% of the training examples for each of the cross validation trials, then 60%, 40% and 20%. In this way we were able to see how accuracy changed when fewer training examples were used. The results that we present for each percentage of data are also averages across five runs.

For each of the data sets we compared WHIRL to several baseline learning algorithms, with multiple runs on the largest training set size that we used in our experiments. We used the rainbow package (http://www.cs.cmu.edu/~mccallum/bow/rainbow/) to obtain results of naive Bayes on each of these data sets. We also report results using support vector machines on these multi-class problems, using SVMstruct (Tsochantaridis et al., 2004) (http://svmlight.joachims.org/svm_struct.html). In addition, we used several nearest-neighbor methods as baselines. 1-NN simply finds the nearest item in the training-set table (using the vector-space cosine similarity measure) and gives the test item that training item's label. We also used Yang's distance-weighted k-NN method (Yang and Chute, 1994), hereafter called K-NN(sum). This is closely related to WHIRL, but uses a different method to combine the score of the K nearest neighbors; K-NN(sum) uses the label L that maximizes \(\sum_j p_{j}\), where the \(p_j\) are as in Eq. (1) in Section 3.1. Finally, K-NN(maj) is like Yang's method, but picks the label L that appears most frequently among the K nearest neighbors.

Table 4 Comparing the accuracy results of different learning algorithms

The results of these methods are shown in Table 4. The table shows the accuracy for WHIRL and each of our baseline methods. As can be seen from this table, the WHIRL nearest neighbor method, which we will refer to as WHIRL-nn, is comparable to other methods. Thus any improvements above and beyond WHIRL-nn that we now report represent even stronger classification performance than this credible state-of-the-art method.

Incorporating background knowledge with WHIRL

What is background knowledge?

Suppose that you were maintaining a Web site about veterinary issues and resources. One of your design decisions has been to divide Web pages according to the animals that they are about (as in the Netvet data set used in the chart above). You might have a site devoted to primates, one for cows, one for dogs, etc., each of which has a list of Web page titles that are associated with that particular animal. In this way, a person browsing your site would have access to many different pages on her topic of interest. A text classification problem related to this task might be placing Web page titles in the appropriate list. For example, the Web page entitled “Mad Cow Panic Goes Beyond Europe” (http://archive.nandotimes.com/newsroom/nt/morecow.html), would clearly fit under the category “cow”.

Let us formulate this classification task as a supervised learning problem. Given a list of Web page titles that have been hand classified (these are the training examples), we wish to create a classifier that will automatically classify new titles. If there are not many training examples, this problem becomes very difficult. This is because the training examples are quite short, and do not contain very many informative words. It is often the case that a new test example contains words that do not occur in the training examples at all. An automatic classifier based solely on the given set of training examples can not use these new words in the classification decisions.

However, we can assume that the World Wide Web contains much information about all the topics and classes that we are dealing with. We can use some of this information as background knowledge for the task. An example of a piece of background knowledge for this task is in Fig. 1. This Web page, which advertises a pillow for pets, is clearly related to our text classification problem. However, it is important to note that it does not fit clearly into any one of our predefined categories. What then can a piece of background knowledge such as this one add to the text classification task? Background knowledge can give us information about the co-occurrences of words, as well as the frequency of different words in the domain. For example, from the Web page discussed above, we might learn that the word “pet” is often used in conjunction with “cat” or “dog”, but not with “primate” or “cow”. The background knowledge can also enhance the sometimes meager vocabulary of the domain that has been created by using only the training examples. Especially when there are very few training examples, the size of the vocabulary that is created from the training examples is small as well, the chance that test examples will contain words that have not been seen by the learning algorithm, and are hence not part of the model, is very high. In this case the background knowledge enriches the domain model. Depending upon the supervised learning algorithm that is used to classify new test instances, the information gleaned from the background knowledge may be used to improve classification accuracy.

Fig. 1
figure 1

An example of background knowledge for the NetVet classification problem

WHIRL with background knowledge

A WHIRL query can have an arbitrary conjunction on its right hand side. The final score for a returned tuple is the product of all the scores of all the individual components. In this way, each of these component scores can be viewed as independent probabilities that are combined to produce a final probability that the returned tuple accurately answers the query. This gives a great amount of flexibility in the formulation of queries and the addition of alternative tables, or sources of knowledge, into the queries.

Such alternative sources of knowledge may provide us with a corpus of text that contains information both about importance of words (i.e. in terms of their total frequency and document frequency values in this large corpus), and joint probability of words (i.e. what percentage of the time do two words coexist in a document?). This gives us a large context in which to test the similarity of a training example with a new test example. We can use this context in conjunction with the training examples to label a new example.

A concrete example of the usefulness of adding background knowledge into the WHIRL text classification system can be seen in the task of assigning topic labels to technical papers. Assuming a machine learning supervised model, we are given a corpus of titles of papers, each with an associated label. The only information available to the learner is contained in this labeled corpus. This labeled corpus might be insufficient or incomplete. For example, in labeling the title of a physics article with its sub-specialty, any title containing a word such as galaxy should easily be classified correctly as an astrophysics paper, even if there are few training articles in that domain. This is the case because galaxy is an extremely common word that appears quite often in papers about astrophysics. However, an article on a more unusual topic, as for example old white dwarfs, would only be classified correctly if a title with these words appears in the labeled training examples. Although the training set does not contain the words old white dwarfs in our experimental data, our system is able to correctly classify a title with these words as astrophysics, by utilizing a corpus of unlabeled paper abstracts from the same field, which is naturally available on the Web. In our second-order approach, our system finds those unlabeled paper abstracts that are most similar to both old white dwarfs and to various training titles. These training titles are then used to classify old white dwarfs correctly, although each of these titles is quite dissimilar to it when compared directly.

Because of WHIRL's expressive language, and the ability to create conjunctive queries simply by adding conditions to a query, WHIRL's queries for text classification can be expanded to allow for the use of background knowledge on a subject. In the example of the classification of physics paper titles discussed earlier, suppose that we had a fairly small set of labeled paper titles, and also a very large set of unlabeled titles, papers or abstracts (or Web pages resulting from a search), in a relation called Background with a single field, value. We can create the following query for classification:

$$\displaylines{ {\hbox{\tt Result(instance2,label)}} \mbox{\tt ?-~} {\hbox{\tt Train(instance1,class)}}\wedge {\mbox{\tt Test(instance2)}} \cr \wedge \,{\mbox{\tt Background(value)}} \wedge {\mbox{\tt instance1}}\sim {\mbox{\tt value}}\wedge {\mbox{\tt instance2}}\sim {\mbox{\tt value}} }$$

Given a query of this form WHIRL will first find the set of k tuples (X i ,Y j ,Z k ,L j ) from the Cartesian product of \(Train\) and \(Test\) and \(Background\) such that the product of the two similarity scores is maximal. Here each of the two similarity comparisons in the query computes a score, and WHIRL multiplies them together to obtain the final score for each tuple in the intermediate-results table. The intermediate results table has the elements from all three of the tables that are in the query and the score:

〈Test.instance,Train.label,Train.instance,Background.value, score〉.

This table is then projected onto the \(instance\) and \(label\) fields as discussed before. Whichever label gives the highest score is returned as the label for the test example.

One way of thinking about this is that rather than trying to connect a test example directly with each training example, it instead tries to bridge them through the use of an element of the background table. Note that WHIRL combines the scores of tuples generated from different matches to the background table. A schematic view of this can be seen in Fig. 2. The rectangles in Fig. 2 represent individual documents in each of the corpora. If a line connects two documents then it represents the fact that those two documents are highly similar using the cosine similarity metric. If a path can be followed from the test document to a training document via a piece of background knowledge then those two documents are considered to be similar in our scheme. Our use of WHIRL in this fashion thus essentially conducts a search for a set of items in the background knowledge that are close neighbors of the test example, provided that there exists a training example that is a neighbor of the background knowledge as well. As can be seen from Fig. 2, training examples can be used multiple times with different background knowledge and a piece of background knowledge can be used multiple times as well, with different training examples. Training neighbors of a test example are defined differently when background knowledge is incorporated. If words in a test example are found in some background knowledge, then other words that are in that background knowledge can connect this test example to dissimilar (in terms of word overlap and direct cosine difference) training examples. The final classification thus integrates information from multiple training examples and the multiple “bridge” examples that lie between them in the background text.

Fig. 2
figure 2

Schematic view of WHIRL with background knowledge

Note that this approach does not concern itself with which class (if any!) a background item belongs to. A background instance that is close to numerous training instances can be included more than once in the table returned by the WHIRL query—even if the training examples that it is close to have different classes. Similarly, a training example can also be included in the table multiple times, if it is close to numerous background instances.

Suppose that our classification task consists of labeling the first few words of a news article with a topic. If a test example belongs to the category sports, for instance, the cosine distance between the few words in the test example and each of the small number of training examples might be large. However, we would hope that given a large corpus of unlabeled news articles, it is likely that there will be one or more articles that contains both the few words of the test example and the words of one of the training examples.

To make our classification system more robust, if the background query that we presented does not provide a classification label for a given test example, we then allow the simple text classification query (WHIRL-nn) to attempt to classify the test example. If this query fails as well, then the majority class is chosen as a final attempt at classification. Consider the test example in the NetVet domain:

Steller's eider (USFW)

USFW is an acronym for the U.S. Fish and Wild Life Service, which is not referred to this way in the background corpus. The word eider, which is a type of duck, does not appear in the background corpus either. The WHIRL query with background knowledge therefore does not return anything. However, the training set contains the example:

Migratory Birds and Waterfowl – USFWS

so once WHIRL-nn is used, the correct class is returned.

We term this method of using background knowledge for the text classification task, WHIRL-bg.

WHIRL-bg and the nature of background knowledge

In a sense, WHIRL-bg can be looked at as a implementing a type of query expansion. In this context, we view the test example, X as the original query. As opposed to traditional query expansion (Buckley et al., 1994), that adds new words to a query, WHIRL-bg replaces the actual query X with other, longer pieces of text, and associates a weight of importance to each of these new, longer queries.

Given a test example, X, and a set of background knowledge, \(BG_1,BG_2,{\ldots}BG_n\), a weighted query is formed from each element in this background set, where the query is the piece of background knowledge and the weight given to a query \(BG_i\) is the cosine similarity of \(X\) and \(BG_i\). Each of these new weighted queries is then used independently to find those training examples that will be used as nearest neighbors to \(X\) and combined for classification. The higher the weight of the given \(BG_i\), the more reliable that query is, and returned results are weighted based upon the reliability of the query.

The strength of WHIRL-bg is in the combination of both the weight of the new query, and the similarity of the returned training examples with this new query. Clearly, if a piece of background knowledge, \(BG_i\) is exactly the same as the test example \(X\), the new query, \(BG_i\) will have a weight of \(1\), and will be considered highly reliable. This makes sense, for we are sure that this query is related to that particular test example, and all of its words are relevant. However, given that each of these training examples are short-text, although the query reliability is 1, when compared to the training set, the cosine similarity to the nearest neighbors tend to be low. A new query with a slightly lower weight that is very close to training examples will be more useful.

In general, a test example X that is correctly classified by WHIRL-bg, but misclassified by WHIRL-nn often does not share any meaningful words with the training set. Our data sets are primarily short-text classification tasks so this occurrence may be common. WHIRL-nn therefore returns nearest neighbors that are not of the same class as X. However, X may share domain related words with a few or even many background pieces. This allows multiple queries to be formed from the background set that contain domain related words and that find neighbors in the training set of the correct class. An example of this from the physics paper title is the title “Quasienergy Spectroscopy of Excitons.” This example, of class condensed materials, shares the the word spectroscopy with many training set examples, all of which are in the wrong class. However, the word excitons occurs only once in the training set, but 13 times in the background set in pieces related to condensed materials and allows the correct classification once those background examples are used as queries into the training set.

Ideally we would like to be assured that the vocabulary of the background knowledge overlaps in a great measure with the training and test sets. It is important to note that the closer the set of background knowledge is to the classification task, the more useful this method would be. The most useful background sets would be those where each piece of background is strongly connected to only one class, and where all classes are represented in a number of background pieces. Pieces of background knowledge that are related to only one class, that contain words from particular training examples, as well as may other words that are relevant to only one class would be most useful in this system. With this understanding, simply using unlabeled examples would be ideal. Each fits into a class, and contains domain related words. However, since we are dealing with short text, cosine similarity measures are often low, and we are forced to provide some other type of bg knowledge that conforms at least partly with these criteria.

Background pieces that are made up of words that come from more than one class can connect a test example to training examples of many different classes, causing misclassification. If the background text does not contain information about one or more classes, this too will cause training examples from many classes to be returned.

As an illustration of this possible occurrence, we can look at the Business name data. One of the 124 classes is school, and the test and training examples that are names of universities are classified as school. WHIRL without background knowledge has no problem correctly classifying names of universities as class school, and this class has very low error. However, our background knowledge, which is taken from another business site, includes no universities. This causes a problem for WHIRL-bg in its attempt to classify test examples of class school, because the system is forced to use pieces background knowledge that does not really match well with these test examples. Hence, many of the test examples of this class are misclassified.

Results for WHIRL-nn and WHIRL-bg

The results that we present validate what many other researchers (Nigam et al., 2000) have found: unlabeled data or background knowledge is most useful when there is little training data. As the number of examples in the training sets increase, background knowledge does not provide the same advantage. The accuracy results for the 20 Newsgroups data, the WebKb data, the advertisements data, and the ASRS data are graphed in Figs. 36. The y axis represents the accuracy rate, and the x axis is the number of labeled examples given to the learner. In the first three of these graphs the same phenomenon is apparent: with few labeled examples per class the background knowledge is very helpful, but as the number of labeled examples increase, the usefulness of the background knowledge goes down, and even causes the accuracy to degrade. In the fourth set, background knowledge helps most when there are more than 2 examples per class, because it is sometimes the case that the very small training sets do not provide enough information (in terms of vocabulary in each class) to utilize the background knowledge fully.

Fig. 3
figure 3

20 Newsgroups

Fig. 4
figure 4

WebKb

Fig. 5
figure 5

Advertisements problem

Fig. 6
figure 6

ASRS problem

We present the results on the physics data in Fig. 7. Figure 7 clearly shows the effect that background knowledge can have on text data sets. The line representing WHIRL-bg remains almost horizontal as fewer training examples were used, indicating that the background knowledge compensated for the lack of data. In contrast, WHIRL-nn degraded as fewer training examples are used. The helpfulness of the background knowledge, therefore, also increased as fewer training examples were used.

Fig. 7
figure 7

2-class physics title problem

Results for the NetVet domain are graphed in Fig. 8. Reductions in error rate were largest on smallest training sets. The NetVet domain is unlike some of the other sets previously discussed in that there was overlap in topics in the background knowledge. A Web page that could be useful in classifying a test example as belonging to the category of dogs was quite likely to discuss cats and vice versa. Some of the training and test examples, too, could have caused confusion. There were titles of Web pages on pet stores or animal care that were placed under one topic, but could just have easily been placed in many other different categories. We therefore were not surprised to see that the error rate did not decrease by a large percentage.

Fig. 8
figure 8

NetVet problem

The results for the business name data set are graphed in Fig. 9. Once again, WHIRL-bg outperformed WHIRL-nn. Using 100 percent of the data, the decrease in error rate is substantial. However, when the percentage of training examples that was used is lower, the difference in error rate between the two systems is not much changed. This is unlike the results of the previous three domains. This might have been due to the fact that the training and test examples were company names, which often consisted of words that occurred only once (for example, Xerox) so that reducing the number of training examples actually reduced the dictionary of words in the training corpus substantially. There were therefore fewer words that could be used to find bridges in the background knowledge, so even though there was less training data, the background knowledge did not compensate for it. This same trend can be seen in Fig. 10, which graphs accuracy on the thesaurus data. In this graph the difference between the accuracy of WHIRL-nn and WHIRL-bg was reduced as the training data was reduced. Here too, the labeled data consists of single words, so with only 20% of the data, the usefulness of background knowledge is not as apparent as when more labeled data is added, although there are substantial improvements in accuracy rates.

Fig. 9
figure 9

Business name problem

Fig. 10
figure 10

Thesaurus problem

Results for the Clarinet news problem are in Fig. 11. The addition of background knowledge was useful when the training set size was small. When less than 60% of the data was used, background knowledge reduced the error rate. As the amount of training data increased the background knowledge no longer added enough new vocabulary to improve performance.

Fig. 11
figure 11

Clarinet problem

We wished to determine whether the difference in accuracy obtained using WHIRL vs using WHIRL-bg is statistically significant. From Figs. 710 we can obtain five pairs of numbers. For each x value that is plotted on all of these graphs, the corresponding two y values create a pair of numbers. The first number in each pair represents the accuracy of running WHIRL while the second number is the accuracy of running WHIRL-bg on the same set. We used a paired t-test for each of these x values to see if addition of background knowledge caused a significant improvement in the accuracies. For x = 20, x = 40, and x = 60 improvements were significant. Hence we were able to conclude that improvements in accuracy are expected if background knowledge is added to the WHIRL query for smaller data sets. Figure 12 compares the accuracy of WHIRL-nn and WHIRL-bg on the smallest training sets. Points above the line y = x show those data sets that have improved accuracy with background knowledge. Figure 13 compares the accuracy of WHIRL-nn and WHIRL-bg when the complete training set for each data set was used.

Fig. 12
figure 12

Small training set

Fig. 13
figure 13

Full training set

Short text classification problems

There is another interesting point that has not yet been discussed, on the issue of when background knowledge could be most useful. As opposed to our discussion above, let us assume that we have a large number of training and test examples. However, co-occurrences of words may not be able to be learned properly and the vocabulary size of the training set can still be small if each of the training and test examples themselves consist of very short strings. This might cause the same problems as an insufficient number of training examples would cause. We can intuitively understand that classification problems where the training and test data consist of short text strings will benefit most from the addition of background knowledge. As an example, we can look at the Business name data introduced earlier. Examples of training and test strings are:

$$\displaylines{ {\mbox{\tt Class BROADCAST: ABC Inc.}} \cr {\mbox{\tt Class DRUG: Watson Pharmaceutical, Inc.}} }$$

It is clear from the type of examples in the training and test set that in this particular domain often the training data will be insufficient to classify a new test example. When data points in a text classification problem consist of only a few words each, it becomes hard for a learning system to obtain accurate counts of co-occurrences of words, as well as a complete vocabulary of words related to each class in the domain.

The Clarinet newsgroup problem that was described earlier is a short-text classification problem that we created. Training, test and background knowledge sets are all taken from news articles in the two classes of bank and sport from Clarinet news. Although the training and test and background documents all come from the same source, we chose the training and test documents to be simply the first 9 words of the news articles. This reduces the size of the vocabulary that is in the training corpus, and also reduces the sharing of words between different examples. Since we chose each piece of background knowledge to consist of the first 100 words of news articles, each piece of background knowledge can overlap with many training or test examples as well as with each other.

To illustrate the effect that background knowledge has when the text classification problem consists of short-text strings, we have created three new problems from the Clarinet domain. Instead of taking the first nine words from the news articles for the training and test data, we take the first 7 words, 5 words and 3 words to create the three new problems. We expect the problems with the shorter text strings to be helped more by the inclusion of background knowledge. A test example from the Clarinet news problem with 9 words, such as:

$${\mbox{\tt Class SPORT: The online suspense in votes for the American League}}$$

will be classified correctly by WHIRL-nn, making the background knowledge unnecessary for this particular test example. However, when only the first 3 words of the article are present the test example is:

$${\mbox{\tt Class SPORT: The online suspense}}$$

and the words American and League are no longer part of the text.

We plot the results for WHIRL-nn and WHIRL-bg in Figs. 1417, where the problem names are called 3-words, 5-words, 7-words and 9-words, corresponding to the test and training set consisting of the first 3, 5, 7, or 9 words of each article respectively.

Fig. 14
figure 14

3 words problem

Fig. 15
figure 15

5 words problem

Fig. 16
figure 16

7 words problem

Fig. 17
figure 17

9 words problem

As expected, the smaller the number of words in each training and test example, the worse both WHIRL-nn and WHIRL-bg performed. The addition of background knowledge was most useful with the shorter strings in the test and training data as well. This is represented in Figs. 1417 by the point at which the two lines intersect. For strings of length 3, background knowledge reduced the error rates, even when the entire set of training data was used. As the number of words in the training-test examples increased, the point at which background knowledge became helpful changed. For strings of length 9, background knowledge reduced error rates only when less than 60 percent of the data was used. This gives empirical evidence that the less informative the training data is, the greater the advantage in having a corpus of background knowledge available for use during classification. The size of the reduction in error rate obtained by running WHIRL-bg was also greater when there were fewer words in each example.

Using background knowledge robustly

It is still the case that the conjunctive query that we presented (in Section 4.2) incorporates background knowledge in a way that overlooks the direct comparison between training and test examples. Depending upon the type of data, and the strength of the background knowledge, this might be a dangerous approach. One of the strengths of WHIRL as a data retrieval engine is that if the test example exists in the training corpus, and the similarity function compares the test example to the training examples, the training example that is identical to the test example will be returned with a score equal to one. WHIRL-bg weakens WHIRL so that this quality is no longer true. If the conjunctive background query returns a set of results then the test example is never directly compared to the training examples and we can no longer access their direct similarity. If a training example is identical to the test example, but is not close to any element in the background knowledge database, it is possible that it will not even be returned in the top k results of the intermediate table. We wish to minimize the risk of such an anomalous event occurring. Additionally, if the background knowledge that is used is unrelated to the text classification task, WHIRL-bg can degrade drastically. Consider the ridiculous situation of the NetVet classification task using the background knowledge from the physics data set; i.e. a set of technical paper abstracts. If a test example consisting of a veterinary web page titles cannot be compared to any of the abstracts then the background query will return nothing, and the system will fall through to WHIRL-nn to classify the instance. However, suppose that a test example can be compared to an abstract, as meaningless as the comparison might be. We would then have very misleading results. The test example:

russian horses in the UK

when run in this case returns as a result:

figure a

The scores are extremely low, since the background knowledge is not close to either the train or the test set. This example is misclassified as dog. If we run the NetVet data using this background knowledge from the physics domain, WHIRL-bg performs much worse than WHIRL-nn, as can be seen from Table 18. Figure 18 gives the average accuracy rates for 20, 40, 60, 80, and 100% of the NetVet data using WHIRL-nn, and using WHIRL-bg with the incorrect background knowledge.

Fig. 18
figure 18

Comparison of accuracies of WHIRL-nn and WHIRL-bg with unrelated background knowledge

We wish to make our system more robust to the inclusion of misleading background knowledge. To do this we create a disjunctive query that combines both the standard text classification using WHIRL with the WHIRL-bg approach.

$$\displaylines{ {\hbox{\tt Result(instance2,label)}} {\hbox{\tt ?-}} {\hbox{\tt Train(instance1,label)}} \cr \wedge {\hbox{\tt Test(instance2)}} \wedge {\hbox{\tt instance1}} \sim {\hbox{\tt instance2}} \cr {\hbox{\tt Result(instance2,label)}} {\hbox{\tt ?-}} {\hbox{\tt Train(instance1,label)}} \cr \wedge {\hbox{\tt Test(instance2)}} \wedge {\hbox{\tt Background(value)}} \cr \wedge {\hbox{\tt instance1}} \sim {\hbox{\tt value}} \wedge {\hbox{\tt instance2}} \sim {\hbox{\tt value}}}$$

Using the two queries that we presented above we can create intermediate tables of their results separately, and project onto the test \(instance\) and \(label\) fields separately as well. These two sets of results are then combined by defining a disjunctive view. This query selects a test \(instance\) and \(label\) from the results of WHIRL-nn and also selects a test \(instance\) and \(label\) from the WHIRL-bg query.

When this disjunctive query is materialized there may be multiple lines with the same label but with different scores. Equation (1) in Section 3.1 is then used on this combined table to arrive at a final result. This is equivalent to producing the two intermediate tables of size k and then projecting on all the results returned by those two tables. If either of the two queries returns elements with very high scores, then those will dominate the noisy or operation. In empirical testing we have found that this query is comparable to the WHIRL-bg query that is defined above; it improves upon learning without background knowledge. The main advantage of this query is that when the background knowledge is misleading and returns results that are meaningless, the disjunction prevents the system from placing emphasis on these false comparisons. The test example:

russian horses in the UK

with the disjunctive query, returns a final results of:

figure b

This is a combination of the results from WHIRL-bg (that was given above) and the results from WHIRL-nn which were:

figure c

Since the scores returned in the comparisons of WHIRL-nn were much larger, they dominate the result, which is exactly what we would like.

We present results with four cross validated data sets to illustrate how the disjunctive query performs in the presence of misleading background knowledge in Figs. 1922. The four data sets are the physics paper title problem, the NetVet problem, the business name problem and the thesaurus words problem. These four problems have different types of background knowledge. The physics data has background knowledge that is of the same type as the data, and the NetVet background knowledge is from the same domain as the data but of a slightly different type. The thesaurus background knowledge is very different than the data but each piece is only about one word, which is like the training and test set, and the business data has background knowledge that is of a totally different size and type than the training and test set.

Fig. 19
figure 19

Physics

Fig. 20
figure 20

NetVet

Fig. 21
figure 21

Business

Fig. 22
figure 22

Thesaurus

We present the cross validated accuracy rates for WHIRL-bg and the disjunctive query, which we will term WHIRL-dis. For the WHIRL-bg query and WHIRL-dis we present results for the background knowledge that is related to the task, for background knowledge that is a combination of the one related to the task as well as one unrelated (termed WHIRL-bg(mixed)), and for totally unrelated background knowledge (termed WHIRL-bg(wrong)). For the unrelated background knowledge we use the background set from the NetVet data for the other three tasks, and the physics abstracts for the NetVet task. The mixed background set consists of all documents in the related background set plus all documents in the unrelated set of background knowledge for each task.

Note that in all four of the data sets WHIRL-bg with the wrong set of background knowledge performs worst. For Figs. 19, 20, and 22, WHIRL-dis with the wrong set of background knowledge performs worse than both WHIRL-bg and WHIRL-dis with the correct set of background knowledge. In cases where the background knowledge is very related to the problem, especially if the data consists of very short text strings, WHIRL-bg may be more useful than WHIRL-dis. This is partly because the direct comparison part of the disjunctive query has only one conjunct, whereas the background part of the query has two conjuncts. Since WHIRL multiplies the scores of the two conjuncts, and these scores are less than one, the background part of the query often has scores that are lower than the direct comparison part of the query. This reduces the utility of the background knowledge when the two parts of the disjunctive query is combined. This phenomenon can be observed in the business name data set, where WHIRL-bg outperforms WHIRL-dis (see Fig. 21). The direct comparison of WHIRL-dis often relies on words such as inc or company that are not very informative, yet often provide higher scores than the background part of the WHIRL query, since the score of the background is the product of two conjuncts. WHIRL-dis would therefore be a better choice if the background knowledge is not from a reliable source, and is not known to be closely related to the problem; otherwise, WHIRL-bg would be the appropriate choice.

In all four data sets, for any number of training examples, there is a major discrepancy in the accuracy rate of WHIRL-bg with the wrong set of background knowledge and WHIRL-dis with the wrong set of background knowledge. In many cases, WHIRL-bg with the wrong set of background knowledge had accuracy rates that were substantially lower even than WHIRL-nn, without any background knowledge. For example, with the thesaurus data set, WHIRL-nn has an accuracy rate of 36.3%. WHIRL-bg achieves 51.4% accuracy with the correct background knowledge and 51.7% with the mixed background knowledge; WHIRL-dis achieves about the same accuracy for the correct and mixed background knowledge — 53%. For the wrong set of background knowledge, the accuracy for WHIRL-dis is close to that of WHIRL-nn, which is what we would like to occur. However, for WHIRL-bg with the wrong set of background knowledge, accuracy actually degrades sharply, achieving a level of only 26.8%. This phenomenon can be seen in all the data sets, so we are convinced that WHIRL-dis minimizes the effects of misleading background knowledge. of the set of examples that are misclassified with the inclusion of background knowledge and the set of examples that are misclassified without the use of background knowledge is often not large either. Once again, we can look at the statistics in the physics paper title example. Only an average of 25% of the test examples that were misclassified by WHIRL-bg are also misclassified by WHIRL-nn.

Comparisons

Background knowledge vs. unlabeled examples

In Section 4.1 we defined what we mean by background knowledge. It is often the case that although the background knowledge is of a different form and length than the training and test data, each piece of background knowledge can be fit into a specific class. The physics paper titles domain, with physics abstracts used as background knowledge is a good example of this type of classification problem. Although the abstracts are longer than the training data and often contain words that do not appear in short title strings, each abstract still fits into a specific area of physics. In this sense, the background knowledge can be treated as unlabeled examples.

To compare to previous work, we ran Naive Bayes classifier with Expectation Maximization (Dempster et al., 1977) (as in Nigam et al. (2000)), treating the background knowledge as unlabeled examples. This method probabilistically classifies the unlabeled examples, and uses them to reestimate the Naive Bayes classifier parameters in an iterative manner. The WHIRL-bg system performs best on the problems where the form and size of the background knowledge is substantially different than the training and test data. In the business name data, the training and test data consist of short text strings that are names of businesses taken from the Hoovers web site (www.hoovers.com). Each piece of background knowledge, however, contains many different companies as well as descriptions of these companies, grouped together by Yahoo! business pages. These background pieces of data are not really classifiable, in the sense that they do not necessarily belong to any specific class in the Hoovers hierarchy. Since WHIRL-bg does not attempt to classify the background knowledge, but merely uses it to index into the training corpus, it makes the best use of this background knowledge. The same phenomenon is apparent in the advertisement data. The training and test data consist of individual advertisements taken from the Courier Post. In contrast, each piece of background knowledge consists of all advertisements under a specific topic from another web site (http://classifieds.dailyrecord.com). These two different sources do not use the same hierarchy, so not only are the background pieces of information of a different type, but they are not classifiable in the same way that the training and test data is. This strengthens our assertion that our method is one that uses external corpora that need not be specifically unlabeled examples and need not fit into the classification scheme or individual classes to be helpful.

Results

For the data sets where the background knowledge fits very closely to the training and test classification task, the EM method outperforms WHIRL-bg. This is consistent with the way EM makes use of background knowledge. Since EM actually classifies the background knowledge, and uses the background knowledge to decide on the parameters of its generative model, the closer the background knowledge is to the training and test sets, the better EM will perform. Ideally, for EM, we wish the background knowledge to be generated from the same model as the training and test sets. The data sets that EM performs best on include 20 Newsgroups and WebKb, where the background knowledge is unlabeled examples, and hence of the exact same form as the training and test set. This group also includes the physics paper titles problems.

Figures 2326 show error rates using EM and WHIRL-bg for the four data sets that we used in the previous section. As can be seen from these graphs, EM achieves greater accuracy for the physics domain and for most data set sizes of the NetVet domain. Since the background of the physics data consists of abstracts that can be classified, this result is expected. Many of the NetVet Web pages, although not all, fit into one class as well. However, for the Business data, WHIRL-bg outperforms EM, as it is able to utilize the background knowledge that is not tailored to the specific classes or categorization task. At some of the different sizes of data sets this same phenomenon can be seen with the Thesaurus task.

Fig. 23
figure 23

Physics

Fig. 24
figure 24

NetVet

Fig. 25
figure 25

Business

Fig. 26
figure 26

Thesaurus

In Section 4.6 we saw that our WHIRL-dis query avoids the degradation of accuracy as a result of irrelevant background knowledge. To compare WHIRL and EM in the presence of incorrect background knowledge we present the results of running WHIRL-dis and EM on four data sets with incorrect background knowledge. The graph in Fig. 27 compares the accuracy of WHIRL-nn on different data sets and sizes, with WHIRL-dis with incorrect set of background knowledge. Points that are below the line y = x indicate that the wrong background knowledge caused a degradation in accuracy. As can be seen from these graphs, in many domains both WHIRL and EM are not hurt by the wrong set of background knowledge. However, many of the points on Fig. 27(a) are closer to the line y = x than those in Fig. 27(b). This indicates that incorrect background knowledge can be better tolerated by WHIRL-dis than EM.

Fig. 27
figure 27

These graphs show how much the inclusion of incorrect background knowledge hurts WHIRL (right) and EM (left)

Conclusion

Our contribution with this paper is twofold. This paper evaluates WHIRL, an information integration system, on inductive classification tasks, where it can be applied in a very natural way—in fact, a single unlabeled example can be classified using one simple WHIRL query.

We have also presented a novel concept of “background knowledge” to aid in text categorization. This is strongly related to the current work on the use of unlabeled examples in the text classification task. Rather than having a learner simply rely on labeled training data to create a classifier for new test examples, we show that combining the labeled training data with other forms of available related text allows for the creation of more accurate classifiers. We have empirically shown that we can use a large body of potentially uncoordinated background knowledge, in conjunction with WHIRL, yielding good results. In most of the data sets to which the system was applied, we saw substantial reductions in error rates, particularly when the set of labeled examples was small. In many cases the use of background knowledge allowed for only a small degradation in accuracy as the number of training examples was decreased.

Appendix A: Data set details

In this section we give a description of all of the data sets that we have used and a description of the source and nature of the background knowledge that we have used with each classification task.

A.1. 20 Newsgroups

Two of the data sets that we have used have been widely studied by other researchers in machine learning (http://www.cs.cmu.edu/~textlearning), the 20 Newsgroups data and the WebKb dataset.

The 20 Newsgroup data set consists of articles from 1993 that were collected by Ken Lang from 20 distinct newsgroups on UseNet (Lang, 1995). It is hard to distinguish between some of these newsgroups, because some of their topics are very related. There are three newsgroups that discuss politics, five that are computer related, four science newsgroups, three about religion, four on recreational activities, and one miscellaneous group. The vocabulary consists of 62258 words that are each used more than one time (Nigam et al., 2000; Nigam, 2001). Although it might sometimes be easy for an automated machine learning system to distinguish between newsgroups that are dissimilar, this problem becomes very hard when trying to determine which of a few related newsgroups an article falls into. This problem is particularly challenging when there are very few labeled examples, as that makes it even harder to distinguish between classes.

The training set and test set consist of articles from the 20 newsgroups, and the set of background knowledge consists of individual articles from the 20 newsgroups as well. This background knowledge is therefore exactly of the same form and from the same source as both the training and test data. In this case we can alternatively use a more common term for the background knowledge, “unlabeled examples”. Although this is a very limited form of background knowledge, and we must make the assumption that unlabeled examples are readily available, this data set is a good test bed for the comparison of our work to that of other researchers.

We followed exactly the same train/test/unlabeled example split as in the work of Nigam et al. (2000) and Nigam (2001). The UseNet headers of all examples that have the name of the correct newsgroup posting are of course removed before any learning is done. A list of stop-words was also removed from all the examples. The latest 200 examples in each group were taken for the test set, to create a test set of 4000 examples. This would closely mimic a natural scenario of training on older articles and testing on current ones. An unlabeled set of 10000 examples was created by randomly choosing these from the 80% remaining examples. Training sets of different sizes were created by randomly choosing from the remaining examples. Ten sets of non-overlapping training examples of the same size were formed and the results that we report are the averages of all of these runs.

A.2. WebKb

The WebKb data set consists of pages that are part of computer science departments from many different universities and are placed in one of seven categories: student, faculty, staff, course, project, department, and other. Following the methods of other researchers our data consists of only the 4199 pages that are in the four most common (excluding “other”) categories of faculty, course, student, and project (Nigam et al., 2000; Joachims, 1999); our text-categorization problem is to place a web page into one of these four classes.

For purposes of training and testing the data is divided by the department that the web page is associated with. There are five total sets: pages from Cornell, Washington, Texas, Wisconsin and a miscellaneous group that includes many other universities. In this way, four test sets are formed by taking pages from each of the four universities. For each test set, the training and unlabeled sets are taken from the pages of all the other universities. For example, the training and unlabeled examples used with the Wisconsin test set come from Cornell, Washington, Texas and the miscellaneous group. This helps avoid learning parameters that are specific to a university and testing on that same university, as opposed to learning parameters that are specific to the categorization task.

For each of these four test sets, a set of 2500 unlabeled examples is randomly chosen from all documents of every other university. Once again, for each of the four test sets, from the remaining documents, training set of different sizes were formed. The results that are reported in this paper are averages of the 40 runs (10 per data set size per 4 departments). Preprocessing of the data follows other previous methodology, numbers and phone numbers are converted into tokens, and feature selection is done by keeping only the 300 words that have highest mutual information with the classes.

For this data set as well, our split of the articles into the train, test and unlabeled sets is exactly that of Nigam et al. (2000) and Nigam (2001).

A.3. Advertisements

Following the methodology of the 20 Newsgroups and WebKb data sets, we created a data set of short classified advertisements off the World Wide Web. In this set, and other descriptions that follow, the background knowledge no longer consists simply of unlabeled examples. For the labeled set of examples, we downloaded the classified advertisements from one day in January 2001 from the Courier Post at http://www.southjerseyclassifieds.com. The Courier Post online advertisements are divided into 9 main categories: Employment, Real Estate for Sale, Real Estate for Rent, Dial-A-Pro, Announcements, Transportation, Pets, Employment, and Business Opportunity. As in the 20 Newsgroups and WebKb dataset, we created 10 random training sets of different sizes. For testing, we simply downloaded advertisements from the same paper, from one day a month later, taking approximately 1000 (25%) of the examples for our test set.

The background knowledge from the problem came from another online newspaper—The Daily Record (http://classifieds.dailyrecord.com). The Daily Record advertisements online are divided into 8 categories: Announcements, Business and Service, Employment, Financial, Instructions, Merchandise, Real Estate and Transportation. We treated the union of the articles from each one of these categories as a separate piece of background knowledge. The average length of the training and test data is 31 words; the background information had an average of 2147 words each. The vocabulary size of the full set of data without the background was 5367; with the background it became 6489.

A.4. ASRS

The Aviation Safety Reporting System (http://asrs.arc.nasa.gov/) is a combined effort of the Federal Aviation Administration (FAA) and the National Aeronautics and Space Administration (NASA). The purpose of the system is to provide a forum by which airline workers can anonymously report any problems that arise pertaining to flights and aircraft. The incident reports are read by analysts and classified and diagnosed by them. The analysts identify any emergencies, detect situations that compromise safety, and provide actions to be taken. These reports are then placed into a database for further research on safety and other issues. We obtained the data from http://nasdac.faa.gov/asp/ and our database contains the incident reports from January 1990 through March 1999.

Since we are interested in text categorization tasks, there are two parts of each incident report that we deal with, the “narrative” and the “synopsis”. The “narrative” is a long description of the incident, ranging in our training and test data set from 1 to 1458 words with a median length of 171 words. The “synopsis” is a much shorter summary of the incident with the length ranging from 2 to 109 with a median length of 24. It is interesting to note that many of these words are sometimes abbreviated which makes the text classification task even harder.

There are many different categorization problems that can be taken from this data set. A feature that is associated with each incident is the consequence of the incident that the analyst adds to the report. This can take on the values: aircraft damaged, emotional trauma, FAA investigatory follow-up, FAA assigned or threatened penalties, flight control/aircraft review, injury, none, and other. If more than one consequence was present we removed that incident from our training and test data. We also removed from the training and test data all those incidents that had categories of none and other. This then became a six class classification problem.

We chose the training and test sets to consist of the synopsis part of each incident. The test set consists of data from the year 1999 for a total of 128 examples. The training set consists of all data from 1997 and 1998, for a total of 591 examples. For the background knowledge, we chose all narratives from 1990–1996. In this case we did not remove any examples; thus the background knowledge contains those reports whose categories were other and none as well as the six that are found in our training and test set. For this data set, therefore, the training and test examples are shorter than the background pieces of knowledge, and the background pieces do not all fit into the categories of the text classification problem. The total vocabulary size of the training and test data combined was 1872; when the background knowledge was added the vocabulary had a total of 3771 words. Once again, as in the 20-Newsgroups and WebKb dataset, we created from our training data random training sets of different sizes.

A.5. Physics papers

One common text categorization task is assigning topic labels to technical papers. We created a data set from the physics papers archive (http://xxx.lanl.gov), where we downloaded the titles for all technical papers in the first two areas in physics (astrophysics and condensed matter) for the month of March 1999. As background knowledge we downloaded the abstracts of all papers in these same areas from the two previous months—January and February 1999. In total there were 1530 pieces of knowledge in the background set, and 953 in the training-test set combined. These background knowledge abstracts were downloaded without their labels (i.e., without knowledge of what sub-discipline they were from) so that our learning programs had no access to them.

The average length of the technical paper titles was 12.4 words while the average length of the abstracts was 141 words. The number of words in the vocabulary taken from the full set of titles was 1716; with the abstracts it went up to 6950.

A.6. NetVet

We have created other text classification tasks from the World Wide Web (Cohen and Hirsh, 1998; Zelikovitz and Hirsh, 2000). The NetVet site (http://www.netvet.wustle.edu) includes the Web page headings for the pages concerning cows, horses, cats, dogs, rodents, birds and primates. The text categorization task is to place a web page title into the appropriate class. For example, a training example in the class birds might have been: “Wild Bird Center of Walnut Creek.” Each of these titles had a URL that linked the title to its associated Web page. For the training/test corpus, we randomly chose half of these titles with their labels, in total 1789 examples. We discarded the other half of the titles, with their labels, and simply kept the URL to the associated Web page. We used these URLs to download the first 100 words from each of these pages, to be placed into a corpus for background knowledge. Those URLs that were not reachable were ignored by the program that created the background knowledge. In total there were 1158 entries in the background knowledge database.

In this case, the background data consisted of much longer entries than the training and test data. The titles had an average length of 4.9 words, while each piece of background data contained 100 words. The total vocabulary size of the titles was 1817 words; it jumped to 10399 words when the background data was added. However, the words in the background data were not as informative as the shorter title strings, because people tend to place important words in their titles. Many of the background pieces had words unrelated to the task at all.

A.7. Business names

Another data set consisted of a training set of company names, 2472 in all, taken from the Hoover Web site (http://www.hoovers.com) labeled with one of 124 industry names. The class retail had the most business names associated with it; at 303 examples, and there were a few classes with only one example each. The average number of examples per class was 20. We created background knowledge from an entirely different Web site, http://biz.yahoo.com. We downloaded the Web pages under each business category in the Yahoo! business hierarchy to create 101 pieces of background knowledge. The Yahoo! hierarchy had a different number of classes and different way of dividing the companies, but this was irrelevant to our purposes since we treated it solely as a source of unlabeled background text. Each piece of background knowledge consisted of the combination of Web pages that were stored under a sub-topic in the Yahoo! hierarchy. Each example in the training and test set had an average of 4 words (for example: Midland Financial Group, Inc). The instances in the table of background knowledge had an average length of 6727 words and each one was thus a much longer text string than the training or test examples. Vocabulary size for the business names was 2612 words; with the background knowledge it was 22,963 words.

A.8. Thesaurus

Roget's thesaurus places words in the English language into one of six major categories: space, matter, abstract relations, intellect, volition, and affection. For example, a word such as “superiority” falls into the category abstract relations, while the word “love” is placed into the category affection. From http://thesaurus.reference.com/, we created a labeled set of 1000 words, with each word associated with one category. The smallest category contained 135 labeled examples and the largest class contained 222 examples. The training and test sets had a combined vocabulary size of 1063 words (as most examples had only one word each), but when the background knowledge was added the total vocabulary size became 11607 words.

We obtained our background knowledge via http://www.thesaurus.com as well, by downloading the dictionary definitions of all 1000 words in the labeled set. We cleaned up the dictionary definitions by removing the sources that were returned (i.e. which dictionary the information was gleaned from) as well as other miscellaneous information (such as how many definitions were found). Each of these dictionary definitions became an entry in our background knowledge database. An interesting point about this data set is that the background knowledge contains information directly about the test set, i.e. definitions of the words in the test set. Since these definitions are not directly related to the classification task at hand, this poses no contradiction. As a matter of fact, we can look at new test examples given to the system as follows: given a word and its definition, place the definition into the background knowledge data base, and then categorize the word using the total background knowledge.

A.9. Clarinet news

Another data set that we created was obtained from Clarinet news (www.clarinet.com). We downloaded all articles under the sports and banking headings on November 17, 1999, using the most recent ones for training and test sets and the older ones for background knowledge. In total, our background knowledge consisted of a corpus of 1165 articles. The background knowledge in this problem consisted of the first 100 words of each of these articles. Informal studies showed us that including the entire articles did not improve accuracy substantially, probably because the most informative part of an article is usually the first few paragraphs. Our training-test data had 1033 data points of which 637 belonged to the sports category, and 406 belonged to banking category. We took the first nine words of each news article to create the 1033 examples in the training and test set. The results that we present are on these short text strings. The size of the vocabulary of the training and test set was 1984, but when the background knowledge is included there were 7381 words.

Footnote 1

Footnote 2