1 Introduction

Topic modeling, similarly to text classification, is an established and thoroughly researched field in computer science. Traditionally, both techniques are based on a bag-of-words (BOW) document representation, where one feature corresponds to one word (its stem or lemma), i.e. the word order doesn’t count, only the frequencies. As Gabrilovich and Markovitch [3] describe the state of the art in text classification in their 2005 paper, “after a decade of improvements, the performance of the best document categorization systems became more or less similar, and it appears as though a plateau has been reached [...]”. For this reason, researchers started working on developing different approaches. Considering the limitations of BOW model, the most natural idea was to enhance the method of document representation.

Scott and Matwin [11] did one of the first efforts of feature vector engineering for the purpose of text classification by using WordNet, a lexical database for english, and converting documents to feature vectors based on this new representation. Recent papers, such as Garla and Brandt [4] and Zong et al. [14], employ semantic information during the feature engineering step and apply machine learning techniques to learn text categories.

These attempts inspired us to perform feature engineering in the context of topic modeling. We want to incorporate semantic information in order to extend the traditional bag-of-words approach into a novel bag-of-features approach when preparing feature vectors. We plan to consider not only words but also disambiguated Named Entities linked to DBpedia resources and several related entities.

The underlying idea and motivation for our work is based on the fact that topic modeling algorithms draw their information based on the frequencies and co-occurrences of tokens in single documents and across the whole corpus. Because of that, we formulated a hypothesis that, in thematically related documents, the entities and/or their types, hypernyms or categories of corresponding Wikipedia articles should also be overlapping and thus summed frequencies of these terms should be more meaningful and lift up their relevance in discovered topics.

For example, consider a text snippet from a Spiegel OnlineFootnote 1 article that a human would assign a label “politics”: “Barack Obama is only passing through Germany on his trip to Europe later this week and does not plan to hold substantial talks with Angela Merkel. The White House views the chancellor as difficult and Germany is increasingly being left out of the loop”. The word politics itself has a zero frequency. But if we perform Named Entity Recognition and Disambiguation, the entities Barack Obama and Angela Merkel will be considered politicians thanks to the enrichment we perform.

In this work we present an approach of mining topic models enriched with background knowledge. We focus on the feature engineering aspect of topic modeling and leave the underlying generative statistical model intact. We assess the quality of this approach based on the evaluation strategy which consists of inspecting the internal coherence of topics and the topic-document assignments in terms of human understanding.

2 Related Work

In contrary to pure word-based LDA algorithm and its variations (such as different sampling techniques or online learning proposed by Hoffman et al. [5] which enables streaming-wise model mining and is thus much less resource-hungry) or applications (Gibbs-sampling-based LDA for gene prediction [10]), topic modeling approaches using enriched feature vectors have not been subject to much research so far.

One of the first methods that contributes to topic modeling using entities instead of words as features has been published by Newman et al. [8]. The authors propose five new models that modify LDA in order to learn pure entity-topic models. They evaluate them with regard to entity prediction and not their information-theoretic value, e.g. by measuring perplexity.

Hu et al. [6] present an approach of taxonomy-based topic representation that focuses on entities from a knowledge base. They developed a probabilistic model, Latent Grounded Semantic Analysis, that infers both topics and entities from text corpora and grounds the information to a KB.

Todor et al. [12], our previously published work, approaches enriched topic models in a different way and sees them as predictors for multi-labeled classification tasks. The approach was evaluated on news articles, each of which was labeled with a category. After having mined the topics, we let the model predict the coverage for every document and counted a histogram for every topic of how many times it was most relevant for a particular label (e.g. topic 1 was most relevant for 100 documents about sport, 759 times about politics, etc.). Then, we took the highest value of the label-histogram and from this moment on considered it the label of this topic. For the evaluation we evaluated the classification accuracy. To be more specific, we counted, which (1st, 2nd or 3rd) most relevant topic was the correct one, i.e. associated with the article’s label. The results showed that for every dataset there was at least one enriched topic (consisting of words + linked entities) that outperformed the classic topic consisting solely of words when looking only at the single most relevant predicted topic which was a very positive and important outcome. When we left words aside and only considered linked entities, we had to take also the 2nd and 3rd most relevant topic into account. We explain it with the fact that the vocabulary of linked entities is much smaller compared to words and it is harder to make an unambiguous prediction. On the other hand, feature combinations that have a comparable cumulative accuracy to words within top three predictions, operate on a smaller vocabulary (which has advantages of lower time and space complexity).

We already mentioned perplexity (or equivalently predictive likelihood) as the established qualitative method for expressing the quality of a topic model. To calculate perplexity on a held-out set D of test documents in LDA, we will characterize the model by the topic matrix \(\varPhi \) and the hyperparameter \(\alpha \), the Dirichlet prior for the document-topic distribution. Hence, what we search for is calculating the log-likelihood of generating every document \(d \in D\) given two above-mentioned parameters:

$$\begin{aligned} LL(D) = log~p(D | \varPhi , \alpha ) = \sum _{d \in D} log~p(d | \varPhi , \alpha ) \end{aligned}$$

Computed log-likelihood can be used to compare topic models – the higher, the better the model.

For LDA, to take the size of the vocabulary into account, we define perplexity as the exponential of negative log-likelihood divided by the number of tokens (note that the quality of the model increases while perplexity decreases):

$$\begin{aligned} perplexity(D) = exp \bigg ( \frac{-LL(D)}{\#tokens} \bigg ) \end{aligned}$$

Wallach et al. [13] published an overview of evaluation methods for topic models. They address certain challenges, such as the difficulty of estimating \(p(d | \varPhi , \alpha )\) and propose using sampling to overcome it.

Another interesting method of evaluating topic models is coherence, i.e. examining the existence of a single semantic concept that enfolds the words of the topic. This task can be performed quantitatively and there exist two state-of-the art methods of calculating coherence – an intrinsic (that doesn’t use external source of information) and an extrinsic one (that might employ external data or statistics to calculate the coherence score). Both methods are based on the same idea of calculating a sum of scores for every pair of top n words for a given topic t:

$$\begin{aligned} coherence_t = \sum _{i<j} score(w_{t,i}, w_{t,j}) \end{aligned}$$

The difference between the methods is the score function.

The most popular extrinsic measure is the UCI measure, proposed by Newman et al. [9]. The pairwise score function calculates the Pointwise Mutual Information:

$$\begin{aligned} uci(w_i, w_j) = log \frac{p(w_i, w_j)}{p(w_i)p(w_j)} \end{aligned}$$

The probabilities p(w) and \(p(w_i,w_j)\) indicate the probabilities of seeing a word w and a co-occurring pair of words \(w_i\) and \(w_j\) in a random document. These probabilities are manually estimated using a document corpus different from the one used to mine topics (hence the name, extrinsic method), e.g. as the document frequencies of word/word pairs.

The most popular intrinsic coherence measure is UMass, posted by Mimno et al. [7]. The proposed score function is a smoothed variant of the conditional log-probability:

$$\begin{aligned} umass(w_i, w_j) = log \frac{D(w_i, w_j) + 1}{D(w_i)} \end{aligned}$$

The UMass score measures the goodness of a common word being a predictor for a less common one. Here we consider the relations among the top n words from a given topic.

Both measure names come from the Institutes where the authors worked at the times of publications (University of Massachusetts for David Mimno and University of California, Irvine for David Newman).

Let us present another evaluation technique, proposed by Chang et al. [2]. They wanted to qualitatively measure the coherence of the estimated topic models. The state-of-the art quantitative coherence-based evaluation methods, UMass and UCI, follow a similar high-level idea. A method proposed by Chang et al. abstracts from them and postulates a purely manual evaluation of the human judgement, which can be seen as understanding and coherence from the human perspective.

The authors have defined two tasks – word intrusion and topic intrusion. First one is expected to measure how strongly the most relevant words for the topic compose a coherent semantic concept, an unbreakable unit. For this they draw a random topic from the model, take its five most relevant words and add one top word from remaining topics to create a set of 6 words which they shuffle and present to a human who is expected to select the intruder. In order to quantitatively measure how well the topics match human understanding, Chang et al. introduced model precision, i.e. to which extent the intruders selected by humans correspond to the “real” intruders. Model precision is defined as the fraction of correctly selected intruders (Eq. 1, where \(w_s\) is the word selected by the evaluation subject and w is the real intruding word), and thus ranges from 0 (worst) to 1 (best).

$$\begin{aligned} MP = \sum \mathbbm {1}(w_s = w)/S \end{aligned}$$
(1)

The second task measures how understandable the topics are in terms of assigning them to a given text. To prepare a topic intrusion question the authors draw a random article and consider its topical coverage. They take three most relevant topics and one irrelevant topic, each represented by its eight top words, shuffle them and present to an evaluation subject to select the intruder. The results for this task are evaluated using topic log odds. This measure, introduced by Chang et al. measures how good were the guesses of humans. In topic intrusion task, every answer (a topic) has certain probability of generating a given document. Topic log odds sums and normalizes differences of logs of probabilities of real intruder belonging to the document and the intruder selected by the evaluation subject. Intuitively, this way of evaluating makes a lot of sense, since it doesn’t binarily count right/wrong answers, but works as a kind of error function.

$$\begin{aligned} TLO = (\sum log \hat{\theta }_{d,*} - log \hat{\theta }_{d,s})/S \end{aligned}$$
(2)

Simplifying the notation of Chang et al., Eq. 2 is the definition of topic log odds, where \(\hat{\theta }_{d,*}\) is the probability of the intruder topic belonging to the document and \(\hat{\theta }_{d,s}\) is the probability of the topic selected by the subject belonging to the document. Because latter is greater or equal to the former, the topic model in terms of TLO is better when TLO is higher (closer to 0).

3 Approach

Our approach differs in several ways from the state of the art methods using entities in mining topic models. First of all, we do not modify the underlying probabilistic generative model of LDA and can therefore apply our method on any variation and implementation of the algorithm. Second of all, we mine topics that contain named entities linked to a knowledge base and might be used for knowledge acquisition purposes, e.g. taxonomy extraction or knowledge based population. Moreover we only employ one KB – DBpedia and DBpedia Spotlight as the NERD tool and focus on finding the best topic models in this setup. Lastly, we combine two evaluation techniques – we concentrate on achieving low perplexity, as well as measure human perception and interpretability of mined models.

4 Evaluation

We evaluated our approach on three datasets: BBCFootnote 2, The New York Times Annotated CorpusFootnote 3 (NYT) and DBpedia Abstracts.

The BBC dataset is a collection of 2225 selected news articles from BBC news websiteFootnote 4 which correspond to stories from five domains (business, entertainment, politics, sport and tech), published in 2004 and 2005.

NYT is a collection of over 1.8 million selected New York Times articles that span for 20 years, from 1987 to 2007. Over 650 thousand of them have been manually summarized by the library scientists, 1.5 million have been manually tagged with regard to mentions of people, organizations, locations and topic descriptors and over 250 thousand of them have been algorithmically tagged and manually verified. Additionally Java utilities for parsing and processing the corpus which is provided in XML format are included in the download. For these reasons it is one of the most widely used datasets for approaches from the domain of natural language processing. We didn’t take the full NYT dataset but reduced it to over 46000 articles which have been pre-categorized into at least one of following ten taxonomy classes: science, technology, arts, sports, business, health, movies, education, u.s., world.

We already introduced the small and medium-sized datasets we used for the evaluation. We chose one more dataset that can be categorized as big. Namely it is the corpus of DBpedia abstracts – first paragraphs of Wikipedia articles extracted as free text. Abstracts are connected to the DBpedia resources using the property abstract from the DBPedia ontology (http://dbpedia.org/ontology). The dataset Long Abstracts is available to download from the DBpedia downloads websiteFootnote 5.

After cleaning the documents and annotating the datasets (some articles contain characters which cause Spotlight to fail) we end up with the numbers of articles per data source displayed in Table 1.

Table 1. Sizes of datasets used in the evaluation.

We decided to use 10 feature combinations in our evaluation. It means that every topic model will be estimated 10 times, once for every combination. Let us enumerate them: first of all, we chose to mine models using words (w – an abbreviation in parentheses after every mentioned feature combination will be sometimes used to avoid frequent repeating of long descriptors) to have a possibility to compare models of our enriched approach to a classical one. Second of all, we decided to use words alongside with entities (we) and hypernyms (weh) – we assumed that linked named entities recognized in text represent an important concept which can be characteristic for the given text. For the same reason we also include hypernyms, which can generalize and “group together” semantically related entities.

The remaining seven models consist purely of DBpedia resources. They include models having entities (e), hypernyms (h) and both features together (eh). The justification for the choice of these three feature sets is analogous – we assume the entities to represent important concepts from a text and hypernyms to generalize them. Next model consists of types only (t). We want to see if rdf:types of entities recognized in a document are descriptive enough for its content. Last three models contain types alongside with entities, hypernyms and the combinations of both (et, th, eth). Here we wanted to investigate if these three types of features combined account to the quality of mined models.

We decided to omit the models containing categories (subject properties from Dublin Core terminology which correspond to Wikipedia categories of a given entity) in the evaluation since we expected them to be too broad and not enough descriptive.

The goal of this work was not finding best numbers of topics through exhaustive evaluation. In this work we want to measure changes in the human perception of enriched topics for different feature sets. However, we needed to determine k for every dataset and use it for generating word/entity – and topic intrusion tasks. For this reason we evaluated perplexity for diverse models (diverse feature set/number of topics combinations), as proposed in [1, 13].

For generating the study we wanted to choose sets of models with k’s, for which the pure word-based models’ perplexities stop falling. We expected it to allow us to compare models mined using our enriched approach to a best model for the given dataset.

We had troubles selecting number of topics for the DBpedia Abstracts dataset since generating models on such a huge corpus is very time consuming (10 h on an 8 core CPU). We were not guaranteed that estimating more and more models with larger numbers of topics would result in finding one where the perplexity would finally start to converge in a reasonable amount of time. Overview of perplexity values for different combinations of k’s and feature sets for DBpedia Abstracts corpus can be found in Table 4. We noticed that perplexity for the traditional model mined on words was almost the same for 3000 and 5000 topics.

Eyeballing the questionnaires generated from NYT and DBpedia Abstracts models made us conclude that the improvements in perplexity for models with higher numbers of topics, don’t necessarily carry improvement in the quality of questions.

In the end we decided to use following number of topics per dataset:

  • New York Times Annotated Corpus: 125 topics

  • BBC Dataset: 30 topics

  • DBpedia Abstracts Dataset: 1000 topics

To conduct the survey and manage its results at first we prepared a Google Doc with 60 entity and 60 topic intrusion tasks but we quickly realized it’s much too cumbersome and time consuming to complete – our first subject was not ready after 3 h. To address it, we implemented a web service where we were displaying one entity and one topic intrusion task at once instead. Each was randomly chosen from the set of tasks with lowest number of answers so that we keep the answer distribution as balanced as possible. Conducting the evaluation in this manner is acceptable since Chang et al. did not profile their results per user neither (Tables 2 and 3).

Table 2. Perplexity of topic models mined on the BBC dataset using different feature types and numbers of clusters.
Table 3. Perplexity of topic models mined on the New York Times Annotated Corpus dataset using different feature types and numbers of clusters.
Table 4. Perplexity of topic models mined on the DBpedia Abstracts dataset using different feature types and numbers of clusters.

4.1 Results

We sent out the link to the questionnaire among fellow computer science/mathematics students, i.e. people with a bigger than average technical sense and understanding of algorithmic and mathematical concepts. This fact might have been reflected in final results. We closed the survey after receiving 600 answers. It corresponds to 300 answers for each task type and thus 5 answers for each single task. Our study had 10 participants, a similar number compared to 8 in experiments by Chang et al.

We started the evaluation by assessing the model precision and topic log odds overall, as well as for each dataset separately. Out of 300 entity intrusion answers, 260 were correct, which accounts for the model precision of 0.87. We achieved the overall topic log odds of \(-4.23\). The value itself is not very expressive, but we see that the BBC dataset performs best. All topic log odds and model precision values are presented in Table 5.

Table 5. Quality of estimated models (BBC 30, NYT 125 and Abstracts 1000) in terms of model precision and topic log odds.
Table 6. Quality of estimated models in terms of model precision and topic log odds for every feature combination.

Next, we calculated and evaluated the model precision separately for every feature combination. The values can be found in Table 6 left. We were very happy to notice that the standard, bag-of-words based model, has been outperformed by seven enriched models, five of which being pure resource topic models. Also the fact that both remaining models based on words occupy first two places shows the meaningfulness of enriching topic models.

As a next step we took a look at topic log odds. Similarly to the case of entity intrusion, also in topic intrusion the baseline (bag-of-words based model) is outperformed. In this case only by five models out of nine but all of them are pure resource topic models. The values can be found in Table 6 right. These are very good outcomes that confirm our initial assumptions that incorporating background knowledge into feature vectors brings additional context and makes the topic model easier to understand.

Fig. 1.
figure 1

Histograms of the model precisions of all 30 topics (left) and topic log odds of all 60 documents (right). On both histograms we presented several examples of topics and documents for the corresponding values, mp and tlo, respectively.

Figure 1 depicts histograms of both used measures. Alongside the accumulated values for mp and tlo, we presented example topics and documents accounting for certain values. We can see that the topic about technological companies is easily separable from the entity London Stock Exchange, while University, as an intruder in a topic about fish species is harder to spot and “infiltrates” the topic better. On the right side, in the topic log odds histogram, we see that the document about rugby and the protest for a bad referee is easy to decompose into the topical mixture, while a DBpedia Abstract document about a 2007 Americas Cup sailing regatta is a very specific piece of text.

Apart from getting measurable, quantitative results from the intrusion-based evaluation we also told the survey participants we would be happy to receive feedback in a free form if they had some remarks about the tasks they saw. As expected in such cases, only a few (three) decided to make an extra effort and use this option. However, all three stated harmonically that often right answers to the topic intrusion tasks were guessed not by the descriptiveness of the presented text itself but rather by searching for an outlier within the group of four topics with no regard to the text.

To illustrate this phenomenon on an example, let’s first take the following topic intrusion task, generated from the model mined on NYT Annotated Corpus dataset:

Of the 35 members in the Millrose Hall of Fame, all but one are renowned athletes. The only exception is Fred Schmertz, the meet director for 41 years.

Now, Schmertz will be joined by his son, Howard, 81, who succeeded him and directed the Millrose Games, the world’s most celebrated indoor track and field meet, for 29 years, until 2003.

The 100th Millrose Games, to be held Friday night at Madison Square Garden, will be the 73rd for Howard Schmertz, now the meet director emeritus. He will be honored tomorrow night as this year’s only Hall of Fame inductee.

FRANK LITSKY

SPORTS BRIEFING: TRACK AND FIELD

The generated answers can be seen below. The intruder is marked.

  • dbo:City, dbo:Settlement, dbo:AdministrativeRegion, dbo:Person, dbo:OfficeHolder, dbo:Company, dbo:Town, dbo:EthnicGroup

  • dbo:City, dbo:Settlement, dbo:AdministrativeRegion, dbo:Person, dbo:University, dbo:Disease, dbo:Town, dbo:Magazine

  • dbo:Person, dbo:Film, dbo:Settlement, dbo:City, dbo:Magazine, dbo:Company, dbo:TelevisionShow, dbo:Newspaper

  • dbo:Country, dbo:PoliticalParty, dbo:MilitaryUnit, dbo:Settlement, dbo:Person, dbo:OfficeHolder, dbo:Weapon, dbo:MilitaryConflict

Now let us explain the thought process that decided about selecting the right answer. These are our observations confirmed by the survey participants. First, we see that two first topics are very similar which makes it very unlikely that one of them is an intruder. Several of third topic’s top words are present either in the top words of the first (dbo:City, dbo:Company) or the second (dbo:Settlement, dbo:City, dbo:Magazine) topic, plus the ontology class Newspaper seems related to class Magazine in human understanding. Following this reasoning we conclude that the last topic is the intruder as it contains entities related to politics/military strategy. Unfortunately, this strategy is the only way to choose a correct answer for this task because all four answers seem unrelated to the presented document.

While evaluating and trying to understand the results, we came across another topic intrusion task from the NYC dataset which is a perfect example of how counter-intuitive and uninterpretable the topics can be:

BASEBALL

American League TEXAS RANGERS–Acquired RHP Brandon McCarthy and OF David Paisano from the Chicago White Sox for RHP John Danks, RHP Nick Masset and RHP Jacob Rasner.

National League MILWAUKEE BREWERS–Agreed to terms with RHP Jeff Suppan on a four-year contract.

FOOTBALL

National Football League MINNESOTA VIKINGS–Agreed to terms with DT Kevin Williams on a seven-year contract extension. PITTSBURGH STEELERS–Signed LB Richard Seigler from the practice squad. Released WR Walter Young.

HOCKEY

National Hockey League (...)

The possible answers can be seen below. Number of votes per answer can be seen in brackets. The intruder is marked.

  • db:Agent, db:Preparation, db:Subtype, db:Epidemic, db:Birds, db:State, db:Agency, db:Fowl (2)

  • db:System, db:Company, db:Software, db:Computer, db:Product, db:Application, db:Device, db:Transmission (3)

  • db:Team, db:Player, db:Goaltender, db:Disk, db:Tournament, db:Position, db:Trophy, db:Hockey (0)

  • db:Club, db:Footballer, db:Sport, db:Cup, db:League, db:Team, db:Competition, db:Player (0)

To our giant surprise, we noticed that the intruder is the last answer – one of two topics that we would connect with sports at all. This example clearly shows what a difficult task it is to interpret associations between documents and enriched topics.

5 Conclusion and Future Work

The results we achieved are very satisfying and promising for the area of topic models enriched with background knowledge which is still relatively new and has not been subject to much open research so far. In our experimental setup we showed that our approach outperforms the established bag-of-words based models, i.e. the enrichment step we perform and injecting linked data while generating the feature vectors makes the topic models more understandable for humans. This is a very meaningful result since interpreting and labeling topics is an interesting and popular research area. Having topics containing resources linked to knowledge graphs would allow for completely new possibilities in this domain.

Automatic topic labeling would not be the only research and application area where we think enriched topic models could be used. We imagine scenarios where we apply enriched topic modeling, attach huge text corpora to knowledge bases and thus have systems to automatically “classify” unseen documents and to find their places as mixtures of subsets of the knowledge graph.

Assessing the quality of hierarchical topic models was out of scope for this work. However, if we define a measure for goodness of topical hierarchies and manage to estimate enriched models that satisfy certain condition (e.g. perform similarly well as their bag-of-words counterparts), we could attach mined topical hierarchies to knowledge bases. This way we could conduct a fully automatic taxonomy mining from arbitrary document collections and thus offer support for experts in every domain.

Naturally, the evaluation didn’t bring only positive outcomes. We encountered some issues that hinder unlocking the full potential of our approach. First of all, a vital obstacle while considering enriched models and pure resource models is the length of documents. This problem plays an even bigger role in our approach than in pure word-based models. When texts are too short and thus the NER systems fail to deliver a reasonable amount of resources, the inferencer is unable to predict the coverage proper in terms of human understanding because the feature vector is far too sparse. To overcome this issue, we could try to mine a model using Wikipedia dump, or at least a representative and reasonable subset of it, to have a universal, multi-domain topic model. Running a NER system, such as DBpedia Spotlight, on Wikipedia articles (and maybe experimenting with tuning its parameters, such as support and confidence) could result in much denser feature vectors of higher quality.

Second, we find that the fact that topic models mined using DBpedia Abstracts perform worse might, apart from the average document length, be influenced by the number of topics in the model. In this work we focused on the aspects of feature engineering and feature selection, i.e. defining new features from background knowledge and investigating which subset of them would perform best for a fixed number of topics. More patience when selecting k’s and tailoring them better for every dataset should bring improvement in the quality of topic models.

Another important thing to note is that we didn’t touch the topic modeling algorithm itself. We only changed ways a document is represented as feature vector. Digging deeper into LDA in order to differentiate between words and resources and maybe applying graph-based reasoning would be another idea worth experimenting with in the future in pursuit of good enriched topic models.

One more aspect we consider worth mentioning and discussing is the philosophical nature of this work. We were using human judgment to assess the quality of topic models. Chang et al. concluded that human judgment does not correlate to established information-theoretic measure of topic models. However, human judgment itself is never uniquely defined and depends on the knowledge, cultural background, etc. It is very subjective and humans can easily discuss and argue over a topic, its interpretation and usefulness.

Unfortunately, performing very thorough evaluation of this kind was beyond our possibilities. Generated tasks, especially topic intrusion tasks, were very time consuming. Not only they required thorough readings of the presented text but also precise elaborations of the answers. As we already mentioned, completing the full questionnaire (60 tasks of each type) took almost four hours. Our vision and a future direction to overcome this issue would be to arrange a setup with much more topics (several k’s per dataset) and, given appropriate funding, use Amazon Mechanical TurkFootnote 6, a platform where human workers are paid a small cash reward for each completed task. Not only would it allow us to measure topic interpretability on a bigger, more representative group of subjects but also, as already mentioned, to examine more models. Performing such an extended evaluation could confirm, which dataset produces better topics – as for now we only know that BBC performed best in this given setup but maybe k’s we selected were not optimal for the chosen datasets. Also, we used a particular subset of the New York Times Annotated Corpus for the evaluation. Maybe our choice was unfortunate and lowered the quality of estimated models.

Results achieved in this work justify enriching topic models with background knowledge. Even though rather basic, our approach of extending the bag-of-words showed potential that enriched topic models impose. Digging deeper into the algorithm and differentiating between words and resources could further improve the quality of estimated models. That being said, given more research, interest and funds, in near future the enriched topic models could develop into more sophisticated methods applied for numerous purposes in multiple domains.