Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Multi-view text data consist of texts from different information sources. A view of an instance refers to a part that is from some information source. For example, in a English-Japanese bilingual corpora, an document has two views: English article and Japanese article. Multi-view text data are considered as comparable if views of a document are description of the same target. Multi-view topic modeling is the task of extracting aligned topics from comparable multi-view text data, which are tuples of semantically similar topics of different views. Aligned topics facilitate construction of bilingual lexicon of semantically related words, which can be useful in cross-lingual information retrieval [16]. Aligned topics can also be used to transfer knowledge from one language to another in cross-lingual document classification [5]. Moreover, on data consist of texts and social annotation, complementary information in tags can be utilized to improve performance of clustering tasks [14] using aligned topics.

In a multi-view topic models, a view of a document is modeled as a mixture of topics, which are Categorical distributions over words. The mixture weights, which are often called topic proportions, can be considered as low-dimensional representation of documents. It is often assumed in existing multi-view topic models that different views of the same document are semantically consistent. Under this assumption, topic proportions are shared across all views of a document [11]. However, for multilingual corpora that are managed in distributed fashion, this assumption does not necessarily hold. For example, since articles in different Wikipedia languages are usually managed by different communities, they often differ in details. Figure 6 shows an example for this. This bilingual document contains Japanese article and Finnish article about Orne, a province in France. Compared to Japanese article, Finnish article contains more information about history of Orne, so it should have larger weights for topics that related to history.

Documents that have inconsistent weights can be regarded as multi-view anomalies [8]. Although inconsistency in content should incur difference in topic proportions, existing models are not capable of depicting it while learning low-dimensional representation of documents. In this paper we propose a multi-view topic model which models data and detects anomalous instances simultaneously. Appropriate number of topic proportions variable are inferred for anomalous instances, to model the inconsistent views. As a result, the proposed model is more robust to multi-view anomalies, and also applicable for the multi-view anomaly detection task. The proposed model is beneficial in at least two applications. In large enterprise with global business, managing information consistency in multilingual documents is an important but expensive task [6]. Cost of management can be reduced if anomalous documents are detected automatically. In cross-cultural analysis [9], documents with inconsistent views are used to analyze cultural difference. We can reduce cost of obtaining samples by using the proposed model to identify anomalous documents from large datasets automatically.

In the proposed model, documents that contain inconsistent views are regarded as anomalies, and such views have distinct topic proportions. Views of a non-anomalous documents share the same topic proportions variable. We use Dirichlet process as the prior for topic proportions variables to infer the appropriate number of topic proportions variable for each document. Based on collapsed Gibbs sampling, we derive efficient inference procedures for the proposed model. To our knowledge, this is the first model that addresses the problem of multi-view anomaly detection in the literature of topic modeling. Performance of the proposed model is examined on ten bilingual Wikipedia corpora. It is demonstrated that the proposed model is more robust than existing multi-view topic models, in terms of held-out perplexity. In addition, compared to existing multi-view anomaly detection methods the proposed model is more efficient and has higher anomaly detection performance on multi-view text data.

The rest of this paper is organized as the following. Section 2 includes related work on topic modeling and multi-view anomaly detection. The proposed model and its inference method are presented in Sect. 3. Section 4 contains evaluation of models’ generalization ability in terms of held-out perplexity on Wikipedia corpora. Section 5 contains evaluation of multi-view anomaly detection. In Sect. 6, examples of aligned topics and multi-view anomalies in a Wikipedia corpus are presented. We conclude this paper in Sect. 7.

2 Related Work

Topic models, such as Latent Dirichlet Allocation (LDA) [4], are analysis tool for discrete data. Polylingual Topic Model (PLTM) [11] is an extension of LDA to comparable multi-view setting, and is demonstrated to be useful in various applications, such as cross-cultural understanding, cross-lingual semantic similarity calculation and cross-lingual document classification [16]. Based on the fact that views of a document are information about the same target from different perspective, topics of different views are aligned by sharing topic proportions variable among all views of a document in PLTM. While information in different views are utilized jointly, this model assumption is not valid for data that contain multi-view anomalies. Correspondence LDA [3] and symmetric correspondence LDA [7] are another kinds of multi-view topic models, which extract direct correspondence between topics of different views. However, in these models distinct topic proportions variables are inferred for views of the same document, in regardless of view consistency. Hence they are not applicable in detecting multi-view anomalies and obtaining low-dimensional representation of multi-view documents. Moreover, in existing models topics are to be aligned without considering view consistency, so on noisy data that contains a lot of multi-view anomalies their performance may degenerate.

Various methods can be applied to the task of multi-view anomaly detection. In probabilistic canonical correlation analysis (PCCA) [2] a shared latent vector among all views and its projection matrices for each view are estimated. The reconstruction error is considered as anomaly score, based on the idea that high reconstruction error indicates views are inconsistent. In [10] the authors propose a robust version of PCCA by detecting multi-view anomalies during estimating parameters. Nevertheless, this model assumes Gaussian error so it may not be suitable for textual data. Moreover, textual data have high-dimensional features, which leads to efficiency issue when applying that method.

3 Proposed Model

3.1 Generative Process

Suppose there are D documents, and each of them contains L views. In the proposed model, views of a document are grouped into clusters. The proposed model assumes that each document can have a countably infinite number of clusters. A topic proportion vector \(\theta _{dy}\) is generated for each cluster y in document d, and it is then used to generate words in each view of y. As a result, views in the same cluster share the same topic proportions vector, and views belong to different cluster have distinct topic proportions vectors. Consequently, multi-view anomalies are identified by the number of clusters they have. A document is a normal document if it has only one cluster, and is an anomaly if it has more than one cluster.

Specifically, we use Stick-breaking process [15] to generate clusters and cluster assignments of views. The probability that a view belongs to some cluster is related to the proportions of its words’ topic assignments. In anomalous documents, such proportions are different in different views, causing its views to be assigned to different clusters. Meanwhile, in normal documents such proportions of views are similar, so views are assigned to the same cluster.

The generative process of the proposed model is described as the following, and the graphical model representation is shown in Fig. 1.

Fig. 1.
figure 1

Graphical model representation of the proposed model.

For each \(\ell =1, 2,\dots , L\) and \(k=1, 2,\dots , K\), generate a topic \(\phi _{\ell k} \in R^{V_\ell }\) with a symmetric prior \(\beta _\ell e\in R^K\), where \(\beta _\ell \in R\) and \(e\in R^K\) is all-ones vector. \(V_\ell \) is the number of unique words in view \(\ell \).

$$\begin{aligned} \phi _{\ell k} \sim \mathrm{Dirichlet}(\beta _\ell ). \end{aligned}$$
(1)

For each document d, generate mixture weights \(\pi _d\) by the stick-breaking process with concentration parameter \(\gamma \), which generates mixture wights of the Dirichlet process.

$$\begin{aligned} \pi _d\sim \mathrm{Stick}(\gamma ). \end{aligned}$$
(2)

For each view \(\ell \) of the document d, generate cluster assignments \(s_{d\ell }\) from \(\pi _d\):

$$\begin{aligned} s_{d\ell } \sim \mathrm{Category}(\pi _d). \end{aligned}$$
(3)

Then generate topic proportions \(\theta _{dy}\) for cluster y of d using asymmetric prior \(\alpha \in R^K\):

$$\begin{aligned} \theta _{dy} \sim \mathrm{Dirichlet}(\alpha ). \end{aligned}$$
(4)

Finally, generate topic assignment \(z_{d\ell n}\) of \(n^{th}\) word in view \(\ell \) of d, and the corresponding word \(w_{d\ell n}\), for \(n=1, \dots , N_{d\ell }\), where \(N_{d\ell }\) is the number of words in view \(\ell \) of document d.

$$\begin{aligned} z_{d\ell n} \sim \mathrm{Dirichlet}(\theta _{ds_{d\ell }}), \end{aligned}$$
(5)
$$\begin{aligned} w_{d\ell n} \sim \mathrm{Category}(\phi _{\ell z_{d\ell n}}). \end{aligned}$$
(6)

3.2 Inference

Collapsed Gibbs Sampling. In the following inference procedure, \(\theta \), \(\pi \) and \(\phi \) are marginalized out by Dirichlet-multinomial conjugacy. Denote Z as the topic assignments of all words. Denote \(S_d\) as the cluster assignments of all views in document d and S as the identity of cluster assignments in all documents. In order to simplify expression, denote subscript \(d\ell n\) as J, and use \(\setminus J\) to refer to the remaining after removing \(z_{d\ell n}\). Similarly, use \(\setminus d\ell \) to refer to the remaining of a cluster in d after view \(\ell \) is removed. For example, \(y\setminus d\ell \) refers to the rest of cluster y after removing view \(\ell \). If view \(\ell \) is not in y, then y is not modified.

Given S and \(Z_{\setminus J}\), Eq. 7 is used to sample a new value for \(z_{d\ell n}\). Denote the number of occurrence that word t in view \(\ell \) is assigned to topic k as \(N_{\ell kt}\). Use \(N_{d\ell k}\) and \(N_{dyk}\) to refer to number of words in \(\ell \) and in y that are assigned to topic k in document d. Denote number of words in view \(\ell ^{th}\) that are assigned to topic k as \(N_{\ell k}\).

$$\begin{aligned} P(z_{d\ell n} = k \mid Z_{\setminus J}, S ) \propto (N_{ds_{d\ell }k \setminus J} + \alpha _k)\frac{N_{\ell kw_{d\ell n}\setminus J}+\beta _\ell }{N_{\ell k\setminus J}+\beta _\ell V_\ell }. \end{aligned}$$
(7)

For each document d, given Z and \(S_{d\setminus d\ell }\), Eq. 8 is used for sampling a new value for \(s_{d\ell }\). \(\ell \), a view of document d, could be assigned to an existing cluster y or a new cluster \(\tilde{y}\). Denote number of words y contains as \(N_{dy}\) and number of words in y that are assigned to topic k as \(N_{dyk}\). Denote number of views in cluster y of document d as \(L_{dy}\). \(\bar{\alpha }=\sum _{k=1}^{K}\alpha _{k}\). \(\varGamma (\cdot )\) refers to the gamma function.

$$\begin{aligned} \begin{aligned}&P(s_{d\ell }=y \mid Z,S_{d\setminus d\ell }) \propto L_{dy\setminus d\ell } \\&\times \left[ \prod \limits _{k:N_{d\ell k}>0}\frac{\varGamma (N_{dyk\setminus d\ell } + N_{d\ell k} + \alpha _k)}{\varGamma (N_{dyk\setminus d\ell } + \alpha _k)}\right] \frac{\varGamma (N_{dy \setminus d\ell } +\bar{\alpha })}{\varGamma (N_{dy} + \bar{\alpha })},\\&P(s_{d\ell } = \tilde{y} \mid Z,S_{d\setminus d\ell }) \propto \gamma \\&\times \left[ \prod \limits _{k:N_{d\ell k}>0}\frac{\varGamma (N_{d\ell k} + \alpha _k)}{\varGamma ( \alpha _k)}\right] \frac{\varGamma (\bar{\alpha })}{\varGamma (N_{d\ell } +\bar{\alpha })}. \end{aligned} \end{aligned}$$
(8)

Hyper-parameter Estimation. Hyper parameters \(\alpha \) and \(\beta \) smooth word counts in inference. They can be either set to some small values or optimized by placing Gamma priors on them and then using fixed-point iteration method [12]. As demonstrated in [1], the later approach reduce performance difference that is resulted from learning algorithm. Thus we optimize these hyper parameters using the approached introduced in [12], as Eq. 9. \(Y_d\) denotes the set of clusters in document d. \(\varPsi (\cdot )\) refers to the digamma function.

$$\begin{aligned} \begin{aligned}&\alpha _k^\mathrm{new} = \alpha _k \frac{\sum \limits _{d=1}^{D}(\sum \limits _{y \in Y_d}{\varPsi (N_{dyk}+\alpha _k) - \left| {Y_d}\right| \varPsi (\alpha _k))}}{\sum \limits _{d=1}^{D}(\sum \limits _{y \in Y_d} \varPsi (N_{dy}+ \bar{\alpha }) - N_{dy}\left| {Y_d}\right| \varPsi (\bar{\alpha }))} \\&\beta _\ell ^\mathrm{new} = \beta _\ell \frac{ \sum \limits _{k=1}^{K} \sum \limits _{t=1}^{V_\ell } \varPsi (N_{\ell kt} +\beta _\ell ) - KV_\ell \varPsi (\beta _\ell )}{V_\ell \sum \limits _{k=1}^K \varPsi (N_{\ell k}+V_\ell \beta _\ell ) - KV_\ell \varPsi (\beta _\ell )} \end{aligned} \end{aligned}$$
(9)

Estimation of \(\varvec{\varTheta }\) and \(\varvec{\varPhi }\). After iteratively sampling and updating hyper-parameters, point estimates for \(\varTheta \) and \(\varPhi \) are made:

$$\begin{aligned} \begin{aligned}&\theta _{yk} = \frac{N_{dyk} + \alpha _k}{N_{ds} +\bar{\alpha }}, \\&\phi _{\ell kt} = \frac{\beta _\ell + N_{\ell kt}}{N_{\ell k} + V_\ell \beta _\ell }. \end{aligned} \end{aligned}$$
(10)

Anomaly Score. Because view consistency is modeled stochastically using the Dirichlet process, we use the probability that a document has more than one clusters as anomaly score. High value indicates views in a document tend to diverge, so probably it is a multi-view anomaly. As shown Eq. 11, such anomaly score is estimated with samples of S generated using the Gibbs sampler Eq. 8. T refers the total number of iterations in model training. \(\left| {Y_{d}^{(t)}}\right| \) is the number of clusters in document d in iteration t. \(\mathrm{I}(\cdot )\) is the indicator function. In experiments we use sufficiently large T to ensure \(score_d\) converges.

$$\begin{aligned} \mathrm{score}_d = \frac{1}{T} \sum _{t=1}^{T}\mathrm{I}( \left| {Y_{d}^{(t)}}\right| >1), \end{aligned}$$
(11)

4 Held-Out Perplexity Evaluation

4.1 Dataset

We collect 34024 articles in Japanese, German, French, Italian, English, and Finnish from Wikipedia. This data is preprocessed by removing general stop words and corpus stop words, which are words with frequency larger than 3402. We also remove words with frequency lower than 100 to reduce the size of vocabulary. After preprocessing, the vocabularies of each language are 12148, 17375, 12813, 16291, 22500, and 7910. From this corpus we select ten bilingual corpora for experiments. They are Japanese - Finnish, Japanese - German, Japanese - French, Japanese - Italian, Japanese - English, English - Germany, English - Finnish, English - Japanese, English - French and English - Italian. We filter out article pairs whose both views are shorter than five words. The numbers of documents in these ten bilingual corpora are 33652, 33668, 33658, 33653, 33854, 33829, 33813, 33854, 33822, and 33814. From each bilingual corpus ten datasets are randomly sampled for experiments, each of them contains 5000 documents.

To quantitively examine models’ performance when multi-view anomalies are present, view-swapping is performed to generate multi-view anomalies, as used in [8, 10]. Specifically, \(10\%, 20\%, 30\%, 40\%, 50\%\) of documents in each dataset are randomly selected as anomalies, and their views are swapped . As a result, these datasets contain multi-view anomalies with ratio \(10\%, ..., 50\%\). Because data of each view are not modified, these datasets can be used to investigate model’s performance against multi-view anomalies.

4.2 Settings

Perplexity of held-out corpus is selected as an evaluation metric. Low perplexity indicates good generalization ability. The proposed model is compared with PLTM and CorrLDA to examine the effect of anomaly detection in multi-view topic modeling.

Perplexity is calculated using Eq. 12. As perplexity of CorrLDA depends on choice of pivot view, we report the average of for different choice of pivot view. The held-out corpus is constructed by randomly selecting \(20\%\) of documents and then randomly selecting half of their words in each view. Denote the set of index of documents chosen as \(D^\mathrm{test}\). Denote the set of words chosen in document d as \(w_{d}^\mathrm{test}\). Denote the total number of words chosen as \(N^\mathrm{test}\).

$$\begin{aligned} \mathrm{perplexity} = \mathrm{exp} \left( -\frac{\sum \limits _{d \in D^\mathrm{test}}\sum \limits _{\ell =1}^L\sum \limits _{t \in w_{d\ell }^\mathrm{test}}\mathrm{ln}(\sum \limits _{k=1}^K \theta _{ds_{d\ell }k}\phi _{\ell kt})}{N^\mathrm{test}}\right) \end{aligned}$$
(12)

In all experiments, initial value of \(\alpha _k\), \(\beta \) and \(\gamma \) are set to 0.05. Gibbs sampling is executed for 1000 iterations. The proposed model is initialized by using single cluster for every documents in the first 256 iterations. After that parameters are learned using procedures described in Sect. 3.2.

4.3 Results

Fig. 2.
figure 2

Average held-out perplexities and their standard errors on Japanese - Finnish dataset with 30% multi-view anomalies

Fig. 3.
figure 3

Average held-out perplexities and their standard error on Japanese - Finnish dataset when anomaly ratio varies. K = 700.

Figure 2 shows the average of held-out perplexities and their standard errors on Japanese - Finnish dataset containing \(30\%\) multi-view anomalies. Number of topic K varies from 100 to 700. With the same K, the proposed model always achieves the lowest perplexity. As perplexities stop decreasing after \( K \ge 700\), further increasing number of topics provides no improvement generalization ability. Thus when multi-view anomalies exist, the proposed model outperforms all alternative methods in irrespective of number of topics.

Figure 3 shows average held-out perplexities and their standard errors on the Japanese - Finnish corpus. Anomaly ratio varies from 0 to 0.5. It is shown that as the anomaly ratio increases, perplexities of CorrLDA and PLTM increase significantly. Because view-swapping does not modify content of each view, this performance degeneration could only result from inconsistency among views. Meanwhile, perplexity of the proposed model increases very slowly when the anomaly ratio increases. Note that in Fig. 2, the proposed model has the lowest perplexity in regardless of K. We conclude that the proposed model has the best generative ability when multi-view anomalies are present on this bilingual dataset.

Table 1. Average held-out perplexities and their standard errors on 10 bilingual corpora

Table 1 shows average held-out perplexities and their standard errors on all the ten bilingual corpora with 30% multi-view anomalies for K equals to 700. As shown in Figs. 2 and 3, CorrLDA is not suitable for these corpora, so its results are not reported. On all corpora held-out perplexities of the proposed model are significantly lower than those of PLTM. Hence the proposed model’s superiority over PLTM on noisy multilingual corpora is language-independent.

5 Multi-view Anomaly Detection

5.1 Settings

Area under ROC curve (AUC) is used as evaluation metric for multi-view anomaly detection. High AUC indicates a method could discriminate anomalous instances from non-anomalous instance well.

The proposed model is compared with robust version of CCA proposed in [10] (RCCA), one-class SVM (OCSVM) and PLTM. RCCA is included in comparison because it also uses Dirichlet process and is reported to be effective on continuous data. OCSVM is a representative method for single-view anomaly detection. It is included into experiments to investigate whether methods for single view anomaly detection are also applicable for detecting multi-view anomalies. In experiments OCSVM implementation in scikit-learn package [13] with radial-basis function kernel is used. It is applied into multilingual setting by using bag-of-word representation and appending one view at the end of another.

We also report results of classification by using PLTM’s perplexity of training data as anomaly score. Inasmuch as model assumption of PLTM is not valid on anomalous documents, perplexities of such documents are higher than non-anomalous documents. Because the proposed model reduces to PLTM if cluster assignments of views are fixed to be the same, comparison between the proposed model and this method demonstrates the efficacy of using Dirichlet process to model view consistency.

5.2 Dataset

As RCCA does not scale well on high dimensional textual data, we have to carry out comparison experiments with data of smaller size. On Japanese - Finnish dataset, sizes of vocabulary are reduced 100 by removing low-frequency words. Documents that have view shorter than 50 words are removed. From the remaining ten datasets are sampled, each of them contains 100 documents.

5.3 Results

Fig. 4.
figure 4

AUC and their standard errors. Dimension of latent spaces are set to 8 for RCCA. \(K=8\) is used for the proposed model and PLTM

Fig. 5.
figure 5

AUC and their standard errors with anomaly rate equals to \(20\%\).

Figure 4 shows AUC of multi-view anomaly detection when anomaly ratio varies. AUC of RCCA and OCSVM are around 0.5 in all cases, which means they barely discriminate anomalies from non-anomalies. AUC of PLTM is around 0.6, and that of the proposed method is around 0.7. Thus the proposed model outperforms all alternative methods.

Figure 5 shows AUC on dataset containing 20% anomalies with various number of topics K. K correspond to dimension of latent space in RCCA. It is shown that \(K=4\) is enough for the proposed model and PLTM to achieve their best performance, and the proposed method outperform all comparing methods. Meanwhile, AUC of RCCA is around 0.5 for all cases, which means increasing dimension of latent space cannot improve performance of anomaly detection.

6 Examples of Aligned Topics and Anomalies

In previous sections we demonstrate the proposed model’s efficacy in modeling multi-view text data with manually created multi-view anomalies and detecting such anomalies. In this section we present topics extracted by the proposed model and example of multi-view anomalies detected from the original data.

Table 2. An example of aligned Finnish(fi) and Japanese(ja) topics.

Examples of most probable words of aligned topics extracted from original Japanese - Finnish corpus are presented in Table 2. Relatedness between Japanese topics and Finnish topics are observable. For example, the second topic is about business. With these aligned topics, information of two views can be jointly utilized. For example, the most probable words for fourth Finnish topics are “team”, “score”, “minutes”, “world” and “seconds”, which may not be as cohesive as the other topics. It can be better interpreted if the corresponding Japanese topic (“team”, “acting”, “competition”, “jump”, “skate”) is considered jointly. With the complementary information, one may figure out that words in this topic are about sports competitions.

Fig. 6.
figure 6

Article for Orne in Finnish (left) and its counterpart in Japanese (right).

In addition, an example of multi-view anomaly detected from original Japanese - Finnish corpus is shown in Fig. 6. Screenshots are captured in Feb 11th, 2017. These two articles are about Orne, a province in France. While they contain common sections, Finnish and Japanese articles differ significantly in history section. For applications in which inconsistency is detrimental, we can use the proposed model to detect and process documents like this automatically.

7 Conclusion

Since multi-view text data are often managed in distributed fashion, they may contain multi-view anomalies and pose challenge on topic modeling. In this paper a probabilistic topic model is proposed for multi-view topic modeling, which is capable of modeling joint distribution of views and detecting anomalies simultaneously. In our experiments on ten bilingual Wikipedia corpora, it is demonstrated that the proposed model is more robust than existing multi-view topic models against multi-view anomalies. In addition, from comparison with other multi-view anomaly detection methods it is shown that the proposed model is more effective on textual data. Future work of the proposed model includes applying to multi-modal text data.