1 Introduction

Community question answering (cQA) websites, such as the general Baidu ZhidaoFootnote 1 and Yahoo!AnswersFootnote 2, and the vertical StackExchangeFootnote 3 and GuoKeFootnote 4, are becoming more and more popular since everyone can ask, answer, edit, and organize questions on the website. Compared to the traditional techniques for information retrieval, cQA has made a headway in solving complex, advice-seeking, reasoning questions based on its user-generated-content.

The fast-growing crowdsourcing Q&A data has a good application and development prospect for understanding complex, implicit and self-organization answers. However, the data quality problem [1,2,3] still exists due to the great different backgrounds of answerers. The low-quality data makes a portion of data cannot be applied directly. Hence, automatic answer reliability evaluation method is very important for improving user experience and constructing high quality Q&A knowledge base.

However, the existing supervised approaches for user reliability evaluation need large amounts of annotated data which is time consuming and limits the applicability to new domains [4]. Besides, unsupervised methods mainly depend on the answerers reputation and result in low accuracy owing to less factors considered.

Therefore, high accuracy unsupervised methods are needed. In this paper, we proposed a novel unsupervised method for answer reliability evaluation by constructing Answer-User association Network (AUnet). This network can successfully captures a variety of factors that affect the reliability of the answer. The contributions of this paper are as follows:

  • We constructed AUnet to capture a variety of factors that affect the reliability of the answer. And then the answer reliability evaluation problem is formalized as computing the reliability of node variables on heterogeneous information network.

  • A mutual inference algorithm based on AUnet is proposed to calculate the answer reliability. The reliability of answers and users can be obtained simultaneously by an iterative process without any annotated data.

  • Experiments on four real datasets from StackExchange have been conducted to test the effectiveness of our method. The results show our method works well.

2 Related work

Our work relates to the answer reliability evaluation and the network-based trust propagation algorithm.

Researches about evaluating the answer reliability are mainly divided into supervised methods and unsupervised methods. Like Maximum Entropy used in [5], Logistic Regression used in [6] and Rand Forests used in [7], supervised methods mainly evaluate and predict the answer reliability by training the classifier based on the manually annotated features of the answer such as community features, user features, textual features and statistical features. Although supervised methods can achieve excellent results, the cost of labeling data is high. Rather than directly evaluate the answer reliability, unsupervised methods resort to calculates user’s authority through mining the relation between users, such as the improved PageRank in [8] and the improved HITS in [9]. Besides, Wu et al. [10] achieved the best results from the current unsupervised methods based on the idea of minimizing the difference among answers. For unsupervised methods, data annotation is not required but the accuracy is relatively low.

The network-based trust propagation algorithm is used to effectively identify the trustworthiness of nodes in the network. At present, the network-based trust propagation algorithm is mainly used for fraud detection, selection of comments with high quality, and the discovery of authoritative users and reliable users [11]. Such as Leman et al. [12] iteratively calculated the reliability of the user by using the trust propagation algorithm on bipartite graph, Li et al. [13] used typed Markov Random Fields to detect the campaign promoters on social media and Ko et al. [2] regarded the marginal probability of each answer inferred by the maximum joint probability distribution on the answer association network as the answer reliability. As far as we know, there is no method constructing the trust network to simultaneously model multiple factors that affect the answer reliability and calculating the answer reliability by the trust propagation algorithm on the network.

3 Approach Overview

3.1 Problem Definition and Data Observation

The problem of evaluating the user reliability is formalized as: Given a set of questions \(Q=\{q_1,q_2,...,q_n\} \), a set of all answers \(A=\displaystyle \bigcup _{i=1}^{n} A_i \), where \(A_i=\{a_{i1},a_{i2},...,a_{i_{mi}}\} \) is a set of \(m_i \) answers of the question \( q_i\in Q \) and a set of users \(U=\{u_1,u_2,...,u_k\} \). Our goal is to model multiple factors which affect the answer reliability into a network and output the answer reliability \( \tau \left( a_{ij}\right) \) of each answer \( a_{ij} \).

Definition 1

(Answer Reliability). We let \( \tau \left( a_{ij}\right) \) denote the reliability of the answer \( a_{ij} \), which indicates the extent people trust something [14]. We take the answer reliability \( \tau \left( a_{ij}\right) \in \left[ 0,1\right] \) , and the answer will be more reliable if the answer reliability is more closer to 1.

Definition 2

(User Reliability). The user reliability \(\omega (u_k) \) of a user \(u_k \) indicates the probability of the user providing reliable answers, and \(\omega (u_k) \in \left[ 0,1\right] \). The user will be more reliable if the user reliability is more closer to 1.

Through the observation of the data, we found that two direct factors and two indirect factors affect the answer reliability.

Direct Factors

  • The number of votes of the answer affect the answer reliability. Answers with more votes tend to be more reliable than those with fewer votes. In order to eliminate the different concerns between questions, the shares of votes is used instead of the number of votes to represent the supporting degree to the answer among all voters in participating for the same question. We let \( fvote_{ij}=\frac{vote_{ij}}{\displaystyle \sum _{j=1\rightarrow m_i} vote_{ij}} \) denote the share of votes for the answer \( a_{ij} \), where \( vote_{ij} \) is the number of votes for the answer \( a_{ij} \) and \( m_i \) is the number of all answers to question \( q_i \). Our statistics showed that the average share of votes of the best answer is apparently higher than non-best answers.

  • The frequency of core words affects the answer reliability. The answer with clearer expression and more information is more likely to be reliable. A sentence is considered consisting of meaningless stop words and informative core words. We use the frequency of core words to represent the amount of the information a sentence conveys: \( fcore_{ij}=\frac{\sum _{n=1}^{N_{ij}} I\left( w_n\right) }{N_{ij}} \), where \( N_{ij} \) is the number of words in the answer \( a_{ij} \) and \( I\left( w_n\right) \) is an indicator function, using 1 or 0 to indicate the word \( w_n \) is a core word or not. We found that the frequency of core words of the best answer is apparently higher than non-best answers.

Indirect Factors

  • Correlation among answers for the same question affects the answer reliability. The reliability of similar answers is positively correlated and mutually driven. Assume that an answer is reliable, it’s similar answers are more likely to be reliable, but the different answers of it are more likely to be unreliable.

  • Correlation among answers and corresponding users affects the answer reliability. The answer from a more reliable user is more likely to be reliable. Users who provide reliable answers are more likely to be reliable.

3.2 AUnet Model

Based on the four factors above, we constructed AUnet to model them in a unified framework with the reference to the concept of heterogeneous information networks. The network model is shown in Fig. 1, which is defined as:

$$ G=\{V,E,W,P\} $$
Fig. 1.
figure 1

AUnet model.

  • \( V=A \cup U \) is a set of all nodes in AUnet, \( A=\displaystyle \bigcup _{i=1}^{n} A_i \) is a set of answers to all questions denoted in the blue circle and \( U=\{u_1,u_2,...,u_k\} \) is a set of all users denoted in the black square.

  • \( E=E_p \cup E_s \) is a set of all edges in AUnet. The similarity relation between answers \( E_s \subseteq A \times A \) are denoted as red undirected edges, and the provided relation between users and answers \( E_p \subseteq A \times U \) are denoted as black undirected edges.

  • \( W=\{W_e|e \in E\} \) is a set of the corresponding weights of the edges. \( w_s=sim(a_{ij},a_{ij^{'}}) \) is the weight of the similarity relation between answers \( a_{ij} \) and \( a_{ij^\prime } \), and \( w_s\in \left[ 0,1\right] \). In this paper, we adopted sen2vec [15] and cosine similarity to calculate the semantic similarity \( w_s \) between any two answers to the same question. For the weight of the provided relation \( w_p \) between the user \( u_k \) and the answer \( a_{ij} \), \( w_p=prd(a_{ij},u_k)=1 \) means that all answers provided by the user equally affect the user.

  • \( P=\{priori\left( v\right) |v\in V\} \) is a set of priori reliability of the node \( v\in V \) and \( priori\left( v\right) \in \left[ 0,1\right] \). The higher the priori reliability is, the more reliable the node is. The priori reliability of the answer \( a_{ij} \) is defined on the share of votes \( fvote_{ij} \) and the frequency of core words \( fcore_{ij} \). \( priori\left( a_{ij}\right) =\alpha fvote_{ij}+\left( 1-\alpha \right) fcore_{ij} \), where \( \alpha \) is the influence coefficient between the share of votes and the frequency of core words. The priori reliability of the user \( priori\left( u_k\right) \) is defined on the reputation, upvotes, downvotes and the homepage views. After the Pearson correlation analysis, we find that the user authority is strongly correlated with the number of the homepage views. Therefore, the normalized user prior reliability is defined as \( priori\left( u_k\right) =Norm\left( \frac{Reputation}{Views}+Upvote-Downvote\right) \).

4 Mutual Inference Principle

After getting AUnet model, the trust propagation algorithm is used to iteratively update the user reliability and answer reliability based on the mutual inference principle.

4.1 User Reliability Computing

Compared to reliable users, unreliable users have higher error rates. So the reliability of user \( u_k \) can be inferred by his/her error rate. Assume that the error rate \( \varepsilon \left( u_k\right) \) of the user \( u_k \) obeys normal distribution, \( \varepsilon \left( u_k\right) \sim N\left( 0,\sigma \left( u_k\right) ^2\right) \).

Our goal is to make \( \varepsilon _{combine}=\frac{\sum _{u_k\in U} \omega \left( u_k\right) \varepsilon \left( u_k\right) }{\sum _{u_k\in U} \omega \left( u_k\right) }\), the variance of the weighted untrustworthiness of all users as small as possible. Since \( \varepsilon _{combine} \) also obey normal distribution \( \varepsilon _{combine}\sim N\left[ 0,\frac{\sum _{u_k\in U} \left( \omega \left( u_k\right) \right) ^2\sigma ^2\left( u_k\right) }{\left( \sum _{u_k\in U} \omega \left( u_k\right) \right) ^2}\right] \). We formulated this goal with the constraint \( \sum _{u_k\in U} \omega \left( u_k\right) =1 \) into the following optimization problem as:

$$\begin{aligned} \begin{aligned} \min \limits _{\{\omega \left( u_k\right) \}}\sum _{u_k\in U} \left( \omega \left( u_k\right) \right) ^2\sigma ^2\left( u_k\right) \\ s.t.\sum _{u_k\in U} \omega \left( u_k\right) =1,\omega \left( u_k\right) >0 \end{aligned} \end{aligned}$$
(1)

The optimization problem is a convex function, which can be solved by the Lagrangian multiplier method with a Lagrangian multiplier \( \lambda \), and the analytical solution is:

$$\begin{aligned} \omega \left( u_k\right) \propto \frac{1}{\sigma ^2\left( u_k\right) } \end{aligned}$$
(2)

In Eq. (2), the true variance \( \sigma ^2\left( u_k\right) \) of user \( u_k \) can be estimated by the maximum likelihood estimation as:

$$\begin{aligned} \hat{\sigma }^2\left( u_k\right) =\frac{1}{\left| Q\left( u_k\right) \right| }\sum _{q\in Q\left( u_k\right) } \left( x_q^{u_k}-x_q^*\right) ^2 \end{aligned}$$
(3)

Equation (3) means the mean of the squared loss of the errors that user \( u_k \) makes. \( x_q^* \) is the best answer for the question q which is computed by the weighted average of the answer reliability \( x_q^*=\frac{\sum _{u_k\in U_q} \tau \left( a_q^{u_k}\right) \cdot x_q^{u_k}}{\sum _{u_k\in U_q} \tau \left( a_q^{u_k}\right) } \).

According to the statistics, most users give less answers, the method to estimate the users theoretical variance \( \sigma ^2\left( u_k\right) \) by \( \hat{\sigma }^2\left( u_k\right) \) will be inaccurate when the user provides small number of answers. We solved this long-tail problem by using confidence interval score instead of a single value reference to the work in [16]. Finally, the answer reliability under a certain confidence can be computed as follows:

$$\begin{aligned} \omega ^\prime \left( u_k\right) \propto -\frac{1}{\sigma ^2\left( u_k\right) }=\frac{\chi _{1-\frac{\alpha }{2}}^2\left( \left| Q\left( u_k\right) \right| \right) }{\sum _{q\in Q\left( u_k\right) } \left( x_q^{u_k}-x_q^*\right) ^2} \end{aligned}$$
(4)

4.2 Answer Reliability Computing

The answer reliability is affected by the user reliability and other peer answers for the same question [10]. For the reliability, we can get an undirected subgraph for a specific question, consisting of the answers and the corresponding user. Then, we transformed the answer reliability problem to the joint probability distribution of nodes in the undirected probabilistic subgraph. For the undirected subgraph with n random variables, the joint probability distribution can be represented as follows:

$$\begin{aligned} P(X)=\frac{1}{Z}\displaystyle \prod _{c\in C}\psi _c\left( X_c\right) \end{aligned}$$
(5)
$$\begin{aligned} Z=\displaystyle \sum _X\displaystyle \prod _{c\in C}\psi _c\left( X_c\right) \end{aligned}$$
(6)

In Eq. (6), \( \psi _c\left( X_c\right) =exp\{-E\left( X_c\right) \} \), and the energy function \( E\left( X_c\right) \) represents the correlation between variables. Based on the Boltzmann Machines, the probability of the hidden variable \( y_{ij} \) of the answer \( a_{ij} \) and the probability of the hidden variable \( y_k \) of the user \( u_k \) are defined as follows:

$$\begin{aligned} P\left( y_{ij}\right)= & {} {\left\{ \begin{array}{ll} \tau \left( a_{ij}\right) , &{} \text{ if } y_{ij}=1 \\ 1-\tau \left( a_{ij}\right) , &{} \text{ if } y_{ij}=0 \end{array}\right. }\\ P\left( y_k\right)= & {} {\left\{ \begin{array}{ll} \omega \left( u_k\right) , &{} \text{ if } y_k=1 \\ 1-\omega \left( u_k\right) , &{} \text{ if } y_k=0 \end{array}\right. } \end{aligned}$$
(7)

Generally, it’s an NP-hard problem to obtain the joint probability distribution on the undirected probabilistic graph [17]. By using the iterated conditional modes ICM [18], we updated the value of the answer node variable in the undirected subgraph step by step based on the idea of gradient ascent as follows:

$$\begin{aligned} P\left( y_{ij}=\eta \right) =P\left( y_k=\eta \right) +\displaystyle \sum _{y_{ij^\prime }\in N\left( y_{ij}\right) }m_{ij^\prime \rightarrow ij}\left( y_{ij}=\eta \right) \end{aligned}$$
(8)
$$\begin{aligned} m_{ij^\prime \rightarrow ij}\left( y_{ij}\right) =\displaystyle \sum _{y_{ij^\prime }}U\left( y_{ij^\prime },y_{ij}\right) P\left( y_{ij^\prime }\right) \end{aligned}$$
(9)
$$\begin{aligned} U\left( y_{ij^\prime },y_{ij}\right) =\left[ sim\left( a_{ij},a_{ij^\prime }\right) \right] ^{I\left( y_{ij^\prime },y_{ij}\right) }\cdot \left[ 1-sim\left( a_{ij},a_{ij^\prime }\right) \right] ^{1-I\left( y_{ij^\prime },y_{ij}\right) } \end{aligned}$$
(10)

We let \( y_{ij^\prime }\in \{0,1\} \) denote the trustiness transmitted by \( a_{ij^\prime } \) to \( a_{ij} \). \( U\left( y_{ij^\prime },y_{ij}\right) \) is the potential function and \( sim\left( a_{ij},a_{ij^\prime }\right) \) denotes the similarity between the answer. When the reliability of similar answers for the same question is consistent, the energy needed by transmission is small and it is easy to happen. In contrast, if the reliability of similar answers for the same question is inconsistent, the energy needed by transmission is big and it is hard to happen.

5 Experiments

5.1 Datasets and Experimental Settings

In order to evaluate the effectiveness of the proposed algorithm in this paper, we conducted experiments on datasets of four domains from the vertical cQA site StackExchangeFootnote 5, including coffee, movie, music and sports.

The statistics of the four datasets are shown in the first six column in Table 1. To ensure the quality of our dataset, only the question with more than 3 answers are selected.

Table 1. Experimental data statistics.

The dataset of StackExchange only provides the best answer of the question, and doesn’t make any judgement on the reliability of other answers. However, answers in cQA often have diversity, so it’s not objective to directly treat other answers as negative samples which will cause imbalance between positive and negative examples. Therefore, we randomly selected 50 questions from four domains respectively, totaling 200 questions and 1037 answers, and let two volunteers annotate the answer reliability according to the best answer and relevant information. Each volunteer annotated 125 questions and all answers are annotated as “Yes”(reliable) and “No”(not reliable). After verifying the consistency of the labeling results, the final statistics for all areas is shown in the last two columns in Table 1.

All the experiments were conducted over a server equipped with core i7-4790 CPU on 16 GB RAM, four cores and 64-bit Windows 10 operating system.

5.2 Baseline and Metrics

Four methods Vote, LR, TDM and LQ are selected as the comparison in this paper.

  • Vote, the basic voting method, directly ranks answers according to the number of votes of the answer.

  • LR, proposed by Shah et al. [6], trains the Logistic regression model based on non-textual information of answers to evaluate and predict the answer reliability in cQA. The output of LR is a trust value of the answer between 0 and 1.

  • TDM, a method proposed in [19] based on the iterative idea of TruthDiscovery, estimates the trustworthiness of the answer. TDM smoothes the long-tail user with the priori reliability of the user, and it uses basic iterative methods to update the user reliability and the answer reliability.

  • LQ, an unsupervised answer reliability evaluation method, is proposed in [10], which detects the low quality answer using the relation between peer answers and label answers through minimizing the variance of the question. We represented the answer by 121 relevant features categorized in 5 types including the statistical characteristics and textual features of the answer, user features and similar features between peer answers.

For the evaluation of answer reliability, we focused on whether the model can effectively filter and return reliable answers, that is, whether the top few answers in the answer list presented to the user are more reliable. Therefore, we evaluated the performance of five models using the two indicators MRR and MAP which are commonly used in information retrieval and question-answering.

MRR (Mean Reciprocal Rank) measures the average of the reciprocal of the position of the best answer in the answer list, which is defined as follows:

$$\begin{aligned} MRR=\frac{1}{\left| Q\right| } \displaystyle \sum _{q\in Q} \frac{1}{bp_q} \end{aligned}$$
(11)

where \( \left| Q\right| \) is the total number of questions, and \( bp_q \) is the position of the most reliable answer in the answer list. MRR can evaluate whether the algorithm can effectively filter out the best answer.

MAP (Mean Average of Precision) measures the average accuracy of the ranking of answers for each question. That is to say, not only the position of the most reliable answer, but also the position of other reliable answers in the final ranking result are measured. MAP is defined as follows:

$$\begin{aligned} MAP=\frac{1}{\left| Q\right| } \displaystyle \sum _{q\in Q}\left( \frac{1}{TN_q}\displaystyle \sum _{i=1}^{TN} \frac{i}{p_i}\right) \end{aligned}$$
(12)

where \( TN_q \) is the number of reliable answers labeled as positive samples of the question q, and \( p_i \) is the position of the ith reliable answer in the final ranking result.

5.3 Performance and Results Analysis

The main parameters of our AUnet method are the window size of sen2vec and \( \alpha \) in calculating the answer reliability. The DM model of sen2vec is adopted to represent the answer as a 300-dimensional vector, and the window size is 5. After experiments, the best value of \( \alpha \) in coffee domain is 0.6, in movie domain is 0.7, in music domain is 0.6 and in sports domain is 0.5.

Fig. 2.
figure 2

The number of iterations in four domains.

We firstly verified the convergence of the algorithm. Figure 2 shows the change in the cumulative value of the answer reliability with iterations in each iteration. When the reliability change of each answer between two iterations is less than 0.001, the algorithm is considered to have reached a steady state.

It can be seen in Fig. 2 that the data of four domains all reach a steady state after 15 iterations in the experiment. Among them, the convergence speed of data of the Music domain is obviously faster than other domain. This is because the number of per capita answers of the Music domain is relatively large, and the number of answers under each question is also large.

The MRR and MAP of five models in four domains are shown in Table 2.

Table 2. MRR(%) and MAP of five models in four domains

From the experimental results of MRR, we can see that using the voting method alone can filter out about 70% of the best answers. After adding user information and statistical information, the trained LR method can filter out about 80% of the best answers. To improve the ability to filter the best answers in the case of few votes to a certain extent, TDM smooths the long-tail users and LQ introduces similarity relation between peer answers. AUnet achieves the best screening ability in all four areas, and it can effectively return the best answer of more than 86% of problems.

The experimental results of MAP count the average sorting accuracy of all questions in each domain, and it measures the ability and accuracy of the algorithm for returning reliable answers.

On the whole, the average ranking performance of Vote which only considers the number of votes is the worst. This is because the number of votes can be affected by factors such as release time and malicious voting, and the reliability of the answer cannot be effectively evaluated without considering the influence of other factors. Because in addition to the community features, the statistical features of the answer and user features are also considered, the performance of LR is slightly improved on the basis of Vote. However, because a large part of answers are long-tail users with less number of votes, the prediction result for the answer with sparse features is poor in LR. This can cause some reliable answers to be sorted backwards, so except Movie, the MAP value of LR in other three domains are below 80%. TDM uses the iterative method to evaluate the reliability of the answer, and the value of MAP is about 83%, which is stable and unaffected by community information. LQ introduces the similarity features and the textual features of the answer on the basis of LR, which can effectively filter the low-quality answer, so the MAP value has greatly improved compared to the other three methods. AUnet models the relation between the user and the answer simultaneously, and utilizes the priori reliability based on the community information and the statistical information, achieving the highest average sorting accuracy in all four domains. For 90% questions, the top three answers returned by AUnet are all reliable. In addition, AUnet achieves the largest performance improvement in the Music domain. That is to say, when the number of answers to the question and the number of answers per capita are large, evaluating answer reliability by AUnet is significantly better than characteristic methods.

6 Conclusion

To alleviate the high cost of labeling data in supervised methods and the low performance in unsupervised methods, we proposed an unsupervised method based on AUnet to evaluate the answer reliability in this paper. On the basis of the probabilistic graphical model and the mutual inference algorithm, our AUnet method can calculate the answer reliability and user reliability simultaneously without supervision and automatically rank the answer in cQA. Results of experiments on four domains in StackExchange verified the convergence and effectiveness of our algorithm and showed our method is superior to other methods in the screening ability of the best answer and the ability to discriminate between reliable and unreliable answers. The potential direction for future research if focusing on evaluating the answer reliability under the multi-source conflict.