1 Introduction

Collaborative question answering (CQA) systems, such as Yahoo! Answers, Baidu Knows, and Naver, are becoming popular online information services. One of the useful by-products of this popularity are the resulting large archives of questions, answers and ratings—which in turn could be good resources of information for automatic question answering. For example, Yahoo! Answers (2011) alone has acquired an archive of more than 40 Million questions and 500 Million answers, according to 2008 estimates.

However, unaware of previous questions and answers accumulated, or being unwilling to spend time in searching for them, people often ask repeated or very similar questions. As a result, repeated answers need to be provided again and again and cause a huge waste of resources. Hence, it is a new challenge to develop a suitable method that can make reuse of these accumulated questions and answers effectively.

Currently, there are quite a number of text similarity calculation methods while most of them focus on the questions with similar syntactic or topics. Due to the complexity and abundance of language representations, there are a large number of questions that are syntactically different while semantically similar in CQA archives. For example, “how can I lose weight in a few month?” and “are there any ways of losing pound in a short period?” are two similar questions asking for methods of losing weight, but they neither share many common words nor follow identical syntactic structure. This gap makes the similar question matching task difficult. Similarity measure techniques based purely on the bag-of-word approach may perform poorly and become ineffective in these circumstances (Wang et al. 2009).

While identifying all such groups of questions becomes vital, we propose exploiting the existing CQA archives to first identify a small group of clearly equivalent questions, and then use these groups to learn and extend equivalent patterns to match more questions. Our approach is based on the following assumption: in CQA systems, an asker often chooses one posted answer as “best” if it fulfills the information need expressed by the question. Therefore, in the cases when the best answers chosen for different questions in the same domain are exactly the same, these questions express the same information need, and thus semantically similar. This assumption is also adopted by other researchers, such as Wang et al. (2009).

Based on this assumption, we propose an automatic question answering method over a CQA archive by generating patterns from equivalent question that may syntactically different. “Equivalent” question groups are firstly retrieved from the CQA archive by grouping the questions by the text of “best” answers and domains. However, certain questions, in different semantic meaning, may share same answers by chance. To avoid generating spurious equivalent question groups, we propose an equivalent question filter to acquire exact equivalent question groups based on the estimation of the topical diversity (TD) between the question groups. While extracting equivalent question patterns, we explore three methods of syntactic pattern generation. The chunk-based phrase level (C-PL) and chunk-based lexical level (C-LL) are chunk-based pattern generation method while tree-based incremental generation (T-IG) is a syntactic tree-based pattern generation method. The generated patterns for same questions are then automatically evaluated, by matching against the whole question set, to select the best pattern specificity for a given equivalent group. The result of this step is a set of equivalent pattern groups and these patterns are then used to match all questions in the CQA archives. By comparing question similarity and answer similarity, a large number of equivalent patterns can be extracted round by round by a new bootstrapping-based pattern extension method. Given a new question, it is compared to the set of available equivalent patterns. In case of a match, the best answer from previously submitted questions in the matched group could be returned.

Experiments over a dataset of more than 200,000 questions retrieved from the Yahoo! Answers are preformed to test the effectiveness of the propose method. We initially detect 1,349 equivalent patterns in 452 groups, which are then used to learn more equivalent patterns. The final extended 16,991 patterns are applied to automatically seek a best answer to new (hold-out) set of questions. Our method correctly suggests an answer to a new question, 54.5% of the time—outperforming previously reported state-of-the-art translation-based method for similar question finding.

The rest of this paper is organized as follows. Section 2 introduces related work. In Sect. 3, the three equivalent pattern generation methods including C-PL, C-LL, and T-IG are presented. Section 4 describes the bootstrapping-based pattern extension method in detail, and maximum pattern matching is shown in Sect. 5. Section 6 introduces the experiment setup and results with evaluation and Sect. 7 discusses the results and concludes the paper.

2 Related work

Our work builds on the long tradition of research in automatic question answering (QA). Automatic QA systems attempt to find the most relevant parts (usually in short paragraph or just one or two sentences) in long documents with respect to user queries. Auto-FAQ relied on a shallow, surface-level analysis for similar question retrieval (Whitehead 1995). FAQ-Finder adopted two major aspects, i.e., concept expansion using the hypernyms defined in WordNet and the TFIDF weighted score in the retrieval process (Hammond et al. 1995). In the FAQ-Finder, certain question types may not be detected correctly, for examples, in the cases when interrogative words like “what” and “how” are the substrings of interrogative phrases “for what” and “how large”, respectively. To eliminate the above problem in FAQ-finder Tomuro (2004), combined lexicon and semantic features to automatically extract the interrogative words from question corpus. Besides WordNet, Lenz (1998) retrieved FAQs via case-based reasoning (CBR). Sneiders (2002) used question templates with entity slots that are replaced by data instances from an underlying database to interpret the structure of queries or questions. Berger et al. (2000) proposed a statistical lexicon correlation method for FAQ retrieval.

With respect to the pattern usage, some QA systems attempted to learn patterns to help identify potential answers. For example, Ion (1999) gave three different linguistic patterns to extract relevant information. There have also been much prior efforts on automatic pattern extraction and most of them focused on extracting patterns from human-labeled training corpus. Ravichandran and Hovy (2002) proposed a surface text pattern generation algorithm to find answers to new questions. Zhang and Lee (2002) introduced a pattern learning algorithm to extract answer patterns for a given question. The essential idea was to find one answer instance and generalize the question target. However, the defined answer targets were too general to differentiate between the answer types thus the generated patterns are usually too domain-specific to be efficiently applied to a new domain. Mark and Horacio (2004) extended Zhang and Lee’s patterns by using four answer instances instead of one to overcome the over-generalization problem. Hu and Liu (2006) utilized a kind of semantic pattern for QA, in which two granularity evaluation algorithms SIIPU and DEXT were used to control the granularity of the patterns in order to increase their flexibility. A more recent work was focusing on learning semantic pattern by Hao et al. (2008). However, the computational time required to directly process semantic patterns was high, with the result that the pattern does not appear to be feasible in processing of huge amount of data archives instantly.

The idea of finding similar questions in CQA is related to passage retrieval in traditional QA, with the exception that question-to-question matching is much stricter than question-to-passage matching. There have been significant new efforts focusing on CQA retrieval (e.g., Wu et al. 2005; Bian et al. 2008; Wang et al. 2009). Bernhard and Gurevych (2008) consulted 6 different types of question similarity methods on WiKianswers, which is a CQA site. The comparison shown that Lucene’s Extended Boolean Model get best performance but only overcome a little than term vector similarity. Jijkoun and Rijke (2005) proposed to retrieve answers from frequently asked question pages on the Web and return a ranked list of QA pairs in response to user’s questions. They used the implementation of the vector space model in Lucene as the core of retrieval system and exploited the performance of different models. However, the vector space similarity, as the core of all the baselines, processed same words between the user’s questions and Q/A pairs while the similar syntax structures of questions were not concerned. Jeon (2005a, b) used word translation probabilities to find similar questions and it was proved to exceed Cosine similarity method much. Kosseim and Yousefi (2008) tried to improve QA by retrieving equivalent answer patterns. However, all the manually and automatically generated patterns were based on TREC 8 and 9 data, which are in quite unified formats. Thus the patterns may cannot process common questions in most CQA systems even the questions started with why and which. Jeon et al. (2005a, b) extended this method by introducing word translation probability to find similar questions in CQA archives, and have shown significant improvements over previous methods. We will compare our approach with their method in this paper.

3 Equivalent pattern learning

We now present our approach and system implementation. Recall, that we first group the questions into “equivalent groups”, that contain exactly the same answer chosen by the question author (the asker) as the “best” answer among all submitted answers. Then, we filter the candidate groups to remove questions grouped together by chance, by estimating the group’s TD (described next). For the remaining groups, equivalent syntactic patterns are generated (Sect. 3.2). These patterns can be further used as seed patterns to explore more equivalent patterns for matching against new questions, to suggest answers automatically.

3.1 Equivalent question filtering

While most questions that share exactly the same “best” answer are indeed semantically equivalent, some may share the same answer by chance. For example, in Table 1, there are two questions which share the same best answer “Antidisestablishmentarianism” while the questions are quite different in semantic aspect. To filter out such cases, we propose an estimate of TD, which is calculated based on the shared topics for all pairs of questions in a group. If the diversity value is larger than a threshold, the questions in this group are considered not equivalent, and no patterns are generated.

Table 1 An example of two very different questions shares the same answer

To calculate the TD of the questions, we define “notional words” (NW) as head nouns and the heads of verb phrases that are identified by the OpenNLP parser (2011). These NW are regarded as topics since they are more related to the topics than other types of words. Using these “topics”, we can obtain the TD by calculating the not shared topics in the whole topics for a group of similar questions. To calculate them, we firstly compare the NW of each two questions in the group and calculate the probability of topics without sharing. Since a group may contain more than two questions, average probability is then calculated on all pairs in the same group. Therefore, TD for a question group G is represented as (1) and an example with NW can be referred to Table 3.

$$ TD(G) = \frac{1}{n(n - 1)} \times \sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {\left( {1 - \frac{{|Q_{i} \cap Q_{j} |}}{{|Q_{i} \cup Q_{j} |}}} \right)\quad (i > j)} } $$
(1)

Q i and Q j represent the notional word subsets of any two different questions in the same group G, which contains n questions in total. From this equation, we can see that the TD is higher when there are less shared topics in the question group.

After equivalent question filtering, only the question groups with diversity values lower than a threshold, which is further described in experiment section, are kept as equivalent question groups. These equivalent groups are further used to generate equivalent question patterns.

3.2 Pattern generation

Based on these filtered question groups, we can generate equivalent question patterns, which are the patterns generated in a same equivalent question group. The resulting patterns, regarded as seed pattern, are then used to extend and extract more equivalent patterns. Those extended patterns can enlarge the matching coverage—thus retrieving more similar questions with different syntactic structure. To achieve this, we propose three pattern generation methods with considering phrase chunking and syntactic tree. C-PL and C-LL are two chunk-based pattern generation methods while T-IG is a syntactic tree-based pattern generation method.

The major difference between them is the chunk-based method generates the most “detailed” patterns to intuitively improve matching precision while the tree-based method extracts patterns with the most appropriate generality thus can match and answer more questions. The purpose of proposing the three generation methods is to find which kind of pattern is more appropriate for question matching. We will mainly describe the T-IG in this section because of its high generation flexibility. The comparison of all these generation methods is presented in experiments section later.

3.2.1 Chunk-based pattern generation

By phrase chunking, each question can be tagged by a group of labels. A type of label is the label of an independent phrase and the other one is the lexical label for each word. Since the phrase label is more general than lexical label, it can potentially match more questions intuitively. Based on these kinds of labels, we propose two variants of chunk-based pattern generation methods:

C-PL After chunk processing, a question is split into chunks tagged with phrase label. For example, the “the first person” is an independent chunk with the label of “NP”. We use all tagged phrase labels with original sequence in the question to build the question structural pattern. The method is simple but provides fast processing thus it can be applied on large question archives.

For example, a query “what book do you think everyone should have at home?” the chunking result is “[NP what/WP book/NN] do/VBP [NP you/PRP] [VP think/VB] [NP everyone/NN] [VP should/MD have/VB] [PP at/IN] [NP home/NN]”. The corresponding pattern generated using C-PL is “NP NP VP NP VP PP NP”.

C-LL In chunking processing, the most detail tagged labels, the lexical label of a question which can be provided by Part-Of-Speech, are considered. We use all tagged labels of chunks with original sequence in the question to generate a question pattern. The very detailed pattern can match question structure well and related evaluations are presented in Sect. 6.

With the exact same question example and the chunking result, the generated pattern using C-LL is “WP NN PRP VB NN MD VB IN NN”, in which every lexical label is kept in the pattern.

3.2.2 Syntactic tree-based pattern generation

The chunk-based pattern generation is mainly considering the original question structure. However, some questions may are very long or they may contain subordinate clauses, which in turn could affect the pattern matching. The syntactic tree-based method thus is proposed to find and use different levels of the syntactic tree to extract the “core” structure of the question. By sentence parsing, the parsed structure of a question can be converted into a syntactic tree. Based on this, we propose a syntactic tree-based incremental pattern generation method as follows:

T-IG On a syntactic tree, the T-IG method tries to extract all potential “valuable” patterns from root node to all leaf nodes incrementally. In the case of a question for generation is very long, many incremental generation steps may take and generation efficiency may is affected. To improve it by reducing computational volume, the T-IG method firstly preprocesses the tree to merge nodes in a single chain, which is defined as follows:

Definition 1

Given a node n x+1 and its parent node n x (n x n x+1) is a single chain if and only if n x+1 has only one child and is the only child of n x .

From the definition, each node in a same single chain has only one child thus a single chain can be extended to contain more than two nodes. To merge them, all nodes in a single chain are compared with their priorities. The node with highest priority is selected to represent the other nodes. The priority of POS tag for a notional word (described in previous section) is predefined as larger than that of a interrogative word such as “WDT, WP, WP$”. The priority of the latter is further larger than that of other types of POS tags.

This method starts to construct an initial pattern from the root node of a syntactic tree. The initial pattern is then extended to the leaf nodes of the tree level by level. With considering the parent–child relation, the child nodes in each level are added from left to right in the tree. Each extension action forms a new extended sub-tree based on a sub-tree. The difference between the two sub-trees is defined as incremental part. A sub-tree is then judged to decide if it is “valuable” to be a pattern by two constraints, which are based on the evidences: (1) a too short pattern (only one tag) is meaningless for matching; (2) if there is no notional word tag and interrogative word tag in the incremental part, the extended sub-tree is similar to the current one since NW are regarded as topics and interrogative words indicates the question target thus it is not valuable to be a pattern. The description of the two constraints is as follows:

Constraint 1

The total number of tags in the sub-tree is larger than 1.

Constraint 2

The incremental part contains either a tag of notional word or a tag of interrogative word.

The algorithm of the T-IG method is shown as Algorithm1 in detail, in which the parsed tree is firstly simplified by comparison of priority of nodes, which are shown as line 4–12. The nodes with only one child and one parent, in a single chain as line 5, are merged with theirs parent (judged by priorities) to save further computational time, as line 6–10. The sub-tree candidates are then incrementally extracted to generate question patterns, which are shown as line 13–31. Line 24 is to acquire incremental part by comparing two sub-trees and two conditions of “valuable” we mentioned above are shown as line 22 and 25, respectively.

Algorithm 1 Tree-based incremantal pattern (T-IG)

For example, a query “What book do you think everyone should have at home?” The syntactic tree, as shown in Fig. 1, is firstly simplified to merge nodes within single chain with priority comparison. In the example, “NP” and “PRP” are merged as “NP” since the priority of “NP” is higher. After that, all sub-trees are extended incrementally to the maximum level of the tree according to our algorithm. The sub-trees that fulfill the two constraints are selected as patterns, which are shown in Table 2.

Fig. 1
figure 1

Pattern generation on the question “what book do you think everyone should have at home?” using the T-IG

Table 2 Final generated patterns using the T-IG

After pattern generation, for the reason of reducing matching computations, the group of extracted “valuable” patterns is further compared to select a best pattern by the maximum pattern matching algorithm, which is further described in Sect. 5.2. The patterns on the same questions can be grouped as equivalent patterns when there is more than one pattern in the group. Table 3 shows an example with two questions, questions 1 and 2, which share the same answer “Watership Down by Richard Adams…”, and the longest generated equivalent patterns with NW using the T-IG method.

Table 3 The EP and NW of two EQs that share the same answer “Watership Down by Richard Adams …”

4 Bootstrapping-based equivalent pattern extension

Though the CQA dataset is large, the question groups that share the exact same answers are not too much. In our investigation of 215,974 questions and 2,044,296 answers crawled from the Yahoo! Answers, such kind of questions with exact same answers are only 2,166 before equivalent question filtering. Thus, the generated patterns directly on the limited training questions are far not enough for answering newly posted questions even with the incremental generation ability of the T-IG method. Therefore, we propose bootstrapping the learning process by automatically acquiring additional training questions. We call this algorithm “bootstrapping-based pattern extension”, which is operated by pattern matching and similarity comparison of questions and answers. The flowchart of this algorithm is shown as Fig. 2.

Fig. 2
figure 2

Flowchart of bootstrapping-based EP extension

As in this figure, this algorithm firstly input seed patterns, which are generated from initial training equivalent question groups. Such equivalent pattern groups are then matched with a large scale of QA archive to extract more equivalent question candidate groups. Each question pair in the groups is evaluated by calculating both question similarity and answer similarity. The similarity calculation uses the normal Cosine similarity method in the Code Project (2011) to extend the similar question cases thus to extract more equivalent patterns. The reason we use normal Cosine is from three perspectives: (1) if it is efficient since the CQA archive to be processed is huge; (2) if it is capable of processing on both questions and answers; (3) the similarity constraints should not be too strict since there exists the requirement of pattern extension with further performance verification.

If the similarity is larger than a threshold, the question group is regarded as equivalent and is added into equivalent question groups for pattern generation. In each round, the generated patterns are evaluated with the metrics of F1 score, which is further described in evaluation section in detail. In the case that the average F1 score begin to drop, the extension loop stops and the generated pattern groups from all previous rounds are the final extended equivalent patterns, which are then added into our pattern database for further usage. The F1 score is a strict constraint to ensure the overall quality of the extracted equivalent patterns even it is possible that some patterns in next round are actually equivalent.

Algorithm 2 The alogorithm of bootstrapping-based equivalent pattern extension

.

The detailed algorithm of the bootstrapping-based pattern extension method is shown as Algorithm 2, in which line 1 and 2 are the initial definition of parameters. Lines 3–15 are the main function of the bootstrapping-based pattern generation. The pattern generation using the T-IG on equivalent question groups is shown as lines 4–7. Lines 8–14 present the matching of equivalent patterns with the whole QA archive to calculate average F1 score. If the F1 score begins to drop, the extension iteration stops. Otherwise, the equivalent patterns generated in current step, as seed pattern, are sent to pattern extension function, as shown in line 12, to acquire new question set for next round processing.

Algorithm 3 The algorithm of pattern_extension function

To be understood easily, the function of the pattern extension (line 12) is shown separately as Algorithm 3. This algorithm first matches each group of equivalent patterns on the whole QA archive as line 3. The similarity of matched questions and their answers are then calculated using the Cosine as lines 6 and 7, respectively. The groups with the higher similarity than thresholds are kept as the equivalent question candidate groups. These groups, as shown in line 14, are returned to next round for further processing.

An example for the first round, from initial equivalent patterns (EP) to final merged EP, is shown in Table 4. With an initial EP group learned by our previous mentioned equivalent question filtering and pattern generation of C-LL. This pattern group then is matched with whole QA database and five questions with answers are extracted in total as matched questions. These question groups with best answers are evaluated by question similarity as well as answer similarity. The groups, with question similarity or answer similarity are over two thresholds, are added into equivalent question (EQ) groups. The generated EP on such question groups are as the extended EP, which are further checked with duplication merging. If there is more than two patterns with their questions (generation source) sharing one same answer, these patterns are grouped as an equivalent pattern group. The final learned patterns, as merged and extended EP, are further used for next round extension until the average F1 score begin to drop, which is further introduced in Sect. 6.

Table 4 An example of first round pattern extension on an initial EP

5 Maximum pattern matching

5.1 Pattern matching

After incremental pattern generation, a group of patterns are extracted for a same question. We then use these patterns to match the CQA archive. The used matching method is to compare the number of matched labels with the labels of the original question and candidates in the archive.

For a given pattern p, the detailed matching method is described as follows. We first preprocess all question candidates using the same pattern generation method as the pattern p does. For the T-IG pattern method, the questions in the archive are processed by using phrase chunking to reduce time consumed. Our method is then split both the pattern p and the patterns generated from question candidates into labels and compare these labels one by one with sequence. Since questions are independent to each other, we use joint probability of the matching probability to the p and that to the question candidate. The matching score MS can be represented by the following equation:

$$ MS (p,q_{i} )= \frac{{\left| {{\text{Matched}}\_L} \right|}}{{\left| {{\text{Pattern}}\_L} \right|}} \times \frac{{\left| {{\text{Matched}}\_L} \right|}}{{\left| {{\text{Candidate}}\_L} \right|}}, $$
(2)

where Matched_L is the matched labels, Pattern_L represents all the labels in the pattern p, and Candidate_L is the labels in question candidate q i , respectively. The |Match_L|/|Pattern_L| means the probability of pattern matching to the pattern p while |Match_L|/|Candidate_L| represents the probability of pattern matching to the question candidate.

For example, an original pattern “PP NP NP VP NP VP” is to match a question candidate “Liver mass found on young woman if its cancer would’nt she be very very ill”. The generated pattern of the question is “NP VP PP NP PP NP NP VP ADJP”, which is shown in Table 5. Our method compares the two patterns with labels from left to right order. The matching continues at the position of pervious matching round by round until to the end of the original pattern or the end of the pattern from the candidate question. From the table, we can see that the |Matched_L| here is 4, |Pattern_L| is 6, and |Candidate_L| is 9. Therefore, the matching score is calculated as 0.296 using (2).

Table 5 Pattern matching to question “liver mass found on young woman if its cancer would’nt she be very very ill”

The pattern matching method can be further applied to notional word matching since a notional word matching procedure can be worked in the same way to a single label matching. Supposing a question is matched with both pattern and NW, we define the weight of equivalent pattern matching as W EP . The final matching score MS final for both equivalent pattern and notional word is calculated by using the following equation:

$$ MS_{\text{final}} (p + NW,q_{i} ) = W_{EP} \times MS_{EP} (p,q_{i} ) + (1 - W_{EP} ) \times MS_{nw} (NW,q_{i} )c $$
(3)

5.2 Maximum pattern matching

The T-IG generates a group of validated patterns for a question and the pattern quantity increases when the question is longer. To save computational cost brought by matching on large number of patterns for QA, we further propose a maximum pattern matching method to seek a most appropriate pattern in each generation procedure. The main idea is to find a generated pattern which can match original question (generation source) “better” than any other question. We define this “better” as matching gap σ and define matching score as MS, in which MS is calculated by the possibility of matched tags in the pattern with sequence. The matching gap to a certain pattern p thus can be represented as follows:

$$ \sigma (p) = MS \, (p, q_{p} ) - {\text{Maximum}}(MS(p, q_{i} )),\quad q_{p} \ne q_{i} $$
(4)

Once we get σ, the best pattern can be selected by its sub-tree level. The reason to consider its level is that we suppose the higher level (root level is highest) is, the more questions this pattern can match. However, from (4), the matching gap is larger when the sub-tree level is lower, which is opposite to matching coverage. To balance the two factors, we define a matching threshold as λ and find the patterns with matching gap larger than the threshold. After that, only one pattern with matching gap is most “close” to λ but larger than λ is selected as the best pattern. The equation to find best pattern p best is defined as follows:

$$ p_{\text{best}} = \left\{ {p_{i} | \sigma (p_{i} )\ge \lambda ;\quad \sigma (p_{i} )\to \lambda } \right\} $$
(5)

The detailed algorithm for the maximum pattern matching is shown as Algorithm 4. All the patterns are firstly sorted by their levels in ascending order, as shown in line 2, and line 4–9 is to acquire maximum matching score of p i to all questions except q o and record the related question group id as g. In line 10, matching gap is extracted considering the maximum matching score obtained. The matching gap for each pattern is compared with the threshold to select best pattern. When the gap value is equal or larger than the threshold and the group index of generation source is same to the current group, as shown in line 11, the current pattern is regarded as the best pattern p best and the iteration stops with best pattern returned.

Algorithm 4 The algorithm of the maximum pattern matching

By maximum pattern matching, the best patterns for the questions in the same group can be grouped as EP when more than one pattern exists in the group. Table 6 shows an EQ group questions 1 and 2, and the generated EP based on the C-PL, C-LL, and T-IG, respectively, with NW.

Table 6 The group of EQs and the generated patterns based on the C-PL, C-LL, and T-IG

After the EP are extracted by using any one pattern generation method described above, we can use them to answer new questions from users. The procedure is as follows: we firstly extract the pattern and notional word on the new questions using the same generation method. The generated structure, which contains phrase or lexical labels, then is used to further match equivalent pattern set. If there is any pattern exactly matched, all the patterns in the same group are used to match question archives. The well matched questions are then compared with a matching threshold. If it satisfies, the retrieved answers of the matched questions are then return to the users as the final answers.

6 Experiments and evaluations

This section describes the evaluation setup, baselines, the datasets, and the experiment procedures with discussion used to tune and evaluate our system.

We adapt standard evaluation metrics from information retrieval, namely, Precision, Recall, and F1-measure. The task is, for a given question, to retrieve its set of semantically EQs (that is, questions from the same equivalent group but not itself). Specifically, if a retrieved question in the result set is in the same equivalent group as the “query” question, this question is regarded to have a correct matching. Therefore, Precision for this question is defined as, the number of correctly matched questions, divided by the number of the questions retrieved, which is shown as (6). Similarly, Recall is defined as, which is shown as (7), the correctly matched questions divided by the number of questions in the original group. Finally, the F1 measure is computed in the standard way as 2 × Precision × Recall/(Precision + Recall).

$$ {\text{Precision = }}\frac{{\left| {\text{Correctly\,matched\,questions}} \right|}}{{\left| {\text{Retrieved\,quetions}} \right|}} $$
(6)
$$ {\text{Recall = }}\frac{{\left| {\text{Correctly\,matched\,questions}} \right|}}{{\left| {\text{Retrieved\,quetions}} \right|}} $$
(7)

6.1 Baselines

Baseline methods are implemented to compare with the propose methods. Traditional similarity-based methods are firstly consulted for retrieving surface-similar questions (and suggesting the corresponding “best” answers for similar questions). A Cosine model from Code Project (2011) is selected as it is a classical similarity calculation method. In the preprocessing, it utilizes a “tokenizer” to extract and filter the words by comparing “stop word list”. The filtered words are further processed to make sure their uniqueness. Based on these words, the frequencies and their weights are computed thus to obtain the Cosine similarity. A vector space model—TFIDF(NW), which keeps the NW filtered by phrase chunking, is also implemented as the improvement of the traditional TFIDF method.

To further compare with solid baselines, the translation model proposed by Jeon (2005a, b) is also implemented. This method uses IBM statistical machine translation model to estimate word translation probabilities. Previous experiment results show that it overcomes LM and Okapi method specifically with significant improvements. To implement this method, we use GIZA++ toolkit (2011) to learn the model with smoothing parameter setting to 0.01.

6.2 Datasets

Our dataset consists of 215,974 questions and 2,044,296 answers crawled from the Yahoo! Answers (2011) in 2008. From these questions, we acquired 833 groups of similar questions distributed in 65 categories defined by the Yahoo! Answer. After automatic filtering by TD calculation, 452 groups remain for parameter tuning and as seed data, with human verification, for equivalent pattern generation. These groups contain 1,349 questions, with, 2.98 questions per group on average. Figure 3 reports the distribution of group sizes, in which groups containing fewer than 5 questions account for almost 90% of all the questions.

Fig. 3
figure 3

Question quantity distribution of the whole dataset

In our experiments, the seed data is split into two categories: 603 questions for training (200 groups) and 746 questions for testing (the remainder). To make the experiments sound, we add a large set of additional questions on the testing data category to form two testing datasets. Dataset-1 contains the 746 testing questions with additional 10,000 questions and dataset-2 is the whole archive of 215,974 questions. Since a full experiment on the large dataset is very time-consuming, we use the smaller one (dataset-1) for evaluation of pattern generation methods and the other one (dataset-2) for comparison with the baseline methods.

6.3 Parameter tuning

The weight of using equivalent pattern (EP) for question matching is set as W EP and that of notional word (NW) is 1 − W EP accordingly. Parameter θ is a matching threshold used for pattern matching to find similar patterns. Defined in (4) and (5), parameter λ is a threshold for matching gap comparison.

To train the weights and parameters, with the 603 training questions, we firstly set the value of W EP as 1, which means that the matching procedure only works on EP other than NW at this stage. Precision and recall are then calculated on the training dataset by using different values (0–1) of θ. After that, the best value of θ can be obtained by comparing best F1 score on the precision and recall. The best value of θ is further used to train W EP by using EP and NW at the same time. To get more questions matched, the matching gap σ is set to a very small value 0.001 from experimental experience. With the trained parameters W EP , θ and σ, the performance is calculated on the training dataset again to find best value of λ considering highest F1 score. Figure 4 shows the question matching performance with different values of threshold λ using the T-IG method.

Fig. 4
figure 4

Performance with different values of threshold λ using the T-IG method

The final trained parameter values for the C-PL, C-CL and T-IG are reported in Table 7, and are further used for the subsequent experiments.

Table 7 Final trained parameter values

There are also other three parameters: question similarity, answer similarity, and pattern matching similarity to questions, as shown in EP extension procedure (Sect. 4). All these three parameters are obtained based on the statistics of human judgment and to be used, as priori parameters, for automatic pattern extension. The human judgment is based on the majority personal understanding of annotators on the question pairs. Table 8 shows five examples of question pairs that are identified as “similar”. After that, on these identified question pairs, question similarities are calculated using the normal Cosine method (the method and the reason to use it are described in Sect. 4). Median value is then calculated as, the value of middle similarity, or the average value of the middle similarities in the case that there are more than one middle similarity values, to be the learned parameters.

Table 8 The procedure of obtaining best question similarity threshold through human judgment

Currently, there are five annotators and more than one hundred “similar” question pairs are identified. On these pairs, the three parameters are obtained and the final values are listed in Table 9. From the result, we can see the threshold of answer similarity is 0.27, which is much lower than that of question similarity 0.8. We believe the main reason is that the answers are usually quite longer than the questions in the Yahoo! Answer.

Table 9 The trained parameter values of question similarly, answer similarity, and pattern matching

6.4 Equivalent pattern (EP) extension

We now evaluate our bootstrapping-based pattern extension algorithm, which is used to match and extract more EQ groups on each round of bootstrapping.

The large dataset dataset-2 was used for this experiment and the result is shown as Fig. 5. From the result, in the first two rounds of the extension, the number of generated EP changes a little as well as the performance (F1 score). In round 3, 4 and 5, the number of generated patterns increases dramatically and the corresponding performance values continuously increase until to the round 5 with a specific drop. Therefore, the system regards round 4 as the best round and stops the extension procedure. As the result of this experiment, 16,991 EP are finally extracted in the pattern extension.

Fig. 5
figure 5

The performance on the dataset-2 using the bootstrapping-based pattern extension

6.5 Results on the dataset-1

After parameter training and pattern extension, based on the testing dataset Dataset-1, we first compare the three methods (C-PL, C-LL, and T-IG) with their variants: EP, NW only (NW), and both EP and notational words (EP + NW) with the trained weight W EP . Using the same matching method, the performances of the three methods with their variants are reported in Table 10. From the result, EP + NW achieves the highest performance on all the methods. T-IG with EP + NW has highest precision and F1 score compared with other methods. The C-PL with EP achieves the best recall, which is due to its characteristic of high generalization. However, the precision of C-PL with EP is 0.008, which is much lower than others. The situation is decided by the feature of C-PL generation method, which is mainly using chunk-based phrase labels. That is, the generated patterns by this method usually are very short and brief but have high coverage. Without the assistant of NW, the precision could be very low since too many questions could be matched to the patterns. Taking F1 score as the main factor, the T-IG with EP + NW, as T-IG (EP + NW), is the best method in this experiment.

Table 10 Performance comparison of the three methods with their variants

Bold values indicate the highest value in each column (recall, precision, and F1)

6.6 Results on the dataset-2

Since T-IG (EP + NW) achieves best performance in terms of F1 score on the previous m experiment, we further use this method to compare with the baseline methods described in Sect. 6.1. Firstly, the three baseline methods, Cosine, TFIDF(NW), and Translation Model, are implemented and applied on the testing dataset Dataset-2 to calculate their precision, recall and F1 score. On the same dataset, the performance of T-IG (EP + NW) is also recorded and the result is shown as Fig. 6.

Fig. 6
figure 6

Performance comparison with baseline methods on the dataset-2

From the result, the TFIDF(NW) gets 48.2% as recall, 36.1% as precision, and 41.3% as F1 score, which is higher than that of the Cosine 38.2%. However, the precision of the Cosine, 39.6% is higher than that of the TFIDF(NW). The translation model achieves 52.2% as recall, 37.1% as precision, and 43.4% as F1 score. The recall, precision, and F1 score of T-IG(EP + NW) reach 57.1, 54.5, 55.8%, respectively. Considering the highest F1 score 55.8%, the performance of our method T-IG(EP + NW) outperform Cosine, TFIDF(NW), and the Translation Model significantly.

6.7 Discussion

In our experiments, generated patterns can effectively match and answer a given question on CQA archives. Compared with traditional keywords searching, our method can handle the questions with different syntactic structures by using the EQ patterns. In the cases that certain questions may are very “noisy” such as in spoken language, our method still has the ability to extract patterns thus to answer those “noisy” questions. From our experiments, the T-IG still can stably generate patterns unless the questions are too short.

However, if the questions are very short, the generated patterns could match noise questions and thus reduce the performance. For example, a group of questions “a change in momentum help?” and “change in momentum?” the generated pattern for the first question is “NP PP NP; change momentum help”. It could match other questions well like “help with Physics” if the weight of pattern matching is very high. In this case, the notional word matching could be more useful especially when such cases increase in larger dataset.

Though experiments show that the T-IG(EP + NW) has the best performance on F1-score on the Dataset-1, the matched questions, which represents how many questions can be covered by a pattern, also should be considered. This metric indicates the expected coverage of the pattern. With higher coverage, a generated pattern can potentially match and answer more questions. To explore this, we compared our three pattern generation methods, as shown in Fig. 7. From the results, the C-PL has the highest coverage while the C-LL has the lowest coverage when λ is between 0.1 and 0.3. Therefore, we can select the pattern generation method for different situations. For example, the C-PL can be applied to compromise the performance but answer more questions, or get a relative balance by using the T-IG.

Fig. 7
figure 7

Comparison of pattern coverage on the dataset-1

7 Summary

To automatically find answers to new questions in CQA site, this paper proposes to automatically identify “equivalent” questions submitted and answered, in the past, by automatically generating equivalent question patterns. Three new pattern generation methods: C-PL, C-LL and T-IG with their variants: EP, NW and EP+NW, are presented. The patterns generated automatically from initial equivalent question groups are regarded as seed patterns after question diversity filtering. These seed patterns are further extended by a new bootstrapping-based pattern extension algorithm. The resulting patterns, combining syntactic patterns and notional words in the questions, can be used to answer new questions with existing answers. The experiment is conducted on a large collection of more than 200,000 real questions drawn from the Yahoo! Answers archive. From the result, our method can achieve over 57% recall and over 54% precision on finding similar questions to new questions, significantly outperforming the baseline models for this task.

However, in CQA archives, semantically similar questions may are totally different in words, namely, no shared NW at all. In these cases, their TD values are very high thus they could be filtered from initial EQ groups. Consequently, these EQs are lost in pattern extension and certain new questions may cannot be retrieved and answered. Future improvements would focus on incorporating additional semantic information into the filtering and matching process. More popular evaluation measures such as MAP and MRR are also expected to further enhance the result validation.