1 Introduction

Domain terminology refers to the vocabulary of theoretical concepts in a specific domain. People can quickly understand the development of the subject through domain terminology, which is of great significance to scientific research. However, it is unaffordable to extract domain terminology manually from the massive text collections. Therefore, automatic domain terminology extraction is in real demand in various domains.

The process flow of the existing domain terminology extraction methods can be summarized into two steps: candidate term extraction and term evaluation [1]. Firstly, the candidate term extractor extracts terms that conform to the domain conditions. Secondly, the evaluation module evaluates each candidate term and filters it when necessary based on some statistical measures.

In order to enhance the accuracy of domain terms extracted, in this paper, we propose a novel parsing-based method. The contributions of this paper can be summarized as follows:

  1. (1)

    Dependency parsing is proposed to be utilized to generate candidate domain terms.

  2. (2)

    A multi-factor evaluator is proposed, which evaluates and filters the candidate terms based on the linguistics rules, statistical methods, and domain-specific term characteristics.

We evaluated the performance of our proposed domain terminology extraction method with a frequency-based POS-based term extraction method. In the experiment, our method identified plentiful of accurate candidates. The recall rate has been improved. The ranking outperformed the counterpart in precision.

2 Related Work

Some automatic terminology recognition approaches have been proposed in recent years. The existing domain terminology extraction approaches can be classified into four categories [2]:

  1. (1)

    Dictionary-based method. It is simple and easy to extract domain terms by matching the words with those in a domain dictionary. However, domain terminology is constantly updated so that the domain dictionaries cannot be easily maintained [3].

  2. (2)

    Linguistic method. It uses the surface grammatical analysis to recognize terminology [4]. However, the linguistic rules are difficult to summarize. Linguistic method may generate lots of noise when identifying terms.

  3. (3)

    Statistical method. It uses the statistical properties of the terms in corpus to identify potential terminologies. Some commonly used statistical methods are word frequency statistics, TF-IDF [5], C-Value [6], etc. Statistical methods may produce some meaningless string combinations [7], common words (non-terminology) and other noises.

3 Parsing-Based Domain Terminology Extraction Method

In this paper, we propose to use Dependency Parsing in the process of candidate domain term identification. The proposed parsing-based domain terminology extraction method consists of three steps: dependency parsing establishing, candidate term generation and candidate evaluation for ranking.

We will provide the details of each step in the following sections. In order to help you better understand our ideas, a Chinese corpus will be used as the example for explanation.

3.1 Dependency Parsing Establishment

Dependency parsing is able to reveal the syntactic structure in a given sentence by analyzing the dependencies among the components of language units. It can well explain the relationship between the adjacent words. Typical dependency parsing methods include Graph-based [8, 9] and Transition-based [10, 11].

The very first step in establishing dependency parsing is word segmentation. Since the CRF(Conditional Random Field)s-based word segmentation algorithm has been proved to be one of the best segmenter [12], we adopt CRFs-based parser as our baseline word segmenter. Next, a syntactic parse tree can be generated in the mean time. The dependency parsing represents the grammatical structure and the relationship between the words. Table 1 shows an example dependency parsing.

Table 1. Dependency parsing of “边际收益等于物品的价格。” (The marginal revenue equals to the price of the item.)

3.2 Candidate Term Generation

In the example sentence in the previous section, 收益 (revenue) is a nominal subject, and 边际 (marginal) serves as an adjectival modifier of 收益 (revenue). By grouping words in particular roles together, we can obtain the expected “phrases”. For example, 边际收益 (marginal revenue) can be regarded as a candidate domain term.

Therefore, we propose to create grammatical rules to generate phrases, which can be regarded as domain terminologies. In this paper, we propose three grammatical rules, which may be widely accepted by different domains: Noun + Noun, (Adj| Noun) + Noun, and ((Adj | Noun) + (Adj | Noun)*(NounPrep)?)(Adj | Noun)*)Noun.

3.3 Candidates Evaluation and Ranking

It is inevitable that the candidate terms generated in Sect. 3.2 may have noise. So, in order to control the quality of the selected domain terminology, we propose a set of measures in candidate evaluation. The candidates are ranked in descending order by the evaluation score for the purpose of filtering.

3.3.1 Linguistic Rule Based Filter

In this paper, we propose to filter the candidate terms in a “backward” manner, which filters out those candidate terms that obviously cannot be terminologies by checking with the POS of the candidate terms. Word segmentation and POS tagging are performed on the candidates.

3.3.2 Multi-factor Evaluation

Traditional terminology evaluation method is based on frequency, which sorts the candidates in descending order by their frequencies in the corpus. However, as everyone knows, although frequency is an important factor, other factors, such as adhesion, etc., also play important roles in evaluation. Therefore, we propose a multi-factor evaluator. In addition to frequency, affixes (prefixes and suffixes) that often occur in phrases are considered as a factor. The affixes of hot words in a particular domain are often the same. For example, in the domain of economics, “固定成本 (constant cost)”, “可变成本 (variable cost)”and “总成本 (total cost)” all contain the suffix “成本 (cost)”. Table 2 shows some affixes of the hot words and the non-terms in the candidate set in the economics corpus.

Table 2. Some affixes of the hot words and non-terms in economics

Based on the observations in Table 2, affixes can either bring positive or negative impacts to domain terminology. Therefore, we propose an influence factor which indicates the impact of the affixes.

Equation 1 denotes the relationship between the frequency and the influence factor of non-terminology affixes, a is adjustment threshold.

$$ \alpha = \frac{{f_{word} }}{a} $$
(1)

Equation 2 denotes the relationship between the average frequency and the influence factor of the hot-word affixes. The number of the candidate terms which occurs only once, \( C_{\left( 1 \right)} \), is excluded, b is adjustment threshold.

$$ \beta = \left\lceil {b\frac{{\sum\nolimits_{i = 2}^{n} {f_{\left( i \right)} } }}{{C - C_{\left( 1 \right)} }}} \right\rceil $$
(2)

Equation 3 denotes the relationship between the frequency and other factors, named as the evaluation score.

$$ v = f_{word} - \alpha + \beta $$
(3)

The candidates are ranked in descending order by their evaluation scores. The higher the value, the more consistent with the characteristics of the domain terminology. By experiment, when a is 1/2 and b is 2, the effect is the best. The notations in Eqs. 1, 2 and 3 are listed in Table 3.

Table 3. Notations used in the multi-factor evaluator

4 Performance Analysis

4.1 Datasets and Experiments Settings

For the purpose of evaluation, we use the well-known textbook Macroeconomics (Chinese Edition) [13] as the corpus, whose domain terminology has already been labeled by domain experts. The total number of the domain terms labeled is 349.

Two different parsers are explored for comparison: Stanford parser [14], LTP parser [15]. In order to evaluate the performance of our proposed automatic parsing-based terminology extraction method, we implement the traditional POS-based method for a fair comparison. Four measures are studied in the experiments: precision (P), recall (R), n-precision (P(n)) and n-recall (R(n)) as defined in Eqs. 4, 5, 6 and 7. n-precision considers the top-n entries as well as n-recall.

$$ P = \frac{total\,number\,of\,the\,extracted\,domain\,terms}{total\,number\,of\,extracted\,words} \times 100\% $$
(4)
$$ R = \frac{total\,number\,of\,the\,extracted\,domain\,terms}{total\,number\,of\,the\,labeled\,domain\,terms} \times 100{\text{\% }} $$
(5)
$$ P\left( n \right) = \frac{total\,number\,of\,extracted\,terminologies\,in\,top - n\,results}{n} \times 100{\text{\% }} $$
(6)
$$ R\left( n \right){ = }\frac{total\,number\,of\,extracted\,terminologies\,in\,top - n \,results}{total\,number\,of\,the\,labeled\,domain\,terms} \times 100{\text{\% }} $$
(7)

4.2 Experimental Results and Discussion

Table 4 presents the precision and recall of our proposed domain terminology extraction method when using different parsers, the traditional POS, Stanford Parser and the LTP parser. The LTP parser contributes to the best precision.

Table 4. Total precision and recall of different methods

In order to verify the effectiveness of the proposed multi-factor evaluator and the rationality of the ranking, n-precision and n-recall are used as the measures. The n-precision and n-recall of the extracted terms is shown in Table 5. When including the multi-factor evaluator for filtering and reordering, the n-precision rise significantly, the n-recall is higher than that of the POS-based method.

Table 5. Precision and recall of different methods in top-n results

5 Conclusion

Domain terminology is important in the study of every domain. Thus, an automatic domain terminology extraction method is in real demand. In this paper, we presented a novel automatic domain terminology extraction method. It generates the candidate domain terms by using dependency parsing. In addition, a multi-factor evaluator is proposed to evaluate the significance of each candidate term which not only considers frequency but also includes the influence of other factors affecting domain terminology. A Chinese corpus in economics is used in the performance evaluation. Experimental results demonstrate that the proposed domain terminology extraction method outperforms the traditional POS-based method in both precision and recall.