Abstract
As domain terminology plays a crucial role in the study of every domain, automatic domain terminology extraction method is in real demand. In this paper, we propose a novel parsing-based method which generates the domain compound terms by utilizing the dependent relations between the words. Dependency parsing is used to identify the dependent relations. In addition, a multi-factor evaluator is proposed to evaluate the significance of each candidate term which not only considers frequency but also includes the influence of other factors affecting domain terminology. Experimental results demonstrate that the proposed domain terminology extraction method outperforms the traditional POS-base method in both precision and recall.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Domain terminology refers to the vocabulary of theoretical concepts in a specific domain. People can quickly understand the development of the subject through domain terminology, which is of great significance to scientific research. However, it is unaffordable to extract domain terminology manually from the massive text collections. Therefore, automatic domain terminology extraction is in real demand in various domains.
The process flow of the existing domain terminology extraction methods can be summarized into two steps: candidate term extraction and term evaluation [1]. Firstly, the candidate term extractor extracts terms that conform to the domain conditions. Secondly, the evaluation module evaluates each candidate term and filters it when necessary based on some statistical measures.
In order to enhance the accuracy of domain terms extracted, in this paper, we propose a novel parsing-based method. The contributions of this paper can be summarized as follows:
-
(1)
Dependency parsing is proposed to be utilized to generate candidate domain terms.
-
(2)
A multi-factor evaluator is proposed, which evaluates and filters the candidate terms based on the linguistics rules, statistical methods, and domain-specific term characteristics.
We evaluated the performance of our proposed domain terminology extraction method with a frequency-based POS-based term extraction method. In the experiment, our method identified plentiful of accurate candidates. The recall rate has been improved. The ranking outperformed the counterpart in precision.
2 Related Work
Some automatic terminology recognition approaches have been proposed in recent years. The existing domain terminology extraction approaches can be classified into four categories [2]:
-
(1)
Dictionary-based method. It is simple and easy to extract domain terms by matching the words with those in a domain dictionary. However, domain terminology is constantly updated so that the domain dictionaries cannot be easily maintained [3].
-
(2)
Linguistic method. It uses the surface grammatical analysis to recognize terminology [4]. However, the linguistic rules are difficult to summarize. Linguistic method may generate lots of noise when identifying terms.
-
(3)
Statistical method. It uses the statistical properties of the terms in corpus to identify potential terminologies. Some commonly used statistical methods are word frequency statistics, TF-IDF [5], C-Value [6], etc. Statistical methods may produce some meaningless string combinations [7], common words (non-terminology) and other noises.
3 Parsing-Based Domain Terminology Extraction Method
In this paper, we propose to use Dependency Parsing in the process of candidate domain term identification. The proposed parsing-based domain terminology extraction method consists of three steps: dependency parsing establishing, candidate term generation and candidate evaluation for ranking.
We will provide the details of each step in the following sections. In order to help you better understand our ideas, a Chinese corpus will be used as the example for explanation.
3.1 Dependency Parsing Establishment
Dependency parsing is able to reveal the syntactic structure in a given sentence by analyzing the dependencies among the components of language units. It can well explain the relationship between the adjacent words. Typical dependency parsing methods include Graph-based [8, 9] and Transition-based [10, 11].
The very first step in establishing dependency parsing is word segmentation. Since the CRF(Conditional Random Field)s-based word segmentation algorithm has been proved to be one of the best segmenter [12], we adopt CRFs-based parser as our baseline word segmenter. Next, a syntactic parse tree can be generated in the mean time. The dependency parsing represents the grammatical structure and the relationship between the words. Table 1 shows an example dependency parsing.
3.2 Candidate Term Generation
In the example sentence in the previous section, 收益 (revenue) is a nominal subject, and 边际 (marginal) serves as an adjectival modifier of 收益 (revenue). By grouping words in particular roles together, we can obtain the expected “phrases”. For example, 边际收益 (marginal revenue) can be regarded as a candidate domain term.
Therefore, we propose to create grammatical rules to generate phrases, which can be regarded as domain terminologies. In this paper, we propose three grammatical rules, which may be widely accepted by different domains: Noun + Noun, (Adj| Noun) + Noun, and ((Adj | Noun) + (Adj | Noun)*(NounPrep)?)(Adj | Noun)*)Noun.
3.3 Candidates Evaluation and Ranking
It is inevitable that the candidate terms generated in Sect. 3.2 may have noise. So, in order to control the quality of the selected domain terminology, we propose a set of measures in candidate evaluation. The candidates are ranked in descending order by the evaluation score for the purpose of filtering.
3.3.1 Linguistic Rule Based Filter
In this paper, we propose to filter the candidate terms in a “backward” manner, which filters out those candidate terms that obviously cannot be terminologies by checking with the POS of the candidate terms. Word segmentation and POS tagging are performed on the candidates.
3.3.2 Multi-factor Evaluation
Traditional terminology evaluation method is based on frequency, which sorts the candidates in descending order by their frequencies in the corpus. However, as everyone knows, although frequency is an important factor, other factors, such as adhesion, etc., also play important roles in evaluation. Therefore, we propose a multi-factor evaluator. In addition to frequency, affixes (prefixes and suffixes) that often occur in phrases are considered as a factor. The affixes of hot words in a particular domain are often the same. For example, in the domain of economics, “固定成本 (constant cost)”, “可变成本 (variable cost)”and “总成本 (total cost)” all contain the suffix “成本 (cost)”. Table 2 shows some affixes of the hot words and the non-terms in the candidate set in the economics corpus.
Based on the observations in Table 2, affixes can either bring positive or negative impacts to domain terminology. Therefore, we propose an influence factor which indicates the impact of the affixes.
Equation 1 denotes the relationship between the frequency and the influence factor of non-terminology affixes, a is adjustment threshold.
Equation 2 denotes the relationship between the average frequency and the influence factor of the hot-word affixes. The number of the candidate terms which occurs only once, \( C_{\left( 1 \right)} \), is excluded, b is adjustment threshold.
Equation 3 denotes the relationship between the frequency and other factors, named as the evaluation score.
The candidates are ranked in descending order by their evaluation scores. The higher the value, the more consistent with the characteristics of the domain terminology. By experiment, when a is 1/2 and b is 2, the effect is the best. The notations in Eqs. 1, 2 and 3 are listed in Table 3.
4 Performance Analysis
4.1 Datasets and Experiments Settings
For the purpose of evaluation, we use the well-known textbook Macroeconomics (Chinese Edition) [13] as the corpus, whose domain terminology has already been labeled by domain experts. The total number of the domain terms labeled is 349.
Two different parsers are explored for comparison: Stanford parser [14], LTP parser [15]. In order to evaluate the performance of our proposed automatic parsing-based terminology extraction method, we implement the traditional POS-based method for a fair comparison. Four measures are studied in the experiments: precision (P), recall (R), n-precision (P(n)) and n-recall (R(n)) as defined in Eqs. 4, 5, 6 and 7. n-precision considers the top-n entries as well as n-recall.
4.2 Experimental Results and Discussion
Table 4 presents the precision and recall of our proposed domain terminology extraction method when using different parsers, the traditional POS, Stanford Parser and the LTP parser. The LTP parser contributes to the best precision.
In order to verify the effectiveness of the proposed multi-factor evaluator and the rationality of the ranking, n-precision and n-recall are used as the measures. The n-precision and n-recall of the extracted terms is shown in Table 5. When including the multi-factor evaluator for filtering and reordering, the n-precision rise significantly, the n-recall is higher than that of the POS-based method.
5 Conclusion
Domain terminology is important in the study of every domain. Thus, an automatic domain terminology extraction method is in real demand. In this paper, we presented a novel automatic domain terminology extraction method. It generates the candidate domain terms by using dependency parsing. In addition, a multi-factor evaluator is proposed to evaluate the significance of each candidate term which not only considers frequency but also includes the influence of other factors affecting domain terminology. A Chinese corpus in economics is used in the performance evaluation. Experimental results demonstrate that the proposed domain terminology extraction method outperforms the traditional POS-based method in both precision and recall.
References
Nakagawa, H., Mori, T.: Automatic term recognition based on statistics of compound nouns and their components. Terminology 9, 201–219 (2003)
Korkontzelos, I., Klapaftis, I.P., Manandhar, S.: Reviewing and evaluating automatic term recognition techniques. In: 6th International Conference on Advances in Natural Language Processing, pp. 248–259 (2008)
Krauthammer, M., Nenadic, G.: Term identification in the biomedical literature. J. Biomed. Inform. 37(6), 512–526 (2004)
Bourigault, D.: Surface grammatical analysis for the extraction of terminological noun phrases. In: Proceedings of the 14th Conference on Computational Linguistics, Stroudsburg, PA, USA, pp. 977–98l (1992)
Rezgui, Y.: Text-based domain ontology building using tf-idf and metric clusters techniques. Knowl. Eng. Rev. 22, 379–403 (2007)
Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the c-value/nc-value method. Int. J. Digit. Libr. 3(2), 115–130 (2000)
Damerau, F.J.: Generating and evaluating domain-oriented multi-word terms from texts. Inf. Process. Manage. 29(4), 433–447 (1993)
Eisner, J.: Three new probabilistic models for dependency parsing: an exploration. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), Copenhagen, pp. 340–345 (1996). http://cs.jhu.edu/˜jason/papers/#coling96
McDonald, R., Pereira, F., Ribarov, K., Hajic, J.: Non-projective dependency parsing using spanning tree algorithms. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 523–530. Association for Computational Linguistics, Vancouver (2005). http://www.aclweb.org/anthology/H/H05/H05-1066
Kubler, S., McDonald, R., Nivre, J.: Dependency parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool, San Rafael (2009). http://books.google.com/books?id=k3iiup7HB9UC
Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., Marinov, S., Marsi, E.: MaltParser: a language-independent system for data-driven dependency parsing. Nat. Lang. Eng. 13, 95–135 (2007)
Yang, D., Pan, Y., Furui, S.: Automatic Chinese abbreviation generation using conditional random field. In: NAACL 2009, pp. 273–276 (2009)
Mankiw, N.G.: Macroeconomics, 4th ed. China Renmin University Press (2002)
Klein, D., Manning, C.D.: Fast exact inference with a factored model for natural language parsing. In: Advances in Neural Information Processing Systems (NIPS 2002), vol. 15, pp. 3–10. MIT Press, Cambridge (2003)
Che, W., Li, Z., Liu, T.: LTP: a Chinese language technology platform. In: Proceedings of the Coling 2010: Demonstrations, pp. 13–16, Beijing, China (2010)
Acknowledgements
This project was partially supported by Grants from Natural Science Foundation of China #71671178/#91546201/#61202321, and the open project of the Key Lab of Big Data Mining and Knowledge Management. It was also supported by Hainan Provincial Department of Science and Technology under Grant No. ZDKJ2016021, and by Guangdong Provincial Science and Technology Project 2016B010127004.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Liu, Y., Zhang, T., Quan, P., Wen, Y., Wu, K., He, H. (2018). A Novel Parsing-Based Automatic Domain Terminology Extraction Method. In: Shi, Y., et al. Computational Science – ICCS 2018. ICCS 2018. Lecture Notes in Computer Science(), vol 10862. Springer, Cham. https://doi.org/10.1007/978-3-319-93713-7_77
Download citation
DOI: https://doi.org/10.1007/978-3-319-93713-7_77
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93712-0
Online ISBN: 978-3-319-93713-7
eBook Packages: Computer ScienceComputer Science (R0)