A Novel Parsing-Based Automatic Domain Terminology Extraction Method

Liu, Ying; Zhang, Tianlin; Quan, Pei; Wen, Yueran; Wu, Kaichao; He, Hongbo

doi:10.1007/978-3-319-93713-7_77

Ying Liu ORCID: orcid.org/0000-0001-6005-5714^20,21,
Tianlin Zhang ORCID: orcid.org/0000-0003-0843-1916^20,21,
Pei Quan^20,21,
Yueran Wen²²,
Kaichao Wu²³ &
…
Hongbo He²³

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10862))

Included in the following conference series:

International Conference on Computational Science

3334 Accesses

Abstract

As domain terminology plays a crucial role in the study of every domain, automatic domain terminology extraction method is in real demand. In this paper, we propose a novel parsing-based method which generates the domain compound terms by utilizing the dependent relations between the words. Dependency parsing is used to identify the dependent relations. In addition, a multi-factor evaluator is proposed to evaluate the significance of each candidate term which not only considers frequency but also includes the influence of other factors affecting domain terminology. Experimental results demonstrate that the proposed domain terminology extraction method outperforms the traditional POS-base method in both precision and recall.

You have full access to this open access chapter, Download conference paper PDF

Survey on terminology extraction from texts

Article Open access 06 February 2025

A Method to Build and Expand the Domain Dictionary Automatically Based on WordNet

Automatic Terminology Extraction Using a Dependency-Graph in NLP

Keywords

1 Introduction

Domain terminology refers to the vocabulary of theoretical concepts in a specific domain. People can quickly understand the development of the subject through domain terminology, which is of great significance to scientific research. However, it is unaffordable to extract domain terminology manually from the massive text collections. Therefore, automatic domain terminology extraction is in real demand in various domains.

The process flow of the existing domain terminology extraction methods can be summarized into two steps: candidate term extraction and term evaluation [1]. Firstly, the candidate term extractor extracts terms that conform to the domain conditions. Secondly, the evaluation module evaluates each candidate term and filters it when necessary based on some statistical measures.

In order to enhance the accuracy of domain terms extracted, in this paper, we propose a novel parsing-based method. The contributions of this paper can be summarized as follows:

(1)
Dependency parsing is proposed to be utilized to generate candidate domain terms.
(2)
A multi-factor evaluator is proposed, which evaluates and filters the candidate terms based on the linguistics rules, statistical methods, and domain-specific term characteristics.

We evaluated the performance of our proposed domain terminology extraction method with a frequency-based POS-based term extraction method. In the experiment, our method identified plentiful of accurate candidates. The recall rate has been improved. The ranking outperformed the counterpart in precision.

2 Related Work

Some automatic terminology recognition approaches have been proposed in recent years. The existing domain terminology extraction approaches can be classified into four categories [2]:

(1)
Dictionary-based method. It is simple and easy to extract domain terms by matching the words with those in a domain dictionary. However, domain terminology is constantly updated so that the domain dictionaries cannot be easily maintained [3].
(2)
Linguistic method. It uses the surface grammatical analysis to recognize terminology [4]. However, the linguistic rules are difficult to summarize. Linguistic method may generate lots of noise when identifying terms.
(3)
Statistical method. It uses the statistical properties of the terms in corpus to identify potential terminologies. Some commonly used statistical methods are word frequency statistics, TF-IDF [5], C-Value [6], etc. Statistical methods may produce some meaningless string combinations [7], common words (non-terminology) and other noises.

3 Parsing-Based Domain Terminology Extraction Method

In this paper, we propose to use Dependency Parsing in the process of candidate domain term identification. The proposed parsing-based domain terminology extraction method consists of three steps: dependency parsing establishing, candidate term generation and candidate evaluation for ranking.

We will provide the details of each step in the following sections. In order to help you better understand our ideas, a Chinese corpus will be used as the example for explanation.

3.1 Dependency Parsing Establishment

Dependency parsing is able to reveal the syntactic structure in a given sentence by analyzing the dependencies among the components of language units. It can well explain the relationship between the adjacent words. Typical dependency parsing methods include Graph-based [8, 9] and Transition-based [10, 11].

The very first step in establishing dependency parsing is word segmentation. Since the CRF(Conditional Random Field)s-based word segmentation algorithm has been proved to be one of the best segmenter [12], we adopt CRFs-based parser as our baseline word segmenter. Next, a syntactic parse tree can be generated in the mean time. The dependency parsing represents the grammatical structure and the relationship between the words. Table 1 shows an example dependency parsing.

Table 1. Dependency parsing of “边际收益等于物品的价格。” (The marginal revenue equals to the price of the item.)

Full size table

3.2 Candidate Term Generation

In the example sentence in the previous section, 收益 (revenue) is a nominal subject, and 边际 (marginal) serves as an adjectival modifier of 收益 (revenue). By grouping words in particular roles together, we can obtain the expected “phrases”. For example, 边际收益 (marginal revenue) can be regarded as a candidate domain term.

Therefore, we propose to create grammatical rules to generate phrases, which can be regarded as domain terminologies. In this paper, we propose three grammatical rules, which may be widely accepted by different domains: Noun + Noun, (Adj| Noun) + Noun, and ((Adj | Noun) + (Adj | Noun)*(NounPrep)?)(Adj | Noun)*)Noun.

3.3 Candidates Evaluation and Ranking

It is inevitable that the candidate terms generated in Sect. 3.2 may have noise. So, in order to control the quality of the selected domain terminology, we propose a set of measures in candidate evaluation. The candidates are ranked in descending order by the evaluation score for the purpose of filtering.

3.3.1 Linguistic Rule Based Filter

In this paper, we propose to filter the candidate terms in a “backward” manner, which filters out those candidate terms that obviously cannot be terminologies by checking with the POS of the candidate terms. Word segmentation and POS tagging are performed on the candidates.

3.3.2 Multi-factor Evaluation

Traditional terminology evaluation method is based on frequency, which sorts the candidates in descending order by their frequencies in the corpus. However, as everyone knows, although frequency is an important factor, other factors, such as adhesion, etc., also play important roles in evaluation. Therefore, we propose a multi-factor evaluator. In addition to frequency, affixes (prefixes and suffixes) that often occur in phrases are considered as a factor. The affixes of hot words in a particular domain are often the same. For example, in the domain of economics, “固定成本 (constant cost)”, “可变成本 (variable cost)”and “总成本 (total cost)” all contain the suffix “成本 (cost)”. Table 2 shows some affixes of the hot words and the non-terms in the candidate set in the economics corpus.

Table 2. Some affixes of the hot words and non-terms in economics

Full size table

Based on the observations in Table 2, affixes can either bring positive or negative impacts to domain terminology. Therefore, we propose an influence factor which indicates the impact of the affixes.

Equation 1 denotes the relationship between the frequency and the influence factor of non-terminology affixes, a is adjustment threshold.

$$ \alpha = \frac{{f_{word} }}{a} $$

(1)

Equation 2 denotes the relationship between the average frequency and the influence factor of the hot-word affixes. The number of the candidate terms which occurs only once, $ C_{\left( 1 \right)} $, is excluded, b is adjustment threshold.

$$ \beta = \left\lceil {b\frac{{\sum\nolimits_{i = 2}^{n} {f_{\left( i \right)} } }}{{C - C_{\left( 1 \right)} }}} \right\rceil $$

(2)

Equation 3 denotes the relationship between the frequency and other factors, named as the evaluation score.

$$ v = f_{word} - \alpha + \beta $$

(3)

The candidates are ranked in descending order by their evaluation scores. The higher the value, the more consistent with the characteristics of the domain terminology. By experiment, when a is 1/2 and b is 2, the effect is the best. The notations in Eqs. 1, 2 and 3 are listed in Table 3.

Table 3. Notations used in the multi-factor evaluator

Full size table

4 Performance Analysis

4.1 Datasets and Experiments Settings

For the purpose of evaluation, we use the well-known textbook Macroeconomics (Chinese Edition) [13] as the corpus, whose domain terminology has already been labeled by domain experts. The total number of the domain terms labeled is 349.

Two different parsers are explored for comparison: Stanford parser [14], LTP parser [15]. In order to evaluate the performance of our proposed automatic parsing-based terminology extraction method, we implement the traditional POS-based method for a fair comparison. Four measures are studied in the experiments: precision (P), recall (R), n-precision (P(n)) and n-recall (R(n)) as defined in Eqs. 4, 5, 6 and 7. n-precision considers the top-n entries as well as n-recall.

$$ P = \frac{total\,number\,of\,the\,extracted\,domain\,terms}{total\,number\,of\,extracted\,words} \times 100\% $$

(4)

$$ R = \frac{total\,number\,of\,the\,extracted\,domain\,terms}{total\,number\,of\,the\,labeled\,domain\,terms} \times 100{\text{\% }} $$

(5)

$$ P\left( n \right) = \frac{total\,number\,of\,extracted\,terminologies\,in\,top - n\,results}{n} \times 100{\text{\% }} $$

(6)

$$ R\left( n \right){ = }\frac{total\,number\,of\,extracted\,terminologies\,in\,top - n \,results}{total\,number\,of\,the\,labeled\,domain\,terms} \times 100{\text{\% }} $$

(7)

4.2 Experimental Results and Discussion

Table 4 presents the precision and recall of our proposed domain terminology extraction method when using different parsers, the traditional POS, Stanford Parser and the LTP parser. The LTP parser contributes to the best precision.

Table 4. Total precision and recall of different methods

Full size table

In order to verify the effectiveness of the proposed multi-factor evaluator and the rationality of the ranking, n-precision and n-recall are used as the measures. The n-precision and n-recall of the extracted terms is shown in Table 5. When including the multi-factor evaluator for filtering and reordering, the n-precision rise significantly, the n-recall is higher than that of the POS-based method.

Table 5. Precision and recall of different methods in top-n results

Full size table

5 Conclusion

Domain terminology is important in the study of every domain. Thus, an automatic domain terminology extraction method is in real demand. In this paper, we presented a novel automatic domain terminology extraction method. It generates the candidate domain terms by using dependency parsing. In addition, a multi-factor evaluator is proposed to evaluate the significance of each candidate term which not only considers frequency but also includes the influence of other factors affecting domain terminology. A Chinese corpus in economics is used in the performance evaluation. Experimental results demonstrate that the proposed domain terminology extraction method outperforms the traditional POS-based method in both precision and recall.

References

Nakagawa, H., Mori, T.: Automatic term recognition based on statistics of compound nouns and their components. Terminology 9, 201–219 (2003)
Google Scholar
Korkontzelos, I., Klapaftis, I.P., Manandhar, S.: Reviewing and evaluating automatic term recognition techniques. In: 6th International Conference on Advances in Natural Language Processing, pp. 248–259 (2008)
Chapter Google Scholar
Krauthammer, M., Nenadic, G.: Term identification in the biomedical literature. J. Biomed. Inform. 37(6), 512–526 (2004)
Article Google Scholar
Bourigault, D.: Surface grammatical analysis for the extraction of terminological noun phrases. In: Proceedings of the 14th Conference on Computational Linguistics, Stroudsburg, PA, USA, pp. 977–98l (1992)
Google Scholar
Rezgui, Y.: Text-based domain ontology building using tf-idf and metric clusters techniques. Knowl. Eng. Rev. 22, 379–403 (2007)
Article Google Scholar
Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the c-value/nc-value method. Int. J. Digit. Libr. 3(2), 115–130 (2000)
Article Google Scholar
Damerau, F.J.: Generating and evaluating domain-oriented multi-word terms from texts. Inf. Process. Manage. 29(4), 433–447 (1993)
Article Google Scholar
Eisner, J.: Three new probabilistic models for dependency parsing: an exploration. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), Copenhagen, pp. 340–345 (1996). http://cs.jhu.edu/˜jason/papers/#coling96
McDonald, R., Pereira, F., Ribarov, K., Hajic, J.: Non-projective dependency parsing using spanning tree algorithms. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 523–530. Association for Computational Linguistics, Vancouver (2005). http://www.aclweb.org/anthology/H/H05/H05-1066
Kubler, S., McDonald, R., Nivre, J.: Dependency parsing. Synthesis Lectures on Human Language Technologies. Morgan & Claypool, San Rafael (2009). http://books.google.com/books?id=k3iiup7HB9UC
Article Google Scholar
Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., Marinov, S., Marsi, E.: MaltParser: a language-independent system for data-driven dependency parsing. Nat. Lang. Eng. 13, 95–135 (2007)
Google Scholar
Yang, D., Pan, Y., Furui, S.: Automatic Chinese abbreviation generation using conditional random field. In: NAACL 2009, pp. 273–276 (2009)
Google Scholar
Mankiw, N.G.: Macroeconomics, 4th ed. China Renmin University Press (2002)
Google Scholar
Klein, D., Manning, C.D.: Fast exact inference with a factored model for natural language parsing. In: Advances in Neural Information Processing Systems (NIPS 2002), vol. 15, pp. 3–10. MIT Press, Cambridge (2003)
Google Scholar
Che, W., Li, Z., Liu, T.: LTP: a Chinese language technology platform. In: Proceedings of the Coling 2010: Demonstrations, pp. 13–16, Beijing, China (2010)
Google Scholar

Download references

Acknowledgements

This project was partially supported by Grants from Natural Science Foundation of China #71671178/#91546201/#61202321, and the open project of the Key Lab of Big Data Mining and Knowledge Management. It was also supported by Hainan Provincial Department of Science and Technology under Grant No. ZDKJ2016021, and by Guangdong Provincial Science and Technology Project 2016B010127004.

Author information

Authors and Affiliations

School of Computer and Control, University of Chinese Academy of Sciences, Beijing, 100190, China
Ying Liu, Tianlin Zhang & Pei Quan
Key Lab of Big Data Mining and Knowledge Management, Chinese Academy of Sciences, Beijing, 100190, China
Ying Liu, Tianlin Zhang & Pei Quan
School of Labor and Human Resources, Renmin University of China, Beijing, 100872, China
Yueran Wen
Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100190, China
Kaichao Wu & Hongbo He

Authors

Ying Liu
View author publications
You can also search for this author in PubMed Google Scholar
Tianlin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Pei Quan
View author publications
You can also search for this author in PubMed Google Scholar
Yueran Wen
View author publications
You can also search for this author in PubMed Google Scholar
Kaichao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Hongbo He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ying Liu or Tianlin Zhang .

Editor information

Editors and Affiliations

Chinese Academy of Sciences, Beijing, China
Yong Shi
National Supercomputing Center in Wuxi, Wuxi, China
Haohuan Fu
Chinese Academy of Sciences, Beijing, China
Yingjie Tian
University of Amsterdam, Amsterdam, The Netherlands
Valeria V. Krzhizhanovskaya
University of Amsterdam, Amsterdam, The Netherlands
Michael Harold Lees
University of Tennessee, Knoxville, Tennessee, USA
Jack Dongarra
University of Amsterdam, Amsterdam, The Netherlands
Peter M. A. Sloot

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Zhang, T., Quan, P., Wen, Y., Wu, K., He, H. (2018). A Novel Parsing-Based Automatic Domain Terminology Extraction Method. In: Shi, Y., et al. Computational Science – ICCS 2018. ICCS 2018. Lecture Notes in Computer Science(), vol 10862. Springer, Cham. https://doi.org/10.1007/978-3-319-93713-7_77

Download citation

DOI: https://doi.org/10.1007/978-3-319-93713-7_77
Published: 12 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93712-0
Online ISBN: 978-3-319-93713-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Novel Parsing-Based Automatic Domain Terminology Extraction Method

Abstract

Similar content being viewed by others

Survey on terminology extraction from texts

A Method to Build and Expand the Domain Dictionary Automatically Based on WordNet

Automatic Terminology Extraction Using a Dependency-Graph in NLP

Keywords

1 Introduction

2 Related Work

3 Parsing-Based Domain Terminology Extraction Method

3.1 Dependency Parsing Establishment

3.2 Candidate Term Generation

3.3 Candidates Evaluation and Ranking

3.3.1 Linguistic Rule Based Filter

3.3.2 Multi-factor Evaluation

4 Performance Analysis

4.1 Datasets and Experiments Settings

4.2 Experimental Results and Discussion

5 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us