Abstract
This report introduces several measures of term representativeness and a scheme called the baseline method for defining the measures. The representativeness of a term T is measured by a normalized characteristic value which indicates the bias of the distribution of words in D(T), the set of all documents that contain the term. Dist(D(T)), the distance between the distribution of words in D(T) and in a whole corpus was, after normalization, found to be effective as a characteristic value for the bias of the distribution of words in D(T). Experiments showed that the measure based on the normalized value of Dist(D(∙)) strongly outperforms existing measures in evaluating the representativeness of terms in newspaper articles. The measure was also effective, in combination with term frequency, as a means for automatically extracting terms from abstracts of papers on artificial intelligence.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aizawa, A.: The Feature Quantity: An Information Theoretic Perspective of Tf-id-like Measure. In: Proc. of ACM SIGIR 2000, pp. 104–111 (2000)
Bessé, B.: Terminological Definitions. In: Sager, J.C. (ed.) (transl.) Handbook of Terminology Management, pp. 69–80. John Benjamins, Amsterdam (1996)
Caraballo, S.A., Charniak, E.: Determining the specificity of nouns from text. In: Proc. of EMNLP 1999, pp. 63–70 (1999)
Church, K.W., Hanks, P.: Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics 6(1), 22–29 (1990)
Cohen, J.D.: Highlights: Language- and Domain-independent Automatic Indexing Terms for Abstracting. J. of American Society for Information Science 46(3), 162–174 (1995)
Daille, B., Gaussier, E., Lange, J.: Towards automatic extraction of monolingual and bilingual terminology. In: Proc. of COLING 1994, pp. 515–521 (1994)
Damerau, F.J.: Evaluating Domain-oriented Multi-word Terms from Texts. Information Processing and Management 29(4), 433–477 (1993)
Dunning, T.: Accurate Method for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
Frantzi, K.T., Ananiadou, S.: Statistical Measures for Terminological Expression. In: Proc. of the Third International Conference on Statistical Analysis of Textual Data, pp. 297–308. N.p, Rome (1995)
Frantzi, K.T., Ananiadou, S., Tsujii, J.: Extracting Terminological Expressions. Information Processing Society of Japan, Technical Report of SIGNL, NL112-12, 83–88 (1996)
Frantzi, K.T., Ananiadou, S., Tsujii, J.: The C-value/NC-value Method of Automatic Recognition for Multi-Word Terms. In: Proc. of European Conference on Digital Libraries, pp. 585–604 (1999)
Fukushige, Y., Noguchi, N.: Statistical and Linguistics Approaches to automatic term recognition: NTCIR experiments at Matsushita. Terminology 6(2), 257–286 (2000)
Hisamitsu, T., Niwa, Y., Nishioka, S., Sakurai, H., Imaichi, O., Iwayama, M., Takano, A.: Term Extraction Using A New Measure of Term Representativeness. In: Proc. of NTCIR Workshop, vol. 1, pp. 475–481 (1999)
Hisamitsu, T., Niwa, Y., Tsujii, J.: A Method of Measuring Term Representativeness – Baseline Method Using Co-occurrence Distribution. In: Proc. of COLING 2000, pp. 320–326 (2000)
Hisamitsu, T., Niwa, Y., Nishioka, S., Sakurai, H., Imaichi, O., Iwayama, M., Takano, A.: Extracting Terms by a Combination of Term Frequency and a Measure of Term Representativeness. Terminology, 6(2), 211–232 (2000)
Hisamitsu, T., Niwa, Y.: A Measure of Term Representativeness Based on the Number of Co-occurring Salient Words. In: Proc. of COLING 2002 (2002) (to appear)
Jacquemin, C.: Spotting and Discovering Terms through NLP. MIT Press, Cambridge (2001)
Kageura, K., Umino, B.: Methods of automatic term recognition: A review. Terminology 3(2), 259–289 (1996)
Kageura, K., Yoshioka, M., Takeuchi, K., Koyama, T., Tsuji, K., Yoshikane, Y., Okada, M.: Overview of TMREC Tasks. In: Proc. of NTCIR Workshop, vol. 1 (1999)
Kageura, K., Yoshioka, M., Tsuji, K., Yoshikane, Y., Takeuchi, K., Koyama, T.: Evaluation of the Term Recognition Task. In: Proc. of NTCIR Workshop, vol. 1, pp. 417–434 (1999)
Kageura, K., Yoshioka, M., Takeuchi, K., Koyama, T., Tsuji, K., Yoshikane, Y.: Recent Advances in automatic term recognition: Experiences from the NTCIR workshop on information retrieval and term recognition. Terminology 6(2), 151–174 (2000)
Kando, N., Kuriyama, K., Nozue, T.: NACSIS test collection workshop (NTCIR-1). In: Proc. of the 22nd Annual International ACM SIGIR Conf. on Research and Development in IR, pp. 299–300 (1999)
Kit, C.: Reduction of Indexing Term Space for Phrase-based Information Retrieval. Internal memo of Computational Linguistics Program. Carnegie Mellon University, Pittsburgh (1994)
Luhn, H.P.: A Statistical Approach to Mechanized Encoding and Searching Literary Information. IBM J. of Research and Development 2(2), 159–165 (1957)
Maron, M.E.: Automatic Indexing: An Experimental inquiry. J. of the Association for Computer Machinery 8(3), 404–417 (1961)
Mima, H., Ananiadou, S.: An application and evaluation of the C/NC value approach for the automatic term recognition of multi-word units in Japanese. Terminology 6(2), 175–194 (2000)
Nagao, M., Mizutani, M., Ikeda, H.: An Automated Method of the Extraction of Important Words from Japanese Scientific Documents. Trans. of Information Processing Society of Japan 17(2), 110–117 (1976) (in Japanese)
Nakagawa, H., Mori, T.: Nested Collocation and Compound Noun For Term Extraction. In: Proc. of Computerm 1998, pp. 64-70 (1998)
Nakagawa, H.: Automatic term recognition based on statistics of compound nouns. Terminology 6(2), 195–210 (2000)
Niwa, Y., Nishioka, S., Iwayama, M., Takano, A.: Topic graph generation for query navigation: Use of frequency classes for topic extraction. In: Proc. of NLPRS 1997, pp. 95–100 (1997)
Noreault, T., McGill, M., Koll, M.B.: A Performance Evaluation of Similarity Measure, Document Term Weighting Schemes and Representation in a Boolean Environment. In: Oddey, R.N. (ed.) Information Retrieval Research, pp. 57–76. Butterworths, London (1977)
Robertson, S.E., Walker, S., Beaulieu, M.: Experimentation as a way of life: Okapi at TREC. Information Processing and Management 36(1), 95–108 (2000)
Sakurai, H., Hisamitsu, T.: A Data Structure for Fast Lookup of Grammatically Connectable Word Pairs in Japanese Morphological Analysis. In: Proc. of ICCPOL 1999, pp. 467–471 (1999)
Salton, G., Yang, C.S.: On the Specification of Term. Values in Automatic Indexing. Journal of Documentation 29(4), 351–372 (1973)
Salton, G., Yang, C.S., Yu, C.T.: A Theory of Term Importance in Automatic Text Analysis. J. of the American Society for Information Science 26(1), 33–44 (1975)
Salton, G.: Automatic Text Processing. Addison-Wesley, Reading (1988)
Singhal, A., Buckley, C., Cochrane, P.A.: Pivoted Document Length Normalization. In: Proc. of ACM SIGIR 1996, pp. 126–133 (1996)
Sparck-Jones, K.: Index Term Weighting. Information Storage and Retrieval 9(11), 619–633 (1973)
Takano, A., Niwa, Y., Nishioka, S., Iwayama, M., Hisamitsu, T., Imaichi, O., Sakurai, H.: Information Access Based on Associative Calculation. In: Jeffery, K., Hlaváč, V., Wiedermann, J. (eds.) SOFSEM 2000. LNCS, vol. 1963, pp. 187–201. Springer, Heidelberg (2000)
Teramoto, Y., Miyahara, Y., Matsumoto, S.: Word weight calculation for document retrieval by analyzing the distribution of co-occurrence words. In: Proc. of the 59th Annual Meeting of IPSJ. IP-06 (1999) (in Japanese)
Terminology 6(2) (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hisamitsu, T., Tsujii, Ji. (2003). Measuring Term Representativeness. In: Pazienza, M.T. (eds) Information Extraction in the Web Era. SCIE 2002. Lecture Notes in Computer Science(), vol 2700. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45092-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-540-45092-4_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40579-5
Online ISBN: 978-3-540-45092-4
eBook Packages: Springer Book Archive