Abstract
This paper describes the rule-based classification of numerals and strings that include numerals, composed of a number and semantic unit(s) that indicate a SPEED, NUMBER, or other measure, at three levels: morphological, syntactic, and semantic. The approach employs three interpretation processes: word trigram construction with tokeniser, rule-based processing of number strings, and n-gram based classification. We extracted numeral strings from 378 online newspaper articles, finding that, on average, they comprised about 2.2% of the words in the articles. To manually extract n-gram rules to disambiguate the number strings’ meanings, our approach was trained on 886 numeral strings and tested on the remaining 3251 strings. We implemented two heuristic disambiguation methods based on each category’s frequency statistics collected from the sample data, and precision ratios of both methods were 86.8% and 86.3% respectively. This paper focuses on the acquisition and performance of different types of rules applied to numeral strings classification.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Asahara, M., Matsumoto, Y.: Japanese Named Entity Extraction with Redundant Morpho-logical Analysis. In: Proceedings of HLT-NAACL 2003, pp. 8–15 (2003)
Black, W., Rinaldi, F., Mowatt, D.: FACILE: Description of the NE system used for MUC-7. In: Proceedings of MUC-7 (1998)
Chieu, L., Ng, T.: Named Entity Recognition: A Maximum Entropy Approach Using Global Information. In: Proceedings of the 19th COLING, pp. 190–196 (2002)
CoNLL-2003 Language-Independent Named Entity Recognition (2003), http://www.cnts.uia.ac.be/conll2003/ner/2
Dale, R.: A Framework for Complex Tokenisation and its Application to Newspaper Text. In: Proceedings of the second Australian Document Computing Symposium (1997)
Earley, J.: An Efficient Context-Free Parsing Algorithm. CACM 13(2), 94–102 (1970)
Maynard, D., Tablan, V., Ursu, C., Cunningham, H., Wilks, Y.: Named Entity Recognition from Diverse Text Types. In: Proceedings of Recent Advances in NLP (2001)
Nelson, G., Wallis, S., Aarts, B.: Exlporing Natural Language - working with the British Component of the International Corpus of English. John Benjamins, The Netherlands (2002)
Polanyi, L., van den Berg, M.: Logical Structure and Discourse Anaphora Resolution. In: Proceedings of ACL99 Workshop on The Relation of Discourse/Dialogue Structure and Reference, pp. 10–117 (1999)
Reiter, E., Sripada, S.: Learning the Meaning and Usage of Time Phrases from a parallel Text-Data Corpus. In: Proceedings of HLT-NAACL2003 Workshop on Learning Word Meaning from Non-Linguistic Data, pp. 78–85 (2003)
Siegel, M., Bender, E.M.: Efficient Deep Processing of Japanese. In: Proceedings of the 3rd Workshop on Asian Language Resources and International Standardization (2002)
Torii, M., Kamboj, S., Vijay-Shanker, K.: An investigation of Various Information Sources for Classifying Biological Names. In: Proceedings of ACL2003 Workshop on Natural Language Processing in Biomedicine, pp. 113–120 (2003)
Wang, H., Yu, S.: The Semantic Knowledge-base of Contemporary Chinese and its Apllication in WSD. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 112–118 (2003)
Zhou, G., Su, J.: Named Entity Recognition using an HMM-based Chunk Tagger. In: Proceedings of ACL 2002, pp. 473–480 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Min, K., MacDonell, S., Moon, YJ. (2006). Heuristic and Rule-Based Knowledge Acquisition: Classification of Numeral Strings in Text. In: Hoffmann, A., Kang, Bh., Richards, D., Tsumoto, S. (eds) Advances in Knowledge Acquisition and Management. PKAW 2006. Lecture Notes in Computer Science(), vol 4303. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11961239_4
Download citation
DOI: https://doi.org/10.1007/11961239_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68955-3
Online ISBN: 978-3-540-68957-7
eBook Packages: Computer ScienceComputer Science (R0)