Definition Extraction with Balanced Random Forests

Kobyliński, Łukasz; Przepiórkowski, Adam

doi:10.1007/978-3-540-85287-2_23

Definition Extraction with Balanced Random Forests

Łukasz Kobyliński² &
Adam Przepiórkowski^3,4

Conference paper

1568 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5221))

Abstract

We propose a novel machine learning approach to the task of identifying definitions in Polish documents. Specifics of the problem domain and characteristics of the available dataset have been taken into consideration, by carefully choosing and adapting a classification method to highly imbalanced and noisy data. We evaluate the performance of a Random Forest-based classifier in extracting definitional sentences from natural language text and give a comparison with previous work.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Article MATH Google Scholar
Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Technical Report 666, University of California, Berkeley (2004), http://www.stat.berkeley.edu/tech-reports/666.pdf
Degórski, Ł, Marcińczuk, M., Przepiórkowski, A.: Definition extraction using a sequential combination of baseline grammars and machine learning classifiers. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008). ELRA, Forthcoming (2008)
Google Scholar
Fahmi, I., Bouma, G.: Learning to identify definitions using syntactic features. In: Proceedings of the EACL 2006 workshop on Learning Structured Information in Natural Language Applications (2006)
Google Scholar
Kingsbury, P., Palmer, M.: From TreeBank to PropBank. In: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, pp. 1989–1993. ELRA, Las Palmas (2002)
Google Scholar
Klavans, J.L., Muresan, S.: DEFINDER: Rule-based methods for the extraction of medical terminology and their associated definitions from on-line text. In: Proceedings of the Annual Fall Symposium of the American Medical Informatics Association (2000)
Google Scholar
Klavans, J.L., Muresan, S.: Evaluation of the DEFINDER system for fully automatic glossary construction. In: Proceedings of AMIA Symposium (2001)
Google Scholar
Lin, D., Wu, D. (eds.): Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004). ACL, Barcelona (2004)
Google Scholar
Malaisé, V., Zweigenbaum, P., Bachimont, B.: Detecting semantic relations between terms in definitions. In: Ananadiou, S., Zweigenbaum, P. (eds.) COLING 2004 CompuTerm 2004: 3rd International Workshop on Computational Terminology, Geneva, Switzerland, pp. 55–62 (2004)
Google Scholar
Miliaraki, S., Androutsopoulos, I.: Learning to identify single-snippet answers to definition questions. In: Proceedings of COLING 2004, Geneva, Switzerland, pp. 1360–1366 (2004)
Google Scholar
Nielsen, R.D., Pradhan, S.: In: Lin,, Wu (eds.) Mixing weak learners in semantic parsing, pp. 80–87 (2004)
Google Scholar
Pearson, J.: The expression of definitions in specialised texts: a corpus-based analysis. In: Gellerstam, M., Järborg, J., Malmgren, S.G., Norén, K., Rogström, L., Papmehl, C. (eds.) Proceedings of the Seventh Euralex International Congress, Göteborg, pp. 817–824 (1996)
Google Scholar
Przepiórkowski, A., Degórski, Ł, Wójtowicz, B.: On the evaluation of Polish definition extraction grammars. In: Vetulani, Z. (ed.) Proceedings of the 3rd Language & Technology Conference, Poznań, Poland, pp. 473–477 (2007a)
Google Scholar
Przepiórkowski, A., Degórski, Ł, Spousta, M., Simov, K., Osenova, P., Lemnitzer, L., Kuboň, V., Wójtowicz, B.: Towards the automatic extraction of definitions in Slavic. In: Piskorski, J., Pouliquen, B., Steinberger, R., Tanev, H. (eds.) Proceedings of the Workshop on Balto-Slavonic Natural Language Processing at ACL 2007, Prague, pp. 43–50 (2007b)
Google Scholar
Przepiórkowski, A., Marcińczuk, M., Degórski, Ł.: Dealing with small, noisy and imbalanced data: Machine learning or manual grammars? In: Sojka, P., Kopeček, I., Pala, K. (eds.) Text, Speech and Dialogue: 9th International Conference (TSD 2008), Brno, Czech Republic, September 2008. LNCS (LNAI). Springer, Berlin (2008)
Google Scholar
Storrer, A., Wellinghoff, S.: Automated detection and annotation of term definitions in German text corpora. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, ELRA, Genoa (2006)
Google Scholar
Walter, S., Pinkal, M.: Automatic extraction of definitions from German court decisions. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 20–28 (2006)
Google Scholar
Xu, P., Jelinek, F.: In: Lin,, Wu (eds.) Random forests in language modeling, pp. 325–332 (2004)
Google Scholar
Xu, P., Jelinek, F.: Random forests and the data sparseness problem in language modeling. Computer Speech and Language 21(1), 105–152 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665, Warszawa, Poland
Łukasz Kobyliński
Institute of Computer Science, Polish Academy of Sciences, ul. Ordona 21, 01-237, Warszawa, Poland
Adam Przepiórkowski
Institute of Informatics, University of Warsaw, ul. Banacha 2, 02-097, Warszawa, Poland
Adam Przepiórkowski

Authors

Łukasz Kobyliński
View author publications
You can also search for this author in PubMed Google Scholar
Adam Przepiórkowski
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Chalmers University of Technology, 41296, Göteborg, Sweden
Bengt Nordström & Aarne Ranta &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kobyliński, Ł., Przepiórkowski, A. (2008). Definition Extraction with Balanced Random Forests. In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85287-2_23

Download citation

DOI: https://doi.org/10.1007/978-3-540-85287-2_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85286-5
Online ISBN: 978-3-540-85287-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics