Abstract
The paper describes two approaches to modeling word normalization (such as replacing “wrote” or “writing” by “write”) based on the re-occurring patterns in: word suffix and the context of word obtained from texts. In order to collect patterns, we first represent the data using two independent feature sets and then find the patterns responsible for a particular word mapping. The modeling is based on a set of hand-labeled words of the form (word, normalized word) and texts from 28 novels obtained from the Web and used to get words context. Since the hand-labeling is a demanding task we investigate the possibility of improving our modeling by gradually adding unlabeled examples. Namely, we use the initial model based on word suffix to predict the labels. Then we enlarge the training set by the examples with predicted labels for which the model is the most certain. The experiment show that this helps the context-based approach while largely hurting the suffix-based approach. To get an idea of the influence of the number of labeled instead of unlabeled examples, we give a comparison with the situation when simply more labeled data is provided.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
M.F. Porter. An algorithm for suffix stripping. In In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 318–327, 1980.
C. X. Ling. Learning the past tense of English verbs: The symbolic pattern associator vs. connectionist models. Journal of Artificial Intelligence Research, 1:209–229, 1994.
R.J. Mooney and M.E. Califf. Induction of first-order decision lists: Results on learning the past tense of english verbs. In L. De Raedt, ed., Proceedings of the 5th International Workshop on Inductive Logic Programming, pages 145–146. Department of Computer Science, Katholieke Universiteit Leuven, 1995.
Saso Dzeroski and Tomaz Erjavec. Learning to lemmatise slovene words. In Learning language in logic, (Lecture notes in computer science, J. Cussens and S. Dzeroski (eds), pages 69–88, 200
Dunja Mladenic and Marko Grobelnik. Feature selection for unbalanced class distribution and naive bayes. In Proceedings of the 16th International Conference on Machine Learning, 1999.
David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Meeting of the Association for Computational Linguistics, pages 189–196, 1995.
Tomaz Erjavec. The multext-east slovene lexicon. In Proceedings of the 7th Slovene Electrotechnical Conference ERK-98, 1998.
Rayid Ghani, Rosie Jones, and Dunja Mladenic. Automatic web search query generation to create minority language corpora. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.
Blum and Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1998.
Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2/3):103–134, 2000.
Kamal Nigam and Rayid Ghani. Analyzing the effectiveness and applicability of co-training. In Ninth International Conference on Information and Knowledge Management, 2000.
Dunja Mladenic. Combinatorial optimization in inductive concept learning. In Proc. 10th Int. Conf. on Machine Learning, Morgan Kaufmann, pages 205–211, 1993.
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classifiers. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
Yair Even-Zohar and Dan Roth. A sequential model for multi-class classification. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), 2001.
Jure Dimec, Saso Dzeroski, Ljupco Todorovski, and Dimitrij Hristovski. Www search engine for slovenian and english medical documents. In Stud Health Technol Inform.:68, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mladenić, D. (2002). Modeling Information in Textual Data Combining Labeled and Unlabeled Data. In: Hand, D.J., Adams, N.M., Bolton, R.J. (eds) Pattern Detection and Discovery. Lecture Notes in Computer Science(), vol 2447. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45728-3_13
Download citation
DOI: https://doi.org/10.1007/3-540-45728-3_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44148-9
Online ISBN: 978-3-540-45728-2
eBook Packages: Springer Book Archive