Modeling Information in Textual Data Combining Labeled and Unlabeled Data

Mladenić, Dunja

doi:10.1007/3-540-45728-3_13

Modeling Information in Textual Data Combining Labeled and Unlabeled Data

Dunja Mladenić²

Conference paper
First Online: 01 January 2002

452 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2447))

Abstract

The paper describes two approaches to modeling word normalization (such as replacing “wrote” or “writing” by “write”) based on the re-occurring patterns in: word suffix and the context of word obtained from texts. In order to collect patterns, we first represent the data using two independent feature sets and then find the patterns responsible for a particular word mapping. The modeling is based on a set of hand-labeled words of the form (word, normalized word) and texts from 28 novels obtained from the Web and used to get words context. Since the hand-labeling is a demanding task we investigate the possibility of improving our modeling by gradually adding unlabeled examples. Namely, we use the initial model based on word suffix to predict the labels. Then we enlarge the training set by the examples with predicted labels for which the model is the most certain. The experiment show that this helps the context-based approach while largely hurting the suffix-based approach. To get an idea of the influence of the number of labeled instead of unlabeled examples, we give a comparison with the situation when simply more labeled data is provided.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M.F. Porter. An algorithm for suffix stripping. In In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 318–327, 1980.
Google Scholar
C. X. Ling. Learning the past tense of English verbs: The symbolic pattern associator vs. connectionist models. Journal of Artificial Intelligence Research, 1:209–229, 1994.
Google Scholar
R.J. Mooney and M.E. Califf. Induction of first-order decision lists: Results on learning the past tense of english verbs. In L. De Raedt, ed., Proceedings of the 5th International Workshop on Inductive Logic Programming, pages 145–146. Department of Computer Science, Katholieke Universiteit Leuven, 1995.
Google Scholar
Saso Dzeroski and Tomaz Erjavec. Learning to lemmatise slovene words. In Learning language in logic, (Lecture notes in computer science, J. Cussens and S. Dzeroski (eds), pages 69–88, 200
Chapter Google Scholar
Dunja Mladenic and Marko Grobelnik. Feature selection for unbalanced class distribution and naive bayes. In Proceedings of the 16th International Conference on Machine Learning, 1999.
Google Scholar
David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Meeting of the Association for Computational Linguistics, pages 189–196, 1995.
Google Scholar
Tomaz Erjavec. The multext-east slovene lexicon. In Proceedings of the 7th Slovene Electrotechnical Conference ERK-98, 1998.
Google Scholar
Rayid Ghani, Rosie Jones, and Dunja Mladenic. Automatic web search query generation to create minority language corpora. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.
Google Scholar
Blum and Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1998.
Google Scholar
Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2/3):103–134, 2000.
Article MATH Google Scholar
Kamal Nigam and Rayid Ghani. Analyzing the effectiveness and applicability of co-training. In Ninth International Conference on Information and Knowledge Management, 2000.
Google Scholar
Dunja Mladenic. Combinatorial optimization in inductive concept learning. In Proc. 10th Int. Conf. on Machine Learning, Morgan Kaufmann, pages 205–211, 1993.
Google Scholar
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classifiers. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
Google Scholar
Yair Even-Zohar and Dan Roth. A sequential model for multi-class classification. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), 2001.
Google Scholar
Jure Dimec, Saso Dzeroski, Ljupco Todorovski, and Dimitrij Hristovski. Www search engine for slovenian and english medical documents. In Stud Health Technol Inform.:68, 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

J.Stefan Institute, Ljubljana, Slovenia and Carnegie Mellon University, Pittsburgh, USA
Dunja Mladenić

Authors

Dunja Mladenić
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics, Imperial College of Science, Technology and Medicine, Huxley Building, 180 Queen’s Gate, SW7 2BZ, London, UK
David J. Hand , Niall M. Adams & Richard J. Bolton , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mladenić, D. (2002). Modeling Information in Textual Data Combining Labeled and Unlabeled Data. In: Hand, D.J., Adams, N.M., Bolton, R.J. (eds) Pattern Detection and Discovery. Lecture Notes in Computer Science(), vol 2447. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45728-3_13

Download citation

DOI: https://doi.org/10.1007/3-540-45728-3_13
Published: 02 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44148-9
Online ISBN: 978-3-540-45728-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics