Skip to main content

Modeling Information in Textual Data Combining Labeled and Unlabeled Data

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2447))

Abstract

The paper describes two approaches to modeling word normalization (such as replacing “wrote” or “writing” by “write”) based on the re-occurring patterns in: word suffix and the context of word obtained from texts. In order to collect patterns, we first represent the data using two independent feature sets and then find the patterns responsible for a particular word mapping. The modeling is based on a set of hand-labeled words of the form (word, normalized word) and texts from 28 novels obtained from the Web and used to get words context. Since the hand-labeling is a demanding task we investigate the possibility of improving our modeling by gradually adding unlabeled examples. Namely, we use the initial model based on word suffix to predict the labels. Then we enlarge the training set by the examples with predicted labels for which the model is the most certain. The experiment show that this helps the context-based approach while largely hurting the suffix-based approach. To get an idea of the influence of the number of labeled instead of unlabeled examples, we give a comparison with the situation when simply more labeled data is provided.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. M.F. Porter. An algorithm for suffix stripping. In In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 318–327, 1980.

    Google Scholar 

  2. C. X. Ling. Learning the past tense of English verbs: The symbolic pattern associator vs. connectionist models. Journal of Artificial Intelligence Research, 1:209–229, 1994.

    Google Scholar 

  3. R.J. Mooney and M.E. Califf. Induction of first-order decision lists: Results on learning the past tense of english verbs. In L. De Raedt, ed., Proceedings of the 5th International Workshop on Inductive Logic Programming, pages 145–146. Department of Computer Science, Katholieke Universiteit Leuven, 1995.

    Google Scholar 

  4. Saso Dzeroski and Tomaz Erjavec. Learning to lemmatise slovene words. In Learning language in logic, (Lecture notes in computer science, J. Cussens and S. Dzeroski (eds), pages 69–88, 200

    Chapter  Google Scholar 

  5. Dunja Mladenic and Marko Grobelnik. Feature selection for unbalanced class distribution and naive bayes. In Proceedings of the 16th International Conference on Machine Learning, 1999.

    Google Scholar 

  6. David Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Meeting of the Association for Computational Linguistics, pages 189–196, 1995.

    Google Scholar 

  7. Tomaz Erjavec. The multext-east slovene lexicon. In Proceedings of the 7th Slovene Electrotechnical Conference ERK-98, 1998.

    Google Scholar 

  8. Rayid Ghani, Rosie Jones, and Dunja Mladenic. Automatic web search query generation to create minority language corpora. In Proceedings of the Sixteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.

    Google Scholar 

  9. Blum and Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory, Morgan Kaufmann Publishers, 1998.

    Google Scholar 

  10. Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom Mitchell. Text classification from labeled and unlabeled documents using em. Machine Learning, 39(2/3):103–134, 2000.

    Article  MATH  Google Scholar 

  11. Kamal Nigam and Rayid Ghani. Analyzing the effectiveness and applicability of co-training. In Ninth International Conference on Information and Knowledge Management, 2000.

    Google Scholar 

  12. Dunja Mladenic. Combinatorial optimization in inductive concept learning. In Proc. 10th Int. Conf. on Machine Learning, Morgan Kaufmann, pages 205–211, 1993.

    Google Scholar 

  13. A. McCallum and K. Nigam. A comparison of event models for naive bayes text classifiers. In AAAI-98 Workshop on Learning for Text Categorization, 1998.

    Google Scholar 

  14. Yair Even-Zohar and Dan Roth. A sequential model for multi-class classification. In Proc. of Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), 2001.

    Google Scholar 

  15. Jure Dimec, Saso Dzeroski, Ljupco Todorovski, and Dimitrij Hristovski. Www search engine for slovenian and english medical documents. In Stud Health Technol Inform.:68, 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mladenić, D. (2002). Modeling Information in Textual Data Combining Labeled and Unlabeled Data. In: Hand, D.J., Adams, N.M., Bolton, R.J. (eds) Pattern Detection and Discovery. Lecture Notes in Computer Science(), vol 2447. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45728-3_13

Download citation

  • DOI: https://doi.org/10.1007/3-540-45728-3_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44148-9

  • Online ISBN: 978-3-540-45728-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics