Skip to main content

Part of Speech (POS) Tag Sets Reduction and Analysis Using Rough Set Techniques

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5908))

Abstract

The motivation behind this work stems from an earlier work where text was transformed into strings of syntactical structures and used in similarity calculations using sequence algorithm on a string generated by a POS tagger. The performance of computations was greatly affected by the size of the string which in itself is the result of the type of tags used. Generated tags range from several (minimum of nine) general ones to many more (hundreds) detailed tags. Figuring out which tags and what combination of tags affect the realization of meanings, dependencies or relationships that exist in the text is an important issue. The resulting tag set reduction using rough sets and consequently string reduction has resulted in an improved efficiency in similarity calculations between documents while maintaining the same level of accuracy. Such finding was very encouraging.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Elhadi, M., Al-Tobi, A.: Use of Text Syntactical Structures in Detection of Document Duplicates. In: Third IEEE International Conference on Digital Information Management, University of East London, London, UK (2008)

    Google Scholar 

  2. Elhadi, M., Al-Tobi, A.: Webpage Duplicate Detection Using Combined POS and Sequence Alignment Algorithm. In: World Congress on Computer Science and Information Engineering, Los Angeles/Anaheim, USA (2009)

    Google Scholar 

  3. Koppel, M., Argamon, S., Schler, J.: Computational Methods in Authorship Attribution. Journal of the American Society for Information Science and Technology 60, 9–26 (2009)

    Article  Google Scholar 

  4. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of on line messeges: writing style features and classification techniques. Journal of American society of Information Sciences and technology 57, 378–393 (2006)

    Article  Google Scholar 

  5. Pawlak, Z.: Rough sets. International Journal of Computer and Information Sciences 11, 341–356 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  6. Pawlak, Z.: Rough Sets - Theoretical Aspects of Reasoning: About Data. Kluwer Academic Publishers, Dordrecht (1991)

    MATH  Google Scholar 

  7. Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic Detection of Semantic Similarity. In: Proceedings of the 14th international conference on World Wide Web, pp. 107–116 (2005)

    Google Scholar 

  8. Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and Knowledge-based Measures of Text Semantic Similarity. In: Proceedings of The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference (2006)

    Google Scholar 

  9. Steinberger, R., Pouliquen, B., Hagman, J.: Cross-lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 415–424. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  10. Campbell, D.M., Chen, W.R., Smith, R.D.: Copy Detection Systems for Digital Documents. In: Proceedings of Advances in Digital Libraries, pp. 78–88. IEEE, Los Alamitos (2000)

    Google Scholar 

  11. Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (1995)

    Google Scholar 

  12. MacFadyen, H.: The Parts of Speech (2007), http://www.arts.uottawa.ca/writcent/hypergrammar/partsp.html

  13. Johnson, D.S.: Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences 9, 256–278 (1974)

    Article  MATH  MathSciNet  Google Scholar 

  14. ELC Courses: Parts of Speech: English Language Centre, University of Victoria (1997), http://web2.uvcs.uvic.ca/elc/StudyZone/330/grammar/parts.htm

  15. Øhrn, A.: Discernibility and Rough Sets in Medicine: Tools and Applications, PhD thesis, Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway. NTNU report 1999:133, IDI report 1999:14, 239 pages (1999) ISBN 82-7984-014-1

    Google Scholar 

  16. Bull, J., Collins, C., Coughlin, E., Sharp, D.: Technical Review of Plagiarism Detection Software Report: Computer Assisted Assessment Centre, University of Luton, Luton, UK (2003)

    Google Scholar 

  17. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local Algorithms for Document Fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003)

    Google Scholar 

  18. REUTERS, Reuters Corpus (Volume 1: English Language, 1996-08-20 to 1997-08-19), Released date: November 3, 2000, NIST (2000)

    Google Scholar 

  19. Hu, X.R., Atwell, E.: A survey of machine learning approaches to analysis of large corpora. In: Proceedings of the Workshop on Shallow Processing of Large Corpora, Lancaster University, UK, pp. 45–52 (2003)

    Google Scholar 

  20. Komorowski, J., Øhrn, A., Skowron, A.: The ROSETTA Rough Set Software System. In: Klösgen, W., Zytkow, J. (eds.) Handbook of Data Mining and Knowledge Discovery, ch. D.2.3. Oxford University Press, Oxford (2002), http://www.lcb.uu.se/tools/rosetta/downloads.php

    Google Scholar 

  21. Clough, P.: Old and new challenges in automatic plagiarism detection: Department of Information Studies, University of Sheffield (2003)

    Google Scholar 

  22. Wong, S.K.M., Ziarko, W.: On learning and evaluation of decision rules in the context of rough sets. In: Proceedings of the International Symposium on Methodologies for Intelligent Systems, Knoxville, Tennessee, pp. 224–308 (1986)

    Google Scholar 

  23. Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK, pp. 44–49 (1994)

    Google Scholar 

  24. Schmid, H.: Improvements in Part-of-Speech Tagging With an Application To German. In: EACL SIGDAT workshop, Dubai, UAE (1995)

    Google Scholar 

  25. Liu, Y., Liang, L.: A Dual-method Model for Copy Detection. In: Proceedings of the IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology, Hong Kong Convention and Exhibition Centre, Hong Kong, pp. 634–637. IEEE, Los Alamitos (2006)

    Chapter  Google Scholar 

  26. Lexicon and Textcorpora Group: TreeTagger - a language independent part-of-speech tagger: Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, Germany (2003), http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Elhadi, M., Al-Tobi, A. (2009). Part of Speech (POS) Tag Sets Reduction and Analysis Using Rough Set Techniques. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Ślęzak, D., Zhu, W. (eds) Rough Sets, Fuzzy Sets, Data Mining and Granular Computing. RSFDGrC 2009. Lecture Notes in Computer Science(), vol 5908. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10646-0_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-10646-0_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-10645-3

  • Online ISBN: 978-3-642-10646-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics