Part of Speech (POS) Tag Sets Reduction and Analysis Using Rough Set Techniques

Elhadi, Mohamed; Al-Tobi, Amjd

doi:10.1007/978-3-642-10646-0_27

Part of Speech (POS) Tag Sets Reduction and Analysis Using Rough Set Techniques

Mohamed Elhadi²⁴ &
Amjd Al-Tobi²⁴

Conference paper

1473 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5908))

Abstract

The motivation behind this work stems from an earlier work where text was transformed into strings of syntactical structures and used in similarity calculations using sequence algorithm on a string generated by a POS tagger. The performance of computations was greatly affected by the size of the string which in itself is the result of the type of tags used. Generated tags range from several (minimum of nine) general ones to many more (hundreds) detailed tags. Figuring out which tags and what combination of tags affect the realization of meanings, dependencies or relationships that exist in the text is an important issue. The resulting tag set reduction using rough sets and consequently string reduction has resulted in an improved efficiency in similarity calculations between documents while maintaining the same level of accuracy. Such finding was very encouraging.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Elhadi, M., Al-Tobi, A.: Use of Text Syntactical Structures in Detection of Document Duplicates. In: Third IEEE International Conference on Digital Information Management, University of East London, London, UK (2008)
Google Scholar
Elhadi, M., Al-Tobi, A.: Webpage Duplicate Detection Using Combined POS and Sequence Alignment Algorithm. In: World Congress on Computer Science and Information Engineering, Los Angeles/Anaheim, USA (2009)
Google Scholar
Koppel, M., Argamon, S., Schler, J.: Computational Methods in Authorship Attribution. Journal of the American Society for Information Science and Technology 60, 9–26 (2009)
Article Google Scholar
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of on line messeges: writing style features and classification techniques. Journal of American society of Information Sciences and technology 57, 378–393 (2006)
Article Google Scholar
Pawlak, Z.: Rough sets. International Journal of Computer and Information Sciences 11, 341–356 (1982)
Article MATH MathSciNet Google Scholar
Pawlak, Z.: Rough Sets - Theoretical Aspects of Reasoning: About Data. Kluwer Academic Publishers, Dordrecht (1991)
MATH Google Scholar
Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic Detection of Semantic Similarity. In: Proceedings of the 14th international conference on World Wide Web, pp. 107–116 (2005)
Google Scholar
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and Knowledge-based Measures of Text Semantic Similarity. In: Proceedings of The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference (2006)
Google Scholar
Steinberger, R., Pouliquen, B., Hagman, J.: Cross-lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 415–424. Springer, Heidelberg (2002)
Chapter Google Scholar
Campbell, D.M., Chen, W.R., Smith, R.D.: Copy Detection Systems for Digital Documents. In: Proceedings of Advances in Digital Libraries, pp. 78–88. IEEE, Los Alamitos (2000)
Google Scholar
Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (1995)
Google Scholar
MacFadyen, H.: The Parts of Speech (2007), http://www.arts.uottawa.ca/writcent/hypergrammar/partsp.html
Johnson, D.S.: Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences 9, 256–278 (1974)
Article MATH MathSciNet Google Scholar
ELC Courses: Parts of Speech: English Language Centre, University of Victoria (1997), http://web2.uvcs.uvic.ca/elc/StudyZone/330/grammar/parts.htm
Øhrn, A.: Discernibility and Rough Sets in Medicine: Tools and Applications, PhD thesis, Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway. NTNU report 1999:133, IDI report 1999:14, 239 pages (1999) ISBN 82-7984-014-1
Google Scholar
Bull, J., Collins, C., Coughlin, E., Sharp, D.: Technical Review of Plagiarism Detection Software Report: Computer Assisted Assessment Centre, University of Luton, Luton, UK (2003)
Google Scholar
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local Algorithms for Document Fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003)
Google Scholar
REUTERS, Reuters Corpus (Volume 1: English Language, 1996-08-20 to 1997-08-19), Released date: November 3, 2000, NIST (2000)
Google Scholar
Hu, X.R., Atwell, E.: A survey of machine learning approaches to analysis of large corpora. In: Proceedings of the Workshop on Shallow Processing of Large Corpora, Lancaster University, UK, pp. 45–52 (2003)
Google Scholar
Komorowski, J., Øhrn, A., Skowron, A.: The ROSETTA Rough Set Software System. In: Klösgen, W., Zytkow, J. (eds.) Handbook of Data Mining and Knowledge Discovery, ch. D.2.3. Oxford University Press, Oxford (2002), http://www.lcb.uu.se/tools/rosetta/downloads.php
Google Scholar
Clough, P.: Old and new challenges in automatic plagiarism detection: Department of Information Studies, University of Sheffield (2003)
Google Scholar
Wong, S.K.M., Ziarko, W.: On learning and evaluation of decision rules in the context of rough sets. In: Proceedings of the International Symposium on Methodologies for Intelligent Systems, Knoxville, Tennessee, pp. 224–308 (1986)
Google Scholar
Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK, pp. 44–49 (1994)
Google Scholar
Schmid, H.: Improvements in Part-of-Speech Tagging With an Application To German. In: EACL SIGDAT workshop, Dubai, UAE (1995)
Google Scholar
Liu, Y., Liang, L.: A Dual-method Model for Copy Detection. In: Proceedings of the IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology, Hong Kong Convention and Exhibition Centre, Hong Kong, pp. 634–637. IEEE, Los Alamitos (2006)
Chapter Google Scholar
Lexicon and Textcorpora Group: TreeTagger - a language independent part-of-speech tagger: Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, Germany (2003), http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

Download references

Author information

Authors and Affiliations

Department of Computer Science, Sultan Qaboos University, Oman
Mohamed Elhadi & Amjd Al-Tobi

Authors

Mohamed Elhadi
View author publications
You can also search for this author in PubMed Google Scholar
Amjd Al-Tobi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics and Computer Aided Sciences, Kyushu Institute of Technology, 804-8550, Tobata, Kitakyushu, Japan
Hiroshi Sakai
Department of Pure Mathematics, University of Calcutta, 35 Ballygunge Circular Road, 700019, Kolkata, India
Mihir Kumar Chakraborty
Information Technology Department, University of Cairo, 5 Ahmed Zewal St. Orman, Giza, Egypt
Aboul Ella Hassanien
University of Warsaw & Infobright Inc., Poland
Dominik Ślęzak
University of Electronic Science and Technology of China, Chengdu, China
William Zhu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Elhadi, M., Al-Tobi, A. (2009). Part of Speech (POS) Tag Sets Reduction and Analysis Using Rough Set Techniques. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Ślęzak, D., Zhu, W. (eds) Rough Sets, Fuzzy Sets, Data Mining and Granular Computing. RSFDGrC 2009. Lecture Notes in Computer Science(), vol 5908. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10646-0_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-10646-0_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10645-3
Online ISBN: 978-3-642-10646-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics