Abstract
The motivation behind this work stems from an earlier work where text was transformed into strings of syntactical structures and used in similarity calculations using sequence algorithm on a string generated by a POS tagger. The performance of computations was greatly affected by the size of the string which in itself is the result of the type of tags used. Generated tags range from several (minimum of nine) general ones to many more (hundreds) detailed tags. Figuring out which tags and what combination of tags affect the realization of meanings, dependencies or relationships that exist in the text is an important issue. The resulting tag set reduction using rough sets and consequently string reduction has resulted in an improved efficiency in similarity calculations between documents while maintaining the same level of accuracy. Such finding was very encouraging.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Elhadi, M., Al-Tobi, A.: Use of Text Syntactical Structures in Detection of Document Duplicates. In: Third IEEE International Conference on Digital Information Management, University of East London, London, UK (2008)
Elhadi, M., Al-Tobi, A.: Webpage Duplicate Detection Using Combined POS and Sequence Alignment Algorithm. In: World Congress on Computer Science and Information Engineering, Los Angeles/Anaheim, USA (2009)
Koppel, M., Argamon, S., Schler, J.: Computational Methods in Authorship Attribution. Journal of the American Society for Information Science and Technology 60, 9–26 (2009)
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of on line messeges: writing style features and classification techniques. Journal of American society of Information Sciences and technology 57, 378–393 (2006)
Pawlak, Z.: Rough sets. International Journal of Computer and Information Sciences 11, 341–356 (1982)
Pawlak, Z.: Rough Sets - Theoretical Aspects of Reasoning: About Data. Kluwer Academic Publishers, Dordrecht (1991)
Maguitman, A.G., Menczer, F., Roinestad, H., Vespignani, A.: Algorithmic Detection of Semantic Similarity. In: Proceedings of the 14th international conference on World Wide Web, pp. 107–116 (2005)
Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and Knowledge-based Measures of Text Semantic Similarity. In: Proceedings of The Twenty-First National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference (2006)
Steinberger, R., Pouliquen, B., Hagman, J.: Cross-lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 415–424. Springer, Heidelberg (2002)
Campbell, D.M., Chen, W.R., Smith, R.D.: Copy Detection Systems for Digital Documents. In: Proceedings of Advances in Digital Libraries, pp. 78–88. IEEE, Los Alamitos (2000)
Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (1995)
MacFadyen, H.: The Parts of Speech (2007), http://www.arts.uottawa.ca/writcent/hypergrammar/partsp.html
Johnson, D.S.: Approximation algorithms for combinatorial problems. Journal of Computer and System Sciences 9, 256–278 (1974)
ELC Courses: Parts of Speech: English Language Centre, University of Victoria (1997), http://web2.uvcs.uvic.ca/elc/StudyZone/330/grammar/parts.htm
Øhrn, A.: Discernibility and Rough Sets in Medicine: Tools and Applications, PhD thesis, Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway. NTNU report 1999:133, IDI report 1999:14, 239 pages (1999) ISBN 82-7984-014-1
Bull, J., Collins, C., Coughlin, E., Sharp, D.: Technical Review of Plagiarism Detection Software Report: Computer Assisted Assessment Centre, University of Luton, Luton, UK (2003)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local Algorithms for Document Fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003)
REUTERS, Reuters Corpus (Volume 1: English Language, 1996-08-20 to 1997-08-19), Released date: November 3, 2000, NIST (2000)
Hu, X.R., Atwell, E.: A survey of machine learning approaches to analysis of large corpora. In: Proceedings of the Workshop on Shallow Processing of Large Corpora, Lancaster University, UK, pp. 45–52 (2003)
Komorowski, J., Øhrn, A., Skowron, A.: The ROSETTA Rough Set Software System. In: Klösgen, W., Zytkow, J. (eds.) Handbook of Data Mining and Knowledge Discovery, ch. D.2.3. Oxford University Press, Oxford (2002), http://www.lcb.uu.se/tools/rosetta/downloads.php
Clough, P.: Old and new challenges in automatic plagiarism detection: Department of Information Studies, University of Sheffield (2003)
Wong, S.K.M., Ziarko, W.: On learning and evaluation of decision rules in the context of rough sets. In: Proceedings of the International Symposium on Methodologies for Intelligent Systems, Knoxville, Tennessee, pp. 224–308 (1986)
Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester, UK, pp. 44–49 (1994)
Schmid, H.: Improvements in Part-of-Speech Tagging With an Application To German. In: EACL SIGDAT workshop, Dubai, UAE (1995)
Liu, Y., Liang, L.: A Dual-method Model for Copy Detection. In: Proceedings of the IEEE/WIC/ACM international conference on Web Intelligence and Intelligent Agent Technology, Hong Kong Convention and Exhibition Centre, Hong Kong, pp. 634–637. IEEE, Los Alamitos (2006)
Lexicon and Textcorpora Group: TreeTagger - a language independent part-of-speech tagger: Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, Germany (2003), http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Elhadi, M., Al-Tobi, A. (2009). Part of Speech (POS) Tag Sets Reduction and Analysis Using Rough Set Techniques. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Ślęzak, D., Zhu, W. (eds) Rough Sets, Fuzzy Sets, Data Mining and Granular Computing. RSFDGrC 2009. Lecture Notes in Computer Science(), vol 5908. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-10646-0_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-10646-0_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-10645-3
Online ISBN: 978-3-642-10646-0
eBook Packages: Computer ScienceComputer Science (R0)