ABSTRACT
This paper explores the effect of noun lemmatization and stopword removal in Greek Web searching. A light lemmatizer is presented and applied in a retrieval experiment. Stopwords are removed from user queries. In both experiments an increase in precision is reported. The main purpose of our work is to adapt and apply some "ancient" information retrieval techniques in non Latin queries and measure their effect in the retrieval of relevant documents.
- van Rijsbergen, C. J., Information Retrieval, Butterworths, London, England, 1979. Google ScholarDigital Library
- Salton, G, and McGill M., Introduction to Modern Information Retrieval, McGraw-Hill Book Company, Compute Science Series, New York, 1983. Google ScholarDigital Library
- F. Lazarinis, "Evaluating the Searching Capabilities of Greek e-commerce Web sites", Online Information Review Journal (in press)Google Scholar
- F. Lazarinis, "Web retrieval systems and the Greek language: Do they have an understanding?", Journal of Information Science, SAGE Publications (in press). Google ScholarDigital Library
- J. Bar-Ilan, J., and T. Gutman, "How do Search Engines respond to some non-English queries?", Journal of Information Science, 2005, 31(1), pp. 13--28. Google ScholarDigital Library
- H. Moukdad, "Lost in Cyberspace: How do search engines handle Arabic queries? Proceedings of the 32nd Annual Conference of the Canadian Association for Information Science, (Winnipeg, 2004)Google Scholar
- W. Chung, Y. Zhang, Z. Huang, G. Wang, T. Ong, and H. Chen, "Internet Searching and Browsing in a Multilingual World: An Experiment on the Chinese Business Intelligence Portal (CBizPort)", Journal of the American Society for Information Science and Technology, 2004, 55(9), pp. 818--831. Google ScholarDigital Library
- M. Porter, "An algorithm for Suffix Stripping", Program, 1980, 14(3), pp. 130--137.Google ScholarCross Ref
- T. Z. Kalamboukis, "Suffix stripping with Modern Greek", Program, 1995, 29(3), pp. 313--321.Google ScholarCross Ref
- G. Tambouratzis, and C. Carayannis, "Automatic corpora-based stemming in Greek", Literacy and Linguistic Computing, 2001, 16, pp. 445--466.Google ScholarCross Ref
- Ntais, G., Development of a stemmer for the Greek language, MSc Thesis, Stockholm University 2006, www.dsv.su.se/~hercules/papers/Google Scholar
- Triantafyllidis, M., Modern Greek Grammar Institute M Triantalyllidis, 1941 (in Greek)Google Scholar
- Lazarinis, F., Combining Information Retrieval with Information Extraction, MSc Thesis, University of Glasgow, 1997.Google Scholar
- J. Savoy, "A Stemming Procedure and stopword List for General French Corpora" Journal of the American Society for Information Science, 1999, 50(10) pp. 944--952. Google ScholarDigital Library
- F. Zou, F. Wang, X. Deng, S. Han, "Automatic Identification of Chinese Stop Words", Research on Computing Science, 2006, 18, pp. 151--162.Google Scholar
- F. Lazarinis, "Engineering and utilizing a stopword list in Greek web retrieval", Journal of the American Society for Information Science (in press) Google ScholarDigital Library
- Lemmatization and stopword elimination in Greek web searching
Recommendations
A novel Arabic lemmatization algorithm
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataTokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, ...
Effect of Stopwords and Stemming Techniques in Urdu IR
AbstractThis paper explores and evaluates the effect of different stopword removal and stemming techniques in Urdu IR. The issues are examined from four viewpoints. Is there any performance difference between non-corpus-based and corpus-based stopword ...
Indexing and retrieval of a Greek corpus
iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searchingGreek is one of the most difficult languages to handle in Web Information Retrieval (IR) related tasks. Its difficulty stems from the fact that it is grammatically, morphologically and orthographically more complex than the lingua franca of IR, English. ...
Comments