Abstract
In the past several years, some aspects of Semitic language, primarily Arabic, Information Retrieval (IR) have garnered a significant amount of attention. The main research interests have focused on retrieval of formal language, mostly in the news domain, with ad hoc retrieval, OCR document retrieval, and cross-language retrieval. The literature on other aspects of retrieval continues to be sparse or non-existent, though some of these aspects have been investigated by industry. The two main aspects where literature is lacking are web search and social search. The survey will cover two main areas: 1) a significant part of the literature pertaining to language-specific issues that affect retrieval; and 2) specialized retrieval problems, namely document image retrieval, cross-language search, web search, and social search.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
Buckwalter encoding is used to Romanize Arabic text in this chapter.
- 5.
- 6.
This is based on communication with people working on different web search engines.
References
Abdelsapor A, Adly N, Darwish K, Emam O, Magdy W, Nagi M (2006) Building a heterogeneous information retrieval collection of printed Arabic documents. In: LREC 2006, Genoa
Abdul-Al-Aal A (1987) An-Nahw Ashamil. Maktabat Annahda Al-Masriya, Cairo
AbdulJaleel N, Larkey LS (2003) Statistical transliteration for English–Arabic cross language information retrieval. In: CIKM’03, New Orleans, 3–8 Nov 2003
Abu-Salem H, Al-Omari M, Evens M (1999) Stemming methodologies over individual query words for Arabic information retrieval. JASIS 50(6):524–529
Ahmed M (2000) A large-scale computational processor of the Arabic morphology, and applications. Faculty of Engineering, Cairo University, Cairo
Ahmad F, Kondrak G (2005) Learning a spelling error model from search query logs. In: Proceedings of HLT-2005, Vancouver
Agirre E, Gojenola K, Sarasola K, Voutilainen A (1998) Towards a single proposal in spelling correction. In: Proceedings of COLING-ACL’98, San Francisco, pp 22–28
Alemayehu N (1999) Development of a stemming algorithm for Amharic language text retrieval. Ph.D. thesis, Dept. of Information Studies, University of Sheffield, Sheffield
Alemayehu N, Willett P (2003) The effectiveness of stemming for information retrieval in Amharic. Electron Libr Inf Syst 37(4):254–259
Aljlayl M, Frieder O (2002) On Arabic search: improving the retrieval effectiveness via a light stemming approach. In: CIKM’02, McLean
Aljlayl M, Beitzel S, Jensen E, Chowdhury A, Holmes D, Lee M, Grossman D, Frieder O (2001) IIT at TREC-10. In: TREC 2001, Gaithersburg
Al-Kharashi I, Evens M (1994) Comparing words, stems, and roots as index terms in an Arabic information retrieval system. JASIS 45(8):548–560
Allam M (1995) Segmentation versus segmentation-free for recognizing Arabic text. Proc SPIE 2422:228–235
Argaw AA, Asker L (2007) An Amharic stemmer: reducing words to their citation forms. In: Proceedings of the 5th workshop on important unresolved matters, ACL-2007, Prague, pp 104–110
Attar R, Choueka Y, Dershowitz N, Fraenkel AS (1978) KEDMA – linguistic tools for retrieval systems. J Assoc Comput Mach 25(1):52–66
Baird H (1990) Document image defect models. In: IAPR workshop on syntactic and structural pattern recognition, Murray Hill, pp 38–46
Baird H (1993) Document image defects models and their uses. In: Second international conference on document analysis and recognition (ICDAR), Tsukuba City, pp 62–67
Beesley K (1996) Arabic finite-state morphological analysis and generation. In: COLING-96, Copenhagen
Beesley K, Buckwalter T, Newton S (1989) Two-level finite-state analysis of Arabic morphology. In: Proceedings of the seminar on bilingual computing in Arabic and English, Cambridge
Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3(4–5):993–1022
Braschler M, Ripplinger B (2004) How effective is stemming and decompounding for German text retrieval? Inf Retr J 7(3–4):291–316
Brill E, Moore R (2000) An improved error model for noisy channel spelling correction. In: Proceedings of the 38th annual meeting of the association for computational linguistics, ACL’00, Hong Kong, pp 286–293
Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning, Bonn
Burgin B (1992) Variations in relevance judgments and the evaluation of retrieval performance. Inf Process Manage 28(5):619–627
Carmel D, Maarek YS (1999) Morphological disambiguation for Hebrew search systems. In: NGITS-99, Zikhron-Yaakov
Chenm A, Gey F (2002) Building an Arabic stemmer for information retrieval. In: TREC-2002, Gaithersburg
Choueka Y (1980) Computerized full-text retrieval systems and research in the humanities: the Responsa project. Comput Hum 14:153–169. North-Holland
Church K, Gale W (1991) Probability scoring for spelling correction. Stat Comput 1:93–103
Croft WB, Harding S, Taghva K, Andborsak J (1994) An evaluation of information retrieval accuracy with simulated OCR output. In: Proceedings of the 3rd annual symposium on document analysis and information retrieval, University of Nevada, Las Vegas, pp 115–126
Darwish K (2002) Building a shallow morphological analyzer in one day. In: ACL workshop on computational approaches to Semitic languages, Philadelphia
Darwish K (2003) Probabilistic methods for searching OCR-degraded Arabic text. Ph.D. thesis, Electrical and Computer Engineering Department, University of Maryland, College Park
Darwish K, Ali A (2012) Arabic retrieval revisited: morphological hole filling. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics: short papers-volume 2, Jeju Island. ACL, pp 218–222
Darwish K, Emam O (2005) The effect of blind relevance feedback on a new Arabic OCR degraded text collection. In: International conference on machine intelligence: special session on Arabic document image analysis, Tozeur, 5–7 Nov 2005
Darwish K, Magdy W (2007) Error correction vs. query garbling for Arabic OCR document retrieval. ACM Trans Inf Syst (TOIS) 26(1):5
Darwish K, Oard DW (2002) Term selection for searching printed Arabic. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’02), Tampere, pp 261–268
Darwish K, Oard D (2002) CLIR experiments at Maryland for TREC 2002: evidence combination for Arabic–English retrieval. In: Text retrieval conference (TREC’02), Gaithersburg
Darwish K, Hassan H, Emam O (2005) Examining the effect of improved context sensitive morphology on Arabic information retrieval. In: Proceedings of the ACL workshop on computational approaches to Semitic languages, Ann Arbor, pp 25–30
De Roeck A, El-Fares W (2000) A morphologically sensitive clustering algorithm for identifying Arabic roots. In: 38th Annual meeting of the ACL, Hong Kong, pp 199–206
Diab M (2009) Second generation tools (AMIRA 2.0): fast and robust tokenization, POS tagging, and Base phrase chunking. In: 2nd international conference on Arabic language resources and tools, Cairo
Doermann D (1998) The indexing and retrieval of document images: a survey. Comput Vis Image Underst 70(3):287–298
Doermann D, Yao S (1995) Generating synthetic data for text analysis systems. In: Symposium on document analysis and information retrieval, Las Vegas, pp 449–467
Domeij R, Hollman J, Kann V (1994) Detection of spelling errors in Swedish not using a Word List en Clair. J Quant Linguist 1:195–201
Dumais ST, Furnas GW, Landauer TK, Deerwester S, Harshman R (1988) Using latent semantic analysis to improve access to textual information. In: CHI’88 proceedings of the SIGCHI conference on human factors in computing systems, Washington, DC
El-Kholy A, Habash N (2010) Techniques for Arabic morphological detokenization and orthographic denormalization. In: Proceedings of language resources and evaluation conference (LREC), Valletta
Fraser A, Xu J, Weischedel R (2002) TREC 2002 cross-lingual retrieval at BBN. In: TREC-2002, Gaithersburg
Gao W, Niu C, Nie J-Y, Zhou M, J Hu, Wong K-F, Hon H-W (2007) Cross-lingual query suggestion using query logs of different languages, SIGIR-2007, Amsterdam, pp 463–470
Gao W, Niu C, Zhou M, Wong KF (2009) Joint ranking for multilingual web search. In: ECIR 2009, pp 114–125
Gao W, Niu C, Nie J-Y, Zhou M, Wong K-F, Hon H-W (2010) Exploiting query logs for cross-lingual query suggestions. ACM Trans Inf Syst 28:1–33
Gey F, Oard D (2011) The TREC-2001 cross-language information retrieval track: searching Arabic using English, French or Arabic queries. In: TREC 2001, Gaithersburg, pp 16–23
Gillies A, Erlandson E, Trenkle J, Schlosser S (1997) Arabic text recognition system. In: The symposium on document image understanding technology, Annapolis
Habash N, Rambow O (2007) Arabic diacritization through full morphological tagging. In: Proceedings of NAACL HLT 2007, Rochester, Companion volume, pp 53–56
Han B, Baldwin T (2011) Lexical normalisation of short text messages: makn sens a #twitter. In: Proceedings of the 49th annual meeting of the Association for Computational Linguistics: human language technologies-volume 1, Portland. ACL, pp 368–378
Harding S, Croft W, Weir C (1997) Probabilistic retrieval of OCR-degraded text using N-grams. In: European conference on digital libraries, Pisa. Research and advanced technology for digital libraries. Springer, Berlin/Heidelberg, pp 345–359
Harman D (1992) Overview of the first Text REtrieval conference, Gaithersburg, TREC-1992
Harman D (1995) Overview of the fourth Text REtrieval conference, Gaithersburg,TREC-4, p 1
Hassibi K (1994) Machine printed Arabic OCR. In: 22nd AIPR workshop: interdisciplinary computer vision, SPIE Proceedings, Washington, DC
Hassibi K (1994) Machine printed Arabic OCR using neural networks. In: 4th international conference on multi-lingual computing, London
Hawking D (1996) Document retrieval in OCR-scanned text. In: 6th parallel computing workshop, Kawasaki
He D, Oard DW, Wang J, Luo J, Demner-Fushman D, Darwish K, Resnik P, Khudanpur S, Nossal M, Subotin M, Leuski A (2003) Making MIRACLEs: interactive translingual search for Cebuano and Hindi. ACM Trans Asian Lang Inf Process (TALIP) 2(3):219–244
Hefny A, Darwish K, Alkahky A (2011) Is a query worth translating: ask the users! In: ECIR 2011, Dublin, pp 238–250
Hersh WR, Bhuptiraju RT, Ross L, Cohen AM, Kraemer DF, Johnson P (2004) TREC 2004 genomics track overview (TREC-2004), Gaithersburg
Hmeidi I, Kanaan G, Evens M (1997) Design and implementation of automatic indexing for information retrieval with Arabic documents. JASIS 48(10):867–881
Hong T (1995) Degraded text recognition using visual and linguistic context. Ph.D. thesis, Computer Science Department, SUNY Buffalo, Buffalo
Huang J, Efthimiadis EN (2009) Analyzing and evaluating query reformulation strategies in web search logs. In: CIKM’09, Hong Kong, 2–6 Nov 2009
Jarvelin K, Kekalainen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446
Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the ACM conference on knowledge discovery and data mining (KDD), Philadelphia
Jurafsky D, Martin J (2000) Speech and language processing. Prentice Hall, Upper Saddle River
Kantor P, Voorhees E (1996) Report on the TREC-5 confusion track. In: TREC-1996, Gaithersburg
Kareem Darwish (2013) Arabizi detection and conversion to Arabic. CoRR abs/1306.6755
Khoja S, Garside R (2001) Automatic tagging of an Arabic corpus using APT. In: The Arabic linguistic symposium (ALS), University of Utah, Salt Lake City
Kiraz G (1998) Arabic computation morphology in the west. In: 6th international conference and exhibition on multi-lingual computing, Cambridge
Kishida K (2008) Prediction of performance of cross-language information retrieval using automatic evaluation of translation. Libr Inf Sci Res 30(2):138–144
Kanungo T, Haralick R (1998) An automatic closed-loop methodology for generating character ground-truth for scanned documents. IEEE Trans Pattern Anal Mach Intell 21(2):179–183
Kanungo T, Haralick R, Phillips I (1993) Global and local document degradation models. In: 2nd international conference on document analysis and recognition (ICDAR’93), Tsukuba City, pp 730–734
Kanungo T, Bulbul O, Marton G, Kim D (1997) Arabic OCR systems: state of the art. In: Symposium on document image understanding technology, Annapolis
Kanungo T, Marton G, Bulbul O (1999) OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products. In: SPIE conference on document recognition and retrieval (VI), San Jose
Lam-Adesina AM, Jones GJF (2006) Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents. Inf Process Manage 42(3):633–649
Larkey LS, Ballesteros L, Connell ME (2002) Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. Research and development in information retrieval – SIGIR-2002, Tampere, pp 275–282
Lee Y, Papineni K, Roukos S, Emam O, Hassan H (2003) Language model based Arabic word segmentation. In: Proceedings of the 41st annual meeting of the association for computational linguistics, Sapporo, July 2003, pp 399–406
Lee CJ, Chen CH, Kao SH, Cheng PJ (2010) To translate or not to translate? In: SIGIR-2010, Geneva
Levow GA, Oard DW, Resnik P (2005) Dictionary-based techniques for cross-language information retrieval. Inf Process Manage J 41(3):523–547
Li Y, Lopresti D, Tomkins A (1997) Validation of document defect models. IEEE Trans Pattern Anal Mach Intell 18:99–107
Lin WC, Chen HH (2003) Merging mechanisms in multilingual information retrieval. CLEF 2002, LNCS 2785. Springer, Berlin/New York, pp 175–186
Liu T-Y (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331
Lopresti D, Zhou J (1994) Using consensus sequence voting to correct OCR errors. In: IAPR workshop on document analysis systems, Kaiserslautern, pp 191–202
Lu Z, Bazzi I, Kornai A, Makhoul J, Natarajan P, Schwartz R (1999) A robust, language-independent OCR system. In: 27th AIPR workshop: advances in computer assisted recognition, Washington, DC. SPIE
Maamouri M, Graff D, Bouziri B, Krouna S, Bies A, Kulick S (2010) LDC standard Arabic morphological analyzer (SAMA) version 3.1. Linguistics Data Consortium, Catalog No. LDC2010L01
Magdy W, Darwish K (2006) Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In: Empirical methods in natural language processing (EMNLP’06), Sydney, pp 408–414
Magdy W, Darwish K, Rashwan M (2007) Fusion of multiple corrupted transmissions and its effect on information retrieval. In: ESOLE 2007, Cairo
Magdy W, Darwish K, El-Saban M (2009) Efficient language-independent retrieval of printed documents without OCR. In: SPIRE 2009, Saariselkä
Magdy W, Darwish K, Mourad A (2012) Language processing for Arabic microblog retrieval. In: CIKM, Maui
Mayfield J, McNamee P, Costello C, Piatko C, Banerjee A (2001) JHU/APL at TREC 2001: experiments in filtering and in Arabic, video, and web retrieval. In: Text retrieval conference (TREC’01), Gaithersburg
McNamee P, Mayfield J (2002) Comparing cross-language query expansion techniques by degrading translation resources. In: SIGIR’02, Tampere
Metzler D, Croft WB (2004) Combining the language model and inference network approaches to retrieval. Inf Process Manage 40(5):735–750. Special issue on Bayesian Networks and Information Retrieval
Mittendorf E, Schäuble P (2000) Information retrieval can cope with many errors. Inf Retr 3(3):189–216. Springer, Netherlands
Oard D, Dorr B (1996) A survey of multilingual text retrieval. UMIACS, University of Maryland, College Park
Oard D, Gey F (2002) The TREC 2002 Arabic/English CLIR track. In: TREC-2002, Gaithersburg
Oflazer K (1996) Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Comput Linguist 22(1):73–90
Page L (1998) Method for node ranking in a linked database. US patent no. 6285999
Pirkola A (1998) The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: SIGIR-1998, Melbourne, pp 55–63
Robertson SE, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27:129–146
Robertson SE, Jones KS (1996) Simple, proven approaches to text-retrieval. Technical report 356, Computer Laboratory, University of Cambridge, Cambridge
Robertson SE, Zaragoza H (2009) The probabilistic relevance framework: BM25 and beyond. Found Trends Inf Retr 3(4):333–389
Salton G, Lesk M (1969) Relevance assessments and retrieval system evaluation. Inf Storage Retr 4:343–359
Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Salton G, Fox EA, Wu H (1983) Extended Boolean information retrieval. Commun ACM 26(11):1022–1036
Sanderson M (1994) Word sense disambiguation and information retrieval. In: SIGIR’94, Dublin, pp 142–151
Sanderson M, Joho H (2004) Forming test collections with no system pooling. In: SIGIR’04, Sheffield, 25–29 July 2004
Si L, Callan J (2005) CLEF 2005: multilingual retrieval by combining multiple multilingual ranked lists. In: Sixth workshop of the cross-language evaluation forum, CLEF, Vienna
Singhal A, Salton G, Buckley C (1996) Length normalization in degraded text collections. In: 5th annual symposium on document analysis and information retrieval, Las Vegas
Smith S (1990) An analysis of the effects of data corruption on text retrieval performance. Thinking Machines Corp, Cambridge
Soboroff I, Nicholas C, Cahan P (2001) Ranking retrieval systems without relevance judgments. In: SIGIR, New Orleans
Szpektor I, Dagan I, Lavie A, Shacham D, Wintner S (2007) Cross lingual and semantic retrieval for cultural heritage appreciation. In: Proceedings of the workshop on language technology for cultural heritage data, Prague
Taghva K, Borasack J, Condit A, Gilbreth J (1994) Results and implications of the noisy data projects, 1994. Information Science Research Institute, University of Nevada, Las Vegas
Taghva K, Borsack J, Condit A (1994) An expert system for automatically correcting OCR output. In: SPIE-document recognition, San Jose
Taghva K, Borasack J, Condit A, Inaparthy P (1995) Querying short OCR’d documents. Information Science Research Institute, University of Nevada, Las Vegas
Tillenius M (1996) Efficient generation and ranking of spelling error corrections. NADA technical report TRITA-NA-E9621
Tsai MF, Wang YT, Chen HH (2008) A study of learning a merge model for multilingual information retrieval. In: SIGIR, Singapore
Tseng Y, Oard DW (2001) Document image retrieval techniques for Chinese. In: Symposium on document image understanding technology (SDIUT), Columbia, pp 151–158
Udupa R, Saravanan K, Bakalov A, Bhole A (2009) “They Are Out There, If You Know Where to Look”: mining transliterations of OOV query terms for cross-language information retrieval. In: ECIR, Toulouse. LNCS, vol 5478, pp 437–448
Voorhees E (1998) Variations in relevance judgments and the measurement of retrieval effectiveness. In: SIGIR, Melbourne
Wang J, Oard DW (2006) Combining bidirectional translation and synonymy for cross-language information retrieval. In: SIGIR, Seattle, pp 202–209
Wayne C (1998) Detection & tracking: a case study in corpus creation & evaluation methodologies. Language resources and evaluation conference, Granada
Wu D, He D, Ji H, Grishman R (2008) A study of using an out-of-box commercial MT system for query translation in CLIR. In: Workshop on improving non-English web searching, CIKM, Napa Valley
Yona S, Wintner S (2008) A finite-state morphological grammar of Hebrew. In: Proceedings of the ACL-2005 workshop on computational approaches to Semitic languages, Ann Arbor, June 2005
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Darwish, K. (2014). Information Retrieval. In: Zitouni, I. (eds) Natural Language Processing of Semitic Languages. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45358-8_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-45358-8_10
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45357-1
Online ISBN: 978-3-642-45358-8
eBook Packages: Computer ScienceComputer Science (R0)