Information Retrieval

Darwish, Kareem

doi:10.1007/978-3-642-45358-8_10

Kareem Darwish⁵

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

2495 Accesses
3 Citations

Abstract

In the past several years, some aspects of Semitic language, primarily Arabic, Information Retrieval (IR) have garnered a significant amount of attention. The main research interests have focused on retrieval of formal language, mostly in the news domain, with ad hoc retrieval, OCR document retrieval, and cross-language retrieval. The literature on other aspects of retrieval continues to be sparse or non-existent, though some of these aspects have been investigated by industry. The two main aspects where literature is lacking are web search and social search. The survey will cover two main areas: 1) a significant part of the literature pertaining to language-specific issues that affect retrieval; and 2) specialized retrieval problems, namely document image retrieval, cross-language search, web search, and social search.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.clef-initiative.eu
2.
http://research.nii.ac.jp/ntcir/index-en.html
3.
http://www.isical.ac.in/~fire/
4.
Buckwalter encoding is used to Romanize Arabic text in this chapter.
5.
http://en.wikipedia.org/wiki/Varieties_of_Arabic
6.
This is based on communication with people working on different web search engines.

References

Abdelsapor A, Adly N, Darwish K, Emam O, Magdy W, Nagi M (2006) Building a heterogeneous information retrieval collection of printed Arabic documents. In: LREC 2006, Genoa
Google Scholar
Abdul-Al-Aal A (1987) An-Nahw Ashamil. Maktabat Annahda Al-Masriya, Cairo
Google Scholar
AbdulJaleel N, Larkey LS (2003) Statistical transliteration for English–Arabic cross language information retrieval. In: CIKM’03, New Orleans, 3–8 Nov 2003
Google Scholar
Abu-Salem H, Al-Omari M, Evens M (1999) Stemming methodologies over individual query words for Arabic information retrieval. JASIS 50(6):524–529
Article Google Scholar
Ahmed M (2000) A large-scale computational processor of the Arabic morphology, and applications. Faculty of Engineering, Cairo University, Cairo
Google Scholar
Ahmad F, Kondrak G (2005) Learning a spelling error model from search query logs. In: Proceedings of HLT-2005, Vancouver
Google Scholar
Agirre E, Gojenola K, Sarasola K, Voutilainen A (1998) Towards a single proposal in spelling correction. In: Proceedings of COLING-ACL’98, San Francisco, pp 22–28
Google Scholar
Alemayehu N (1999) Development of a stemming algorithm for Amharic language text retrieval. Ph.D. thesis, Dept. of Information Studies, University of Sheffield, Sheffield
Google Scholar
Alemayehu N, Willett P (2003) The effectiveness of stemming for information retrieval in Amharic. Electron Libr Inf Syst 37(4):254–259
Google Scholar
Aljlayl M, Frieder O (2002) On Arabic search: improving the retrieval effectiveness via a light stemming approach. In: CIKM’02, McLean
Google Scholar
Aljlayl M, Beitzel S, Jensen E, Chowdhury A, Holmes D, Lee M, Grossman D, Frieder O (2001) IIT at TREC-10. In: TREC 2001, Gaithersburg
Google Scholar
Al-Kharashi I, Evens M (1994) Comparing words, stems, and roots as index terms in an Arabic information retrieval system. JASIS 45(8):548–560
Article Google Scholar
Allam M (1995) Segmentation versus segmentation-free for recognizing Arabic text. Proc SPIE 2422:228–235
Article Google Scholar
Argaw AA, Asker L (2007) An Amharic stemmer: reducing words to their citation forms. In: Proceedings of the 5th workshop on important unresolved matters, ACL-2007, Prague, pp 104–110
Google Scholar
Attar R, Choueka Y, Dershowitz N, Fraenkel AS (1978) KEDMA – linguistic tools for retrieval systems. J Assoc Comput Mach 25(1):52–66
Article MATH MathSciNet Google Scholar
Baird H (1990) Document image defect models. In: IAPR workshop on syntactic and structural pattern recognition, Murray Hill, pp 38–46
Google Scholar
Baird H (1993) Document image defects models and their uses. In: Second international conference on document analysis and recognition (ICDAR), Tsukuba City, pp 62–67
Google Scholar
Beesley K (1996) Arabic finite-state morphological analysis and generation. In: COLING-96, Copenhagen
Google Scholar
Beesley K, Buckwalter T, Newton S (1989) Two-level finite-state analysis of Arabic morphology. In: Proceedings of the seminar on bilingual computing in Arabic and English, Cambridge
Google Scholar
Blei D, Ng A, Jordan M (2003) Latent Dirichlet allocation. J Mach Learn Res 3(4–5):993–1022
MATH Google Scholar
Braschler M, Ripplinger B (2004) How effective is stemming and decompounding for German text retrieval? Inf Retr J 7(3–4):291–316
Article Google Scholar
Brill E, Moore R (2000) An improved error model for noisy channel spelling correction. In: Proceedings of the 38th annual meeting of the association for computational linguistics, ACL’00, Hong Kong, pp 286–293
Google Scholar
Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of the 22nd international conference on machine learning, Bonn
Google Scholar
Burgin B (1992) Variations in relevance judgments and the evaluation of retrieval performance. Inf Process Manage 28(5):619–627
Article Google Scholar
Carmel D, Maarek YS (1999) Morphological disambiguation for Hebrew search systems. In: NGITS-99, Zikhron-Yaakov
Google Scholar
Chenm A, Gey F (2002) Building an Arabic stemmer for information retrieval. In: TREC-2002, Gaithersburg
Google Scholar
Choueka Y (1980) Computerized full-text retrieval systems and research in the humanities: the Responsa project. Comput Hum 14:153–169. North-Holland
Google Scholar
Church K, Gale W (1991) Probability scoring for spelling correction. Stat Comput 1:93–103
Article Google Scholar
Croft WB, Harding S, Taghva K, Andborsak J (1994) An evaluation of information retrieval accuracy with simulated OCR output. In: Proceedings of the 3rd annual symposium on document analysis and information retrieval, University of Nevada, Las Vegas, pp 115–126
Google Scholar
Darwish K (2002) Building a shallow morphological analyzer in one day. In: ACL workshop on computational approaches to Semitic languages, Philadelphia
Google Scholar
Darwish K (2003) Probabilistic methods for searching OCR-degraded Arabic text. Ph.D. thesis, Electrical and Computer Engineering Department, University of Maryland, College Park
Google Scholar
Darwish K, Ali A (2012) Arabic retrieval revisited: morphological hole filling. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics: short papers-volume 2, Jeju Island. ACL, pp 218–222
Google Scholar
Darwish K, Emam O (2005) The effect of blind relevance feedback on a new Arabic OCR degraded text collection. In: International conference on machine intelligence: special session on Arabic document image analysis, Tozeur, 5–7 Nov 2005
Google Scholar
Darwish K, Magdy W (2007) Error correction vs. query garbling for Arabic OCR document retrieval. ACM Trans Inf Syst (TOIS) 26(1):5
Google Scholar
Darwish K, Oard DW (2002) Term selection for searching printed Arabic. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’02), Tampere, pp 261–268
Google Scholar
Darwish K, Oard D (2002) CLIR experiments at Maryland for TREC 2002: evidence combination for Arabic–English retrieval. In: Text retrieval conference (TREC’02), Gaithersburg
Google Scholar
Darwish K, Hassan H, Emam O (2005) Examining the effect of improved context sensitive morphology on Arabic information retrieval. In: Proceedings of the ACL workshop on computational approaches to Semitic languages, Ann Arbor, pp 25–30
Google Scholar
De Roeck A, El-Fares W (2000) A morphologically sensitive clustering algorithm for identifying Arabic roots. In: 38th Annual meeting of the ACL, Hong Kong, pp 199–206
Google Scholar
Diab M (2009) Second generation tools (AMIRA 2.0): fast and robust tokenization, POS tagging, and Base phrase chunking. In: 2nd international conference on Arabic language resources and tools, Cairo
Google Scholar
Doermann D (1998) The indexing and retrieval of document images: a survey. Comput Vis Image Underst 70(3):287–298
Article Google Scholar
Doermann D, Yao S (1995) Generating synthetic data for text analysis systems. In: Symposium on document analysis and information retrieval, Las Vegas, pp 449–467
Google Scholar
Domeij R, Hollman J, Kann V (1994) Detection of spelling errors in Swedish not using a Word List en Clair. J Quant Linguist 1:195–201
Article Google Scholar
Dumais ST, Furnas GW, Landauer TK, Deerwester S, Harshman R (1988) Using latent semantic analysis to improve access to textual information. In: CHI’88 proceedings of the SIGCHI conference on human factors in computing systems, Washington, DC
Google Scholar
El-Kholy A, Habash N (2010) Techniques for Arabic morphological detokenization and orthographic denormalization. In: Proceedings of language resources and evaluation conference (LREC), Valletta
Google Scholar
Fraser A, Xu J, Weischedel R (2002) TREC 2002 cross-lingual retrieval at BBN. In: TREC-2002, Gaithersburg
Google Scholar
Gao W, Niu C, Nie J-Y, Zhou M, J Hu, Wong K-F, Hon H-W (2007) Cross-lingual query suggestion using query logs of different languages, SIGIR-2007, Amsterdam, pp 463–470
Google Scholar
Gao W, Niu C, Zhou M, Wong KF (2009) Joint ranking for multilingual web search. In: ECIR 2009, pp 114–125
Google Scholar
Gao W, Niu C, Nie J-Y, Zhou M, Wong K-F, Hon H-W (2010) Exploiting query logs for cross-lingual query suggestions. ACM Trans Inf Syst 28:1–33
Article Google Scholar
Gey F, Oard D (2011) The TREC-2001 cross-language information retrieval track: searching Arabic using English, French or Arabic queries. In: TREC 2001, Gaithersburg, pp 16–23
Google Scholar
Gillies A, Erlandson E, Trenkle J, Schlosser S (1997) Arabic text recognition system. In: The symposium on document image understanding technology, Annapolis
Google Scholar
Habash N, Rambow O (2007) Arabic diacritization through full morphological tagging. In: Proceedings of NAACL HLT 2007, Rochester, Companion volume, pp 53–56
Google Scholar
Han B, Baldwin T (2011) Lexical normalisation of short text messages: makn sens a #twitter. In: Proceedings of the 49th annual meeting of the Association for Computational Linguistics: human language technologies-volume 1, Portland. ACL, pp 368–378
Google Scholar
Harding S, Croft W, Weir C (1997) Probabilistic retrieval of OCR-degraded text using N-grams. In: European conference on digital libraries, Pisa. Research and advanced technology for digital libraries. Springer, Berlin/Heidelberg, pp 345–359
Google Scholar
Harman D (1992) Overview of the first Text REtrieval conference, Gaithersburg, TREC-1992
Google Scholar
Harman D (1995) Overview of the fourth Text REtrieval conference, Gaithersburg,TREC-4, p 1
MathSciNet Google Scholar
Hassibi K (1994) Machine printed Arabic OCR. In: 22nd AIPR workshop: interdisciplinary computer vision, SPIE Proceedings, Washington, DC
Google Scholar
Hassibi K (1994) Machine printed Arabic OCR using neural networks. In: 4th international conference on multi-lingual computing, London
Google Scholar
Hawking D (1996) Document retrieval in OCR-scanned text. In: 6th parallel computing workshop, Kawasaki
Google Scholar
He D, Oard DW, Wang J, Luo J, Demner-Fushman D, Darwish K, Resnik P, Khudanpur S, Nossal M, Subotin M, Leuski A (2003) Making MIRACLEs: interactive translingual search for Cebuano and Hindi. ACM Trans Asian Lang Inf Process (TALIP) 2(3):219–244
Google Scholar
Hefny A, Darwish K, Alkahky A (2011) Is a query worth translating: ask the users! In: ECIR 2011, Dublin, pp 238–250
Google Scholar
Hersh WR, Bhuptiraju RT, Ross L, Cohen AM, Kraemer DF, Johnson P (2004) TREC 2004 genomics track overview (TREC-2004), Gaithersburg
Google Scholar
Hmeidi I, Kanaan G, Evens M (1997) Design and implementation of automatic indexing for information retrieval with Arabic documents. JASIS 48(10):867–881
Article Google Scholar
Hong T (1995) Degraded text recognition using visual and linguistic context. Ph.D. thesis, Computer Science Department, SUNY Buffalo, Buffalo
Google Scholar
Huang J, Efthimiadis EN (2009) Analyzing and evaluating query reformulation strategies in web search logs. In: CIKM’09, Hong Kong, 2–6 Nov 2009
Google Scholar
Jarvelin K, Kekalainen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446
Article Google Scholar
Joachims T (2006) Training linear SVMs in linear time. In: Proceedings of the ACM conference on knowledge discovery and data mining (KDD), Philadelphia
Google Scholar
Jurafsky D, Martin J (2000) Speech and language processing. Prentice Hall, Upper Saddle River
Google Scholar
Kantor P, Voorhees E (1996) Report on the TREC-5 confusion track. In: TREC-1996, Gaithersburg
Google Scholar
Kareem Darwish (2013) Arabizi detection and conversion to Arabic. CoRR abs/1306.6755
Google Scholar
Khoja S, Garside R (2001) Automatic tagging of an Arabic corpus using APT. In: The Arabic linguistic symposium (ALS), University of Utah, Salt Lake City
Google Scholar
Kiraz G (1998) Arabic computation morphology in the west. In: 6th international conference and exhibition on multi-lingual computing, Cambridge
Google Scholar
Kishida K (2008) Prediction of performance of cross-language information retrieval using automatic evaluation of translation. Libr Inf Sci Res 30(2):138–144
Article Google Scholar
Kanungo T, Haralick R (1998) An automatic closed-loop methodology for generating character ground-truth for scanned documents. IEEE Trans Pattern Anal Mach Intell 21(2):179–183
Article Google Scholar
Kanungo T, Haralick R, Phillips I (1993) Global and local document degradation models. In: 2nd international conference on document analysis and recognition (ICDAR’93), Tsukuba City, pp 730–734
Google Scholar
Kanungo T, Bulbul O, Marton G, Kim D (1997) Arabic OCR systems: state of the art. In: Symposium on document image understanding technology, Annapolis
Google Scholar
Kanungo T, Marton G, Bulbul O (1999) OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products. In: SPIE conference on document recognition and retrieval (VI), San Jose
Google Scholar
Lam-Adesina AM, Jones GJF (2006) Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents. Inf Process Manage 42(3):633–649
Article Google Scholar
Larkey LS, Ballesteros L, Connell ME (2002) Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis. Research and development in information retrieval – SIGIR-2002, Tampere, pp 275–282
Google Scholar
Lee Y, Papineni K, Roukos S, Emam O, Hassan H (2003) Language model based Arabic word segmentation. In: Proceedings of the 41st annual meeting of the association for computational linguistics, Sapporo, July 2003, pp 399–406
Google Scholar
Lee CJ, Chen CH, Kao SH, Cheng PJ (2010) To translate or not to translate? In: SIGIR-2010, Geneva
Google Scholar
Levow GA, Oard DW, Resnik P (2005) Dictionary-based techniques for cross-language information retrieval. Inf Process Manage J 41(3):523–547
Article Google Scholar
Li Y, Lopresti D, Tomkins A (1997) Validation of document defect models. IEEE Trans Pattern Anal Mach Intell 18:99–107
Google Scholar
Lin WC, Chen HH (2003) Merging mechanisms in multilingual information retrieval. CLEF 2002, LNCS 2785. Springer, Berlin/New York, pp 175–186
Google Scholar
Liu T-Y (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331
Article Google Scholar
Lopresti D, Zhou J (1994) Using consensus sequence voting to correct OCR errors. In: IAPR workshop on document analysis systems, Kaiserslautern, pp 191–202
Google Scholar
Lu Z, Bazzi I, Kornai A, Makhoul J, Natarajan P, Schwartz R (1999) A robust, language-independent OCR system. In: 27th AIPR workshop: advances in computer assisted recognition, Washington, DC. SPIE
Google Scholar
Maamouri M, Graff D, Bouziri B, Krouna S, Bies A, Kulick S (2010) LDC standard Arabic morphological analyzer (SAMA) version 3.1. Linguistics Data Consortium, Catalog No. LDC2010L01
Google Scholar
Magdy W, Darwish K (2006) Arabic OCR error correction using character segment correction, language modeling, and shallow morphology. In: Empirical methods in natural language processing (EMNLP’06), Sydney, pp 408–414
Google Scholar
Magdy W, Darwish K, Rashwan M (2007) Fusion of multiple corrupted transmissions and its effect on information retrieval. In: ESOLE 2007, Cairo
Google Scholar
Magdy W, Darwish K, El-Saban M (2009) Efficient language-independent retrieval of printed documents without OCR. In: SPIRE 2009, Saariselkä
Google Scholar
Magdy W, Darwish K, Mourad A (2012) Language processing for Arabic microblog retrieval. In: CIKM, Maui
Google Scholar
Mayfield J, McNamee P, Costello C, Piatko C, Banerjee A (2001) JHU/APL at TREC 2001: experiments in filtering and in Arabic, video, and web retrieval. In: Text retrieval conference (TREC’01), Gaithersburg
Google Scholar
McNamee P, Mayfield J (2002) Comparing cross-language query expansion techniques by degrading translation resources. In: SIGIR’02, Tampere
Google Scholar
Metzler D, Croft WB (2004) Combining the language model and inference network approaches to retrieval. Inf Process Manage 40(5):735–750. Special issue on Bayesian Networks and Information Retrieval
Google Scholar
Mittendorf E, Schäuble P (2000) Information retrieval can cope with many errors. Inf Retr 3(3):189–216. Springer, Netherlands
Google Scholar
Oard D, Dorr B (1996) A survey of multilingual text retrieval. UMIACS, University of Maryland, College Park
Google Scholar
Oard D, Gey F (2002) The TREC 2002 Arabic/English CLIR track. In: TREC-2002, Gaithersburg
Google Scholar
Oflazer K (1996) Error-tolerant finite state recognition with applications to morphological analysis and spelling correction. Comput Linguist 22(1):73–90
Google Scholar
Page L (1998) Method for node ranking in a linked database. US patent no. 6285999
Google Scholar
Pirkola A (1998) The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In: SIGIR-1998, Melbourne, pp 55–63
Google Scholar
Robertson SE, Jones KS (1976) Relevance weighting of search terms. J Am Soc Inf Sci 27:129–146
Article Google Scholar
Robertson SE, Jones KS (1996) Simple, proven approaches to text-retrieval. Technical report 356, Computer Laboratory, University of Cambridge, Cambridge
Google Scholar
Robertson SE, Zaragoza H (2009) The probabilistic relevance framework: BM25 and beyond. Found Trends Inf Retr 3(4):333–389
Article Google Scholar
Salton G, Lesk M (1969) Relevance assessments and retrieval system evaluation. Inf Storage Retr 4:343–359
Google Scholar
Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York
MATH Google Scholar
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article MATH Google Scholar
Salton G, Fox EA, Wu H (1983) Extended Boolean information retrieval. Commun ACM 26(11):1022–1036
Article MATH MathSciNet Google Scholar
Sanderson M (1994) Word sense disambiguation and information retrieval. In: SIGIR’94, Dublin, pp 142–151
Google Scholar
Sanderson M, Joho H (2004) Forming test collections with no system pooling. In: SIGIR’04, Sheffield, 25–29 July 2004
Google Scholar
Si L, Callan J (2005) CLEF 2005: multilingual retrieval by combining multiple multilingual ranked lists. In: Sixth workshop of the cross-language evaluation forum, CLEF, Vienna
Google Scholar
Singhal A, Salton G, Buckley C (1996) Length normalization in degraded text collections. In: 5th annual symposium on document analysis and information retrieval, Las Vegas
Google Scholar
Smith S (1990) An analysis of the effects of data corruption on text retrieval performance. Thinking Machines Corp, Cambridge
Google Scholar
Soboroff I, Nicholas C, Cahan P (2001) Ranking retrieval systems without relevance judgments. In: SIGIR, New Orleans
Google Scholar
Szpektor I, Dagan I, Lavie A, Shacham D, Wintner S (2007) Cross lingual and semantic retrieval for cultural heritage appreciation. In: Proceedings of the workshop on language technology for cultural heritage data, Prague
Google Scholar
Taghva K, Borasack J, Condit A, Gilbreth J (1994) Results and implications of the noisy data projects, 1994. Information Science Research Institute, University of Nevada, Las Vegas
Google Scholar
Taghva K, Borsack J, Condit A (1994) An expert system for automatically correcting OCR output. In: SPIE-document recognition, San Jose
Google Scholar
Taghva K, Borasack J, Condit A, Inaparthy P (1995) Querying short OCR’d documents. Information Science Research Institute, University of Nevada, Las Vegas
Google Scholar
Tillenius M (1996) Efficient generation and ranking of spelling error corrections. NADA technical report TRITA-NA-E9621
Google Scholar
Tsai MF, Wang YT, Chen HH (2008) A study of learning a merge model for multilingual information retrieval. In: SIGIR, Singapore
Google Scholar
Tseng Y, Oard DW (2001) Document image retrieval techniques for Chinese. In: Symposium on document image understanding technology (SDIUT), Columbia, pp 151–158
Google Scholar
Udupa R, Saravanan K, Bakalov A, Bhole A (2009) “They Are Out There, If You Know Where to Look”: mining transliterations of OOV query terms for cross-language information retrieval. In: ECIR, Toulouse. LNCS, vol 5478, pp 437–448
Google Scholar
Voorhees E (1998) Variations in relevance judgments and the measurement of retrieval effectiveness. In: SIGIR, Melbourne
Google Scholar
Wang J, Oard DW (2006) Combining bidirectional translation and synonymy for cross-language information retrieval. In: SIGIR, Seattle, pp 202–209
Google Scholar
Wayne C (1998) Detection & tracking: a case study in corpus creation & evaluation methodologies. Language resources and evaluation conference, Granada
Google Scholar
Wu D, He D, Ji H, Grishman R (2008) A study of using an out-of-box commercial MT system for query translation in CLIR. In: Workshop on improving non-English web searching, CIKM, Napa Valley
Google Scholar
Yona S, Wintner S (2008) A finite-state morphological grammar of Hebrew. In: Proceedings of the ACL-2005 workshop on computational approaches to Semitic languages, Ann Arbor, June 2005
Google Scholar

Download references

Author information

Authors and Affiliations

Qatar Computing Research Institute, Doha, Qatar
Kareem Darwish

Authors

Kareem Darwish
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kareem Darwish .

Editor information

Editors and Affiliations

Microsoft, Redmond, Washington, USA
Imed Zitouni

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Darwish, K. (2014). Information Retrieval. In: Zitouni, I. (eds) Natural Language Processing of Semitic Languages. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-45358-8_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-45358-8_10
Published: 25 March 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-45357-1
Online ISBN: 978-3-642-45358-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics