Abstract
Given the phenomenal growth in the variety and quantity of data available to users through electronic media, there is a great demand for efficient and effective ways to organize and search through all this information. Besides speech, our principal means of communication is through visual media, and in particular, through documents. In this paper, we provide an update on Doermann's comprehensive survey (1998) of research results in the broad area of document-based information retrieval. The scope of this survey is also somewhat broader, and there is a greater emphasis on relating document image analysis methods to conventional IR methods.
Documents are available in a wide variety of formats. Technical papers are often available as ASCII files of clean, correct, text. Other documents may only be available as hardcopies. These documents have to be scanned and stored as images so that they may be processed by a computer. The textual content of these documents may also be extracted and recognized using OCR methods. Our survey covers the broad spectrum of methods that are required to handle different formats like text and images. The core of the paper focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document. We start, however, with a brief overview of traditional IR techniques that operate on clean text. We also discuss research dealing with text that is generated by running OCR on document images. Finally, we also briefly touch on the related problem of content-based image retrieval.
Article PDF
Similar content being viewed by others
References
Adar E and Hylton J (1995) On-the-fly hyperlink creation for page images. In: Second Annual Conference on the Theory and Practice of Digital Libraries. http://csdl.tamu.edu/DL95/papers/adar/adar.html (visited March 20th, 2000).
Ballerini J, Buchel M, Domenig R, Knaus D, Mateev B, Mittendorf E, Schauble P, Sheridan P and Wechsler M (1997) SPIDER retrieval system at TREC-5. In: Voorhees E and Harman D (Eds.), The Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500–238, pp. 217–228.
Ballesteros L and Croft W(1997) Phrasal translation and query expansion techniques for cross-language information retrieval. In: Belkin N, Narasimhalu A and Willett P (Eds.), Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 84–91.
Ballesteros L and Croft W (1998) Resolving ambiguity for cross-language information retrieval. In: Croft W, Moffat A and van Rijsbergen C (Eds.), Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 64–71.
Belkin N and Croft W(1987) Retrieval Techniques. In:Williams M(Ed.), Annual Review of Information Science and Technology. Elsevier Science, pp. 109–145.
Bloomberg D and Chen F (1996) Extraction of text-related features for condensing image documents. In: Vincent L and Hull J. (Eds.), Proceedings of the SPIE–Document Recognition III. The International Society for Optical Engineering (SPIE), Vol. 2660, pp. 72–88. http://www.parc.xerox.com.istl/members/bloomberg/ spie96dimsum.pdf (visited March 20, 2000).
Bruckner T, Suda P, Block H and Maderlechner G (1995) Inhouse mail distribution by autoamtic address and content interpretation. In: Symposium on Document Analysis and Information Retrieval. Information Science Research Institute, University of Nevada, Las Vegas, pp. 67–75.
Buckley C, Salton G, Allan J and Singhal A (1995) Automatic query expansion using SMART: TREC-3. In: Harman D (Ed.), The Third Text REtrieval Conference (TREC-3). NIST Special Publication 500–225.
Buckley C, Singhal A and Mitra M (1997) Using query zoning and correlation within SMART: TREC5. In: Voorhees E and Harman D (Eds.), The Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500–238, pp. 105–118.
Buckley C, Singhal A, MitraMand Salton G(1996) Newretrieval approaches usingSMART: TREC-4. In: Harman D (Ed.), The Fourth Text REtrieval Conference (TREC-4). NIST Special Publication 500–236, pp. 25–48.
Casey R, Ferguson D, Mohiuddin K and Walach E (1992) Intelligent form processing system. Machine Vision and Applications, 5:143–155.
Chen F and Bloomberg D(1996) Extraction of thematically relevant text from images. In: Symposium on Document Analysis and Information Retrieval. Information Science Research Institute, University of Nevada, Las Vegas, pp. 163–178.
Chen F and Bloomberg D(1997) Extraction of indicative summary sentences from imaged documents. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 227–232.
Chen F and Bloomberg D (1998) Summarization of imaged documents without OCR. Computer Vision and Image Understanding, 70(3).
Chen F, Bloomberg D and Wilcox L (1995) Spotting phrases in lines of imaged text. In: Vincent L and Baird H (Eds.), Proceedings of the SPIE–Document Recognition II. The International Society for Optical Engineering (SPIE), Vol. 2422, pp. 256–269.
Chen F, Wilcox L and Bloomberg D(1993a) Detecting and locating partially specified keywords in scanned images using hidden Markov models. In: Proceedings of the Second International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 133–138.
Chen F, Wilcox L and Bloomberg D (1993b) Word spotting in scanned images using hidden Markov models. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE Computer Society Press, Vol. 5, pp. 1–4.
Croft W, Harding S, Taghva K and Borsack J (1994) An evaluation of information retrieval accuracy with simulated OCR Output. In: Symposium on Document Analysis and Information Retrieval, pp. 115–126. http://cobar.cs.umass.edu/pubfiles/ocr.ps.gz (visited March 20th, 2000).
Croft W and Harper D (1979) Using probabilistic models of document retrieval without relevance information.Documentation, 35(4):285–295.
Cullen J, Hull J and Hart P (1997) Document image database retrieval and browsing using texture analysis. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 718–721.
Decurtins J (1997) Comparison of OCR vs. word shape recognition for keyword spotting. In: Symposium on Document Image Understanding Technology, pp. 205–213.
DeCurtins J and Chen E (1995) Keyword spotting via word shape recognition. In: Vincent L and Baird H (Eds.), Proceedings of the SPIE–Document Recognition II. The International Society for Optical Engineering (SPIE), Vol. 2422, pp. 270–277.
DeSilva G and Hull J (1994) Proper noun detection in document images. Pattern Recognition, 27(2):311–320.
Doermann D (1997) The retrieval of document images: A brief survey. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 945–949.
Doermann D (1998) The indexing and retrieval of document images: A survey. Computer Vision and Image Understanding, 70(3):287–298.
Doermann D, Li H and Kia O (1997) The detection of duplicates in document image databases. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 314–318.
Doermann D, Rivlin E and Weiss I (1996) Applying algebraic and differential invariants for logo recognition. Machine Vision and Applications, 9(2):73–86.
Efthimiadis E and Biron P (1994) UCLA-Okapi at TREC-2: Query Expansion Experiments. In: The Second Text REtrieval Conference (TREC-2). NIST Special Publication 500–215, pp. 279–290. http://trec.nist.gov/ pubs/trec2/papers/txt/28.txt (visited March 20th, 2000).
Evans DA and Lefferts RG (1995) CLARIT-TREC experiments. Information Processing and Management, 31(3):385–395.
Flickner M, Sawhney H, Niblack W, Ashley J, Huang Q, Dom B, Gorkani M, Hafner J, Lee D, Petkovic D, Steele D and Yanker P (1995) Query by Image and Video Content: the QBIC System. IEEE Computer, 28(9):23–32.
Fox E, Betrabet S, Koushik M and Lee W (1992) Extended Boolean models. In: Frakes W and Baeza-Yates R (Eds.), Information Retrieval Data Structures and Algorithms, Prentice Hall, pp. 393–418.
Gudivada V and Raghavan V, Eds. (1995) Special issue on content-based image retrieval systems, IEEE Computer Society Press. IEEE Computer, 28(9).
Harding S, Croft W and Weir C (1997) Probabilistic retrieval of OCR degraded text using N-grams. In: Peters C and Thanos C (Eds.), Research and Advanced Technology for Digital Libraries, pp. 345–359.
Harman D (1996) Overview of the fourth Text REtrieval conference (TREC-4). In: Harman D (Ed.), The Fourth Text REtrieval Conference (TREC-4). NIST Special Publication 500–236, pp. 1–23.
Ho T, Hull J and Srihari S (1992) A word shape analysis approach to lexicon based word recognition. Pattern Recognition Letters, 13:821–826.
Huang J (1998) Color-spatial image indexing and applications. PhD Thesis, Department of Computer Science, Cornell University. http://www.cs.cornell.edu/Info/People/huang/thesis.pdf (visited March 20th, 2000).
Huang J, Kumar S, Mitra M, Zhu W and Zabih R (1997) Image indexing using color correlograms. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society Press, pp. 762–768.
Hull D and Grefenstette G (1996) Querying across languages: a dictionary-based approach to multilingual information retrieval. In: Frei H, Harman D, Schauble P and Wilkinson R (Eds.), Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 49–57.
Hull D, Grefenstette G, Schulze B, Gaussier E, Schutze H and Pedersen J (1997) Xerox TREC-5 site report: routing, filtering, NLP, and Spanish tracks. In: Voorhees E and Harman D (Eds.), The Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500–238, pp. 167–180.
Hull J (1992a)Ahidden Markov model for language syntax in text recognition. In: Proceedings of the International Conference on Pattern Recognition. IEEE Computer Society Press.
Hull J (1992b) Incorporation of a Markov model of language syntax in a text recognition algorithm. In: O'Gorman L and Kasturi R (Eds.), Document Image Analysis. IEEE Computer Society Press, pp. 287–297.
Hull J and Cullen J (1997) Document image similarity and equivalence detection. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 308–312.
Joseph S and Pridmore T (1992) Knowledge-directed interpretation of mechanical engineering drawings. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(9):928–940.
Kantor PB and Voorhees EM (1997) Report on the TREC-5 confusion track. In: Voorhees E and Harman D (Eds.), The Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500–238.
Kato T, Kurita Tand Shimogaki H(1991) Intelligent visual interaction with image databases. Journal of Information Processing of Japan, 12(2):134–143.
Kelledy F and Smeaton A(1997) TREC-5 experiments at Dublin City University: query space reduction, Spanish & character shape encoding. In: Voorhees E and Harman D (Eds.), The Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500–238, pp. 197–207.
Khoubyari S and Hull J (1996) Font and function word identification in document recognition. Computer Vision and Image Understanding, 63(1):66–74.
Kia O and Doermann D (1996) Structural compression for document analysis. In: Proceedings of the International Conference on Pattern Recognition. IEEE Computer Society Press, pp. 664–668.
Kuo S and Agazzi O (1994) Keyword spotting in poorly printed documents using pseudo-2D hidden Markov models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(8):842–848.
Lagoze C, Shaw E, Davis J and Krafft D (1995) Dienst: implementation reference manual. Technical Report 95–1514, Dept. of Computer Science, Cornell University. http://estr.cs.cornell.edu:80/Dienst/UI/1.0/Display/ ncstrl.cornell/TR95–1514 (visited March 20th, 2000).
Liu F and Picard R (1996) Periodicity, directionality and randomness: wold features for image modeling and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(7):722–733.
Lorenz O and Monagan G (1995) A retrieval system for graphical documents. In: Symposium on Document Analysis and Information Retrieval. Information Science Research Institute, University of Nevada, Las Vegas, pp. 291–300.
Ma W (1997) NETRA: A toolbox for navigating large image databases. PhD Thesis, Department of Electrical and Computer Engineering, University of California, Santa Barbara. http://vivaldi.ece.ucsb.edu/users/wei/ mypapers/thesis.html (visited March 20th, 2000).
Manjunath B and Ma W (1996) Texture features for browsing and retrieval of image data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(11), Special issue on digital libraries.
Manmatha R (1997) Multimedia indexing and retrieval research at the Center for Intelligent Information Retrieval. In: Symposium on Document Image Understanding Technology, pp. 16–30.
Manmatha R and Croft W(1998)Word spotting: indexing handwritten archives. In: Maybury M (Ed.), Intelligent Multi-media Information Retrieval. AAAI/MIT Press.
Manmatha R, Han C and Riseman E (1996a)Word spotting: a new approach to indexing handwriting. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society Press, pp. 631–637.
Manmatha R, Han C, Riseman E and Croft W (1996b) Indexing handwriting using word matching. In: DL '96: Proceedings of the 1st ACM International Conference on Digital libraries, pp. 151–159.
Mitra M, Buckley C, Singhal A and Cardie C (1997) An analysis of statistical and syntactic phrases. In: Proceedings of the 5th RIAO Conference on Computer-Assisted Research of Information, pp. 200–214. http://www.research.att.com/»singhal/riao97.ps (visited March 20th, 2000).
MUC-6 (1995) In: Proceedings of the Sixth Message Understanding Conference (MUC-6). Defence Advanced Research Projects Agency, Morgan Kaufmann.
Myka A and Guntzer U (1997) Measuring the effects of OCR errors on similarity linking. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 968–973.
Oard D (1996) Adaptive vector space text filtering for monolingual and cross-language applications. PhD Thesis, University of Maryland. http://www.clis.umd.edu/dlrg/filter/papers/thesis.final.ps (visited March 20th, 2000).
Ohta M, Takasu A and Adachi J (1997) Retrieval methods for english text with misrecognized characters. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 950–956.
Park I, Yun I and Lee S (1997) Models and algorithms for efficient color image indexing. In: Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries. IEEE Computer Society Press, pp. 36–49.
Pass G and Zabih R (1999) Comparing images using joint histograms. ACM Journal of Multimedia Systems, 7(3):234–240.
Robertson S and Sparck Jones K (1976) Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129–146.
Robertson S and Walker S (1994) Some simple effective approximations to the 2–poisson model for probabilistic weighted retrieval. In: Croft Wand van Rijsbergen C (Eds.), Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer-Verlag, pp. 232–241.
Robertson S, Walker S, Jones S, Hancock-Beaulieu M and Gatford M (1995) Okapi at TREC-3. In: Harman D (Ed.), The Third Text REtrieval Conference (TREC-3). NIST Special Publication 500–225.
Salton G (1972) Experiments in multi-lingual information retrieval. Technical Report 72–154. Dept. of Computer Science, Cornell University. http://cstr.cs.cornell.edu:80/Dienst/UI/1.0/Display/ncstrl.cornell/TR72–154 (visited March 20th, 2000).
Salton G (1981) A blueprint for automatic indexing. ACM SIGIR Forum, 16(2):22–38.
Salton G(1989) Automatic text processing–the transformation, analysis and retrieval of information by computer. Addison-Wesley Publishing Co., Reading, MA.
Salton G and Buckley C (1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523.
Salton G and McGill M(1983) Introduction to Modern Information Retrieval. McGraw Hill Book Co., New York.
Salton G, Wong A and Yang C (1975) A vector space model for information retrieval. Communications of the ACM, 18(11):613–620.
Samet H and Soffer A (1996) MARCO: map retrieval by content. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):783–798.
Schauble P and Sheridan P (1997) Cross-language information retrieval (CLIR) track overview. In: Voorhees E and Harman D (Eds.), The Sixth Text REtrieval Conference (TREC-6). NIST Special Publication 500–240, pp. 31–43.
Sheridan P and Ballerini J (1996) Experiments in multilingual information retrieval using the SPIDER system. In: Frei H, Harman D, Schauble P and Wilkinson R (Eds.), Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 58–65.
Singhal A (1998) Question answering track at TREC-8. http://www.research.att.com/»singhal/qa-track.html (visited March 20th, 2000).
Singhal A, Buckley C and Mitra M (1996a) Pivoted document length normalization. In: Frei H, Harman D, Schauble P and Wilkinson R (Eds.), Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, pp. 21–29.
Singhal A, Salton G and Buckley C (1996b) Length normalization in degraded text collections. In: Symposium on Document Analysis and Information Retrieval, pp. 149–162. http://www.research.att.com/»singhal/ ocr-norm.ps (visited March 20th, 2000).
Smeaton A and O'Connor J (1998) User-mediated word shape tokens for querying document images. In: Kay J and Milosavljevic M (Eds.), Proceedings of the Third Australian Document Computing Symposium. http://www.compapp.dcu.ie/»asmeaton/pubs/ADCS98–crc.ps.Z (visited March 20th, 2000).
Smeaton A and Spitz A (1997) Using character shape coding for information retrieval. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 974–978.
Soffer A(1997) Image categorization using texture features. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 233–237.
Spitz A (1995) Using character shape codes for word spotting in document images. In: Dori D and Bruckstein A (Eds.), Shape, Structure and Pattern Recognition, World Scientific, Singapore, pp. 382–389.
Srihari R (1995) Automatic indexing and content-based retrieval of captioned images. IEEE Computer, 28(9):49–56.
Stricker M and Dimai A (1996) Color indexing with weak spatial constraints. In: Sethi I and Jain R (Eds.), Proceedings of the SPIE. The International Society for Optical Engineering (SPIE), Vol. 2670, pp. 29–40.
Strzalkowski T, Lin F and Perez-Carballo J (1998) Natural language information retrieval TREC-6 report. In: Voorhees E and Harman D (Eds.), The Sixth Text REtrieval Conference (TREC-6). NIST Special Publication 500–240, pp. 347–366.
Suda P, Bridoux C, Kammerer B and Maderlechner G (1997) Logo and word matching using a general approach to signal registration. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 61–65.
Swain M and Ballard D (1991) Color indexing. International Journal of Computer Vision, 7(1):11–32.
Taghva K, Borsack J and Condit A (1994) Expert system for automatically correcting OCR output. In: Vincent L and Pavlidis T (Eds.), Proceedings of the SPIE–Document Recognition. The International Society for Optical Engineering (SPIE), Vol. 2181, p. 270–278.
Taghva K, Borsack J and Condit A (1997) Information retrieval and OCR. In: Bunke H and Wang P (Eds.), Handbook of Character Recognition and Document Image Analysis,World Scientific Publishing Co., pp. 755–777.
Takasu A (1997) An approximate string match for garbled text with various accuracy. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 957–961.
Tanaka Y and Torii H (1988) Transmedia machine and its keyword search over image texts. In: Proceedings of the 2nd RIAO Conference on Computer-assisted Research of Information, pp. 248–258.
Taylor S, Fritzson R and Pastor J (1992) Intelligent form processing system. Machine Vision and Applications, 5:211–222.
Tong X, Zhai C, Milic-Frayling N and Evans D (1997) OCR correction and query expansion for retrieval on OCR data–CLARIT TREC-5 confusion track report. In: Voorhees E and Harman D (Eds.), The Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500–238, pp. 341–345.
Tsuda K, Senda S, Minoh Mand Ikeda K (1995) Clustering OCR-ed texts for browsing document image database. In: Proceedings of the Third International Conference on Document Analysis and Recognition. IEEE Computer Society Press, pp. 171–174.
Vaxiviere P and Tombre K (1992) Celesstin: CAD conversion of mechanical drawings. IEEE Computer, 25(7):46–54.
Voorhees E (1994) Query expansion using lexical-semantic relations. In: Croft W and van Rijsbergen C (Eds.), Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Springer-Verlag, pp. 61–69.
Voorhees E and Harman D (1998) Overview of the Sixth Text REtrieval conference (TREC-6). In: Voorhees E and Harman D (Eds.), The Sixth Text REtrieval Conference (TREC-6). NIST Special Publication 500–240, pp. 1–24.
Wartik S (1992) Boolean operations. In: Frakes Wand Baeza-Yates R (Eds.), Information Retrieval Data Strucutres and Algorithms. Prentice Hall, pp. 264–292.
Xu J and Croft W (1996) Query expansion using local and global document analysis. In: Frei H, Harman D, Schauble P and Wilkinson R (Eds.), Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, pp. 4–11.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Mitra, M., Chaudhuri, B. Information Retrieval from Documents: A Survey. Information Retrieval 2, 141–163 (2000). https://doi.org/10.1023/A:1009950525500
Issue Date:
DOI: https://doi.org/10.1023/A:1009950525500