Abstract
A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.
Article PDF
Similar content being viewed by others
References
Ballerini JP, Büchel M, Domenig R, Knaus D, Mateev B, Mittendorf E, Schäuble P, Sheridan P and Wechsler M (1997) SPIDER retrieval system at TREC-5. In: Voorhees E and Harman D (Eds.), Proceedings of the Fifth Text REtrieval Conference (TREC-5) NIST Special Publication 500–238, pp. 217–228.
Grossman DA, Lundquist C, Reichart J, Holmes D, Chowdhury A and Frieder O (1997) Using relevance feedback within the relational model for TREC-5. In: Voorhees E and Harman D (Eds.), Proceedings of the Fifth Text REtrieval Conference (TREC-5) NIST Special Publication 500–238, pp. 405–414.
Harman D (1995) The second Text REtrieval Conference (TREC-2) (special issue) Information Processing and Management, 31(3).
Hawking D, Thistlewaite P and Bailey P (1997) ANU/ACSys TREC-5 experiments. In: Voorhees E and Harman D (Eds.), Proceedings of the Fifth Text REtrieval Conference (TREC-5) NIST Special Publication 500–238, pp. 359–375.
Kantor PB (1997) Non-linear utility functions in information retrieval. Tech. Rep. APLab Technical Report, SCILS, Rutgers University. URL D http://scils.rutgers.edu/»kantor/PAPERS/utility.ps
Ng KB, Loewenstern D, Basu C, Hirsh H and Kantor PB (1997) Data fusion of machine-learning methods for the TREC5 routing task (and other work). In: Voorhees E and Harman D (Eds.), Proceedings of the Fifth Text REtrieval Conference (TREC-5) NIST Special Publication 500–238, pp. 477–487.
Tong X, Zhai C, Milić-Frayling N and Evans DA (1997) OCR correction and query expansion for retrieval on OCR data–CLARIT TREC-5 confusion track report. In: Voorhees E and Harman D (Eds.), Proceedings of the Fifth Text REtrieval Conference (TREC-5) NIST Special Publication 500–238, pp. 341–345.
Voorhees E and Harman D (Eds.) (1997) Proceedings of the Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500–238.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kantor, P.B., Voorhees, E.M. The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text. Information Retrieval 2, 165–176 (2000). https://doi.org/10.1023/A:1009902609570
Issue Date:
DOI: https://doi.org/10.1023/A:1009902609570