The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Kantor, Paul B.; Voorhees, Ellen M.

doi:10.1023/A:1009902609570

The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Published: May 2000

Volume 2, pages 165–176, (2000)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Download PDF

Paul B. Kantor¹ &
Ellen M. Voorhees²

149 Accesses
51 Citations
Explore all metrics

Abstract

A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.

References

Ballerini JP, Büchel M, Domenig R, Knaus D, Mateev B, Mittendorf E, Schäuble P, Sheridan P and Wechsler M (1997) SPIDER retrieval system at TREC-5. In: Voorhees E and Harman D (Eds.), Proceedings of the Fifth Text REtrieval Conference (TREC-5) NIST Special Publication 500–238, pp. 217–228.
Grossman DA, Lundquist C, Reichart J, Holmes D, Chowdhury A and Frieder O (1997) Using relevance feedback within the relational model for TREC-5. In: Voorhees E and Harman D (Eds.), Proceedings of the Fifth Text REtrieval Conference (TREC-5) NIST Special Publication 500–238, pp. 405–414.
Harman D (1995) The second Text REtrieval Conference (TREC-2) (special issue) Information Processing and Management, 31(3).
Hawking D, Thistlewaite P and Bailey P (1997) ANU/ACSys TREC-5 experiments. In: Voorhees E and Harman D (Eds.), Proceedings of the Fifth Text REtrieval Conference (TREC-5) NIST Special Publication 500–238, pp. 359–375.
Kantor PB (1997) Non-linear utility functions in information retrieval. Tech. Rep. APLab Technical Report, SCILS, Rutgers University. URL D http://scils.rutgers.edu/»kantor/PAPERS/utility.ps
Ng KB, Loewenstern D, Basu C, Hirsh H and Kantor PB (1997) Data fusion of machine-learning methods for the TREC5 routing task (and other work). In: Voorhees E and Harman D (Eds.), Proceedings of the Fifth Text REtrieval Conference (TREC-5) NIST Special Publication 500–238, pp. 477–487.
Tong X, Zhai C, Milić-Frayling N and Evans DA (1997) OCR correction and query expansion for retrieval on OCR data–CLARIT TREC-5 confusion track report. In: Voorhees E and Harman D (Eds.), Proceedings of the Fifth Text REtrieval Conference (TREC-5) NIST Special Publication 500–238, pp. 341–345.
Voorhees E and Harman D (Eds.) (1997) Proceedings of the Fifth Text REtrieval Conference (TREC-5). NIST Special Publication 500–238.

Download references

Author information

Authors and Affiliations

Department of Library and Information Science, Rutgers University, 4 Huntington St, New Brunswick, NJ, 08901, USA
Paul B. Kantor
National Institute of Standards and Technology (NIST), Gaithersburg, MD, 20899, USA
Ellen M. Voorhees

Authors

Paul B. Kantor
View author publications
You can also search for this author in PubMed Google Scholar
Ellen M. Voorhees
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kantor, P.B., Voorhees, E.M. The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text. Information Retrieval 2, 165–176 (2000). https://doi.org/10.1023/A:1009902609570

Download citation

Issue Date: May 2000
DOI: https://doi.org/10.1023/A:1009902609570

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Abstract

Article PDF

Similar content being viewed by others

Efficient Media Retrieval from Non-Cooperative Queries

Approximate Search for Keywords in Handwritten Text Images

Document analysis systems that improve with use

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Abstract

Article PDF

Similar content being viewed by others

Efficient Media Retrieval from Non-Cooperative Queries

Approximate Search for Keywords in Handwritten Text Images

Document analysis systems that improve with use

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation