ABSTRACT
Finding a document that is similar to a specified query document within a large document database is one of important issues in the Big Data era, as most data available is in the form of unstructured texts. Our testing collection consists of two parts: In the first part texts were produced by human work by artificial plagiarism approach through the linear pipelined procedure. In the second part, texts are generated by software that inserts, deletes, and substitutes certain parts of the target documents to make a similar document from an input document. These document set is known as the Serially Evolved Documents (SED). We propose new methods: Order Preserving Precision (OPP) and Order Preserving Recall (OPR), to compute how the evolutionary order is kept among output documents obtained from the subject IR system. Using those testing texts we evaluated KONAN, a document retrieval system for Korean documents.
- Eugene Agichtein and Silviu Cucerzan. 2005. Predicting accuracy of extracting information from unstructured text collections. In Proceedings of the 14th ACM international conference on Information and knowledge management. ACM, 413--420. Google ScholarDigital Library
- David C Blair and Melvin E Maron. 1985. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Commun. ACM 28, 3 (1985), 289--299. Google ScholarDigital Library
- Vuk Ercegovac, David J DeWitt, and Raghu Ramakrishnan. 2005. The TEXTURE benchmark: measuring performance of text queries on a relational DBMS. In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, 313--324.Google Scholar
- Claudia Hauff and Franciska de Jong. 2010. Retrieval system evaluation: automatic evaluation versus incomplete judgments. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 863--864. Google ScholarDigital Library
- Cyril Labbé and Dominique Labbé. 2013. Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientomet- rics 94, 1 (2013), 379--396. Google ScholarDigital Library
- Matt Mahoney. 2009. Large text compression benchmark. URL: http://www. mattmahoney. net/text/text.html (2009).Google Scholar
- Gerard Salton, James Allan, and Chris Buckley. 1993. Approaches to passage retrieval in full text information systems. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 49--58. Google ScholarDigital Library
- Mark Sanderson et al. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends® in Information Retrieval 4, 4 (2010), 247--375.Google Scholar
- Ellen M Voorhees and Donna Harman. 2000. Overview of the sixth text retrieval conference (TREC-6). Information Processing & Management 36, 1 (2000), 3--35. Google ScholarDigital Library
- Ellen M Voorhees, Donna K Harman, et al. 2005. TREC: Experiment and evaluation in information retrieval. Vol. 1. MIT press Cambridge.Google ScholarDigital Library
Index Terms
- Evaluation of Full-Text Retrieval System Using Collection of Serially Evolved Documents
Recommendations
An evaluation of retrieval effectiveness for a full-text document-retrieval system
An evaluation of a large, operational full-text document-retrieval system (containing roughly 350,000 pages of text) shows the system to be retrieving less than 20 percent of the documents relevant to a particular search. The findings are discussed in ...
Imaged Document Text Retrieval Without OCR
We propose a method for text retrieval from document images without the use of OCR. Documents are segmented into character objects. Image features, namely, the Vertical Traverse Density (VTD) and Horizontal Traverse Density (HTD), are extracted. An n-...
Documents clustering using tolerance rough set model and its application to information retrieval
Intelligent exploration of the webClustering is a powerful tool for analyzing and finding useful information in text collections. However, document clustering is a difficult clustering problem because of the unstructured form and textual characteristics of documents. As a consequence, ...
Comments