ABSTRACT
Text classifiers are frequently used for high-yield retrieval from large corpora, such as in e-discovery. The classifier is trained by annotating example documents for relevance. These examples may, however, be assessed by people other than those whose conception of relevance is authoritative. In this paper, we examine the impact that disagreement between actual and authoritative assessor has upon classifier effectiveness, when evaluated against the authoritative conception. We find that using alternative assessors leads to a significant decrease in binary classification quality, though less so ranking quality. A ranking consumer would have to go on average 25% deeper in the ranking produced by alternative-assessor training to achieve the same yield as for authoritative-assessor training.
- Carla E. Brodley and Mark A. Friedl. Identifying mislabeled training data. Journal of Artificial Intelligence Research, 11: 131--167, 1999.Google ScholarCross Ref
- Ben Carterette and Ian Soboroff. The effect of assessor errors on IR system evaluation. In Proc. 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 539--546, Geneva, Switzerland, July 2010. Google ScholarDigital Library
- Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2: 27:1--27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Google ScholarDigital Library
- Gordon V. Cormack, Maura R. Grossman, Bruce Hedin, and Douglas W. Oard. Overview of the TREC 2010 legal track. In Ellen Voorhees and Lori P. Buckland, editors, Proc. 19th Text REtrieval Conference, pages 1:2:1--45, Gaithersburg, Maryland, USA, November 2010.Google Scholar
- Maura R. Grossman and Gordon V. Cormack. Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manual review. Richmond Journal of Law and Technology, 17 (3): 11:1--48, 2011.Google Scholar
- John C. Platt. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Alexander J. Smola, Peter Bartlett, Bernhard Schölkopf, and Dale Schuurmans, editors, Advances in Large Margin Classifiers, pages 61--74. MIT Press, 1999.Google Scholar
- Ganesh Ramakrishnan, Krishna Prasad Chitrapura, Raghu Krishnapuram, and Pushpak Bhattarcharyy. A model for handling approximate, noisy or incomplete labeling in text classification. In Proc. 22nd International Conference on Machine Learning, pages 681--688, Bonn, Germany, August 2005. Google ScholarDigital Library
- Ellen Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. Information Processing & Management, 36 (5): 697--716, September 2000. Google ScholarDigital Library
- William Webber. Re-examining the effectiveness of manual review. In Proc. SIGIR Information Retrieval for E-Discovery Workshop, pages 2:1--8, Beijing, China, July 2011.Google Scholar
Index Terms
- Assessor disagreement and text classifier accuracy
Recommendations
Document features predicting assessor disagreement
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrievalThe notion of relevance differs between assessors, thus giving rise to assessor disagreement. Although assessor disagreement has been frequently observed, the factors leading to disagreement are still an open problem. In this paper we study the ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation
AbstractEvaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice ...
Comments