Abstract
This paper focuses on combining association measures using corresponding receiver operating characteristic curves. The approach is motivated by a problem of automatic bigram collocation extraction from the field of computational linguistics. It is based on supervised machine learning techniques and the fact that different association measures discover different collocation types. Clusters of equivalent ROC curves are first determined by a testing procedure. The paper’s major contribution is an investigation of the possibility of combining representatives of the clusters of equivalent association measures into more complex models, thus improving performance of the collocation extraction.
Similar content being viewed by others
References
ANTOCH, J., PRCHAL, L., and SARDA, P. (2010), “Nonparametric Comparison of ROC Curves: Testing Equivalence”, in Nonparametrics and Robustness in Modern Statistical Inference and Time Series, IMS Collections 7, Institute of Mathematical Statistics, Beachwood, Ohio, USA, pp. 12–24.
BAMBER, D. (1975), “The Area Above the Ordinal Dominance Graph and the Area Below the Receiver Operating Characteristic Graph”, Journal of Mathematical Psychology, 12, 387–415.
BEAM, C.A., and WIEAND, H.S. (1991), “A Statistical Method for the Comparison of a Discrete Diagnostic Test with Several Continuous Diagnostic Tests”, Biometrics, 27, 907–919.
BETINEC, M. (2008), “Testing the Difference of the ROC Curves in BiexponentialModel”, Tatra Mountains Mathematical Publications, 39, 215–223.
BRADLEY, A.P. (1997), “The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms”, Pattern Recognition, 30 (7), 1145–1159.
BREIMAN, L. (1996), “Bagging Predictors”, Machine Learning, 24 (2), 123–140.
CHEN, C.-H. (2002), “Generalized Association Plots: Information Visualization via Iteratively Generated Correlation Matrices”, Statistica Sinica, 12, 7–29.
CHOUEKA, Y. (1988), “Looking for Needles in a Haystack or Locating Collocation Expressions in Large Textual Databases”, in Proceedings of the RIAO Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA, March 21-24, pp. 609–623.
DELONG, E.R., DELONG, D.M., and CLARKE-PEARSON, D.L. (1988), “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach”, Biometrika 44, 837–846.
FAWCETT, T., (2004), “ROC Graphs: Notes and Practical Considerations for Researchers”, HP Laboratories Technical Report, March 16, 2004, available at http://binf.gmu.edu/mmasso/ROC101.pdf.
FAWCETT, T. (2006), “ROC Analysis in Pattern Recognition”, Pattern Recognition Letters, 27 (8), 861–874.
GREENHOUSE, S.W., and MANTEL, N. (1950), “The Evaluation of Diagnostic Tests”, Biometrics, 6, 399–412.
HAND, D.J. (2009), “Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve”, Machine Learning, 77, 103–123.
IMHOF, J.P. (1961), “Computing the Distribution of Quadratic Forms of Normal Variables”, Biometrika, 48, 419–426.
PECINA, P., and SCHLESINGER, P. (2006), “Combining Association Measures for Collocation Extraction”, in Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), Sydney, pp. 651–658.
PECINA, P. (2008), “Lexical Association Measures”, unpublished PhD Thesis, Charles University, Institute of Formal and Applied Linguistics, Prague.
PEPE, M.S. (2003), The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford: Oxford University Press. PDT 2.0 (2006), available at http://ufal.mff.cuni.cz/pdt2.0/.
PROVOST, F., and FAWCETT, T. (2001), “Robust Classification for Imprecise Environments”, Machine Learning, 42 (3), 203–231.
SILVERMAN, B.W. (1986), Density Estimation for Statistics and Data Analysis, New York: Chapman and Hall.
VENKATRAMAN, E.S., and BEGG, C.B. (1996), “A Distribution-Free Procedure for Comparing Receiver Operating Characteristic Curves from a Paired Experiment”, Biometrika, 83, 835–848.
WIEAND, S., GAIL, M.H., JAMES, B.R., and JAMES, K.L. (1989), “A Family of Nonparametric Statistics for Comparing Diagnostic Markers with Paired and Unpaired Data”, Biometrika, 76, 585–592.
ZHOU, X.H., McCLISH, D.K., and OBUCHOWSKI, N.A. (2002), Statistical Methods in Diagnostic Medicine, New York: J. Wiley.
Author information
Authors and Affiliations
Corresponding author
Additional information
The authors acknowledge support from the grant GAˇCR 201/09/0755 and research network P7/13 of the Belgian Science Policy. The authors thank the Institute of Formal and Applied Linguistics for permission to use their collocation data, and two unknown referees for stimulating comments and suggestions that allowed them to improve considerably the contents of the paper.
Rights and permissions
About this article
Cite this article
Antoch, J., Prchal, L. & Sarda, P. Combining Association Measures for Collocation Extraction Using Clustering of Receiver Operating Characteristic Curves. J Classif 30, 100–123 (2013). https://doi.org/10.1007/s00357-013-9123-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-013-9123-x