Skip to main content
Log in

Combining Association Measures for Collocation Extraction Using Clustering of Receiver Operating Characteristic Curves

Journal of Classification Aims and scope Submit manuscript

Abstract

This paper focuses on combining association measures using corresponding receiver operating characteristic curves. The approach is motivated by a problem of automatic bigram collocation extraction from the field of computational linguistics. It is based on supervised machine learning techniques and the fact that different association measures discover different collocation types. Clusters of equivalent ROC curves are first determined by a testing procedure. The paper’s major contribution is an investigation of the possibility of combining representatives of the clusters of equivalent association measures into more complex models, thus improving performance of the collocation extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • ANTOCH, J., PRCHAL, L., and SARDA, P. (2010), “Nonparametric Comparison of ROC Curves: Testing Equivalence”, in Nonparametrics and Robustness in Modern Statistical Inference and Time Series, IMS Collections 7, Institute of Mathematical Statistics, Beachwood, Ohio, USA, pp. 12–24.

  • BAMBER, D. (1975), “The Area Above the Ordinal Dominance Graph and the Area Below the Receiver Operating Characteristic Graph”, Journal of Mathematical Psychology, 12, 387–415.

    Article  MathSciNet  MATH  Google Scholar 

  • BEAM, C.A., and WIEAND, H.S. (1991), “A Statistical Method for the Comparison of a Discrete Diagnostic Test with Several Continuous Diagnostic Tests”, Biometrics, 27, 907–919.

    Article  Google Scholar 

  • BETINEC, M. (2008), “Testing the Difference of the ROC Curves in BiexponentialModel”, Tatra Mountains Mathematical Publications, 39, 215–223.

    MathSciNet  MATH  Google Scholar 

  • BRADLEY, A.P. (1997), “The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms”, Pattern Recognition, 30 (7), 1145–1159.

    Article  Google Scholar 

  • BREIMAN, L. (1996), “Bagging Predictors”, Machine Learning, 24 (2), 123–140.

    MathSciNet  MATH  Google Scholar 

  • CHEN, C.-H. (2002), “Generalized Association Plots: Information Visualization via Iteratively Generated Correlation Matrices”, Statistica Sinica, 12, 7–29.

    MathSciNet  MATH  Google Scholar 

  • CHOUEKA, Y. (1988), “Looking for Needles in a Haystack or Locating Collocation Expressions in Large Textual Databases”, in Proceedings of the RIAO Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA, March 21-24, pp. 609–623.

  • DELONG, E.R., DELONG, D.M., and CLARKE-PEARSON, D.L. (1988), “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach”, Biometrika 44, 837–846.

    MATH  Google Scholar 

  • FAWCETT, T., (2004), “ROC Graphs: Notes and Practical Considerations for Researchers”, HP Laboratories Technical Report, March 16, 2004, available at http://binf.gmu.edu/mmasso/ROC101.pdf.

  • FAWCETT, T. (2006), “ROC Analysis in Pattern Recognition”, Pattern Recognition Letters, 27 (8), 861–874.

    Article  MathSciNet  Google Scholar 

  • GREENHOUSE, S.W., and MANTEL, N. (1950), “The Evaluation of Diagnostic Tests”, Biometrics, 6, 399–412.

    Article  Google Scholar 

  • HAND, D.J. (2009), “Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve”, Machine Learning, 77, 103–123.

    Article  Google Scholar 

  • IMHOF, J.P. (1961), “Computing the Distribution of Quadratic Forms of Normal Variables”, Biometrika, 48, 419–426.

    MathSciNet  MATH  Google Scholar 

  • PECINA, P., and SCHLESINGER, P. (2006), “Combining Association Measures for Collocation Extraction”, in Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), Sydney, pp. 651–658.

  • PECINA, P. (2008), “Lexical Association Measures”, unpublished PhD Thesis, Charles University, Institute of Formal and Applied Linguistics, Prague.

  • PEPE, M.S. (2003), The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford: Oxford University Press. PDT 2.0 (2006), available at http://ufal.mff.cuni.cz/pdt2.0/.

  • PROVOST, F., and FAWCETT, T. (2001), “Robust Classification for Imprecise Environments”, Machine Learning, 42 (3), 203–231.

    Article  MATH  Google Scholar 

  • SILVERMAN, B.W. (1986), Density Estimation for Statistics and Data Analysis, New York: Chapman and Hall.

    MATH  Google Scholar 

  • VENKATRAMAN, E.S., and BEGG, C.B. (1996), “A Distribution-Free Procedure for Comparing Receiver Operating Characteristic Curves from a Paired Experiment”, Biometrika, 83, 835–848.

    Article  MathSciNet  MATH  Google Scholar 

  • WIEAND, S., GAIL, M.H., JAMES, B.R., and JAMES, K.L. (1989), “A Family of Nonparametric Statistics for Comparing Diagnostic Markers with Paired and Unpaired Data”, Biometrika, 76, 585–592.

    Article  MathSciNet  MATH  Google Scholar 

  • ZHOU, X.H., McCLISH, D.K., and OBUCHOWSKI, N.A. (2002), Statistical Methods in Diagnostic Medicine, New York: J. Wiley.

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaromír Antoch.

Additional information

The authors acknowledge support from the grant GAˇCR 201/09/0755 and research network P7/13 of the Belgian Science Policy. The authors thank the Institute of Formal and Applied Linguistics for permission to use their collocation data, and two unknown referees for stimulating comments and suggestions that allowed them to improve considerably the contents of the paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Antoch, J., Prchal, L. & Sarda, P. Combining Association Measures for Collocation Extraction Using Clustering of Receiver Operating Characteristic Curves. J Classif 30, 100–123 (2013). https://doi.org/10.1007/s00357-013-9123-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00357-013-9123-x

Keywords

Navigation