Combining Association Measures for Collocation Extraction Using Clustering of Receiver Operating Characteristic Curves

Antoch, Jaromír; Prchal, Luboš; Sarda, Pascal

doi:10.1007/s00357-013-9123-x

Combining Association Measures for Collocation Extraction Using Clustering of Receiver Operating Characteristic Curves

Published: 19 January 2013

Volume 30, pages 100–123, (2013)
Cite this article

Journal of Classification Aims and scope Submit manuscript

Jaromír Antoch¹,
Luboš Prchal¹ &
Pascal Sarda²

290 Accesses
Explore all metrics

Abstract

This paper focuses on combining association measures using corresponding receiver operating characteristic curves. The approach is motivated by a problem of automatic bigram collocation extraction from the field of computational linguistics. It is based on supervised machine learning techniques and the fact that different association measures discover different collocation types. Clusters of equivalent ROC curves are first determined by a testing procedure. The paper’s major contribution is an investigation of the possibility of combining representatives of the clusters of equivalent association measures into more complex models, thus improving performance of the collocation extraction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative Evaluation and Integration of Collocation Extraction Metrics

Performance Measures in Discrete Supervised Classification

Justification for the Use of Cohen’s Kappa Statistic in Experimental Studies of NLP and Text Mining

Article 01 March 2022

References

ANTOCH, J., PRCHAL, L., and SARDA, P. (2010), “Nonparametric Comparison of ROC Curves: Testing Equivalence”, in Nonparametrics and Robustness in Modern Statistical Inference and Time Series, IMS Collections 7, Institute of Mathematical Statistics, Beachwood, Ohio, USA, pp. 12–24.
BAMBER, D. (1975), “The Area Above the Ordinal Dominance Graph and the Area Below the Receiver Operating Characteristic Graph”, Journal of Mathematical Psychology, 12, 387–415.
Article MathSciNet MATH Google Scholar
BEAM, C.A., and WIEAND, H.S. (1991), “A Statistical Method for the Comparison of a Discrete Diagnostic Test with Several Continuous Diagnostic Tests”, Biometrics, 27, 907–919.
Article Google Scholar
BETINEC, M. (2008), “Testing the Difference of the ROC Curves in BiexponentialModel”, Tatra Mountains Mathematical Publications, 39, 215–223.
MathSciNet MATH Google Scholar
BRADLEY, A.P. (1997), “The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms”, Pattern Recognition, 30 (7), 1145–1159.
Article Google Scholar
BREIMAN, L. (1996), “Bagging Predictors”, Machine Learning, 24 (2), 123–140.
MathSciNet MATH Google Scholar
CHEN, C.-H. (2002), “Generalized Association Plots: Information Visualization via Iteratively Generated Correlation Matrices”, Statistica Sinica, 12, 7–29.
MathSciNet MATH Google Scholar
CHOUEKA, Y. (1988), “Looking for Needles in a Haystack or Locating Collocation Expressions in Large Textual Databases”, in Proceedings of the RIAO Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA, March 21-24, pp. 609–623.
DELONG, E.R., DELONG, D.M., and CLARKE-PEARSON, D.L. (1988), “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach”, Biometrika 44, 837–846.
MATH Google Scholar
FAWCETT, T., (2004), “ROC Graphs: Notes and Practical Considerations for Researchers”, HP Laboratories Technical Report, March 16, 2004, available at http://binf.gmu.edu/mmasso/ROC101.pdf.
FAWCETT, T. (2006), “ROC Analysis in Pattern Recognition”, Pattern Recognition Letters, 27 (8), 861–874.
Article MathSciNet Google Scholar
GREENHOUSE, S.W., and MANTEL, N. (1950), “The Evaluation of Diagnostic Tests”, Biometrics, 6, 399–412.
Article Google Scholar
HAND, D.J. (2009), “Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve”, Machine Learning, 77, 103–123.
Article Google Scholar
IMHOF, J.P. (1961), “Computing the Distribution of Quadratic Forms of Normal Variables”, Biometrika, 48, 419–426.
MathSciNet MATH Google Scholar
PECINA, P., and SCHLESINGER, P. (2006), “Combining Association Measures for Collocation Extraction”, in Proceedings of the 21th International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (COLING/ACL 2006), Sydney, pp. 651–658.
PECINA, P. (2008), “Lexical Association Measures”, unpublished PhD Thesis, Charles University, Institute of Formal and Applied Linguistics, Prague.
PEPE, M.S. (2003), The Statistical Evaluation of Medical Tests for Classification and Prediction, Oxford: Oxford University Press. PDT 2.0 (2006), available at http://ufal.mff.cuni.cz/pdt2.0/.
PROVOST, F., and FAWCETT, T. (2001), “Robust Classification for Imprecise Environments”, Machine Learning, 42 (3), 203–231.
Article MATH Google Scholar
SILVERMAN, B.W. (1986), Density Estimation for Statistics and Data Analysis, New York: Chapman and Hall.
MATH Google Scholar
VENKATRAMAN, E.S., and BEGG, C.B. (1996), “A Distribution-Free Procedure for Comparing Receiver Operating Characteristic Curves from a Paired Experiment”, Biometrika, 83, 835–848.
Article MathSciNet MATH Google Scholar
WIEAND, S., GAIL, M.H., JAMES, B.R., and JAMES, K.L. (1989), “A Family of Nonparametric Statistics for Comparing Diagnostic Markers with Paired and Unpaired Data”, Biometrika, 76, 585–592.
Article MathSciNet MATH Google Scholar
ZHOU, X.H., McCLISH, D.K., and OBUCHOWSKI, N.A. (2002), Statistical Methods in Diagnostic Medicine, New York: J. Wiley.
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Charles University of Prague, Faculty of Mathematics and Physics, Department of Statistics, Sokolovská 83, CZ – 186 75, Praha 8, Czech Republic
Jaromír Antoch & Luboš Prchal
Pascal Sarda, Université Paul Sabatier, Institut de Mathématiques de Toulouse, UMR 5219, 118 route de Narbonne, F – 310 62, Toulouse cedex, France
Pascal Sarda

Authors

Jaromír Antoch
View author publications
You can also search for this author inPubMed Google Scholar
Luboš Prchal
View author publications
You can also search for this author inPubMed Google Scholar
Pascal Sarda
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jaromír Antoch.

Additional information

The authors acknowledge support from the grant GAˇCR 201/09/0755 and research network P7/13 of the Belgian Science Policy. The authors thank the Institute of Formal and Applied Linguistics for permission to use their collocation data, and two unknown referees for stimulating comments and suggestions that allowed them to improve considerably the contents of the paper.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Antoch, J., Prchal, L. & Sarda, P. Combining Association Measures for Collocation Extraction Using Clustering of Receiver Operating Characteristic Curves. J Classif 30, 100–123 (2013). https://doi.org/10.1007/s00357-013-9123-x

Download citation

Published: 19 January 2013
Issue Date: April 2013
DOI: https://doi.org/10.1007/s00357-013-9123-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining Association Measures for Collocation Extraction Using Clustering of Receiver Operating Characteristic Curves

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Comparative Evaluation and Integration of Collocation Extraction Metrics

Performance Measures in Discrete Supervised Classification

Justification for the Use of Cohen’s Kappa Statistic in Experimental Studies of NLP and Text Mining

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now