Abstract
Some text collections are more difficult to search or more complex to organize into topics than others. What properties of the data characterize this complexity? We use a variation of the Cox-Lewis statistic to measure the natural tendency of a set of points to fall into clusters. We compute this quantity for document collections that are represented as a set of term vectors. We consider applications of the Cox-Lewis statistic in three scenarios: comparing clusterability of different text collections using the same representation, comparing different representations of the same text collection, and predicting the query performance based on the clusterability of the query results set. Our experimental results show a correlation between the observed effectiveness and this statistic, thereby demonstrating the utility of such data analysis in text retrieval.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cox, T.F., Lewis, T.: A conditional distance ratio method for analyzing spatial patterns. Biometrika 63, 483–491 (1976)
Dumais, S.: LSI Meets TREC: A Status Report. In: Proceedings of the First Text Retrieval Conference (TREC), pp. 137–152. NIST Special Publication 500-207 (1993)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall Advanced Reference Series (1988)
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 420. Springer, Heidelberg (2000)
Chavez, E., Navarro, G.: Towards Measuring the Searching Complexity of General Metric Spaces. In: Proceedings of ENC 2001 (2001)
Epter, S., Krishnamoorthy, M., Zaki, M.: Clusterability Detection and Initial Seed Selection in Large Data Sets, Technical Report, Rensselaer Polytechnic Institute (1999)
Smith, S.P., Jain, A.K.: Testing for Uniformity in Multidimensional Data. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6, 73–81 (1984)
El-Hamdouchi, A., Willett, P.: Techniques for the measurement of clustering tendency in document retrieval systems. Journal of Information Science 13(6), 361–365 (1987)
Yom-Tov, E., Fine, S., Carmel, D., Darlow, A.: Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, Salvador, Brazil (2005)
Panayirci, E., Dubes, R.C.: A test for multidimensional clustering tendency. Pattern Recognition 16(4), 433–444 (1983)
Minka, T., Lafferty, J.: Expectation-Propagation for the Generative Aspect Model. In: Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, pp. 352–359 (2002)
Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the Society for Information Science 41(6), 391–407 (1990)
Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent Semantic Indexing: A Probabilistic Analysis. In: Proceedings of the ACM Conference on Principles of Database Systems (PODS), Seattle (1998)
Tombros, A., van Rijsbergen, C.J.: Query-sensitive similarity measures for Information Retrieval. Knowledge and Information Systems (2004) (invited paper)
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)
Voorhees, E.M.: Overview of the TREC 2004 Robust Retrieval Track. In: Proceedings of the 12th Text REtrieval Conference(TREC 2003), p. 69. NIST Special Publication (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vinay, V., Cox, I.J., Milic-Frayling, N., Wood, K. (2006). Measuring the Complexity of a Collection of Documents. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_11
Download citation
DOI: https://doi.org/10.1007/11735106_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33347-0
Online ISBN: 978-3-540-33348-7
eBook Packages: Computer ScienceComputer Science (R0)