Skip to main content

Statistically Valid Links and Anti-links BetweenWords and Between Documents: Applying TourneBool Randomization Test to a Reuters Collection

  • Chapter
Advances in Knowledge Discovery and Management

Part of the book series: Studies in Computational Intelligence ((SCI,volume 292))

  • 844 Accesses

Abstract

Neighborhood is a central concept in data mining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an objects × attributes binary table in order to establish which inter-attribute relations are fortuitous, and which ones are meaningful, without requiring any pre-defined statistical model, while taking into account the empirical distributions. It ensues a robust and statistically validated graph. We present a full-scale experiment on one of the public access Reuters test corpus. We characterize the resulting word graph by a series of indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative “counter-relations” between words, i.e. words which “steer clear” one from another. We characterize in the same way the counter-relation graph. At last we generate the couple of valid document graphs (i.e. links and anti-links) and evaluate them by taking into account the Reuters document categories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bavaud, F.: Modèles et données: une introduction à la Statistique uni-, bi- et trivariée. L’Harmattan (1998)

    Google Scholar 

  • Benzécri, J.: Construction d’une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques. Les Cahiers de l’Analyse des Données, pp. 208–218 (1982)

    Google Scholar 

  • Cadot, M.: A Simulation Technique for extracting Robust Association Rules. In: CSDA 2005 (2005)

    Google Scholar 

  • Cadot, M.: Extraire et valider les relations complexes en sciences humaines: statistiques, motifs et règles d’association. Ph.D. thesis, Université de Franche-Comté (2006)

    Google Scholar 

  • Cadot, M., Napoli, A.: Une optimisation de l’extraction d’un jeu de règles s’appuyant sur les caractéristiques statistiques des données. In: RSTI, série RIA-ECA, pp. 631–656 (2003)

    Google Scholar 

  • Cobb, G., Chen, Y.: An application of Markov chain Monte Carlo to community ecology. The American Mathematical Monthly, pp. 264–288 (2003)

    Google Scholar 

  • Connor, E., Simberloff, D.: The assembly of species communities: Chance or competition? Ecology, 1132–1140 (1979)

    Google Scholar 

  • Delaunay, B.: Sur la sphère vide. Izvestia Akademii Nauk SSSR, pp. 793–800 (1934)

    Google Scholar 

  • Droesbeke, J., Finne, J.: Inférence non-paramétrique - Les statistiques de rangs. Editions de l’Université de Bruxelles (1996)

    Google Scholar 

  • Fisher, R.: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 179–188 (1936)

    Google Scholar 

  • Gabriel, K., Sokal, R.: A new statistical approach to geographic variation analysis. Systematic Zoology, 259–270 (1969)

    Google Scholar 

  • Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. ACM Trans. Knowl. Discov. Data (2007)

    Google Scholar 

  • Goodman, J., O’Rourke, J.: Handbook of Discrete and Computational Geometry. CRC Press, Boca Raton (2004)

    Book  MATH  Google Scholar 

  • Jensen, D., Cohen, P.: Multiple Comparisons in Induction Algorithms. Machine Learning, 309–338 (2000)

    Google Scholar 

  • Lelu, A.: Analyse en composantes locales et graphes de similarité entre textes. In: Purnelle, G. (ed.) JADT 2004 (2004)

    Google Scholar 

  • Lelu, A., Cuxac, P., Cadot, M.: Document stream clustering: an optimal and fine-grained incremental approach. In: COLLNET 2006 / International Workshop on Webometrics, Informetrics and Scientometrics (2006)

    Google Scholar 

  • Lerman, I.-C., Peter, P.: Indice probabiliste de vraisemblance du lien entre objets quelconques. Analyse comparative entre deux approches. Revue de Statistique Appliquée, pp. 5–35 (2003)

    Google Scholar 

  • Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 361–397 (2004)

    Google Scholar 

  • Manly, B.: Randomization, Bootstrap and Monte Carlo methods in Biology. Chapman and Hall/CRC (1997)

    Google Scholar 

  • Morineau, A., Nakache, J.-P., Krzyzanowski, C.: Le modèle log-linéaire et ses applications. Cisia-Ceresta (1996)

    Google Scholar 

  • Newman, M.: Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 323–351 (2005)

    Google Scholar 

  • Pons, P., Latapy, M.: Computing communities in large networks using random walks. Journal of Graph Algorithms and Applications, 191–218 (2006)

    Google Scholar 

  • Press, J.: The role of Bayesian and frequentist multivariate modeling in statistical Data Mining. Statistical Data Mining and Knowledge Discovery, 1–14 (2004)

    Google Scholar 

  • Ryser, H.: Recent Advances in Matrix Theory, Madison (1964)

    Google Scholar 

  • Scuturici, M., Clech, J., Scuturici, V.-M., Zighed, D.: Topological Representation Model for Image Database Query. Journal of Experimental and Theoretical Artificial Intelligence, 145–160 (2005)

    Google Scholar 

  • Snijders, T.: Enumeration and simulation methods for 0-1 matrices with given marginals. Psychometrika, 397–417 (2004)

    Google Scholar 

  • Toussaint, G.: The relative neighbourhood graph of a finite planar set. Pattern Recognition, 261–268 (1980)

    Google Scholar 

  • Watts, D., Strogatz, S.: Collective dynamics of ’small-world’ networks. Nature, 95–118 (1998)

    Google Scholar 

  • Yates, F.: Contingency table involving small numbers and the Chi2 test. Journal of the Royal statistical society (supplement), 217–235 (1934)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Lelu, A., Cadot, M. (2010). Statistically Valid Links and Anti-links BetweenWords and Between Documents: Applying TourneBool Randomization Test to a Reuters Collection. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00580-0_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00580-0_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00579-4

  • Online ISBN: 978-3-642-00580-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics