Statistically Valid Links and Anti-links BetweenWords and Between Documents: Applying TourneBool Randomization Test to a Reuters Collection

Lelu, Alain; Cadot, Martine

doi:10.1007/978-3-642-00580-0_18

Alain Lelu⁵ &
Martine Cadot⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 292))

844 Accesses

Abstract

Neighborhood is a central concept in data mining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an objects × attributes binary table in order to establish which inter-attribute relations are fortuitous, and which ones are meaningful, without requiring any pre-defined statistical model, while taking into account the empirical distributions. It ensues a robust and statistically validated graph. We present a full-scale experiment on one of the public access Reuters test corpus. We characterize the resulting word graph by a series of indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative “counter-relations” between words, i.e. words which “steer clear” one from another. We characterize in the same way the counter-relation graph. At last we generate the couple of valid document graphs (i.e. links and anti-links) and evaluate them by taking into account the Reuters document categories.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bavaud, F.: Modèles et données: une introduction à la Statistique uni-, bi- et trivariée. L’Harmattan (1998)
Google Scholar
Benzécri, J.: Construction d’une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques. Les Cahiers de l’Analyse des Données, pp. 208–218 (1982)
Google Scholar
Cadot, M.: A Simulation Technique for extracting Robust Association Rules. In: CSDA 2005 (2005)
Google Scholar
Cadot, M.: Extraire et valider les relations complexes en sciences humaines: statistiques, motifs et règles d’association. Ph.D. thesis, Université de Franche-Comté (2006)
Google Scholar
Cadot, M., Napoli, A.: Une optimisation de l’extraction d’un jeu de règles s’appuyant sur les caractéristiques statistiques des données. In: RSTI, série RIA-ECA, pp. 631–656 (2003)
Google Scholar
Cobb, G., Chen, Y.: An application of Markov chain Monte Carlo to community ecology. The American Mathematical Monthly, pp. 264–288 (2003)
Google Scholar
Connor, E., Simberloff, D.: The assembly of species communities: Chance or competition? Ecology, 1132–1140 (1979)
Google Scholar
Delaunay, B.: Sur la sphère vide. Izvestia Akademii Nauk SSSR, pp. 793–800 (1934)
Google Scholar
Droesbeke, J., Finne, J.: Inférence non-paramétrique - Les statistiques de rangs. Editions de l’Université de Bruxelles (1996)
Google Scholar
Fisher, R.: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 179–188 (1936)
Google Scholar
Gabriel, K., Sokal, R.: A new statistical approach to geographic variation analysis. Systematic Zoology, 259–270 (1969)
Google Scholar
Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. ACM Trans. Knowl. Discov. Data (2007)
Google Scholar
Goodman, J., O’Rourke, J.: Handbook of Discrete and Computational Geometry. CRC Press, Boca Raton (2004)
Book MATH Google Scholar
Jensen, D., Cohen, P.: Multiple Comparisons in Induction Algorithms. Machine Learning, 309–338 (2000)
Google Scholar
Lelu, A.: Analyse en composantes locales et graphes de similarité entre textes. In: Purnelle, G. (ed.) JADT 2004 (2004)
Google Scholar
Lelu, A., Cuxac, P., Cadot, M.: Document stream clustering: an optimal and fine-grained incremental approach. In: COLLNET 2006 / International Workshop on Webometrics, Informetrics and Scientometrics (2006)
Google Scholar
Lerman, I.-C., Peter, P.: Indice probabiliste de vraisemblance du lien entre objets quelconques. Analyse comparative entre deux approches. Revue de Statistique Appliquée, pp. 5–35 (2003)
Google Scholar
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 361–397 (2004)
Google Scholar
Manly, B.: Randomization, Bootstrap and Monte Carlo methods in Biology. Chapman and Hall/CRC (1997)
Google Scholar
Morineau, A., Nakache, J.-P., Krzyzanowski, C.: Le modèle log-linéaire et ses applications. Cisia-Ceresta (1996)
Google Scholar
Newman, M.: Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 323–351 (2005)
Google Scholar
Pons, P., Latapy, M.: Computing communities in large networks using random walks. Journal of Graph Algorithms and Applications, 191–218 (2006)
Google Scholar
Press, J.: The role of Bayesian and frequentist multivariate modeling in statistical Data Mining. Statistical Data Mining and Knowledge Discovery, 1–14 (2004)
Google Scholar
Ryser, H.: Recent Advances in Matrix Theory, Madison (1964)
Google Scholar
Scuturici, M., Clech, J., Scuturici, V.-M., Zighed, D.: Topological Representation Model for Image Database Query. Journal of Experimental and Theoretical Artificial Intelligence, 145–160 (2005)
Google Scholar
Snijders, T.: Enumeration and simulation methods for 0-1 matrices with given marginals. Psychometrika, 397–417 (2004)
Google Scholar
Toussaint, G.: The relative neighbourhood graph of a finite planar set. Pattern Recognition, 261–268 (1980)
Google Scholar
Watts, D., Strogatz, S.: Collective dynamics of ’small-world’ networks. Nature, 95–118 (1998)
Google Scholar
Yates, F.: Contingency table involving small numbers and the Chi2 test. Journal of the Royal statistical society (supplement), 217–235 (1934)
Google Scholar

Download references

Author information

Authors and Affiliations

Kiwi team, Université de Franche-Comté / LORIA, bâtiment C, Campus scientifique, BP 239, F-54506, Vandoeuvre-lès-Nancy Cedex, France
Alain Lelu
ABC team, Université Henri Poincaré - Nancy 1 / LORIA, bâtiment A, Campus scientifique, BP 239, F-54506, Vandoeuvre-lès-Nancy Cedex, France
Martine Cadot

Authors

Alain Lelu
View author publications
You can also search for this author in PubMed Google Scholar
Martine Cadot
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Polytechnic School of Nantes University, Nantes, France
Fabrice Guillet & Henri Briand &
Université de Genève, Genève, Switzerland
Gilbert Ritschard
Université Lumi‘́ere Lyon 2, Bron, France
Djamel Abdelkader Zighed

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lelu, A., Cadot, M. (2010). Statistically Valid Links and Anti-links BetweenWords and Between Documents: Applying TourneBool Randomization Test to a Reuters Collection. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00580-0_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-00580-0_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00579-4
Online ISBN: 978-3-642-00580-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics