Abstract
TermWatch system automatically extracts multi word terms from scientific texts based on morphological analysis and relates them through linguistic variations. The resulting terminological network is clustered based on a 3-level hierarchical graph algorithm and mapped onto a 2D space. Clusters are automatically labeled based on variation activity. After a precise review of the methodology, this paper evaluates in the context of querying a scientific textual database, the overlap of terms and cluster labels with the keywords selected by human indexers as well as the set of possible queries based on the clustering output. The results show that linguistic variation paradigm is a robust way of automatically extracting and structuring a user comprehensive terminological resource for query refinement.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ben-Dor, A., Yakhini, Z.: Clustering gene expression patterns. In: Proceedings of the Third Annual International Conference on Research in Computational Molecular Biology, Lyon, France, April 11-14, pp. 33–42. ACM, New York (1999)
Baeza-Yates, Ribeiro–Neto, B.: Query operations. In: Modern Information retrieval, pp. 117–139. ACM Press, New York (1999)
Berry, A., Kaba, B., Nadif, M., SanJuan, E., Sigayret, A.: Classification et désarticulation de graphes de termes. In: 7th International conference on Textual Data Statistical Analysis (JADT 2004), Leuven, Belgium, March 10-12, pp. 160–170 (2004)
Blyth, T.S., Janowitz, M.F.: Residuation Theory. Pergamon Press, Oxford (1972)
Buckley, C., Salton, G., Allen, J.: Automatic query expansion using SMART: TREC-3. In: Harman, D.K. (ed.) The Third Text Retrieval Conference (TREC-3), U.S. Department of Commerce (1995)
Callon, M., Courtial, J.-P., Turner, W., Bauin, S.: From translation to network: The co-word analysis. Scientometrics 5(1) (1983)
Celeux, G., Govaert, G.: Comparison of the mixture and the classification maximum likehood. In clusters analysis. Journal of Statistical Computation and simulation 47, 127–146 (1993)
Courtial, J.-P.: Introduction à la scientométrie. Anthropos – Economica, Paris, p. 135 (1990)
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/Gather: a Cluster-based Approach to Browsing Large Document Collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
Daille, B.: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In: Resnik, P., Klavans, J. (eds.) The Balancing Act: Combining Symbolic and Statistical Approaches to Language, pp. 49–66. MIT Press, Cambridge (1996)
Dobrynin, V., Patterson, D., Rooney, N.: Contextual Document Clustering. In: Proceedings of the European Conference on Information Retrieval, Sunderland, UK, April 5-7, pp. 167–180 (2004)
Feldman, R., Fresko, M., Kinar, Y.: Text mining at the term level. In: Żytkow, J.M. (ed.) PKDD 1998. LNCS, vol. 1510, pp. 65–73. Springer, Heidelberg (1998)
Fellbaum, C. (ed.): WordNet, An Electronic Lexical Database. MIT Press, Cambridge (1998)
Grabar, N., Zweigenbaum, P.: Lexically-based terminology structuring: Some inherent limitations. Recent Trends in Computational Terminology: Special Issue of Terminology 10(1), 23–53 (2004)
Matsuda, H., Ishihara, T., Hashimoto, A.: Classifying Molecular Sequences Using a Linkage Graph With Their Pairwise Similarities. Theoretical Computer Science 210(2), 305–325 (1999)
Hearst, M.A.: The use of categories and clusters in information access interfaces. In: Strzalkowski, T. (ed.) Natural Language Information Retrieval, pp. 333–374. Kluwer Academic Publishers, Dordrecht (1999)
Hofmann, T.: Unsupervised learning by Probabilistic Latent Semantic Analysis. Machine Learning 42, 177–196 (2001)
Ibekwe-SanJuan, F.: A linguistic and mathematical method for mapping thematic trends from texts. In: Proceedings of the 13th European Conference on Artificial Intelligence, Brighton UK, August 23-28, pp. 170–174 (1998)
Ibekwe-SanJuan, F., SanJuan, E.: Mining textual data through term variant clustering: the termwatch system. In: RIAO Proceedings, pp. 487–503 (2004)
Jacquemin, C.: Spotting and discovering terms through Natural Language Processing, p. 378. MIT Press, Cambridge (2001)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3) (1999)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
Leclerc, B.: The residuation model for the ordinal construction of dissimilarities and other valued objects. In: Van Cutsem, B. (ed.) Classification and dissimilarity analysis. Lecture Notes in Statistics, vol. 93, pp. 149–171. Springer, Heidelberg (1994)
Leydesdorf, L.: Words and Co-Words as Indicators of Intellectual Organization. Research Policy 18, 209–223 (1989)
Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioural Research 21, 441–458 (1986)
Morin, E., Jacquemin, C.: Automatic acquisition and expansion of hypernym links. Computer and the humanities 38(4), 363–396 (2004)
Nenadic, G., Spassic, I., Ananiadou, S.: Mining term similarities from corpora. Recent Trends in Computational Terminology: Special Issue of Terminology 10(1), 34 (2004)
Pedersen, T., Patwardhan, Michelizzi: WordNet:Similarity - Measuring the Relatedness of Concepts. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI), San Jose, CA, July 25-29 (2004)
Polanco, X., Grivel, L., Royauté, J.: How to do things with terms in informetrics: terminological variation and stabilization as science watch indicators. In: Proceedings of the 5th International Conference of the International Society for Scientometrics and Informetrics, Illinois USA, June 7-10, pp. 435–444 (1995)
Schiffrin, R., Börner, K.: Mapping knowledge domains. Publication of the National Academy of Science (PNAS) 101(1), 5183–5185 (2004)
Silberztein, M.: Dictionnaire électronique et analyse automatique des textes. Le système INTEX. Masson, Paris (1993)
Small, H.: Visualizing science by citation mapping. Journal of the American society for Information Science 50(9), 799–813 (1999)
Yang, Y., Pierce, T., Carbonell, J.G.: A Study on Retrospective and On-line Event Detection. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 28–36 (1998)
Yee Yeung, K.: Clustering or automatic class discovery: non-hierarchical, non-SOM. In: A practical approach to microarray data analysis, Kluwer Academic Publisher, Dordrecht (2003)
Yeung, K.Y., Haynor, H., Ruzzo, W.L.: Validating Clustering for Gene Expression Data. In HYPERLINK http://www.cs.washington.edu/homes/kayee/cluster Bioinformatics HYPERLINK http://www.cs.washington.edu/homes/kayee/cluster , 17, 309–318 (2001)
Yeung, K.Y., Ruzzo, W.L.: Details of the Adjusted Rand Index and clustering algorithms. Supplement to the paper “An experimental study on Principal Component Analysis for clustering gene expression data”. In: HYPERLINK http://www.cs.washington.edu/homes/kayee/cluster Bioinformatics http://www.cs.washington.edu/homes/kayee/cluster 17, pp. 763–774 (2001)
Zamir, O., Etzioni, O.: Web document Clustering, A feasibility demonstration. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
SanJuan, E. (2005). Query Refinement Through Lexical Clustering of Scientific Textual Databases. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_23
Download citation
DOI: https://doi.org/10.1007/11428817_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26031-8
Online ISBN: 978-3-540-32110-1
eBook Packages: Computer ScienceComputer Science (R0)