Query Refinement Through Lexical Clustering of Scientific Textual Databases

SanJuan, Eric

doi:10.1007/11428817_23

Eric SanJuan¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3513))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

1376 Accesses

Abstract

TermWatch system automatically extracts multi word terms from scientific texts based on morphological analysis and relates them through linguistic variations. The resulting terminological network is clustered based on a 3-level hierarchical graph algorithm and mapped onto a 2D space. Clusters are automatically labeled based on variation activity. After a precise review of the methodology, this paper evaluates in the context of querying a scientific textual database, the overlap of terms and cluster labels with the keywords selected by human indexers as well as the set of possible queries based on the clustering output. The results show that linguistic variation paradigm is a robust way of automatically extracting and structuring a user comprehensive terminological resource for query refinement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ben-Dor, A., Yakhini, Z.: Clustering gene expression patterns. In: Proceedings of the Third Annual International Conference on Research in Computational Molecular Biology, Lyon, France, April 11-14, pp. 33–42. ACM, New York (1999)
Chapter Google Scholar
Baeza-Yates, Ribeiro–Neto, B.: Query operations. In: Modern Information retrieval, pp. 117–139. ACM Press, New York (1999)
Google Scholar
Berry, A., Kaba, B., Nadif, M., SanJuan, E., Sigayret, A.: Classification et désarticulation de graphes de termes. In: 7th International conference on Textual Data Statistical Analysis (JADT 2004), Leuven, Belgium, March 10-12, pp. 160–170 (2004)
Google Scholar
Blyth, T.S., Janowitz, M.F.: Residuation Theory. Pergamon Press, Oxford (1972)
MATH Google Scholar
Buckley, C., Salton, G., Allen, J.: Automatic query expansion using SMART: TREC-3. In: Harman, D.K. (ed.) The Third Text Retrieval Conference (TREC-3), U.S. Department of Commerce (1995)
Google Scholar
Callon, M., Courtial, J.-P., Turner, W., Bauin, S.: From translation to network: The co-word analysis. Scientometrics 5(1) (1983)
Google Scholar
Celeux, G., Govaert, G.: Comparison of the mixture and the classification maximum likehood. In clusters analysis. Journal of Statistical Computation and simulation 47, 127–146 (1993)
Article Google Scholar
Courtial, J.-P.: Introduction à la scientométrie. Anthropos – Economica, Paris, p. 135 (1990)
Google Scholar
Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/Gather: a Cluster-based Approach to Browsing Large Document Collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
Google Scholar
Daille, B.: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In: Resnik, P., Klavans, J. (eds.) The Balancing Act: Combining Symbolic and Statistical Approaches to Language, pp. 49–66. MIT Press, Cambridge (1996)
Google Scholar
Dobrynin, V., Patterson, D., Rooney, N.: Contextual Document Clustering. In: Proceedings of the European Conference on Information Retrieval, Sunderland, UK, April 5-7, pp. 167–180 (2004)
Google Scholar
Feldman, R., Fresko, M., Kinar, Y.: Text mining at the term level. In: Żytkow, J.M. (ed.) PKDD 1998. LNCS, vol. 1510, pp. 65–73. Springer, Heidelberg (1998)
Chapter Google Scholar
Fellbaum, C. (ed.): WordNet, An Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Grabar, N., Zweigenbaum, P.: Lexically-based terminology structuring: Some inherent limitations. Recent Trends in Computational Terminology: Special Issue of Terminology 10(1), 23–53 (2004)
Google Scholar
Matsuda, H., Ishihara, T., Hashimoto, A.: Classifying Molecular Sequences Using a Linkage Graph With Their Pairwise Similarities. Theoretical Computer Science 210(2), 305–325 (1999)
Article MATH MathSciNet Google Scholar
Hearst, M.A.: The use of categories and clusters in information access interfaces. In: Strzalkowski, T. (ed.) Natural Language Information Retrieval, pp. 333–374. Kluwer Academic Publishers, Dordrecht (1999)
Google Scholar
Hofmann, T.: Unsupervised learning by Probabilistic Latent Semantic Analysis. Machine Learning 42, 177–196 (2001)
Article MATH Google Scholar
Ibekwe-SanJuan, F.: A linguistic and mathematical method for mapping thematic trends from texts. In: Proceedings of the 13th European Conference on Artificial Intelligence, Brighton UK, August 23-28, pp. 170–174 (1998)
Google Scholar
Ibekwe-SanJuan, F., SanJuan, E.: Mining textual data through term variant clustering: the termwatch system. In: RIAO Proceedings, pp. 487–503 (2004)
Google Scholar
Jacquemin, C.: Spotting and discovering terms through Natural Language Processing, p. 378. MIT Press, Cambridge (2001)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3) (1999)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)
MATH Google Scholar
Leclerc, B.: The residuation model for the ordinal construction of dissimilarities and other valued objects. In: Van Cutsem, B. (ed.) Classification and dissimilarity analysis. Lecture Notes in Statistics, vol. 93, pp. 149–171. Springer, Heidelberg (1994)
Google Scholar
Leydesdorf, L.: Words and Co-Words as Indicators of Intellectual Organization. Research Policy 18, 209–223 (1989)
Article Google Scholar
Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioural Research 21, 441–458 (1986)
Article Google Scholar
Morin, E., Jacquemin, C.: Automatic acquisition and expansion of hypernym links. Computer and the humanities 38(4), 363–396 (2004)
Article Google Scholar
Nenadic, G., Spassic, I., Ananiadou, S.: Mining term similarities from corpora. Recent Trends in Computational Terminology: Special Issue of Terminology 10(1), 34 (2004)
Google Scholar
Pedersen, T., Patwardhan, Michelizzi: WordNet:Similarity - Measuring the Relatedness of Concepts. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI), San Jose, CA, July 25-29 (2004)
Google Scholar
Polanco, X., Grivel, L., Royauté, J.: How to do things with terms in informetrics: terminological variation and stabilization as science watch indicators. In: Proceedings of the 5th International Conference of the International Society for Scientometrics and Informetrics, Illinois USA, June 7-10, pp. 435–444 (1995)
Google Scholar
Schiffrin, R., Börner, K.: Mapping knowledge domains. Publication of the National Academy of Science (PNAS) 101(1), 5183–5185 (2004)
Article Google Scholar
Silberztein, M.: Dictionnaire électronique et analyse automatique des textes. Le système INTEX. Masson, Paris (1993)
Google Scholar
Small, H.: Visualizing science by citation mapping. Journal of the American society for Information Science 50(9), 799–813 (1999)
Article Google Scholar
Yang, Y., Pierce, T., Carbonell, J.G.: A Study on Retrospective and On-line Event Detection. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 28–36 (1998)
Google Scholar
Yee Yeung, K.: Clustering or automatic class discovery: non-hierarchical, non-SOM. In: A practical approach to microarray data analysis, Kluwer Academic Publisher, Dordrecht (2003)
Google Scholar
Yeung, K.Y., Haynor, H., Ruzzo, W.L.: Validating Clustering for Gene Expression Data. In HYPERLINK http://www.cs.washington.edu/homes/kayee/cluster Bioinformatics HYPERLINK http://www.cs.washington.edu/homes/kayee/cluster , 17, 309–318 (2001)
Yeung, K.Y., Ruzzo, W.L.: Details of the Adjusted Rand Index and clustering algorithms. Supplement to the paper “An experimental study on Principal Component Analysis for clustering gene expression data”. In: HYPERLINK http://www.cs.washington.edu/homes/kayee/cluster Bioinformatics http://www.cs.washington.edu/homes/kayee/cluster 17, pp. 763–774 (2001)
Zamir, O., Etzioni, O.: Web document Clustering, A feasibility demonstration. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

LITA Université Paul Verlaine & URI-INIST/CNRS, Metz, France
Eric SanJuan

Authors

Eric SanJuan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software and Computing Systems, University of Alicante, Spain
Andrés Montoyo
Grupo de investigación del Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Alicante, Spain
Rafael Muńoz
Lab. CEDRIC, CNAM, Paris, France
Elisabeth Métais

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

SanJuan, E. (2005). Query Refinement Through Lexical Clustering of Scientific Textual Databases. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_23

Download citation

DOI: https://doi.org/10.1007/11428817_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26031-8
Online ISBN: 978-3-540-32110-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics