Skip to main content

Query Refinement Through Lexical Clustering of Scientific Textual Databases

  • Conference paper
Natural Language Processing and Information Systems (NLDB 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3513))

  • 1376 Accesses

Abstract

TermWatch system automatically extracts multi word terms from scientific texts based on morphological analysis and relates them through linguistic variations. The resulting terminological network is clustered based on a 3-level hierarchical graph algorithm and mapped onto a 2D space. Clusters are automatically labeled based on variation activity. After a precise review of the methodology, this paper evaluates in the context of querying a scientific textual database, the overlap of terms and cluster labels with the keywords selected by human indexers as well as the set of possible queries based on the clustering output. The results show that linguistic variation paradigm is a robust way of automatically extracting and structuring a user comprehensive terminological resource for query refinement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ben-Dor, A., Yakhini, Z.: Clustering gene expression patterns. In: Proceedings of the Third Annual International Conference on Research in Computational Molecular Biology, Lyon, France, April 11-14, pp. 33–42. ACM, New York (1999)

    Chapter  Google Scholar 

  2. Baeza-Yates, Ribeiro–Neto, B.: Query operations. In: Modern Information retrieval, pp. 117–139. ACM Press, New York (1999)

    Google Scholar 

  3. Berry, A., Kaba, B., Nadif, M., SanJuan, E., Sigayret, A.: Classification et désarticulation de graphes de termes. In: 7th International conference on Textual Data Statistical Analysis (JADT 2004), Leuven, Belgium, March 10-12, pp. 160–170 (2004)

    Google Scholar 

  4. Blyth, T.S., Janowitz, M.F.: Residuation Theory. Pergamon Press, Oxford (1972)

    MATH  Google Scholar 

  5. Buckley, C., Salton, G., Allen, J.: Automatic query expansion using SMART: TREC-3. In: Harman, D.K. (ed.) The Third Text Retrieval Conference (TREC-3), U.S. Department of Commerce (1995)

    Google Scholar 

  6. Callon, M., Courtial, J.-P., Turner, W., Bauin, S.: From translation to network: The co-word analysis. Scientometrics 5(1) (1983)

    Google Scholar 

  7. Celeux, G., Govaert, G.: Comparison of the mixture and the classification maximum likehood. In clusters analysis. Journal of Statistical Computation and simulation 47, 127–146 (1993)

    Article  Google Scholar 

  8. Courtial, J.-P.: Introduction à la scientométrie. Anthropos – Economica, Paris, p. 135 (1990)

    Google Scholar 

  9. Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/Gather: a Cluster-based Approach to Browsing Large Document Collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)

    Google Scholar 

  10. Daille, B.: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In: Resnik, P., Klavans, J. (eds.) The Balancing Act: Combining Symbolic and Statistical Approaches to Language, pp. 49–66. MIT Press, Cambridge (1996)

    Google Scholar 

  11. Dobrynin, V., Patterson, D., Rooney, N.: Contextual Document Clustering. In: Proceedings of the European Conference on Information Retrieval, Sunderland, UK, April 5-7, pp. 167–180 (2004)

    Google Scholar 

  12. Feldman, R., Fresko, M., Kinar, Y.: Text mining at the term level. In: Żytkow, J.M. (ed.) PKDD 1998. LNCS, vol. 1510, pp. 65–73. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  13. Fellbaum, C. (ed.): WordNet, An Electronic Lexical Database. MIT Press, Cambridge (1998)

    MATH  Google Scholar 

  14. Grabar, N., Zweigenbaum, P.: Lexically-based terminology structuring: Some inherent limitations. Recent Trends in Computational Terminology: Special Issue of Terminology 10(1), 23–53 (2004)

    Google Scholar 

  15. Matsuda, H., Ishihara, T., Hashimoto, A.: Classifying Molecular Sequences Using a Linkage Graph With Their Pairwise Similarities. Theoretical Computer Science 210(2), 305–325 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  16. Hearst, M.A.: The use of categories and clusters in information access interfaces. In: Strzalkowski, T. (ed.) Natural Language Information Retrieval, pp. 333–374. Kluwer Academic Publishers, Dordrecht (1999)

    Google Scholar 

  17. Hofmann, T.: Unsupervised learning by Probabilistic Latent Semantic Analysis. Machine Learning 42, 177–196 (2001)

    Article  MATH  Google Scholar 

  18. Ibekwe-SanJuan, F.: A linguistic and mathematical method for mapping thematic trends from texts. In: Proceedings of the 13th European Conference on Artificial Intelligence, Brighton UK, August 23-28, pp. 170–174 (1998)

    Google Scholar 

  19. Ibekwe-SanJuan, F., SanJuan, E.: Mining textual data through term variant clustering: the termwatch system. In: RIAO Proceedings, pp. 487–503 (2004)

    Google Scholar 

  20. Jacquemin, C.: Spotting and discovering terms through Natural Language Processing, p. 378. MIT Press, Cambridge (2001)

    Google Scholar 

  21. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3) (1999)

    Google Scholar 

  22. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  23. Leclerc, B.: The residuation model for the ordinal construction of dissimilarities and other valued objects. In: Van Cutsem, B. (ed.) Classification and dissimilarity analysis. Lecture Notes in Statistics, vol. 93, pp. 149–171. Springer, Heidelberg (1994)

    Google Scholar 

  24. Leydesdorf, L.: Words and Co-Words as Indicators of Intellectual Organization. Research Policy 18, 209–223 (1989)

    Article  Google Scholar 

  25. Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioural Research 21, 441–458 (1986)

    Article  Google Scholar 

  26. Morin, E., Jacquemin, C.: Automatic acquisition and expansion of hypernym links. Computer and the humanities 38(4), 363–396 (2004)

    Article  Google Scholar 

  27. Nenadic, G., Spassic, I., Ananiadou, S.: Mining term similarities from corpora. Recent Trends in Computational Terminology: Special Issue of Terminology 10(1), 34 (2004)

    Google Scholar 

  28. Pedersen, T., Patwardhan, Michelizzi: WordNet:Similarity - Measuring the Relatedness of Concepts. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI), San Jose, CA, July 25-29 (2004)

    Google Scholar 

  29. Polanco, X., Grivel, L., Royauté, J.: How to do things with terms in informetrics: terminological variation and stabilization as science watch indicators. In: Proceedings of the 5th International Conference of the International Society for Scientometrics and Informetrics, Illinois USA, June 7-10, pp. 435–444 (1995)

    Google Scholar 

  30. Schiffrin, R., Börner, K.: Mapping knowledge domains. Publication of the National Academy of Science (PNAS) 101(1), 5183–5185 (2004)

    Article  Google Scholar 

  31. Silberztein, M.: Dictionnaire électronique et analyse automatique des textes. Le système INTEX. Masson, Paris (1993)

    Google Scholar 

  32. Small, H.: Visualizing science by citation mapping. Journal of the American society for Information Science 50(9), 799–813 (1999)

    Article  Google Scholar 

  33. Yang, Y., Pierce, T., Carbonell, J.G.: A Study on Retrospective and On-line Event Detection. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 28–36 (1998)

    Google Scholar 

  34. Yee Yeung, K.: Clustering or automatic class discovery: non-hierarchical, non-SOM. In: A practical approach to microarray data analysis, Kluwer Academic Publisher, Dordrecht (2003)

    Google Scholar 

  35. Yeung, K.Y., Haynor, H., Ruzzo, W.L.: Validating Clustering for Gene Expression Data. In HYPERLINK http://www.cs.washington.edu/homes/kayee/cluster Bioinformatics HYPERLINK http://www.cs.washington.edu/homes/kayee/cluster , 17, 309–318 (2001)

  36. Yeung, K.Y., Ruzzo, W.L.: Details of the Adjusted Rand Index and clustering algorithms. Supplement to the paper “An experimental study on Principal Component Analysis for clustering gene expression data”. In: HYPERLINK http://www.cs.washington.edu/homes/kayee/cluster Bioinformatics http://www.cs.washington.edu/homes/kayee/cluster 17, pp. 763–774 (2001)

  37. Zamir, O., Etzioni, O.: Web document Clustering, A feasibility demonstration. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

SanJuan, E. (2005). Query Refinement Through Lexical Clustering of Scientific Textual Databases. In: Montoyo, A., Muńoz, R., Métais, E. (eds) Natural Language Processing and Information Systems. NLDB 2005. Lecture Notes in Computer Science, vol 3513. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11428817_23

Download citation

  • DOI: https://doi.org/10.1007/11428817_23

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-26031-8

  • Online ISBN: 978-3-540-32110-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics