Abstract
In this chapter we present an overview of text mining approaches that can be used to conduct science and technology studies that rely on assessing the (content) similarity between patent documents and/or scientific publications. We highlight the rationale behind vector space models, latent semantic analysis, and probabilistic topic models. In addition, several validation studies pertaining to patent documents and publications are presented. These studies reveal that choices in terms of algorithms, pre-processing, and calculation options have non-trivial consequences in terms of outcomes and their validity. As such, scholars should pay attention to the technicalities implied when engaging in text mining efforts in order for outcomes to become relevant and informative.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
D. Hicks, B.R. Martin, J. Irvine: Bibliometric techniques for monitoring performance in technologically oriented research—The case of integrated-optics, R&D Manag. 16(3), 211–223 (1986)
S. Hinze, H. Grupp: Mapping of R&D structures in transdisciplinary areas: New biotechnology in food sciences, Scientometrics 37(2), 313–335 (1996)
P. Glenisson, W. Glänzel, O. Persson: Combining full-text and bibliometric indicators. A pilot study, Scientometrics 63(1), 163–180 (2005)
V. Rabeharisoa: A special mediation between science and technology: When inventors publish scientific articles in fuel cells. In: Dynamics of Science-Based Innovation, ed. by H. Grupp (Springer, Berlin, Heidelberg 1992) pp. 45–72
E.C.M. Noyons, A.F.J. van Raan, H. Grupp, U. Schmoch: Exploring the science and technology interface—Inventor author relations in laser medicine, Res. Policy 23(4), 443–457 (1994)
U. Schmoch: The technological output of scientific institutions. In: Handbook of Quantitative Science and Technology Research. The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 717–731
W. Glänzel, M. Meyer, B. Schlemmer M. du Plessis, B. Thijs, T. Magerman, K. Debackere, R. Veugelers: Biotechnology: An Analysis of Patents and Publications, Report Steunpunt O&O Statistics, http://www.steunpuntoos.be (2004)
M. Meyer: Patent citations in a novel field of technology: What can they tell about interactions of emerging communities of science and technology?, Scientometrics 48(2), 151–178 (2000)
T. Magerman: Impact and Consequences of Science-Intensive Patenting: In Search of Anti-Commons Evidence Using Latent Semantic Analysis (LSA) Text Mining Techniques (KU Leuven, Leuven 2011), unpublished Ph.D. manuscript
T. Magerman, B. Van Looy, X. Song: Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications, Scientometrics 82(2), 289–306 (2010)
T. Magerman, B. Van Looy, K. Debackere: Does involvement in patenting jeopardize one's academic footprint? An analysis of patent–paper pairs in biotechnology, Res. Policy 44(9), 1702–1713 (2015)
M.A. Hearst: Untangling text data mining. In: Proc. 37th Annu. Meet. Assoc. Comput. Linguist., College Park, Maryland (1999) pp. 3–10
W. Fan, L. Wallace, S. Rich, Z. Zhang: Tapping the power of text mining, Communication ACM 49(9), 77–82 (2006)
K.A. Vidhya, G. Aghila: Text mining process, techniques and tools: An overview, Int. J. Inform. Technol. Manag. 2(2), 613–622 (2010)
D.R. Swanson: Fish Oil, Raynaud's syndrome, and undiscovered public knowledge, Perspect. Biol. Med. 30, 7–18 (1986)
D.R. Swanson: Migraine and magnesium: Eleven neglected connections, Perspect. Biol. Med. 31, 526–557 (1988)
D.R. Swanson: Somatomedin C and arginine: Implicit connections between mutually-isolated literatures, Perspect. Biol. Med. 33, 157–186 (1990)
N.M. Ramadan, H. Halvorson, A. Vandelinde, S.R. Levine: Low brain magnesium in migraine, Headache 29(7), 416–419 (1989)
D.R. Swanson, N.R. Smalheiser: An interactive system for finding complementary literatures: A stimulus to scientific discovery, Artif. Intell. 91, 183–203 (1997)
P. Grzybek, E. Kelih: Anton S. Budilovic (1846–1908)—A forerunner of quantitative linguistics in Russia?, Glottometrics 7, 94–97 (2004)
G.K. Zipf: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology (Addison-Wesley, Cambridge 1949)
R.E. Wyllys: Measuring scientific prose with rank-frequency (‘Zipf') curves: A new use for an old phenomenon, Proc. Am. Soc. Inform. Sci. 12, 30–31 (1975)
M. Callon, J.P. Courtial, W.A. Turner, S. Bauin: From translations to problematic networks—An introduction to co-word analysis, Soc. Sci. Inf. (Paris) 22(2), 191–235 (1983)
C.D. Manning, H. Schütze: Foundations of Statistical Natural Language Processing (MIT Press, Cambridge 2000)
M.W. Berry (Ed.): Survey of Text Mining (Springer, New York 2003)
E. Leopold, M. May, G. Paaß: Data mining and text mining for science and technology research. In: Handbook of Quantitative Science and Technology Research. The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 187–213
A.L. Porter, N.C. Newman: Patent profiling for competitive advantage. In: Handbook of Quantitative Science and Technology Research. The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 587–612
J.P. Courtial: A coword analysis of scientometrics, Scientometrics 31(3), 251–260 (1994)
E. Bassecoulard, M. Zitt: Patents and publications. The lexical connection. In: Handbook of Quantitative Science and Technology Research, The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 665–694
L. Leydesdorff: The university-industry knowledge relationship: Analyzing patents and the science base of technologies, J. Am. Soc. Inform. Sci. Technol. 55(11), 991–1001 (2004)
F. Janssens, J. Leta, W. Glänzel, B. De Moor: Towards mapping library and information science, Inform. Process. Manag. 42(6), 1614–1642 (2006)
G. Salton: Automatic Information Organization and Retrieval (McGraw-Hill, New York 1968)
G. Salton, A. Wong, C.S. Yang: A vector space model for information retrieval, J. Am. Soc. Inform. Sci. 18(11), 613–620 (1975)
G. Salton, M.J. McGill: Introduction to Modern Information Retrieval (McGraw-Hill, New York 1983)
M. Lennon, D.S. Pierce, B.D. Tarry, P. Willett: An evaluation of some conflation algorithms for information retrieval, J. Inform. Sci. 3, 177–183 (1981)
D. Harman: How effective is suffixing?, J. Am. Soc. Inform. Sci. 42, 7–15 (1991)
B. Krovetz: Word Sense Disambiguation for Large Text Databases, Ph.D. Thesis (Department of Computer Science, University of Massachusetts, Amherst 1995)
M.F. Porter: ‘Snowball: A language for stemming algorithms, snowball.tartarus.org/texts/introduction.html (2001)
C.J. van Rijsbergen, S.E. Robertson, M.F. Porter: New Models in Probabilistic Information Retrieval, British Library Research and Development Report 5587 (British Library, London 1980)
M.F. Porter: An algorithm for suffix stripping, Program 14(3), 130–137 (1980)
S. Dumais: Improving the retrieval of information from external sources, Behav. Res. Methods 23(2), 229–236 (1991)
G. Salton, H. Wu: A term weighting model based on utility theory. In: Information Retrieval Research, ed. by R.N. Oddy, S.E. Robertson, C.J. van Rijsbergen, R.W. Williams (Butterworths, Boston 1981) pp. 9–22
C.E. Shannon: A note on the concept of entropy, Bell Syst. Tech. J. 27(3), 379–423 (1948)
M.F. Moens: Information Extraction: Algorithms and Prospects in a Retrieval Context, The Information Retrieval Series, Vol. 21 (Springer, New York 2006)
T.K. Landauer, S.T. Dumais: A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge, Psychol. Rev. 104(2), 211–240 (1997)
J.D. Carroll, P. Arabie: Multidimensional scaling, Annu. Rev. Psychol. 31, 607–649 (1980)
K. Sparck Jones: Automatic Keyword Classification for Information Retrieval (Buttersworth, London 1971)
N. Jardin, C.J. van Rijsbergen: The use of hierarchic clustering in information retrieval, Inform. Storage Retr. 7, 217–240 (1971)
P. Atherton, H. Borko: A test of factor-analytically derived automated classification method applied to descriptions of work and search requests of nuclear physicists, Report No. AIP/DRP 65-1, New York, American Institute of Physics, Documentation Research Project; Report No. SP-1905, Santa Monica, California, System Development Corporation (1965), 15 p
H. Borko, M.D. Bemick: Automatic document classification, J. ACM 10, 151–162 (1963)
P.G. Ossorio: Classification space: A multivariate procedure for automatic document indexing and retrieval, Multivar. Behav. Res. 1, 419–524 (1966)
C. Eckart, G. Young: The approximation of one matrix by another of lower rank, Psychometrika I, 211–218 (1936)
M.W. Berry, Z. Drmac, E. Jessup: Matrices, vector spaces, and information retrieval, SIAM Rev. 41, 335–362 (1999)
E. Jessup, J. Martin: Taking a new look at the latent semantic analysis approach to information retrieval. In: Computational Information Retrieval, ed. by M.W. Berry (SIAM, Philadelphia 2001) pp. 121–144
M. Lizza, F. Sartoretto: A comparative analysis of LSI strategies. In: Computational Information Retrieval, ed. by M.W. Berry (SIAM, Philadelphia 2001) pp. 121–144
M.W. Berry, M. Browne: Understanding Search Engines: Mathematical Modeling and Text Retrieval (Society for Industrial and Applied Mathematics, Philadelphia 1999)
D.M. Blei, A.Y. Ng, M.I. Jordan: Latent Dirichlet allocation, J. Mach. Learn. Res. 3, 993–1022 (2003)
R. Baeza-Yates, B. Ribeiro-Neto: Modern Information Retrieval (Addison-Wesley, Wokingham 1999), Second edition published in 2011
D. Harman: An experimental study of the factors important in document ranking. In: Assoc. Comput. Mach. 9th Conf. Res. Develop. Inform. Retr, ed. by F. Rabbit (Association for Computing Machines, New York 1986)
P. Glenisson, W. Glänzel, F. Janssens, B. De Moor: Combining full text and bibliometric information in mapping scientific disciplines, Inform. Process. Manag. 41, 1548–1572 (2005)
T. Magerman, B. Van Looy, B. Baesens, K. Debackere: Assessment of Latent Semantic Analysis (LSA) Text Mining Algorithms for Large Scale Mapping of Patent and Scientific Publication Documents (Faculty of Business and Economics, KU Leuven, Leuven 2011)
OECD: A Framework for Biotechnology Statistics (OECD, Paris 2005) pp. 29–32
X.H. Phan, L.M. Nguyen, S. Horiguchi: Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proc. 17th Int. World Wide Web Conf. (WWW 2008), Beijing (2008) pp. 91–100
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Van Looy, B., Magerman, T. (2019). Using Text Mining Algorithms for Patent Documents and Publications. In: Glänzel, W., Moed, H.F., Schmoch, U., Thelwall, M. (eds) Springer Handbook of Science and Technology Indicators. Springer Handbooks. Springer, Cham. https://doi.org/10.1007/978-3-030-02511-3_38
Download citation
DOI: https://doi.org/10.1007/978-3-030-02511-3_38
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02510-6
Online ISBN: 978-3-030-02511-3
eBook Packages: Economics and FinanceEconomics and Finance (R0)