Skip to main content

Using Text Mining Algorithms for Patent Documents and Publications

  • Chapter
Springer Handbook of Science and Technology Indicators

Part of the book series: Springer Handbooks ((SHB))

Abstract

In this chapter we present an overview of text mining approaches that can be used to conduct science and technology studies that rely on assessing the (content) similarity between patent documents and/or scientific publications. We highlight the rationale behind vector space models, latent semantic analysis, and probabilistic topic models. In addition, several validation studies pertaining to patent documents and publications are presented. These studies reveal that choices in terms of algorithms, pre-processing, and calculation options have non-trivial consequences in terms of outcomes and their validity. As such, scholars should pay attention to the technicalities implied when engaging in text mining efforts in order for outcomes to become relevant and informative.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 379.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • D. Hicks, B.R. Martin, J. Irvine: Bibliometric techniques for monitoring performance in technologically oriented research—The case of integrated-optics, R&D Manag. 16(3), 211–223 (1986)

    Article  Google Scholar 

  • S. Hinze, H. Grupp: Mapping of R&D structures in transdisciplinary areas: New biotechnology in food sciences, Scientometrics 37(2), 313–335 (1996)

    Article  Google Scholar 

  • P. Glenisson, W. Glänzel, O. Persson: Combining full-text and bibliometric indicators. A pilot study, Scientometrics 63(1), 163–180 (2005)

    Article  Google Scholar 

  • V. Rabeharisoa: A special mediation between science and technology: When inventors publish scientific articles in fuel cells. In: Dynamics of Science-Based Innovation, ed. by H. Grupp (Springer, Berlin, Heidelberg 1992) pp. 45–72

    Chapter  Google Scholar 

  • E.C.M. Noyons, A.F.J. van Raan, H. Grupp, U. Schmoch: Exploring the science and technology interface—Inventor author relations in laser medicine, Res. Policy 23(4), 443–457 (1994)

    Article  Google Scholar 

  • U. Schmoch: The technological output of scientific institutions. In: Handbook of Quantitative Science and Technology Research. The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 717–731

    Google Scholar 

  • W. Glänzel, M. Meyer, B. Schlemmer M. du Plessis, B. Thijs, T. Magerman, K. Debackere, R. Veugelers: Biotechnology: An Analysis of Patents and Publications, Report Steunpunt O&O Statistics, http://www.steunpuntoos.be (2004)

  • M. Meyer: Patent citations in a novel field of technology: What can they tell about interactions of emerging communities of science and technology?, Scientometrics 48(2), 151–178 (2000)

    Article  Google Scholar 

  • T. Magerman: Impact and Consequences of Science-Intensive Patenting: In Search of Anti-Commons Evidence Using Latent Semantic Analysis (LSA) Text Mining Techniques (KU Leuven, Leuven 2011), unpublished Ph.D. manuscript

    Google Scholar 

  • T. Magerman, B. Van Looy, X. Song: Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications, Scientometrics 82(2), 289–306 (2010)

    Article  Google Scholar 

  • T. Magerman, B. Van Looy, K. Debackere: Does involvement in patenting jeopardize one's academic footprint? An analysis of patent–paper pairs in biotechnology, Res. Policy 44(9), 1702–1713 (2015)

    Article  Google Scholar 

  • M.A. Hearst: Untangling text data mining. In: Proc. 37th Annu. Meet. Assoc. Comput. Linguist., College Park, Maryland (1999) pp. 3–10

    Google Scholar 

  • W. Fan, L. Wallace, S. Rich, Z. Zhang: Tapping the power of text mining, Communication ACM 49(9), 77–82 (2006)

    Article  Google Scholar 

  • K.A. Vidhya, G. Aghila: Text mining process, techniques and tools: An overview, Int. J. Inform. Technol. Manag. 2(2), 613–622 (2010)

    Google Scholar 

  • D.R. Swanson: Fish Oil, Raynaud's syndrome, and undiscovered public knowledge, Perspect. Biol. Med. 30, 7–18 (1986)

    Article  Google Scholar 

  • D.R. Swanson: Migraine and magnesium: Eleven neglected connections, Perspect. Biol. Med. 31, 526–557 (1988)

    Article  Google Scholar 

  • D.R. Swanson: Somatomedin C and arginine: Implicit connections between mutually-isolated literatures, Perspect. Biol. Med. 33, 157–186 (1990)

    Article  Google Scholar 

  • N.M. Ramadan, H. Halvorson, A. Vandelinde, S.R. Levine: Low brain magnesium in migraine, Headache 29(7), 416–419 (1989)

    Article  Google Scholar 

  • D.R. Swanson, N.R. Smalheiser: An interactive system for finding complementary literatures: A stimulus to scientific discovery, Artif. Intell. 91, 183–203 (1997)

    Article  Google Scholar 

  • P. Grzybek, E. Kelih: Anton S. Budilovic (1846–1908)—A forerunner of quantitative linguistics in Russia?, Glottometrics 7, 94–97 (2004)

    Google Scholar 

  • G.K. Zipf: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology (Addison-Wesley, Cambridge 1949)

    Google Scholar 

  • R.E. Wyllys: Measuring scientific prose with rank-frequency (‘Zipf') curves: A new use for an old phenomenon, Proc. Am. Soc. Inform. Sci. 12, 30–31 (1975)

    Google Scholar 

  • M. Callon, J.P. Courtial, W.A. Turner, S. Bauin: From translations to problematic networks—An introduction to co-word analysis, Soc. Sci. Inf. (Paris) 22(2), 191–235 (1983)

    Article  Google Scholar 

  • C.D. Manning, H. Schütze: Foundations of Statistical Natural Language Processing (MIT Press, Cambridge 2000)

    Google Scholar 

  • M.W. Berry (Ed.): Survey of Text Mining (Springer, New York 2003)

    Google Scholar 

  • E. Leopold, M. May, G. Paaß: Data mining and text mining for science and technology research. In: Handbook of Quantitative Science and Technology Research. The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 187–213

    Google Scholar 

  • A.L. Porter, N.C. Newman: Patent profiling for competitive advantage. In: Handbook of Quantitative Science and Technology Research. The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 587–612

    Google Scholar 

  • J.P. Courtial: A coword analysis of scientometrics, Scientometrics 31(3), 251–260 (1994)

    Article  Google Scholar 

  • E. Bassecoulard, M. Zitt: Patents and publications. The lexical connection. In: Handbook of Quantitative Science and Technology Research, The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 665–694

    Google Scholar 

  • L. Leydesdorff: The university-industry knowledge relationship: Analyzing patents and the science base of technologies, J. Am. Soc. Inform. Sci. Technol. 55(11), 991–1001 (2004)

    Article  Google Scholar 

  • F. Janssens, J. Leta, W. Glänzel, B. De Moor: Towards mapping library and information science, Inform. Process. Manag. 42(6), 1614–1642 (2006)

    Article  Google Scholar 

  • G. Salton: Automatic Information Organization and Retrieval (McGraw-Hill, New York 1968)

    Google Scholar 

  • G. Salton, A. Wong, C.S. Yang: A vector space model for information retrieval, J. Am. Soc. Inform. Sci. 18(11), 613–620 (1975)

    Google Scholar 

  • G. Salton, M.J. McGill: Introduction to Modern Information Retrieval (McGraw-Hill, New York 1983)

    Google Scholar 

  • M. Lennon, D.S. Pierce, B.D. Tarry, P. Willett: An evaluation of some conflation algorithms for information retrieval, J. Inform. Sci. 3, 177–183 (1981)

    Article  Google Scholar 

  • D. Harman: How effective is suffixing?, J. Am. Soc. Inform. Sci. 42, 7–15 (1991)

    Article  Google Scholar 

  • B. Krovetz: Word Sense Disambiguation for Large Text Databases, Ph.D. Thesis (Department of Computer Science, University of Massachusetts, Amherst 1995)

    Google Scholar 

  • M.F. Porter: ‘Snowball: A language for stemming algorithms, snowball.tartarus.org/texts/introduction.html (2001)

    Google Scholar 

  • C.J. van Rijsbergen, S.E. Robertson, M.F. Porter: New Models in Probabilistic Information Retrieval, British Library Research and Development Report 5587 (British Library, London 1980)

    Google Scholar 

  • M.F. Porter: An algorithm for suffix stripping, Program 14(3), 130–137 (1980)

    Article  Google Scholar 

  • S. Dumais: Improving the retrieval of information from external sources, Behav. Res. Methods 23(2), 229–236 (1991)

    Article  Google Scholar 

  • G. Salton, H. Wu: A term weighting model based on utility theory. In: Information Retrieval Research, ed. by R.N. Oddy, S.E. Robertson, C.J. van Rijsbergen, R.W. Williams (Butterworths, Boston 1981) pp. 9–22

    Google Scholar 

  • C.E. Shannon: A note on the concept of entropy, Bell Syst. Tech. J. 27(3), 379–423 (1948)

    Article  Google Scholar 

  • M.F. Moens: Information Extraction: Algorithms and Prospects in a Retrieval Context, The Information Retrieval Series, Vol. 21 (Springer, New York 2006)

    Google Scholar 

  • T.K. Landauer, S.T. Dumais: A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge, Psychol. Rev. 104(2), 211–240 (1997)

    Article  Google Scholar 

  • J.D. Carroll, P. Arabie: Multidimensional scaling, Annu. Rev. Psychol. 31, 607–649 (1980)

    Article  Google Scholar 

  • K. Sparck Jones: Automatic Keyword Classification for Information Retrieval (Buttersworth, London 1971)

    Google Scholar 

  • N. Jardin, C.J. van Rijsbergen: The use of hierarchic clustering in information retrieval, Inform. Storage Retr. 7, 217–240 (1971)

    Article  Google Scholar 

  • P. Atherton, H. Borko: A test of factor-analytically derived automated classification method applied to descriptions of work and search requests of nuclear physicists, Report No. AIP/DRP 65-1, New York, American Institute of Physics, Documentation Research Project; Report No. SP-1905, Santa Monica, California, System Development Corporation (1965), 15 p

    Google Scholar 

  • H. Borko, M.D. Bemick: Automatic document classification, J. ACM 10, 151–162 (1963)

    Article  Google Scholar 

  • P.G. Ossorio: Classification space: A multivariate procedure for automatic document indexing and retrieval, Multivar. Behav. Res. 1, 419–524 (1966)

    Article  Google Scholar 

  • C. Eckart, G. Young: The approximation of one matrix by another of lower rank, Psychometrika I, 211–218 (1936)

    Article  Google Scholar 

  • M.W. Berry, Z. Drmac, E. Jessup: Matrices, vector spaces, and information retrieval, SIAM Rev. 41, 335–362 (1999)

    Article  Google Scholar 

  • E. Jessup, J. Martin: Taking a new look at the latent semantic analysis approach to information retrieval. In: Computational Information Retrieval, ed. by M.W. Berry (SIAM, Philadelphia 2001) pp. 121–144

    Google Scholar 

  • M. Lizza, F. Sartoretto: A comparative analysis of LSI strategies. In: Computational Information Retrieval, ed. by M.W. Berry (SIAM, Philadelphia 2001) pp. 121–144

    Google Scholar 

  • M.W. Berry, M. Browne: Understanding Search Engines: Mathematical Modeling and Text Retrieval (Society for Industrial and Applied Mathematics, Philadelphia 1999)

    Google Scholar 

  • D.M. Blei, A.Y. Ng, M.I. Jordan: Latent Dirichlet allocation, J. Mach. Learn. Res. 3, 993–1022 (2003)

    Google Scholar 

  • R. Baeza-Yates, B. Ribeiro-Neto: Modern Information Retrieval (Addison-Wesley, Wokingham 1999), Second edition published in 2011

    Google Scholar 

  • D. Harman: An experimental study of the factors important in document ranking. In: Assoc. Comput. Mach. 9th Conf. Res. Develop. Inform. Retr, ed. by F. Rabbit (Association for Computing Machines, New York 1986)

    Google Scholar 

  • P. Glenisson, W. Glänzel, F. Janssens, B. De Moor: Combining full text and bibliometric information in mapping scientific disciplines, Inform. Process. Manag. 41, 1548–1572 (2005)

    Article  Google Scholar 

  • T. Magerman, B. Van Looy, B. Baesens, K. Debackere: Assessment of Latent Semantic Analysis (LSA) Text Mining Algorithms for Large Scale Mapping of Patent and Scientific Publication Documents (Faculty of Business and Economics, KU Leuven, Leuven 2011)

    Google Scholar 

  • OECD: A Framework for Biotechnology Statistics (OECD, Paris 2005) pp. 29–32

    Google Scholar 

  • X.H. Phan, L.M. Nguyen, S. Horiguchi: Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proc. 17th Int. World Wide Web Conf. (WWW 2008), Beijing (2008) pp. 91–100

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bart Van Looy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer International Publishing AG, part of Springer Nature

About this chapter

Cite this chapter

Van Looy, B., Magerman, T. (2019). Using Text Mining Algorithms for Patent Documents and Publications. In: Glänzel, W., Moed, H.F., Schmoch, U., Thelwall, M. (eds) Springer Handbook of Science and Technology Indicators. Springer Handbooks. Springer, Cham. https://doi.org/10.1007/978-3-030-02511-3_38

Download citation

Publish with us

Policies and ethics