Using Text Mining Algorithms for Patent Documents and Publications

Van Looy, Bart; Magerman, Tom

doi:10.1007/978-3-030-02511-3_38

Part of the book series: Springer Handbooks ((SHB))

4079 Accesses

Abstract

In this chapter we present an overview of text mining approaches that can be used to conduct science and technology studies that rely on assessing the (content) similarity between patent documents and/or scientific publications. We highlight the rationale behind vector space models, latent semantic analysis, and probabilistic topic models. In addition, several validation studies pertaining to patent documents and publications are presented. These studies reveal that choices in terms of algorithms, pre-processing, and calculation options have non-trivial consequences in terms of outcomes and their validity. As such, scholars should pay attention to the technicalities implied when engaging in text mining efforts in order for outcomes to become relevant and informative.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 299.00; Price excludes VAT (USA)

Hardcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Measuring science and innovation linkage using text mining of research papers and patent information

Article Open access 29 February 2024

A text mining-based approach for the evaluation of patenting trends on nanomaterials

Article 13 September 2021

A Comparison of Patent Classifications with Clustering Analysis

References

D. Hicks, B.R. Martin, J. Irvine: Bibliometric techniques for monitoring performance in technologically oriented research—The case of integrated-optics, R&D Manag. 16(3), 211–223 (1986)
Article Google Scholar
S. Hinze, H. Grupp: Mapping of R&D structures in transdisciplinary areas: New biotechnology in food sciences, Scientometrics 37(2), 313–335 (1996)
Article Google Scholar
P. Glenisson, W. Glänzel, O. Persson: Combining full-text and bibliometric indicators. A pilot study, Scientometrics 63(1), 163–180 (2005)
Article Google Scholar
V. Rabeharisoa: A special mediation between science and technology: When inventors publish scientific articles in fuel cells. In: Dynamics of Science-Based Innovation, ed. by H. Grupp (Springer, Berlin, Heidelberg 1992) pp. 45–72
Chapter Google Scholar
E.C.M. Noyons, A.F.J. van Raan, H. Grupp, U. Schmoch: Exploring the science and technology interface—Inventor author relations in laser medicine, Res. Policy 23(4), 443–457 (1994)
Article Google Scholar
U. Schmoch: The technological output of scientific institutions. In: Handbook of Quantitative Science and Technology Research. The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 717–731
Google Scholar
W. Glänzel, M. Meyer, B. Schlemmer M. du Plessis, B. Thijs, T. Magerman, K. Debackere, R. Veugelers: Biotechnology: An Analysis of Patents and Publications, Report Steunpunt O&O Statistics, http://www.steunpuntoos.be (2004)
M. Meyer: Patent citations in a novel field of technology: What can they tell about interactions of emerging communities of science and technology?, Scientometrics 48(2), 151–178 (2000)
Article Google Scholar
T. Magerman: Impact and Consequences of Science-Intensive Patenting: In Search of Anti-Commons Evidence Using Latent Semantic Analysis (LSA) Text Mining Techniques (KU Leuven, Leuven 2011), unpublished Ph.D. manuscript
Google Scholar
T. Magerman, B. Van Looy, X. Song: Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications, Scientometrics 82(2), 289–306 (2010)
Article Google Scholar
T. Magerman, B. Van Looy, K. Debackere: Does involvement in patenting jeopardize one's academic footprint? An analysis of patent–paper pairs in biotechnology, Res. Policy 44(9), 1702–1713 (2015)
Article Google Scholar
M.A. Hearst: Untangling text data mining. In: Proc. 37th Annu. Meet. Assoc. Comput. Linguist., College Park, Maryland (1999) pp. 3–10
Google Scholar
W. Fan, L. Wallace, S. Rich, Z. Zhang: Tapping the power of text mining, Communication ACM 49(9), 77–82 (2006)
Article Google Scholar
K.A. Vidhya, G. Aghila: Text mining process, techniques and tools: An overview, Int. J. Inform. Technol. Manag. 2(2), 613–622 (2010)
Google Scholar
D.R. Swanson: Fish Oil, Raynaud's syndrome, and undiscovered public knowledge, Perspect. Biol. Med. 30, 7–18 (1986)
Article Google Scholar
D.R. Swanson: Migraine and magnesium: Eleven neglected connections, Perspect. Biol. Med. 31, 526–557 (1988)
Article Google Scholar
D.R. Swanson: Somatomedin C and arginine: Implicit connections between mutually-isolated literatures, Perspect. Biol. Med. 33, 157–186 (1990)
Article Google Scholar
N.M. Ramadan, H. Halvorson, A. Vandelinde, S.R. Levine: Low brain magnesium in migraine, Headache 29(7), 416–419 (1989)
Article Google Scholar
D.R. Swanson, N.R. Smalheiser: An interactive system for finding complementary literatures: A stimulus to scientific discovery, Artif. Intell. 91, 183–203 (1997)
Article Google Scholar
P. Grzybek, E. Kelih: Anton S. Budilovic (1846–1908)—A forerunner of quantitative linguistics in Russia?, Glottometrics 7, 94–97 (2004)
Google Scholar
G.K. Zipf: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology (Addison-Wesley, Cambridge 1949)
Google Scholar
R.E. Wyllys: Measuring scientific prose with rank-frequency (‘Zipf') curves: A new use for an old phenomenon, Proc. Am. Soc. Inform. Sci. 12, 30–31 (1975)
Google Scholar
M. Callon, J.P. Courtial, W.A. Turner, S. Bauin: From translations to problematic networks—An introduction to co-word analysis, Soc. Sci. Inf. (Paris) 22(2), 191–235 (1983)
Article Google Scholar
C.D. Manning, H. Schütze: Foundations of Statistical Natural Language Processing (MIT Press, Cambridge 2000)
Google Scholar
M.W. Berry (Ed.): Survey of Text Mining (Springer, New York 2003)
Google Scholar
E. Leopold, M. May, G. Paaß: Data mining and text mining for science and technology research. In: Handbook of Quantitative Science and Technology Research. The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 187–213
Google Scholar
A.L. Porter, N.C. Newman: Patent profiling for competitive advantage. In: Handbook of Quantitative Science and Technology Research. The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 587–612
Google Scholar
J.P. Courtial: A coword analysis of scientometrics, Scientometrics 31(3), 251–260 (1994)
Article Google Scholar
E. Bassecoulard, M. Zitt: Patents and publications. The lexical connection. In: Handbook of Quantitative Science and Technology Research, The Use of Publication and Patent Statistics in Studies of S&T Systems, ed. by H.F. Moed, W. Glänzel, U. Schmoch (Springer, Dordrecht 2004) pp. 665–694
Google Scholar
L. Leydesdorff: The university-industry knowledge relationship: Analyzing patents and the science base of technologies, J. Am. Soc. Inform. Sci. Technol. 55(11), 991–1001 (2004)
Article Google Scholar
F. Janssens, J. Leta, W. Glänzel, B. De Moor: Towards mapping library and information science, Inform. Process. Manag. 42(6), 1614–1642 (2006)
Article Google Scholar
G. Salton: Automatic Information Organization and Retrieval (McGraw-Hill, New York 1968)
Google Scholar
G. Salton, A. Wong, C.S. Yang: A vector space model for information retrieval, J. Am. Soc. Inform. Sci. 18(11), 613–620 (1975)
Google Scholar
G. Salton, M.J. McGill: Introduction to Modern Information Retrieval (McGraw-Hill, New York 1983)
Google Scholar
M. Lennon, D.S. Pierce, B.D. Tarry, P. Willett: An evaluation of some conflation algorithms for information retrieval, J. Inform. Sci. 3, 177–183 (1981)
Article Google Scholar
D. Harman: How effective is suffixing?, J. Am. Soc. Inform. Sci. 42, 7–15 (1991)
Article Google Scholar
B. Krovetz: Word Sense Disambiguation for Large Text Databases, Ph.D. Thesis (Department of Computer Science, University of Massachusetts, Amherst 1995)
Google Scholar
M.F. Porter: ‘Snowball: A language for stemming algorithms, snowball.tartarus.org/texts/introduction.html (2001)
Google Scholar
C.J. van Rijsbergen, S.E. Robertson, M.F. Porter: New Models in Probabilistic Information Retrieval, British Library Research and Development Report 5587 (British Library, London 1980)
Google Scholar
M.F. Porter: An algorithm for suffix stripping, Program 14(3), 130–137 (1980)
Article Google Scholar
S. Dumais: Improving the retrieval of information from external sources, Behav. Res. Methods 23(2), 229–236 (1991)
Article Google Scholar
G. Salton, H. Wu: A term weighting model based on utility theory. In: Information Retrieval Research, ed. by R.N. Oddy, S.E. Robertson, C.J. van Rijsbergen, R.W. Williams (Butterworths, Boston 1981) pp. 9–22
Google Scholar
C.E. Shannon: A note on the concept of entropy, Bell Syst. Tech. J. 27(3), 379–423 (1948)
Article Google Scholar
M.F. Moens: Information Extraction: Algorithms and Prospects in a Retrieval Context, The Information Retrieval Series, Vol. 21 (Springer, New York 2006)
Google Scholar
T.K. Landauer, S.T. Dumais: A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge, Psychol. Rev. 104(2), 211–240 (1997)
Article Google Scholar
J.D. Carroll, P. Arabie: Multidimensional scaling, Annu. Rev. Psychol. 31, 607–649 (1980)
Article Google Scholar
K. Sparck Jones: Automatic Keyword Classification for Information Retrieval (Buttersworth, London 1971)
Google Scholar
N. Jardin, C.J. van Rijsbergen: The use of hierarchic clustering in information retrieval, Inform. Storage Retr. 7, 217–240 (1971)
Article Google Scholar
P. Atherton, H. Borko: A test of factor-analytically derived automated classification method applied to descriptions of work and search requests of nuclear physicists, Report No. AIP/DRP 65-1, New York, American Institute of Physics, Documentation Research Project; Report No. SP-1905, Santa Monica, California, System Development Corporation (1965), 15 p
Google Scholar
H. Borko, M.D. Bemick: Automatic document classification, J. ACM 10, 151–162 (1963)
Article Google Scholar
P.G. Ossorio: Classification space: A multivariate procedure for automatic document indexing and retrieval, Multivar. Behav. Res. 1, 419–524 (1966)
Article Google Scholar
C. Eckart, G. Young: The approximation of one matrix by another of lower rank, Psychometrika I, 211–218 (1936)
Article Google Scholar
M.W. Berry, Z. Drmac, E. Jessup: Matrices, vector spaces, and information retrieval, SIAM Rev. 41, 335–362 (1999)
Article Google Scholar
E. Jessup, J. Martin: Taking a new look at the latent semantic analysis approach to information retrieval. In: Computational Information Retrieval, ed. by M.W. Berry (SIAM, Philadelphia 2001) pp. 121–144
Google Scholar
M. Lizza, F. Sartoretto: A comparative analysis of LSI strategies. In: Computational Information Retrieval, ed. by M.W. Berry (SIAM, Philadelphia 2001) pp. 121–144
Google Scholar
M.W. Berry, M. Browne: Understanding Search Engines: Mathematical Modeling and Text Retrieval (Society for Industrial and Applied Mathematics, Philadelphia 1999)
Google Scholar
D.M. Blei, A.Y. Ng, M.I. Jordan: Latent Dirichlet allocation, J. Mach. Learn. Res. 3, 993–1022 (2003)
Google Scholar
R. Baeza-Yates, B. Ribeiro-Neto: Modern Information Retrieval (Addison-Wesley, Wokingham 1999), Second edition published in 2011
Google Scholar
D. Harman: An experimental study of the factors important in document ranking. In: Assoc. Comput. Mach. 9th Conf. Res. Develop. Inform. Retr, ed. by F. Rabbit (Association for Computing Machines, New York 1986)
Google Scholar
P. Glenisson, W. Glänzel, F. Janssens, B. De Moor: Combining full text and bibliometric information in mapping scientific disciplines, Inform. Process. Manag. 41, 1548–1572 (2005)
Article Google Scholar
T. Magerman, B. Van Looy, B. Baesens, K. Debackere: Assessment of Latent Semantic Analysis (LSA) Text Mining Algorithms for Large Scale Mapping of Patent and Scientific Publication Documents (Faculty of Business and Economics, KU Leuven, Leuven 2011)
Google Scholar
OECD: A Framework for Biotechnology Statistics (OECD, Paris 2005) pp. 29–32
Google Scholar
X.H. Phan, L.M. Nguyen, S. Horiguchi: Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proc. 17th Int. World Wide Web Conf. (WWW 2008), Beijing (2008) pp. 91–100
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Economics and Business, KU Leuven, Leuven, Belgium
Bart Van Looy & Tom Magerman

Authors

Bart Van Looy
View author publications
You can also search for this author in PubMed Google Scholar
Tom Magerman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bart Van Looy .

Editor information

Editors and Affiliations

ECOOM and Faculty of Economics and Business, KU Leuven, Leuven, Belgium
Wolfgang Glänzel
Amsterdam, The Netherlands
Henk F. Moed
Competence Center Policy – Industry – Innovation, Fraunhofer Institute for Systems and Innovation Research ISI, Karlsruhe, Germany
Ulrich Schmoch
Faculty of Science and Engineering, University of Wolverhampton, Wolverhampton, UK
Mike Thelwall

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Van Looy, B., Magerman, T. (2019). Using Text Mining Algorithms for Patent Documents and Publications. In: Glänzel, W., Moed, H.F., Schmoch, U., Thelwall, M. (eds) Springer Handbook of Science and Technology Indicators. Springer Handbooks. Springer, Cham. https://doi.org/10.1007/978-3-030-02511-3_38

Download citation

DOI: https://doi.org/10.1007/978-3-030-02511-3_38
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02510-6
Online ISBN: 978-3-030-02511-3
eBook Packages: Economics and FinanceEconomics and Finance (R0)

Publish with us

Policies and ethics

Using Text Mining Algorithms for Patent Documents and Publications

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Measuring science and innovation linkage using text mining of research papers and patent information

A text mining-based approach for the evaluation of patenting trends on nanomaterials

A Comparison of Patent Classifications with Clustering Analysis

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Using Text Mining Algorithms for Patent Documents and Publications

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Measuring science and innovation linkage using text mining of research papers and patent information

A text mining-based approach for the evaluation of patenting trends on nanomaterials

A Comparison of Patent Classifications with Clustering Analysis

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation