Abstract
In this study, we examine and validate the use of existing text mining techniques (based on the vector space model and latent semantic indexing) to detect similarities between patent documents and scientific publications. Clearly, experts involved in domain studies would benefit from techniques that allow similarity to be detected—and hence facilitate mapping, categorization and classification efforts. In addition, given current debates on the relevance and appropriateness of academic patenting, the ability to assess content-relatedness between sets of documents—in this case, patents and publications—might become relevant and useful. We list several options available to arrive at content based similarity measures. Different options of a vector space model and latent semantic indexing approach have been selected and applied to the publications and patents of a sample of academic inventors (n = 6). We also validated the outcomes by using independently obtained validation scores of human raters. While we conclude that text mining techniques can be valuable for detecting similarities between patents and publications, our findings also indicate that the various options available to arrive at similarity measures vary considerably in terms of accuracy: some generally accepted text mining options, like dimensionality reduction and LSA, do not yield the best results when working with smaller document sets. Implications and directions for further research are discussed.
Similar content being viewed by others
Notes
A more in-depth analysis of the performance and advantages and disadvantages of stemming (which are also language and corpus dependent) is outside the scope of this publication. The reader interested in this aspect is referred to Lennon et al. (1981), Harman (1991), Krovets (1995), and Porter (2001).
A more detailed description of these topics can be found in Moens 2006.
Note that for some academic inventors R2 of 0.80 has been obtained.
References
Atherton, P., & Borko, H. (1965). A test of factor-analytically derived automated classification methods. AIP Report AIP-DRP 65-l.
Azoulay, P., Ding, W., & Stuart, T. (2006). The impact of academic patenting on the rate, quality and direction of (public) research. NBER Working Paper No. 11917. Cambridge MA: National Bureau of Economic Research.
Bassecoulard, E., & Zitt, M. (2004). Patents and publications: The lexical connection. In H. F. Moed, W. Glänzel, & U. Schmoch (Eds.), Handbook of quantitative science and technology research. The use of publication and patent statistics in studies of S&T systems (pp. 665–694). Dordrecht: Kluwer Academic Publishers.
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: ACM Press.
Berry, M. W. (Ed.). (2003). Survey of text mining. New York: Springer.
Berry, M. W., & Browne, M. (1999). Understanding search engines: Mathematical modeling and text retrieval. Philadelphia: Society for Industrial and Applied Mathematics.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Borko, H., & Bemick, M. D. (1963). Automatic document classification. Journal of the ACM, 10, 151–162.
Calderini, M., Franzoni, C., & Vezzulli, A. (2005). If star scientists do not patent: An event history analysis of scientific eminence and the decision to patent in the academic world. CESPRI Working Paper No. 169.
Callon, M., Courtial, J. P., Turner, W. A., & Bauin, S. (1983). From translations to problematic networks—an introduction to co-word analysis. Social Science Information Sur Les Sciences Sociales, 22(2), 191–235.
Carroll, J. D., & Arabie, P. (1980). Multidimensional scaling. In M. R. Rosenzweig & L. W. Porter (Eds.), Annual review of psychology (Vol. 31, pp. 607–649). Palo Alto, CA: Annual Reviews, Inc.
Courtial, J. P. (1994). A coword analysis of Scientometrics. Scientometrics, 31(3), 251–260.
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, I, 211–218.
Engelsman, E. C., & van Raan, A. F. J. (1994). A patent based cartography of technology. Research Policy, 23, 1–26.
European Commission. (2003). Third European Report on S&T Indicators.
Fabrizio, K. R., & DiMinin, A. (2005). Commercializing the laboratory: Faculty patenting and the open science environment. Working paper.
Glänzel, W., et al. (2004). Biotechnology: An analysis of patents and publications. Report Steunpunt O&O Statistics (www.steunpuntoos.be).
Glenisson, P., Glänzel, W., Janssens, F., & De Moor, B. (2005a). Combining full-text and bibliometric information in mapping scientific disciplines. Information Processing & Management, 41(6), 1548–1572.
Glenisson, P., Glänzel, W., & Persson, O. (2005b). Combining full-text and bibliometric indicators: A pilot study. Scientometrics, 63(1), 163–180.
Grzybek, P., & Kelih, E. (2004). Anton S. Budilovic (1846–1908): A forerunner of quantitative linguistics in Russia? Glottometrics, 7(9), 4–97.
Harman, D. (1986). An experimental study of the factors important in document ranking. In F. Rabbit (Ed.), Association for computing machine’s ninth conference on research and development in information retrieval. New York: Association for Computing Machines.
Harman, D. (1991). Hew effective is suffixing? Journal of the American Society for Information Science, 42, 7–15.
Hicks, D., Martin, B. R., & Irvine, J. (1986). Bibliometric techniques for monitoring performance in technologically oriented research: The case of integrated-optics. R&D Management, 16(3), 211–223.
Hinze, S., & Grupp, H. (1996). Mapping of R&D structures in transdisciplinary areas: New biotechnology in food sciences. Scientometrics, 37(2), 313–335.
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the Twenty-Second Annual International SIGIR Conference (pp. 50–57). New York: ACM Press.
Janssens, F., Leta, J., Glänzel, W., & De Moor, B. (2006). Towards mapping library and information science. Information Processing and Management, 42(6), 1614–1642.
Jardin, N., & van Rijsbergen, C. J. (1971). The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7, 217–240.
Krovets, B. (1995). Word sense disambiguation for large text databases. Ph. D. Thesis. Department of Computer Science, University of Massachusetts Amherst.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.
Lennon, M., Pierce, D. S., Tarry, B. D., & Willett, P. (1981). An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3, 177–183.
Leopold, E., May, M., & Paaß, G. (2004). Data mining and text mining for science & technology research. In H. F. Moed, W. Glänzel, & U. Schmoch (Eds.), Handbook of quantitative science and technology research. The use of publication and patent statistics in studies of S&T systems (pp. 187–213). Dordrecht: Kluwer Academic Publishers.
Leydesdorff, L. (2004). The university-industry knowledge relationship: analyzing patents and the science base of technologies. Journal of the American Society for Information Science and Technology, 55(11), 991–1001.
Manning, C. D., & Schütze, H. (2000). Foundations of statistical natural language processing. Cambridge: MIT Press.
Meyer, M. (2000). Patent citations in a novel field of technology: What can they tell about interactions of emerging communities of science and technology? Scientometrics, 48(2), 151–178.
Meyer, M. (2006). Knowledge integrators or weak links? An exploratory comparison of patenting researchers with their non-inventing peers in nano-science and technology. Scientometrics, 68(3), 545–560.
Moens, M. F. (2006). Information extraction: Algorithms and prospects in a retrieval context (The Information Retrieval Series 21). New York: Springer.
Murray, F. & Stern, S. (2005). Do formal intellectual property rights hinder the free flow of scientific knowledge? An empirical test of the anti-commons hypothesis. NBER Working Paper No. 11465. Cambridge, MA: National Bureau of Economic Research.
National Science Foundation (NSF). (2006). Science and Engineering Indicators.
Noyons, E. C. M., van Raan, A. F. J., Grupp, H., & Schmoch, U. (1994). Exploring the science and technology interface–inventor author relations in laser medicine. Research Policy, 23(4), 443–457.
Ossorio, P. G. (1966). Classification space: A multivariate procedure for automatic document indexing and retrieval. Multivariate Behavior Research, 1, 479–524.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Porter, M. F. (2001). Snowball: A language for stemming algorithms. (www.snowball.tartarus.org/texts/introduction.html).
Porter, A. L., & Newman, N. C. (2004). Patent profiling for competitive advantage. In H. F. Moed, W. Glänzel, & U. Schmoch (Eds.), Handbook of quantitative science and technology research. The use of publication and patent statistics in studies of S&T systems (pp. 587–612). Dordrecht: Kluwer Academic Publishers.
Rabeharisoa, V. (1992). A special mediation between science and technology: When inventors publish scientific articles in fuel cells. In H. Grupp (Ed.), Dynamics of science-based innovation (pp. 45–72). Berlin: Springer.
Salton, G. (1968). Automatic information organization and retrieval. New York: McGraw-Hill.
Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw Hill.
Salton, G., Wong, A., & Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11), 613–620.
Salton, G., & Wu, H. (1981). A term weighting model based on utility theory. In R. N. Oddy, S. E. Robertson, C. J. van Rijsbergen, & R. W. Williams (Eds.), Information retrieval research (pp. 9–22). Boston: Butterworths.
Schmoch, U. (2004). The technological output of scientific institutions. In H. F. Moed, W. Glänzel, & U. Schmoch (Eds.), Handbook of quantitative science and technology research. The use of publication and patent statistics in studies of S&T systems (pp. 717–731). Dordrecht: Kluwer Academic Publishers.
Sparck Jones, K. (1971). Automatic keyword classification for information Retrieval. London: Buttersworth.
Van Looy, B., Callaert, J., & Debackere, K. (2006). Publication and patent behavior of academic researchers: Conflicting, reinforcing or merely co-existing? Research Policy, 35(4), 596–608.
Van Looy, B., Ranga, M., Callaert, J., Debackere, K., & Zimmermann, E. (2004). Combining entrepreneurial and scientific performance in academia: Towards a compounded and reciprocal Matthew Effect? Research Policy, 33, 425–441.
van Rijsbergen, C. J., Robertson, S. E., & Porter, M. F. (1980). New models in probabilistic information retrieval. London: British Library (British Library Research and Development Report, No. 5587).
Vandromme, D., Magerman, T., Song, X., Van Looy, B., Hoskens, M., Glenisson, P., Thijs, B., Vertomme, J., De Moor, B., & Duflou, J. (2006). A comparative analysis of distance measures and text mining methods supporting domain studies. Paper presented at the Ninth STI indicator conference, Leuven, 2006.
Wong, S. K. M., & Yao, Y. Y. (1995). On modeling information retrieval with probabilistic inference. ACM Transactions on Information Systems, 13(1), 69–99.
Wyllys, R. E. (1975). Measuring scientific prose with rank-frequency (‘‘Zipf’’) curves: A new use for an old phenomenon. Proceedings of the American Society for Information Science, 12, 30–31.
Zipf, G. K. (1949). Human behavior and the principle of least effort: An introduction to human ecology. Cambridge: Addison-Wesley.
Acknowledgments
The authors would like to express their gratitude to Julie Callaert and Mariette Du Plessis for their contribution to the independent assessment of the patent–paper pairs and Frizo Janssens for useful methodological comments and suggestions. We also wish to thank the participants of the Triple Helix Conference (Singapore, May 2007) for their helpful remarks in response to a previous version of this work, and two anonymous reviewers for their valuable comments.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1
See Table 4.
Appendix 2: Title and abstract of one patent document and two publications (highly related and unrelated) authored by the inventor
Seed patent: Gluten biopolymers
This invention consists of a modified gluten biopolymer for use in industrial applications, such as composites and foams. In the present work, the fracture toughness of the gluten polymer was improved with the addition of a thiol-containing modifying agent. This work also resulted in the development of a gluten biopolymer-modified fibre bundle, demonstrating the potential to process fully biodegradable composite materials. Qualitative analysis suggests that a reasonably strong interface between the natural fibres and biopolymer matrix can form spontaneously under the proper conditions. Therefore, this invention relates to a modified gluten biopolymer for use in industrial applications, such as composites, stabilized foams and moulded articles of manufactures. The present invention relates to a new gluten based biopolymer with modified properties, such as an increase in impact strength, and prepared by using thiol-containing molecules. The multifunctional activity of the polythiol-containing molecules generates the potential for the development of a new material base for commodity plastics. The invention furthermore relates to a new composite material comprising gluten-coated fibre, its use and the method for preparing the composite material.
Publication 1 (highly related to the patent document): designing new materials from wheat protein
We recently discovered that wheat gluten could be formed into a tough, plastic-like substance when thiol-terminated, star-branched molecules are incorporated directly into the protein structure. This discovery offers the exciting possibility of developing biodegradable high-performance engineering plastics and composites from renewable resources that are competitive with their synthetic counterparts. Wheat gluten powder is available at a cost of less than $0.5/lb, so if processing costs can be controlled, an inexpensive alternative to synthetic polymers may be possible. In the present work, we demonstrate the ability to toughen an otherwise brittle protein-based material by increasing the yield stress and strain-to-failure, without compromising stiffness. Water absorption results suggest that the cross-link density of the polymer is increased by the presence of the thiol-terminated, star-branched additive in the protein. Size-exclusion high performance liquid chromatography data of moulded tri-thiol-modified gluten are consistent with that of a polymer that has been further cross-linked when compared directly with unmodified gluten, handled under identical conditions. Remarkably, the mechanical properties of our gluten formulations stored in ambient conditions were found to improve with time.
Publication 2 (unrelated to the patent document): in situ polymerization of thermoplastic composites based on cyclic oligomers
The high melt viscosity of thermoplastics is the main issue when producing continuously reinforced thermoplastic composites. For this reason, production methods for thermoplastic and thermoset composites differ substantially. Lowering the viscosity of thermoplastics to a value below 1 Pa s enables the use of thermoset production methods such as resin transfer molding (RTM). In order to achieve these low viscosities, a low viscous mixture of prepolymers and catalyst can be infused into a mold where the polymerization reaction takes place. Only a limited number of polymerization reactions are compatible with a closed mold process. These polymerization reactions proceed rapidly compared to the curing reaction of thermosets used in RTM. Therefore, the processing window is narrow, and managing the processing parameters is crucial. This paper describes the production and properties of a glass fiber reinforced polyester produced from cyclic oligoesters.
Rights and permissions
About this article
Cite this article
Magerman, T., Van Looy, B. & Song, X. Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications. Scientometrics 82, 289–306 (2010). https://doi.org/10.1007/s11192-009-0046-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-009-0046-6