Skip to main content
Log in

Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

In this study, we examine and validate the use of existing text mining techniques (based on the vector space model and latent semantic indexing) to detect similarities between patent documents and scientific publications. Clearly, experts involved in domain studies would benefit from techniques that allow similarity to be detected—and hence facilitate mapping, categorization and classification efforts. In addition, given current debates on the relevance and appropriateness of academic patenting, the ability to assess content-relatedness between sets of documents—in this case, patents and publications—might become relevant and useful. We list several options available to arrive at content based similarity measures. Different options of a vector space model and latent semantic indexing approach have been selected and applied to the publications and patents of a sample of academic inventors (n = 6). We also validated the outcomes by using independently obtained validation scores of human raters. While we conclude that text mining techniques can be valuable for detecting similarities between patents and publications, our findings also indicate that the various options available to arrive at similarity measures vary considerably in terms of accuracy: some generally accepted text mining options, like dimensionality reduction and LSA, do not yield the best results when working with smaller document sets. Implications and directions for further research are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. A more in-depth analysis of the performance and advantages and disadvantages of stemming (which are also language and corpus dependent) is outside the scope of this publication. The reader interested in this aspect is referred to Lennon et al. (1981), Harman (1991), Krovets (1995), and Porter (2001).

  2. A more detailed description of these topics can be found in Moens 2006.

  3. Other methods e.g. do not rely on semantic representation like LSA but use semantic topic models based on generative models (probabilistic inference models, topic models and probabilistic latent semantic indexing—see e.g. Wong and Yao 1995; Hofmann 1999; Blei et al. 2003).

  4. Note that for some academic inventors R2 of 0.80 has been obtained.

References

  • Atherton, P., & Borko, H. (1965). A test of factor-analytically derived automated classification methods. AIP Report AIP-DRP 65-l.

  • Azoulay, P., Ding, W., & Stuart, T. (2006). The impact of academic patenting on the rate, quality and direction of (public) research. NBER Working Paper No. 11917. Cambridge MA: National Bureau of Economic Research.

  • Bassecoulard, E., & Zitt, M. (2004). Patents and publications: The lexical connection. In H. F. Moed, W. Glänzel, & U. Schmoch (Eds.), Handbook of quantitative science and technology research. The use of publication and patent statistics in studies of S&T systems (pp. 665–694). Dordrecht: Kluwer Academic Publishers.

    Google Scholar 

  • Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York: ACM Press.

    Google Scholar 

  • Berry, M. W. (Ed.). (2003). Survey of text mining. New York: Springer.

    Google Scholar 

  • Berry, M. W., & Browne, M. (1999). Understanding search engines: Mathematical modeling and text retrieval. Philadelphia: Society for Industrial and Applied Mathematics.

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    Article  MATH  Google Scholar 

  • Borko, H., & Bemick, M. D. (1963). Automatic document classification. Journal of the ACM, 10, 151–162.

    Article  MATH  Google Scholar 

  • Calderini, M., Franzoni, C., & Vezzulli, A. (2005). If star scientists do not patent: An event history analysis of scientific eminence and the decision to patent in the academic world. CESPRI Working Paper No. 169.

  • Callon, M., Courtial, J. P., Turner, W. A., & Bauin, S. (1983). From translations to problematic networks—an introduction to co-word analysis. Social Science Information Sur Les Sciences Sociales, 22(2), 191–235.

    Google Scholar 

  • Carroll, J. D., & Arabie, P. (1980). Multidimensional scaling. In M. R. Rosenzweig & L. W. Porter (Eds.), Annual review of psychology (Vol. 31, pp. 607–649). Palo Alto, CA: Annual Reviews, Inc.

    Google Scholar 

  • Courtial, J. P. (1994). A coword analysis of Scientometrics. Scientometrics, 31(3), 251–260.

    Article  Google Scholar 

  • Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.

    Article  Google Scholar 

  • Eckart, C., & Young, G. (1936). The approximation of one matrix by another of lower rank. Psychometrika, I, 211–218.

  • Engelsman, E. C., & van Raan, A. F. J. (1994). A patent based cartography of technology. Research Policy, 23, 1–26.

    Article  Google Scholar 

  • European Commission. (2003). Third European Report on S&T Indicators.

  • Fabrizio, K. R., & DiMinin, A. (2005). Commercializing the laboratory: Faculty patenting and the open science environment. Working paper.

  • Glänzel, W., et al. (2004). Biotechnology: An analysis of patents and publications. Report Steunpunt O&O Statistics (www.steunpuntoos.be).

  • Glenisson, P., Glänzel, W., Janssens, F., & De Moor, B. (2005a). Combining full-text and bibliometric information in mapping scientific disciplines. Information Processing & Management, 41(6), 1548–1572.

    Article  Google Scholar 

  • Glenisson, P., Glänzel, W., & Persson, O. (2005b). Combining full-text and bibliometric indicators: A pilot study. Scientometrics, 63(1), 163–180.

    Article  Google Scholar 

  • Grzybek, P., & Kelih, E. (2004). Anton S. Budilovic (1846–1908): A forerunner of quantitative linguistics in Russia? Glottometrics, 7(9), 4–97.

    Google Scholar 

  • Harman, D. (1986). An experimental study of the factors important in document ranking. In F. Rabbit (Ed.), Association for computing machine’s ninth conference on research and development in information retrieval. New York: Association for Computing Machines.

    Google Scholar 

  • Harman, D. (1991). Hew effective is suffixing? Journal of the American Society for Information Science, 42, 7–15.

    Article  MathSciNet  Google Scholar 

  • Hicks, D., Martin, B. R., & Irvine, J. (1986). Bibliometric techniques for monitoring performance in technologically oriented research: The case of integrated-optics. R&D Management, 16(3), 211–223.

    Article  Google Scholar 

  • Hinze, S., & Grupp, H. (1996). Mapping of R&D structures in transdisciplinary areas: New biotechnology in food sciences. Scientometrics, 37(2), 313–335.

    Article  Google Scholar 

  • Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the Twenty-Second Annual International SIGIR Conference (pp. 50–57). New York: ACM Press.

  • Janssens, F., Leta, J., Glänzel, W., & De Moor, B. (2006). Towards mapping library and information science. Information Processing and Management, 42(6), 1614–1642.

    Article  Google Scholar 

  • Jardin, N., & van Rijsbergen, C. J. (1971). The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7, 217–240.

    Article  Google Scholar 

  • Krovets, B. (1995). Word sense disambiguation for large text databases. Ph. D. Thesis. Department of Computer Science, University of Massachusetts Amherst.

  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.

    Article  Google Scholar 

  • Lennon, M., Pierce, D. S., Tarry, B. D., & Willett, P. (1981). An evaluation of some conflation algorithms for information retrieval. Journal of Information Science, 3, 177–183.

    Article  Google Scholar 

  • Leopold, E., May, M., & Paaß, G. (2004). Data mining and text mining for science & technology research. In H. F. Moed, W. Glänzel, & U. Schmoch (Eds.), Handbook of quantitative science and technology research. The use of publication and patent statistics in studies of S&T systems (pp. 187–213). Dordrecht: Kluwer Academic Publishers.

    Google Scholar 

  • Leydesdorff, L. (2004). The university-industry knowledge relationship: analyzing patents and the science base of technologies. Journal of the American Society for Information Science and Technology, 55(11), 991–1001.

    Article  Google Scholar 

  • Manning, C. D., & Schütze, H. (2000). Foundations of statistical natural language processing. Cambridge: MIT Press.

    Google Scholar 

  • Meyer, M. (2000). Patent citations in a novel field of technology: What can they tell about interactions of emerging communities of science and technology? Scientometrics, 48(2), 151–178.

    Article  Google Scholar 

  • Meyer, M. (2006). Knowledge integrators or weak links? An exploratory comparison of patenting researchers with their non-inventing peers in nano-science and technology. Scientometrics, 68(3), 545–560.

    Article  Google Scholar 

  • Moens, M. F. (2006). Information extraction: Algorithms and prospects in a retrieval context (The Information Retrieval Series 21). New York: Springer.

    Google Scholar 

  • Murray, F. & Stern, S. (2005). Do formal intellectual property rights hinder the free flow of scientific knowledge? An empirical test of the anti-commons hypothesis. NBER Working Paper No. 11465. Cambridge, MA: National Bureau of Economic Research.

  • National Science Foundation (NSF). (2006). Science and Engineering Indicators.

  • Noyons, E. C. M., van Raan, A. F. J., Grupp, H., & Schmoch, U. (1994). Exploring the science and technology interface–inventor author relations in laser medicine. Research Policy, 23(4), 443–457.

    Article  Google Scholar 

  • Ossorio, P. G. (1966). Classification space: A multivariate procedure for automatic document indexing and retrieval. Multivariate Behavior Research, 1, 479–524.

    Google Scholar 

  • Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

    Google Scholar 

  • Porter, M. F. (2001). Snowball: A language for stemming algorithms. (www.snowball.tartarus.org/texts/introduction.html).

  • Porter, A. L., & Newman, N. C. (2004). Patent profiling for competitive advantage. In H. F. Moed, W. Glänzel, & U. Schmoch (Eds.), Handbook of quantitative science and technology research. The use of publication and patent statistics in studies of S&T systems (pp. 587–612). Dordrecht: Kluwer Academic Publishers.

    Google Scholar 

  • Rabeharisoa, V. (1992). A special mediation between science and technology: When inventors publish scientific articles in fuel cells. In H. Grupp (Ed.), Dynamics of science-based innovation (pp. 45–72). Berlin: Springer.

    Google Scholar 

  • Salton, G. (1968). Automatic information organization and retrieval. New York: McGraw-Hill.

    Google Scholar 

  • Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw Hill.

    MATH  Google Scholar 

  • Salton, G., Wong, A., & Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM, 18(11), 613–620.

    Article  MATH  Google Scholar 

  • Salton, G., & Wu, H. (1981). A term weighting model based on utility theory. In R. N. Oddy, S. E. Robertson, C. J. van Rijsbergen, & R. W. Williams (Eds.), Information retrieval research (pp. 9–22). Boston: Butterworths.

    Google Scholar 

  • Schmoch, U. (2004). The technological output of scientific institutions. In H. F. Moed, W. Glänzel, & U. Schmoch (Eds.), Handbook of quantitative science and technology research. The use of publication and patent statistics in studies of S&T systems (pp. 717–731). Dordrecht: Kluwer Academic Publishers.

    Google Scholar 

  • Sparck Jones, K. (1971). Automatic keyword classification for information Retrieval. London: Buttersworth.

    Google Scholar 

  • Van Looy, B., Callaert, J., & Debackere, K. (2006). Publication and patent behavior of academic researchers: Conflicting, reinforcing or merely co-existing? Research Policy, 35(4), 596–608.

    Article  Google Scholar 

  • Van Looy, B., Ranga, M., Callaert, J., Debackere, K., & Zimmermann, E. (2004). Combining entrepreneurial and scientific performance in academia: Towards a compounded and reciprocal Matthew Effect? Research Policy, 33, 425–441.

    Article  Google Scholar 

  • van Rijsbergen, C. J., Robertson, S. E., & Porter, M. F. (1980). New models in probabilistic information retrieval. London: British Library (British Library Research and Development Report, No. 5587).

  • Vandromme, D., Magerman, T., Song, X., Van Looy, B., Hoskens, M., Glenisson, P., Thijs, B., Vertomme, J., De Moor, B., & Duflou, J. (2006). A comparative analysis of distance measures and text mining methods supporting domain studies. Paper presented at the Ninth STI indicator conference, Leuven, 2006.

  • Wong, S. K. M., & Yao, Y. Y. (1995). On modeling information retrieval with probabilistic inference. ACM Transactions on Information Systems, 13(1), 69–99.

    Article  Google Scholar 

  • Wyllys, R. E. (1975). Measuring scientific prose with rank-frequency (‘‘Zipf’’) curves: A new use for an old phenomenon. Proceedings of the American Society for Information Science, 12, 30–31.

  • Zipf, G. K. (1949). Human behavior and the principle of least effort: An introduction to human ecology. Cambridge: Addison-Wesley.

    Google Scholar 

Download references

Acknowledgments

The authors would like to express their gratitude to Julie Callaert and Mariette Du Plessis for their contribution to the independent assessment of the patent–paper pairs and Frizo Janssens for useful methodological comments and suggestions. We also wish to thank the participants of the Triple Helix Conference (Singapore, May 2007) for their helpful remarks in response to a previous version of this work, and two anonymous reviewers for their valuable comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tom Magerman.

Appendices

Appendix 1

See Table 4.

Table 4 Basic distribution descriptions of all measures

Appendix 2: Title and abstract of one patent document and two publications (highly related and unrelated) authored by the inventor

Seed patent: Gluten biopolymers

This invention consists of a modified gluten biopolymer for use in industrial applications, such as composites and foams. In the present work, the fracture toughness of the gluten polymer was improved with the addition of a thiol-containing modifying agent. This work also resulted in the development of a gluten biopolymer-modified fibre bundle, demonstrating the potential to process fully biodegradable composite materials. Qualitative analysis suggests that a reasonably strong interface between the natural fibres and biopolymer matrix can form spontaneously under the proper conditions. Therefore, this invention relates to a modified gluten biopolymer for use in industrial applications, such as composites, stabilized foams and moulded articles of manufactures. The present invention relates to a new gluten based biopolymer with modified properties, such as an increase in impact strength, and prepared by using thiol-containing molecules. The multifunctional activity of the polythiol-containing molecules generates the potential for the development of a new material base for commodity plastics. The invention furthermore relates to a new composite material comprising gluten-coated fibre, its use and the method for preparing the composite material.

Publication 1 (highly related to the patent document): designing new materials from wheat protein

We recently discovered that wheat gluten could be formed into a tough, plastic-like substance when thiol-terminated, star-branched molecules are incorporated directly into the protein structure. This discovery offers the exciting possibility of developing biodegradable high-performance engineering plastics and composites from renewable resources that are competitive with their synthetic counterparts. Wheat gluten powder is available at a cost of less than $0.5/lb, so if processing costs can be controlled, an inexpensive alternative to synthetic polymers may be possible. In the present work, we demonstrate the ability to toughen an otherwise brittle protein-based material by increasing the yield stress and strain-to-failure, without compromising stiffness. Water absorption results suggest that the cross-link density of the polymer is increased by the presence of the thiol-terminated, star-branched additive in the protein. Size-exclusion high performance liquid chromatography data of moulded tri-thiol-modified gluten are consistent with that of a polymer that has been further cross-linked when compared directly with unmodified gluten, handled under identical conditions. Remarkably, the mechanical properties of our gluten formulations stored in ambient conditions were found to improve with time.

Publication 2 (unrelated to the patent document): in situ polymerization of thermoplastic composites based on cyclic oligomers

The high melt viscosity of thermoplastics is the main issue when producing continuously reinforced thermoplastic composites. For this reason, production methods for thermoplastic and thermoset composites differ substantially. Lowering the viscosity of thermoplastics to a value below 1 Pa s enables the use of thermoset production methods such as resin transfer molding (RTM). In order to achieve these low viscosities, a low viscous mixture of prepolymers and catalyst can be infused into a mold where the polymerization reaction takes place. Only a limited number of polymerization reactions are compatible with a closed mold process. These polymerization reactions proceed rapidly compared to the curing reaction of thermosets used in RTM. Therefore, the processing window is narrow, and managing the processing parameters is crucial. This paper describes the production and properties of a glass fiber reinforced polyester produced from cyclic oligoesters.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Magerman, T., Van Looy, B. & Song, X. Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications. Scientometrics 82, 289–306 (2010). https://doi.org/10.1007/s11192-009-0046-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-009-0046-6

Keywords

Navigation