Skip to main content
Log in

Measuring textual patent similarity on the basis of combined concepts: design decisions and their consequences

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

For certain tasks in patent management it makes sense to apply a quantitative measure of textual similarity between patents and/or parts thereof: be it the analysis of freedom to operate, the analysis of technology convergence, or the mapping of patents for strategic purposes. In this paper we intend to outline the process of measuring textual patent similarity on the basis of elements referred to as ‘combined concepts’. We are going to use this process in various operations leading to design decisions, and shall also provide guidance regarding these decisions. By way of two applications from patent management, namely the prioritization of patents and the analysis of convergence between two technological fields, we mean to demonstrate the crucial importance of design decisions in terms of patent analysis results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. On the levels of root forms of words and simple words the first possible tag that can be applied refers to syntactical class. Syntactical classes can be subdivided into lexical classes and phrasal classes. Verbs, nouns, prepositions and adjectives belong to the lexical classes. Verb phrases, noun phrases and prepositional phrases are part of the phrasal classes, to mention but a few (Collins and Hollo 2010).

  2. On this level or on the level of simple words syntactical function is a second possible tag that can be applied. Syntactical functions point to the grammatical role of a concept within a clause (Collins and Hollo 2010). They represent clause elements such as subject, predicate and object.

  3. Additional information about the selected patents can be found in “Application for prioritization section.

  4. If the size of the window exceeds that of the combined concepts, it makes sense to avoid building more combined concepts than necessary. For this reason, the algorithm should initially build all possible combined concepts in the first window. After moving the window to the next position, the algorithm should build only combined concepts between the new solitary concept in the window and the remaining old solitary concepts in the window. This pattern should be adhered to throughout the concept building process.

  5. The basic model was already introduced by different authors, sometimes using dissent variable names (Sepkoski 1974; Batagelj and Bren 1995).

  6. Corpora-based similarity calculation (e.g. Point-wise Mutual Information and Latent Semantic Analysis) have already been applied for the calculation of textual similarity between words, sentences, paragraphs or whole texts (e.g. Landauer et al. 1998; Foltz et al. 1998; Turney 2001).

  7. The abovementioned coefficients have already been adopted for different fields of application. For example, Qin (2000) shows the adaption of the cosine coefficient and the Jaccard coefficient for the comparison of documents. Quite early on, Braam et al. (1988) used these coefficients for co-citation cluster analysis. And Rip and Courtial (1984) described their application in the construction of co-word maps.

  8. For detailed information about the FVA: http://www.fva-net.de/.

  9. For example, n-grams can be used for the prediction of the next word in a word chain (see Manning and Schütze 2005), text categorization (see Cavanar and Trenkle 1994), malicious code detection (see Abou-Assaleh et al. 2004) and spam e-mail filtering (see Çıltık and Güngör 2008). They have also been applied in indexing, information retrieval, error correction, text compression, language identification, subject classification and speech recognition (Egghe 2000).

References

  • Abou-Assaleh, T., Cercone, N., Keselj, V., & Sweidan, R. (2004). N-gram-based detection of new malicious code. In Proceedings of the 28th annual international computer software and applications conference. Hong-Kong.

  • Batagelj, V., & Bren, M. (1995). Comparing resemblance measures. Journal of Classification, 12(1), 73–90.

    Article  MathSciNet  MATH  Google Scholar 

  • Bonino, D., Ciaramella, A., & Corno, F. (2010). Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics. World Patent Information, 32(1), 30–38.

    Article  Google Scholar 

  • Braam, R. R., Moed, H. F., & van Raan, A. F. J. (1988). Mapping of science: Critical elaboration and new approaches, a case study in agricultural biochemistry. In L. Egghe & R. Rousseau (Eds.), Infometrics 87/88 (pp. 15–28). Amsterdam: Elsevier Science.

    Google Scholar 

  • Buehl, A. (2010). PASW 18: Einführung in die moderne Datenanalyse (12th ed.). München u.a.: Pearson Studium.

    Google Scholar 

  • Carley, K. M. (1997). Extracting team mental models through textual analysis. Journal of Organizational Behavior, 18(S1), 533–558.

    Article  Google Scholar 

  • Cascini, G., & Russo, D. (2007). Computer-aided analysis of patents and search for TRIZ contradictions. International Journal of Product Development, 4(1/2), 52–67.

    Article  Google Scholar 

  • Cavanar, W. B., & Trenkle, J. M., (1994). N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval. Las Vegas, NV.

  • Cepela, N., & Danowski, J. A. (2009). Automatic mapping of social networks of political actors from large collections of news stories. International conference on advances in social network analysis and mining. Athens

  • Çıltık, A., & Güngör, T. (2008). Time-efficient spam e-mail filtering using n-gram models. Pattern Recognition Letters, 29(1), 19–33.

    Article  Google Scholar 

  • Collins, P., & Hollo, C. (2010). English grammar: An introduction. Basingstoke u.a.: Palgrave Macmillan.

    Google Scholar 

  • Corman, S. R., Kuhn, T., McPhee, R. D., & Dooley, K. J. (2002). Studying complex discursive systems. Human Communication Research, 28(2), 157–206.

    Google Scholar 

  • Curran, C., Bröring, S., & Leker, J. (2010). Anticipating converging industries using publicly available data. Technological Forecasting and Social Change, 77(3), 385–395.

    Article  Google Scholar 

  • Curran, C., & Leker, J. (2009). Seeing the next iphone coming your way: How to anticipate converging industries. Portland International Conference on Management of Engineering & Technology, 2009. PICMET 2009.

  • Curran, C., & Leker, J. (2011). Patent indicators for monitoring convergence—Examples from NFF and ICT. Technological Forecasting and Social Change, 78(2), 256–273.

    Article  Google Scholar 

  • Daga, R., & Pandey, G. (2008). US-Patent application 2008/0162455 A1. Determination of document similarity.

  • Doerfel, M. L., & Barnett, G. A. (1996). The use of Catpac for text analysis. Field Methods, 8(2), 4–7.

    Article  Google Scholar 

  • Dressler, A. (2006). Patente in technologieorientierten Mergers & Acquisitions: Nutzen, Prozessmodell, Entwicklung und Interpretation semantischer Patentlandkarten. Wiesbaden: Deutscher Universitäts-Verlag.

    Google Scholar 

  • Egghe, L. (2000). The distribution of N-grams. Scientometrics, 47(2), 237–252.

    Article  Google Scholar 

  • Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25(2–3), 285–308.

    Article  Google Scholar 

  • Gerken, J. M., Walter, L., & Moehrle, M. G. (2010). Semantische Patentlandkarten. Einsatz semantischer Patentlandkarten im Anwendungsfeld der Antriebstechnik—Eine explorative Analyse am Beispiel der Planentengetriebe. Heft Nr. 924 der Forschungsvereinigung Antriebstechnik. Frankfurt/Main: VDMA.

  • Gower, J. C., & Legendre, P. (1986). Journal of Classification, 3(1), 5–48.

    Article  MathSciNet  MATH  Google Scholar 

  • Jeong, B., Lee, D., Cho, H., & Lee, J. (2008). A novel method for measuring semantic similarity for XML schema matching. Expert Systems with Applications, 34(3), 1651–1658.

    Article  Google Scholar 

  • Kangasabai, R., & Pan, H. (2008). US-Patent 7,346,491 B2. Method of text similarity measurement.

  • Kim, Y. G., Suh, J. H., & Park, S. C. (2008). Visualization of patent analysis for emerging technology. Expert Systems with Applications, 34(3), 1804–1812.

    Article  Google Scholar 

  • Kondrak, G. (2005). N-gram similarity and distance. Lecture Notes in Computer Science, 3772, 115–126.

    Article  MathSciNet  Google Scholar 

  • Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2), 259–284.

    Article  Google Scholar 

  • Lee, S., Yoon, B., & Park, Y. (2009). An approach to discovering new technology opportunities: Keyword-based patent map approach. Technovation, 29(6–7), 481–497.

    Article  Google Scholar 

  • Manning, C. D., & Schütze, H. (2005). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.

  • Moehrle, M. G. (2010). Measures for textual patent similarities: A guided way to select appropriate approaches. Scientometrics, 85(1), 95–109.

    Article  Google Scholar 

  • Moehrle, M. G., & Geritz, A. (2007). Developing acquisition strategies based on patent maps. In T. Khalil & Y. Hosni (Eds.), Management of technology: New directions in technology management (pp. 19–29). Oxford: Elsevier.

    Google Scholar 

  • Moehrle, M. G., Walter, L., Bergmann, I., Bobe, S., & Skrzipale, S. (2010). Patinformatics as a business process: A guideline through patent research tasks and tools. World Patent Information, 32(4), 291–299.

    Article  Google Scholar 

  • Moens, M. (2006). Information extraction: Algorithms and prospects in a retrieval context. Dordrecht: Springer.

    MATH  Google Scholar 

  • Peters, H. P. F., & van Raan, A. F. J. (1993). Co-word-based science maps of chemical engineering. Part I: Representations by direct multidimensional scaling. Research Policy, 22(1), 23–45.

    Article  Google Scholar 

  • Qin, J. (2000). Semantic similarities between a keyword database and a controlled vocabulary database: An investigation in the antibiotic resistance literature. Journal of the American Society for Information Science, 51(2), 166–180.

    Article  Google Scholar 

  • Ranganathan, A., & Ronen, R. (2008). US-Patent application 2008/0243809 A1. Information-theory based measure of similarity between instances in ontology.

  • Rip, A., & Courtial, P. (1984). Co-word maps of biotechnology: An example of cognitive scientometrics. Scientometrics, 6(6), 381–400.

    Article  Google Scholar 

  • Ryley, J. F., Saffer, J., & Gibbs, A. (2008). Advanced document retrieval techniques for patent research. World Patent Information, 30(3), 238–243.

    Article  Google Scholar 

  • Sepkoski, J. J. (1974). Quantified coefficients of association and measurement of similarity. Mathematical Geology, 6(2), 135–152.

    Article  Google Scholar 

  • Sternitzke, C. (2008). Betriebswirtschaftliche Patentportfoliobewertung: Eine informationswissenschaftliche Perspektive [dissertation]. Bremen: Universität Bremen.

    Google Scholar 

  • Sternitzke, C., & Bergmann, I. (2009). Similarity measures for document mapping: A comparative study on the level of an individual scientist. Scientometrics, 78(1), 113–130.

    Article  Google Scholar 

  • Trajtenberg, M. (1990). A penny for your quotes: Patent citations and the value of innovations. The Rand Journal of Economics, 21(1), 172–187.

    Article  Google Scholar 

  • Trippe, A. J. (2003). Patinformatics: Tasks to tools. World Patent Information, 25(3), 211–221.

    Article  Google Scholar 

  • Tseng, Y., Lin, C., & Lin, Y. (2007). Text mining techniques for patent analysis. Information Processing and Management, 43(5), 1216–1247.

    Article  Google Scholar 

  • Tsourikov, V. M., Batchilo, L. S., & Sovpel, I. V. (2000). US-Patent 6,167,370. Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (SAO) structures.

  • Turney, P. D. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Lecture Notes in Computer Science, 2167, 491–502.

    Article  Google Scholar 

  • von Wartburg, I., Teichert, T., & Rost, K. (2005). Inventive progress measured by multi-stage patent citation analysis. Research Policy, 34(10), 1591–1607.

    Article  Google Scholar 

  • Wanner, L., Baeza-Yates, R., Brügmann, S., Codina, J., Diallo, B., Escorsa, E., et al. (2008). Towards content-oriented patent document processing. World Patent Information, 30(1), 21–33.

    Article  Google Scholar 

  • Wen, G., Jiang, L., & Shadbolt, N. R. (2006). Ontology-based similarity between text documents on manifold. Lecture Notes in Computer Science, 4185, 113–125.

    Article  Google Scholar 

  • Yang, Y., Akers, L., Klose, T., & Barcelon Yang, C. (2008). Text mining and visualization tools—Impressions of emerging capabilities. World Patent Information, 30(4), 280–293.

    Article  Google Scholar 

Download references

Acknowledgments

The authors wish to give credit to Dr. Peter Roosen, g.o.e.the GbR, Aachen, for his constructive input regarding this paper. One of the included applications is based on the results of a joint project with the Forschungsvereinigung Antriebstechnik (FVA). We would like to thank the FVA and all industrial members, especially Dipl.-Ing. Thomas Bayer, for their contributions and their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin G. Moehrle.

Appendices

Appendix 1: Calculating the quantity of combined concepts without regard to the window size

The count of combined concepts within a document is related to the concept size, more precisely; it decreases in proportion to the concept size. The relationship between the size of the text in patent i, measured as the count of solitary concepts (ci1), the size of a combined concept (n) and the maximum quantity of extracted combined concepts with the size n (c wn ), can be calculated by means of formula 1. The count of combined concepts can be even smaller, whether any combined concepts occur identically in the text.

$$ c_{in} = c_{i1} + 1 - n $$
(1)

Appendix 2: Calculating the quantity of combined concepts with regard to the window size

The maximum quantity of combined concepts is influenced by the size of the combined concepts as well as by the count of windows ciw in which a patent i can be divided, the count of combined concepts with a size n that can be extracted from a window (cwn) and the overlap of identical combined concepts in different windows co. (formula 2)

$$ c_{in} = c_{iw} \cdot c_{wn} - c_{o} $$
(2)

The counts of windows in patent i (ciw) depend on the window size m and the count of solitary concepts ci1. The total of windows within a patent can be calculated by means of formula 3, which is quite similar to formula 1:

$$ c_{iw} = c_{i1} + 1 - m $$
(3)

Calculating the number of combined concepts inside a window is reminiscent of a typical problem from the field of combinatorics. Within the window size, the solitary concepts are connected to combined concepts without variation of their sequence and without repetition. Accordingly, the quantity of combined concepts with a concept size of n c wn can be calculated by means of formula 4:

$$ c_{wn} = \frac{m!}{n! \cdot (m - n)!} $$
(4)

The overlap of identical combined concepts in the windows depends on the counts of identical solitary concepts, the combined concept size, the window size and the window offset wo.

$$ c_{o} = \frac{{c_{w1!} }}{{n! \cdot (c_{w1} - n)!}} \cdot \left( {c_{iw} - 1} \right) $$
(5)

with

$$ c_{w1} = m - w_{o} $$
(6)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moehrle, M.G., Gerken, J.M. Measuring textual patent similarity on the basis of combined concepts: design decisions and their consequences. Scientometrics 91, 805–826 (2012). https://doi.org/10.1007/s11192-012-0682-0

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-012-0682-0

Keywords

Mathematics Subject Classification (2000)

JEL Classification

Navigation