Abstract
For certain tasks in patent management it makes sense to apply a quantitative measure of textual similarity between patents and/or parts thereof: be it the analysis of freedom to operate, the analysis of technology convergence, or the mapping of patents for strategic purposes. In this paper we intend to outline the process of measuring textual patent similarity on the basis of elements referred to as ‘combined concepts’. We are going to use this process in various operations leading to design decisions, and shall also provide guidance regarding these decisions. By way of two applications from patent management, namely the prioritization of patents and the analysis of convergence between two technological fields, we mean to demonstrate the crucial importance of design decisions in terms of patent analysis results.
Similar content being viewed by others
Notes
On the levels of root forms of words and simple words the first possible tag that can be applied refers to syntactical class. Syntactical classes can be subdivided into lexical classes and phrasal classes. Verbs, nouns, prepositions and adjectives belong to the lexical classes. Verb phrases, noun phrases and prepositional phrases are part of the phrasal classes, to mention but a few (Collins and Hollo 2010).
On this level or on the level of simple words syntactical function is a second possible tag that can be applied. Syntactical functions point to the grammatical role of a concept within a clause (Collins and Hollo 2010). They represent clause elements such as subject, predicate and object.
Additional information about the selected patents can be found in “Application for prioritization ” section.
If the size of the window exceeds that of the combined concepts, it makes sense to avoid building more combined concepts than necessary. For this reason, the algorithm should initially build all possible combined concepts in the first window. After moving the window to the next position, the algorithm should build only combined concepts between the new solitary concept in the window and the remaining old solitary concepts in the window. This pattern should be adhered to throughout the concept building process.
The abovementioned coefficients have already been adopted for different fields of application. For example, Qin (2000) shows the adaption of the cosine coefficient and the Jaccard coefficient for the comparison of documents. Quite early on, Braam et al. (1988) used these coefficients for co-citation cluster analysis. And Rip and Courtial (1984) described their application in the construction of co-word maps.
For detailed information about the FVA: http://www.fva-net.de/.
For example, n-grams can be used for the prediction of the next word in a word chain (see Manning and Schütze 2005), text categorization (see Cavanar and Trenkle 1994), malicious code detection (see Abou-Assaleh et al. 2004) and spam e-mail filtering (see Çıltık and Güngör 2008). They have also been applied in indexing, information retrieval, error correction, text compression, language identification, subject classification and speech recognition (Egghe 2000).
References
Abou-Assaleh, T., Cercone, N., Keselj, V., & Sweidan, R. (2004). N-gram-based detection of new malicious code. In Proceedings of the 28th annual international computer software and applications conference. Hong-Kong.
Batagelj, V., & Bren, M. (1995). Comparing resemblance measures. Journal of Classification, 12(1), 73–90.
Bonino, D., Ciaramella, A., & Corno, F. (2010). Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics. World Patent Information, 32(1), 30–38.
Braam, R. R., Moed, H. F., & van Raan, A. F. J. (1988). Mapping of science: Critical elaboration and new approaches, a case study in agricultural biochemistry. In L. Egghe & R. Rousseau (Eds.), Infometrics 87/88 (pp. 15–28). Amsterdam: Elsevier Science.
Buehl, A. (2010). PASW 18: Einführung in die moderne Datenanalyse (12th ed.). München u.a.: Pearson Studium.
Carley, K. M. (1997). Extracting team mental models through textual analysis. Journal of Organizational Behavior, 18(S1), 533–558.
Cascini, G., & Russo, D. (2007). Computer-aided analysis of patents and search for TRIZ contradictions. International Journal of Product Development, 4(1/2), 52–67.
Cavanar, W. B., & Trenkle, J. M., (1994). N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval. Las Vegas, NV.
Cepela, N., & Danowski, J. A. (2009). Automatic mapping of social networks of political actors from large collections of news stories. International conference on advances in social network analysis and mining. Athens
Çıltık, A., & Güngör, T. (2008). Time-efficient spam e-mail filtering using n-gram models. Pattern Recognition Letters, 29(1), 19–33.
Collins, P., & Hollo, C. (2010). English grammar: An introduction. Basingstoke u.a.: Palgrave Macmillan.
Corman, S. R., Kuhn, T., McPhee, R. D., & Dooley, K. J. (2002). Studying complex discursive systems. Human Communication Research, 28(2), 157–206.
Curran, C., Bröring, S., & Leker, J. (2010). Anticipating converging industries using publicly available data. Technological Forecasting and Social Change, 77(3), 385–395.
Curran, C., & Leker, J. (2009). Seeing the next iphone coming your way: How to anticipate converging industries. Portland International Conference on Management of Engineering & Technology, 2009. PICMET 2009.
Curran, C., & Leker, J. (2011). Patent indicators for monitoring convergence—Examples from NFF and ICT. Technological Forecasting and Social Change, 78(2), 256–273.
Daga, R., & Pandey, G. (2008). US-Patent application 2008/0162455 A1. Determination of document similarity.
Doerfel, M. L., & Barnett, G. A. (1996). The use of Catpac for text analysis. Field Methods, 8(2), 4–7.
Dressler, A. (2006). Patente in technologieorientierten Mergers & Acquisitions: Nutzen, Prozessmodell, Entwicklung und Interpretation semantischer Patentlandkarten. Wiesbaden: Deutscher Universitäts-Verlag.
Egghe, L. (2000). The distribution of N-grams. Scientometrics, 47(2), 237–252.
Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25(2–3), 285–308.
Gerken, J. M., Walter, L., & Moehrle, M. G. (2010). Semantische Patentlandkarten. Einsatz semantischer Patentlandkarten im Anwendungsfeld der Antriebstechnik—Eine explorative Analyse am Beispiel der Planentengetriebe. Heft Nr. 924 der Forschungsvereinigung Antriebstechnik. Frankfurt/Main: VDMA.
Gower, J. C., & Legendre, P. (1986). Journal of Classification, 3(1), 5–48.
Jeong, B., Lee, D., Cho, H., & Lee, J. (2008). A novel method for measuring semantic similarity for XML schema matching. Expert Systems with Applications, 34(3), 1651–1658.
Kangasabai, R., & Pan, H. (2008). US-Patent 7,346,491 B2. Method of text similarity measurement.
Kim, Y. G., Suh, J. H., & Park, S. C. (2008). Visualization of patent analysis for emerging technology. Expert Systems with Applications, 34(3), 1804–1812.
Kondrak, G. (2005). N-gram similarity and distance. Lecture Notes in Computer Science, 3772, 115–126.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2), 259–284.
Lee, S., Yoon, B., & Park, Y. (2009). An approach to discovering new technology opportunities: Keyword-based patent map approach. Technovation, 29(6–7), 481–497.
Manning, C. D., & Schütze, H. (2005). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
Moehrle, M. G. (2010). Measures for textual patent similarities: A guided way to select appropriate approaches. Scientometrics, 85(1), 95–109.
Moehrle, M. G., & Geritz, A. (2007). Developing acquisition strategies based on patent maps. In T. Khalil & Y. Hosni (Eds.), Management of technology: New directions in technology management (pp. 19–29). Oxford: Elsevier.
Moehrle, M. G., Walter, L., Bergmann, I., Bobe, S., & Skrzipale, S. (2010). Patinformatics as a business process: A guideline through patent research tasks and tools. World Patent Information, 32(4), 291–299.
Moens, M. (2006). Information extraction: Algorithms and prospects in a retrieval context. Dordrecht: Springer.
Peters, H. P. F., & van Raan, A. F. J. (1993). Co-word-based science maps of chemical engineering. Part I: Representations by direct multidimensional scaling. Research Policy, 22(1), 23–45.
Qin, J. (2000). Semantic similarities between a keyword database and a controlled vocabulary database: An investigation in the antibiotic resistance literature. Journal of the American Society for Information Science, 51(2), 166–180.
Ranganathan, A., & Ronen, R. (2008). US-Patent application 2008/0243809 A1. Information-theory based measure of similarity between instances in ontology.
Rip, A., & Courtial, P. (1984). Co-word maps of biotechnology: An example of cognitive scientometrics. Scientometrics, 6(6), 381–400.
Ryley, J. F., Saffer, J., & Gibbs, A. (2008). Advanced document retrieval techniques for patent research. World Patent Information, 30(3), 238–243.
Sepkoski, J. J. (1974). Quantified coefficients of association and measurement of similarity. Mathematical Geology, 6(2), 135–152.
Sternitzke, C. (2008). Betriebswirtschaftliche Patentportfoliobewertung: Eine informationswissenschaftliche Perspektive [dissertation]. Bremen: Universität Bremen.
Sternitzke, C., & Bergmann, I. (2009). Similarity measures for document mapping: A comparative study on the level of an individual scientist. Scientometrics, 78(1), 113–130.
Trajtenberg, M. (1990). A penny for your quotes: Patent citations and the value of innovations. The Rand Journal of Economics, 21(1), 172–187.
Trippe, A. J. (2003). Patinformatics: Tasks to tools. World Patent Information, 25(3), 211–221.
Tseng, Y., Lin, C., & Lin, Y. (2007). Text mining techniques for patent analysis. Information Processing and Management, 43(5), 1216–1247.
Tsourikov, V. M., Batchilo, L. S., & Sovpel, I. V. (2000). US-Patent 6,167,370. Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (SAO) structures.
Turney, P. D. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Lecture Notes in Computer Science, 2167, 491–502.
von Wartburg, I., Teichert, T., & Rost, K. (2005). Inventive progress measured by multi-stage patent citation analysis. Research Policy, 34(10), 1591–1607.
Wanner, L., Baeza-Yates, R., Brügmann, S., Codina, J., Diallo, B., Escorsa, E., et al. (2008). Towards content-oriented patent document processing. World Patent Information, 30(1), 21–33.
Wen, G., Jiang, L., & Shadbolt, N. R. (2006). Ontology-based similarity between text documents on manifold. Lecture Notes in Computer Science, 4185, 113–125.
Yang, Y., Akers, L., Klose, T., & Barcelon Yang, C. (2008). Text mining and visualization tools—Impressions of emerging capabilities. World Patent Information, 30(4), 280–293.
Acknowledgments
The authors wish to give credit to Dr. Peter Roosen, g.o.e.the GbR, Aachen, for his constructive input regarding this paper. One of the included applications is based on the results of a joint project with the Forschungsvereinigung Antriebstechnik (FVA). We would like to thank the FVA and all industrial members, especially Dipl.-Ing. Thomas Bayer, for their contributions and their support.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Calculating the quantity of combined concepts without regard to the window size
The count of combined concepts within a document is related to the concept size, more precisely; it decreases in proportion to the concept size. The relationship between the size of the text in patent i, measured as the count of solitary concepts (ci1), the size of a combined concept (n) and the maximum quantity of extracted combined concepts with the size n (c wn ), can be calculated by means of formula 1. The count of combined concepts can be even smaller, whether any combined concepts occur identically in the text.
Appendix 2: Calculating the quantity of combined concepts with regard to the window size
The maximum quantity of combined concepts is influenced by the size of the combined concepts as well as by the count of windows ciw in which a patent i can be divided, the count of combined concepts with a size n that can be extracted from a window (cwn) and the overlap of identical combined concepts in different windows co. (formula 2)
The counts of windows in patent i (ciw) depend on the window size m and the count of solitary concepts ci1. The total of windows within a patent can be calculated by means of formula 3, which is quite similar to formula 1:
Calculating the number of combined concepts inside a window is reminiscent of a typical problem from the field of combinatorics. Within the window size, the solitary concepts are connected to combined concepts without variation of their sequence and without repetition. Accordingly, the quantity of combined concepts with a concept size of n c wn can be calculated by means of formula 4:
The overlap of identical combined concepts in the windows depends on the counts of identical solitary concepts, the combined concept size, the window size and the window offset wo.
with
Rights and permissions
About this article
Cite this article
Moehrle, M.G., Gerken, J.M. Measuring textual patent similarity on the basis of combined concepts: design decisions and their consequences. Scientometrics 91, 805–826 (2012). https://doi.org/10.1007/s11192-012-0682-0
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-012-0682-0
Keywords
- Patent
- Similarity measurement
- Similarity coefficients
- Prior art analysis
- Convergence analysis
- Patent mapping