Measuring textual patent similarity on the basis of combined concepts: design decisions and their consequences

Moehrle, Martin G.; Gerken, Jan M.

doi:10.1007/s11192-012-0682-0

Measuring textual patent similarity on the basis of combined concepts: design decisions and their consequences

Published: 13 March 2012

Volume 91, pages 805–826, (2012)
Cite this article

Scientometrics Aims and scope Submit manuscript

Martin G. Moehrle¹ &
Jan M. Gerken¹

1176 Accesses
49 Citations
Explore all metrics

Abstract

For certain tasks in patent management it makes sense to apply a quantitative measure of textual similarity between patents and/or parts thereof: be it the analysis of freedom to operate, the analysis of technology convergence, or the mapping of patents for strategic purposes. In this paper we intend to outline the process of measuring textual patent similarity on the basis of elements referred to as ‘combined concepts’. We are going to use this process in various operations leading to design decisions, and shall also provide guidance regarding these decisions. By way of two applications from patent management, namely the prioritization of patents and the analysis of convergence between two technological fields, we mean to demonstrate the crucial importance of design decisions in terms of patent analysis results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Completing keyword patent search with semantic patent search: introducing a semiautomatic iterative method for patent near search based on semantic similarities

Article 15 October 2014

Analysis of the Conceptual Structure of Information Recorded in Patents

Using Text Mining Algorithms for Patent Documents and Publications

Notes

On the levels of root forms of words and simple words the first possible tag that can be applied refers to syntactical class. Syntactical classes can be subdivided into lexical classes and phrasal classes. Verbs, nouns, prepositions and adjectives belong to the lexical classes. Verb phrases, noun phrases and prepositional phrases are part of the phrasal classes, to mention but a few (Collins and Hollo 2010).
On this level or on the level of simple words syntactical function is a second possible tag that can be applied. Syntactical functions point to the grammatical role of a concept within a clause (Collins and Hollo 2010). They represent clause elements such as subject, predicate and object.
Additional information about the selected patents can be found in “Application for prioritization ” section.
If the size of the window exceeds that of the combined concepts, it makes sense to avoid building more combined concepts than necessary. For this reason, the algorithm should initially build all possible combined concepts in the first window. After moving the window to the next position, the algorithm should build only combined concepts between the new solitary concept in the window and the remaining old solitary concepts in the window. This pattern should be adhered to throughout the concept building process.
The basic model was already introduced by different authors, sometimes using dissent variable names (Sepkoski 1974; Batagelj and Bren 1995).
Corpora-based similarity calculation (e.g. Point-wise Mutual Information and Latent Semantic Analysis) have already been applied for the calculation of textual similarity between words, sentences, paragraphs or whole texts (e.g. Landauer et al. 1998; Foltz et al. 1998; Turney 2001).
The abovementioned coefficients have already been adopted for different fields of application. For example, Qin (2000) shows the adaption of the cosine coefficient and the Jaccard coefficient for the comparison of documents. Quite early on, Braam et al. (1988) used these coefficients for co-citation cluster analysis. And Rip and Courtial (1984) described their application in the construction of co-word maps.
For detailed information about the FVA: http://www.fva-net.de/.
For example, n-grams can be used for the prediction of the next word in a word chain (see Manning and Schütze 2005), text categorization (see Cavanar and Trenkle 1994), malicious code detection (see Abou-Assaleh et al. 2004) and spam e-mail filtering (see Çıltık and Güngör 2008). They have also been applied in indexing, information retrieval, error correction, text compression, language identification, subject classification and speech recognition (Egghe 2000).

References

Abou-Assaleh, T., Cercone, N., Keselj, V., & Sweidan, R. (2004). N-gram-based detection of new malicious code. In Proceedings of the 28th annual international computer software and applications conference. Hong-Kong.
Batagelj, V., & Bren, M. (1995). Comparing resemblance measures. Journal of Classification, 12(1), 73–90.
Article MathSciNet MATH Google Scholar
Bonino, D., Ciaramella, A., & Corno, F. (2010). Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics. World Patent Information, 32(1), 30–38.
Article Google Scholar
Braam, R. R., Moed, H. F., & van Raan, A. F. J. (1988). Mapping of science: Critical elaboration and new approaches, a case study in agricultural biochemistry. In L. Egghe & R. Rousseau (Eds.), Infometrics 87/88 (pp. 15–28). Amsterdam: Elsevier Science.
Google Scholar
Buehl, A. (2010). PASW 18: Einführung in die moderne Datenanalyse (12th ed.). München u.a.: Pearson Studium.
Google Scholar
Carley, K. M. (1997). Extracting team mental models through textual analysis. Journal of Organizational Behavior, 18(S1), 533–558.
Article Google Scholar
Cascini, G., & Russo, D. (2007). Computer-aided analysis of patents and search for TRIZ contradictions. International Journal of Product Development, 4(1/2), 52–67.
Article Google Scholar
Cavanar, W. B., & Trenkle, J. M., (1994). N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval. Las Vegas, NV.
Cepela, N., & Danowski, J. A. (2009). Automatic mapping of social networks of political actors from large collections of news stories. International conference on advances in social network analysis and mining. Athens
Çıltık, A., & Güngör, T. (2008). Time-efficient spam e-mail filtering using n-gram models. Pattern Recognition Letters, 29(1), 19–33.
Article Google Scholar
Collins, P., & Hollo, C. (2010). English grammar: An introduction. Basingstoke u.a.: Palgrave Macmillan.
Google Scholar
Corman, S. R., Kuhn, T., McPhee, R. D., & Dooley, K. J. (2002). Studying complex discursive systems. Human Communication Research, 28(2), 157–206.
Google Scholar
Curran, C., Bröring, S., & Leker, J. (2010). Anticipating converging industries using publicly available data. Technological Forecasting and Social Change, 77(3), 385–395.
Article Google Scholar
Curran, C., & Leker, J. (2009). Seeing the next iphone coming your way: How to anticipate converging industries. Portland International Conference on Management of Engineering & Technology, 2009. PICMET 2009.
Curran, C., & Leker, J. (2011). Patent indicators for monitoring convergence—Examples from NFF and ICT. Technological Forecasting and Social Change, 78(2), 256–273.
Article Google Scholar
Daga, R., & Pandey, G. (2008). US-Patent application 2008/0162455 A1. Determination of document similarity.
Doerfel, M. L., & Barnett, G. A. (1996). The use of Catpac for text analysis. Field Methods, 8(2), 4–7.
Article Google Scholar
Dressler, A. (2006). Patente in technologieorientierten Mergers & Acquisitions: Nutzen, Prozessmodell, Entwicklung und Interpretation semantischer Patentlandkarten. Wiesbaden: Deutscher Universitäts-Verlag.
Google Scholar
Egghe, L. (2000). The distribution of N-grams. Scientometrics, 47(2), 237–252.
Article Google Scholar
Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25(2–3), 285–308.
Article Google Scholar
Gerken, J. M., Walter, L., & Moehrle, M. G. (2010). Semantische Patentlandkarten. Einsatz semantischer Patentlandkarten im Anwendungsfeld der Antriebstechnik—Eine explorative Analyse am Beispiel der Planentengetriebe. Heft Nr. 924 der Forschungsvereinigung Antriebstechnik. Frankfurt/Main: VDMA.
Gower, J. C., & Legendre, P. (1986). Journal of Classification, 3(1), 5–48.
Article MathSciNet MATH Google Scholar
Jeong, B., Lee, D., Cho, H., & Lee, J. (2008). A novel method for measuring semantic similarity for XML schema matching. Expert Systems with Applications, 34(3), 1651–1658.
Article Google Scholar
Kangasabai, R., & Pan, H. (2008). US-Patent 7,346,491 B2. Method of text similarity measurement.
Kim, Y. G., Suh, J. H., & Park, S. C. (2008). Visualization of patent analysis for emerging technology. Expert Systems with Applications, 34(3), 1804–1812.
Article Google Scholar
Kondrak, G. (2005). N-gram similarity and distance. Lecture Notes in Computer Science, 3772, 115–126.
Article MathSciNet Google Scholar
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2), 259–284.
Article Google Scholar
Lee, S., Yoon, B., & Park, Y. (2009). An approach to discovering new technology opportunities: Keyword-based patent map approach. Technovation, 29(6–7), 481–497.
Article Google Scholar
Manning, C. D., & Schütze, H. (2005). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
Moehrle, M. G. (2010). Measures for textual patent similarities: A guided way to select appropriate approaches. Scientometrics, 85(1), 95–109.
Article Google Scholar
Moehrle, M. G., & Geritz, A. (2007). Developing acquisition strategies based on patent maps. In T. Khalil & Y. Hosni (Eds.), Management of technology: New directions in technology management (pp. 19–29). Oxford: Elsevier.
Google Scholar
Moehrle, M. G., Walter, L., Bergmann, I., Bobe, S., & Skrzipale, S. (2010). Patinformatics as a business process: A guideline through patent research tasks and tools. World Patent Information, 32(4), 291–299.
Article Google Scholar
Moens, M. (2006). Information extraction: Algorithms and prospects in a retrieval context. Dordrecht: Springer.
MATH Google Scholar
Peters, H. P. F., & van Raan, A. F. J. (1993). Co-word-based science maps of chemical engineering. Part I: Representations by direct multidimensional scaling. Research Policy, 22(1), 23–45.
Article Google Scholar
Qin, J. (2000). Semantic similarities between a keyword database and a controlled vocabulary database: An investigation in the antibiotic resistance literature. Journal of the American Society for Information Science, 51(2), 166–180.
Article Google Scholar
Ranganathan, A., & Ronen, R. (2008). US-Patent application 2008/0243809 A1. Information-theory based measure of similarity between instances in ontology.
Rip, A., & Courtial, P. (1984). Co-word maps of biotechnology: An example of cognitive scientometrics. Scientometrics, 6(6), 381–400.
Article Google Scholar
Ryley, J. F., Saffer, J., & Gibbs, A. (2008). Advanced document retrieval techniques for patent research. World Patent Information, 30(3), 238–243.
Article Google Scholar
Sepkoski, J. J. (1974). Quantified coefficients of association and measurement of similarity. Mathematical Geology, 6(2), 135–152.
Article Google Scholar
Sternitzke, C. (2008). Betriebswirtschaftliche Patentportfoliobewertung: Eine informationswissenschaftliche Perspektive [dissertation]. Bremen: Universität Bremen.
Google Scholar
Sternitzke, C., & Bergmann, I. (2009). Similarity measures for document mapping: A comparative study on the level of an individual scientist. Scientometrics, 78(1), 113–130.
Article Google Scholar
Trajtenberg, M. (1990). A penny for your quotes: Patent citations and the value of innovations. The Rand Journal of Economics, 21(1), 172–187.
Article Google Scholar
Trippe, A. J. (2003). Patinformatics: Tasks to tools. World Patent Information, 25(3), 211–221.
Article Google Scholar
Tseng, Y., Lin, C., & Lin, Y. (2007). Text mining techniques for patent analysis. Information Processing and Management, 43(5), 1216–1247.
Article Google Scholar
Tsourikov, V. M., Batchilo, L. S., & Sovpel, I. V. (2000). US-Patent 6,167,370. Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (SAO) structures.
Turney, P. D. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Lecture Notes in Computer Science, 2167, 491–502.
Article Google Scholar
von Wartburg, I., Teichert, T., & Rost, K. (2005). Inventive progress measured by multi-stage patent citation analysis. Research Policy, 34(10), 1591–1607.
Article Google Scholar
Wanner, L., Baeza-Yates, R., Brügmann, S., Codina, J., Diallo, B., Escorsa, E., et al. (2008). Towards content-oriented patent document processing. World Patent Information, 30(1), 21–33.
Article Google Scholar
Wen, G., Jiang, L., & Shadbolt, N. R. (2006). Ontology-based similarity between text documents on manifold. Lecture Notes in Computer Science, 4185, 113–125.
Article Google Scholar
Yang, Y., Akers, L., Klose, T., & Barcelon Yang, C. (2008). Text mining and visualization tools—Impressions of emerging capabilities. World Patent Information, 30(4), 280–293.
Article Google Scholar

Download references

Acknowledgments

The authors wish to give credit to Dr. Peter Roosen, g.o.e.the GbR, Aachen, for his constructive input regarding this paper. One of the included applications is based on the results of a joint project with the Forschungsvereinigung Antriebstechnik (FVA). We would like to thank the FVA and all industrial members, especially Dipl.-Ing. Thomas Bayer, for their contributions and their support.

Author information

Authors and Affiliations

IPMI – Institute of Project Management and Innovation, University of Bremen, Wilhelm-Herbst-Str. 12, 28359, Bremen, Germany
Martin G. Moehrle & Jan M. Gerken

Authors

Martin G. Moehrle
View author publications
You can also search for this author in PubMed Google Scholar
Jan M. Gerken
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin G. Moehrle.

Appendices

Appendix 1: Calculating the quantity of combined concepts without regard to the window size

The count of combined concepts within a document is related to the concept size, more precisely; it decreases in proportion to the concept size. The relationship between the size of the text in patent i, measured as the count of solitary concepts (c_i1), the size of a combined concept (n) and the maximum quantity of extracted combined concepts with the size n (c_wn), can be calculated by means of formula 1. The count of combined concepts can be even smaller, whether any combined concepts occur identically in the text.

$$ c_{in} = c_{i1} + 1 - n $$

(1)

Appendix 2: Calculating the quantity of combined concepts with regard to the window size

The maximum quantity of combined concepts is influenced by the size of the combined concepts as well as by the count of windows c_iw in which a patent i can be divided, the count of combined concepts with a size n that can be extracted from a window (c_wn) and the overlap of identical combined concepts in different windows c_o. (formula 2)

$$ c_{in} = c_{iw} \cdot c_{wn} - c_{o} $$

(2)

The counts of windows in patent i (c_iw) depend on the window size m and the count of solitary concepts c_i1. The total of windows within a patent can be calculated by means of formula 3, which is quite similar to formula 1:

$$ c_{iw} = c_{i1} + 1 - m $$

(3)

Calculating the number of combined concepts inside a window is reminiscent of a typical problem from the field of combinatorics. Within the window size, the solitary concepts are connected to combined concepts without variation of their sequence and without repetition. Accordingly, the quantity of combined concepts with a concept size of n c_wn can be calculated by means of formula 4:

$$ c_{wn} = \frac{m!}{n! \cdot (m - n)!} $$

(4)

The overlap of identical combined concepts in the windows depends on the counts of identical solitary concepts, the combined concept size, the window size and the window offset w_o.

$$ c_{o} = \frac{{c_{w1!} }}{{n! \cdot (c_{w1} - n)!}} \cdot \left( {c_{iw} - 1} \right) $$

(5)

with

$$ c_{w1} = m - w_{o} $$

(6)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Moehrle, M.G., Gerken, J.M. Measuring textual patent similarity on the basis of combined concepts: design decisions and their consequences. Scientometrics 91, 805–826 (2012). https://doi.org/10.1007/s11192-012-0682-0

Download citation

Received: 29 August 2011
Published: 13 March 2012
Issue Date: June 2012
DOI: https://doi.org/10.1007/s11192-012-0682-0

Keywords

Mathematics Subject Classification (2000)

68U15

JEL Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Measuring textual patent similarity on the basis of combined concepts: design decisions and their consequences

Abstract

Access this article

Similar content being viewed by others

Completing keyword patent search with semantic patent search: introducing a semiautomatic iterative method for patent near search based on semantic similarities

Analysis of the Conceptual Structure of Information Recorded in Patents

Using Text Mining Algorithms for Patent Documents and Publications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Calculating the quantity of combined concepts without regard to the window size

Appendix 2: Calculating the quantity of combined concepts with regard to the window size

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification (2000)

JEL Classification

Navigation

Measuring textual patent similarity on the basis of combined concepts: design decisions and their consequences

Abstract

Access this article

Similar content being viewed by others

Completing keyword patent search with semantic patent search: introducing a semiautomatic iterative method for patent near search based on semantic similarities

Analysis of the Conceptual Structure of Information Recorded in Patents

Using Text Mining Algorithms for Patent Documents and Publications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Calculating the quantity of combined concepts without regard to the window size

Appendix 2: Calculating the quantity of combined concepts with regard to the window size

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2000)

JEL Classification

Search

Navigation