Skip to main content
Log in

A deep learning based method for extracting semantic information from patent documents

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

The text-based patent analysis is grounded in information extraction technique. However, such technique suffers from obvious defects such as low degree of automation and unsatisfactory extraction accuracy. To deal with these problems, after an information schema is pre-defined, which contains 17 types of entities and 15 types of semantic relations, a dataset of 1010 patent abstracts is annotated and opened freely to the research community. Then, a novel patent information extraction framework is proposed, in which two deep-learning models, BiLSTM-CRF and BiGRU-HAN, are respectively used for entity identification and semantic relation extraction. Finally, to demonstrate the advantages of the new framework, extensive experiments are conducted, and the SAO method and PCNNs model are taken as respective baselines on the framework and module levels. Experimental results show that our framework out-performs the traditional one in terms of automation and accuracy, and is capable of extracting fine-grained structured information from patent texts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://github.com/awesome-patent-mining/TFH_Annotated_Dataset.

  2. https://radimrehurek.com/gensim/.

References

  • Akhondi, S. A., Klenner, A. G., Tyrchan, C., Manchala, A. K., Boppana, K., Lowe, D., et al. (2014). Annotated chemical patent corpus: A gold standard for text mining. PLoS ONE, 9(9), 1–8.

    Article  Google Scholar 

  • An, J., Kim, K., Mortara, L., & Lee, S. (2018). Deriving technology intelligence from patents: Preposition-based semantic analysis. Journal of Informetrics, 12(1), 217–236.

    Article  Google Scholar 

  • Baldridge, J. (2005). The OpenNLP project. http://opennlp.apache.org/index.html. Accessed 14 Dec 2019.

  • Bergmann, I., Butzke, D., Walter, L., Fuerste, J. P., & Erdmann, V. A. (2008). Evaluating the risk of patent infringement by means of semantic patent analysis: The case of DNA chips. R& D Management, 38(5).

  • Carvalho, D. S., França, F. M. G., & Lima, P. M. V. (2014). Extracting semantic information from patent claims using phrasal structure annotations. In 2014 Brazilian Conference on Intelligent Systems (pp. 31–36).

  • Chen, D. (2018). Neural reading comprehension and beyond (Doctoral dissertation). Palo Alto, CA: Stanford University.

    Google Scholar 

  • Choi, S., Kang, D., Lim, J., & Kim, K. (2012a). A fact-oriented ontological approach to SAO-based function modeling of patents for implementing function-based technology database. Expert System with Application, 39(10), 9129–9140.

    Article  Google Scholar 

  • Choi, S., Kim, H., Yoon, J., Kim, K., & Lee, J. Y. (2013). An sao-based text-mining approach for technology roadmapping using patent information. R&D management, 43(1), 52–74.

    Article  Google Scholar 

  • Choi, S., Lee, H., Park, E. L., & Choi, S. (2019). Deep patent landscaping model using transformer and graph embedding. arXiv preprint arXiv: 1903.05823v4

  • Choi, S., Park, H., Kang, D., Lee, J. Y., & Kim, K. (2012b). An SAO-based text mining approach to building a technology tree for technology planning. Expert Systems with Applications, 39(13), 11443–11455.

    Article  Google Scholar 

  • Dewulf, S. (2011). Directed variation of properties for new or improved function product DNA- a base for connect and develop. Procedia Engineering, 9, 646–652.

    Article  Google Scholar 

  • Ford, E., Carroll, J. A., Smith, H. E., Scott, D., & Cassell, J. A. (2016). Extracting information fro-m the text of electronic medical records to improve case detection: a systematic review. Journal of the American Medical Informatics Association, 23(5), 1007–1015.

    Article  Google Scholar 

  • Guo, J., Wang, X., Li, Q., & Zhu, D. (2016). Subject- action- object- based morphology analysis for determining the direction of technological change. Technological Forecasting and Social Change, 105, 27–40.

    Article  Google Scholar 

  • Han, X., Gao, T., Yao, Y., Ye, D., Liu, Z., Sun, M. (2019). OpenNRE: An open and extensible toolkit for neural relation extraction. arXiv preprint arXiv: 1301.3781

  • Han, C., Lim, H., Lee, D., Cho, H., & Kang, K. (2017). Patent analysis for forecasting promising technology in high-rise building construction. Technological Forecasting and Social Change, 128(3), 144–153.

    Google Scholar 

  • Huang, Z., Xu, W., &Yu K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.

  • Invention Machine Corporation. (2001). Knowledgist 2.5-Product Description http://www.triz.ch/KN25Prodesc.doc. Accessed 14 Dec 2019.

  • Jurafsky, D., Martin, J. (2019). Speech and language processing (the 3nd edition draft). https://web.stanford.edu/~jurafsky/slp3/. Accessed 24 Dec 2019.

  • Lee, C., & Lee, G. (2019). Technology opportunity analysis based on recombinant search patent landscape analysis for idea generation. Scientometrics, 121(2), 603–632.

    Article  Google Scholar 

  • Li, S., Hu, J., Cui, Y., & Hu, J. (2018). DeepPatent: patent classification with convolutional neural networks and word embedding. Scientometrics, 117(2), 721–744.

    Article  Google Scholar 

  • Lupu, M. (2017). Information retrieval, machine learning, and NLP for intellectual property information. World Patent Information, 49, A1–A3.

    Article  Google Scholar 

  • Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: System demonstrations (pp. 55–60).

  • Mikolov, T., Chen, K., Corrado G., & Dean, J.(2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781.

  • Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 39–41.

    Article  Google Scholar 

  • Moehrle, M. G., Walter, L., Geritz, A., & Müller, S. (2005). Patent- based inventor profiles as a basis for human resource decisions in research and development. R&D Management, 35(5), 513–524.

    Article  Google Scholar 

  • Park, H., Yoon, J., & Kim, K. (2012). Identifying patent infringement using SAO based semantic technological similarities. Scientometrics, 90(2), 515–529.

    Article  Google Scholar 

  • Park, H., Yoon, J., & Kim, K. (2013). Using function-based patent analysis to identify potential application areas of technology for technology transfer. Expert Systems with Applications, 40(13), 5260–5265.

    Article  Google Scholar 

  • Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (pp. 1532–1543).

  • Pérez-Pérez, M., Pérez-Rodríguez, G., Vazquez, M., Fdez-Riverola, F., Oyarzabal, J., Oyarzabal, J., Valencia, A., Lourenço, A., & Krallinger, M. (2017). Evaluation of chemical and gene/protein entity recognition systems at BioCreative V.5: The CEMP and GPRO patents tracks. In Proceedings of the BioCreative V.5 challenge evaluation workshop, pp. 11–18.

  • Phan, M. C., & Sun, A. (2018). CoNEREL: Collective information extraction in news articles. In The 41st international ACM SIGIR conference on research & development in information retrieval (pp. 1273–1276).

  • Rajshekhar, K., Shalaby, W., & Zadrozny, W. (2016). Analytics in post-grant patent review: possibilities and challenges (preliminary report). In Proceedings of the American Society for Engineering Management 2016 international annual conference.

  • Risch, J., & Krestel, R. (2019). Domain-specific word embeddings for patent classification. Data Technologies and Applications, 53(1), 108–122.

    Article  Google Scholar 

  • Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv preprint arXiv:cs/0306050.

  • Singh, S. (2018). Natural language processing for information extraction. arXiv preprint arXiv: 1807.02383.

  • Souili, A., Cavallucci, D., & Rousselot, F. (2015). Natural Language Processing (NLP): A solution for knowledge extraction from patent unstructured data. Procedia Engineering, 131, 635–643.

    Article  Google Scholar 

  • Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. I. (2012). BRAT: A web-based tool for NLP-assisted text annotation. In Proceedings of the demonstrations at the 13th conference of the european chapter of the association for computational linguistics (pp. 102–107).

  • Strzalkowski, T. (Ed.). (1999). Natural language information retrieval. Dordrecht: Kluwer.

    MATH  Google Scholar 

  • Tsourikov, V., Batchilo, L., & Sovpel, I. (2000). Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (SAO) structures (No. 6167370). Alexandria, VA: U. S. Patent and Trademark Office.

  • Wang, X., Qiu, P., Zhu, D., Mitkova, L., Lei, M., & Porter, A. (2015). Identification of technology development trends based on subject- action- object analysis: The case of dye-sensitized solar cells. Technological Forecasting and Social Change, 98, 24–46.

    Article  Google Scholar 

  • Wang, X., Ren, H., Chen, Y., Liu, Y., Qiao, Y., & Huang, Y. (2019). Measuring patent similarity with SAO semantic analysis. Scientometrics, 121(1), 1–23.

    Article  Google Scholar 

  • Wu, H. (2019). Report of 2019 language & intelligence technique evaluation. Baidu Corporation. http://tcci.ccf.org.cn/summit/2019/dlinfo/1101-wh.pdf, Accessed 24 Dec 2019.

  • Xu, S., An, X., Zhu, L., Zhang, Y., & Zhang, H. (2015). A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature. Journal of Cheminformatics, 7(Suppl 1), S11.

    Article  Google Scholar 

  • Xu, J., Guo, L., Jiang, J., Ge, B., & Li, M. (2019). A deep learning methodology for automatic extraction and discovery of technical intelligence. Technological Forecasting and Social Change, 146(9), 339–351.

    Article  Google Scholar 

  • Xu, S., Zhu, L., Qiao, X., Xue, C. (2009). A novel approach for measuring Chinese terms semantic similarity based on pairwise sequence alignment. In Proceedings of the 5th international conference on semantics, knowledge and grid, pp. 92–98.

  • Yang, C. B. (2012). Role of patent analysis in corporate R&D. Pharmaceutical Patent Analyst, 1(1), 5–7.

    Article  Google Scholar 

  • Yang, C., Huang, C., & Su, J. (2018). An improved SAO network-based method for technology trend analysis: A case study of graphene. Journal of Informetrics, 12(1), 271–286.

    Article  Google Scholar 

  • Yang, S., & Soo, V. (2012). Extract conceptual graphs from plain texts in patent claims. Engineering Applications of Artificial Intelligence, 25(4), 874–887.

    Article  Google Scholar 

  • Yang, C., Zhu, D., Bergmann, X., Zhang, Y., & Lu, J. (2017). Requirement-oriented core technological components’ identification based on SAO analysis. Scientometrics, 112(2), 1229–1248.

    Article  Google Scholar 

  • Yoon, J., & Kim, K. (2012). An analysis of property–function based patent networks for strategic R&D planning in fast-moving industries: The case of silicon-based thin film solar cells. Expert Systems with Applications, 39(9), 7709–7717.

    Article  Google Scholar 

  • Yoon, J., Ko, N., Kim, J., Lee, J. M., Coh, B. Y., & Song, I. (2015). A function-based knowledge base for technology intelligence. Industrial Engineering & Management Systems, 14(1), 73–87.

    Article  Google Scholar 

  • Zeng, D., Liu, K., Chen, Y., & Zhao, J. (2015). Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 1753–1762).

  • Zhang, L. (2016). An integrated framework for patent analysis and mining (Doctoral dissertation). Miami, FL: Florida International University.

    Google Scholar 

  • Zheng, S., Wang, F., Bao, H., Hao, Y., Zhou, P., & Xu, B. (2017). Joint extraction of entities and relations based on a novel tagging scheme. arXiv preprint arXiv:1706.05075.

  • Zhou, Y., Dong, F., Liu, Y., Li, Z., Du, J., & Zhang, L. (2020). Forecasting emerging technologies using data augmentation and deep learning. Scientometrics, 122(1), 1–29.

    Article  Google Scholar 

Download references

Acknowledgements

This research received the financial support from National Natural Science Foundation of China under Grant Number 71704169, and Social Science Foundation of Beijing Municipality under Grant Number 17GLB074, respectively. Our gratitude also goes to the anonymous reviewers for their valuable suggestions and comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuo Xu.

Appendix

Appendix

There are two types of errors for entity identification: (1) errors in entity boundary detection, (2) errors in entity type classification. General confusion matrix is capable of recording the first type of errors. As for the second type, an extra column is appended to the confusion matrix in Table 8, where rows indicate true entity types and columns predicted ones, and the last column (ebd) denotes the errors in boundary detection.

Table 8 The confusion matrix of entity identification

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, L., Xu, S., Zhu, L. et al. A deep learning based method for extracting semantic information from patent documents. Scientometrics 125, 289–312 (2020). https://doi.org/10.1007/s11192-020-03634-y

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-020-03634-y

Keywords

Navigation