Abstract
The analysis of scientific and technical documents is crucial in the process of establishing science and technology strategies. One popular method for such analysis is for field experts to manually classify each scientific or technical document into one of several predefined technical categories. However, not only is manual classification error-prone and expensive, but it also requires extended efforts to handle frequent data updates. In contrast, machine learning and text mining techniques enable cheaper and faster operations, and can alleviate the burden on human resources. In this paper, we propose a method for extracting embedded feature vectors by applying a neural embedding approach for text features in patent documents and automatically clustering the embedding features by utilizing a deep embedding clustering method.
Similar content being viewed by others
References
Akers, L. (2003). The future of patent information—a user with a view. World Patent Information, 25(4), 303.
Beltz, H., Fülöp, A., Wadhwa, R. R., & Érdi, P. (2017). In 2017 International joint conference on neural networks (IJCNN) (pp. 1388–1394). IEEE.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137.
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). In: Advances in neural information processing systems (pp. 153–160).
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
Choi, S., & Jun, S. (2014). Vacant technology forecasting using new bayesian patent clustering. Technology Analysis & Strategic Management, 26(3), 241. https://doi.org/10.1080/09537325.2013.850477.
Choi, S., Park, H., Kang, D., Lee, J. Y., & Kim, K. (2012). An sao-based text mining approach to building a technology tree for technology planning. Expert Systems with Applications, 39(13), 11443.
Delorme, J. (1982). Dissemination of patent information. World Patent Information, 4(4), 155.
Du, R., Drake, B., & Park, H. (2017). Hybrid clustering based on content and connection structure using joint nonnegative matrix factorization. arXiv preprint arXiv:1703.09646
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121.
Fattori, M., Pedrazzi, G., & Turra, R. (2003). Text mining applied to patent mapping: A practical business case. World Patent Information, 25(4), 335.
Fowlkes, E. B., & Mallows, C. L. (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553. https://doi.org/10.1080/01621459.1983.10478008.
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193. https://doi.org/10.1007/BF01908075.
Jun, S., Park, S. S., & Jang, D. S. (2014). Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Systems with Applications, 41(7), 3204.
Kang, I. S., Na, S. H., Kim, J., & Lee, J. H. (2007). Cluster-based patent retrieval. Information Processing & Management, 43(5), 1173.
Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). In: Advances in neural information processing systems (pp. 972–981).
Krizhevsky, A., Nair, V., & Hinton, G. (2009). Cifar-10 and cifar-100 datasets. Retrieved March 1, 2016, from https://www.cs.toronto.edu/kriz/cifar. html.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). In Advances in neural information processing systems (pp. 1097–1105).
Le, Q., & Mikolov, T. (2014). In: International conference on machine learning (pp. 1188–1196).
Lee, C., Jeon, J., & Park, Y. (2011). Monitoring trends of technological changes based on the dynamic patent lattice: A modified formal concept analysis approach. Technological Forecasting and Social Change, 78(4), 690.
Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579.
Madani, F., & Weber, C. (2016). The evolution of patent mining: Applying bibliometrics analysis and keyword network analysis. World Patent Information, 46, 32.
Meireles, M. R. G., Carvalho, J. R., do Patrocínio Júnior, Z. K., & Almeida, P. E. (2017). Automatic patent clustering using som and bibliographic coupling. iSys-Revista Brasileira de Sistemas de Informação, 10(1), 06.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). In: Advances in neural information processing systems (pp. 3111–3119).
Pang, B., & Lee, L. (2005). In Proceedings of the 43rd annual meeting on association for computational linguistics (ACL) (pp. 115–124). Association for Computational Linguistics.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825.
Pennington, J., Socher, R., & Manning, C. (2014). In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).
Ramos, J., et al. (2003). In Proceedings of the first instructional conference on machine learning (Vol. 242, pp. 133–142).
Rodriguez, A., Tosyali, A., Kim, B., Choi, J., Lee, J., Coh, B., et al. (2016). Patent clustering and outlier ranking methodologies for attributed patent citation networks for technology opportunity discovery. IEEE Transactions on Engineering Management, 63(4), 426. https://doi.org/10.1109/TEM.2016.2580619.
Shibata, N., Kajikawa, Y., Takeda, Y., & Matsushima, K. (2008). Detecting emerging research fronts based on topological measures in citation networks of scientific publications. Technovation, 28(11), 758.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929.
Trappey, A. J., & Trappey, C. V. (2008). An R&D knowledge management method for patent document summarization. Industrial Management & Data Systems, 108(2), 245.
Trappey, A. J., Trappey, C. V., & Wu, C. Y. (2009). Automatic patent document summarization for collaborative knowledge systems and services. Journal of Systems Science and Systems Engineering, 18(1), 71.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec), 3371.
Wallach, H. M. (2006). In Proceedings of the 23rd international conference on machine learning (pp. 977–984). ACM.
Xie, J., Girshick, R., & Farhadi, A. (2016). In International conference on machine learning (pp. 478–487).
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 1480–1489).
Yoon, B., & Park, Y. (2004). A text-mining-based patent network: Analytical tool for high-technology trend. The Journal of High Technology Management Research, 15(1), 37.
Yoon, J., & Kim, K. (2012). Detecting signals of new technological opportunities using semantic patent analysis and outlier detection. Scientometrics, 90(2), 445.
Young, T., Hazarika, D., Poria, S., & Cambria, E. (2017). Recent trends in deep learning based natural language processing. arXiv preprint arXiv:1708.02709
Zeiler, M. D. (2012). Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701
Zhang, C., & Zhang, S. (2002). Association rule mining: Models and algorithms. Berlin: Springer.
Zhang, W., Yoshida, T., Tang, X., & Wang, Q. (2010). Text clustering using frequent itemsets. Knowledge-Based Systems, 23(5), 379.
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) Grant and funded by the Korean government (No. NRF-2015R1C1A1A01056185 and 2018R1D1A1B07045825).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Eunjeong Park and Sungchul Choi are co-corresponding authors.
Rights and permissions
About this article
Cite this article
Kim, J., Yoon, J., Park, E. et al. Patent document clustering with deep embeddings. Scientometrics 123, 563–577 (2020). https://doi.org/10.1007/s11192-020-03396-7
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-020-03396-7