Abstract
This study presents the first systematic disambiguation result of Chinese patent inventors in State Intellectual Property Office of China patent database from 1985 to 2016. With a list of 66,248 inventors owning rare names and a hand-labeled data of 1465 inventors, our supervised learning algorithm identified 3.99 million unique inventors from 1.84 million Chinese names referring to 14.68 million patent-inventor records. We developed a method for constructing high-quality training data from a third-party rare name list and provided evidence for its reliability when large-scale and representative hand-labeled data is crucial but expensive to obtain. To optimize clustering results on large-scale dataset with highly unbalanced distribution, we also modified robust single linkage by adding constraints to the maximum distance within clusters generated. Varying across different training and testing data, as well as clustering parameters, our algorithm could yield F1 scores to 93.36% before clustering and 99.10% after clustering, with final splitting errors of 1.05–1.34% and lumping errors of 0.21–0.83%. Besides, we also applied this framework in standardizing applicants’ names according to their text similarity and geographical information based on the high-resolution geocoding data of all addresses within mainland China.
Similar content being viewed by others
Notes
In our rare name dataset, only 0.2–0.25% record has synonym problem and we suppose this problem in the whole dataset would be lower than 0.2% as rare names are prone to be misspelled or miswritten and common names constitute a larger portion.
Korean and Japanese names could be identified from Chinese ones if they are written in English. However, for names written in Chinese, distinguishing them simply by names’ properties is tough if no extra information were available.
We filter out the rare name list based on web search service offered by Guozhengtong on (http://zhaoren.idtag.cn/samename/searchName!searchIndex.htm), though the service is suspended currently. Similar information is provided by websites like http://www.sosuo.name/tong/. Some local governments, such as Guangdong Provincial Government, also provide access of name-citizen ID counts at provincial level, which can be used as a preliminary filtering of non-rare names.
We compared results with and without this adjustment. Tests shows the F1 score of most classification models falls below 90% or even lower than 50%. If no adjustment was made, even the best classifier would yield a lumping error higher than all results with this adjustment. In some models (e.g., Naïve Bayes and Random Forest), performing no adjustment can produce lower splitting errors, but result in significantly higher lumping errors.
Other popular models like K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Multi-Layer Perceptron [MLP, utilized by Tran et al. (2014)] were abandoned due to the long training time and insignificant gain on performances. Our practice confirms opinions of previous studies in that classifiers like SVM and instance-based algorithms like KNN are inappropriate for disambiguation algorithms involving large-scale pairwise comparisons with high computational complexity.
While Wishart’s RSL takes a similar approach with density-based methods, it differs from the later in definition of connectivity and the fact that it assigns all points into one particular cluster, rather than leaving those noisy points unassigned. In this sense, RSL is more suitable than many density-based approaches in which inventor-patent records are meaningful thus should have a cluster label.
For detailed about RSL’s algorithms and parameters, please refer to Campello et al. (2013), Chaudhuri and Dasgupta (2010); Chaudhuri et al. (2014), and the hdbscan implementation document at https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html.
Many points are very similar to a few points within other clusters although the majority of their nearest neighbors are from their own cluster (Gupta 2011).
Dynamically enlarging the threshold k for computing τ based on the block sizes may also be a good option. However, on the one hand, a larger k means more records would be treated as noise and the lumping error would be increased. On the other hand, when adjusting k based on the block size, the mutual reachability distance τ should also be changed accordingly to find the best combination. But searching for the best combination of both two thresholds at the same time would be challenging or even infeasible in practice when training data on large blocks is extremely difficult to obtain.
For RSL, the F1 score over τ has a skewed inverted-U shape, in which F1 score peaked at τ = 0.15 and dropped dramatically. We show its result with τ = 0.5 just for comparison.
We sincerely thank the reviewers for recommending this method of statistical comparison.
References
Balcan, M.-F., Liang, Y., & Gupta, P. (2014). Robust hierarchical clustering. Journal of Machine Learning Research. Retrieved from https://arxiv.org/abs/1401.0247.
Balsmeier, B., Chavosh, A., Li, G. C., Fierro, G., Johnson, K., Kaulagi, A., et al. (2015). Automated disambiguation of us patent grants and applications. Fung Institute for Engineering Leadership Unpublished Working Paper.
Boeing, P., Mueller, E., & Sandner, P. (2016). China’s R&D explosion—Analyzing productivity effects across ownership types and over time. Research Policy,45, 159–176.
Campello, R. J. G. B., Moulavi, D., & Sander, J. (2013). Density-based clustering based on hierarchical density estimates. In J. Pei, V. S. Tseng, L. Cao, H. Motoda, & G. Xu (Eds.), Advances in knowledge discovery and data mining (pp. 160–172). Berlin: Springer.
Cassi, L., & Carayol, N. (2009). Who’s who in patents. A Bayesian approach. Retrieved July 7, 2009, from https://hal-paris1.archives-ouvertes.fr/hal-00631750/document.
Chaudhuri, K., & Dasgupta, S. (2010). Rates of convergence for the cluster tree. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, & A. Culotta (Eds.), Advances in neural information processing systems 23 (pp. 343–351). Red Hook: Curran Associates Inc.
Chaudhuri, K., Dasgupta, S., Kpotufe, S., & von Luxburg, U. (2014). Consistent procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory,60, 7900–7912.
Chin, W.-S., Zhuang, Y., Juan, Y.-C., Wu, F., Tung, H.-Y., Yu, T., et al. (2014). Effective string processing and matching for author disambiguation. The Journal of Machine Learning Research,15, 3037–3064.
Cuxac, P., Lamirel, J.-C., & Bonvallot, V. (2013). Efficient supervised and semi-supervised approaches for affiliations disambiguation. Scientometrics,97, 47–58.
Dang, J., & Motohashi, K. (2015). Patent statistics: A good indicator for innovation in China? Patent subsidy program impacts on patent quality. China Economic Review. https://doi.org/10.1016/j.chieco.2015.03.012.
Davidson, I., & Ravi, S. S. (2005). Agglomerative hierarchical clustering with constraints: Theoretical and empirical results. In A. M. Jorge, L. Torgo, P. Brazdil, R. Camacho, & J. Gama (Eds.), Knowledge discovery in databases: PKDD 2005 (pp. 59–70). Berlin: Springer.
Dehman, A. (2015). Spatial clustering of linkage disequilibrium blocks for genome-wide association studies (Ph.D. thesis). Université d’Evry Val d’Essonne; Université Paris-Saclay; Laboratoire de Mathématiques et Modélisation d’Evry. Retrieved September 21, 2018, from https://tel.archives-ouvertes.fr/tel-01288568/document.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine learning research,7(Jan), 1–30.
Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis. Hoboken: Wiley.
Fan, X., Wang, J., Pu, X., Zhou, L., & Lv, B. (2011). On graph-based name disambiguation. Journal of Data and Information Quality,2, 10:1–10:23.
Fegley, B. D., & Torvik, V. I. (2013). Has large-scale named-entity network analysis been resting on a flawed assumption? PLoS ONE,8, e70299.
Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2012). A brief survey of automatic methods for author name disambiguation. ACM SIGMOD Record,41, 15–26.
Fleming, L., King, C., & Juda, A. I. (2007). Small worlds and regional innovation. Organization Science,18, 938–954.
Gagolewski, M., Bartoszuk, M., & Cena, A. (2016). Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences,363, 8–23.
Giles, C. L., Zha, H., & Han, H. (2005). Name disambiguation in author citations using a K-way spectral clustering method. In Proceedings of the 5th ACM/IEEE-CS joint conference on digital libraries (JCDL’05) (pp. 334–343).
Gupta, P. (2011). Robust clustering algorithms (Master Thesis). Georgia Institute of Technology.
Han, H., Yao, C., Fu, Y., Yu, Y., Zhang, Y., & Xu, S. (2017). Semantic fingerprints-based author name disambiguation in Chinese documents. Scientometrics,111, 1879–1896.
Hartigan, John A. (1975). Clustering algorithms (99th ed.). New York: Wiley.
Hartigan, J. A. (1981). Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76(374), 388–394.
He, Z.-L., Tong, T. W., Zhang, Y., & He, W. (2018). A database linking Chinese patents to China’s census firms. Scientific Data,5, 180042.
Hu, A. G. Z., Zhang, P., & Zhao, L. (2017). China as number one? Evidence from China’s most recent patenting surge. Journal of Development Economics,124, 107–119.
Huang, J., Ertekin, S., & Giles, C. L. (2006). Efficient name disambiguation for large-scale databases. In Knowledge discovery in databases: PKDD 2006 (pp. 536–544). Berlin: Springer.
Hussain, I., & Asghar, S. (2017). A survey of author name disambiguation techniques: 2010–2016. The Knowledge Engineering Review. https://doi.org/10.1017/S0269888917000182.
Ikeuchi, K., Motohashi, K., Tamura, R., & Tsukada, N. (2017). Measuring science intensity of industry using linked dataset of science, technology and industry. RIETI Discussion Paper Series, 17-E-056.
Jones, B. F. (2009). The burden of knowledge and the “death of the renaissance man”: Is innovation getting harder? The Review of Economic Studies, 76(1), 283–317.
Karami, A., & Johansson, R. (2014). Choosing DBSCAN parameters automatically using differential evolution. International Journal of Computer Applications,91, 1–11.
Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: An introduction to cluster analysis. Hoboken: Wiley.
Khabsa, M., Treeratpituk, P., & Giles, C. L. (2014). Large scale author name disambiguation in digital libraries. In 2014 IEEE international conference on big data (pp. 41–42).
Kim, K., Khabsa, M., & Giles, C. L. (2016). Inventor name disambiguation for a patent database using a random forest and DBSCAN. In 2016 IEEE/ACM joint conference on digital libraries (JCDL) (pp. 269–270).
Kriegel, H.-P., Kröger, P., Sander, J., & Zimek, A. (2011). Density-based clustering: Density-based clustering. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery,1, 231–240.
Lai, R., D’Amour, A., & Fleming, L. (2009). The careers and co-authorship networks of U.S. patent-holders, since 1975. Retrieved January 1, 2018, from https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/12367.
Li, G.-C., Lai, R., D’Amour, A., Doolin, D. M., Sun, Y., Torvik, V. I., et al. (2014). Disambiguation and co-authorship networks of the U.S. patent inventor database (1975–2010). Research Policy,43, 941–955.
Liu, W., Islamaj Doğan, R., Kim, S., Comeau, D. C., Kim, W., Yeganova, L., et al. (2014). Author name disambiguation for PubMed. Journal of the Association for Information Science and Technology,65, 765–781.
Louppe, G., Al-Natsheh, H. T., Susik, M., & Maguire, E. J. (2016). Ethnicity sensitive author disambiguation using semi-supervised Learning. In Presented at the international conference on knowledge engineering and the semantic web (pp. 272–287). Cham: Springer.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York, NY: Cambridge University Press.
Monath, N., & McCallum, A. (2015). Discriminative hierarchical coreference for inventor disambiguation. In Presentation. Presented at the patentsview inventor disambiguation technical workshop.
Morrison, G., Riccaboni, M., & Pammolli, F. (2017). Disambiguation of patent inventors and assignees using high-resolution geolocation data. Scientific Data. https://doi.org/10.1038/sdata.2017.64.
Motohashi, K. (2008). Assessment of technological capability in science industry linkage in China by patent database. World Patent Information,30, 225–232.
Müller, M.-C. (2017). Semantic author name disambiguation with word embeddings. In Research and advanced technology for digital libraries (pp. 300–311). Cham: Springer.
On, B.-W., Lee, I., & Lee, D. (2012). Scalable clustering methods for the name disambiguation problem. Knowledge and Information Systems,31, 129–151.
Pezzoni, M., Lissoni, F., & Tarasconi, G. (2014). How to kill inventors: Testing the Massacrator© algorithm for inventor disambiguation. Scientometrics,101, 477–504.
Raffo, J., & Lhuillery, S. (2009). How to play the “Names Game”: Patent retrieval comparing different heuristics. Research Policy,38, 1617–1627.
Shin, D., Kim, T., Choi, J., & Kim, J. (2014). Author name disambiguation using a graph model with node splitting and merging based on bibliographic information. Scientometrics,100, 15–50.
Tang, L., & Walsh, J. P. (2010). Bibliometric fingerprints: Name disambiguation based on approximate structure equivalence of cognitive maps. Scientometrics,84, 763–784.
Torvik, V. I., & Smalheiser, N. R. (2009). Author name disambiguation in MEDLINE. ACM Transactions on Knowledge Discovery from Data (TKDD),3(3), 11.
Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A probabilistic similarity metric for Medline records: A model for author name disambiguation. Journal of the American Society for Information Science and Technology,56, 140–158.
Trajtenberg, M., Shiff, G., & Melamed, R. (2006). The “Names Game”: Harnessing Inventors’ Patent Data for Economic Research (Working Paper No. 12479). National Bureau of Economic Research. Retrieved January 4, 2018, from http://www.nber.org/papers/w12479.
Tran, H. N., Huynh, T., & Do, T. (2014). Author name disambiguation by using deep neural network. In N. T. Nguyen, B. Attachoo, B. Trawiński, & K. Somboonviwat (Eds.), Intelligent information and database systems (pp. 123–132). Berlin: Springer.
Treeratpituk, P., & Giles, C. L. (2009). Disambiguating authors in academic publications using random forests. In Proceedings of the 9th ACM/IEEE-CS joint conference on digital libraries (pp. 39–48). New York, NY, USA: ACM.
Ventura, S. L., Nugent, R., & Fuchs, E. R. H. (2015). Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records. Research Policy,44, 1672–1701.
Wang, J., Berzins, K., Hicks, D., Melkers, J., Xiao, F., & Pinheiro, D. (2012). A boosted-trees method for name disambiguation. Scientometrics,93, 391–411.
Wishart, D. (1969). Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In Numerical taxonomy (pp. 282–311). London: Academic Press.
Zhang, B., & Hasan, M. A. (2017). Name disambiguation in anonymized graphs using network embedding. Retrieved from http://arxiv.org/abs/1702.02287.
Zhang, G., Guan, J., & Liu, X. (2014). The impact of small world on patent productivity in China. Scientometrics,98, 945–960.
Zhao, Y., Karypis, G., & Fayyad, U. (2005). Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery,10, 141–168.
Acknowledgements
This work is mainly supported by the Research Institute of Economy, Trade and Industry’s (RIETI) under the project of Empirical Analysis of Innovation Ecosystems in Advancement of the Internet of Things (IoT), National Natural Science Foundation of China (NSFC, Nos. 71704025; 71503123), Scientific Cooperation Program between NSFC and Japan Society for the Promotion of Science (No. 71711540044). We also appreciate the editors’ diligent work as well as insightful and inspiring comments from two anonymous reviewers, Dr. Kenta Ikeuchi, and Mr. Zhao An.
Author information
Authors and Affiliations
Corresponding author
Appendix: Standardization of applicants’ names
Appendix: Standardization of applicants’ names
Applicant name standardization is an obstacle that nearly all researchers would encounter if they desire to make use of Chinese patent data or link it with external data sources. Since this is also a disambiguation task, our framework on disambiguating inventors could be easily transplanted to the harmonization of applicants’ names. However, instead of suffering from the common name problem, what the harmonization of applicant name has to deal with is the synonym problem: its goal is to cover the name variant, changing of firm names and typos of the same applicant as accurate as possible.
We harmonized Chinese applicant names according to their characteristics: i.e., a typical Chinese Firm name is usually composed 4 parts—Province/city + name stem + industry + type (e.g., 深圳/TCL/数字技术/有限公司). While purely relying on the name or geocoding could go wrong, the high similarity in both two dimensions indicates a higher probability of matching. After comparing the rule-based and supervised learning method to standardize firm and university names, our result shows that rule-based approach is much appropriate due to its simplicity and empirical studies usually do not have such high requirement for accuracy, though supervised learning method performs slightly better. The steps are as follows:
- 1.
Preprocessing: removing spaces, brackets, and name suffix such as “股份有限公司”, “有限公司”, “公司”, “所”, “院”, “中心”, as well as firms’ prefix like “省”, “市”, “北京”, “(北京)”, “深圳”. Keep the prefix representing addresses of universities and research institutes as for such type of applicants, such as “北京大学”, “浙江大学”, etc., the address prefix is their sole identifier.
- 2.
Stemming: extract the most special words (e.g., “万达”, “中兴通讯”, “TCL”, “ABB”) in a name with TF-IDF algorithm.
- 3.
Blocking by the stem, applicant type and generating comparing pairs: from 0.38 million unique applicant names, we generated 5.54 m record pairs to compare.
- 4.
Comparing the Levenshtein similarity of preprocessed names and record pairs.
- 5.
Cluster with thresholds directly, or
- 6.
Predicting distances based on trained models and clustering based on distance matrices. Two datasets were collected for training and testing our algorithm: one is our manually standardized applicants’ names of firms filed patents and listed on the National Equities Exchanges and Quotations (NEEQ), the other is SIPO’s linked data with China’s Annual Survey of Industrial Enterprises (ASIE), which is provided by the Chinese Patent Data Project (He et al. 2018).
Rights and permissions
About this article
Cite this article
Yin, D., Motohashi, K. & Dang, J. Large-scale name disambiguation of Chinese patent inventors (1985–2016). Scientometrics 122, 765–790 (2020). https://doi.org/10.1007/s11192-019-03310-w
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-019-03310-w