ABSTRACT
Inventor name disambiguation is the task that distinguishes each unique inventor from all other inventor records in a patent database. This task is essential for processing person name queries in order to get information related to a specific inventor, e.g. a list of all that inventor's patents. Using earlier work on author name disambiguation, we apply it to inventor name disambiguation. A random forest classifier is trained to classify whether each pair of inventor records is the same person. The DBSCAN algorithm is use for inventor record clustering, and its distance function is derived using the random forest classifier. For scalability, blocking functions are used to reduce the complexity of record matching and enable parallelization since each block can be run simultaneously. Tested on the USPTO patent database, 12 million inventor records were disambiguated in 6.5 hours. Evaluation on the labeled datasets from USPTO PatentsView competition shows our algorithm outperforms all algorithms submitted to the competition.
- L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarDigital Library
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD'96), volume 96, pages 226--231, 1996.Google Scholar
- J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European Conference on Principle and Practice of Knowledge Discovery in Databases(PKDD'06), pages 536--544, 2006.Google ScholarDigital Library
- M. Khabsa, P. Treeratpituk, and C. L. Giles. Large scale author name disambiguation in digital libraries. In IEEE International Conference on Big Data, pages 41--42, 2014.Google ScholarCross Ref
- M. Khabsa, P. Treeratpituk, and C. L. Giles. Online person name disambiguation with constraints. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries(JCDL'15), pages 37--46, 2015. Google ScholarDigital Library
- O. Tange et al. Gnu parallel-the command-line power tool. The USENIX Magazine, 36(1):42--47, 2011.Google Scholar
- P. Treeratpituk and C. L. Giles. Disambiguating authors in academic publications using random forests. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries(JCDL'09), pages 39--48, 2009. Google ScholarDigital Library
- S. L. Ventura, R. Nugent, and E. R. Fuchs. Seeing the non-stars:(some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records. Research Policy, 2015.Google ScholarCross Ref
Index Terms
- Inventor Name Disambiguation for a Patent Database Using a Random Forest and DBSCAN
Recommendations
Author name disambiguation in MEDLINE
Background: We recently described “Author-ity,” a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical ...
Name Disambiguation Using Semantic Association Clustering
ICEBE '09: Proceedings of the 2009 IEEE International Conference on e-Business EngineeringDue to homonyms, abbreviations, etc., name ambiguity is widely available in web and e-document. For example, when integrating heterogeneous literature databases, because there are different name specifications, different authors may be thought of as the ...
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data managementAmbiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Comments