poster

Inventor Name Disambiguation for a Patent Database Using a Random Forest and DBSCAN

Authors:
Kunho Kim

The Pennsylvania State University, University Park, PA, USA

The Pennsylvania State University, University Park, PA, USA
View Profile

,
Madian Khabsa

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
C. Lee Giles

The Pennsylvania State University, University Park, PA, USA

The Pennsylvania State University, University Park, PA, USA
View Profile

JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital LibrariesJune 2016Pages 269–270https://doi.org/10.1145/2910896.2925465

Published:19 June 2016Publication History

JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries

Pages 269–270

ABSTRACT

Inventor name disambiguation is the task that distinguishes each unique inventor from all other inventor records in a patent database. This task is essential for processing person name queries in order to get information related to a specific inventor, e.g. a list of all that inventor's patents. Using earlier work on author name disambiguation, we apply it to inventor name disambiguation. A random forest classifier is trained to classify whether each pair of inventor records is the same person. The DBSCAN algorithm is use for inventor record clustering, and its distance function is derived using the random forest classifier. For scalability, blocking functions are used to reduce the complexity of record matching and enable parallelization since each block can be run simultaneously. Tested on the USPTO patent database, 12 million inventor records were disambiguated in 6.5 hours. Evaluation on the labeled datasets from USPTO PatentsView competition shows our algorithm outperforms all algorithms submitted to the competition.

References

L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarDigital Library
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD'96), volume 96, pages 226--231, 1996.Google Scholar
J. Huang, S. Ertekin, and C. L. Giles. Efficient name disambiguation for large-scale databases. In Proceedings of the 10th European Conference on Principle and Practice of Knowledge Discovery in Databases(PKDD'06), pages 536--544, 2006.Google ScholarDigital Library
M. Khabsa, P. Treeratpituk, and C. L. Giles. Large scale author name disambiguation in digital libraries. In IEEE International Conference on Big Data, pages 41--42, 2014.Google ScholarCross Ref
M. Khabsa, P. Treeratpituk, and C. L. Giles. Online person name disambiguation with constraints. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries(JCDL'15), pages 37--46, 2015. Google ScholarDigital Library
O. Tange et al. Gnu parallel-the command-line power tool. The USENIX Magazine, 36(1):42--47, 2011.Google Scholar
P. Treeratpituk and C. L. Giles. Disambiguating authors in academic publications using random forests. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries(JCDL'09), pages 39--48, 2009. Google ScholarDigital Library
S. L. Ventura, R. Nugent, and E. R. Fuchs. Seeing the non-stars:(some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records. Research Policy, 2015.Google ScholarCross Ref

Index Terms

Inventor Name Disambiguation for a Patent Database Using a Random Forest and DBSCAN
1. Information systems
  1. Information retrieval

Recommendations

Author name disambiguation in MEDLINE

Background: We recently described “Author-ity,” a model for estimating the probability that two articles in MEDLINE, sharing the same author name, were written by the same individual. Features include shared title words, journal name, coauthors, medical ...
Read More
Name Disambiguation Using Semantic Association Clustering
ICEBE '09: Proceedings of the 2009 IEEE International Conference on e-Business Engineering

Due to homonyms, abbreviations, etc., name ambiguity is widely available in web and e-document. For example, when integrating heterogeneous literature databases, because there are different name specifications, different authors may be thought of as the ...
Read More
Web personal name disambiguation based on reference entity tables mined from the web
WIDM '09: Proceedings of the eleventh international workshop on Web information and data management

Ambiguous personal names are common on the Web, which pose a challenge for many different tasks. The traditional disambiguation employs the clustering methods. However, without reference entity tables, the clustering method can only identify whether two ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries
June 2016
316 pages
ISBN:9781450342292
DOI:10.1145/2910896
General Chairs:
Nabil R. Adam
Rutgers University
,
Boots Cassel
Villanova University
,
Yelena Yesha
University of Maryland, Baltimore County
,
Program Chairs:
Richard Furuta
Texas A&M University
,
Michele C. Weigle
Old Dominion University
Copyright © 2016 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 June 2016
Check for updates
Author Tags
dbscan
name disambiguation
random forest
Qualifiers
- poster
Conference

Acceptance Rates
JCDL '16 Paper Acceptance Rate15of52submissions,29%Overall Acceptance Rate415of1,482submissions,28%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 208
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Inventor Name Disambiguation for a Patent Database Using a Random Forest and DBSCAN

JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Author name disambiguation in MEDLINE

Name Disambiguation Using Semantic Association Clustering

Web personal name disambiguation based on reference entity tables mined from the web