Identifying named entities in academic biographies with supervised learning

Kenekayoro, Patrick

doi:10.1007/s11192-018-2797-4

Identifying named entities in academic biographies with supervised learning

Published: 08 June 2018

Volume 116, pages 751–765, (2018)
Cite this article

Scientometrics Aims and scope Submit manuscript

Patrick Kenekayoro ORCID: orcid.org/0000-0003-0021-6584¹

519 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Personal webpages of researchers or faculty members make up a percentage of the academic web. These webpages contain semi-structured or plain text information, and research has shown the importance of combining information extracted from multiple academic websites to create a unified database that can help in expert finding, and thus improve information retrieval for end users. This research identifies the kind of named entities that could be present in academic biographies by manually examining the biographies extracted from ORCID public profiles, and describes a method that uses natural language processing techniques and supervised machine learning to automatically extract these named entities from the plain text biographies. Up to 86% accuracy was achieved with support vector machines, demonstrating that the method used in this research can be suitable for creating a reusable trained model that extracts useful academic information from researchers’ personal profiles in webpages or other data sources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Development of a Machine Learning Framework for Biomedical Text Mining

TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

All that Glitters Is Not Gold – Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking

References

Arlot, S. & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics surveys, 4, 40–79. http://projecteuclid.org/euclid.ssu/1268143839. Accessed 27 February 2017.
Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, 6(1), 20–29. http://sci2s.ugr.es/keel/dataset/includes/catImbFiles/2004-Batista-SIGKDD.pdf. Accessed 23 March 2018.
Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. In Advances in neural information processing systems (pp. 2546–2554). http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization. Accessed 27 February 2017.
Bergstra, J. & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research. http://www.jmlr.org/papers/v13/bergstra12a.html. Accessed 27 February 2017.
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Scientific American, 284(5), 34–43. https://doi.org/10.1038/scientificamerican0501-34.
Article Google Scholar
Bird, S., Edward, L., & Ewan, K. (2009). Natural language processing with Python. Sebastopol: O’Reilly Media Inc.
MATH Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1017/CBO9781107415324.004.
Article MATH Google Scholar
Caruana, R. & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on machine learning, Pittsburgh. https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf. Accessed 4 June 2017.
Cervantes, J., Li, X., & Wen, Y. (2008). Support vector classification for large data sets by reducing training data with change of classes. In 2008 IEEE international conference on systems, man and cybernetics (pp. 2609–2614). IEEE. https://doi.org/10.1109/icsmc.2008.4811689.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953.
Article MATH Google Scholar
Chen, C., Liaw, A., & Breiman, L. (2004). Using random forest to learn imbalanced data. University of California, Berkeley, 110, 1–12.
Google Scholar
Chen, Y. W. & Lin, C. J. (2006). Combining SVMs with various feature selection strategies. Feature extraction, 315–324. http://link.springer.com/chapter/10.1007/978-3-540-35488-8_13. Accessed 6 June 2017.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(273), 273–297.
MATH Google Scholar
Dietterich, T. (2000). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1–15).
Gollapalli, S., Giles, C., Mitra, P., & Caragea, C. (2011). On identifying academic homepages for digital libraries. In Proceedings of the 11th annual international ACM/IEEE joint conference on digital libraries (pp. 123–132). ACM. http://dl.acm.org/citation.cfm?id=1998099. Accessed 1 July 2017.
Grömping, U. (2009). Variable importance assessment in regression: linear regression versus random forest. The American Statistician, 63(4), 308–319. https://doi.org/10.1198/tast.2009.08199.
Article MathSciNet Google Scholar
Isozaki, H. & Kazawa, H. (2002). Efficient support vector classifiers for named entity recognition. In Proceedings of the 19th international conference on computational linguistics—Volume 1 (pp. 1–7). Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1072282. Accessed 1 July 2017.
Kenekayoro, P., Buckley, K., & Thelwall, M. (2014a). Automatic classification of academic web page types. Scientometrics, 101(2), 1–12. https://doi.org/10.1007/s11192-014-1292-9.
Article Google Scholar
Kenekayoro, P., Buckley, K., & Thelwall, M. (2014b). Hyperlinks as inter-university collaboration indicators. Journal of Information Science, 40(4), 514–522. https://doi.org/10.1177/0165551514534141.
Article Google Scholar
Kiss, T. & Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics. http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485. Accessed 25 May 2017.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence, 14(12), 1137–1143. https://doi.org/10.1067/mod.2000.109031.
Google Scholar
Krapivin, M., Autayeu, A., Marchese, M., Blanzieri, E., & Segata, N. (2010). Improving machine learning approaches for keyphrases extraction from scientific documents with natural language knowledge. In Asian digital libraries (pp. 102–111). Springer, Berlin Heidelberg. http://link.springer.com/content/pdf/10.1007/978-3-642-13654-2.pdf#page=114. Accessed 1 July 2017.
Kudo, T. & Matsumoto, Y. (2001). Chunking with support vector machines. In Proceedings of the second meeting of the North American Chapter of the Association for computational linguistics on language technologies. Association for computational linguistics (pp. 1–8). Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1073361. Accessed 3 July 2017.
Kursa, M. B. & Rudnicki, W. R. (2010). Feature selection with the Boruta package. https://core.ac.uk/download/files/153/6340269.pdf. Accessed 6 June 2017.
Lafferty, J., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML’01 proceedings of the eighteenth international conference on machine learning, (Vol. 8(June), pp. 282–289). https://doi.org/10.1038/nprot.2006.61.
Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1–3), 503–528. https://doi.org/10.1007/BF01589116.
Article MathSciNet MATH Google Scholar
Marcus, M., Kim, G., Marcinkiewicz, M. A., MacIntyre, R., Bies, A., Ferguson, M., et al. (1994). The penn treebank. In Proceedings of the workshop on human language technology—HLT’94 (p. 114). Morristown, NJ, USA: Association for Computational Linguistics. https://doi.org/10.3115/1075812.1075835.
McCallum, A. & Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 (Vol. 4, pp. 188–191). https://doi.org/10.3115/1119176.1119206.
Morales, J. M., Navarro, E., Sánchez, P., & Alonso, D. (2015). A controlled experiment to evaluate the understandability of KAOS and i* for modeling Teleo-Reactive systems. Journal of Systems and Software, 100, 1–14. https://doi.org/10.1016/J.JSS.2014.10.010.
Article Google Scholar
Mullen, T. & Collier, N. (2004). Sentiment analysis using support vector machines with diverse information sources. In Empirical methods in natural language processing (pp. 412–418). https://doi.org/10.3115/1219044.1219069.
Paglione, L., Peters, R., Wilmers, C., Simpson, W., Montenegro, A., Ramírez Monge, F., et al. (2015). ORCID Public Data File 2015. https://doi.org/10.6084/m9.figshare.1582705.v1. Accessed 22 May 2017.
Passerini, A., Pontil, M., & Frasconi, P. (2004). New results on error correcting output codes of kernel machines. IEEE Transactions on Neural Networks. https://doi.org/10.1109/TNN.2003.820841.
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in {P}ython. Journal of Machine Learning Research, 12, 2825–2830.
MATH Google Scholar
Peng, T. & Korobov, M. (2014). Python CRF-Suite. https://github.com/scrapinghub/python-crfsuite.
Peng, F. & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information processing & management, 42(4), 963–979. https://www.sciencedirect.com/science/article/pii/S0306457305001172. Accessed 1 February 2018.
Rabiner, L. & Juang, B. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1), 4–16. http://ieeexplore.ieee.org/abstract/document/1165342/. Accessed 31 January 2018.
Ramshaw, L. A. & Marcus, M. P. (1995). Text chunking using transformation-based learning. In Natural language processing using very large corpora (pp. 157–176). https://doi.org/10.1007/978-94-017-2390-9_10.
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of the conference on empirical methods in natural language processing (Vol. 1, pp. 132–141). http://www.anthology.aclweb.org/W/W96/W96-0213.pdf. Accessed 25 May 2017.
Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In Australasian Joint Conference on Artificial Intelligence (pp. 1015–1021). Springer, Berlin Heidelberg. https://doi.org/10.1007/11941439_114.
Soon, W. M., Ng, H. T., & Lim, D. C. Y. (2001). A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4), 521–544. http://www.mitpressjournals.org/doi/abs/10.1162/089120101753342653. Accessed 6 June 2017.
Tang, J. (2007). Spec on researcher profile. Social Network Extraction of Academic Researchers. https://aminer.org/lab-datasets/profiling/. Accessed 28 March 2018.
Tang, J., Zhang, D., & Yao, L. (2007). Social network extraction of academic researchers. In Data Mining, 2007. ICDM 2007 (pp. 292–301). IEEE. http://ieeexplore.ieee.org/abstract/document/4470253/. Accessed 1 July 2017.
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). Arnetminer: Extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 990–998). ACM. http://dl.acm.org/citation.cfm?id=1402008. Accessed 1 July 2017.
Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin Markov networks. Advances in neural information processing systems, 25–32. http://papers.nips.cc/paper/2397-max-margin-markov-networks.pdf. Accessed 30 January 2018.
Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82. https://doi.org/10.1109/4235.585893.
Article Google Scholar

Download references

Acknowledgements

This research was supported by the Tertiary Education Trust Fund (TETFUND). The author would like to thank the anonymous reviewers for their constructive comments and suggestions to improve the quality paper.

Author information

Authors and Affiliations

Mathematics/Computer Science Department, Niger Delta University, PMB 581, Amassoma, Bayelsa State, Nigeria
Patrick Kenekayoro

Authors

Patrick Kenekayoro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrick Kenekayoro.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kenekayoro, P. Identifying named entities in academic biographies with supervised learning. Scientometrics 116, 751–765 (2018). https://doi.org/10.1007/s11192-018-2797-4

Download citation

Received: 02 August 2017
Published: 08 June 2018
Issue Date: August 2018
DOI: https://doi.org/10.1007/s11192-018-2797-4

Keywords

Mathematics Subject Classification

JEL Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying named entities in academic biographies with supervised learning

Abstract

Access this article

Similar content being viewed by others

Development of a Machine Learning Framework for Biomedical Text Mining

TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

All that Glitters Is Not Gold – Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

JEL Classification

Navigation

Identifying named entities in academic biographies with supervised learning

Abstract

Access this article

Similar content being viewed by others

Development of a Machine Learning Framework for Biomedical Text Mining

TSE-NER: An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

All that Glitters Is Not Gold – Rule-Based Curation of Reference Datasets for Named Entity Recognition and Entity Linking

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

JEL Classification

Search

Navigation