Abstract
Personal webpages of researchers or faculty members make up a percentage of the academic web. These webpages contain semi-structured or plain text information, and research has shown the importance of combining information extracted from multiple academic websites to create a unified database that can help in expert finding, and thus improve information retrieval for end users. This research identifies the kind of named entities that could be present in academic biographies by manually examining the biographies extracted from ORCID public profiles, and describes a method that uses natural language processing techniques and supervised machine learning to automatically extract these named entities from the plain text biographies. Up to 86% accuracy was achieved with support vector machines, demonstrating that the method used in this research can be suitable for creating a reusable trained model that extracts useful academic information from researchers’ personal profiles in webpages or other data sources.
Similar content being viewed by others
References
Arlot, S. & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics surveys, 4, 40–79. http://projecteuclid.org/euclid.ssu/1268143839. Accessed 27 February 2017.
Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, 6(1), 20–29. http://sci2s.ugr.es/keel/dataset/includes/catImbFiles/2004-Batista-SIGKDD.pdf. Accessed 23 March 2018.
Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. In Advances in neural information processing systems (pp. 2546–2554). http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization. Accessed 27 February 2017.
Bergstra, J. & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research. http://www.jmlr.org/papers/v13/bergstra12a.html. Accessed 27 February 2017.
Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Scientific American, 284(5), 34–43. https://doi.org/10.1038/scientificamerican0501-34.
Bird, S., Edward, L., & Ewan, K. (2009). Natural language processing with Python. Sebastopol: O’Reilly Media Inc.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1017/CBO9781107415324.004.
Caruana, R. & Niculescu-Mizil, A. (2006). An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on machine learning, Pittsburgh. https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf. Accessed 4 June 2017.
Cervantes, J., Li, X., & Wen, Y. (2008). Support vector classification for large data sets by reducing training data with change of classes. In 2008 IEEE international conference on systems, man and cybernetics (pp. 2609–2614). IEEE. https://doi.org/10.1109/icsmc.2008.4811689.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953.
Chen, C., Liaw, A., & Breiman, L. (2004). Using random forest to learn imbalanced data. University of California, Berkeley, 110, 1–12.
Chen, Y. W. & Lin, C. J. (2006). Combining SVMs with various feature selection strategies. Feature extraction, 315–324. http://link.springer.com/chapter/10.1007/978-3-540-35488-8_13. Accessed 6 June 2017.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(273), 273–297.
Dietterich, T. (2000). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1–15).
Gollapalli, S., Giles, C., Mitra, P., & Caragea, C. (2011). On identifying academic homepages for digital libraries. In Proceedings of the 11th annual international ACM/IEEE joint conference on digital libraries (pp. 123–132). ACM. http://dl.acm.org/citation.cfm?id=1998099. Accessed 1 July 2017.
Grömping, U. (2009). Variable importance assessment in regression: linear regression versus random forest. The American Statistician, 63(4), 308–319. https://doi.org/10.1198/tast.2009.08199.
Isozaki, H. & Kazawa, H. (2002). Efficient support vector classifiers for named entity recognition. In Proceedings of the 19th international conference on computational linguistics—Volume 1 (pp. 1–7). Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1072282. Accessed 1 July 2017.
Kenekayoro, P., Buckley, K., & Thelwall, M. (2014a). Automatic classification of academic web page types. Scientometrics, 101(2), 1–12. https://doi.org/10.1007/s11192-014-1292-9.
Kenekayoro, P., Buckley, K., & Thelwall, M. (2014b). Hyperlinks as inter-university collaboration indicators. Journal of Information Science, 40(4), 514–522. https://doi.org/10.1177/0165551514534141.
Kiss, T. & Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics. http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485. Accessed 25 May 2017.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence, 14(12), 1137–1143. https://doi.org/10.1067/mod.2000.109031.
Krapivin, M., Autayeu, A., Marchese, M., Blanzieri, E., & Segata, N. (2010). Improving machine learning approaches for keyphrases extraction from scientific documents with natural language knowledge. In Asian digital libraries (pp. 102–111). Springer, Berlin Heidelberg. http://link.springer.com/content/pdf/10.1007/978-3-642-13654-2.pdf#page=114. Accessed 1 July 2017.
Kudo, T. & Matsumoto, Y. (2001). Chunking with support vector machines. In Proceedings of the second meeting of the North American Chapter of the Association for computational linguistics on language technologies. Association for computational linguistics (pp. 1–8). Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1073361. Accessed 3 July 2017.
Kursa, M. B. & Rudnicki, W. R. (2010). Feature selection with the Boruta package. https://core.ac.uk/download/files/153/6340269.pdf. Accessed 6 June 2017.
Lafferty, J., McCallum, A., & Pereira, F. C. N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML’01 proceedings of the eighteenth international conference on machine learning, (Vol. 8(June), pp. 282–289). https://doi.org/10.1038/nprot.2006.61.
Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1–3), 503–528. https://doi.org/10.1007/BF01589116.
Marcus, M., Kim, G., Marcinkiewicz, M. A., MacIntyre, R., Bies, A., Ferguson, M., et al. (1994). The penn treebank. In Proceedings of the workshop on human language technology—HLT’94 (p. 114). Morristown, NJ, USA: Association for Computational Linguistics. https://doi.org/10.3115/1075812.1075835.
McCallum, A. & Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 (Vol. 4, pp. 188–191). https://doi.org/10.3115/1119176.1119206.
Morales, J. M., Navarro, E., Sánchez, P., & Alonso, D. (2015). A controlled experiment to evaluate the understandability of KAOS and i* for modeling Teleo-Reactive systems. Journal of Systems and Software, 100, 1–14. https://doi.org/10.1016/J.JSS.2014.10.010.
Mullen, T. & Collier, N. (2004). Sentiment analysis using support vector machines with diverse information sources. In Empirical methods in natural language processing (pp. 412–418). https://doi.org/10.3115/1219044.1219069.
Paglione, L., Peters, R., Wilmers, C., Simpson, W., Montenegro, A., Ramírez Monge, F., et al. (2015). ORCID Public Data File 2015. https://doi.org/10.6084/m9.figshare.1582705.v1. Accessed 22 May 2017.
Passerini, A., Pontil, M., & Frasconi, P. (2004). New results on error correcting output codes of kernel machines. IEEE Transactions on Neural Networks. https://doi.org/10.1109/TNN.2003.820841.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in {P}ython. Journal of Machine Learning Research, 12, 2825–2830.
Peng, T. & Korobov, M. (2014). Python CRF-Suite. https://github.com/scrapinghub/python-crfsuite.
Peng, F. & McCallum, A. (2006). Information extraction from research papers using conditional random fields. Information processing & management, 42(4), 963–979. https://www.sciencedirect.com/science/article/pii/S0306457305001172. Accessed 1 February 2018.
Rabiner, L. & Juang, B. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3(1), 4–16. http://ieeexplore.ieee.org/abstract/document/1165342/. Accessed 31 January 2018.
Ramshaw, L. A. & Marcus, M. P. (1995). Text chunking using transformation-based learning. In Natural language processing using very large corpora (pp. 157–176). https://doi.org/10.1007/978-94-017-2390-9_10.
Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of the conference on empirical methods in natural language processing (Vol. 1, pp. 132–141). http://www.anthology.aclweb.org/W/W96/W96-0213.pdf. Accessed 25 May 2017.
Sokolova, M., Japkowicz, N., & Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In Australasian Joint Conference on Artificial Intelligence (pp. 1015–1021). Springer, Berlin Heidelberg. https://doi.org/10.1007/11941439_114.
Soon, W. M., Ng, H. T., & Lim, D. C. Y. (2001). A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4), 521–544. http://www.mitpressjournals.org/doi/abs/10.1162/089120101753342653. Accessed 6 June 2017.
Tang, J. (2007). Spec on researcher profile. Social Network Extraction of Academic Researchers. https://aminer.org/lab-datasets/profiling/. Accessed 28 March 2018.
Tang, J., Zhang, D., & Yao, L. (2007). Social network extraction of academic researchers. In Data Mining, 2007. ICDM 2007 (pp. 292–301). IEEE. http://ieeexplore.ieee.org/abstract/document/4470253/. Accessed 1 July 2017.
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). Arnetminer: Extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 990–998). ACM. http://dl.acm.org/citation.cfm?id=1402008. Accessed 1 July 2017.
Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin Markov networks. Advances in neural information processing systems, 25–32. http://papers.nips.cc/paper/2397-max-margin-markov-networks.pdf. Accessed 30 January 2018.
Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82. https://doi.org/10.1109/4235.585893.
Acknowledgements
This research was supported by the Tertiary Education Trust Fund (TETFUND). The author would like to thank the anonymous reviewers for their constructive comments and suggestions to improve the quality paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kenekayoro, P. Identifying named entities in academic biographies with supervised learning. Scientometrics 116, 751–765 (2018). https://doi.org/10.1007/s11192-018-2797-4
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-018-2797-4
Keywords
- Named entity recognition
- Supervised learning
- Natural language processing
- Support vector machines
- Random forests
- Conditional random fields