Abstract
Domain-specific classification schemas (or subject heading vocabularies) are often used to identify, classify, and disambiguate concepts that occur in scholarly articles. In this work, we develop, apply, and evaluate a human-in-the-loop workflow that first extracts an initial category tree from crowd-sourced Wikipedia data, and then combines community detection, machine learning, and hand-crafted heuristics or rules to prune the initial tree. This work resulted in WikiCSSH; a large-scale, hierarchically-organized subject heading vocabulary for the domain of computer science (CS). Our evaluation suggests that WikiCSSH outperforms alternative CS vocabularies in terms of coverage of CS terms that occur in research articles. WikiCSSH can further distinguish between coarse-grained versus fine-grained CS concepts. The outlined workflow can serve as a template for building hierarchically-organized subject heading vocabularies for other domains that are covered in Wikipedia.
This material is based upon work supported by the Korea Institute of Science and Technology Information under Grant No. C17031. We thank Kehan Li for assistance with data annotation, and anonymous reviewers for their feedback.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008). https://doi.org/10.1088/1742-5468/2008/10/p10008
Gallina, Y., Boudin, F., Daille, B.: Large-scale evaluation of keyphrase extraction models. arXiv preprint arXiv:2003.04628 (2020)
Girvan, M., Newman, M.E.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002)
Grover, A., Leskovec, J.: Node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 855–864. Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2939672.2939754
Han, K., Yang, P., Mishra, S., Diesner, J.: Wikicssh - computer science subject headings from Wikipedia (2020). https://doi.org/10.13012/B2IDB-0424970_V1
Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from Wikipedia. Artif. Intell. 194, 28–61 (2013). https://doi.org/10.1016/j.artint.2012.06.001
Lehmann, J., et al.: Dbpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)
Levine, T.R.: Rankings and trends in citation patterns of communication journals. Commun. Educ. 59(1), 41–51 (2010)
Medelyan, O., Witten, I.H., Milne, D.: Topic indexing with Wikipedia. In: Proceedings of the AAAI WikiAI Workshop, vol. 1, pp. 19–24 (2008)
Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y.: Deep keyphrase generation. arXiv preprint arXiv:1704.06879 (2017)
Mishra, S., Fegley, B.D., Diesner, J., Torvik, V.I.: Expertise as an aspect of author contributions. In: Workshop on Informetric and Scientometric Research (SIG/MET), Vancouver (2018)
Mishra, S., Fegley, B.D., Diesner, J., Torvik, V.I.: Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9), e0195773 (2018). https://doi.org/10.1371/journal.pone.0195773
Mishra, S., Torvik, V.I.: Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib Mag.: Mag. Digit. Libr. Forum 22(9–10) (2016). https://doi.org/10.1045/september2016-mishra
Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: Advances in Neural Information Processing Systems, vol. 30, pp. 6338–6347. Curran Associates, Inc. (2017)
Nielsen, F.Å., Mietchen, D., Willighagen, E.: Scholia, scientometrics and Wikidata. In: Blomqvist, E., Hose, K., Paulheim, H., Ławrynowicz, A., Ciravegna, F., Hartig, O. (eds.) ESWC 2017. LNCS, vol. 10577, pp. 237–259. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70407-4_36
Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 408–424. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_24
Packalen, M., Bhattacharya, J.: Age and the trying out of new ideas. J. Hum. Cap. 13(2), 341–373 (2019). https://doi.org/10.1086/703160
Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the NAACL-HLT, pp. 2227–2237. Association for Computational Linguistics, Stroudsburg (June 2018). https://doi.org/10.18653/v1/N18-1202
Salatino, A.A., Thanapalasingam, T., Mannocci, A., Osborne, F., Motta, E.: The computer science ontology: a large-scale taxonomy of research areas. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 187–205. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_12
Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R., Han, J.: Automated phrase mining from massive text corpora. IEEE Trans. Knowl. Data Eng. 30(10), 1825–1837 (2018)
Wang, Y., Zhu, M., Qu, L., Spaniol, M., Weikum, G.: Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 697–700 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Han, K., Yang, P., Mishra, S., Diesner, J. (2020). WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia. In: Bellatreche, L., et al. ADBIS, TPDL and EDA 2020 Common Workshops and Doctoral Consortium. TPDL ADBIS 2020 2020. Communications in Computer and Information Science, vol 1260. Springer, Cham. https://doi.org/10.1007/978-3-030-55814-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-55814-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-55813-0
Online ISBN: 978-3-030-55814-7
eBook Packages: Computer ScienceComputer Science (R0)