WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia

Han, Kanyao; Yang, Pingjing; Mishra, Shubhanshu; Diesner, Jana

doi:10.1007/978-3-030-55814-7_17

Kanyao Han²³,
Pingjing Yang²³,
Shubhanshu Mishra²³ &
…
Jana Diesner²³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1260))

Included in the following conference series:

787 Accesses
5 Citations

Abstract

Domain-specific classification schemas (or subject heading vocabularies) are often used to identify, classify, and disambiguate concepts that occur in scholarly articles. In this work, we develop, apply, and evaluate a human-in-the-loop workflow that first extracts an initial category tree from crowd-sourced Wikipedia data, and then combines community detection, machine learning, and hand-crafted heuristics or rules to prune the initial tree. This work resulted in WikiCSSH; a large-scale, hierarchically-organized subject heading vocabulary for the domain of computer science (CS). Our evaluation suggests that WikiCSSH outperforms alternative CS vocabularies in terms of coverage of CS terms that occur in research articles. WikiCSSH can further distinguish between coarse-grained versus fine-grained CS concepts. The outlined workflow can serve as a template for building hierarchically-organized subject heading vocabularies for other domains that are covered in Wikipedia.

This material is based upon work supported by the Korea Institute of Science and Technology Information under Grant No. C17031. We thank Kehan Li for assistance with data annotation, and anonymous reviewers for their feedback.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008). https://doi.org/10.1088/1742-5468/2008/10/p10008
Article MATH Google Scholar
Gallina, Y., Boudin, F., Daille, B.: Large-scale evaluation of keyphrase extraction models. arXiv preprint arXiv:2003.04628 (2020)
Girvan, M., Newman, M.E.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002)
Article MathSciNet Google Scholar
Grover, A., Leskovec, J.: Node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 855–864. Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2939672.2939754
Han, K., Yang, P., Mishra, S., Diesner, J.: Wikicssh - computer science subject headings from Wikipedia (2020). https://doi.org/10.13012/B2IDB-0424970_V1
Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from Wikipedia. Artif. Intell. 194, 28–61 (2013). https://doi.org/10.1016/j.artint.2012.06.001
Article MathSciNet MATH Google Scholar
Lehmann, J., et al.: Dbpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)
Article Google Scholar
Levine, T.R.: Rankings and trends in citation patterns of communication journals. Commun. Educ. 59(1), 41–51 (2010)
Article Google Scholar
Medelyan, O., Witten, I.H., Milne, D.: Topic indexing with Wikipedia. In: Proceedings of the AAAI WikiAI Workshop, vol. 1, pp. 19–24 (2008)
Google Scholar
Meng, R., Zhao, S., Han, S., He, D., Brusilovsky, P., Chi, Y.: Deep keyphrase generation. arXiv preprint arXiv:1704.06879 (2017)
Mishra, S., Fegley, B.D., Diesner, J., Torvik, V.I.: Expertise as an aspect of author contributions. In: Workshop on Informetric and Scientometric Research (SIG/MET), Vancouver (2018)
Google Scholar
Mishra, S., Fegley, B.D., Diesner, J., Torvik, V.I.: Self-citation is the hallmark of productive authors, of any gender. PLoS ONE 13(9), e0195773 (2018). https://doi.org/10.1371/journal.pone.0195773
Article Google Scholar
Mishra, S., Torvik, V.I.: Quantifying Conceptual Novelty in the Biomedical Literature. D-Lib Mag.: Mag. Digit. Libr. Forum 22(9–10) (2016). https://doi.org/10.1045/september2016-mishra
Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: Advances in Neural Information Processing Systems, vol. 30, pp. 6338–6347. Curran Associates, Inc. (2017)
Google Scholar
Nielsen, F.Å., Mietchen, D., Willighagen, E.: Scholia, scientometrics and Wikidata. In: Blomqvist, E., Hose, K., Paulheim, H., Ławrynowicz, A., Ciravegna, F., Hartig, O. (eds.) ESWC 2017. LNCS, vol. 10577, pp. 237–259. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-70407-4_36
Chapter Google Scholar
Osborne, F., Motta, E.: Klink-2: integrating multiple web sources to generate semantic topic networks. In: Arenas, M., et al. (eds.) ISWC 2015. LNCS, vol. 9366, pp. 408–424. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25007-6_24
Chapter Google Scholar
Packalen, M., Bhattacharya, J.: Age and the trying out of new ideas. J. Hum. Cap. 13(2), 341–373 (2019). https://doi.org/10.1086/703160
Article Google Scholar
Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the NAACL-HLT, pp. 2227–2237. Association for Computational Linguistics, Stroudsburg (June 2018). https://doi.org/10.18653/v1/N18-1202
Salatino, A.A., Thanapalasingam, T., Mannocci, A., Osborne, F., Motta, E.: The computer science ontology: a large-scale taxonomy of research areas. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 187–205. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_12
Chapter Google Scholar
Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R., Han, J.: Automated phrase mining from massive text corpora. IEEE Trans. Knowl. Data Eng. 30(10), 1825–1837 (2018)
Article Google Scholar
Wang, Y., Zhu, M., Qu, L., Spaniol, M., Weikum, G.: Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 697–700 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA
Kanyao Han, Pingjing Yang, Shubhanshu Mishra & Jana Diesner

Authors

Kanyao Han
View author publications
You can also search for this author in PubMed Google Scholar
Pingjing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shubhanshu Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Jana Diesner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shubhanshu Mishra .

Editor information

Editors and Affiliations

ISAE-ENSMA, Poitiers, France
Ladjel Bellatreche
Slovak University of Technology, Bratislava, Slovakia
Mária Bieliková
Université Lumière Lyon 2, Lyon, France
Omar Boussaïd
University of Genova, Genova, Italy
Barbara Catania
Université Lumière Lyon 2, Lyon, France
Jérôme Darmont
Leibniz University of Hannover, Hannover, Niedersachsen, Germany
Elena Demidova
Université Claude Bernard Lyon 1, Lyon, France
Fabien Duchateau
The Open University, Milton Keynes, UK
Mark Hall
University of Ljubljana, Ljubljana, Slovenia
Tanja Merčun
National Research University Higher School of Economics, St. Petersburg, Russia
Boris Novikov
Ionian University, Corfu, Greece
Christos Papatheodorou
Goethe University Frankfurt, Frankfurt am Main, Hessen, Germany
Thomas Risse
Universitat Politècnica de Catalunya, Barcelona, Spain
Oscar Romero
AgroParisTech, Montpellier, France
Lucile Sautot
University of Lyon, Lyon, France
Guilaine Talens
Poznań University of Technology, Poznań, Poland
Robert Wrembel
University of Ljubljana, Ljubljana, Slovenia
Maja Žumer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, K., Yang, P., Mishra, S., Diesner, J. (2020). WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia. In: Bellatreche, L., et al. ADBIS, TPDL and EDA 2020 Common Workshops and Doctoral Consortium. TPDL ADBIS 2020 2020. Communications in Computer and Information Science, vol 1260. Springer, Cham. https://doi.org/10.1007/978-3-030-55814-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-55814-7_17
Published: 18 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-55813-0
Online ISBN: 978-3-030-55814-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics