Abstract
In this article, we present the conceptual design and report on the implementation of Capisco—a low-cost approach to concept-based access to digital libraries. Capisco avoids the need for complete semantic document markup using ontologies by leveraging an automatically generated Concept-in-Context (CiC) network. The network is seeded by a priori analysis of Wikipedia texts and identification of semantic metadata. Our Capisco system disambiguates the semantics of terms in the documents by their semantics and context and identifies the relevant CiC concepts. Supplementary to this, the disambiguation of search queries is done interactively, to fully utilize the domain knowledge of the scholar. For established digital library systems, completely replacing, or even making significant changes to the document retrieval mechanism (document analysis, indexing strategy, query processing, and query interface) would require major technological effort and would most likely be disruptive. In addition to presenting Capisco, we describe ways to harness the results of our developed semantic analysis and disambiguation, while retaining the existing keyword-based search and lexicographic index. We engineer this so the output of semantic analysis (performed off-line) is suitable for import directly into existing digital library metadata and index structures, and thus incorporated without the need for architecture modifications.






























Similar content being viewed by others
Notes
Technical non-experts are users who are domain experts but are not familiar with technical detail of semantic concepts [30].
These documents and other test collections have been provided by the HathiTrust.
For simplicity, we abstract from the precise locations in which the terms appear on each page.
Such as the advanced search for HathiTrust items at catalog.hathitrust.org/Search/Advanced.
The references link to the publications in which the corpora were first introduced.
References
Cunningham, S.J., Hinze, A.M., Bainbridge, D., Taube-Schock, C., Ryan, T.: Building heritage document collections for Pacific Island nations using semantic-enriched search. In: Proceedings of the Samoa Conference III. Sãmoa: National University of Sãmoa (2014)
Duineveld, A.J., Stoter, R., Weiden, M.R., Kenepa, B., Benjamins, V.R.: Wondertools? A Comparative Study of Ontological Engineering Tools
Airio, E., Järvelin, K., Saatsi, P., Kekäläinen, J., Suomela, S.: Ciri-an ontology-based query interface for text retrieval. In: Web Intelligence: Proceedings of the 11th Finnish Artificial Intelligence Conference, Citeseer (2004)
Apperley, M., Cunningham, S.J., Keegan, T.T., Witten, I.H.: Niupepa: a historical newspaper collection. Commun. ACM 44(5), 86–87 (2001)
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval—The Concepts and Technology Behind Search, 2nd edn. Addison-Wesley, Reading (2011)
Bainbridge, D., Don, K.J., Buchanan, G.R., Witten, I.H., Jones, S., Jones, M., Barr, M.I.: Dynamic digital library construction and configuration. In: Heery, R., Lyon, L. (eds.) Proceedings of the Research and Advanced Technology for Digital Libraries: 8th European Conference, ECDL 2004, Bath, UK, September 12–17, 2004, pp 1–13. Springer, Berlin (2004)
Berrios, D.C.: Methods for Semi-automated Index Generation for High Precision Information Retrieval. PhD thesis, Stanford University (2001)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Bunescu, R.C., Pasca, M.: Using encyclopedic knowledge for named entity disambiguation. In: 11th Conference of the European Chapter of the Association for Computational Linguistics, ACL, pp. 9–16 (2006)
Campbell, I.: The Ostensive Model of Developing Information-Needs. PhD thesis, University of Glasgow (2000)
Carpineto, C., Romano, G.: A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44(1), 1:1–1:50 (2012). https://doi.org/10.1145/2071389.2071390
Churchill, W.: Niue: a reconnaissance. Bull. Am. Geogr. Soc. 40(3), 150–156 (1908)
Cimiano, P., Schultz, A., Sizov, S., Sorg, P., Staab, S.: Explicit versus latent concept models for cross-language information retrieval. IJCAI 9, 1513–1518 (2009)
Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Association for Computational Linguistics, Prague, Czech Republic, pp. 708–716 (2007). http://www.aclweb.org/anthology/D07-1074
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Downie, J.S., Cole, T., Senseney, M., Jett, J., Page, K., Hinze, A., Muñoz, T., Audenaert, N.: Workset Creation for Scholarly Analysis: Recommendations and Prototyping Project Reports. University of Illinois at Urbana-Champaign, Tech. rep. (2015)
Dugan, J.M., Berrios, D.C., Liu, X., Kim, D.K., Kaizer, H., Fagan, L.M.: Automation and integration of components for generalized semantic markup of electronic medical texts. In: Proceedings of the AMIA Symposium, American Medical Informatics Association, pp. 736–740 (1999)
Efthimiadis, E.N.: Interactive query expansion: a user-based evaluation in a relevance feedback environment. J. Am. Soc. Inf. Sci. 51(11), 989–1003 (2000)
El-Beltagy, S.R., Rafea, A.: KP-Miner: a keyphrase extraction system for English and Arabic documents. Inf. Syst. 34(1), 132–144 (2009)
Fellbaum, C.: WordNet. Wiley, New York (1998)
Flouris, G., Manakanatas, D., Kondylakis, H., Plexousakis, D., Antoniou, G.: Ontology change: classification and survey. Knowl. Eng. Rev. 23(02), 117–152 (2008)
Furnas, G.W., Landauer, T.K., Gomez, L.M., Dumais, S.T.: The vocabulary problem in human–system communication. Commun. ACM 30(11), 964–971 (1987). https://doi.org/10.1145/32206.32212
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 1606–1611. Morgan Kaufmann (2007)
Ganea, O.E., Ganea, M., Lucchi, A., Eickhoff, C., Hofmann, T.: Probabilistic bag-of-hyperlinks model for entity linking. In: Proceedings of the 25th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 927–938 (2016)
Griffiths, T.L., Steyvers, M., Blei, D.M., Tenenbaum, J.B.: Integrating topics and syntax. In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS’04), pp. 537–544 . MIT Press, Cambridge, MA, USA, (2004)
Grishman, R., Sundheim, B.: Message understanding conference—6: a brief history. In: Proceedings of the 16th Conference on Computational Linguistics, ACL, COLING ’96, pp. 466–471 (1996). https://doi.org/10.3115/992628.992709
Guha, R., McCool, R., Miller, E.: Semantic search. In: Proceedings of the 12th International Conference on World Wide Web. ACM, pp. 700–709 (2003)
Guppy, H.B.: Coral Islands and Savage Myths. Victoria Institute and Philosophical Society of Great Britain, London (1889)
Harris, P., Matamua, R., Smith, T., Kerr, H., Waaka, T.: A review of Māori astronomy in Aotaora-New Zealand. J. Astron. Hist. Herit. 16(3), 325–336 (2013)
Hinze, A., Heese, R., Luczak-Rösch, M., Paschke, A.: Semantic enrichment by non-experts: usability of manual annotation tools. In: The Semantic Web—ISWC 2012, pp. 165–181. Springer, Berlin (2012)
Hinze, A., Heese, R., Schlegel, A., Luczak-Rösch, M.: User-defined semantic enrichment of full-text documents: experiences and lessons learned. In: Theory and Practice of Digital Libraries, pp. 209–214. Springer, Berlin (2012)
Hinze, A., Taube-Schock, C., Bainbridge, D., Cunningham, S.J., Downie, J.S.: Introducing Capisco: A semantically-enhanced search and discovery system for large-scale text corpora. ACM SIGWEB Newsl. Autumn 2015, 4:1–4:14 (2015). https://doi.org/10.1145/2833219.2833223
Hinze, A., Taube-Schock, C., Bainbridge, D., Matamua, R., Downie, J.S.: Improving access to large-scale digital libraries through semantic-enhanced search and disambiguation. In: Proceedings of the ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 147–156. ACM (2015)
Hinze, A., Bainbridge, D., Cunningham, S.J., Downie, J.S.: Low-cost semantic enhancement to digital library metadata and indexing: simple yet effective strategies. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp 93–102. ACM (2016). https://doi.org/10.1145/2910896.2910910
Hinze, A., Coleman, M., Cunningham, S.J., Bainbridge, D.: Semantic bookworm: mining literary resources revisited. In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, pp. 227–228. ACM (2016b). https://doi.org/10.1145/2910896.2925444
Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 782–792 (2011)
Hovy, E., Navigli, R., Ponzetto, S.P.: Collaboratively built semi-structured content and artificial intelligence: the story so far. Artif. Intell. 194, 2–27 (2013)
Huang, A., Milne, D., Frank, E., Witten, I.H.: Clustering documents using a Wikipedia-based concept representation. In: Proceedings of 13th Pacific-Asia Conference, pp. 628–636. Springer, Berlin (2009)
Jean-Louis, L., Zouaq, A., Gagnon, M., Ensan, F.: An assessment of online semantic annotators for the keyword extraction task. In: PRICAI 2014: Trends in Artificial Intelligence, pp. 548–560. Springer, Berlin (2014)
Johnes, A.J.: Johnes on the causes which have produced dissent from the established church in the principality of Wales. Henry Hooper, London (1870)
Jon, K.J., Bainbridge, D., Witten, I.H.: The Design of Greenstone 3: An Agent Based Dynamic Digital Library. Tech. rep., Department of Computer Science, University of Waikato (2002)
Karger, D.: Unference: UI (Not AI) as Key to the Semantic Web. Panel on Interaction Design Grand Challenges and the Semantic Web, at the 3rd International Semantic Web User Interaction Workshop (2006)
Karger, D., Schraefel, M.: The pathetic fallacy of RDF. In: International Workshop on the Semantic Web and User Interaction (SWUI) 2006 (2006). http://eprints.soton.ac.uk/id/eprint/262911
Kim, D.K., Fagan, L.M., Jones, K.T., Berrios, D.C., Yu, V.L.: MYCIN II: design and implementation of a therapy reference with complex content-based indexing. In: Proceedings of the AMIA Symposium, pp. 175–179. American Medical Informatics Association (1998)
Köhncke, B., Balke, W.T.: Context-sensitive ranking using cross-domain knowledge for chemical digital libraries. In: International Conference on Theory and Practice of Digital Libraries, pp. 285–296. Springer, Berlin (2013)
Köhncke, B., Siehndel, P., Balke, W.T.: Bridging the gap–using external knowledge bases for context-aware document retrieval. In: International Conference on Asian Digital Libraries, pp. 11–20. Springer, Berlin (2013)
Kohomban, U.S., Lee, W.S.: Learning semantic classes for word sense disambiguation. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, pp. 34–41 (2005)
Kulkarni, S., Singh, A., Ramakrishnan, G., Chakrabarti, S.: Collective annotation of Wikipedia entities in web text. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 457–466. ACM (2009)
Lei, Y., Uren, V., Motta, E.: Semsearch: a search engine for the semantic web. In: International Conference on Knowledge Engineering and Knowledge Management, pp. 238–245. Springer, Berlin (2006)
Leonard, P.: Mining large datasets for the humanities. In: World Library and Information Congress. International Federation of Library Associations (2014)
Lin, Y., Michel, J.B., Aiden, E.L., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the Google books ngram corpus. In: Proceedings of the ACL 2012 System Demonstrations, pp. 169–174. ACL (2012)
Lytras, M., Sicilia, M., Davies, J., Kashyap, V., Stojanovic, N.: On the conceptualisation of the query refinement task. Library Manag. 26(4/5), 281–294 (2005)
Mäkelä, E.: Survey of semantic search research. In: Proceedings of the Seminar on Knowledge Management on the Semantic Web. Department of Computer Science, University of Helsinki, Helsinki (2005)
Mangold, C.: A survey and classification of semantic search approaches. Int. J. Metadata Semant. Ontol. 2(1), 23–34 (2007)
Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1318–1327. ACL (2009)
Mihalcea, R., Csomai, A.: Wikify! Linking documents to encyclopedic knowledge. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 233–242. ACM (2007)
Milne, D., Witten, I.H.: Learning to link with Wikipedia. In: Proceedings of the ACM Conference on Information and Knowledge Management, pp. 509–518. ACM (2008)
Milne, D., Witten, I.H.: An open-source toolkit for mining Wikipedia. Artif. Intell. 194, 222–239 (2013)
Milne, D., Medelyan, O., Witten, I.H.: Mining domain-specific thesauri from Wikipedia: a case study. In: Proceedings IEEE/WIC/ACM International Conference on Web Intelligence, pp. 442–448. IEEE (2006)
Milne, D.N., Witten, I.H., Nichols, D.M.: A knowledge-based search engine powered by Wikipedia. In: Proceedings of the ACM Conference on Information and Knowledge Management, pp. 445–454. ACM (2007)
Moldovan, D.I., Mihalcea, R.: Using WordNet and lexical operators to improve internet searches. IEEE Internet Comput. 4(1), 34–43 (2000)
Müller, C., Gurevych, I.: Using Wikipedia and Wiktionary in domain-specific information retrieval. In: Cross-Language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, pp. 219–226. Springer, Berlin (2009)
Nakayama, K., Hara, T., Nishio, S.: A thesaurus construction method from large scaleweb dictionaries. In: 21st International Conference on Advanced Information Networking and Applications, 2007 (AINA’07), pp. 932–939. IEEE (2007)
Navigli, R.: Word sense disambiguation: a survey. ACM Comput. Surv. (CSUR) 41(2), 10:1–10:69 (2009)
O’Brien, R.B. (ed.): Home Rule, Speeches by John Redmond. T. F Unwin, London (1910)
Peat, H.J., Willett, P.: The limitations of term co-occurrence data for query expansion in document retrieval systems. J. Am. Soc. Inf. Sci. 42, 378–383 (1991)
Plale, B., Prakash, A., McDonald, R.: The Data Capsule for Non-consumptive Research: Final report. Tech. rep., Indiana University (2015). https://scholarworks.iu.edu/dspace/handle/2022/19277
Potthast, M., Stein, B., Anderka, M.: A Wikipedia-based multilingual retrieval model. In: European Conference on Information Retrieval, pp. 522–530. Springer, Berlin (2008)
Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local and global algorithms for disambiguation to Wikipedia. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1375–1384. ACL (2011)
Rito, J.S.T., Healy, S.M. (eds): Proceedings of the Traditional Knowledge Conference 2008: Traditional Knowledge and Gateways to Balanced Relationships. New Zealand’s Māori Centre of Research Excellence (2008)
Rizzo, G., Troncy, R.: Nerd: evaluating named entity recognition tools in the web of data. In: ISWC’11, Workshop on Web Scale Knowledge Extraction (WEKEX’11) (2011). http://porto.polito.it/2440793/1/wekex2011_submission_6.pdf
Scheau, C., Rebedea, T., Chiru, C., Trausan-Matu, S.: Improving the relevance of search engine results by using semantic information from Wikipedia. In: 9th RoEduNet IEEE International Conference, pp. 151–156. IEEE (2010)
Shapira, B., Ofek, N., Makarenkov, V.: Exploiting Wikipedia for information retrieval tasks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’15, pp. 1137–1140. ACM (2015). https://doi.org/10.1145/2766462.2767879
Silverstein, C., Henzinger, M., Marais, H., Moricz, M.: Analysis of a very large altavista query log. ACM SIGIR Forum 33, 6–12 (1998)
Sinkkilä, R., Suominen, O., Hyvönen, E.: Automatic semantic subject indexing of web documents in highly inflected languages. In: The Semantic Web: Research and Applications, pp. 215–229. Springer, Berlin (2011)
Soderland, S., Aronow, D., Fisher, D., Aseltine, J., Lehnert, W.: Machine Learning of Text Analysis Rules for Clinical Records. Tech. rep., Dept. of Computer Science, University of Massachusetts (1995)
Sorg, P., Cimiano, P.: Exploiting Wikipedia for cross-lingual and multilingual information retrieval. Data Knowl. Eng. 74, 26–45 (2012)
Sowa, J.F.: Conceptual Structures: Information Processing in Mind and Machine. Addison-Wesley Longman, Reading (1984)
Steyvers, M., Griffiths, T.: Probabilistic topic models. Handb. Latent Semant. Anal. 427(7), 424–440 (2007)
Stojanovic, N.: Information-need driven query refinement. Web Intell. Agent Syst. 3(3), 155–169 (2005)
Stojanovic, N., Studer, R., Stojanovic, L.: An approach for step-by-step query refinement in the ontology-based information retrieval. In: International Conference on Web Intelligence, WI’04, pp. 36–43. IEEE (2004). https://doi.org/10.1109/WI.2004.21
Sykes, W.R.: Contributions to the Flora of Niue. Department of Scientific and Industrial Research, Christchurch (1970)
Tregear, E.: The Maori Race. AD Willis, Wanganui (1904)
Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 61–69 Springer, Berlin (1994)
Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 977–984. ACM (2006)
Wei, W., Barnaghi, P.M., Bargiela, A.: Search with meanings: an overview of semantic search systems. Int. J. Commun. SIWN 3, 76–82 (2008)
Witten, I., Milne, D.: An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, pp. 25–30. AAAI Press, Chicago (2008)
Witten, I.H., Boddie, S.J., Bainbridge, D., McNab, R.J.: Greenstone: a comprehensive open-source digital library software system. In: Proceedings of the Fifth ACM Conference on Digital Libraries, pp. 113–121. ACM, New York (2000)
Witten, I.H., Bainbridge, D., Nichols, D.M.: How to Build a Digital Library, 2nd edn. Morgan Kaufmann, San Francisco (2009)
Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: Wikiwalk: random walks on Wikipedia for semantic relatedness. In: Proceedings of the 2009 Workshop on Graph-Based Methods for Natural Language Processing, pp. 41–49. Association for Computational Linguistic (2009)
Yesilada, Y., Bechhofer, S., Horan, B.: Cohse: Dynamic Linking of Web Resources. Tech. rep., Sun Microsystems Inc. (2007)
Zhang, L.: Interactive Retrieval Based on Wikipedia Concepts (2014). arXiv preprint arXiv:1412.8281
Acknowledgements
The authors thank the Andrew W. Mellon Foundation for their support of this work (Grant Reference Numbers 21300666 and 41500672). We also thank the staff at the HathiTrust Research Center for their assistance, and Tom Ryan, a humanities scholar at the University of Waikato.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hinze, A., Bainbridge, D., Cunningham, S.J. et al. Capisco: low-cost concept-based access to digital libraries. Int J Digit Libr 20, 307–334 (2019). https://doi.org/10.1007/s00799-018-0232-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-018-0232-3