Abstract
Digital Libraries represent the commitment of research communities to preserve authoritative and well structured sources of knowledge, and to share archival organisations, methods and resources thanks to systems relying on standard metadata formats. This chapter describes some natural language processing techniques exploited for automatically extracting structural information from documents stored in Digital Libraries, based on the exposed metadata. The most prominent results achieved in this area are surveyed and discussed. As an example of an infrastructure for integrating, structuring and searching Digital Libraries based on natural language processing and semantic web techniques, we discuss the MANENT system. MANENT is a working prototype offering services of Digital Library content management and record classification and retrieval. It is hosted on a server at the Computer Science Department of Genova University and, starting from 2011, it will become publicly available. 475,000 records drawn from 138 repositories that all over the world expose OAI-PMH services have been downloaded, stored, and their automatic classification is under way.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agirre, E., Edmonds, P.: Word Sense Disambiguation - Algorithms and Applications. Springer, Heidelberg (2007)
Agosti, M., Berretti, S., Brettlecker, G., del Bimbo, A., Ferro, N., Fuhr, N., Keim, D., Klas, C.P., Lidy, T., Milano, D., Norrie, M., Ranaldi, P., Rauber, A., Schek, H.J., Schreck, T., Schuldt, H., Signer, B., Springmann, M.: DelosDLMS - the integrated DELOS digital library management system. In: Proceedings of the First International Conference on Digital Libraries: Research and Development, pp. 36–45 (2007)
Agosti, M., Ferro, N.: A Formal Model of Annotations of Digital Content. ACM Trans. Inform. Syst., 26(1) (2007)
Balasubramanian, N., Allan, J., Croft, W.B.: A comparison of sentence retrieval techniques. In: Proceedings of the Thirtieth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 813–814 (2007)
Banerjee, S., Pedersen, T.: An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 136–145. Springer, Heidelberg (2002)
Baruzzo, A., Casoto, P., Challapalli, P., Dattolo, A., Pudota, N., Tasso, C.: Toward Semantic Digital Libraries: Exploiting Web2.0 and Semantic Services in Cultural Heritage. Journal of Digital Information 10(6) (2009)
Bentivogli, L., Forner, P., Magnini, B., Pianta, E.: Revising WordNet Domains Hierarchy: Semantics, Coverage, and Balancing. In: Proceedings of the Twenty-First International Conference on Computational Linguistics (COLING 2004),, pp. 101–108 (2004)
Bloehdorn, S., Cimiano, P., Duke, A., Haase, P., Heizmann, J., Thurlow, I., Völker, J.: Ontology-based question answering for digital libraries. In: Kovács, L., Fuhr, N., Meghini, C. (eds.) ECDL 2007. LNCS, vol. 4675, pp. 14–25. Springer, Heidelberg (2007)
Buitelaar, P., Cimiano, P., Frank, A., Hartung, M., Racioppa, S.: Ontology-based information extraction and integration from heterogeneous data sources. Int. J. Hum.-Comput. Stud., 66(11), 759–788 (2008)
Candela, L., Castelli, D., Ferro, N., Ioannidis, Y., Koutrika, G., Meghini, C., Pagano, P., Ross, S., Soergel, D., Agosti, M., Dobreva, M., Katifori, V., Schuldt, H.: The DELOS Digital Library Reference Model. Foundations for Digital Libraries. ISTI-CNR, PISA (2007)
Cavnar, W.B., Trenkle, J.M.: N-Gram-Based Text Categorization. In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161–175 (1994)
Dublin Core Metadata Element Set, http://www.dublincore.org/documents/dces/
EAD: Encoded Archivial Description, http://www.loc.gov/ead/
EAD XML Metaschema, http://www.loc.gov/ead/ead.xsd
Ferilli, S., Biba, M., Basile, T., Esposito, F.: Combining Qualitative and Quantitative Keyword Extraction Methods with Document Layout Analysis. In: Proceedings of the Fifth Italian Research Conference on Digital Libraries (IRCDL 2009). DELOS: an Association for Digital Libraries (2009)
Ferro, N.: Annotation search: The FAST way. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 15–26. Springer, Heidelberg (2009)
Ferro, N., Silvello, G.: The NESTOR framework: How to handle hierarchical data structures. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 215–226. Springer, Heidelberg (2009)
FOAF: Friend of a Friend ontology, http://www.foaf-project.org/
Gliozzo, A., Strapparava, C.: Semantic Domains in Computational Linguistics. Springer, Heidelberg (2009)
Gliozzo, A., Strapparava, C., Dagan, I.: Unsupervised and supervised exploitation of semantic domains in lexical disambiguation. Computer Speech & Language 18(3), 255–299 (2004)
Gruber, T.: Definition of Ontology. In: Liu, L., Özsu, M.T. (eds.) Encyclopedia of Database Systems, Springer, Heidelberg (2009)
Hargittai, E., Fullerton, F., Menchen-Trevino, E., Thomas, K.: Trust Online: Young Adults’ Evaluation of Web Content. International Journal of Communication 4, 468–494 (2010)
Hunter, J., Khan, I., Gerber, A.: Harvana: harvesting community tags to enrich collection metadata. In: Proceedings of the Eighth ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 147–156 (2008)
Itzcovich, O.: L’uso del calcolatore in storiografia, Milano (1993)
Kruk, S.R., McDaniel, B.: Semantic Digital Libraries. Springer, Heidelberg (2009)
Locoro, A.: Tagging Domain Ontologies with WordNet Domains: An Approach for Fostering Ontology Classification, Engineering and Matching. Technical Report DISI-TR-10-10, CS Dept. of Genova University (2010), http://www.disi.unige.it/person/LocoroA/download/DISI-TR-10-10.pdf
Magnini, B., Cavagliá, G.: Integrating Subject Field Codes into WordNet. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), pp. 1413–1414 (2000)
Magnini, B., Strapparava, C., Pezzulo, G., Gliozzo, A.: The role of domain information in Word Sense Disambiguation. Natural Language Engineering 8, 359–373 (2002)
METS: Metadata encoding and Transmission Standard, http://www.loc.gov/standards/mets/
Metzler, D., Dumais, S.T., Meek, C.: Similarity measures for short segments of text. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECiR 2007. LNCS, vol. 4425, pp. 16–27. Springer, Heidelberg (2007)
Mihalcea, R., Corley, C., Strappavara, C.: Corpus-based and Knowledge-based Measures of Text Semantic Similarity. In: Proceedings of the Twenty-First National Conference on Artificial Intelligence and Eighteenth Innovative Applications of Artificial Intelligence Conference. AAAI Press, Menlo Park (2006)
Miller, G.A.: WordNet: A Lexical Database for English. Communications of the ACM 38(11), 39–41 (1995)
OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting, http://www.openarchives.org/OAI/openarchivesprotocol.html
Ortoleva, P.: Persi nella rete? Circolazione del sapere storico. In: Soldani, S., Tomassini, L. (eds.) Storia & Computer, alla ricerca del passato con l’informatica, Milano (1996)
The Open Archives Initiative Protocol for Metadata Harvesting: Metadata Prefix and Metadata Schema, http://www.openarchives.org/OAI/openarchivesprotocol.html#MetadataNamespaces
The Open Archives Initiative Protocol for Metadata Harvesting: Guidelines for Repository Implementers, http://www.openarchives.org/OAI/2.0/guidelines-repository.htm
The Protégé Ontology Editor, http://protege.stanford.edu/
Rocchio, J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART retrieval system: Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs (1971)
Rowland, R.: L’informatica e il mestiere dello storico. In: Quaderni Storici, pp. 26–78 (1991)
Salton, G., Lesk, M.: Computer evaluation of indexing and text processing. Journal of the ACM (JACM) 15(1), 8–36 (1968)
SPARQL Query Language for RDF, http://www.w3.org/TR/rdf-sparql-query/
Wache, H., Vögele, T., Visser, U., Stuckenschmidt, H., Schuster, G., Neumann, H., Hübner, S.: Ontology-based integration of information - a survey of existing approaches. In: Proceedings of the Twelfth International Joint Conference on Artificial Intelligence (IJCAI 2001) Workshop on Ontologies and Information Sharing, pp. 108–117 (2001)
Wenger, E.: Communities of practice, learning, meaning and identity, Cambridge (1998)
W3C . OWL Web Ontology Language Overview – W3C Recommendation (February 10, 2004)
W3C . RDF Vocabulary Description Language 1.0: RDF Schema – W3C Recommendation (February 10, 2004)
W3C . RDF/XML Syntax Specification (Revised) – W3C Recommendation (February 10, 2004)
W3C . Extensible Markup Language (XML) 1.0 (Fifth Edition) – W3C Recommendation (November 26, 2008)
Wordnets in the world, http://www.globalwordnet.org/gwa/wordnet_table.htm
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Locoro, A., Grignani, D., Mascardi, V. (2011). MANENT: An Infrastructure for Integrating, Structuring and Searching Digital Libraries. In: Biba, M., Xhafa, F. (eds) Learning Structure and Schemas from Documents. Studies in Computational Intelligence, vol 375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22913-8_15
Download citation
DOI: https://doi.org/10.1007/978-3-642-22913-8_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22912-1
Online ISBN: 978-3-642-22913-8
eBook Packages: EngineeringEngineering (R0)