Capturing Document Semantics for Ontology Generation and Document Summarization

Baxter, David; Klimt, Bryan; Grobelnik, Marko; Schneider, David; Witbrock, Michael; Mladenić, Dunja

doi:10.1007/978-3-540-88845-1_11

Capturing Document Semantics for Ontology Generation and Document Summarization

David Baxter³,
Bryan Klimt³,
Marko Grobelnik⁴,
David Schneider³,
Michael Witbrock³ &
…
Dunja Mladenić⁴

Chapter

942 Accesses
1 Citations

When dealing with a document collection, it is important to identify repeated information. In multi-document summarization, for example, it is important to retain widely repeated content, even if the wording is not exactly the same. Simplistic approaches simply look for the same strings, or the same syntactic structures (including words), across documents. Here we investigate semantic matching, applying background knowledge from a large, general knowledge base (KB) to identify such repeated information in texts.

Automatic document summarization is the problem of creating a surrogate for a document that adequately represents its full content. Automatic ontology generation requires information about candidate types, roles and relationships gathered from across a document or document collection. We aim at a summarization system that can replicate the quality of summaries created by humans and ontology creation systems that significantly reduce the human effort required for construction. Both applications depend for their success on extracting the essence of a collection of text. The work reported here demonstrates the utility of using deep knowledge from Cyc for effectively identifying redundant information in texts by using both semantic and syntactic information.

Download to read the full chapter text

Chapter PDF

References

Fellbaum C (Ed.) (1998) WordNet: An electronic lexical database. Cambridge, MA: MIT Press. ISBN-13 978-0-262-06197-1
MATH Google Scholar
Fortuna B, Mladenić D, Grobelnik M (2006) Semi-automatic construction of topic ontologies. In: Ackermann, Berendt, Grobelnik, Mladenić (Eds.). Semantics, web and mining: joint international workshops, EWMF 2005 and KDO 2005, Porto, Portugal, October 3 and 7, 2005: revised selected papers (Lecture Notes in Computer Science, Lecture Notes in artificial Intelligence, Vol. 4289). Berlin; Heidelberg; New York: Springer, 121–131
Google Scholar
Fortuna B, Mladenić D, Grobelnik M (2005) Visualization of Text Document Corpus. Informatica Journal, vol. 29, no. 4: 497–502
Google Scholar
Grinberg D, Lafferty J, Sleator D (1995) A robust parsing algorithm for link grammars. Carnegie Mellon University Computer Science Technical Report CMU-CS-95-125, and Proceedings of the Fourth International Workshop on Parsing Technologies, Prague, September
Google Scholar
Grobelnik M, Mladenić D (2005) Automated Knowledge Discovery in Advanced Knowledge Management. Journal of Knowledge Management, Vol. 9, 132–146
Article Google Scholar
Krishnan V, Manning C D (2006) An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 1121–1128
Google Scholar
Lenat D B (1995, November) Cyc: A Large-Scale Investment in Knowledge Infrastructure. Communications of the ACM vol. 38, no. 11
Google Scholar
Leskovec J, Grobelnik M, Milic-Frayling N (2004) Learning Sub-structures of Document Semantic Graphs for Document Summarization. In Workshop on Link Analysis and Group Detection (LinkKDD2004). The Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Google Scholar
Leskovec J, Milic-Frayling N, Grobelnik M (2005) Impact of Linguistic Analysis on the Semantic Graph Coverage and Learning of Document Extracts. National Conference on Artificial Intelligence AAAI-2005, Pittsburgh, PA
Google Scholar
Matuszek C, Cabral J, Witbrock M, DeOliveira J (2006) An Introduction to the Syntax and Content of Cyc. In Proceedings of the 2006 AAAI Spring Symposium on Formalizing and Compiling Background Knowledge and Its Applications to Knowledge Representation and Question Answering, Stanford, CA, March 2006
Google Scholar
Panton K, Miraglia P, Salay N et al. (2002) Knowledge Formation and Dialogue Using the KRAKEN Toolset. In Proceedings of the Fourteenth National Conference on Innovative Applications of Artificial Intelligence, 900–905. Edmonton, Canada
Google Scholar
Shah P, Schneider D, Matuszek C, Kahlert R C, Aldag B, Baxter D, Cabral J, Witbrock M, Curtis J (2006) Automated Population of Cyc: Extracting Information about Named-entities from the Web. In Proceedings of the Nineteenth International FLAIRS Conference, 153–158, Melbourne Beach, FL, May 2006
Google Scholar
Sleator D and Temperley D (1991) Parsing English with a Link Grammar. Carnegie Mellon University Computer Science Technical Report CMU-CS-91-196, October 1991
Google Scholar

Download references

Author information

Authors and Affiliations

Cycorp, Inc., Austin, TX, 78731
David Baxter, Bryan Klimt, David Schneider & Michael Witbrock
Jozef Stefan Institute, Ljubljana, 1000, Slovenia
Dr. Marko Grobelnik & Dunja Mladenić

Authors

David Baxter
View author publications
You can also search for this author in PubMed Google Scholar
Bryan Klimt
View author publications
You can also search for this author in PubMed Google Scholar
Dr. Marko Grobelnik
View author publications
You can also search for this author in PubMed Google Scholar
David Schneider
View author publications
You can also search for this author in PubMed Google Scholar
Michael Witbrock
View author publications
You can also search for this author in PubMed Google Scholar
Dunja Mladenić
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Witbrock .

Editor information

Editors and Affiliations

BTexact Technologies, 5/12 Orion, Ipswich, IP5 3RE Adastral Park, United Kingdom
John Davies
Jožef Stefan Institute, Jamova 39, SI-1000, Ljubljana, Slovenia
Marko Grobelnik & Dunja Mladenić &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Baxter, D., Klimt, B., Grobelnik, M., Schneider, D., Witbrock, M., Mladenić, D. (2009). Capturing Document Semantics for Ontology Generation and Document Summarization. In: Davies, J., Grobelnik, M., Mladenić, D. (eds) Semantic Knowledge Management. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88845-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-540-88845-1_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88844-4
Online ISBN: 978-3-540-88845-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics