Skip to main content
Log in

Extracting knowledge from XML document repository: a semantic Web-based approach

  • Published:
Information Technology and Management Aims and scope Submit manuscript

Abstract

XML plays an important role as the standard language for representing structured data for the traditional Web, and hence many Web-based knowledge management repositories store data and documents in XML. If semantics about the data are formally represented in an ontology, then it is possible to extract knowledge: This is done as ontology definitions and axioms are applied to XML data to automatically infer knowledge that is not explicitly represented in the repository. Ontologies also play a central role in realizing the burgeoning vision of the semantic Web, wherein data will be more sharable because their semantics will be represented in Web-accessible ontologies. In this paper, we demonstrate how an ontology can be used to extract knowledge from an exemplar XML repository of Shakespeare’s plays. We then implement an architecture for this ontology using de facto languages of the semantic Web including OWL and RuleML, thus preparing the ontology for use in data sharing. It has been predicted that the early adopters of the semantic Web will develop ontologies that leverage XML, provide intra-organizational value such as knowledge extraction capabilities that are irrespective of the semantic Web, and have the potential for inter-organizational data sharing over the semantic Web. The contribution of our proof-of-concept application, KROX, is that it serves as a blueprint for other ontology developers who believe that the growth of the semantic Web will unfold in this manner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Greater detail to these ontology models is shown in [28].

  2. → denotes parent-of.

  3. Only some of the axioms need to answer this competency question is shown in this section. The remaining axioms as well a walk-through of the answering of the competency question is shown in the Appendix.

References

  1. H. Alani, S. Kim, D.E. Millard, M.J. Weal, W. Hall, P.H. Lewis and N.R. Shadbolt, Automatic ontology-based knowledge extraction from web documents, IEEE Intelligent Systems 18 (2003) 14–21.

    Article  Google Scholar 

  2. B. Amann, C. Beeri, I. Fundulaki and M. Scholl, Querying XML sources using an ontology-based mediator. Lecture Notes in Computer Science 2519 (2002) 429–448.

    Google Scholar 

  3. J.C. Arpírez, O. Corcho, M. Fernández-López and A. Gómez-Pérez, WebODE in a nutshell. AI Magazine 24 (2003) 37–47.

    Google Scholar 

  4. T. Berners-Lee, J. Hendler and O. Lassila, The semantic web, Scientific American 284 (2001) 34–43.

    Article  Google Scholar 

  5. S. Boag, D. Chamberlin, M. Fernandez, D. Florescu, J. Robie and J. Simeon, XQuery 1.0: An XML query language – W3C working draft, 29 October 2004. http://www.w3.org/tr/xquery, W3C, 2004 (Updated:October 29).

  6. H. Boley, S. Tabet and G. Wagner, Design rationale of RuleML: A markup language for semantic web rules, in Proceedings of First Semantic Web Working Symposium (SWWS’01), Stanford, CA, 2001.

  7. J. Bosak, The plays of Shakespeare. http://www.oasis-open.org/cover/bosakShakespeare200.html, Open Oasis.org, 1999 (last updated: July).

  8. C. Brewster, F. Ciravegna and Y. Wilks, User-centred ontology learning for knowledge management, Lecture Notes in Computer Science 2553 (2002) 203–207.

    Article  Google Scholar 

  9. A.E. Campbell and S.C. Shapiro, Ontological mediation: An overview, in: Proceedings of IJCAI Workshop on Basic Ontological Issues in Knowledge Sharing, Menlo Park, CA (1995).

  10. V. Christophides, G. Karvounarakis, I. Koffina, G. Kokkinidis, A. Magkanaraki, D. Plexousakis, G. Serfiotis and V. Tannen, The ICS-FORTH SWIM: A powerful semantic web integration middleware, in: Proceedings of the First International Workshop on Semantic Web and Databases (SWDB), Humboldt-Universitat, Berlin, Germany, 2003.

  11. J. Clark, XSL Transformations (XSLT) Version 1.0. http://www.w3.org/tr/xslt, W3C, 1999 (Updated: November 16).

  12. CommerceOne, xCBL.org: XML Common Business Library. Commerce One Inc., Pleasanton, CA, 2003.

  13. M. Erdmann and R. Studer, How to structure and access XML documents with ontologies, Data and Knowledge Engineering 36 (2001) 317–335.

    Article  Google Scholar 

  14. D. Faure and C. Nedellec, Knowledge acquisition of predicate-argument structures from technical texts using machine learning, in Presented at EKAW, Dagstuhl Castle, Germany, 1999.

  15. M. Fox, F. Fadel and J. Chionglo, A common-sense model of the enterprise, in Proceedings of the Industrial Engineering Research Conference, Atlanta, GA, 1993.

  16. R.J. Glushko, J.M. Tenenbaum and B. Meltzer, An XML framework for agent-based E-commerce, Communications of the ACM 42 (1999) 106.

    Google Scholar 

  17. C.H. Goh, S. Bressan, S. Madnick and M. Siegel, Context interchange: New features and formalisms for the intelligent integration of information, ACM Transactions on Information Systems 17 (1999) 270–293.

    Article  Google Scholar 

  18. A. Gómez-Pérez, M. Fernández-López and O. Corcho, Ontological Engineering with examples from the areas of Knowledge Management, e-Commerce and the Semantic Web, Springer, 2004.

  19. T.R. Gruber, Towards principles for the design of ontologies used for knowledge sharing, in: Proceedings of the International Workshop on Formal Ontology, Padova, Italy, 1993.

  20. M. Gruninger and M.S. Fox, The role of competency questions in enterprise engineering, in Proceedings of the IFIP WG5.7 Workshop on Benchmarking – Theory and Practice, Trondheim, Norway, June 1994.

  21. S. Handschuh, S. Staab and F. Ciravegna, S-CREAM – Semi-automatic creation of metadata. Lecture Notes in Computer Science 2473 (2002) 358–372.

    Google Scholar 

  22. I. Horrocks, P.F. Patel-Schneider and F.v. Harmelen, From SHIQ and RDF to OWL: The making of a web ontology language, Journal of Web Semantics 1 (2003) 7–26.

    Google Scholar 

  23. ISDA, FpML™: The XML Standard for Swaps, Derivatives, and Structured Products. http://www.fpml.org, International Swaps and Derivatives Association, 2004 (last updated: November 19).

  24. J.-U. Kietz, A. Maedche and R. Volz, A method for semi-automatic ontology acquisition from a corporate intranet, in Proceedings of EKAW’00 Workshop on Ontologies and Text, Juan-Les-Pins, France, 2000.

  25. H.M. Kim, Predicting how the semantic web will evolve, Communications of the ACM 45 (2002) 48–54.

    Google Scholar 

  26. H.M. Kim and M.S. Fox, Towards a data model for quality management web services: An ontology of measurement for enterprise modeling, Lecture Notes in Computer Science 2348 (2002) 230–244.

    Google Scholar 

  27. H.M. Kim, Integrating business process-oriented and data-driven approaches for ontology development, in: Proceedings of the AAAI Spring Symposium Series 2000 – Bringing Knowledge to Business Processes, Stanford, CA, 2000.

  28. H.M. Kim, XML-hoo! A prototype application for intelligent query of XML documents using domain-specific ontologies, in Proceedings of 35th Annual Hawaii International Conference on Systems Science (HICSS-35), Hawaii, HI, 2002.

  29. P. Lehti and P. Fankhauser, XML data integration with OWL: experiences and challenges. in Proceedings of International Symposium on Applications and the Internet, Fraunhofer Inst., Darmstadt, Germany, 2004.

  30. A. Maedche, S. Staab, R. Studer, Y. Sure and R. Volz, SEAL – Tying up information integration and web site management by ontologies, IEEE Computer Society Data Engineering Bulletin 25 (2002) 10–17.

    Google Scholar 

  31. G. Modica, A. Gal and H. Jamil, The use of machine-generated ontologies in dynamic information seeking, in Proceedings of Cooperative Information Systems (CoopIS ’01), Trento, Italy, 2001.

  32. L. Narens, Abstract Measurement Theory. (MIT Press, Cambridge, MA 1985).

    Google Scholar 

  33. N.F. Noy, M.S. Decker, M. Crubezy, R.W. Fergerson and M.A. Musen, Creating semantic web contents with Protégé-2000, IEEE Intelligent Systems 16 (2001) 60–71.

    Article  Google Scholar 

  34. S. Philippi and J. Kohler, Using XML technology for the ontology-based semantic integration of life science databases, IEEE Transactions on Information Technology in Biomedicine 8 (2004) 154–160.

    Article  Google Scholar 

  35. W. Shen, X. Li and A. Doan, Constraint-Based Entity Matching, in Proceedings of the American AI Conference (AAAI-05), Pittsburgh, PA, July 2005.

  36. M. Sintek, M. Junker, L. Elst and A. Abecker, Using information extration rules for extending domain ontologies, in Proceedings of IJCAI-2001 Workshop on Ontology Learning, Seattle, 2001.

  37. H. Smith and K. Poulter, Share the ontology in XML-based trading architectures. Communications of the ACM 42, 1999.

  38. Y. Sure, M. Erdmann, J. Angele, R. Studer, S. Staab and D. Wenke, OntoEdit: Collaborative ontology development for the semantic web, Lecture Notes in Computer Science 2342 (2002).

  39. A. Tomasic, L. Raschid and P. Valduriez, Scaling access to heterogeneous data sources with DISCO, IEEE Transactions on Knowledge and Data Engineering 10 (1998) 808–823.

    Article  Google Scholar 

  40. M. Vargas-Vera, E. Motta, J. Domingue, S. B. Shum and M. Lanzoni, Knowledge extraction by using an ontology-based annotation tool, in Proceedings of the First International Conference on Knowledge Capture (K-CAP’01), Victoria, BC,Canada, 2001.

  41. Y. Wand and R. Weber, Towards a theory of deep structure of information systems, Journal of Information Systems (1995) 203–223.

  42. G. Wiederhold, Mediation in information systems, ACM Computing Surveys 27 (1995) 265–267.

    Article  Google Scholar 

  43. M. Uschold and M. Grüninger, Ontologies: Principles, methods, and applications, The Knowledge Engineering Review 11(2) (1996) 93–115.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Henry M. Kim.

Appendix

Appendix

Defn-4. character_has_pseudonym(C,Ps)

A primitive description set describing one character can have both the character’s name and pseudonym used.

  • character_has_pseudonym(C,Ps) =

  • \({{\mathsf{\{ C, Ps|\exists Pd [primitive\_description\_set\_has\_character(Pd,C) \wedge}}}\)

  • \({{\mathsf{primitive\_description\_set\_has\_psuedonym(Pd,Ps)\}}}}\)

Defn-5. \({{\mathbf{(a) related\_characters(C_{1}, Rn,Rp,C_{2})}}}\)

C1 has a relationship, expressed as relation noun(Rn) + preposition(Rp), with C2, if:

  • C1 and C2 are characters in the same play.

  • C1 is explicitly stated as related to C2 or C2’s pseudonym, and

  • C1 is a character introduced individually, or is any of the characters in a group that has a relationship to C2, and

    • \({{\mathbf{ related\_characters(C_{1}, Rn,Rp,C_{2}) = }}}\)

    • \({{\mathsf{\{ C_{1},Rn, Rp, C_{2}\vert}}}\)

    • \({{\mathsf{(\exists P (play\_has\_character(P,C_{1})\wedge play\_has\_character(P,C_{2})) \wedge}}}\)

    • \({{\mathsf{(\exists D(description\_has\_relationship(D,Rn,Rp,C_{2}) \vee}}}\)

    • \({{\mathsf{\exists Cr (description\_has\_relationship(D,Rn,Rp,Cr) \wedge}}}\)

    • \({{\mathsf{ character\_has\_pseudonym(C_{2},Cr)))\wedge}}}\)

    • \({{\mathsf{(primitive\_description\_set\_has\_character(D,C_{1}) \vee}}}\)

    • \({{\mathsf{\exists Pe \exists Pd (group\_has\_character\_description(D,Pe) \wedge}}}\)

    • \({{\mathsf{ character\_description\_has\_primitive\_description\_set(Pe,Pd)\wedge}}}\)

    • \({{\mathsf{ primitive\_description\_set\_has\_character(Pd,C_{1})))\}}}}\)

Defn-6. \({{\mathbf{(b) related\_characters(C_{1}, Rn,Rp,C_{2})}}}\)

C1 has a relationship, expressed as relation noun(Rn) + preposition(Rp), with C2 if:

  • C1 is a pseudonym for a character whose relationship with C2 can be inferred, or

  • C2 is a pseudonym for a character whose relationship with C1 can be inferred, or

  • C1 and C2 are pseudonyms for characters whose relationship with each other can be inferred.

    • \({{\mathbf{ related\_characters(C_{1}, Rn,Rp,C_{2}) =}}}\)

    • \({{\mathsf{\{C_{1},Rn, Rp, C_{2}|\vert\exists C_{a}\exists C_{b}}}}\)

    • \({{\mathsf{(related\_characters(C_{1},Rn,Rp,C_{b})\wedge character\_has\_pseudonym (C_{b},C_{2}) \vee}}}\)

    • \({{\mathsf{(related\_characters(C_{a},Rn,Rp,C_{2}) \wedge character\_has\_pseudonym (C_{a},C_{1}) \vee}}}\)

    • \({{\mathsf{(related\_characters(C_{a},Rn,Rp,C_{b}) \wedge character\_has\_pseudonym (C_{a},C_{1}) \wedge}}}\)

    • \({{\mathsf{character\_has\_pseudonym(C_{b},C_{2})\}}}}\)

Defn-7. \({{\mathbf{ (a) may\_be\_related\_characters(C_{1}, Rn, Rp, C_{2})}}}\)

C1 may have a relationship, expressed as relation noun(Rn) + preposition(Rp), with C2, if:

  • C1 and C2’s relationship (Rn + Rp) cannot be inferred for sure, and

  • C1 and C2 are characters in the same play, and

  • C1 is explicitly stated as related to C2’s qualifying title or location qualifier, and

  • C2 is a character introduced individually, or is any of the characters in a group, and

  • C1 is a character introduced individually, or is any of the characters in a group that has a relationship to C2.

    • \({{\mathbf{ may\_be\_related\_characters(C_{1}, Rn,Rp,C_{2})=}}}\)

    • \({{\mathsf{ \{C_{1}, Rn, Rp, C_{2}\vert \neg related\_characters(C_{1},Rn,Rp,C_{2}) \wedge}}}\)

    • \({{\mathsf{(\exists P (play\_has\_character(P,C_{1}) \wedge play\_has\_character(P,C_{2})) \wedge}}}\)

    • \({{\mathsf{ (\exists C \exists D \exists D_{2}}}}\)

    • \({{\mathsf{(description\_has\_relationship(D,Rn,Rp,C) \wedge}}}\)

    • \({{\mathsf{ (description\_has\_qualifying\_title(D_{2},C)\vee description\_has\_location\_qualifier(D_{2},C)) \wedge}}}\)

    • \({{\mathsf{ (primitive\_description\_set\_has\_character(D_{2},C_{2}) \vee}}}\)

    • \({{\mathsf{ (\exists Pe_{2}\exists Pd_{2}(group\_has\_character\_description(D_{2}, Pe_{2}) \wedge}}}\)

    • \({{\mathsf{ character\_description\_has\_primitive\_description\_set(Pe_{2},Pd_{2})\wedge}}}\)

    • \({{\mathsf{ primitive\_description\_set\_has\_character(Pd_{2},C_{2}))) \wedge}}}\)

    • \({{\mathsf{ (primitive\_description\_set\_has\_character(D,C_{1}) \vee}}}\)

    • \({{\mathsf{ (\exists Pe \exists Pd (group\_has\_character\_description(D,Pe) \wedge }}}\)

    • \({{\mathsf{ character\_description\_has\_primitive\_description\_set(Pe,Pd)\wedge}}}\)

    • \({{\mathsf{ primitive\_description\_set\_has\_character(Pd,C_{1}))) \}}}}\)

Defn-8. \({{\mathbf{has\_son(C_{1}, C_{2}) = }}}\)

\({{\mathsf{ \{C_{1}C_{2} \vert related\_characters(C_{2},`son\hbox{'},`of\hbox{'},C_{1}) \vee related\_characters(C_{2}, `son\hbox{'}, `to\hbox{'},C_{1}) \} }}}\)

Defn-9. \({{\mathbf{has\_father(C_{1},C_{2}) =}}}\)

\({{\mathsf{ \{C_{1}C_{2}\vert related\_characters(C_{2},`father\hbox{'}, `of\hbox{'},C_{1}) \vee related\_characters(C_{2},`father\hbox{'}, `to\hbox{'},C_{1}) \}}}}\)

Defn-10. male(C) =

\({{\mathsf{\{C| \exists C_{1} has\_son(C_{1},C) \vee has\_father(C_{1},C) \}}}}\)

Defn-11. \({{\mathbf{has\_child(C_{1},C_{2}) =}}}\)

\({{\mathsf{ \{C_{1}, C_{2} \vert has\_son(C_{1},C_{2})\vee has\_father(C_{2},C_{1}) \}}}}\)

Obviously, many such relationship terms can be defined, e.g. daughter of, mother of, an additional definition of parent of, uncle of, etc. Also possible familial relationships can be defined using may_be_related_characters.

Definitions for answering CQ-2 and CQ-3 are straightforward, so are not presented. The predicate play_has_character has been defined, so CQ-4 can be answered.

In the next section, these axioms are applied to answer competency questions (Fig. 11).

Fig. 11
figure 11

Excerpt from XML document of ‘Romeo and Juliet’ [7]

1.1 Demonstration of competency

Following are some primitive terms.

1.2 Relevant primitive term instances

With that, the following competency question can be answered

CQ-1. Which character is the son of the Montague character?

Answering CQ-1

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, H.M., Sengupta, A. Extracting knowledge from XML document repository: a semantic Web-based approach. Inf Technol Manage 8, 205–221 (2007). https://doi.org/10.1007/s10799-007-0017-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10799-007-0017-7

Keywords

Navigation