An ontology based approach to the integration of entity–relationship schemas
Introduction
Schema integration involves merging several schemas into an integrated schema. More precisely, Ref. [4] defines schema integration as “the activity of integrating the schemas of existing or proposed databases into a global, unified schema”. It is regarded as an important work to build a heterogeneous database system [5], [21] (also called multidatabase system or federated database system), to integrate data in a data warehouse, or to integrate user views in database design. In schema integration, people have identified different kinds of semantic heterogeneities among component schemas: naming conflict (homonyms and synonyms), key conflict, structural conflict [3], [15], and constraint conflict [14], [19].
A less touched problem is schematic discrepancy, i.e., the same information is modelled as data in one database, but metadata in another. The following example illustrates schematic discrepancy in ER schemas. To focus our contribution and simplify the presentation, in the example below, schematic discrepancy is the only kind of conflicts among schemas. Example 1 Suppose we want to integrate the supply information of products from three databases DB1, DB2 and DB3 (Fig. 1). These databases record similar information, i.e., product numbers, product names, suppliers and the supplying prices in each month. In DB1, the supply relationships are modelled as a ternary relationship type SUP. In DB2, the entity type JAN_PROD models the products supplied in the month of January, and the attributes S1_PRICE, … , Sn_PRICE means the prices of the products by the suppliers S1, … , Sn. For example, the attribute S1_PRICE of the entity type JAN_PROD means the prices of the products supplied in January by the supplier S1. In DB3, the relationship type JAN_SUP models the supply relationships between products and suppliers in January. Note that JAN_SUP is a selection of the ternary relationship type SUP of DB1 (when the value of M# is ‘JAN’). In relational databases, these ER schemas correspond to the following relational schemas (i.e., each entity type having more than one attribute and each relationship type would be transformed into a relation): PROD(P#, PNAME), SUP(P#, S#, M#, PRICE) JAN_PROD(P#, PNAME, S1_PRICE, … , Sn_PRICE), ⋮ DEC_PROD(P#, PNAME, S1_PRICE, … , Sn_PRICE) PROD(P#, PNAME), JAN_SUP(P#, S#, PRICE), ⋮ DEC_SUP(P#, S#, PRICE)
The schemas of Fig. 1 are schematically discrepant from each other. For example, the values of the attribute M# in DB1 correspond to the metadata of the relationship types (i.e., the names of the relationship types) in DB3. The values of the attribute M# in DB1 correspond to the metadata of the entity types in DB2, and the values of the attribute S# in DB1 correspond to the metadata of the attributes S1_PRICE, … , Sn_PRICE in DB2.
In Section 4, we will resolve schematic discrepancies by transforming metadata into attribute values, e.g., transforming DB2 and DB3 into a form of DB1, and then merge the transformed schemas. The statements on the right side of Fig. 1 specify the meta information of the schemas using a shared ontology, which will be explained in Section 3.
Schematic discrepancy arises frequently since the names of schema constructs often capture some intuitive semantic information. Real examples of such disparity abound [11], [12], [18]. Originally raised as a conflict to be resolved in schema integration, schematically discrepant structures have been used to solve some interesting problems:
- •
In [18], Miller identified three scenarios in which schematic discrepancies may occur, i.e., database integration, data publication on the web and physical data independence.
- •
In e-commerce, Agrawal et al. [2] argued that the new generation of e-commerce applications require the data schemas that are constantly evolving and sparsely populated. They believed that a vertical representation of objects (in which attribute names are modelled as data values) is much better on storage and querying performance than the conventional horizontal row representation. On the other hand, to facilitate writing queries, they need to create the horizontal views of vertical tables.
- •
In data warehousing, users usually require generating two-dimensional report tables which are schematically discrepant from fact data.
We adopt a semantic approach to resolve schematic discrepancies in the integration of ER schemas. One of the outstanding features of our proposal is that we preserve cardinality constraints in the transformation/integration of ER schemas. Cardinality constraints, in particular, functional dependencies (FDs) and multivalued dependencies (MVDs), are useful in verifying lossless schema transformation [8], schema normalization and semantic query optimization [9], [19] in multidatabase systems.
The rest of the paper is organized as follows. Section 2 comprises an introduction to the ER approach and an ontology-based approach to the integration of ER schemas. Sections 3 Ontology and context, 4 Resolution of schematic discrepancies, 5 Semantics preserving transformation are the main contributions of this paper. In Section 3, we first introduce the concepts of ontology and context which are used to specify the meta information of ER schemas. In Section 4, we present algorithms to resolve different kinds of schematic discrepancies in schema integration. In Section 5, we show that our resolution algorithms preserve information and cardinality constraints in schema transformation. In Section 6, we compare our work with related work. Section 7 concludes the whole paper.
Section snippets
ER approach
In the ER model, an entity is an object in the real world and can be distinctly identified. An entity type is a collection of the similar entities that have the same set of predefined common attributes. Attributes can be single-valued, i.e., 1:1 (one-to-one) or m:1 (many-to-one), or multivalued, i.e., 1:m (one-to-many) or m:m (many-to-many). A minimal set of attributes of an entity type E which uniquely identifies E is called a key of E. An entity type may have more than one key and we
Ontology and context
We treat an ontology as the specification of the representational vocabulary for a shared domain of discourse which includes the definitions of entity types, relationship types, attributes of entity types and attributes of relationship types. We present ontologies at a conceptual level, which could be implemented by ontology languages.
For example, suppose an ontology SupOnto describes the concepts in the universe of product supply. It includes entity types product, month, supplier, a ternary
Resolution of schematic discrepancies
In this section, we resolve schematic discrepancies in the integration of ER schemas. In particular, we present four algorithms to resolve schematic discrepancies for entity types, relationship types, attributes of entity types and attributes of relationship types respectively. This is done by transforming discrepant meta-attributes into attributes of entity types. The transformation keeps the cardinalities of attributes and entity types, and therefore preserves FDs and MVDs (Section 5). Note
Semantics preserving transformation
In this section, we will show that Algorithm ResolveEnt (Section 4.1), the resolution of the schematic discrepancies of entity types, preserves information and cardinality constraints. The same property holds for the other three algorithms, which is omitted as the proofs are similar to that of Algorithm ResolveEnt.
Related work
Context is the key component in capturing the semantics related to the definition of an entity type, a relationship type or an attribute. The definition of context as a set of meta-attributes with values is originally adopted in [6], [20], but is used to solve different kinds of semantic heterogeneities. Our work complements rather than competes with theirs. Further, their work is based on the context at the attribute level only. We consider the contexts at different levels, and the inheritance
Conclusion and future work
Information integration provides a competitive advantage to businesses, and becomes a major area of investment by software companies today [17]. In this paper, we resolve a common problem in schema integration, schematic discrepancy in general, using the paradigm of context. We define context as a set of meta-attributes with values, which could be at the levels of databases, entity types, relationship types, and attributes. We design algorithms to resolve schematic discrepancies by transforming
Qi He is a Ph.D. candidature in the Department of Computer Science, School of Computing at the National University of Singapore. He received his B.Sc. in Computer Science from the Fudan University (Shanghai, China) in 2001. His research interests include schema integration, data integration, and data dependency.
References (23)
- et al.
Formulating global integrity constraints during derivation of global schema
Data Knowledge Eng.
(1995) - et al.
Foundations of Databases
(1995) - R. Agrawal, A. Somani, and Y.R. Xu, Storing and querying of e-commerce data, in: VLDB, 2001, pp....
- et al.
A methodology for data schema integration in the entity–relationship model
IEEE Trans. Software Eng.
(1984) - et al.
A comparative analysis of methodologies for schema integration
CN Comput. Surveys
(1986) - et al.
Management of Heterogeneous and Autonomous Database Systems
(1999) - et al.
Context interchange: new features and formalisms for the intelligent integration of information
ACM Trans. Inform. Syst.
(1999) - G. Gottlob, Computing covers for embedded functional dependencies, in: SIGMOD,...
- Q. He, T.W. Ling, Extending and inferring functional dependencies in schema transformation, in: CIKM,...
- et al.
Semantic query optimization for query plans of heterogeneous multidatabase systems
TKDE
(2000)
Semantic and schematic similarity between database objects: a context-based approach
VLDB J.
Cited by (0)
Qi He is a Ph.D. candidature in the Department of Computer Science, School of Computing at the National University of Singapore. He received his B.Sc. in Computer Science from the Fudan University (Shanghai, China) in 2001. His research interests include schema integration, data integration, and data dependency.
Tok Wang Ling is a Professor of Department of Computer Science, School of Computing at the National University of Singapore, Singapore. His research interests include Data Modeling, Entity–Relationship Approach, Object-Oriented Data Model, Normalization Theory, Logic and Database, Integrity Constraint Checking, Semistructured Data Model, and Data Warehousing. He has published more than 150 international journal/conference papers and chapters in books, and co-authored a book, mainly in data modeling. He also co-edited 12 conference and workshop proceedings. He organized and served as program committee co-chair of DASFAA’95, DOOD’95, ER’98, WISE 2002, and ER 2003. He organized and served/serves as conference co-chair of Human.Society@Internet conference (HSI) in 2001, 2003, and 2005, WAIM 2004, ER 2004, and DASFAA 2005. He is the Honorary Conference Chair of DASFAA 2006. He serves/served on the program committees of more than 100 international database conferences since 1985. He is the Advisor of the steering committee of International Conference on Database Systems for Advanced Applications (DASFAA), a member of the steering committee of International Conference on Conceptual Modeling (ER), and the International Conference on Human.Society@Internet (HSI). He was chair and vice chair of the steering committee of ER conference and DASFAA conference, and was a member of the steering committee of International Conference on Deductive and Object-Oriented Databases (DOOD). He is an editor of the journal Data & Knowledge Engineering, International Journal of Cooperative Information Systems, Journal of Database Management, Journal of Data Semantics, and World Wide Web: Internet and Web Information Systems. He is a member of ACM, IEEE, and Singapore Computer Society.