Elsevier

Knowledge-Based Systems

Volume 75, February 2015, Pages 41-51
Knowledge-Based Systems

Schema matching based on position of attribute in query statement

https://doi.org/10.1016/j.knosys.2014.11.005Get rights and content

Abstract

Attribute-level schema matching is a critical step in numerous database applications, such as DataSpaces, Ontology Merging and Schema Integration. There exist many researches on this topic, however, they all ignore evidences about the positions of attributes in query statements, which are crucial to find high-quality matches between schema attributes. In this paper, we propose a novel matching technique based on the positions of attributes appearing in the schema structure of query results. The positions of attributes in query results embody the extent of the importance of an attribute for the user browsing the query results. The core idea of our approach is to collect the statistics about attribute positions from query logs to find correspondences between attributes (matches). Our method works in three phases. The first phase is to design a matrix to record the statistics about attribute positions. Then, we employ two scoring functions to measure the similarities between collected statistics of two schemas to be matched. Finally, we employ a traditional algorithm to find the optimal mapping. Furthermore, our approach can be combined with other existing matchers to obtain more accurate matching results. An experimental study shows that our approach is effective and has good performance.

Introduction

Schema matching is an essential building block in sharing multiple heterogeneous data sources through a unified access interface. The basic issue of schema matching is to find attribute correspondences between attributes in source schemas and attributes in target schemas, namely matches. Matches are very significant for creating a unified mediated schema over multiple source schemas, exchanging data from one schema to another schema and sharing data in similar domains. Significant attention has been paid to this topic in literatures, and a rich body of techniques have been proposed by works [41], [40], [38], [24], [23], [21], [19], [13], [14], [5], [4].

The schemas to be matched are typically designed by different developers which have different habits and experiences, so they often have diverse structures and representations, and this makes schema matching difficult. In addition, dozens of tables and thousands of attributes in schemas also increase the difficulty of schema matching. Even with some availability of domain expertise, the task of schema matching may not be easy. As a result, schema matching continues to be a challenge, and to be a valuable research problem in practice.

There are a multitude of techniques in schema matching area, also called matcher, e.g., [38], [24], [23], [19], [14]. However, there are yet no perfect matchers that can return 100% accurate matches, and consequently, additional efforts are needed to be contributed to this area. In this paper, we proposed a novel matching technique that uses the positions of attributes appearing in the schema structure of query results to find the matches. As is well known, people are used to reading words in books from the left side to the right side, which is a habit that people capture information. For example, given a spreadsheet listing some records of phones, people always begin with the first column to read, then the second column, etc. This kind of habit can present some advices for developers who work on applications associated with the structured information and are designing a schema structure. That is, the developers should arrange the more important columns at the positions closer to the left side of the schema structure. For example, the column “phoneModel” in the above spreadsheet will be arranged in the left-hand side of the column “phonePrice”; as such, the column “departure time” is more likely to appear in the left side of the column “arriving time” in a railway timetable. Sometimes, a default ordering rule of a industry is also contained in the schema structure. We browse four websites selling mobile phones and search a phone of a specific brand, and the schema structures of their returned results are shown in Fig. 1. We can see that the attributes of schemas from different websites almost have the same order in their respective query results. Obviously, developers for the websites are more likely to arrange the attributes close to the left side according to the reading habit above. However, the reading habit is slightly useless for arranging the order of attributes close to the right side, but they still have the similar order and the reason is the default ordering rules in a specific industry. Thus, the habits including reading habit and default rules, which are typically conformed by schemas to be matched in the similar domains, should be used to find matches.

As the discussion above, we can see that different attributes have different importance of structuring query results to be shown to final users; that is, an attribute will hold its own position in the schema structure of query results. As a result, we can regard the statistics about the positions of an attribute in a large number of query results as the identification of its semantics. Actually, a query result derives from a corresponding query statement in query logs. Consequently, our core idea is to collect the statistics about positions of attributes from the query logs to find matches. There are three phases in our approach. Firstly, we collect the statistics about positions of attributes by scanning queries in query logs of schemas to be matched, and the matrix, called feature matrix, is used to record the statistics. Secondly, three kinds of cardinality constraints for mappings are considered, which are one-to-one mapping, onto mapping and partial mapping. Based on the constraints, we employ two scoring functions to measure the similarities of feature matrices of different schemas. Finally, given the scoring functions, Ant Colony Optimization algorithm is used to find the final optimal attribute mapping. Our approach can be used as an auxiliary technique for current main matching techniques, such as matching based on text, and matching based on instances. If our approach is combined with other existing matchers, the accuracy of matching results will be improved significantly. This paper makes the following contributions:

  • 1.

    We discover that the positions of an attribute in schema structures of query results contain some semantic information that can be used in schema matching.

  • 2.

    We exploit the feature matrices to record the statistics about positions of attributes, which are collected from query logs.

  • 3.

    We consider two kinds of scoring functions to measure the similarities of the feature matrices of different schemas.

  • 4.

    We perform an extensive experimental study and the experimental results show that the proposed algorithm has good performance.

The rest of this paper is organized as follows. Section 2 introduces how to extract features of attribute positions. Section 3 discusses the scoring functions and the searching algorithm. The experimental results are given in Section 4. A brief related work is reviewed in Section 5. Finally, we conclude in Section 6.

Section snippets

Features extraction of attribute position

The first phase of our work is introduced in this section. Given two schemas to be matched, we will scan each query in logs to collect the statistics about positions of attributes. Then, we design two types of matrices to record the statistics about positions of attributes.

As our motivation shows, the positions of an attribute in schema structures can be seen as the identification of its semantics in the same schema. Thus, the information of the positions can be collected to perform schema

Measuring distance between matrices and searching optimal mapping

In this section, we mainly discuss two parts: the first is two scoring functions for the measurement of the distance between feature matrices, and the second is a search algorithm finding the optimal mapping.

Experimental evaluation

In this section, the quality of the matching results of our proposed approach is evaluated in a real data set. First, we present the data set used in experiments. Then, we show the experimental results on the performance of our matching algorithm in the three cardinality constraints (one-to-one, onto and partial). The effect of varying the parameters of Ant cycle on the performance of our approach is also studied in the experiments, for example, the ρ in Eq. (4), etc. Finally, we study the

Related work

Schema matching plays an important role in many applications related to database, such as data exchange and data integration [17], [13], [15], [12], [11], [10], [9], [8], [7], [6], [1], [2], which have been researched for a long time. Works [40], [20], [18] present surveys of approaches to automatic schema matching, and present a taxonomy that covers many of these existing approaches. In their work, they call these existing techniques matchers, which are mainly classified as schema-based and

Conclusion

In this paper, we employ the positions of attributes appearing in the query statements to perform schema matching. The position of an attribute embodies the extent of the importance of the attribute for the user examining the query results. The statistics about the positions of attributes from the query logs are collected as features of schemas to be matched. We design two types of matrices to structure the statistics about the positions of attributes. They are APMatrix and RPMatrix

Acknowledgments

This research was supported by the National Natural Science Foundation of China (Grant No. 61303016) and the Normal Project Foundation of Education Department of LiaoNing Province (Grant No. L2012045).

References (41)

  • H. Kohler, X. Zhou, S.W. Sadiq, Y. Shu, K.L. Taylor, Sampling dirty data for matching attributes, in: Proc. of Special...
  • G. Mecca, P. Papotti, S. Raunich, Core schema mappings, in: Proc. of Special Interest Group on Management Of Data...
  • A. Radwan, L. Popa, I.R. Stanoi, A. Younis, Top-K generation of integrated schemas based on directed and weighted...
  • Zeshui Xu

    A method for multiple attribute decision making with incomplete weight information in linguistic setting

    Knowl.-Based Syst.

    (2008)
  • B.T. Dai, N. Koudas, D. Srivastavat, Anthony K.H. Tung, S. Venkatasubramaniant, Validating multi-column schema...
  • C. Chan, H.V.J.H. Elmeleegy, M. Ouzzani, A. Elmagarmid, Usage-based schema matching, in: Proc. of International...
  • Yuhai Zhao et al.

    Maximal subspace coregulated gene clustering

    IEEE Trans. Knowl. Data Eng

    (2008)
  • A.D. Sarma, X. Dong, A. Halevy, Bootstrapping pay-as-you-go data integration systems, in: Proc. of Special Interest...
  • X. Dong, A.Y. Halevy, C. Yu, Data integration with uncertainty, in: Proc. of Very Large Data Bases (VLDB), 2007, pp....
  • R.H. Warren, F. Tompa, Multicolumn substring matching for database schema translation, in: Proc. of Very Large Data...
  • Cited by (0)

    View full text