Schema matching based on position of attribute in query statement
Introduction
Schema matching is an essential building block in sharing multiple heterogeneous data sources through a unified access interface. The basic issue of schema matching is to find attribute correspondences between attributes in source schemas and attributes in target schemas, namely matches. Matches are very significant for creating a unified mediated schema over multiple source schemas, exchanging data from one schema to another schema and sharing data in similar domains. Significant attention has been paid to this topic in literatures, and a rich body of techniques have been proposed by works [41], [40], [38], [24], [23], [21], [19], [13], [14], [5], [4].
The schemas to be matched are typically designed by different developers which have different habits and experiences, so they often have diverse structures and representations, and this makes schema matching difficult. In addition, dozens of tables and thousands of attributes in schemas also increase the difficulty of schema matching. Even with some availability of domain expertise, the task of schema matching may not be easy. As a result, schema matching continues to be a challenge, and to be a valuable research problem in practice.
There are a multitude of techniques in schema matching area, also called matcher, e.g., [38], [24], [23], [19], [14]. However, there are yet no perfect matchers that can return 100% accurate matches, and consequently, additional efforts are needed to be contributed to this area. In this paper, we proposed a novel matching technique that uses the positions of attributes appearing in the schema structure of query results to find the matches. As is well known, people are used to reading words in books from the left side to the right side, which is a habit that people capture information. For example, given a spreadsheet listing some records of phones, people always begin with the first column to read, then the second column, etc. This kind of habit can present some advices for developers who work on applications associated with the structured information and are designing a schema structure. That is, the developers should arrange the more important columns at the positions closer to the left side of the schema structure. For example, the column “phoneModel” in the above spreadsheet will be arranged in the left-hand side of the column “phonePrice”; as such, the column “departure time” is more likely to appear in the left side of the column “arriving time” in a railway timetable. Sometimes, a default ordering rule of a industry is also contained in the schema structure. We browse four websites selling mobile phones and search a phone of a specific brand, and the schema structures of their returned results are shown in Fig. 1. We can see that the attributes of schemas from different websites almost have the same order in their respective query results. Obviously, developers for the websites are more likely to arrange the attributes close to the left side according to the reading habit above. However, the reading habit is slightly useless for arranging the order of attributes close to the right side, but they still have the similar order and the reason is the default ordering rules in a specific industry. Thus, the habits including reading habit and default rules, which are typically conformed by schemas to be matched in the similar domains, should be used to find matches.
As the discussion above, we can see that different attributes have different importance of structuring query results to be shown to final users; that is, an attribute will hold its own position in the schema structure of query results. As a result, we can regard the statistics about the positions of an attribute in a large number of query results as the identification of its semantics. Actually, a query result derives from a corresponding query statement in query logs. Consequently, our core idea is to collect the statistics about positions of attributes from the query logs to find matches. There are three phases in our approach. Firstly, we collect the statistics about positions of attributes by scanning queries in query logs of schemas to be matched, and the matrix, called feature matrix, is used to record the statistics. Secondly, three kinds of cardinality constraints for mappings are considered, which are one-to-one mapping, onto mapping and partial mapping. Based on the constraints, we employ two scoring functions to measure the similarities of feature matrices of different schemas. Finally, given the scoring functions, Ant Colony Optimization algorithm is used to find the final optimal attribute mapping. Our approach can be used as an auxiliary technique for current main matching techniques, such as matching based on text, and matching based on instances. If our approach is combined with other existing matchers, the accuracy of matching results will be improved significantly. This paper makes the following contributions:
- 1.
We discover that the positions of an attribute in schema structures of query results contain some semantic information that can be used in schema matching.
- 2.
We exploit the feature matrices to record the statistics about positions of attributes, which are collected from query logs.
- 3.
We consider two kinds of scoring functions to measure the similarities of the feature matrices of different schemas.
- 4.
We perform an extensive experimental study and the experimental results show that the proposed algorithm has good performance.
The rest of this paper is organized as follows. Section 2 introduces how to extract features of attribute positions. Section 3 discusses the scoring functions and the searching algorithm. The experimental results are given in Section 4. A brief related work is reviewed in Section 5. Finally, we conclude in Section 6.
Section snippets
Features extraction of attribute position
The first phase of our work is introduced in this section. Given two schemas to be matched, we will scan each query in logs to collect the statistics about positions of attributes. Then, we design two types of matrices to record the statistics about positions of attributes.
As our motivation shows, the positions of an attribute in schema structures can be seen as the identification of its semantics in the same schema. Thus, the information of the positions can be collected to perform schema
Measuring distance between matrices and searching optimal mapping
In this section, we mainly discuss two parts: the first is two scoring functions for the measurement of the distance between feature matrices, and the second is a search algorithm finding the optimal mapping.
Experimental evaluation
In this section, the quality of the matching results of our proposed approach is evaluated in a real data set. First, we present the data set used in experiments. Then, we show the experimental results on the performance of our matching algorithm in the three cardinality constraints (one-to-one, onto and partial). The effect of varying the parameters of Ant cycle on the performance of our approach is also studied in the experiments, for example, the ρ in Eq. (4), etc. Finally, we study the
Related work
Schema matching plays an important role in many applications related to database, such as data exchange and data integration [17], [13], [15], [12], [11], [10], [9], [8], [7], [6], [1], [2], which have been researched for a long time. Works [40], [20], [18] present surveys of approaches to automatic schema matching, and present a taxonomy that covers many of these existing approaches. In their work, they call these existing techniques matchers, which are mainly classified as schema-based and
Conclusion
In this paper, we employ the positions of attributes appearing in the query statements to perform schema matching. The position of an attribute embodies the extent of the importance of the attribute for the user examining the query results. The statistics about the positions of attributes from the query logs are collected as features of schemas to be matched. We design two types of matrices to structure the statistics about the positions of attributes. They are APMatrix and RPMatrix
Acknowledgments
This research was supported by the National Natural Science Foundation of China (Grant No. 61303016) and the Normal Project Foundation of Education Department of LiaoNing Province (Grant No. L2012045).
References (41)
- et al.
Data clustering with size constraints
Knowl.-Based Syst.
(2010) - et al.
Discovering pattern-based subspace clusters by pattern tree
Knowl.-Based Syst.
(2009) - et al.
Application-oriented purely semantic precision and recall for ontology mapping evaluation
Knowl.-Based Syst.
(2008) - et al.
MAX-MIN ant system
Future Gener. Comput. Syst.
(2000) - et al.
SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks
Data Knowl. Eng.
(2000) - et al.
An accelerator for attribute reduction based on perspective of objects and attributes
Knowl.-Based Syst.
(2012) - et al.
Semantically-grounded construction of centroids for datasets with textual attributes
Knowl.-Based Syst.
(2012) - M. Yakout, K. Ganjam, K. Chakrabarti, S. Chaudhuri, InfoGather: entity augmentation and attribute discovery by holistic...
- E. Peukert, J. Eberius, Erhard Rahm, A self-configuring schema matching system, in: Proc.of International Conference on...
- M. Zhang, B.C. Ooi, C.M. Procopiuc, D. Srivastava, Automatic discovery of attributes in relational databases, in: Proc....