Abstract
Data integration technology can integrate data from different data sources, making it convenient and prompt to use heterogeneous data when processing big data. Therefore, data integration plays an important role in many industries. Recently, more and more work is devoted to data integration for relational data aiming at mining the underlying knowledge from it. Through embedding technology, the features of data can be extracted and expressed in the low-dimensional vectors. Some existing methods took records, attributes and cell values in relational data as various research objects to calculate their embedding representations, but the three types of data objects were trained uniformly in these methods ignoring the differences between multiple types of data. In this paper, we transform the relational data into a heterogeneous graph where different levels of data are treated as different types of nodes. In the training process, different calculation methods are adopted for corresponding node types according to their own characteristics, so that to obtain more accurate embedding representations for data. Then the embeddings are applied to the specific tasks of data integration. The experimental results show that the data embeddings trained by proposed model have good universality and achieve satisfying results in both schema matching and entity resolution tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). https://arxiv.org/abs/1301.3781
Bordawekar, R., Shmueli, O.: Using word embedding to enable semantic queries in relational databases. In: Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning, pp. 1–4 (2017). https://doi.org/10.1145/3076246.3076251
Zhang, L., Zhang, S., Balog, K.: Table2vec: Neural word and entity embeddings for table population and retrieval. In: SIGIR, pp. 1029–1032 (2019). https://doi.org/10.1145/3331184.3331333.
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000). https://doi.org/10.1126/science.290.5500.2323
Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. Nips 14, 585–591 (2001)
Ahmed, A., Shervashidze, N., Narayanamurthy, S., Josifovski, V., Smola, A.J.: Distributed large-scale natural graph factorization. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 37–48 (2013). https://doi.org/10.1145/2488388.2488393
Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: online learning of social representations. In: KDD, pp. 701–710 (2014). https://doi.org/10.1145/2623330.2623732
Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: KDD, pp. 855–864, (2016). https://doi.org/10.1145/2939672.2939754
Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In KDD, pp. 1225–1234 (2016). https://doi.org/10.1145/2939672.2939753
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks (2016). https://arxiv.org/abs/1609.02907
Cilibrasi, R.L., Vitanyi, P.M.B.: The Google similarity distance. IEEE Trans. Knowl. Data Eng. 19(3), 370–383 (2007). https://doi.org/10.1109/TKDE.2007.48
Guo, T., Shen, D., Nie, T., Kou, Y.: Web table column type detection using deep learning and probability graph model. In: Wang, G., Lin, X., Hendler, J., Song, W., Xu, Z., Liu, G. (eds.) WISA 2020. LNCS, vol. 12432, pp. 401–414. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60029-7_37
Koutras, C., Fragkoulis, M., Katsifodimos, A., Lofi, C.: REMA: graph embeddings-based relational schema matching. In: EDBT/ICDT Workshops (2020)
Konda, P., Das, S., Suganthan, G.C.P., Doan, A., Ardalan, A., Ballard, J.R., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endow. 9(12), 1197–1208 (2016). https://doi.org/10.14778/2994509.2994535
Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. Proc. VLDB Endow. 11(11), 1454–1467 (2018). https://doi.org/10.5555/3236187.3269461
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., et al.: Deep learning for entity matching: a design space exploration. SIGMOD Conf. 2018, 19–34 (2018). https://doi.org/10.1145/3183713.3196926
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al.: Attention is all you need. Nips 30, 5998–6008 (2017)
Li, B., Wang, W., Sun, Y., Zhang, L., Ali, M.A., Wang, Y.: GraphER: token-centric entity resolution with graph convolutional neural networks. AAAI 34(5), 8172–8179 (2020). https://doi.org/10.1609/AAAI.V34I05.6330
Cappuzzo, R., Papotti, P., Thirumuruganathan, S.: Creating embeddings of heterogeneous relational datasets for data integration tasks. SIGMOD Conf. 2020, 1335–1349 (2020). https://doi.org/10.1145/3318464.3389742
Hamilton, W.L., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Adv. Neural. Inf. Process. Syst. 30, 1024–1034 (2017)
Acknowledgment
This work is supported by the National Natural Science Foundation of China (62072084, 62072086), the National Defense Basic Scientific Research Program of China (JCKY2018205C012) and the Fundamental Research Funds for the Central Universities (N2116008).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, X., Wang, G., Shen, D., Nie, T., Kou, Y. (2021). Heterogeneous Embeddings for Relational Data Integration Tasks. In: Xing, C., Fu, X., Zhang, Y., Zhang, G., Borjigin, C. (eds) Web Information Systems and Applications. WISA 2021. Lecture Notes in Computer Science(), vol 12999. Springer, Cham. https://doi.org/10.1007/978-3-030-87571-8_59
Download citation
DOI: https://doi.org/10.1007/978-3-030-87571-8_59
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-87570-1
Online ISBN: 978-3-030-87571-8
eBook Packages: Computer ScienceComputer Science (R0)