skip to main content
10.1145/3340531.3412056acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Learning from Textual Data in Database Systems

Published: 19 October 2020 Publication History

Abstract

Relational database systems hold massive amounts of text, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, pre-trained word embeddings are increasingly utilized to convert text values into meaningful numbers. However, a naïve one-to-one mapping of each word in a database to a word embedding vector misses incorporating rich context information given by the database schema. Thus, we propose a novel relational retrofitting framework Retro to learn numerical representations of text values in databases, capturing the rich information encoded by pre-trained word embedding models as well as context information provided by tabular and foreign key relations in the database. We defined relation retrofitting as an optimization problem, present an efficient algorithm solving it, and investigate the influence of various hyperparameters. Further, we develop simple feed-forward and complex graph convolutional neural network architectures to operate on those representations. Our evaluation shows that the proposed embeddings and models are ready-to-use for many ML tasks, such as text classification, imputation, and link prediction, and even outperform state-of-the-art techniques.

Supplementary Material

MP4 File (3340531.3412056.mp4)
Relational database systems hold massive amounts of text, valuable for many machine learning (ML) tasks. Since ML techniques depend on numerical input representations, pre-trained word embeddings are increasingly utilized to convert text values into meaningful numbers. However, a naive one-to-one mapping of each word in a database to a word embedding vector misses incorporating rich context information given by the database schema. Thus, we propose a novel relational retrofitting framework RETRO to learn numerical representations of text values in databases. Further, we develop simple feed-forward and complex graph convolutional neural network architectures to operate on those representations. Our evaluation shows that the proposed embeddings and models are ready-to-use for many ML tasks, such as text classification, imputation, and link prediction, and even outperform state-of-the-art techniques.

References

[1]
A. Alghunaim, M. Mohtarami, S. Cyphers, and J. Glass. A Vector Space Approach for Aspect Based Sentiment Analysis. In Proc. of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 116--122, 2015.
[2]
F. Biessmann, D. Salinas, S. Schelter, P. Schmidt, and D. Lange. Deep Learning for Missing Value Imputation in Tables with Non-Numerical Data. In CIKM, pages 2017--2025. ACM, 2018.
[3]
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching Word Vectors with Subword Information. TACL, 5:135--146, 2017.
[4]
R. Bordawekar and O. Shmueli. Using Word Embedding to Enable Semantic Queries in Relational Databases. In DEEM, pages 5:1--5:4. ACM, 2017.
[5]
H. Cai, V. W. Zheng, and K. C.-C. Chang. A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications. IEEE Transactions on Knowledge and Data Engineering, 30(9):1616--1637, 2018.
[6]
Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine. Simulation of Database-Valued Markov Chains Using SimSQL. In SIGMOD, pages 637--648. ACM, 2013.
[7]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, pages 4171--4186. ACL, 2019.
[8]
T. Dozat. Incorporating Nesterov Momentum into Adam. In ICLR Workshop, 2016.
[9]
M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang. Distributed Representations of Tuples for Entity Resolution. PVLDB, pages 1454--1467, 2018.
[10]
F. F"arber, N. May, W. Lehner, P. Große, I. Müller, H. Rauhe, and J. Dees. The SAP HANA Database - An Architecture Overview. IEEE Data Eng. Bull., pages 28--33, 2012.
[11]
M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith. Retrofitting Word Vectors to Semantic Lexicons. In NAACL, pages 1606--1615, 2015.
[12]
M. Faruqui, Y. Tsvetkov, D. Yogatama, C. Dyer, and N. A. Smith. Sparse Overcomplete Word Vector Representations. In IJCNLP, pages 1491--1500, 2015.
[13]
X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a Unified Architecture for in-RDBMS analytics. In SIGMOD, pages 325--336. ACM, 2012.
[14]
R. C. Fernandez and S. Madden. Termite: A System for Tunneling Through Heterogeneous Data. In Proc. of the 2nd International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, page 7. ACM, 2019.
[15]
S. Ghannay, B. Favre, Y. Esteve, and N. Camelin. Word Embedding Evaluation and Combination. In LREC, pages 300--305, 2016.
[16]
J. Goikoetxea, E. Agirre, and A. Soroa. Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet. In AAAI, 2016.
[17]
P. Goyal and E. Ferrara. Graph Embedding Techniques, Applications, and Performance: A Survey. Knowledge-Based Systems, 151:78--94, 2018.
[18]
M. Günther, M. Thiele, and W. Lehner. RETRO: Relation Retrofitting For In-Database Machine Learning on Textual Data. arXiv preprint arXiv:1911.12674, 2019.
[19]
M. Günther, M. Thiele, and W. Lehner. RETRO: Relation Retrofitting For In-Database Machine Learning on Textual Data. In EDBT, 2020.
[20]
W. Hamilton, Z. Ying, and J. Leskovec. Inductive Representation Learning on Large Graphs. In NIPS, pages 1024--1034, 2017.
[21]
J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, and A. Kumar. The MADlib Analytics Library: or MAD Skills, the SQL. PVLDB, 5(12):1700--1711, Aug. 2012.
[22]
T. Iwata and N. Ueda. Unsupervised Object Matching for Relational Data. arXiv preprint arXiv:1810.03770, 2018.
[23]
S. Jastrzebski, D. Lesniak, and W. M. Czarnecki. How to evaluate word embeddings? On importance of data efficiency and simple supervised tasks. CoRR, abs/1702.02170, 2017.
[24]
K. Kara, K. Eguro, C. Zhang, and G. Alonso. ColumnML: Column-Store Machine Learning with On-The-Fly Data Transformation. PVLDB, 12:348--361, 12 2018.
[25]
D. Kiela, F. Hill, and S. Clark. Specializing Word Embeddings for Similarity or Relatedness. In EMNLP, pages 2044--2048, 2015.
[26]
T. Kilias, A. Löser, F. A. Gers, R. Koopmanschap, Y. Zhang, and M. Kersten. IDEL: In-Database Entity Linking with Neural Embeddings. arXiv preprint arXiv:1803.04884, 2018.
[27]
T. N. Kipf and M. Welling. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR, 2017.
[28]
T. Kraska, A. Talwalkar, and J. Duchi. MLbase: A Distributed Machine-learning System. In In CIDR, 2013.
[29]
B. Lengerich, A. Maas, and C. Potts. Retrofitting Distributional Embeddings to Knowledge Graphs with Functional Relations. In COLING, pages 2423--2436, Santa Fe, New Mexico, USA, Aug. 2018. ACL.
[30]
D. Liben-Nowell and J. Kleinberg. The Link-Prediction Problem for Social Networks. JASIST, 58(7):1019--1031, 2007.
[31]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In NIPS, pages 3111--3119. Curran Associates, Inc., 2013.
[32]
N. Mrkvs ić, D. Ó. Séaghdha, B. Thomson, M. Gavs ić, L. M. Rojas-Barahona, P.-H. Su, D. Vandyke, T.-H. Wen, and S. Young. Counter-fitting Word Vectors to Linguistic Constraints. In NAACL, pages 142--148, 2016.
[33]
H. Peng, J. Li, Y. He, Y. Liu, M. Bao, L. Wang, Y. Song, and Q. Yang. Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN. In WWW, pages 1063--1072, 2018.
[34]
J. Pennington, R. Socher, and C. D. Manning. GloVe: Global Vectors for Word Representation. In EMNLP, pages 1532--1543, 2014.
[35]
B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online Learning of Social Representations. In SIGKDD, pages 701--710. ACM, 2014.
[36]
N. Reimers and I. Gurevych. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP, pages 3973--3983, 2019.
[37]
T. Schnabel, I. Labutov, D. Mimno, and T. Joachims. Evaluation Methods for Unsupervised Word Embeddings. In EMNLP, pages 298--307, 2015.
[38]
R. Speer and J. Chin. An Ensemble Method to Produce High-Quality Word Embeddings. arXiv preprint arXiv:1604.01692, 2016.
[39]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15(1):1929--1958, 2014.
[40]
D. Vrandevc ić and M. Krötzsch. Wikidata: A Free Collaborative Knowledge Base. Comm. of the ACM, 57(10):78--85, 2014.
[41]
Z. Wang, J. Zhang, J. Feng, and Z. Chen. Knowledge Graph and Text Jointly Embedding. In EMNLP, pages 1591--1601, 2014.
[42]
C. Xu, Y. Bai, J. Bian, B. Gao, G. Wang, X. Liu, and T.-Y. Liu. RC-NET: A General Framework for Incorporating Knowledge into Word Representations. In CIKM, pages 1219--1228. ACM, 2014.
[43]
L. Yao, C. Mao, and Y. Luo. Graph Convolutional Networks for Text Classification. In AAAI, volume 33, pages 7370--7377, 2019.
[44]
M. Yu and M. Dredze. Improving Lexical Embeddings with Semantic Knowledge. In ACL, volume 2, pages 545--550, 2014.
[45]
H. Zhong, J. Zhang, Z. Wang, H. Wan, and Z. Chen. Aligning Knowledge and Text Embeddings by Entity Descriptions. In EMNLP, pages 267--272, 2015.
[46]
X. Zhou, X. Wan, and J. Xiao. Representation Learning for Aspect Category Detection in Online Reviews. In AAAI, 2015.

Cited By

View all
  • (2022)Unsupervised Matching of Data and Text2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00084(1058-1070)Online publication date: May-2022
  • (2021)Pre-Trained Web Table Embeddings for Table DiscoveryFourth Workshop in Exploiting AI Techniques for Data Management10.1145/3464509.3464892(24-31)Online publication date: 20-Jun-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
October 2020
3619 pages
ISBN:9781450368599
DOI:10.1145/3340531
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. relational database
  2. retrofitting
  3. word embedding

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Unsupervised Matching of Data and Text2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00084(1058-1070)Online publication date: May-2022
  • (2021)Pre-Trained Web Table Embeddings for Table DiscoveryFourth Workshop in Exploiting AI Techniques for Data Management10.1145/3464509.3464892(24-31)Online publication date: 20-Jun-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media