Abstract
As the underlying database schemas become larger and more complex, it is difficult for casual users to understand the schemas and contents of databases. Therefore, it has become an essential task to summarize the database schemas. However, most prior approaches pay little attention to the topological characteristics between tables, ignore the effect of the user feedback, and fail to accurately predict the number of clusters in the output. This seriously limits their accuracy of schema summarization. To deal with the problems, we propose a new schema summarization method based on a graph partition mechanism. First, we introduce a novel strategy to construct a similarity matrix between tables, which is based on the topology compactness, content similarity and query logs. Then we provide a calculation formula for table importance and a detection scheme of the most important nodes in local areas. Both are used for selecting the initial cluster centers and predicting the number of clusters in the graph partition mechanism. Finally, we evaluate the proposed method over the database TPC-E, and results demonstrate that it achieves high performance in summarizing accuracy.
Similar content being viewed by others
References
Alborzi F, Chirkova R, Doyle J, Fathi Y (2015) Determining query readiness for structured data. In: 17th International Conference on Big Data Analytics and Knowledge Discovery, Valencia, Spain, 2015. pp 3-14
Beneventano D, Guerra F, Velegrakis Y (2017) Data exploration on large amount of relational data through keyword queries. In: 15th International Conference on High Performance Computing and Simulation, Genoa, Italy, 2017. pp 70-73
Bergamaschi S, Guerra F, Simonini G (2014) Keyword search over relational databases: Issues, approaches and open challenges. In: 2013 PROMISE Winter School: Bridging Between Information Retrieval and Databases, Bressanone, Italy, 2013. pp 54-73
Bergamaschi S, Ferrari D, Guerra F, Simonini G, Velegrakis Y (2016) Providing insight into data source topics. Journal on Data Semantics 5(4):211–228
Carlsson G (2009) Topology and data. Bull Am Math Soc 46(2):255–308
Dimitroff G, Georgiev G, Toloi L, Popov B (2014) Efficient F measure maximization via weighted maximum likelihood. Mach Learn 98(3):435–454
Kahng M, Navathe SB, Stasko JT, Chau DH (2016, 2016) Interactive browsing and navigation in relational databases. In: 42nd international conference on very large data bases. New Delhi, India:1017–1028
Kargar M, An A, Cercone N, Godfrey P, Szlichta J, Yu X (2015) Meaningful keyword search in relational databases with large and complex schema. In: 31st IEEE International Conference on Data Engineering, Seoul, Korea, 2015. pp 411-422
Kruse S, Hahn D, Walter M, Naumann F (2017) Metacrate: Organize and analyze millions of data profiles. In: 26th ACM International Conference on Information and Knowledge Management, Singapore, Singapore, 2017. pp 2483-2486
Liu D, Liu G, Zhao W, Hou Y (2017) Top-k keyword search with recursive semantics in relational databases. Int J Comput Sci Eng 14(4):359–369
Luo Y, Lin X, Wang W, Zhou X (2007) Spark: top-k keyword query in relational databases. In: SIGMOD 2007: ACM SIGMOD International Conference on Management of Data, Beijing, China, 2007. pp 115-126
Sampaio M, Quesado J, Barros S (2013) Relational schema summarization: A context-oriented approach. In: 16th East-European Conference on Advances in Databases and Information Systems, Poznan, Poland, 2013. pp 217-228
Taheriyan M, Knoblock CA, Szekely P, Ambite JL (2016) Learning the semantics of structured data sources. Journal of Web Semantics 37-38:152–169
Troullinou G, Kondylakis H, Daskalaki E, Plexousakis D (2015) RDF digest: Efficient summarization of RDF/S KBs. In: 12th European Semantic Web Conference, Portoroz, Slovenia, 2015. pp 119-134
Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(4):141–188
Van Gennip Y, Hunter B, Ahn R, Elliott P, Luh K, Halvorson M, Reid S, Valasik M, Wo J, Tita GE, Bertozzi AL, Brantingham PJ (2013) Community detection using spectral clustering on sparse geosocial data. SIAM J Appl Math 73(1):67–83
Wang N, Tian T (2016) Summarizing personal dataspace based on user interests. Int J Software Engineer Knowledge Engineer 26(5):691–713
Wang X, Zhou X, Wang S (2012) Summarizing large-scale database schema using community detection. J Comput Sci Technol 27(3):515–526
Wang X, Qian B, Davidson I (2014) On constrained spectral clustering and its applications. Data Min Knowl Disc 28(1):1–30
Wang Z, Chen Z, Zhao Y, Niu Q (2014) A novel local maximum potential point search algorithm for topology potential field. International Journal of Hybrid Information Technology 7(2):1–8
Wu W, Reinwald B, Sismanis Y, Manjrekar R (2008) Discovering topical structures of databases. In: 2008 ACM SIGMOD International Conference on Management of Data 2008, Vancouver, Canada, 2008. pp 1019-1030
Yan C, Zhang Y, Xu J, Dai F, Li L, Dai Q, Wu F (2014) A highly parallel framework for HEVC coding unit partitioning tree decision on many-core processors. IEEE Signal Process Lett 21(5):573–576
Yan C, Zhang Y, Xu J, Dai F, Zhang J, Dai Q, Wu F (2014) Efficient parallel framework for HEVC motion estimation on many-core processors. IEEE Trans Circuits Syst Video Technol 24(12):2077–2089
Yan N, Hasani S, Asudeh A, Li C (2016) Generating preview tables for entity graphs. In: 2016 ACM SIGMOD International Conference on Management of Data, San Francisco, United states, 2016. pp 1797-1811
Yan C, Xie H, Liu S, Yin J, Zhang Y, Dai Q (2018) Effective Uyghur language text detection in complex background images for traffic prompt identification. IEEE Trans Intell Transp Syst 19(1):220–229
Yan C, Xie H, Yang D, Yin J, Zhang Y, Dai Q (2018) Supervised hash coding with deep neural network for environment perception of intelligent vehicles. IEEE Trans Intell Transp Syst 19(1):284–295
Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. Proceedings of the VLDB Endowment 2(1):634–645
Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. Proceedings of the VLDB Endowment 4(11):899–910
Yu C, Jagadish HV (2006) Schema summarization. In: 32nd International Conference on Very Large Data Bases, Seoul, Korea, 2006. pp 319-330
Yuan X, Li X, Yu M, Cai X, Zhang Y, Wen Y (2014) Summarizing Relational Database Schema Based on Label Propagation. In: 16th Asia-Pacific Web Conference on Web Technologies and Applications, Changsha, China, 2014. pp 258-269
Acknowledgements
This work is sponsored by the National Natural Science Foundation of China under Grant No. 61772152 and 61502037, and the Basic Research Project (No. JCKY2016206B001, JCKY2014206C002 and JCKY2017604C010).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, Y., Zhou, L. & Wang, N. Summarizing database schema based on graph partition. Multimed Tools Appl 78, 10077–10096 (2019). https://doi.org/10.1007/s11042-018-6543-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6543-y