Abstract
Schema summarization on large-scale databases is a challenge. In a typical large database schema, a great proportion of the tables are closely connected through a few high degree tables. It is thus difficult to separate these tables into clusters that represent different topics. Moreover, as a schema can be very big, the schema summary needs to be structured into multiple levels, to further improve the usability. In this paper, we introduce a new schema summarization approach utilizing the techniques of community detection in social networks. Our approach contains three steps. First, we use a community detection algorithm to divide a database schema into subject groups, each representing a specific subject. Second, we cluster the subject groups into abstract domains to form a multi-level navigation structure. Third, we discover representative tables in each cluster to label the schema summary. We evaluate our approach on Freebase, a real world large-scale database. The results show that our approach can identify subject groups precisely. The generated abstract schema layers are very helpful for users to explore database.
Similar content being viewed by others
References
Newman M E J, Girvan M. Finding and evaluating community structure in networks. Physical Review E, 2004, 69(2): 026113.
Newman M E J, Fast algorithm for detecting community structure in networks. Physical Review E, 2004, 69(6): 066133.
Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P. Community detection in social media. Data Mining and Knowledge Discovery, 2012, 24(3): 515–554.
Shi J, Malik . Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888–905.
Luxburg U. A tutorial on spectral clustering. Statistics and Computing, 2007, 17(4): 395–416.
Rahn E, Bernstein P A. A survey of approaches to automatic schema matching. J. Very Large Data Base, 2001, 10(4): 334–350.
Yang X, Procopiuc C M, Srivastava D. Summarizing relational databases. PVLDB, 2009, 2(1): 634–645.
www.freebase.com, September 2011.
Wu W, Reinwald B, Sismannis Y, Manjrekar B. Discovering topical structures of databases. In Proc. SIGMOD2008, June 2008, pp.1019–1030.
Dyer M E, Fireze A M. A simple heuristic for the p-center problem. Operations Research Letters, 1985, 3(6): 285–288.
Clauset A, Newman M E J, Moore C. Finding community structure in very large networks. Physical Review E, 2004, 70(6): 066111.
Lancichinetti A, Fortunato S. Community detection algorithms: A comparative analysis. Physical Review E, 2009, 80(5): 056117.
Campbell L J, Halpin T A, Proper H A. Conceptual schemas with abstractions making flat conceptual schemas more comprehensible. Data & Knowledge Engineering, 1996, 20(1): 39–85.
Feldman P, Miller D. Entity model clustering: Structuring a data model by abstraction. The Computer Journal, 1986, 29(4): 348–360.
Teorey T, Wei G, Bolton D, Koenig J. ER model clustering as an aid for user communication and documentation in database design. Communications of the ACM, 1989, 32(8): 975–987.
Huffman S B, Zoeller R V. A rule-based system tool for automated ER model clustering. In Proc. the 8th International Conference on Entity-Relationship Approach to Database Design and Querying, Oct. 1990, pp.221–236.
Campbell L J, Halpin T A, Proper H A. CA ERwin data modeler, www.ca.com.
Yu C, Jagadish H V. Schema summarization. In Proc. the 32nd International Conference on Very Large Data Bases, Sep. 2006, pp.319–330.
Motwani R, Raghavan P. Randomized Algorithms. Cambridge Univ. Press, 1995.
Han J, Kamber M. Data Mining: Concepts and Techniques (2nd edition). Morgan Kaufmann, 2006.
Domingos P, Richardson M. Mining the network value of customers. In Proc. the 7th ACM SIGKDD, Aug. 2001, pp.57–66.
Richardson M, Domingos P. Mining knowledge-sharing sites for viral marketing. In Proc. the 8th ACM SIGKDD, July 2002, pp.61–70.
Kempe D, Kleinberg J M, Tardos E. Maximizing the spread of influence through a social network. In Proc. the 9th ACM SIGKDD, Aug. 2003, pp.137–146.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is partly supported by the “HGJ” National Science and Technology Major Project of China under Grant No. 2010ZX01042-001-002, the National Natural Science Foundation of China under Grant No. 61070054, the National High Technology Research and Development 863 Program of China under Grant No. 2009AA01Z149, the Research Funds of Renmin University of China under Grant No. 10XNI018 and the Postgraduate Science & Research Funds of Renmin University of China under Grant No. 12XNH177.
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Wang, X., Zhou, X. & Wang, S. Summarizing Large-Scale Database Schema Using Community Detection. J. Comput. Sci. Technol. 27, 515–526 (2012). https://doi.org/10.1007/s11390-012-1240-1
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-012-1240-1