Skip to main content
Log in

Summarizing Large-Scale Database Schema Using Community Detection

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Schema summarization on large-scale databases is a challenge. In a typical large database schema, a great proportion of the tables are closely connected through a few high degree tables. It is thus difficult to separate these tables into clusters that represent different topics. Moreover, as a schema can be very big, the schema summary needs to be structured into multiple levels, to further improve the usability. In this paper, we introduce a new schema summarization approach utilizing the techniques of community detection in social networks. Our approach contains three steps. First, we use a community detection algorithm to divide a database schema into subject groups, each representing a specific subject. Second, we cluster the subject groups into abstract domains to form a multi-level navigation structure. Third, we discover representative tables in each cluster to label the schema summary. We evaluate our approach on Freebase, a real world large-scale database. The results show that our approach can identify subject groups precisely. The generated abstract schema layers are very helpful for users to explore database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Newman M E J, Girvan M. Finding and evaluating community structure in networks. Physical Review E, 2004, 69(2): 026113.

    Article  Google Scholar 

  2. Newman M E J, Fast algorithm for detecting community structure in networks. Physical Review E, 2004, 69(6): 066133.

    Article  Google Scholar 

  3. Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P. Community detection in social media. Data Mining and Knowledge Discovery, 2012, 24(3): 515–554.

    Article  Google Scholar 

  4. Shi J, Malik . Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888–905.

    Google Scholar 

  5. Luxburg U. A tutorial on spectral clustering. Statistics and Computing, 2007, 17(4): 395–416.

    Article  MathSciNet  Google Scholar 

  6. Rahn E, Bernstein P A. A survey of approaches to automatic schema matching. J. Very Large Data Base, 2001, 10(4): 334–350.

    Article  Google Scholar 

  7. Yang X, Procopiuc C M, Srivastava D. Summarizing relational databases. PVLDB, 2009, 2(1): 634–645.

    Google Scholar 

  8. www.freebase.com, September 2011.

  9. Wu W, Reinwald B, Sismannis Y, Manjrekar B. Discovering topical structures of databases. In Proc. SIGMOD2008, June 2008, pp.1019–1030.

  10. Dyer M E, Fireze A M. A simple heuristic for the p-center problem. Operations Research Letters, 1985, 3(6): 285–288.

    Article  MathSciNet  MATH  Google Scholar 

  11. Clauset A, Newman M E J, Moore C. Finding community structure in very large networks. Physical Review E, 2004, 70(6): 066111.

    Article  Google Scholar 

  12. Lancichinetti A, Fortunato S. Community detection algorithms: A comparative analysis. Physical Review E, 2009, 80(5): 056117.

    Article  Google Scholar 

  13. Campbell L J, Halpin T A, Proper H A. Conceptual schemas with abstractions making flat conceptual schemas more comprehensible. Data & Knowledge Engineering, 1996, 20(1): 39–85.

    Article  MATH  Google Scholar 

  14. Feldman P, Miller D. Entity model clustering: Structuring a data model by abstraction. The Computer Journal, 1986, 29(4): 348–360.

    Article  Google Scholar 

  15. Teorey T, Wei G, Bolton D, Koenig J. ER model clustering as an aid for user communication and documentation in database design. Communications of the ACM, 1989, 32(8): 975–987.

    Article  Google Scholar 

  16. Huffman S B, Zoeller R V. A rule-based system tool for automated ER model clustering. In Proc. the 8th International Conference on Entity-Relationship Approach to Database Design and Querying, Oct. 1990, pp.221–236.

  17. Campbell L J, Halpin T A, Proper H A. CA ERwin data modeler, www.ca.com.

  18. Yu C, Jagadish H V. Schema summarization. In Proc. the 32nd International Conference on Very Large Data Bases, Sep. 2006, pp.319–330.

  19. Motwani R, Raghavan P. Randomized Algorithms. Cambridge Univ. Press, 1995.

  20. Han J, Kamber M. Data Mining: Concepts and Techniques (2nd edition). Morgan Kaufmann, 2006.

  21. Domingos P, Richardson M. Mining the network value of customers. In Proc. the 7th ACM SIGKDD, Aug. 2001, pp.57–66.

  22. Richardson M, Domingos P. Mining knowledge-sharing sites for viral marketing. In Proc. the 8th ACM SIGKDD, July 2002, pp.61–70.

  23. Kempe D, Kleinberg J M, Tardos E. Maximizing the spread of influence through a social network. In Proc. the 9th ACM SIGKDD, Aug. 2003, pp.137–146.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xue Wang.

Additional information

This work is partly supported by the “HGJ” National Science and Technology Major Project of China under Grant No. 2010ZX01042-001-002, the National Natural Science Foundation of China under Grant No. 61070054, the National High Technology Research and Development 863 Program of China under Grant No. 2009AA01Z149, the Research Funds of Renmin University of China under Grant No. 10XNI018 and the Postgraduate Science & Research Funds of Renmin University of China under Grant No. 12XNH177.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 103 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X., Zhou, X. & Wang, S. Summarizing Large-Scale Database Schema Using Community Detection. J. Comput. Sci. Technol. 27, 515–526 (2012). https://doi.org/10.1007/s11390-012-1240-1

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-012-1240-1

Keywords

Navigation