Skip to main content
Log in

Using Wide Table to manage web data: a survey

  • Review Article
  • Published:
Frontiers of Computer Science in China Aims and scope Submit manuscript

Abstract

With the development of World Wide Web (www), storage and utilization of web data has become a big challenge for data management research community. Web data are essentially heterogeneous data, and may change schema frequently, traditional relational data model is inappropriate for web data management. A new data model, called Wide Table (or WT for simplicity), was introduced for this task. There are several characteristics of the WT model. First, WT is usually highly sparsely populated so that most data can be fit into a line or record. Second, queries are composed on only a small subset of the attributes. Thus, existing query processing and optimization techniques for relational database with normalized tables will not work efficiently anymore. Furthermore, WT is usually of extremely large volume. It is thought that only large-scale distributed storage can accommodate themassive data set. In this paper, requirements and challenges to web data management are discussed. Existing techniques for WT, including logical presentation, physical storage, and query processing, are introduced and analyzed in detail.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal R, Somani A, Xu Y. Storage and querying of e-commerce data. In: Proceedings of the 27th International Conference on Very Large Data Bases, 2001, 149–158

  2. Agrawal R, Srikant R, Xu Y. Database technologies for electronic commerce. In: Proceedings of the 28th International Conference on Very Large Data Bases, 2002, 28: 1055–1058

    Article  Google Scholar 

  3. Delicious website. http://del.icio.us.

  4. Flickr website. http://www.flickr.com.

  5. Google co-op website. http://www.google.com/coop.

  6. Google base website. http://base.google.com.

  7. Madhavan J, Halevy A, Cohen S, et al. Structured data meets the Web: a few observations. Data Engineering, 2006, 31:19–26

    Google Scholar 

  8. Brin S, Page L. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 1998, 30(1–7):107–117

    Google Scholar 

  9. Thain D, Tannenbaum T, Livny M. Distributed computing in practice: the condor experience. Concurrency — Practice and Experience, 2005, 17(2–4):323–356

    Article  Google Scholar 

  10. Copeland G P, Khoshafian S N. A decomposition storage model. ACM SIGMOD Record, 1985, 14(4):268–279

    Article  Google Scholar 

  11. Khoshafian S, Copeland G P, Jagodis T, et al. A query processing strategy for the decomposed storage model. In ICDE, 1987, 636–643

  12. Chang F, Dean J, Ghemawat S, et al. Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI06), 2006, 205–218

  13. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating System Design and Implementation, 2004, 137–150

  14. Hbase website. http://wiki.apache.org/lucene-hadoop/Hbase

  15. Hadoop website. http://lucene.apache.org/hadoop

  16. Garcia-Molina H, Ullman J, Widom J. Database Systems: The Complete Book. Prentice-Hall, 2001

  17. Beckmann J L, Halverson A, Krishnamurthy R, et al. Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), 2006

  18. Yu B, Li G, Ooi B C, et al. One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing. 2007

  19. Abadi d j. Column stores for wide and sparse data. In: Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR), 2007

  20. Stonebraker M, O’Neil E, O’Neil P, et al. C-store: a column-oriented DBMS. In: Proceedings of the 31st International Conference on Very Large Data Bases, 2005, 553–564

  21. Boncz P, Zukowski M, Nes N. MonetDB/X100: hyper-pipelining query execution. In: Proceedings of the Second Biennial Conference on Innovative Data Systems Research (CIDR), 2005

  22. Hoque A S M L. Storage and querying of high dimensional sparsely populated data in compressed representation. In: Proceedings of the First EurAsian Conference on Information and Communication Technology, 2002, 418–425

  23. Ghemawat S, Gobioff H, Leung S T. The Google file system. ACM SIGOPS Operating Systems Review, 2003, 37(5): 29–43

    Article  Google Scholar 

  24. Burrows M. The Chubby lock service for loosely-coupled distributed systems. In: Proceedings of the 7th conference on USENIX Symposium on Operating Systems Design and Implementation (OSDI), Volume 7, 2006, 24

    Google Scholar 

  25. Hadoop distributed file sytetem website. http://hadoop.apache.org/core/docs/current/hdfs design

  26. Stonebraker M. The case for shared nothing. Database Engineering Bulletin, 1986, 9(1):4–9

    Google Scholar 

  27. Cunningham C, Galindo-Legaria C A, Graefe G. PIVOT and UNPIVOT: optimization and execution strategies in an RDBMS. In: Proceedings of the 30th International Conference on Very Large Data Bases-Volume 30, 2004, 998–1009

  28. Stonebraker M. The case for partial indexes. ACM SIGMOD Record, 1989, 18(4):4–11

    Article  Google Scholar 

  29. Chu E, Beckmann J, Naughton J. The case for a wide-table approach to manage sparse relational data sets. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, 2007, 821–832

  30. Agrawal S, Narasayya V, Yang B. Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, 2004, 359–370

  31. Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. Addison Wesley Longman, 1999

  32. Hristidis V, Papakonstantinou Y. Discover: keyword search in relational databases. In: Proceedings of the 28th International Conference on Very Large Data Bases-Volume 28, 2002, 670–681

  33. Madhavan J, Jeffery S, Cohen S, et al. Web-scale data integration: you can only afford to pay as you go. In: Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR), 2007, 342–350

  34. Wordnet website. http://wordnet.princeton.edu

  35. Fellbaum C, et al. WordNet: An Electronic Lexical Database. Cambridge. Mass: MIT Press, 1998

    MATH  Google Scholar 

  36. Brin S, Page L, Motwanl R, et al. The pagerank citation ranking: Bring order to the web. Technical report, Stanford University, 1999

  37. Julien Masanes. Web Archiving. Springer, 2006

  38. Brewer E A. Combining systems and databases: a search engine retrospective. In: Hellerstein J M, Stonebraker M, eds. Readings in Database Systems, 2005, 711–724

  39. Agrawal P, Kifer D, Olston C. Scheduling shared scans of large data files. VLDB 2008 (in press)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weining Qian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, B., Qian, W. & Zhou, A. Using Wide Table to manage web data: a survey. Front. Comput. Sci. China 2, 211–223 (2008). https://doi.org/10.1007/s11704-008-0050-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-008-0050-7

Keywords

Navigation