Abstract
With the development of World Wide Web (www), storage and utilization of web data has become a big challenge for data management research community. Web data are essentially heterogeneous data, and may change schema frequently, traditional relational data model is inappropriate for web data management. A new data model, called Wide Table (or WT for simplicity), was introduced for this task. There are several characteristics of the WT model. First, WT is usually highly sparsely populated so that most data can be fit into a line or record. Second, queries are composed on only a small subset of the attributes. Thus, existing query processing and optimization techniques for relational database with normalized tables will not work efficiently anymore. Furthermore, WT is usually of extremely large volume. It is thought that only large-scale distributed storage can accommodate themassive data set. In this paper, requirements and challenges to web data management are discussed. Existing techniques for WT, including logical presentation, physical storage, and query processing, are introduced and analyzed in detail.
Similar content being viewed by others
References
Agrawal R, Somani A, Xu Y. Storage and querying of e-commerce data. In: Proceedings of the 27th International Conference on Very Large Data Bases, 2001, 149–158
Agrawal R, Srikant R, Xu Y. Database technologies for electronic commerce. In: Proceedings of the 28th International Conference on Very Large Data Bases, 2002, 28: 1055–1058
Delicious website. http://del.icio.us.
Flickr website. http://www.flickr.com.
Google co-op website. http://www.google.com/coop.
Google base website. http://base.google.com.
Madhavan J, Halevy A, Cohen S, et al. Structured data meets the Web: a few observations. Data Engineering, 2006, 31:19–26
Brin S, Page L. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 1998, 30(1–7):107–117
Thain D, Tannenbaum T, Livny M. Distributed computing in practice: the condor experience. Concurrency — Practice and Experience, 2005, 17(2–4):323–356
Copeland G P, Khoshafian S N. A decomposition storage model. ACM SIGMOD Record, 1985, 14(4):268–279
Khoshafian S, Copeland G P, Jagodis T, et al. A query processing strategy for the decomposed storage model. In ICDE, 1987, 636–643
Chang F, Dean J, Ghemawat S, et al. Bigtable: a distributed storage system for structured data. In: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI06), 2006, 205–218
Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating System Design and Implementation, 2004, 137–150
Hbase website. http://wiki.apache.org/lucene-hadoop/Hbase
Hadoop website. http://lucene.apache.org/hadoop
Garcia-Molina H, Ullman J, Widom J. Database Systems: The Complete Book. Prentice-Hall, 2001
Beckmann J L, Halverson A, Krishnamurthy R, et al. Extending RDBMSs to support sparse datasets using an interpreted attribute storage format. In: Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), 2006
Yu B, Li G, Ooi B C, et al. One Table Stores All: Enabling Painless Free-and-Easy Data Publishing and Sharing. 2007
Abadi d j. Column stores for wide and sparse data. In: Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR), 2007
Stonebraker M, O’Neil E, O’Neil P, et al. C-store: a column-oriented DBMS. In: Proceedings of the 31st International Conference on Very Large Data Bases, 2005, 553–564
Boncz P, Zukowski M, Nes N. MonetDB/X100: hyper-pipelining query execution. In: Proceedings of the Second Biennial Conference on Innovative Data Systems Research (CIDR), 2005
Hoque A S M L. Storage and querying of high dimensional sparsely populated data in compressed representation. In: Proceedings of the First EurAsian Conference on Information and Communication Technology, 2002, 418–425
Ghemawat S, Gobioff H, Leung S T. The Google file system. ACM SIGOPS Operating Systems Review, 2003, 37(5): 29–43
Burrows M. The Chubby lock service for loosely-coupled distributed systems. In: Proceedings of the 7th conference on USENIX Symposium on Operating Systems Design and Implementation (OSDI), Volume 7, 2006, 24
Hadoop distributed file sytetem website. http://hadoop.apache.org/core/docs/current/hdfs design
Stonebraker M. The case for shared nothing. Database Engineering Bulletin, 1986, 9(1):4–9
Cunningham C, Galindo-Legaria C A, Graefe G. PIVOT and UNPIVOT: optimization and execution strategies in an RDBMS. In: Proceedings of the 30th International Conference on Very Large Data Bases-Volume 30, 2004, 998–1009
Stonebraker M. The case for partial indexes. ACM SIGMOD Record, 1989, 18(4):4–11
Chu E, Beckmann J, Naughton J. The case for a wide-table approach to manage sparse relational data sets. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, 2007, 821–832
Agrawal S, Narasayya V, Yang B. Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, 2004, 359–370
Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. Addison Wesley Longman, 1999
Hristidis V, Papakonstantinou Y. Discover: keyword search in relational databases. In: Proceedings of the 28th International Conference on Very Large Data Bases-Volume 28, 2002, 670–681
Madhavan J, Jeffery S, Cohen S, et al. Web-scale data integration: you can only afford to pay as you go. In: Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR), 2007, 342–350
Wordnet website. http://wordnet.princeton.edu
Fellbaum C, et al. WordNet: An Electronic Lexical Database. Cambridge. Mass: MIT Press, 1998
Brin S, Page L, Motwanl R, et al. The pagerank citation ranking: Bring order to the web. Technical report, Stanford University, 1999
Julien Masanes. Web Archiving. Springer, 2006
Brewer E A. Combining systems and databases: a search engine retrospective. In: Hellerstein J M, Stonebraker M, eds. Readings in Database Systems, 2005, 711–724
Agrawal P, Kifer D, Olston C. Scheduling shared scans of large data files. VLDB 2008 (in press)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yang, B., Qian, W. & Zhou, A. Using Wide Table to manage web data: a survey. Front. Comput. Sci. China 2, 211–223 (2008). https://doi.org/10.1007/s11704-008-0050-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-008-0050-7