PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases

Torsten Grabs¹,
Klemens Böhm² &
Hans-Jörg Schek³

89 Accesses
7 Citations
Explore all metrics

Abstract

Our objective is a scalable infrastructure for information retrieval (IR) with up-to-date retrieval results in the presence of updates. Timely processing of updates is important with novel application domains such as e-commerce. These issues are challenging, given the additional requirement that the system must scale well. We have built PowerDB-IR, a system that has the characteristics sought. This article describes its design, implementation, and evaluation. We follow a three-tier architecture with a database cluster as the bottom layer for storage management. The rationale for a database cluster is to ‘scale out’, i.e., to add further cluster nodes, whenever necessary for better performance. The middle tier provides IR-specific retrieval and update services. We deploy state-of-the-art middleware software to coordinate the cluster and to invoke IR-specific components. PowerDB-IR extends the middleware layer with service decomposition and parallelisation. PowerDB-IR has the following features: It supports state-of-the-art retrieval models such as vector space retrieval. It allows documents to be inserted and retrieved concurrently and ensures up-to-date retrieval results with almost no overhead. PowerDB-IR ensures the correctness of global concurrency and recovery. Alternative physical data organisation schemes and respective query processing techniques provide adequate performance for different workloads and database sizes. Scaling out the database cluster yields higher throughput and lower response times. We have run extensive experiments with PowerDB-IR using several commercial database systems as well as different middleware products. Further experiments have quantified the effect of transactional guarantees on performance. The main result is that PowerDB-IR shows surprisingly good scalability and low response times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Indexing Techniques of Distributed Ordered Tables: A Survey and Analysis

Article 26 January 2018

Chen Feng, Chun-Dian Li & Rui Li

MacroDB: Scaling Database Engines on Multicores

References

Alonso G, Blott S, Fessler A, Schek H-J (1997) Correctness and parallelism of composite systems. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Tucson, Arizona, USA, ACM Press, New York, NY, pp 197–208
Alonso G, Fessler A, Pardon G, Schek H-J (1999a) Correctness in general configurations of transactional components. In: Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Philadelphia, USA, ACM Press, New York, NY, pp 285–293
Alonso G, Fessler A, Pardon G, Schek H-J (1999b) Transactions in stack, fork, and join composite systems. In: Beeri C, Buneman P (eds) Proceedings of the 7th International Conference on Database Theory (ICDT’99), Jerusalem, Israel, pp 150–168
Andresen D, Yang T, Ibarra OH (1997) Toward a scalable distributed WWW server on workstation clusters. J Parallel Distrib Comput 42(1):91–100
Article Google Scholar
Badrinath B, Ramamritham K (1990) Performance evaluation of semantics-based multilevel concurrency control protocols. In: Garcia-Molina H, Jagadish HV (eds) Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, USA, ACM Press, New York, NY, pp 163–172
Barbará D, Mehrotra S, Vallabhaneni P (1996) The gold text indexing engine. In: Proceedings of the 12th International Conference on Data Engineering, New Orleans, Louisiana, USA, IEEE Computer Society, Los Alamitos, CA, pp 172–179
Baru C, Fecteau G, Goyal A, Hsiao H et al (1995) DB2 parallel edition. IBM Systems Journal 34(2):292–321
Article Google Scholar
BEA (1999) TUXEDO Guides and References (V 6.5)
Bernstein PA, Hadzilacos V, Goodman N (1987) Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading, Massachusetts
Böhm K, Aberer K, Neuhold EJ, Yang X (1997) Structured document storage and refined declarative and navigational access mechanisms in HyperStorM. VLDB J 6(4):296–311
Article Google Scholar
Böhm K, Grabs T, Röhm U, Schek H-J (2000) Evaluating the coordination overhead of replica maintenance in a cluster of databases. In: Proceedings of Euro-Par 2000, Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, Vol. 1900 of LNCS, Springer-Verlag, Heidelberg, pp 435–444
Boral H, Alexander W, Clay L, Copeland G et al (1990) Prototyping Bubba, a highly parallel database system. IEEE Trans Knowl Data Eng 2(1):4–24
Article Google Scholar
Brown EW, Callan JP, Croft WB (1994) Fast incremental indexing for full-text information retrieval. In: Proceedings of 20th International Conference on Very Large Data Bases (VLDB’94), Santiago de Chile, Chile, Morgan Kaufmann, San Francisco, CA, pp 192–202
Carey M, Kossmann D (1997) On saying “enough already!” in SQL. In: Peckham J (ed) Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, ACM Press, pp 219–230
Chakrabarti K, Mehrotra S (1999) Efficient concurrency control in multidimensional access methods. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA, ACM Press, New York, NY, pp 25–36
Chaudhuri S, Gravano L (1999) Evaluating top-k selection queries. In: Atkinson MP, Orlowska ME, Valduriez P, Zdonik SB, Brodie ML (eds) Proceedings of 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, Morgan Kaufmann, San Francisco, CA, pp 397–410
Copeland G, Alexander W, Boughter E, Keller T (1988) Data placement in Bubba. In: Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, Chicago, USA, ACM Press, New York, NY, pp 99–108
Crawford RG, Macleod I (1978) A relational approach to modular information retrieval systems design. In: Proceedings of the 41st Conference of the American Society for Information Science Annual Meeting, pp 83–85
Dadam P, Küspert K, Andersen F, Blanken HM et al (1986) A DBMS prototype to support extended NF2 relations: an integrated view on flat tables and hierarchies. In: Zaniolo C (ed) Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, ACM Press, New York, NY, pp 356–367
Dadam P, Pistor P, Schek H (1983) A predicate oriented locking approach for integrated information systems. In: Proceedings of the IFIP 9th World Computer Congress, Paris, France, North-Holland/IFIP, Amsterdam, pp 763–768
DeFazio S (1991) Overview of the full-text document retrieval benchmark. In: Gray J (ed) The Benchmark Handbook, Morgan Kaufmann, San Francisco, CA, pp 435–487
DeWitt DJ, Ghandeharizadeh S, Schneider DA, Bricker A et al (1990) The Gamma Database Project. IEEE Trans Knowl Data Eng 2(1):44–61
Article Google Scholar
Eswaran KP, Gray JN, Lorie RA, Traiger IL (1976) The notions of consistency and predicate locks in a database system. Commun ACM 19(11):624–633
Article MathSciNet Google Scholar
Fox A, Chawathe SGY, Brewer E, Gaulthier P (1997) Cluster-based scalable network services. In: Proceedings of the 16th ACM Symposium on Operating System Principles (SOSP’97), St Malo, France, ACM Press, New York, NY, pp 78–91
Frieder O, Chowdhury A, Grossman D, McCabe M (2000) On the integration of structured data and text: a review of the SIRE architecture. In: Proceedings of the First DELOS Network of Excellence Workshop on Information Seeking, Searching and Querying in Digital Libraries, Zurich, Switzerland, 2000, ERCIM, Le Chesnay, pp 53–58
Grabs T, Böhm K, Schek H-J (2000) A parallel document engine built on top of a cluster of databases – design, implementation, and experiences. In: Technical Report 340, Department of Computer Science, ETH Zurich. Available at: http://www.inf.ethz.ch/publications/abstract.php3?no=tech-reports/3xx/340
Grabs T, Böhm K, Schek H-J (2001a) High-level parallelisation in a database cluster: a feasibility study using document services. In: Proceedings of the 17th International Conference on Data Engineering (ICDE2001), Heidelberg, Germany, IEEE Computer Society, Los Alamitos, CA, pp 121–130
Grabs T, Böhm K, Schek H-J (2001b) PowerDB-IR – information retrieval on top of a database cluster. In: Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM2001), Atlanta, GA, USA, ACM Press, New York, NY, pp 411–418
Gray J (1999) How high is high performance transaction processing. In: High Performance Transaction Systems Workshop, Asilomar, USA. Available at: http://research.microsoft.com/∼gray/hpts99/talks/Gray_Jim.ppt
Gray J, Helland P, O’Neill P, Shasha D (1996) The dangers of replication and a solution. In: Jagadish HV, Mumick IS (eds) Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, ACM Press, New York, NY, pp 173–182
Gray J, Reuter A (1993) Transaction Processing: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA
Google Scholar
Grossman DA, Frieder O, Holmes DO, Roberts DC (1997) Integrating structured data and text: a relational approach. J Am Soc Inf Sci 48(2):122–132
Article Google Scholar
Harper DJ, Walker ADM (1992) ECLAIR: An extensible class library for information retrieval. Comput J 35(3):256–267
Article Google Scholar
Inktomi Corp (1996) The Inktomi technology behind HotBot. Technical report, Inktomi Corp
Kamath M, Ramamritham K (1996) Efficient transaction support for dynamic information retrieval systems. In: Frei H-P, Harman D, Schäuble P, Wilkinson R (eds) Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96), Zurich, Switzerland, pp 147–155
Kaufmann H, Schek H-J (1995) Text search using database systems revisited – some experiments. In: Proceedings of the 13th British National Conference on Databases, pp 18–20
Kaufmann H, Schek H-J (1996) Extending TP-monitors for intra-transaction parallelism. In: Proceedings of the 4th International Conference on Parallel and Distributed Information Systems, Miami Beach, USA, IEEE Computer Society, Los Alamitos, CA, pp 250–261
Kirsch S (1998) Infoseek’s experiences searching the Internet. SIGIR Forum 32(2):3–7
Article Google Scholar
Knaus D, Schäuble P (1996) The system architecture and the transaction concept of the SPIDER information retrieval system. IEEE Bull Tech Committee Data Eng 19(1):43–52
Google Scholar
Lohman GM, Lindsay BG, Pirahesh H, Schiefer KB (1991) Extensions to Starburst: objects, types, functions and rules. Commun ACM 34(10):94–109
Article Google Scholar
Microsoft Corp (2000) Building high-performance databases using Microsoft SQL Server 2000 federated database servers. Technical report, Microsoft Corp
Özsu MT, Szafron D, El-Medani G, Vittal C (1995) An object-oriented multimedia database system for a news-on-demand applications. Multimedia Syst 3(5–6):182–203
Google Scholar
Özsu MT, Valduriez P (1999) Principles of Distributed Database Systems, 2nd edn, Prentice Hall, Upper Saddle River, New Jersey
Rys M, Norrie MC, Schek H-J (1996) Intra-transaction parallelism in the mapping of an object model to a relational multi-processor system. In: Vijayaraman TM, Buchmann AP, Mohan C, Sarda NL (eds) Proceedings of the 22th International Conference on Very Large Data Bases, Mumbai (Bombay), India, Morgan Kaufmann, San Francisco, CA, pp 460–471
Sacks-Davis R, Kent AJ, Ramamohanarao K, Thom JA et al (1995) Atlas: a nested relational database system for text applications. IEEE Trans Knowl Data Eng 7(3):454–470
Article Google Scholar
Salton G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley
Google Scholar
Salton G, Fox EA, Wu H (1983) Extended Boolean information retrieval. Commun ACM 26(12):1022–1036
Article MathSciNet Google Scholar
Salton G, McGill M (1983) Introduction to Modern Information Retrieval. McGraw-Hill
Schaad W, Schek H-J, Weikum G (1995) Implementation and performance of multi-level transaction management in multidatabase environment. In: Bukhres OA, Özsu MT, Shan M-C (eds) Proceedings of RIDE-DOM ’95, Fifth International Workshop on Research Issues in Data Engineering – Distributed Object Management, Taipei, Taiwan, pp 108–115
Schek H-J, Pistor P (1982) Data structures for an integrated data base management and information retrieval system. In: Eighth International Conference on Very Large Data Bases, Mexico City, Mexico, Morgan Kaufmann, San Francisco, CA, pp 197–207
Schek H-J, Weikum G, Schaad W (1991) A multi-level transaction approach to federated DBMS transaction management. In: Kambayashi Y, Rusinkiewicz M (eds) First International Workshop on Research Issues on Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS’91), Kyoto, Japan, IEEE Computer Society, Los Alamitos, CA, pp 108–115
Scheuermann P, Weikum G, Zabback P (1998) Data partitioning and load balancing in parallel disk systems. VLDB J 7(1):48–66
Article Google Scholar
Shasha D, Llirbat F, Simon E, Valduriez P (1995) Transaction chopping: algorithms and performance studies. ACM Trans Database Syst 20(3):325–363
Article Google Scholar
Stonebraker M, Kemnitz G (1991) The Postgres next-generation database management system. Commun ACM 34(10):78–92
Article Google Scholar
Tomasic A, Garcia-Molina H, Shoens K (1994) Incremental updates of inverted lists for text document retrieval. In: Snodgrass RT, Winslett M (eds) Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, USA, pp 289–300
Vingralek R, Breitbart Y, Weikum G (1998) SNOWBALL: scalable storage on networks of workstations with balanced load. Distrib Parallel Databases 6(2):117–156
Article Google Scholar
Vingralek R, Hasse-Ye H, Breitbart Y, Schek H-J (1998) Unifying concurrency control and recovery of transactions with semantically rich operations. In: Theoretical Computer Science, pp 363–396
Weikum G (1991) Principles and realization strategies of multilevel transaction management ACM Trans Database Syst 16(1):132–180
Article Google Scholar
Weikum G, Schek H-J (1984) Architectural issues of transaction management in multi-layered systems. In: Dayal U, Schlageter G, Seng LH (eds) Tenth International Conference on Very Large Data Bases, Singapore, Proceedings, Morgan Kaufmann, San Francisco, CA, pp 454–465
Weikum G, Schek H-J (1992) Concepts and applications of multilevel transactions and open nested transactions. In: Elmagarmid AK (ed) Database Transaction Models for Advanced Applications, Morgan Kaufmann, San Francisco, CA, pp 515–553

Download references

Author information

Authors and Affiliations

Database Research Group, Institute of Information Systems, ETH Zürich, Zürich, Switzerland
Torsten Grabs
Department of Computer Science, Otto-von-Guericke-Universität Magdeburg, Magdeburg, Germany
Klemens Böhm
Institut für Informationssysteme, ETH Zentrum IFW C49.2, Haldeneggsteig 4, 8092 Zürich, Switzerland
Hans-Jörg Schek

Authors

Torsten Grabs
View author publications
You can also search for this author in PubMed Google Scholar
Klemens Böhm
View author publications
You can also search for this author in PubMed Google Scholar
Hans-Jörg Schek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hans-Jörg Schek.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Grabs, T., Böhm, K. & Schek, HJ. PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases. Know. Inf. Sys. 6, 465–505 (2004). https://doi.org/10.1007/s10115-003-0120-y

Download citation

Received: 05 November 2001
Revised: 09 September 2002
Accepted: 12 February 2003
Published: 01 July 2004
Issue Date: July 2004
DOI: https://doi.org/10.1007/s10115-003-0120-y

PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases

Abstract

Access this article

Similar content being viewed by others

Indexing Techniques of Distributed Ordered Tables: A Survey and Analysis

MacroDB: Scaling Database Engines on Multicores

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases

Abstract

Access this article

Similar content being viewed by others

Indexing Techniques of Distributed Ordered Tables: A Survey and Analysis

MacroDB: Scaling Database Engines on Multicores

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation