Abstract
Our objective is a scalable infrastructure for information retrieval (IR) with up-to-date retrieval results in the presence of updates. Timely processing of updates is important with novel application domains such as e-commerce. These issues are challenging, given the additional requirement that the system must scale well. We have built PowerDB-IR, a system that has the characteristics sought. This article describes its design, implementation, and evaluation. We follow a three-tier architecture with a database cluster as the bottom layer for storage management. The rationale for a database cluster is to ‘scale out’, i.e., to add further cluster nodes, whenever necessary for better performance. The middle tier provides IR-specific retrieval and update services. We deploy state-of-the-art middleware software to coordinate the cluster and to invoke IR-specific components. PowerDB-IR extends the middleware layer with service decomposition and parallelisation. PowerDB-IR has the following features: It supports state-of-the-art retrieval models such as vector space retrieval. It allows documents to be inserted and retrieved concurrently and ensures up-to-date retrieval results with almost no overhead. PowerDB-IR ensures the correctness of global concurrency and recovery. Alternative physical data organisation schemes and respective query processing techniques provide adequate performance for different workloads and database sizes. Scaling out the database cluster yields higher throughput and lower response times. We have run extensive experiments with PowerDB-IR using several commercial database systems as well as different middleware products. Further experiments have quantified the effect of transactional guarantees on performance. The main result is that PowerDB-IR shows surprisingly good scalability and low response times.
Similar content being viewed by others
References
Alonso G, Blott S, Fessler A, Schek H-J (1997) Correctness and parallelism of composite systems. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Tucson, Arizona, USA, ACM Press, New York, NY, pp 197–208
Alonso G, Fessler A, Pardon G, Schek H-J (1999a) Correctness in general configurations of transactional components. In: Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Philadelphia, USA, ACM Press, New York, NY, pp 285–293
Alonso G, Fessler A, Pardon G, Schek H-J (1999b) Transactions in stack, fork, and join composite systems. In: Beeri C, Buneman P (eds) Proceedings of the 7th International Conference on Database Theory (ICDT’99), Jerusalem, Israel, pp 150–168
Andresen D, Yang T, Ibarra OH (1997) Toward a scalable distributed WWW server on workstation clusters. J Parallel Distrib Comput 42(1):91–100
Badrinath B, Ramamritham K (1990) Performance evaluation of semantics-based multilevel concurrency control protocols. In: Garcia-Molina H, Jagadish HV (eds) Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, USA, ACM Press, New York, NY, pp 163–172
Barbará D, Mehrotra S, Vallabhaneni P (1996) The gold text indexing engine. In: Proceedings of the 12th International Conference on Data Engineering, New Orleans, Louisiana, USA, IEEE Computer Society, Los Alamitos, CA, pp 172–179
Baru C, Fecteau G, Goyal A, Hsiao H et al (1995) DB2 parallel edition. IBM Systems Journal 34(2):292–321
BEA (1999) TUXEDO Guides and References (V 6.5)
Bernstein PA, Hadzilacos V, Goodman N (1987) Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading, Massachusetts
Böhm K, Aberer K, Neuhold EJ, Yang X (1997) Structured document storage and refined declarative and navigational access mechanisms in HyperStorM. VLDB J 6(4):296–311
Böhm K, Grabs T, Röhm U, Schek H-J (2000) Evaluating the coordination overhead of replica maintenance in a cluster of databases. In: Proceedings of Euro-Par 2000, Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, Vol. 1900 of LNCS, Springer-Verlag, Heidelberg, pp 435–444
Boral H, Alexander W, Clay L, Copeland G et al (1990) Prototyping Bubba, a highly parallel database system. IEEE Trans Knowl Data Eng 2(1):4–24
Brown EW, Callan JP, Croft WB (1994) Fast incremental indexing for full-text information retrieval. In: Proceedings of 20th International Conference on Very Large Data Bases (VLDB’94), Santiago de Chile, Chile, Morgan Kaufmann, San Francisco, CA, pp 192–202
Carey M, Kossmann D (1997) On saying “enough already!” in SQL. In: Peckham J (ed) Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, ACM Press, pp 219–230
Chakrabarti K, Mehrotra S (1999) Efficient concurrency control in multidimensional access methods. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA, ACM Press, New York, NY, pp 25–36
Chaudhuri S, Gravano L (1999) Evaluating top-k selection queries. In: Atkinson MP, Orlowska ME, Valduriez P, Zdonik SB, Brodie ML (eds) Proceedings of 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, Morgan Kaufmann, San Francisco, CA, pp 397–410
Copeland G, Alexander W, Boughter E, Keller T (1988) Data placement in Bubba. In: Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, Chicago, USA, ACM Press, New York, NY, pp 99–108
Crawford RG, Macleod I (1978) A relational approach to modular information retrieval systems design. In: Proceedings of the 41st Conference of the American Society for Information Science Annual Meeting, pp 83–85
Dadam P, Küspert K, Andersen F, Blanken HM et al (1986) A DBMS prototype to support extended NF2 relations: an integrated view on flat tables and hierarchies. In: Zaniolo C (ed) Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, ACM Press, New York, NY, pp 356–367
Dadam P, Pistor P, Schek H (1983) A predicate oriented locking approach for integrated information systems. In: Proceedings of the IFIP 9th World Computer Congress, Paris, France, North-Holland/IFIP, Amsterdam, pp 763–768
DeFazio S (1991) Overview of the full-text document retrieval benchmark. In: Gray J (ed) The Benchmark Handbook, Morgan Kaufmann, San Francisco, CA, pp 435–487
DeWitt DJ, Ghandeharizadeh S, Schneider DA, Bricker A et al (1990) The Gamma Database Project. IEEE Trans Knowl Data Eng 2(1):44–61
Eswaran KP, Gray JN, Lorie RA, Traiger IL (1976) The notions of consistency and predicate locks in a database system. Commun ACM 19(11):624–633
Fox A, Chawathe SGY, Brewer E, Gaulthier P (1997) Cluster-based scalable network services. In: Proceedings of the 16th ACM Symposium on Operating System Principles (SOSP’97), St Malo, France, ACM Press, New York, NY, pp 78–91
Frieder O, Chowdhury A, Grossman D, McCabe M (2000) On the integration of structured data and text: a review of the SIRE architecture. In: Proceedings of the First DELOS Network of Excellence Workshop on Information Seeking, Searching and Querying in Digital Libraries, Zurich, Switzerland, 2000, ERCIM, Le Chesnay, pp 53–58
Grabs T, Böhm K, Schek H-J (2000) A parallel document engine built on top of a cluster of databases – design, implementation, and experiences. In: Technical Report 340, Department of Computer Science, ETH Zurich. Available at: http://www.inf.ethz.ch/publications/abstract.php3?no=tech-reports/3xx/340
Grabs T, Böhm K, Schek H-J (2001a) High-level parallelisation in a database cluster: a feasibility study using document services. In: Proceedings of the 17th International Conference on Data Engineering (ICDE2001), Heidelberg, Germany, IEEE Computer Society, Los Alamitos, CA, pp 121–130
Grabs T, Böhm K, Schek H-J (2001b) PowerDB-IR – information retrieval on top of a database cluster. In: Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM2001), Atlanta, GA, USA, ACM Press, New York, NY, pp 411–418
Gray J (1999) How high is high performance transaction processing. In: High Performance Transaction Systems Workshop, Asilomar, USA. Available at: http://research.microsoft.com/∼gray/hpts99/talks/Gray_Jim.ppt
Gray J, Helland P, O’Neill P, Shasha D (1996) The dangers of replication and a solution. In: Jagadish HV, Mumick IS (eds) Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, ACM Press, New York, NY, pp 173–182
Gray J, Reuter A (1993) Transaction Processing: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA
Grossman DA, Frieder O, Holmes DO, Roberts DC (1997) Integrating structured data and text: a relational approach. J Am Soc Inf Sci 48(2):122–132
Harper DJ, Walker ADM (1992) ECLAIR: An extensible class library for information retrieval. Comput J 35(3):256–267
Inktomi Corp (1996) The Inktomi technology behind HotBot. Technical report, Inktomi Corp
Kamath M, Ramamritham K (1996) Efficient transaction support for dynamic information retrieval systems. In: Frei H-P, Harman D, Schäuble P, Wilkinson R (eds) Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96), Zurich, Switzerland, pp 147–155
Kaufmann H, Schek H-J (1995) Text search using database systems revisited – some experiments. In: Proceedings of the 13th British National Conference on Databases, pp 18–20
Kaufmann H, Schek H-J (1996) Extending TP-monitors for intra-transaction parallelism. In: Proceedings of the 4th International Conference on Parallel and Distributed Information Systems, Miami Beach, USA, IEEE Computer Society, Los Alamitos, CA, pp 250–261
Kirsch S (1998) Infoseek’s experiences searching the Internet. SIGIR Forum 32(2):3–7
Knaus D, Schäuble P (1996) The system architecture and the transaction concept of the SPIDER information retrieval system. IEEE Bull Tech Committee Data Eng 19(1):43–52
Lohman GM, Lindsay BG, Pirahesh H, Schiefer KB (1991) Extensions to Starburst: objects, types, functions and rules. Commun ACM 34(10):94–109
Microsoft Corp (2000) Building high-performance databases using Microsoft SQL Server 2000 federated database servers. Technical report, Microsoft Corp
Özsu MT, Szafron D, El-Medani G, Vittal C (1995) An object-oriented multimedia database system for a news-on-demand applications. Multimedia Syst 3(5–6):182–203
Özsu MT, Valduriez P (1999) Principles of Distributed Database Systems, 2nd edn, Prentice Hall, Upper Saddle River, New Jersey
Rys M, Norrie MC, Schek H-J (1996) Intra-transaction parallelism in the mapping of an object model to a relational multi-processor system. In: Vijayaraman TM, Buchmann AP, Mohan C, Sarda NL (eds) Proceedings of the 22th International Conference on Very Large Data Bases, Mumbai (Bombay), India, Morgan Kaufmann, San Francisco, CA, pp 460–471
Sacks-Davis R, Kent AJ, Ramamohanarao K, Thom JA et al (1995) Atlas: a nested relational database system for text applications. IEEE Trans Knowl Data Eng 7(3):454–470
Salton G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley
Salton G, Fox EA, Wu H (1983) Extended Boolean information retrieval. Commun ACM 26(12):1022–1036
Salton G, McGill M (1983) Introduction to Modern Information Retrieval. McGraw-Hill
Schaad W, Schek H-J, Weikum G (1995) Implementation and performance of multi-level transaction management in multidatabase environment. In: Bukhres OA, Özsu MT, Shan M-C (eds) Proceedings of RIDE-DOM ’95, Fifth International Workshop on Research Issues in Data Engineering – Distributed Object Management, Taipei, Taiwan, pp 108–115
Schek H-J, Pistor P (1982) Data structures for an integrated data base management and information retrieval system. In: Eighth International Conference on Very Large Data Bases, Mexico City, Mexico, Morgan Kaufmann, San Francisco, CA, pp 197–207
Schek H-J, Weikum G, Schaad W (1991) A multi-level transaction approach to federated DBMS transaction management. In: Kambayashi Y, Rusinkiewicz M (eds) First International Workshop on Research Issues on Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS’91), Kyoto, Japan, IEEE Computer Society, Los Alamitos, CA, pp 108–115
Scheuermann P, Weikum G, Zabback P (1998) Data partitioning and load balancing in parallel disk systems. VLDB J 7(1):48–66
Shasha D, Llirbat F, Simon E, Valduriez P (1995) Transaction chopping: algorithms and performance studies. ACM Trans Database Syst 20(3):325–363
Stonebraker M, Kemnitz G (1991) The Postgres next-generation database management system. Commun ACM 34(10):78–92
Tomasic A, Garcia-Molina H, Shoens K (1994) Incremental updates of inverted lists for text document retrieval. In: Snodgrass RT, Winslett M (eds) Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, USA, pp 289–300
Vingralek R, Breitbart Y, Weikum G (1998) SNOWBALL: scalable storage on networks of workstations with balanced load. Distrib Parallel Databases 6(2):117–156
Vingralek R, Hasse-Ye H, Breitbart Y, Schek H-J (1998) Unifying concurrency control and recovery of transactions with semantically rich operations. In: Theoretical Computer Science, pp 363–396
Weikum G (1991) Principles and realization strategies of multilevel transaction management ACM Trans Database Syst 16(1):132–180
Weikum G, Schek H-J (1984) Architectural issues of transaction management in multi-layered systems. In: Dayal U, Schlageter G, Seng LH (eds) Tenth International Conference on Very Large Data Bases, Singapore, Proceedings, Morgan Kaufmann, San Francisco, CA, pp 454–465
Weikum G, Schek H-J (1992) Concepts and applications of multilevel transactions and open nested transactions. In: Elmagarmid AK (ed) Database Transaction Models for Advanced Applications, Morgan Kaufmann, San Francisco, CA, pp 515–553
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Grabs, T., Böhm, K. & Schek, HJ. PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases. Know. Inf. Sys. 6, 465–505 (2004). https://doi.org/10.1007/s10115-003-0120-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-003-0120-y