Skip to main content
Log in

PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Our objective is a scalable infrastructure for information retrieval (IR) with up-to-date retrieval results in the presence of updates. Timely processing of updates is important with novel application domains such as e-commerce. These issues are challenging, given the additional requirement that the system must scale well. We have built PowerDB-IR, a system that has the characteristics sought. This article describes its design, implementation, and evaluation. We follow a three-tier architecture with a database cluster as the bottom layer for storage management. The rationale for a database cluster is to ‘scale out’, i.e., to add further cluster nodes, whenever necessary for better performance. The middle tier provides IR-specific retrieval and update services. We deploy state-of-the-art middleware software to coordinate the cluster and to invoke IR-specific components. PowerDB-IR extends the middleware layer with service decomposition and parallelisation. PowerDB-IR has the following features: It supports state-of-the-art retrieval models such as vector space retrieval. It allows documents to be inserted and retrieved concurrently and ensures up-to-date retrieval results with almost no overhead. PowerDB-IR ensures the correctness of global concurrency and recovery. Alternative physical data organisation schemes and respective query processing techniques provide adequate performance for different workloads and database sizes. Scaling out the database cluster yields higher throughput and lower response times. We have run extensive experiments with PowerDB-IR using several commercial database systems as well as different middleware products. Further experiments have quantified the effect of transactional guarantees on performance. The main result is that PowerDB-IR shows surprisingly good scalability and low response times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Alonso G, Blott S, Fessler A, Schek H-J (1997) Correctness and parallelism of composite systems. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Tucson, Arizona, USA, ACM Press, New York, NY, pp 197–208

  2. Alonso G, Fessler A, Pardon G, Schek H-J (1999a) Correctness in general configurations of transactional components. In: Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Philadelphia, USA, ACM Press, New York, NY, pp 285–293

  3. Alonso G, Fessler A, Pardon G, Schek H-J (1999b) Transactions in stack, fork, and join composite systems. In: Beeri C, Buneman P (eds) Proceedings of the 7th International Conference on Database Theory (ICDT’99), Jerusalem, Israel, pp 150–168

  4. Andresen D, Yang T, Ibarra OH (1997) Toward a scalable distributed WWW server on workstation clusters. J Parallel Distrib Comput 42(1):91–100

    Article  Google Scholar 

  5. Badrinath B, Ramamritham K (1990) Performance evaluation of semantics-based multilevel concurrency control protocols. In: Garcia-Molina H, Jagadish HV (eds) Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, USA, ACM Press, New York, NY, pp 163–172

  6. Barbará D, Mehrotra S, Vallabhaneni P (1996) The gold text indexing engine. In: Proceedings of the 12th International Conference on Data Engineering, New Orleans, Louisiana, USA, IEEE Computer Society, Los Alamitos, CA, pp 172–179

  7. Baru C, Fecteau G, Goyal A, Hsiao H et al (1995) DB2 parallel edition. IBM Systems Journal 34(2):292–321

    Article  Google Scholar 

  8. BEA (1999) TUXEDO Guides and References (V 6.5)

  9. Bernstein PA, Hadzilacos V, Goodman N (1987) Concurrency Control and Recovery in Database Systems. Addison-Wesley, Reading, Massachusetts

  10. Böhm K, Aberer K, Neuhold EJ, Yang X (1997) Structured document storage and refined declarative and navigational access mechanisms in HyperStorM. VLDB J 6(4):296–311

    Article  Google Scholar 

  11. Böhm K, Grabs T, Röhm U, Schek H-J (2000) Evaluating the coordination overhead of replica maintenance in a cluster of databases. In: Proceedings of Euro-Par 2000, Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, Vol. 1900 of LNCS, Springer-Verlag, Heidelberg, pp 435–444

  12. Boral H, Alexander W, Clay L, Copeland G et al (1990) Prototyping Bubba, a highly parallel database system. IEEE Trans Knowl Data Eng 2(1):4–24

    Article  Google Scholar 

  13. Brown EW, Callan JP, Croft WB (1994) Fast incremental indexing for full-text information retrieval. In: Proceedings of 20th International Conference on Very Large Data Bases (VLDB’94), Santiago de Chile, Chile, Morgan Kaufmann, San Francisco, CA, pp 192–202

  14. Carey M, Kossmann D (1997) On saying “enough already!” in SQL. In: Peckham J (ed) Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, Tucson, Arizona, USA, ACM Press, pp 219–230

  15. Chakrabarti K, Mehrotra S (1999) Efficient concurrency control in multidimensional access methods. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Philadelphia, Pennsylvania, USA, ACM Press, New York, NY, pp 25–36

  16. Chaudhuri S, Gravano L (1999) Evaluating top-k selection queries. In: Atkinson MP, Orlowska ME, Valduriez P, Zdonik SB, Brodie ML (eds) Proceedings of 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, Morgan Kaufmann, San Francisco, CA, pp 397–410

  17. Copeland G, Alexander W, Boughter E, Keller T (1988) Data placement in Bubba. In: Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, Chicago, USA, ACM Press, New York, NY, pp 99–108

  18. Crawford RG, Macleod I (1978) A relational approach to modular information retrieval systems design. In: Proceedings of the 41st Conference of the American Society for Information Science Annual Meeting, pp 83–85

  19. Dadam P, Küspert K, Andersen F, Blanken HM et al (1986) A DBMS prototype to support extended NF2 relations: an integrated view on flat tables and hierarchies. In: Zaniolo C (ed) Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, ACM Press, New York, NY, pp 356–367

  20. Dadam P, Pistor P, Schek H (1983) A predicate oriented locking approach for integrated information systems. In: Proceedings of the IFIP 9th World Computer Congress, Paris, France, North-Holland/IFIP, Amsterdam, pp 763–768

  21. DeFazio S (1991) Overview of the full-text document retrieval benchmark. In: Gray J (ed) The Benchmark Handbook, Morgan Kaufmann, San Francisco, CA, pp 435–487

  22. DeWitt DJ, Ghandeharizadeh S, Schneider DA, Bricker A et al (1990) The Gamma Database Project. IEEE Trans Knowl Data Eng 2(1):44–61

    Article  Google Scholar 

  23. Eswaran KP, Gray JN, Lorie RA, Traiger IL (1976) The notions of consistency and predicate locks in a database system. Commun ACM 19(11):624–633

    Article  MathSciNet  Google Scholar 

  24. Fox A, Chawathe SGY, Brewer E, Gaulthier P (1997) Cluster-based scalable network services. In: Proceedings of the 16th ACM Symposium on Operating System Principles (SOSP’97), St Malo, France, ACM Press, New York, NY, pp 78–91

  25. Frieder O, Chowdhury A, Grossman D, McCabe M (2000) On the integration of structured data and text: a review of the SIRE architecture. In: Proceedings of the First DELOS Network of Excellence Workshop on Information Seeking, Searching and Querying in Digital Libraries, Zurich, Switzerland, 2000, ERCIM, Le Chesnay, pp 53–58

  26. Grabs T, Böhm K, Schek H-J (2000) A parallel document engine built on top of a cluster of databases – design, implementation, and experiences. In: Technical Report 340, Department of Computer Science, ETH Zurich. Available at: http://www.inf.ethz.ch/publications/abstract.php3?no=tech-reports/3xx/340

  27. Grabs T, Böhm K, Schek H-J (2001a) High-level parallelisation in a database cluster: a feasibility study using document services. In: Proceedings of the 17th International Conference on Data Engineering (ICDE2001), Heidelberg, Germany, IEEE Computer Society, Los Alamitos, CA, pp 121–130

  28. Grabs T, Böhm K, Schek H-J (2001b) PowerDB-IR – information retrieval on top of a database cluster. In: Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM2001), Atlanta, GA, USA, ACM Press, New York, NY, pp 411–418

  29. Gray J (1999) How high is high performance transaction processing. In: High Performance Transaction Systems Workshop, Asilomar, USA. Available at: http://research.microsoft.com/∼gray/hpts99/talks/Gray_Jim.ppt

  30. Gray J, Helland P, O’Neill P, Shasha D (1996) The dangers of replication and a solution. In: Jagadish HV, Mumick IS (eds) Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, ACM Press, New York, NY, pp 173–182

  31. Gray J, Reuter A (1993) Transaction Processing: Concepts and Techniques, Morgan Kaufmann, San Francisco, CA

    Google Scholar 

  32. Grossman DA, Frieder O, Holmes DO, Roberts DC (1997) Integrating structured data and text: a relational approach. J Am Soc Inf Sci 48(2):122–132

    Article  Google Scholar 

  33. Harper DJ, Walker ADM (1992) ECLAIR: An extensible class library for information retrieval. Comput J 35(3):256–267

    Article  Google Scholar 

  34. Inktomi Corp (1996) The Inktomi technology behind HotBot. Technical report, Inktomi Corp

  35. Kamath M, Ramamritham K (1996) Efficient transaction support for dynamic information retrieval systems. In: Frei H-P, Harman D, Schäuble P, Wilkinson R (eds) Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96), Zurich, Switzerland, pp 147–155

  36. Kaufmann H, Schek H-J (1995) Text search using database systems revisited – some experiments. In: Proceedings of the 13th British National Conference on Databases, pp 18–20

  37. Kaufmann H, Schek H-J (1996) Extending TP-monitors for intra-transaction parallelism. In: Proceedings of the 4th International Conference on Parallel and Distributed Information Systems, Miami Beach, USA, IEEE Computer Society, Los Alamitos, CA, pp 250–261

  38. Kirsch S (1998) Infoseek’s experiences searching the Internet. SIGIR Forum 32(2):3–7

    Article  Google Scholar 

  39. Knaus D, Schäuble P (1996) The system architecture and the transaction concept of the SPIDER information retrieval system. IEEE Bull Tech Committee Data Eng 19(1):43–52

    Google Scholar 

  40. Lohman GM, Lindsay BG, Pirahesh H, Schiefer KB (1991) Extensions to Starburst: objects, types, functions and rules. Commun ACM 34(10):94–109

    Article  Google Scholar 

  41. Microsoft Corp (2000) Building high-performance databases using Microsoft SQL Server 2000 federated database servers. Technical report, Microsoft Corp

  42. Özsu MT, Szafron D, El-Medani G, Vittal C (1995) An object-oriented multimedia database system for a news-on-demand applications. Multimedia Syst 3(5–6):182–203

    Google Scholar 

  43. Özsu MT, Valduriez P (1999) Principles of Distributed Database Systems, 2nd edn, Prentice Hall, Upper Saddle River, New Jersey

  44. Rys M, Norrie MC, Schek H-J (1996) Intra-transaction parallelism in the mapping of an object model to a relational multi-processor system. In: Vijayaraman TM, Buchmann AP, Mohan C, Sarda NL (eds) Proceedings of the 22th International Conference on Very Large Data Bases, Mumbai (Bombay), India, Morgan Kaufmann, San Francisco, CA, pp 460–471

  45. Sacks-Davis R, Kent AJ, Ramamohanarao K, Thom JA et al (1995) Atlas: a nested relational database system for text applications. IEEE Trans Knowl Data Eng 7(3):454–470

    Article  Google Scholar 

  46. Salton G (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley

    Google Scholar 

  47. Salton G, Fox EA, Wu H (1983) Extended Boolean information retrieval. Commun ACM 26(12):1022–1036

    Article  MathSciNet  Google Scholar 

  48. Salton G, McGill M (1983) Introduction to Modern Information Retrieval. McGraw-Hill

  49. Schaad W, Schek H-J, Weikum G (1995) Implementation and performance of multi-level transaction management in multidatabase environment. In: Bukhres OA, Özsu MT, Shan M-C (eds) Proceedings of RIDE-DOM ’95, Fifth International Workshop on Research Issues in Data Engineering – Distributed Object Management, Taipei, Taiwan, pp 108–115

  50. Schek H-J, Pistor P (1982) Data structures for an integrated data base management and information retrieval system. In: Eighth International Conference on Very Large Data Bases, Mexico City, Mexico, Morgan Kaufmann, San Francisco, CA, pp 197–207

  51. Schek H-J, Weikum G, Schaad W (1991) A multi-level transaction approach to federated DBMS transaction management. In: Kambayashi Y, Rusinkiewicz M (eds) First International Workshop on Research Issues on Data Engineering: Interoperability in Multidatabase Systems (RIDE-IMS’91), Kyoto, Japan, IEEE Computer Society, Los Alamitos, CA, pp 108–115

  52. Scheuermann P, Weikum G, Zabback P (1998) Data partitioning and load balancing in parallel disk systems. VLDB J 7(1):48–66

    Article  Google Scholar 

  53. Shasha D, Llirbat F, Simon E, Valduriez P (1995) Transaction chopping: algorithms and performance studies. ACM Trans Database Syst 20(3):325–363

    Article  Google Scholar 

  54. Stonebraker M, Kemnitz G (1991) The Postgres next-generation database management system. Commun ACM 34(10):78–92

    Article  Google Scholar 

  55. Tomasic A, Garcia-Molina H, Shoens K (1994) Incremental updates of inverted lists for text document retrieval. In: Snodgrass RT, Winslett M (eds) Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, USA, pp 289–300

  56. Vingralek R, Breitbart Y, Weikum G (1998) SNOWBALL: scalable storage on networks of workstations with balanced load. Distrib Parallel Databases 6(2):117–156

    Article  Google Scholar 

  57. Vingralek R, Hasse-Ye H, Breitbart Y, Schek H-J (1998) Unifying concurrency control and recovery of transactions with semantically rich operations. In: Theoretical Computer Science, pp 363–396

  58. Weikum G (1991) Principles and realization strategies of multilevel transaction management ACM Trans Database Syst 16(1):132–180

    Article  Google Scholar 

  59. Weikum G, Schek H-J (1984) Architectural issues of transaction management in multi-layered systems. In: Dayal U, Schlageter G, Seng LH (eds) Tenth International Conference on Very Large Data Bases, Singapore, Proceedings, Morgan Kaufmann, San Francisco, CA, pp 454–465

  60. Weikum G, Schek H-J (1992) Concepts and applications of multilevel transactions and open nested transactions. In: Elmagarmid AK (ed) Database Transaction Models for Advanced Applications, Morgan Kaufmann, San Francisco, CA, pp 515–553

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hans-Jörg Schek.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Grabs, T., Böhm, K. & Schek, HJ. PowerDB-IR – Scalable Information Retrieval and Storage with a Cluster of Databases. Know. Inf. Sys. 6, 465–505 (2004). https://doi.org/10.1007/s10115-003-0120-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-003-0120-y

Keywords

Navigation