Skip to main content

Bulk Loading a Linear Hash File

  • Conference paper
Data Warehousing and Knowledge Discovery (DaWaK 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4081))

Included in the following conference series:

Abstract

We study the problem of bulk loading a linear hash file; the problem is that a good hash function is able to distribute records into random locations in the file; however, performing a random disk access for each record can be costly and this cost increases with the size of the file. We propose a bulk loading algorithm that can avoid random disk accesses by reducing multiple accesses to the same location into a single access and reordering the accesses such that the pages are accessed sequentially. Our analysis shows that our algorithm is near-optimal with a cost roughly equal to the cost of sorting the dataset, thus the algorithm can scale up to very large datasets. Our experiments show that our method can improve upon the Berkeley DB load utility, in terms of running time, by two orders of magnitude and the improvements scale up well with the size of the dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ailamaki, A., DeWitt, D.J., Hill, M.D., Skounakis, M.: Weaving relations for cache performance. In: Proceedings of the VLDB Conference, Rome, Italy, pp. 169–180 (2001)

    Google Scholar 

  2. Amer-Yahia, S., Cluet, S.: A declarative approach to optimize bulk loading into databases. ACM Transactions on Database Systems 29(2), 233–281 (2004)

    Article  Google Scholar 

  3. Böhm, C., Kriegel, H.: Efficient bulk loading of large high-dimensional indexes. In: International Conference on Data Warehousing and Knowledge Discovery, pp. 251–260 (1999)

    Google Scholar 

  4. Fenk, R., Kawakami, A., Markl, V., Bayer, R., Osaki, S.: Bulk loading a data warehouse built upon a ub-tree. In: Proceedings of of IDEAS Conference, Yokohoma, Japan, pp. 179–187 (2000)

    Google Scholar 

  5. Gray, J.: A conversation with Jim Gray. ACM Queue 1(4) (2003)

    Google Scholar 

  6. Hjaltason, G.R., Samet, H., Sussmann, Y.J.: Speeding up bulk-loading of quadtrees. In: Proceedings of the International ACM Workshop on Advances in Geographic Information Systems, Las Vegas, pp. 50–53 (1997)

    Google Scholar 

  7. Internet Archive, http://www.archive.org

  8. Jagadish, H.V., Narayan, P.P.S., Seshadri, S., Sudarshan, S., Kanneganti, R.: Incremental organization for data recording and warehousing. In: Proc. of the VLDB Conference, Athens, pp. 16–25 (1997)

    Google Scholar 

  9. Knuth, D.: The Art of Computer Programming: vol III, Sorting and Searching, 3rd edn. Addison-Wesley, Reading (1998)

    Google Scholar 

  10. Labio, W., Wiener, J.L., Garcia-Molina, H., Gorelik, V.: Efficient resumption of interrupted warehouse loads. In: Proc. of the SIGMOD Conference, Dallas, pp. 46–57 (2000)

    Google Scholar 

  11. Larson, P.: Dynamic hash tables. Communications of the ACM 31(4), 446–457 (1988)

    Article  MathSciNet  Google Scholar 

  12. Rabin, M.O.: Fingerprinting by random polynomials. Technical Report TR-15-81, Department of Computer Science, Harvard University (1981)

    Google Scholar 

  13. Rafiei, D., Hu, C.: Bulk loading a linear hash file: extended version (under preparation)

    Google Scholar 

  14. Seltzer, M., Yigit, O.: A new hashing package for unix. In: USENIX, Dallas, pp. 173–184 (1991)

    Google Scholar 

  15. Wiener, J.L., Naughton, J.F.: OODB bulk loading revisited: The partitioned-list approach. In: Proceedings of the VLDB Conference, Zurich, Switzerland, pp. 30–41 (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Rafiei, D., Hu, C. (2006). Bulk Loading a Linear Hash File. In: Tjoa, A.M., Trujillo, J. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2006. Lecture Notes in Computer Science, vol 4081. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11823728_3

Download citation

  • DOI: https://doi.org/10.1007/11823728_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37736-8

  • Online ISBN: 978-3-540-37737-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics