Skip to main content

Advertisement

Log in

Online maintenance of very large random samples on flash storage

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Recent advances in flash storage have made it an attractive alternative for data storage in a wide spectrum of computing devices, such as embedded sensors, mobile phones, PDA’s, laptops, and even servers. However, flash storage has many unique characteristics that make existing data management/analytics algorithms designed for magnetic disks perform poorly with flash storage. For example, while random reads can be nearly as fast as sequential reads, random writes and in-place data updates are orders of magnitude slower than sequential writes. In this paper, we consider an important fundamental problem that would seem to be particularly challenging for flash storage: efficiently maintaining a very large random sample of a data stream (e.g., of sensor readings). First, we show that previous algorithms such as reservoir sampling and geometric file are not readily adapted to flash. Second, we propose B-File, an energy-efficient abstraction for flash storage to store self-expiring items, and show how a B-File can be used to efficiently maintain a large sample in flash. Our solution is simple, has a small (RAM) memory footprint, and is designed to cope with flash constraints in order to reduce latency and energy consumption. Third, we provide techniques to maintain biased samples with a B-File and to query the large sample stored in a B-File for a subsample of an arbitrary size. Finally, we present an evaluation with flash storage that shows our techniques are several orders of magnitude faster and more energy-efficient than (flash-friendly versions of) reservoir sampling and geometric file. A key finding of our study, of potential use to many flash algorithms beyond sampling, is that “semi-random” writes (as defined in the paper) on flash cards are over two orders of magnitude faster and more energy-efficient than random writes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal, N., Prabhakaran, V., Wobber, T., Davis, J.D., Manasse, M., Panigrahy, R.: Design tradeoffs for SSD performance. In: USENIX Annual Technical Conference (2008)

  2. Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: ACM SIGMOD International Conference on Management of Data (2003)

  3. Birrell A., Isard M., Thacker C., Wobber T.: A design for high-performance flash disks. SIGOPS Oper. Syst. Rev. 41(2), 88–93 (2007)

    Article  Google Scholar 

  4. Bobineau, C., Bouganim, L., Pucheral, P., Valduriez, P.: PicoDBMS: Scaling down database techniques for the smartcard. In: International Conference on Very Large Data Bases (VLDB) (2000)

  5. Bouganim, L., Jónsson, B., Bonnet, P.: uFLIP: Understanding flash IO patterns. In: Fourth Biennial Conference on Innovative Data Systems Research (CIDR) (2009)

  6. Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.R.: Overcoming limitations of sampling for aggregation queries. In: IEEE International Conference on Data Engineering (ICDE) (2001)

  7. Diao, Y., Ganesan, D., Mathur, G., Shenoy, P.: Rethinking data management for storage-centric sensor networks. In: Third Biennial Conference on Innovative Data Systems Research (CIDR) (2007)

  8. Douglis, F., Cáceres, R., Kaashoek, F., Li, K., Marsh, B., Tauber, J.A.: Storage alternatives for mobile computers. In: USENIX Conference on Operating Systems Design and Implementation (OSDI) (1994)

  9. Fan C.T., Muller M.E., Rezucha I.: Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Am. Stat. Assoc. 57(298), 387–402 (1962)

    Article  MATH  MathSciNet  Google Scholar 

  10. Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: ACM Symposium on Parallel Algorithms and Architectures (SPAA) (2001)

  11. Hachman, M.: New Samsung notebook replaces hard drive with flash. http://www.extremetech.com/article2/0,1558,1966644,00.asp, May 2006

  12. Intel-Corporation. Understanding the Flash Translation Layer (FTL) specification. http://www.embeddedfreebsd.org/Documents/Intel-FTL.pdf (1998)

  13. Janukowicz, J., Reinsel, D.: SSDs: The other primary storage alternative. IDC White Paper (2008)

  14. Jermaine, C., Datta, A., Omiecinski, E.: A novel index supporting high volume data warehouse insertion. In: International Conference on Very Large Data Bases (VLDB) (1999)

  15. Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random samples. In:ACM SIGMOD International Conference on Management of Data (2004)

  16. Kim, H., Ahn, S.: BPLRU: a buffer management scheme for improving random writes in flash storage. In: USENIX Conference on File and Storage Technologies (FAST) (2008)

  17. Kim J., Kim J.M., Noh S.H., Min S.L., Cho Y.: A space-efficient flash translation layer for compact flash systems. IEEE Trans. Consumer Electron. 48(2), 366–375 (2002)

    Article  Google Scholar 

  18. Lee, J., Kim, S., Kwon, H., Hyun, C., Ahn, S., Choi, J., Lee, D., Noh, S.H.: Block recycling schemes and their cost-based optimization in NAND flash memory based storage system. In: ACM/IEEE International Conference on Embedded Software (EMSOFT) (2007)

  19. Lee, S.-W., Moon, B.: Design of flash-based DBMS: an in-page logging approach. In: ACM SIGMOD International Conference on Management of Data (2007)

  20. Mathur, G., Desnoyers, P., Ganesan, D., Shenoy, P.: Capsule: an energy-optimized object storage system for memory-constrained sensor devices. In: ACM International Conference on Embedded Networked Sensor Systems (SenSys) (2006)

  21. Miller, P.: SimpleTech announces 512GB and 256GB 3.5-inch SSD drives. http://www.engadget.com/2007/04/18/ (2007)

  22. Moteiv Corporation. Tmote sky platform. http://www.moteiv.com/community/Tmote_Sky_Downloads (2007)

  23. Nath, S., Kansal, A.: FlashDB: dynamic self-tuning database for NAND flash. In: ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN) (2007)

  24. Olken F., Rotem D., Xu P.: Random sampling from hash files. SIGMOD Rec., 19(2), 375–386 (1990)

    Article  Google Scholar 

  25. O’Neil P., Cheng E., Gawlick D., O’Neil E.: The log-structured merge-tree (LSM-tree). Acta Inf. 33(4), 351–385 (1996)

    Article  Google Scholar 

  26. Pugh W.: Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33(6), 668–676 (1990)

    Article  MathSciNet  Google Scholar 

  27. Reinsel, D., Janukowicz, J.: Datacenter SSDs: Solid footing for growth. Samsung white paper. http://www.samsung.com/global/business/semiconductor/products/flash/ssd/pdf/datacenter_ssds.pdf (2008)

  28. SyCard. CF extend 180 CompactFlash Flexible Extender Card. http://www.sycard.com/cfext180.html (2008)

  29. Vitter J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  30. Vitter J.S.: An efficient algorithm for sequential random sampling. ACM Trans. Math. Softw. 13(1), 58–67 (1987)

    Article  MathSciNet  Google Scholar 

  31. Vitter J.S.: External memory algorithms and data structures. ACM Comput. Surv. 33(2), 209–271 (2001)

    Article  Google Scholar 

  32. Wu, C.-H., Chang, L.-P., Kuo, T.-W.: An efficient R-tree implementation over flash-memory storage systems. In: ACM International Symposium on Advances in Geographic Information Systems (GIS) (2003)

  33. Yahoo!-Finance. Zeus-IOPS solid state drives surge to 512GB. http://biz.yahoo.com/pz/070418/117663.html (2007)

  34. Zeinalipour-Yazti, D., Lin, S., Kalogeraki, V., Gunopulos, D., Najjar, W.A.: Microhash: an efficient index structure for flash-based sensor devices. In: USENIX Conference on File and Storage Technologies (FAST) (2005)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suman Nath.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nath, S., Gibbons, P.B. Online maintenance of very large random samples on flash storage. The VLDB Journal 19, 67–90 (2010). https://doi.org/10.1007/s00778-009-0164-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-009-0164-z

Keywords

Navigation