Abstract
Recent advances in flash storage have made it an attractive alternative for data storage in a wide spectrum of computing devices, such as embedded sensors, mobile phones, PDA’s, laptops, and even servers. However, flash storage has many unique characteristics that make existing data management/analytics algorithms designed for magnetic disks perform poorly with flash storage. For example, while random reads can be nearly as fast as sequential reads, random writes and in-place data updates are orders of magnitude slower than sequential writes. In this paper, we consider an important fundamental problem that would seem to be particularly challenging for flash storage: efficiently maintaining a very large random sample of a data stream (e.g., of sensor readings). First, we show that previous algorithms such as reservoir sampling and geometric file are not readily adapted to flash. Second, we propose B-File, an energy-efficient abstraction for flash storage to store self-expiring items, and show how a B-File can be used to efficiently maintain a large sample in flash. Our solution is simple, has a small (RAM) memory footprint, and is designed to cope with flash constraints in order to reduce latency and energy consumption. Third, we provide techniques to maintain biased samples with a B-File and to query the large sample stored in a B-File for a subsample of an arbitrary size. Finally, we present an evaluation with flash storage that shows our techniques are several orders of magnitude faster and more energy-efficient than (flash-friendly versions of) reservoir sampling and geometric file. A key finding of our study, of potential use to many flash algorithms beyond sampling, is that “semi-random” writes (as defined in the paper) on flash cards are over two orders of magnitude faster and more energy-efficient than random writes.
Similar content being viewed by others
References
Agrawal, N., Prabhakaran, V., Wobber, T., Davis, J.D., Manasse, M., Panigrahy, R.: Design tradeoffs for SSD performance. In: USENIX Annual Technical Conference (2008)
Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: ACM SIGMOD International Conference on Management of Data (2003)
Birrell A., Isard M., Thacker C., Wobber T.: A design for high-performance flash disks. SIGOPS Oper. Syst. Rev. 41(2), 88–93 (2007)
Bobineau, C., Bouganim, L., Pucheral, P., Valduriez, P.: PicoDBMS: Scaling down database techniques for the smartcard. In: International Conference on Very Large Data Bases (VLDB) (2000)
Bouganim, L., Jónsson, B., Bonnet, P.: uFLIP: Understanding flash IO patterns. In: Fourth Biennial Conference on Innovative Data Systems Research (CIDR) (2009)
Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.R.: Overcoming limitations of sampling for aggregation queries. In: IEEE International Conference on Data Engineering (ICDE) (2001)
Diao, Y., Ganesan, D., Mathur, G., Shenoy, P.: Rethinking data management for storage-centric sensor networks. In: Third Biennial Conference on Innovative Data Systems Research (CIDR) (2007)
Douglis, F., Cáceres, R., Kaashoek, F., Li, K., Marsh, B., Tauber, J.A.: Storage alternatives for mobile computers. In: USENIX Conference on Operating Systems Design and Implementation (OSDI) (1994)
Fan C.T., Muller M.E., Rezucha I.: Development of sampling plans by using sequential (item by item) selection techniques and digital computers. J. Am. Stat. Assoc. 57(298), 387–402 (1962)
Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: ACM Symposium on Parallel Algorithms and Architectures (SPAA) (2001)
Hachman, M.: New Samsung notebook replaces hard drive with flash. http://www.extremetech.com/article2/0,1558,1966644,00.asp, May 2006
Intel-Corporation. Understanding the Flash Translation Layer (FTL) specification. http://www.embeddedfreebsd.org/Documents/Intel-FTL.pdf (1998)
Janukowicz, J., Reinsel, D.: SSDs: The other primary storage alternative. IDC White Paper (2008)
Jermaine, C., Datta, A., Omiecinski, E.: A novel index supporting high volume data warehouse insertion. In: International Conference on Very Large Data Bases (VLDB) (1999)
Jermaine, C., Pol, A., Arumugam, S.: Online maintenance of very large random samples. In:ACM SIGMOD International Conference on Management of Data (2004)
Kim, H., Ahn, S.: BPLRU: a buffer management scheme for improving random writes in flash storage. In: USENIX Conference on File and Storage Technologies (FAST) (2008)
Kim J., Kim J.M., Noh S.H., Min S.L., Cho Y.: A space-efficient flash translation layer for compact flash systems. IEEE Trans. Consumer Electron. 48(2), 366–375 (2002)
Lee, J., Kim, S., Kwon, H., Hyun, C., Ahn, S., Choi, J., Lee, D., Noh, S.H.: Block recycling schemes and their cost-based optimization in NAND flash memory based storage system. In: ACM/IEEE International Conference on Embedded Software (EMSOFT) (2007)
Lee, S.-W., Moon, B.: Design of flash-based DBMS: an in-page logging approach. In: ACM SIGMOD International Conference on Management of Data (2007)
Mathur, G., Desnoyers, P., Ganesan, D., Shenoy, P.: Capsule: an energy-optimized object storage system for memory-constrained sensor devices. In: ACM International Conference on Embedded Networked Sensor Systems (SenSys) (2006)
Miller, P.: SimpleTech announces 512GB and 256GB 3.5-inch SSD drives. http://www.engadget.com/2007/04/18/ (2007)
Moteiv Corporation. Tmote sky platform. http://www.moteiv.com/community/Tmote_Sky_Downloads (2007)
Nath, S., Kansal, A.: FlashDB: dynamic self-tuning database for NAND flash. In: ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN) (2007)
Olken F., Rotem D., Xu P.: Random sampling from hash files. SIGMOD Rec., 19(2), 375–386 (1990)
O’Neil P., Cheng E., Gawlick D., O’Neil E.: The log-structured merge-tree (LSM-tree). Acta Inf. 33(4), 351–385 (1996)
Pugh W.: Skip lists: a probabilistic alternative to balanced trees. Commun. ACM 33(6), 668–676 (1990)
Reinsel, D., Janukowicz, J.: Datacenter SSDs: Solid footing for growth. Samsung white paper. http://www.samsung.com/global/business/semiconductor/products/flash/ssd/pdf/datacenter_ssds.pdf (2008)
SyCard. CF extend 180 CompactFlash Flexible Extender Card. http://www.sycard.com/cfext180.html (2008)
Vitter J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985)
Vitter J.S.: An efficient algorithm for sequential random sampling. ACM Trans. Math. Softw. 13(1), 58–67 (1987)
Vitter J.S.: External memory algorithms and data structures. ACM Comput. Surv. 33(2), 209–271 (2001)
Wu, C.-H., Chang, L.-P., Kuo, T.-W.: An efficient R-tree implementation over flash-memory storage systems. In: ACM International Symposium on Advances in Geographic Information Systems (GIS) (2003)
Yahoo!-Finance. Zeus-IOPS solid state drives surge to 512GB. http://biz.yahoo.com/pz/070418/117663.html (2007)
Zeinalipour-Yazti, D., Lin, S., Kalogeraki, V., Gunopulos, D., Najjar, W.A.: Microhash: an efficient index structure for flash-based sensor devices. In: USENIX Conference on File and Storage Technologies (FAST) (2005)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nath, S., Gibbons, P.B. Online maintenance of very large random samples on flash storage. The VLDB Journal 19, 67–90 (2010). https://doi.org/10.1007/s00778-009-0164-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-009-0164-z