ABSTRACT
Multidimensional indexing is crucial for enabling a fast search over large-scale data. Owing to the unprecedented scale of data, extending such indexing technology has recently gained attention in distributed environments. The goal of existing efforts in distributed indexing has been the localization of queries to data residing at a small number of nodes (i.e., locality-preserving indexing) to minimize communication cost. However, considering that workloads often correlate with data locality, such indexing often generates hotspots. Location-based queries are typically skewed to disaster areas during certain periods of time, e.g., during Hurricane Irene, search traffic increased by more than 2000%. To alleviate such hotspots, we propose workload-balancing as an optimization goal. A cost model analytically supporting the need for load balancing is first developed, then a distributed index that evenly distributes the workload is presented. Our empirical study suggests that hotspots degrading search performance can be effectively alleviated. Specifically, when deployed to Amazon EC2, our proposed scheme showed maximum speed-up of 127.7%. Even in hostile settings where workload is not at all correlated with the search criteria, the proposed scheme's performance is comparable to existing approaches optimized for such settings.
- Amazon Elastic Compute Cloud. Amazon Web Services. {online} http://aws.amazon.com/ec2/.Google Scholar
- Chomp charts - monthly app statistics. Chomp, 2011. {online} http://chomp.com/etc/chomp-charts/.Google Scholar
- M. K. Aguilera, W. Golab, and M. A. Shah. A practical scalable distributed b-tree. Proc. VLDB Endow., 1:598--609, August 2008. Google ScholarDigital Library
- N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An efficient and robust access method for points and rectangles. SIGMOD Rec., 19(2):322--331, May 1990. Google ScholarDigital Library
- N. Beckmann and B. Seeger. A revised R*-tree in comparison with related index structures. In Proc. of the 2009 ACM SIGMOD International Conference on Management of Data, pages 799--812, 2009. Google ScholarDigital Library
- J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM, 18:509--517, September 1975. Google ScholarDigital Library
- W. Cai, S. Zhou, W. Qian, L. Xu, K. Tan, and A. Zhou. C2: a new overlay network based on can and chord. Int. J. High Perform. Comput. Netw., 3:248--261, December 2005. Google ScholarDigital Library
- G. Chen, H. T. Vo, S. Wu, B. C. Ooi, and M. T. Ozsu. A framework for supporting dbms-like indexes in the cloud. PVLDB, 4(11):702--713, 2011.Google ScholarDigital Library
- G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: amazon's highly available key-value store. In Proc. of 21st ACM SIGOPS symposium on Operating systems principles, pages 205--220. ACM, 2007. Google ScholarDigital Library
- R. Devine. Design and implementation of ddh: A distributed dynamic hashing algorithm. In Proc. of the 4th International Conference on Foundations of Data Organization and Algorithms, pages 101--114. Springer-Verlag, 1993. Google ScholarDigital Library
- C. du Mouza, W. Litwin, and P. Rigaux. SD-Rtree: A scalable distributed rtree. In Proc. of the 23rd International Conference on Data Engineering, pages 296--305. IEEE Computer Society, 2007.Google ScholarCross Ref
- V. Gaede and O. Günther. Multidimensional access methods. ACM Comput. Surv., 30(2):170--231, June 1998. Google ScholarDigital Library
- A. Guttman. R-trees: a dynamic index structure for spatial searching. SIGMOD Rec., 14(2):47--57, June 1984. Google ScholarDigital Library
- H. V. Jagadish, B. C. Ooi, and Q. H. Vu. Baton: a balanced tree structure for peer-to-peer networks. In Proc. of the 31st international conference on Very large data bases, pages 661--672. VLDB Endowment, 2005. Google ScholarDigital Library
- D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In Proc. of the twenty-ninth annual ACM symposium on Theory of computing, pages 654--663. ACM, 1997. Google ScholarDigital Library
- J. S. Karlsson. Information organization and databases. chapter HQT*: a scalable distributed data structure for high-performance spatial accesses, pages 295--312. Kluwer Academic Publishers, 2000. Google ScholarDigital Library
- S. T. Leutenegger, J. M. Edgington, and M. A. Lopez. STR: A simple and efficient algorithm for r-tree packing. In Proc. of the 13th International Conference on Data Engineering, page 497. IEEE Computer Society, 1997. Google ScholarDigital Library
- W. Litwin and M.-A. Neimat. k-rp*s: a scalable distributed data structure for high-performance multi-attribute access. In Proc. of the 4th international conference on on Parallel and distributed information systems, pages 120--131. IEEE Computer Society, 1996. Google ScholarDigital Library
- M. Lupu, B. C. Ooi, and Y. C. Tay. Paths to stardom: calibrating the potential of a peer-based data management system. In Proc. of the 2008 ACM SIGMOD international conference on Management of data, pages 265--278. ACM, 2008. Google ScholarDigital Library
- G. Morton. A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Co., 1966.Google Scholar
- S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content-addressable network. SIGCOMM Comput. Commun. Rev., 31:161--172, August 2001. Google ScholarDigital Library
- N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. SIGMOD Rec., 24(2):71--79, May 1995. Google ScholarCross Ref
- A. I. T. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Proc. of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg, pages 329--350. Springer-Verlag, 2001. Google ScholarDigital Library
- I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proc. of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, pages 149--160. ACM, 2001. Google ScholarDigital Library
- J. Wang, S. Wu, H. Gao, J. Li, and B. C. Ooi. Indexing multi-dimensional data in a cloud system. In Proc. of the 2010 ACM SIGMOD International Conference on Management of Data, pages 591--602. ACM, 2010. Google ScholarDigital Library
- S. Wu, D. Jiang, B. C. Ooi, and K.-L. Wu. Efficient b-tree based indexing for cloud data processing. Proc. VLDB Endow., 3:1207--1218, September 2010. Google ScholarDigital Library
Index Terms
- Robust distributed indexing for locality-skewed workloads
Recommendations
Making LRU Friendly to Weak Locality Workloads: A Novel Replacement Algorithm to Improve Buffer Cache Performance
Although the LRU replacement algorithm has been widely used in buffer cache management, it is well-known for its inability to cope with access patterns with weak locality. Previously proposed algorithms to improve LRU greatly increase complexity and/or ...
Spatial indexing of distributed multidimensional datasets
CCGRID '05: Proceedings of the Fifth IEEE International Symposium on Cluster Computing and the Grid (CCGrid'05) - Volume 2 - Volume 02While declustering methods for distributed multidimensional indexing of large datasets have been researched widely in the past, replication techniques for multidimensional indexes have not been investigated deeply. In general, a centralized index server ...
Analyzing design choices for distributed multidimensional indexing
Scientific datasets are often stored on distributed archival storage systems, because geographically distributed sensor devices store the datasets in their local machines and also because the size of scientific datasets demands large amount of disk ...
Comments