ABSTRACT
Replication of data in the context of databases is a way to improve the performance of queries (throughput). An ecosystem where data is replicated can also result in increased parallelism. With replicated data, there would be better fault tolerance. In some cases, replicating a set of data only in few nodes for higher efficiency (in terms of space), could be a choice. A particular set of data could be replicated in many nodes while others in only few, based on the access ratio of the data. Today, the decision of what data to be replicated on which all nodes, is taken based on few presumptions at the time of replication. Once the data is replicated, it remains in those nodes. Over a period of time, the requirements/queries accessing a set of data might change, and it may happen that the data that is less replicated might be the most desired, and vice versa.
Another aspect to be considered is the storage format of the replicas. From the data storage perspective, columnar database could be a great choice for some applications, whereas row based option could be a better bid for another set of applications. Storing all the replicas in either of the storage formats would be inefficient. In this paper, we propose a framework, RepliSmart, in which there is a smart controller that redirects the incoming queries appropriately among the nodes connected, to balance the workload. The framework employs learning based on-demand replication, where in the number of replicas corresponding to a data unit (at a table or database level) vary as the data access patterns vary over a period. Additionally, the smart controller would dynamically define the storage format of a replica such that few of the replicas could be in columnar whereas the remaining in row based storage. The smart controller would redirect any of the user's requests to appropriate nodes based on the decision whether a query could be better executed on columnar data or row based. The proposed framework results in higher query throughput, and better space utilization for read-heavy query workloads.
- Daniel J. Abadi, Peter A. Boncz, and Stavros Harizopoulos. 2009. Column-oriented database systems. Proc. VLDB Endow. 2, 2 (August 2009), 1664--1665. Google ScholarDigital Library
- Gheorghe MATEI, 2010. "Column-Oriented Databases, an Alternative for Analytical Environment," Database Systems Journal, Academy of Economic Studies - Bucharest, Romania, vol. 1(2), pages 3--16, December.Google Scholar
- D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, and S. Madden, "The Design and Implementation of Modern Column-Oriented Database Systems," Foundations and Trends in Databases, vol. 5, no. 3, pp. 197--280, 2013. Google ScholarDigital Library
- Wu Qiyue, "Research on column-store databases optimization techniques," 2015 International Conference on Logistics, Informatics and Service Sciences (LISS), Barcelona, 2015, pp. 1--7.Google Scholar
- David Loshin, "Gaining the Performance Edge Using a Column-Oriented Database Management System", Analytics in the Federal Government, White paper series on how to achieve efficiency, responsiveness and transparency, January 2010.Google Scholar
- https://in.teradata.com/Resources/White-Papers/Teradata-Intelligent-Memory.Google Scholar
- J. J. Levandoski, P. Larson and R. Stoica, "Identifying hot and cold data in main-memory databases," 2013 IEEE 29th International Conference on Data Engineering (ICDE), Brisbane, QLD, 2013, pp. 26--37. Google ScholarDigital Library
- K. Kim, S. Jung and Y. H. Song, "Compression ratio based hot/cold data identification for flash memory," 2011 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, 2011, pp. 33--34.Google Scholar
- S. Elnaffar, P. Martin, and R. Horman, "Automatically classifying database workloads", International Conference on Information and Knowledge Management(CIKM), pp. 622--624, 2002. Google ScholarDigital Library
- Bettina Kemme and Gustavo Alonso. 2000. A new approach to developing and implementing eager database replication protocols. ACM Trans. Database Syst. 25, 3 (September 2000), 333--379. Google ScholarDigital Library
- Makpangou, Mesaac. (2009). P2P based hosting system for scalable replicated databases. 47--54. Google ScholarDigital Library
- Said Elnaffar, Pat Martin, Randy Horman, "Automatically Classifying Database Workloads", International Conference on Information and Knowledge Management(CIKM), November 4-9, 2002 Google ScholarDigital Library
- Javier García-García and Carlos Ordonez. 2009. Consistency-aware evaluation of OLAP queries in replicated data warehouses. In Proceedings of the ACM twelfth international workshop on Data warehousing and OLAP (DOLAP '09). ACM, New York, NY, USA, 73--80. Google ScholarDigital Library
- Haifeng Yu and Amin Vahdat. 2006. The costs and limits of availability for replicated services. ACM Trans. Comput. Syst. 24, 1 (February 2006), 70--113. Google ScholarDigital Library
- Yi Lin, Bettina Kemme, Ricardo Jiménez-Peris, Marta Patiño-Martínez, and José Enrique Armendáriz-Iñigo. 2009. Snapshot isolation and integrity constraints in replicated databases. ACM Trans. Database Syst. 34, 2, Article 11 (July 2009), 49 pages. Google ScholarDigital Library
- V. Bhagat and A. Gopal, "Comparative Study of Row and Column Oriented Database," 2012 Fifth International Conference on Emerging Trends in Engineering and Technology, Himeji, 2012, pp. 196--201. Google ScholarDigital Library
- A. Kamal and S. C. Gupta, "Query based performance analysis of row and column storage data warehouse," 2014 9th International Conference on Industrial and Information Systems (ICIIS), Gwalior, 2014, pp. 1--6.Google Scholar
- Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2005. C-store: a column-oriented DBMS. In Proceedings of the 31st international conference on Very large data bases (VLDB '05). VLDB Endowment 553--564. Google ScholarDigital Library
- A. S. Kanade and A. Gopal, "Choosing right database system: Row or column-store," 2013 International Conference on Information Communication and Embedded Systems (ICICES), Chennai, 2013, pp. 16--20.Google Scholar
- Jongsung Lee and Jin-Soo Kim. 2013. An empirical study of hot/cold data separation policies in solid state drives (SSDs). In Proceedings of the 6th International Systems and Storage Conference (SYSTOR '13). ACM, New York, NY, USA, Article 12, 6 pages. Google ScholarDigital Library
- D. Park and D. H. C. Du, "Hot data identification for flash-based storage systems using multiple bloom filters," 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST), Denver, CO, 2011, pp. 1--11. Google ScholarDigital Library
- Chen J., Deng Y., Huang Z. (2015) HDCat: Effectively Identifying Hot Data in Large-Scale I/O Streams with Enhanced Temporal Locality. In: Wang G., Zomaya A., Martinez G., Li K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science, vol 9529. Springer, Cham Google ScholarDigital Library
- Sándor Héman, Marcin Zukowski, Niels J. Nes, Lefteris Sidirourgos, and Peter Boncz. 2010. Positional update handling in column stores. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 543--554. Google ScholarDigital Library
- https://docs.teradata.com/reader/vLlhnTq8biC8lbWbMR3PBA/GNVVgCfo5Bb2qQvRUftASwGoogle Scholar
Index Terms
- RepliSmart: A Smart Replication framework for optimal query throughput in read-heavy environments
Recommendations
PHFS: A dynamic replication method, to decrease access latency in the multi-tier data grid
Data replication is a method to improve the performance of data access in distributed systems. Dynamic replication is a kind of replication that adapts replication configuration with the change of users' behavior during the time to ensure the benefits ...
Dynamic replica placement and selection strategies in data grids- A comprehensive survey
Data replication techniques are used in data grid to reduce makespan, storage consumption, access latency and network bandwidth. Data replication enhances data availability and thereby increases the system reliability. There are two steps involved in ...
Coarse-grain replica management strategies for dynamic replication of web contents
Special issue on The global InternetThis paper discusses replica management strategies for cost-effective, scalable Web content distribution. In terms of the granularity of replica contents, current dynamic replication approaches can be classified into entire replication (entire content ...
Comments