EM-KDE: A locality-aware job scheduling policy with distributed semantic caches

https://doi.org/10.1016/j.jpdc.2015.06.002Get rights and content

Highlights

  • We propose a locality-aware scheduling policy for distributed query processing.

  • Load balance and data reuse are equally important for query processing throughput.

  • Distributed semantic caching needs a scheduler that balances load and data reuse.

Abstract

In modern query processing systems, the caching facilities are distributed and scale with the number of servers. To maximize the overall system throughput, the distributed system should balance the query loads among servers and also leverage cached results. In particular, leveraging distributed cached data is becoming more important as many systems are being built by connecting many small heterogeneous machines rather than relying on a few high-performance workstations. Although many query scheduling policies exist such as round-robin and load-monitoring, they are not sophisticated enough to both balance the load and leverage cached results. In this paper, we propose distributed query scheduling policies that take into account the dynamic contents of distributed caching infrastructure and employ statistical prediction methods into query scheduling policy.

We employ the kernel density estimation derived from recent queries and the well-known exponential moving average (EMA) in order to predict the query distribution in a multi-dimensional problem space that dynamically changes. Based on the estimated query distribution, the front-end scheduler assigns incoming queries so that query workloads are balanced and cached results are reused. Our experiments show that the proposed query scheduling policy outperforms existing policies in terms of both load balancing and cache hit ratio.

Introduction

Load-balancing has been extensively investigated in various fields in the past including multiprocessor systems, computer networks, and distributed systems. For load balance, various scheduling algorithms have been introduced; one of the simplest algorithms is the round robin scheduling, and more intelligent load balancing algorithms were proposed, taking additional performance factors into account, such as the current system load, heterogeneous computational power, and network connection of the servers.

In many computational domains, including scientific and business applications, the application dataset has been growing in its size. Moreover, recent computing trend is to analyze massive volumes of data and identify certain patterns. Hence many modern applications spend a large amount of execution time on I/O and manipulation of the data. The fundamental challenge for improving performance of such data-intensive applications is managing massive amounts of data, and reducing data movement and I/O.

In order to reduce the I/O on the large datasets, distributed data analysis frameworks place huge demands on cluster-wide memory capabilities, but the size of the memory in a cluster is not often big enough to hold all the datasets, making in-memory computing impossible. However, the caching facilities scale with the number of distributed servers, and leveraging the large distributed caches plays an important role in improving the overall system throughput as many large scale systems are being built by connecting small machines.

In distributed environments, orchestrating a large number of distributed caches for high cache-hit ratio is difficult. The traditional query scheduling policies such as load-monitoring  [1] are not sophisticated enough to consider cached results in distributed servers; they only consider load balance. Without high cache-hit ratio, the distributed caches will be underutilized, leading to slow query responses. On the contrary, scheduling policies that are solely based on data reuse will fail to balance the system load. If a certain server has very hot cached items, that single server will be flooded with a majority of queries while the other servers are all idle. In order to maximize the system throughput by achieving load balancing as well as exploiting cached query results, it is required to employ query scheduling policies that are more intelligent than the traditional round-robin and load-monitoring scheduling policies.

In this paper, we propose novel distributed query scheduling policies for multi-dimensional scientific data-analysis applications. The proposed distributed query scheduling policies make query scheduling decisions by interpreting the queries as multi-dimensional points, and cluster them so that similar queries cluster together for high cache-hit ratio. The proposed scheduling policies also balance the load among the servers by leveling the cluster sizes in the caches. Our clustering algorithms differ from the well-known clustering algorithms such as k-means or BIRCH  [29] in that the complexity of the proposed query scheduling algorithms is very light weight and independent of the number of cached data items since the query scheduling decisions should be made dynamically at run time for the incoming stream queries. Moreover, the goal of the well known clustering methods is to minimize the distance of each object to the belonged cluster, while the distributed query scheduling policies try to balance the number of assigned objects in each cluster while increasing the data locality.

To evaluate the performance of the proposed scheduling policies, we implemented the scheduling policies on top of a component-based distributed query processing framework. Scientific data analysis application developers can implement the interface of user-defined operators to process scientific queries on top of the framework. We conducted extensive experimental studies using the framework and show the proposed query scheduling policies significantly outperform the conventional scheduling policies in terms of both load balancing and cache hit ratio.

The rest of the paper is organized as follows: In Section  2, we discuss other research efforts related to cache-aware query scheduling and query optimization. In Section  3, we describe the architecture of our distributed query processing framework. In Section  4, we discuss BEMA (Balanced Exponential Moving Average) scheduling policy  [15] and analyze its load balancing behavior. In Section  5, we propose a novel scheduling policy—EM-KDE (Exponential Moving-Kernel Density Estimation) that improves load balancing. In Section  6 we present an extensive experimental evaluation, where we examine the performance impact of different scheduling policies, measuring both query execution and waiting time, as well as load balancing. Finally we conclude in Section  7.

Section snippets

Related work

The scheduling problem that minimizes the makespan of multiple jobs in parallel systems is a well known NP-hard optimization problem. This led to a very large number of heuristic scheduling algorithms that range from low level process scheduling algorithms on multiprocessor systems to high level job scheduling algorithms in cluster, cloud, and grid environment  [4], [5], [6], [9], [21], [27], [26].

Catalyurek et al.  [6] investigated how to dynamically restore the balance in parallel scientific

Distributed and parallel query processing framework and distributed semantic caching

Many scientific data analysis applications have common features in the overall query processing workflow although they process different types of raw datasets. Since scientific datasets are commonly represented in a multi-dimensional space, the scientific datasets are often accessed via multi-dimensional range queries.

Fig. 1 shows the architecture of our distributed and parallel query processing middleware for scientific data analysis applications. This architecture aims to build an efficient

Multiple scientific query scheduling policy from geometric perspective

In this section, as a background, we discuss an existing query scheduling policy called BEMA (Balanced Exponential Moving Average)  [15] that is shown to outperform DEMA (Distributed Exponential Moving Average)  [16] and identify its limitations.

Multiple query scheduling with Exponential Moving Kernel Density Estimation (EM-KDE)

In this section, we propose a novel scheduling policy EM-KDE (Exponential Moving with Kernel Density Estimation). Unlike the BEMA, the EM-KDE quickly adapts to a sudden change of query distribution, while achieving both load balancing and high cache-hit ratio.

We first introduce our KDE-based scheduling method, then address the dimensionality of scientific data analysis queries. Next, we describe the exponential moving method that reflects the recent query trend, and then provide the EM-KDE

Experiments

In this section, we evaluate EM-KDE, BEMA, and round-robin scheduling policies in terms of query response time, cache hit ratio that shows how well the queries are clustered, and the standard deviation of the number of processed queries across the back-end servers to measure load balancing, lower standard deviation indicating better load balancing.

Satellite remote sensing data visualization tool is an example of scientific data analysis applications implemented on top of our distributed query

Conclusion and future work

In distributed query processing systems where the caching infrastructure is distributed and scales with the number of servers, both leveraging cached results and achieving load balance become equally important to improve the overall system throughput. Conventional scheduling policies that consider load balancing fail to take the advantage of data reuse, while scheduling policies that only consider data reuse may suffer from load imbalance.

In this paper we propose a novel intelligent distributed

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2014R1A1A2058843) and KEIT of Korea funded by the IT R&D program MKE/KEIT (No. 10041608).

Youngmoon Eom is a graduate student at Ulsan National Institute of Science and Technology (UNIST), Republic of Korea. He earned his B.S. in Electrical and Computer Engineering in 2013 from UNIST. He was a visiting intern student in computer science department at the University of Maryland, College Park in 2011 summer. His research interests are in high performance computing, virtual machine technologies, big data processing platforms, and cloud computing.

References (30)

  • Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, Andrew Goldberg, Quincy: Fair scheduling for...
  • Manolis Katevenis et al.

    Weighted round-robin cell multiplexing in a general-purpose atm switch chip

    IEEE J. Sel. Areas Commun.

    (1991)
  • Solomon Kullback et al.

    On information and sufficiency

    Ann. Math. Statist.

    (1951)
  • Daniel A. Menasce et al.

    Scaling for E-Business: Technologies, Models, Performance, and Capacity Planning

    (2000)
  • Bongki Moon et al.

    Analysis of the clustering properties of the Hilbert space-filling curve

    IEEE Trans. Knowl. Data Eng.

    (2001)
  • Cited by (2)

    Youngmoon Eom is a graduate student at Ulsan National Institute of Science and Technology (UNIST), Republic of Korea. He earned his B.S. in Electrical and Computer Engineering in 2013 from UNIST. He was a visiting intern student in computer science department at the University of Maryland, College Park in 2011 summer. His research interests are in high performance computing, virtual machine technologies, big data processing platforms, and cloud computing.

    Deukyeon Hwang is an undergraduate student at Ulsan National Institute of Science and Technology (UNIST), Republic of Korea. He has worked on distributed query processing middleware systems in UNIST data intensive computing lab as an intern. He also worked as a summer intern in computer science department at the University of Maryland, College Park in 2013 summer. His research interests are in supercomputing and distributed and parallel middleware systems.

    Junyong Lee is a software engineer in Gala Lab, Republic of Korea. He earned his B.S. in Electrical and Computer Engineering in 2012 at Ulsan National Institute of Science and Technology (UNIST), Republic of Korea. He has worked on distributed query processing middleware systems as an undergraduate intern in UNIST data intensive computing lab in 2011. His research interests are in cloud computing platforms for multi-user games, and distributed and parallel middleware systems.

    Jonghwan Moon is an undergraduate student at Ulsan National Institute of Science and Technology (UNIST), Republic of Korea. He has worked on distributed query processing middleware systems in UNIST data intensive computing lab as an intern since his sophomore year. His research interests are in computer vision, image processing, and high performance computing.

    Minho Shin is an assistant professor at Myongji University, Republic of Korea. He earned his M.S. and Ph.D. in Computer Science from the University of Maryland, College Park, USA, in 2003 and 2007, respectively. He earned his B.S. in Computer Science and Statistics from Seoul National University, Seoul, Korea, in 1998. His research interests are in wireless networks, wireless network security, and user privacy in people-centric sensing. He has filed several patents in the US, Korea, and India, and he has refereed articles for many journals and conferences.

    Beomseok Nam is an assistant professor in school of Electrical and Computer Engineering at UNIST (Ulsan National Institute of Science and Technology) in Korea. Before he joined UNIST, he was a senior member of technical staff at Oracle in Redwood Shores, CA. He received his Ph.D. in Computer Science in 2007 at the University of Maryland, College Park, and obtained a B.S. (1997) and an M.S. (1999) from Seoul National University, Republic of Korea. His research interests are in the area of data-intensive computing, multi-dimensional indexing, distributed and parallel high performance computing middlewares, and cluster, Cloud, and Grid computing.

    View full text