Modeling throughput sampling size for a cloud-hosted data scheduling and optimization service

https://doi.org/10.1016/j.future.2013.01.003Get rights and content

Abstract

As big-data processing and analysis dominates the usage of the Cloud systems, the need for Cloud-hosted data scheduling and optimization services increases. One key component for such a service is to provide available bandwidth and achievable throughput estimation capabilities, since all scheduling and optimization decisions would be built on top of this information. The biggest challenge in providing these estimation capabilities is the dynamic decision of what proportion of the actual dataset, when transferred, would give us an accurate estimate of the bandwidth and throughput achieved by transferring the whole data set. That proportion of data is called the sampling size (or the probe size). Although small fixed sample sizes worked well for high-latency low-bandwidth networks in the past, high-bandwidth networks require much larger and more dynamic sample sizes, since an accurate estimation now also depends on how fast the transfer protocol can saturate that fat network link. In this study, we present a model to decide the optimal sampling size based on the data size and estimated capacity of the network. Our results show that the predicted sampling size is very accurate compared to the targeted best sampling size for a certain file transfer in a majority of the cases.

Highlights

► We present a model to calculate the smallest amount of sampling data to transfer to find the file transfer throughput. ► An online sampling strategy is combined with the parallel stream optimization model. ► The proposed model is used in a Cloud-hosted data scheduler. ► The estimated optimal throughput calculated with the sampling transfers is accurate compared to the actual optimized file transfer throughput.

Introduction

The “data deluge” in the last decade has changed the way we understand and handle data dependencies for scientific as well as commercial applications. Large scientific experiments, such as environmental and coastal hazard prediction  [1], climate modeling  [2], genome mapping  [3], and high-energy physics simulations  [4], [5] generate data volumes reaching hundreds of terabytes per year  [6]. Data collected from remote sensors and satellites, dynamic data-driven applications, digital libraries and preservations are also producing extremely large datasets for real-time or offline processing  [7], [8]. This data deluge in scientific applications necessitates collaboration and sharing among national and international education and research institutions, which results in frequent large-scale data movement across widely distributed sites. We see a very similar trend in commercial applications as well. According to a recent study by Forrester Research  [9], 77% of the 106 large organizations that operate two or more datacenters run regular backup and replication applications among three or more sites. Also, more than 50% of them have over one petabyte of data in their primary datacenter and expect their inter-datacenter throughput requirements to double or triple over the next couple of years  [10]. As a result, Google is now deploying a large-scale inter-datacenter copy service  [11], and background traffic becomes dominant in Yahoo!’s aggregate inter-datacenter traffic  [12].

Several national and regional optical networking initiatives such as Internet2  [13], ESnet  [14], XSEDE/TeraGrid  [15], and LONI  [16] provide high-speed network connectivity to their users to mitigate the data bottleneck. The recent developments in networking technology provide scientists with high-speed optical links reaching 100 Gbps in capacity  [17]. However, a majority of the users fail to obtain even a fraction of the theoretical speeds these networks promise due to issues such as sub-optimal protocol tuning, inefficient end-to-end routing, disk performance bottlenecks on the sending and/or receiving ends, and server processor limitations. For example, Garfienkel reports that sending a 1 TB forensics dataset from Boston to the Amazon S3 storage system took several weeks  [18]. For this reason, many companies prefer sending their data through a shipment service provider such as UPS or FedEx rather than using the Internet  [19]. This has led some researchers to develop systems like Pandora (People and Networks Moving Data Around) which automates the creation of cooperative transfer plans  [19]. In these systems, information is gathered about available Internet and shipment links, and this data is used as the input to algorithms that solve for optimal data transfer plans. This means that having high speed networks in place is important but not sufficient. Being able to use these high speed networks effectively is becoming increasingly important for wide-area data replication as well as for federated Cloud computing in a distributed setting.

We are developing a Cloud-hosted data transfer and optimization service which will provide enhanced data management capabilities, such as data aggregation and connection caching; early error detection and recovery; scheduled storage management; and end-to-end throughput optimization for a broad range of data-intensive Cloud computing applications. This Cloud-hosted data transfer and optimization service is based on our experience with the Stork data scheduler  [20]. One of the key components of such a service is the ability to estimate the end-to-end data transfer throughput and delivery time, which can be fed into other scheduling and high-level planning tools for advanced reservations, provisioning, and co-scheduling purposes. In our previous work, we analyzed various factors that affect the end-to-end data transfer throughput in wide-area distributed environments, such as number of parallel streams, CPU speed, and disk I/O speed. We have shown the effects of CPU-, disk-, and network-level parallelism in removing the bottlenecks one by one and increasing the end-to-end data transfer throughput  [21].

In another study  [22], we have proved that single stream TCP can fail in utilizing the bandwidth especially in the case of large bandwidth long RTT network paths. We have developed a model that uses an optimal number of parallel streams to get the highest possible throughput without congesting the network. This method does 3 sampling data transfers with different parallel stream numbers and then calculates the optimal achievable throughput. The rest of the data transfer is completed by using this optimal stream number value. This feature is implemented as part of the Stork data scheduler in the latest version  [20], [23]. One of the biggest issues in this method was to decide the size of the sampling transfers which would give us an accurate snapshot of the bandwidth but at the same time would not cause much overhead so that the overall transfer could finish efficiently. Since this decision depended on many factors such as the characteristics of the data transfers, network bandwidth and end-systems, we left it as a parameter to be decided by the user.

In this study, we develop a methodology to decide the optimal sampling (probe) size for file transfers using available information such as the estimated bandwidth capacity, RTT, and file size. We used a mining and statistics method to analyze the data we generated, with several parameters carefully decided, and to find a relationship between the optimal sampling size, file size and the Bandwidth-Delay Product (BDP). Our model provides very accurate results in predicting the throughput of the actual file transfer by using a smaller sampling size. This model can be used in data scheduling algorithms along with bandwidth and throughput prediction tools.

Section snippets

Related work

Globus Online  [24] provides data management capabilities to users as hosted “software as a service” (or SaaS) and manages fire-and-forget file transfers for big-data through thin clients (such as Web browsers) over the Internet. Globus Online does not provide any throughput prediction capabilities, and its transfer optimizations are mostly done manually. The developers mention in  [24] that they set the pipelining, parallelism, and concurrency parameters to specific values for three different

Proposed model

In this section, we describe how to generate the data to apply our mining and statistics strategies, define the relationship between the variables that are used in our equations while giving a detailed step by step description of our model.

Experimental results

We have used both real and emulation testbeds to evaluate our model. In emulation testbeds, we have the freedom of defining our own topology, bandwidth and delays as well as the operating system to run on virtual nodes. Emulab  [44] and CRON  [45] testbeds are examples of such an environment. CRON testbed also runs the same software as Emulab, however it provides a reconfigurable 10 Gbps network while Emulab supports up to 1 Gbps. Table 1 shows the different settings we have used to run our

Comparison to Iperf Quick mode

The model proposed in this study differs from existing bandwidth measurement tools and other historical mining techniques in terms of the goals it targets and its methodology, which makes it harder to develop a possible comparison scenario. These goals make our method more specific and at the same time unique. The first goal is to develop an online measurement technique that will give us the changing characteristics of the network traffic. Therefore it is unjust to compare it to algorithms that

A case study on Amazon EC2 Cloud with Stork Data Scheduler

To confirm the effectiveness and applicability of our model, a case study on the Amazon EC2 Cloud is designed and realized with Stork Data Scheduler  [20]. Stork is a batch scheduler specialized in data placement and movement. It understands the characteristics of data placement jobs and implements specific queuing, scheduling and optimization techniques. It acts like an I/O control system between the user application and the underlying protocols and storage servers. It has the ability to do

Conclusion

In this study, we presented a dynamic sampling model to be used in the estimation of the available bandwidth and throughput as part of a Cloud-hosted data scheduling and optimization service. Our strategy finds a relationship between file size, bandwidth, RTT and sampling size and predicts a sampling size that can be used to find the actual achievable throughput. This strategy works in the application level and takes into account specific data transfer characteristics and parallelism in

Esma Yildirim received her B.S. degree from Fatih University and M.S. degree from Marmara University Computer Engineering Departments in Istanbul, Turkey. She worked for one year in Avrupa Software Company for the Development of ERP Software. She also worked as a Lecturer in Fatih University Vocational School until 2006. She received her Ph.D. from the Louisiana State University Computer Science Department in 2010. She has worked at the University at Buffalo (SUNY) as a researcher. She is

References (51)

  • N. Laoutaris, M. Sirivianos, X. Yang, P. Rodriguez, Inter-datacenter bulk transfers with netstitcher, in: ACM SIGCOMM,...
  • D. Ziegler, Distributed peta-scale data transfer....
  • Y. Chen, S. Jain, V.K. Adhikari, Z.-L. Zhang, K. Xu, A first look at inter-data center traffic characteristics via...
  • Internet2....
  • Energy sciences network (ESNet)....
  • TeraGrid/XSEDE....
  • Louisiana optical network initiative (LONI)....
  • Arra/ani testbed....
  • S. Garfienkel, An evaluation of amazons grid computing services: EC2, S3 and SQS, Tech. Rep. TR-08-07, August...
  • B. Cho, I. Gupta, Budget-constrained bulk data transfer via Internet and shipping networks, in: Proceedings of the 8th...
  • T. Kosar et al.

    Stork data scheduler: mitigating the data bottleneck in e-science

    Philosophical Transactions of the Royal Society A

    (2011)
  • E. Yildirim, T. Kosar, Network-aware end-to-end data throughput optimization, in: Proceedings of Network-Aware Data...
  • E. Yildirim et al.

    Prediction of optimal parallelism level in wide area data transfers

    IEEE Transactions on Parallel and Distributed Systems

    (2011)
  • D. Yin et al.

    A data throughput prediction and optimization service for widely distributed many-task computing

    IEEE Transactions on Parallel and Distributed Systems

    (2011)
  • B. Allen et al.

    Software as a service for data scientists

    Communications of the ACM

    (2012)
  • Cited by (11)

    • Resource scheduling methods for cloud computing environment: The role of meta-heuristics and artificial intelligence

      2022, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      A dynamic sampling model is designed for the estimation of the bandwidth for cloud scheduling problems. The base of the model is the data size and estimated capacity of the network in Yildirim et al. (2013). Their main intent is to provide optimization services to the application by considering data transfer parameters.

    • Big data transfer optimization through adaptive parameter tuning

      2018, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      These models have proved to be more accurate than existing similar models [16,29] which lack in predicting the parallel stream number that gives the peak throughput. We also have developed algorithms to determine the best sampling size and the best sampling points for data transfers by using bandwidth, Round-Trip Time (RTT), or Bandwidth-Delay Product (BDP) [37]. In addition, several approaches are proposed to tune multiple transfer parameters at the same time using heuristics [3,7], offline modeling [5,6,30], and adaptive [34] techniques.

    • Sample transfer optimization with adaptive deep neural network

      2019, Proceedings of 6th Annual International Workshop on Innovating the Network for Data Intensive Science, INDIS 2019 - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis
    • Time series analysis for efficient sample transfers

      2019, SNTA 2019 - Proceedings of the ACM Workshop on Systems and Network Telemetry and Analytics, co-located with HPDC 2019
    • H-fair: Asymptotic scheduling of heavy workloads in heterogeneous data centers

      2018, Proceedings - 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2018
    • High-speed transfer optimization based on historical analysis and real-time tuning

      2018, IEEE Transactions on Parallel and Distributed Systems
    View all citing articles on Scopus

    Esma Yildirim received her B.S. degree from Fatih University and M.S. degree from Marmara University Computer Engineering Departments in Istanbul, Turkey. She worked for one year in Avrupa Software Company for the Development of ERP Software. She also worked as a Lecturer in Fatih University Vocational School until 2006. She received her Ph.D. from the Louisiana State University Computer Science Department in 2010. She has worked at the University at Buffalo (SUNY) as a researcher. She is currently an assistant professor at Fatih University, Istanbul, Turkey. Her research interests are data-intensive distributed computing, high performance computing, and Cloud computing.

    Jangyoung Kim received his B.S. degree in Computer Science from Yonsei university in Seoul, Korea and M.S. degree in Computer Science and Engineering from Pennsylvania State university in University Park. He worked as a Teaching Assistant in Pennsylvania State university. Earlier, he also participated in the Programming Internship in Samsung. Currently, he is pursuing his Ph.D. in Computer Science and Engineering from the University at Buffalo (SUNY). His research interests are data-intensive distributed computing, Cloud computing, and throughput optimization in high-speed networks.

    Tevfik Kosar is an Associate Professor in the Department of Computer Science and Engineering, University at Buffalo. Prior to joining UB, Kosar was with the Center for Computation and Technology (CCT) and the Department of Computer Science at Louisiana State University. He holds a B.S. degree in Computer Engineering from Bogazici University, Istanbul, Turkey and an M.S. degree in Computer Science from Rensselaer Polytechnic Institute, Troy, NY. Dr. Kosar has received his Ph.D. in Computer Science from the University of Wisconsin-Madison. Dr. Kosar’s main research interests lie in the cross-section of petascale distributed systems, eScience, Grids, Clouds, and collaborative computing with a focus on large-scale data-intensive distributed applications. He is the primary designer and developer of the Stork distributed data scheduling system, and the lead investigator of the state-wide PetaShare distributed storage network in Louisiana. Some of the awards received by Dr. Kosar include NSF CAREER Award, LSU Rainmaker Award, LSU Flagship Faculty Award, Baton Rouge Business Report’s Top 40 Under 40 Award, and 1012 Corridor’s Young Scientist Award.

    View full text