Modeling throughput sampling size for a cloud-hosted data scheduling and optimization service
Highlights
► We present a model to calculate the smallest amount of sampling data to transfer to find the file transfer throughput. ► An online sampling strategy is combined with the parallel stream optimization model. ► The proposed model is used in a Cloud-hosted data scheduler. ► The estimated optimal throughput calculated with the sampling transfers is accurate compared to the actual optimized file transfer throughput.
Introduction
The “data deluge” in the last decade has changed the way we understand and handle data dependencies for scientific as well as commercial applications. Large scientific experiments, such as environmental and coastal hazard prediction [1], climate modeling [2], genome mapping [3], and high-energy physics simulations [4], [5] generate data volumes reaching hundreds of terabytes per year [6]. Data collected from remote sensors and satellites, dynamic data-driven applications, digital libraries and preservations are also producing extremely large datasets for real-time or offline processing [7], [8]. This data deluge in scientific applications necessitates collaboration and sharing among national and international education and research institutions, which results in frequent large-scale data movement across widely distributed sites. We see a very similar trend in commercial applications as well. According to a recent study by Forrester Research [9], 77% of the 106 large organizations that operate two or more datacenters run regular backup and replication applications among three or more sites. Also, more than 50% of them have over one petabyte of data in their primary datacenter and expect their inter-datacenter throughput requirements to double or triple over the next couple of years [10]. As a result, Google is now deploying a large-scale inter-datacenter copy service [11], and background traffic becomes dominant in Yahoo!’s aggregate inter-datacenter traffic [12].
Several national and regional optical networking initiatives such as Internet2 [13], ESnet [14], XSEDE/TeraGrid [15], and LONI [16] provide high-speed network connectivity to their users to mitigate the data bottleneck. The recent developments in networking technology provide scientists with high-speed optical links reaching 100 Gbps in capacity [17]. However, a majority of the users fail to obtain even a fraction of the theoretical speeds these networks promise due to issues such as sub-optimal protocol tuning, inefficient end-to-end routing, disk performance bottlenecks on the sending and/or receiving ends, and server processor limitations. For example, Garfienkel reports that sending a 1 TB forensics dataset from Boston to the Amazon S3 storage system took several weeks [18]. For this reason, many companies prefer sending their data through a shipment service provider such as UPS or FedEx rather than using the Internet [19]. This has led some researchers to develop systems like Pandora (People and Networks Moving Data Around) which automates the creation of cooperative transfer plans [19]. In these systems, information is gathered about available Internet and shipment links, and this data is used as the input to algorithms that solve for optimal data transfer plans. This means that having high speed networks in place is important but not sufficient. Being able to use these high speed networks effectively is becoming increasingly important for wide-area data replication as well as for federated Cloud computing in a distributed setting.
We are developing a Cloud-hosted data transfer and optimization service which will provide enhanced data management capabilities, such as data aggregation and connection caching; early error detection and recovery; scheduled storage management; and end-to-end throughput optimization for a broad range of data-intensive Cloud computing applications. This Cloud-hosted data transfer and optimization service is based on our experience with the Stork data scheduler [20]. One of the key components of such a service is the ability to estimate the end-to-end data transfer throughput and delivery time, which can be fed into other scheduling and high-level planning tools for advanced reservations, provisioning, and co-scheduling purposes. In our previous work, we analyzed various factors that affect the end-to-end data transfer throughput in wide-area distributed environments, such as number of parallel streams, CPU speed, and disk I/O speed. We have shown the effects of CPU-, disk-, and network-level parallelism in removing the bottlenecks one by one and increasing the end-to-end data transfer throughput [21].
In another study [22], we have proved that single stream TCP can fail in utilizing the bandwidth especially in the case of large bandwidth long RTT network paths. We have developed a model that uses an optimal number of parallel streams to get the highest possible throughput without congesting the network. This method does 3 sampling data transfers with different parallel stream numbers and then calculates the optimal achievable throughput. The rest of the data transfer is completed by using this optimal stream number value. This feature is implemented as part of the Stork data scheduler in the latest version [20], [23]. One of the biggest issues in this method was to decide the size of the sampling transfers which would give us an accurate snapshot of the bandwidth but at the same time would not cause much overhead so that the overall transfer could finish efficiently. Since this decision depended on many factors such as the characteristics of the data transfers, network bandwidth and end-systems, we left it as a parameter to be decided by the user.
In this study, we develop a methodology to decide the optimal sampling (probe) size for file transfers using available information such as the estimated bandwidth capacity, RTT, and file size. We used a mining and statistics method to analyze the data we generated, with several parameters carefully decided, and to find a relationship between the optimal sampling size, file size and the Bandwidth-Delay Product (BDP). Our model provides very accurate results in predicting the throughput of the actual file transfer by using a smaller sampling size. This model can be used in data scheduling algorithms along with bandwidth and throughput prediction tools.
Section snippets
Related work
Globus Online [24] provides data management capabilities to users as hosted “software as a service” (or SaaS) and manages fire-and-forget file transfers for big-data through thin clients (such as Web browsers) over the Internet. Globus Online does not provide any throughput prediction capabilities, and its transfer optimizations are mostly done manually. The developers mention in [24] that they set the pipelining, parallelism, and concurrency parameters to specific values for three different
Proposed model
In this section, we describe how to generate the data to apply our mining and statistics strategies, define the relationship between the variables that are used in our equations while giving a detailed step by step description of our model.
Experimental results
We have used both real and emulation testbeds to evaluate our model. In emulation testbeds, we have the freedom of defining our own topology, bandwidth and delays as well as the operating system to run on virtual nodes. Emulab [44] and CRON [45] testbeds are examples of such an environment. CRON testbed also runs the same software as Emulab, however it provides a reconfigurable 10 Gbps network while Emulab supports up to 1 Gbps. Table 1 shows the different settings we have used to run our
Comparison to Iperf Quick mode
The model proposed in this study differs from existing bandwidth measurement tools and other historical mining techniques in terms of the goals it targets and its methodology, which makes it harder to develop a possible comparison scenario. These goals make our method more specific and at the same time unique. The first goal is to develop an online measurement technique that will give us the changing characteristics of the network traffic. Therefore it is unjust to compare it to algorithms that
A case study on Amazon EC2 Cloud with Stork Data Scheduler
To confirm the effectiveness and applicability of our model, a case study on the Amazon EC2 Cloud is designed and realized with Stork Data Scheduler [20]. Stork is a batch scheduler specialized in data placement and movement. It understands the characteristics of data placement jobs and implements specific queuing, scheduling and optimization techniques. It acts like an I/O control system between the user application and the underlying protocols and storage servers. It has the ability to do
Conclusion
In this study, we presented a dynamic sampling model to be used in the estimation of the available bandwidth and throughput as part of a Cloud-hosted data scheduling and optimization service. Our strategy finds a relationship between file size, bandwidth, RTT and sampling size and predicts a sampling size that can be used to find the actual achievable throughput. This strategy works in the application level and takes into account specific data transfer characteristics and parallelism in
Esma Yildirim received her B.S. degree from Fatih University and M.S. degree from Marmara University Computer Engineering Departments in Istanbul, Turkey. She worked for one year in Avrupa Software Company for the Development of ERP Software. She also worked as a Lecturer in Fatih University Vocational School until 2006. She received her Ph.D. from the Louisiana State University Computer Science Department in 2010. She has worked at the University at Buffalo (SUNY) as a researcher. She is
References (51)
- et al.
Resilience to natural hazards: how useful is this concept?
Global Environmental Change Part B: Environmental Hazards
(2003) - et al.
Basic local alignment search tool
Journal of Molecular Biology
(1990) - et al.
The network weather service: a distributed resource performance forecasting service for metacomputing
Journal of Future Generation Computer Systems
(1999) - et al.
The national center for atmospheric research community climate model: Ccm3
Journal of Climate
(1998) - Cms: the US Compact Muon Solenoid project....
- A Toroidal LHC ApparatuS project (ATLAS)....
- et al.
The data deluge: an e-science perspective
- E. Ceyhan, T. Kosar, Large scale data management in sensor networking applications, in: Proceedings of Secure...
- et al.
Data management challenges in coastal applications
Journal of Coastal Research
(2007) - F. Research, The future of data center wide-area networking....
Stork data scheduler: mitigating the data bottleneck in e-science
Philosophical Transactions of the Royal Society A
Prediction of optimal parallelism level in wide area data transfers
IEEE Transactions on Parallel and Distributed Systems
A data throughput prediction and optimization service for widely distributed many-task computing
IEEE Transactions on Parallel and Distributed Systems
Software as a service for data scientists
Communications of the ACM
Cited by (11)
Resource scheduling methods for cloud computing environment: The role of meta-heuristics and artificial intelligence
2022, Engineering Applications of Artificial IntelligenceCitation Excerpt :A dynamic sampling model is designed for the estimation of the bandwidth for cloud scheduling problems. The base of the model is the data size and estimated capacity of the network in Yildirim et al. (2013). Their main intent is to provide optimization services to the application by considering data transfer parameters.
Big data transfer optimization through adaptive parameter tuning
2018, Journal of Parallel and Distributed ComputingCitation Excerpt :These models have proved to be more accurate than existing similar models [16,29] which lack in predicting the parallel stream number that gives the peak throughput. We also have developed algorithms to determine the best sampling size and the best sampling points for data transfers by using bandwidth, Round-Trip Time (RTT), or Bandwidth-Delay Product (BDP) [37]. In addition, several approaches are proposed to tune multiple transfer parameters at the same time using heuristics [3,7], offline modeling [5,6,30], and adaptive [34] techniques.
Sample transfer optimization with adaptive deep neural network
2019, Proceedings of 6th Annual International Workshop on Innovating the Network for Data Intensive Science, INDIS 2019 - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and AnalysisTime series analysis for efficient sample transfers
2019, SNTA 2019 - Proceedings of the ACM Workshop on Systems and Network Telemetry and Analytics, co-located with HPDC 2019H-fair: Asymptotic scheduling of heavy workloads in heterogeneous data centers
2018, Proceedings - 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2018High-speed transfer optimization based on historical analysis and real-time tuning
2018, IEEE Transactions on Parallel and Distributed Systems
Esma Yildirim received her B.S. degree from Fatih University and M.S. degree from Marmara University Computer Engineering Departments in Istanbul, Turkey. She worked for one year in Avrupa Software Company for the Development of ERP Software. She also worked as a Lecturer in Fatih University Vocational School until 2006. She received her Ph.D. from the Louisiana State University Computer Science Department in 2010. She has worked at the University at Buffalo (SUNY) as a researcher. She is currently an assistant professor at Fatih University, Istanbul, Turkey. Her research interests are data-intensive distributed computing, high performance computing, and Cloud computing.
Jangyoung Kim received his B.S. degree in Computer Science from Yonsei university in Seoul, Korea and M.S. degree in Computer Science and Engineering from Pennsylvania State university in University Park. He worked as a Teaching Assistant in Pennsylvania State university. Earlier, he also participated in the Programming Internship in Samsung. Currently, he is pursuing his Ph.D. in Computer Science and Engineering from the University at Buffalo (SUNY). His research interests are data-intensive distributed computing, Cloud computing, and throughput optimization in high-speed networks.
Tevfik Kosar is an Associate Professor in the Department of Computer Science and Engineering, University at Buffalo. Prior to joining UB, Kosar was with the Center for Computation and Technology (CCT) and the Department of Computer Science at Louisiana State University. He holds a B.S. degree in Computer Engineering from Bogazici University, Istanbul, Turkey and an M.S. degree in Computer Science from Rensselaer Polytechnic Institute, Troy, NY. Dr. Kosar has received his Ph.D. in Computer Science from the University of Wisconsin-Madison. Dr. Kosar’s main research interests lie in the cross-section of petascale distributed systems, eScience, Grids, Clouds, and collaborative computing with a focus on large-scale data-intensive distributed applications. He is the primary designer and developer of the Stork distributed data scheduling system, and the lead investigator of the state-wide PetaShare distributed storage network in Louisiana. Some of the awards received by Dr. Kosar include NSF CAREER Award, LSU Rainmaker Award, LSU Flagship Faculty Award, Baton Rouge Business Report’s Top 40 Under 40 Award, and 1012 Corridor’s Young Scientist Award.