Data mining middleware for wide-area high-performance networks

https://doi.org/10.1016/j.future.2006.03.024Get rights and content

Abstract

In this paper, we describe two distributed, data intensive applications that were demonstrated at iGrid 2005 (iGrid Demonstration US109 and iGrid Demonstration US121). One involves transporting astronomical data from the Sloan Digital Sky Survey (SDSS) and the other involves computing histograms from multiple high-volume data streams. Both rely on newly developed data transport and data mining middleware. Specifically, we describe a new version of the UDT network protocol called Composible-UDT, a file transfer utility based upon UDT called UDT-Gateway, and an application for building histograms on high-volume data flows called BESH (for Best Effort Streaming Histogram). For both demonstrations, we include a summary of the experimental studies performed at iGrid 2005.

Introduction

High-speed (1 Gb/s and 10 Gb/s) wide-area networks provide us with the opportunity to deploy data intensive applications over large geographic areas. Until recently, distributed data intensive applications were usually designed to minimize inter-process data communications; if large data transfers could not be avoided, large data sets were sometimes loaded onto disks or tapes and physically sent to remote sites. As a consequence, there were usually substantial delays when analyzing large distributed data sets, especially when two or more such data sets had to be integrated.

For example, telescopes in the Sloan Digital Sky Survey (SDSS) [19] collect gigabytes of data per day. This data is currently stored locally, and a data release is made periodically, e.g., every quarter of a year. The data is then sent to astronomers around the world via disks or tapes. Analysis results that produce large data sets are difficult to exchange between astronomers. Also, overlaying a second data set on top of the SDSS data in order to discover astronomical objects that are too faint to be identified from one data set alone requires a substantial effort.

With high-speed wide-area optical links connecting the observation stations, processing centers, and astronomers, these data sets and the analysis results can now be shared in near real time. Thus the processing delay can be reduced significantly and different data sets can be combined more easily.

However, existing applications cannot automatically make use of the emerging high-speed networks. First, the de facto Internet transport protocol, TCP as usually deployed, significantly under-utilizes the network bandwidth in high-speed long distance environments. Several alternatives and enhancements to TCP have been developed over the past several years [8], including UDT [12]. Second, the current generation of data mining software and middleware was not designed to process data at high speeds across distributed computing sites and data sources.

At iGrid 2005, we demonstrated three middleware applications designed to address these issues. One is a new version of the UDT protocol that we have described previously, [12] which is composible in the sense that it is designed to support different congestion control algorithms easily [10]. The application is called Composible-UDT. The remaining two middleware applications are built over Composible-UDT. The first of these is a file transfer utility called UDT-Gateway that provides access to UDT-based data services using TCP-based applications for the “last mile”. This greatly expands the population of end users that can use UDT-based data services. The second of these is a best-effort online histogram application called BESH (for Best Effort Streaming Histogram).

At iGrid 2005, we demonstrated Composible-UDT, UDT-Gateway and BESH in two demonstrations. The first was iGrid Demonstration US121, which transported data from the SDSS. The second was iGrid Demonstration US109, which computed streaming histograms using data from web logs for web servers that provided results of the 1998 World Cup.

In Section 2 we describe the experimental setup. In Section 3, we describe the data transport middleware: UDT and UDT-Gateway. In Section 4, we describe the data mining middleware for computing histograms on high-volume streaming data. The experimental results are described in Section 5. Section 6 briefly reviews related work. Section 7 contains a summary and conclusion.

Section snippets

Experimental setup

In this section, we describe the hardware and network infrastructure used in our iGrid 2005 demonstrations. We also describe the data sets that we used.

Data transport middleware

It is well known that TCP substantially under-utilizes the network bandwidth in high-bandwidth-delay product environments. We have analyzed this problem and developed practical solutions since 2001 [9], [10], [11], [12], [13], [14].

As mentioned above, at iGrid 2005 we demonstrated the current version of UDT (called Composible-UDT) and a file transfer utility based on Composible-UDT called UDT-Gateway. In this section, we will give a brief review of Composible-UDT and UDT-Gateway.

Data mining middleware

In this section, we describe the data mining middleware that we demonstrated at iGrid 2005.

Simply developing high-speed data transport middleware will not enable wide-area data intensive applications. We also need data mining middleware that scales to high-volume data flows. At iGrid, we demonstrated data mining middleware that supports a streaming model for processing data. This model places two requirements on the algorithm: first, the data is examined only once; second, only a fixed amount

Experimental results

For each of our demonstrations, we had two official time slots. We also tested our applications during the night-time, especially the BESH algorithm.

US109 experimental results. The performance of US109 is recorded in Fig. 4, Fig. 6, Fig. 7, which are the figures taken from the real-time display during one of our demos. Fig. 4 shows the real-time dynamic histogram on the four data streams. Fig. 6 shows the aggregate throughput of the streaming data mining application. The average speed is around

Related work

Moving large data sets over high-speed wide-area networks has been recognized as a challenging task for many years. During iGrid 2002, various groups demonstrated prototypes of several different tools for high-performance data transport [2], [3], [9], [16], [18], [21].

Since then, various new data transport protocols or related congestion control algorithms [8], [10], [12] have been designed and developed. Comparison between different protocols is now commonly regarded as a complicated topic, as

Conclusions

In this paper, we have described two demonstrations at iGrid 2005 that use data transport middleware and data mining middleware tools that we have developed.

For this first demonstration, we used the UDT-Gateway file transfer utility to transfer astronomical data from the iGrid 2005 conference to Korea. We transferred over 797 GB of data at a mean rate of 1027 Mb/s. As far as we are aware, this was the first time that astronomical data of this size has been transported across the Pacific.

For the

Acknowledgments

This work was supported in part by the National Science Foundation under grant ANI 9977868, the Department of Energy under grant DE-FG02-04ER25639, and the U.S. Army Pantheon Project.

Robert L. Grossman is the Director of the Laboratory for Advanced Computing and the National Center for Data Mining at the University of Illinois at Chicago, where he has been a faculty member since 1988. He is also the spokesperson for the Data Mining Group (DMG), an industry consortium responsible for the Predictive Model Markup Language (PMML), an XML language for data mining and predictive modeling. He is the President of Open Data Partners, which provides consulting and outsourced services

References (23)

  • M. Goutelle et al.

    A Survey of transport protocols other than standard TCP

  • Cited by (7)

    • Large scale computational science on federated international grids: The role of switched optical networks

      2010, Future Generation Computer Systems
      Citation Excerpt :

      Light-paths provide several features that are not possible using regular, production, best-effort networks, but which are needed for high performance grid applications. These include higher bandwidth connections (e.g. [4]), user-defined networks [5], implementation of novel protocols [6] which all provide essentially contention-free, high quality-of-service links. The purpose of this paper is to document the set of steps that it was necessary for us to take in order to use dedicated light-paths in several real large-scale scientific simulation scenarios.

    • OptIPuter: Enabling advanced applications with novel optical control planes and backplanes

      2009, Future Generation Computer Systems
      Citation Excerpt :

      The results from initial experiments indicate that these methods provide for particularly powerful resources for very large scale, data intensive applications. These preliminary results are highly promising, and the success of these experiments has attracted additional domain sciences to the project [7]. Today, a number of large scale applications are beginning to use these approaches in production environments for science research applications.

    • Store-and-Forward Data Transfer using Optimized Intermediate Node

      2019, 2019 20th Asia-Pacific Network Operations and Management Symposium: Management in a Cyber-Physical World, APNOMS 2019
    • Quality guaranteed media delivery over advanced network

      2012, Next Generation Content Delivery Infrastructures: Emerging Paradigms and Technologies
    • Mining@home : Toward a public-resource computing framework for distributed data mining

      2010, Concurrency and Computation: Practice and Experience
    View all citing articles on Scopus

    Robert L. Grossman is the Director of the Laboratory for Advanced Computing and the National Center for Data Mining at the University of Illinois at Chicago, where he has been a faculty member since 1988. He is also the spokesperson for the Data Mining Group (DMG), an industry consortium responsible for the Predictive Model Markup Language (PMML), an XML language for data mining and predictive modeling. He is the President of Open Data Partners, which provides consulting and outsourced services focused on data. He has published over 100 papers in refereed journals and proceedings on internet computing, data mining, high-performance networking, business intelligence, and related areas, and lectured extensively at conferences and workshops.

    Yunhong Gu is a research scientist at the National Center for Data Mining of the University of Illinois at Chicago (UIC). He received a Ph.D. in Computer Science from UIC in 2005. His current research interests include computer networks and distributed systems. He is the developer of UDT, an application-level data transport protocol. UDT is deployed on distributed data intensive applications over high-speed wide-area networks (e.g.,  1 Gb/s or above; both private and shared networks). UDT uses UDP to transfer bulk data and it has its own reliability control and congestion control mechanism.

    David Hanley has a bachelor’s degree from UIC and is pursuing his master’s degree. He is the co-author of two computer books (‘C: Just the FAQ’s’ and ‘Visual J++ Unleashed’). He is co-author of numerous papers and has won a series of computer programming contests.

    Michal Sabala received his Bachelor’s Degree in Computer Engineering from the University of Illinois at Champaign-Urbana. He is a senior research programmer and systems administrator at the National Center for Data Mining at UIC.

    Joe Mambretti is the Director of the International Center for Advanced Internet Research at Northwestern University (iCAIR, www.icair.org), the Director of the Metropolitan Research and Education Network (MREN, www.mren.org), Co-Director of the StarLight international exchange (www.starlight.net), a member of the Executive Committee for I-WIRE, principal investigator for OMNInet and for the Distributed Optical Testbed, and a research participant for the OptIPuter initiative. The mission of iCAIR is to accelerate leading-edge innovation and enhanced digital communications through advanced Internet technologies, in partnership with the international community. iCAIR accomplishes its mission by undertaking large-scale (e.g., global, national, regional, metro) projects focused on high-performance resource intensive applications, advanced communications middleware, and optical and photonic networking. These initiatives include basic research, testbeds and prototype implementations. He is co-editor of the forthcoming book, “Grid Networks: Enabling Grids With Advanced Communications Technology”, which will be published by Wiley.

    Alex Szalay is the Alumni Centennial Professor of Astronomy at the Johns Hopkins University. He is also Professor in the Department of Computer Science. He is a cosmologist, working on the statistical measures of the spatial distribution of galaxies and galaxy formation. He was born and educated in Hungary. After graduation, he spent postdoctoral periods at University of California Berkeley and the University of Chicago, before accepting a faculty position at Johns Hopkins. In 1990 he was elected to the Hungarian Academy of Sciences as a Corresponding Member. He is the architect for the Science Archive of the Sloan Digital Sky Survey. He has been collaborating with Jim Gray of Microsoft to design an efficient system for performing data mining on the SDSS Terabyte sized archive, based on innovative spatial indexing techniques. He is leading a grass-roots standardization effort to bring the next-generation Terabyte-sized databases in astronomy to a common basis, so that they will be interoperable—the Virtual Observatory. He is Project Director of the US National Science Foundation (NSF)-funded National Virtual Observatory. He is involved in the GriPhyN and iVDGL projects, creating testbed applications for the Computational Grid. He has written over 340 papers in various scientific journals, covering areas from theoretical cosmology to observational astronomy, spatial statistics and computer science. In 2003 he was elected as a Fellow of the American Academy of Arts and Sciences. In 2004 he received one of the Alexander Von Humboldt Prizes in Physical Sciences.

    Ani Thakar is a Research Scientist in the Center for Astrophysical Sciences at the Johns Hopkins University. He was born in India and completed his college education in Canada and the US. He has a Bachelor’s degree in Physics and Computer Science (combined Honors) and a Ph.D. in Astronomy. He worked as a software engineer for several years in Canada prior to his Ph.D. His research interests are data intensive science, large scientific databases and computational astrophysics. He is the chief database scientist for the Sloan Digital Sky Survey (SDSS) and Project Manager for SDSS Science Archive software development at JHU. He is a member of the NSF-funded National Virtual Observatory (NVO) Project Team, Principal Investigator on NASA Applied Information Systems Research Program (AISRP) projects and a Co-Investigator on the Large Synoptic Survey Telescope (LSST) project. He is a Collaborator on the Digital Laboratory for Multi-scale Science (DLMS), a joint NSF-funded Major Research Initiative (MRI) between the Math, Mechanical Engineering, Computer Science and Astronomy departments at JHU to analyse data-cubes from large turbulence simulations. He is also a collaborator on an NSF-funded Science and Engineering Information Integration and Informatics (SEIII) project to investigate dynamic partitioning and caching of large datasets. He has collaborated with Jim Gray of Microsoft and Alex Szalay (JHU) on the design and development of the SDSS Science Archive. He co-designed and built an earlier object-oriented version of the Terabyte SDSS Science Archive database.

    Kazumi Kumazoe received B.E. and M.E. degrees in Computer Sciences and Electronics from Kyushu Institute of Technology, Iizuka, Japan in 1993 and 1995, respectively. She is currently a researcher of the NICT (National Institute of Information and Communications Technology, Japan) JGNII research project. Her research interests include high-speed transport protocols. She is a member of IEICE.

    Oie Yuji received B.E., M.E. and D.E. degrees from Kyoto University, Kyoto, Japan in 1978, 1980 and 1987, respectively. From 1983 to 1990, he was with the Department of Electrical Engineering, Sasebo College of Technology, Sasebo. From 1990 to 1995, he was an Associate Professor in the Department of Computer Science and Electronics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology, Iizuka. From 1995 to 1997, He was a Professor in the Information Technology Center, Nara Institute of Science and Technology. Since April 1997, He has been a Professor in the Department of Computer Science and Electronics, Faculty of Computer Science and Systems Engineering, Kyushu Institute of Technology. He is currently a leader of the NICT (National Institute of Information and Communications Technology, Japan) JGNII research project. His research interests include performance evaluation of computer communication networks, high-speed networks, and queuing systems. He is a fellow of IPSJ, and IEICE, and a member of the IEEE.

    Minsun Lee is currently a senior researcher of Research Networking Team, Korea Institute of Science and Technology Information (KISTI), Daejeon, Korea. She received a B.S. degree in Physics from Sookmyung Women’s University, Seoul, Korea and an M.S. degree in Electrical engineering from University of Nebraska-Lincoln, Nebraska, US. Her research interests include data compression technique, global networking and sensor network engineering.

    Yoonjoo Kwon is working for the KISTI Supercomputing Center as a network researcher. She received B.E and M.E. degrees in computer engineering from Sungkyunkwan University. Her research interests are TCP performance enhancement over long and fat networks, traffic monitoring/ measurement/ analysis, and network security.

    Woojin Seok is working for the KISTI Supercomputing Center. He received a B.E. degree in computer engineering from Kyungpook National University and an M.S. degree in computer science from University of North Carolina at Chapel Hill. His research interests are TCP moving over heterogeneous networks, TCP over long and fat networks, traffic monitoring/measurement/analysis, and network simulations.

    1

    Also with Open Data Partners.

    View full text