DBStream: A holistic approach to large-scale network traffic monitoring and analysis

doi:10.1016/j.comnet.2016.04.020

Computer Networks

Volume 107, Part 1, 9 October 2016, Pages 5-19

https://doi.org/10.1016/j.comnet.2016.04.020 Get rights and content

Abstract

In the last decade, many systems for the extraction of operational statistics from computer network interconnects have been designed and implemented. Those systems generate huge amounts of data of various formats and in various granularities, from packet level to statistics about whole flows. In addition, the complexity of Internet services has increased drastically with the introduction of cloud infrastructures, Content Delivery Networks (CDNs) and mobile Internet usage, and complexity will continue to increase in the future with the rise of Machine-to-Machine communication and ubiquitous wearable devices. Therefore, current and future network monitoring frameworks cannot rely only on information gathered at a single network interconnect, but must consolidate information from various vantage points distributed across the network.

In this paper, we present DBStream, a holistic approach to large-scale network monitoring and analysis applications. After a precise system introduction, we show how its Continuous Execution Language (CEL) can be used to automate several data processing and analysis tasks typical for monitoring operational ISP networks. We discuss the performance of DBStream as compared to MapReduce processing engines and show how intelligent job scheduling can increase its performance even further. Furthermore, we show the versatility of DBStream by explaining how it has been integrated to import and process data from two passive network monitoring systems, namely METAWIN and Tstat. Finally, multiple examples of network monitoring applications are given, ranging from simple statistical analysis to more complex traffic classification tasks applying machine learning techniques using the Weka toolkit.

Introduction

Since the introduction of computer networks in general and the Internet more specifically, networked computer systems have become more and more important to modern society. Todays Internet is a highly complex, distributed system, spanning the globe and reaching even into outer space to the International Space Station. Human communication relies to a large extent on emails, (mobile) phone calls and social media. It has become normal to buy electronics, clothes or even cars, book flights and make bank transfers over the Internet. The financial market exchanges large amounts of stocks via interconnected high frequency trading systems. This shows that computer networks have become a corner stone of today’s modern society.

Network operators are responsible for the proper functioning of those highly complex networks. They face the challenge of detecting and reacting very quickly to network anomalies, security breaches and, at the same time, plan ahead to adopt their networks to novel usage patterns. Network monitoring and analysis systems play a central role in supporting operators in these tasks. However, the above challenges put a wide range of requirements to the system in charge to collect, store, and process the gathered monitoring data. Such a system should be: (i) able to store data over extended time periods, (ii) make analysis results available quickly, on the order of minutes or even seconds, and (iii) network experts should be able to easily specify and extend typical analysis tasks. Whereas, many isolated systems and approaches have been proposed to capture and analyze network data [1], [2], [3], [4], there is still a clear lack of open, comprehensive approaches for integrating, combining and post processing data from multiple sources.

In this paper, we propose the open source system DBStream¹, a holistic approach to large-scale network data analysis (Fig. 1). DBStream is a Data Stream Warehouse (DSW) based on traditional database techniques, designed with comprehensive network monitoring in mind. We show that DBStream is performance-wise at least on par with the most recent large-scale data processing frameworks such as Hadoop and Spark. We report the use of DBStream for several network monitoring and analysis applications, and the experience from its deployment in a production mobile network. Finally we show a DBStream integration with the well-known Weka Machine Learning (ML) toolkit can be used for on-line detection of Machine-to-Machine (M2M) devices in mobile networks, using only high level statistical information.

The specific contributions of the paper are:

•
We propose the open source DSW DBStream.
•
We present the high level, micro service architecture of DBStream.
•
We show the high performance of DBStream by comparing it to state-of-the-art large-scale data processing frameworks.
•
We demonstrate how the Continuous Execution Language (CEL) language empowers users to solve analytic challenges effectively.

The remainder of the paper is organized as follows. Section 2 presents the related work. In Section 3 and 4, we describe the system architecture and the processing language of DBStream, respectively. In Section 5, the performance of DBStream is compared to the in-memory MapReduce framework Spark. We discuss the impact of jobs scheduling on DBStream performance in Section 6. We provide in Section 7 an extensive report of the DBStream usage in several network traffic monitoring and analysis projects, as well as in a nation-wide mobile network. A prototypical integration of DBStream with a ML library is presented in Section 8, along with its application to M2M traffic detection. Finally, Section 9 provides the overall conclusions and an outlook on the future work.

Section snippets

Related work

The introduction of the term Big Data lead to a new era in which many scientific and commercial organizations started designing and developing novel large-scale data processing systems. Most of them achieve increased performance by re-implementing the whole or parts of the data processing engine. They often relax Atomicity, Consistency, Isolation, Durability (ACID) constraints [5] and/or apply novel data processing paradigms. Still, a limitation of such systems is the inability to cope with

DBStream system design

The main purpose of DBStream is to store and analyze large amounts of network monitoring data. But, it might also be applied to data from other application domains like e.g., smart grids, smart cities, intelligent transportation systems, or any other use case that requires continuous processing of large amounts of heterogeneous data data over time. DBStream is implemented as a middle-ware layer on top of PostgreSQL. Whereas all data processing is done in PostgreSQL, DBStream offers the ability

Continuous execution language (CEL)

In this section, we describe the batched stream processing language CEL originally introduced in [29] in full detail. Table 1 gives an overview of the important terms of CEL. We start with a simple example explaining the main functions of CEL. In the following Section 4.1 we detail the Continuous Tables (CTs) used in CEL. Section 4.2 describes how time windows are handled in DBStream. Finally, in Section 4.3 we explain multiple complex examples showing the full expressive power of the presented

Performance evaluation

In this section, we compare the performance of DBStream to the state-of-the-art Big Data processing framework Spark.

Improving performance with intelligent scheduling

In the setup considered for the performance comparison, the main bottleneck of the DBStream system is disk I/O. However, we will show that it is possible to minimize disk I/O overhead by intelligent tasks scheduling. In this section, we give an introduction to a more general scheduling problem found in disk-based continuous processing systems executing shared worflows. The automation of the scheduling presented here is part of future work and a first step towards this automation has already

Experience from NTMA projects

DBStream has been adopted in several research projects for running a number of NTMA applications. To provide a concrete example, we report in Section 7.1 several statistics from running DBStream in the network monitoring project DARWIN4 [33], where it has been used as central analysis system. In addition, in Section 8 we present the M2M TRAffic Classification (MTRAC) approach [34] as one prominent advanced analytics application of DBStream.

Besides these illustrative examples, there exist a

MTRAC - M2M traffic classification

In this Section we describe the MTRAC as one of the most important applications of DBStream not under Non-Disclosure Agreement (NDA) constraints.

Machine-to-Machine (M2M) traffic has become a major share of today’s mobile networks and will grow even further in the near future. The quickly increasing number of M2M devices introduces unprecedented traffic patterns and fosters the interest of mobile operators, who whish to discover and track those devices in their networks. MTRAC enables the

Conclusion and future work

In this paper, we presented DBStream, a Data Stream Warehouse (DSW) tailored for, but not limited to, Network Traffic Monitoring and Analysis (NTMA) applications. We have shown, that if instrumented correctly, a PostgreSQL database engine can process large amounts of data in a fast and efficient way.

In a performance study, we demonstrated that a single-node instance of DBStream can outperform a cluster of 10 Spark nodes by a factor of 2.6, running the same query workload on the same dataset.

The

Acknowledgments

The research leading to these results has received funding from the European Union under the FP7 Grant Agreement n. 318627, mPlane project. The work has been partially performed within the framework of the projects Darwin 4 and N-0 at the Telecommunications Research Center Vienna (FTW), and has been partially funded by the Austrian Government and the City of Vienna through the program COMET, and by the Vienna Science and Technology Fund (WWTF) through project ICT15-129, BigDAMA. We would like

References (48)

F. Ricciato
Traffic monitoring and analysis for the optimization of a 3g network
IEEE Wireless Commun.
(2006)
K. Keys et al.
The architecture of CoralReef: an Internet traffic monitoring software suite
Passive and Active Network Measurement Workshop (PAM)
(2001)
A. Finamore et al.
Experiences of internet traffic monitoring with Tstat
IEEE Netw.
(2011)
F. Fusco et al.
High speed network traffic analysis with commodity multi-core systems
Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement 2010, Melbourne, Australia - November 1-3, 2010
(2010)
M. Stonebraker
Sql databases v. nosql databases
Commun. ACM
(2010)
C.D. Cranor et al.
Gigascope: a stream database for network applications
Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9–12, 2003
(2003)
D.J. Abadi et al.
Aurora: a new model and architecture for data stream management
VLDB J.
(2003)
EsperTech Inc.
Esper: Event Processing for Java
(2016)
StreamBase Inc., Streambase: Real-time, Low Latency Data Processing with a Stream Processing Engine., 2016....
E. Liarou et al.
Monetdb/datacell: Online analytics in a streaming column-store
PVLDB
(2012)

L. Golab et al.

A sequence-oriented stream warehouse paradigm for network monitoring applications

Passive and Active Measurement - 13th International Conference, PAM 2012, Vienna, Austria, March 12–14th, 2012. Proceedings

(2012)

J. Dean et al.

Mapreduce: simplified data processing on large clusters

Commun. ACM

(2008)

T. White

Hadoop: The Definitive Guide

(2012)

A. Thusoo et al.

Hive - a warehousing solution over a map-reduce framework

PVLDB

(2009)

S. Melnik et al.

Dremel: interactive analysis of web-scale datasets

PVLDB

(2010)

M. Zaharia et al.

Spark: cluster computing with working sets

2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010

(2010)

W. Lam et al.

Muppet: mapreduce-style processing of fast data

PVLDB

(2012)

B. Li et al.

SCALLA: a platform for scalable one-pass analytics using mapreduce

ACM Trans. Database Syst.

(2012)

M. Zaharia et al.

Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters

4th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’12, Boston, MA, USA, June 12-13, 2012

(2012)

A. Dainotti et al.

Issues and future directions in traffic classification

IEEE Netw.

(2012)

S. Valenti et al.

Reviewing traffic classification

Data Traffic Monitoring and Analysis

(2013)

E. Schikuta

Grid-clustering: an efficient hierarchical clustering method for very large data sets

13th International Conference on Pattern Recognition, ICPR 1996, Vienna, Austria, 25-19 August, 1996

(1996)

E. Schikuta et al.

The bang-clustering system: grid-based data analysis

Advances in Intelligent Data Analysis, Reasoning about Data, Second International Symposium, IDA-97, London, UK, August 4-6, 1997, Proceedings

(1997)

T.T.T. Nguyen et al.

A survey of techniques for internet traffic classification using machine learning

IEEE Commun. Surv. Tut.

(2008)

Cited by (25)

How does enterprise IoT traffic evolve? Real-world evidence from a Finnish operator
2020, Internet of Things (Netherlands)
Citation Excerpt :
The authors concluded that the data traffic that cars generate differ both from smartphones and other IoT devices, and warned about the potential adverse impact that massive over-the-air firmware updates may have on network performance. Several studies [8–10] similarly analyzed IoT data from a cellular network but with different objectives. The studies proposed methods for online and offline classification of IoT traffic that would give MNOs a more efficient way of identifying IoT devices compared to the traditional TAC-based (Type Allocation Code) approach.
The adoption of Internet of Things (IoT) technologies in businesses is increasing and thus enterprise IoT (EIoT) is seemingly shifting from hype to reality. However, the actual use of EIoT over significant timescales has not been empirically analyzed. In other words, the reality remains unexplored. Furthermore, despite the variety of EIoT verticals, the use of IoT across vertical industries has not been compared. This paper uses a two-year EIoT dataset from a major Finnish mobile network operator to investigate device use across industries, cellular traffic patterns, and mobility patterns. We present a variety of novel findings: EIoT traffic volume per device has increased three-fold over the last two years, the share of LTE-enabled devices has remained low at around 2% and that 30% of EIoT devices are still 2G only, and there are order of magnitude differences between different industries’ EIoT traffic and mobility. We also show that daily traffic can be clustered into only three patterns, differing mainly in the presence and timing of a peak hour. Beyond these descriptive results, modeling and forecasting is conducted for both traffic and mobility. We forecast the total daily EIoT traffic through a temporal regression model and achieve an error of about 15% over medium-term (30 to 180 day) horizons. We also model device mobility through a Markov mixture model and quantify the upper bound of predictability for device mobility.
Systematic survey of big data and data mining in internet of things
2018, Computer Networks
Citation Excerpt :
It is based on Hadoop MapReduce and develops the MapReduce model so that it can be effectively used for most types of computing, and it includes the interactive queries and stream processing. The main feature of Spark is to store the computing in memory so that it increases the processing speed of the application [77,78]. Storm: It processes large volumes of data by horizontal scalability method with error tolerance and analysis in real time.
In recent years, the Internet of Things (IoT) has emerged as a new opportunity. Thus, all devices such as smartphones, transportation facilities, public services, and home appliances are used as data creator devices. All the electronic devices around us help our daily life. Devices such as wrist watches, emergency alarms, and garage doors and home appliances such as refrigerators, microwaves, air conditioning, and water heaters are connected to an IoT network and controlled remotely. Methods such as big data and data mining can be used to improve the efficiency of IoT and storage challenges of a large data volume and the transmission, analysis, and processing of the data volume on the IoT. The aim of this study is to investigate the research done on IoT using big data as well as data mining methods to identify subjects that must be emphasized more in current and future research paths. This article tries to achieve the goal by following the conference and journal articles published on IoT-big data and also IoT-data mining areas between 2010 and August 2017. In order to examine these articles, the combination of Systematic Mapping and literature review was used to create an intended review article. In this research, 44 articles were studied. These articles are divided into three categories: Architecture & Platform, framework, and application. In this research, a summary of the methods used in the area of IoT-big data and IoT-data mining is presented in three categories to provide a starting point for researchers in the future.
A Comprehensive Investigation Into the Implementation of Machine Learning Solutions for Network Traffic Classification
2023, Proceedings - 2023 International Conference on Advanced Computing and Communication Technologies, ICACCTech 2023
Machine Learning Empowered Intelligent Data Center Networking
2023, SpringerBriefs in Computer Science
Efficient Gaussian Kernel Microcluster Real-Time Clustering Method for Industrial Internet of Things (IIoT) Streams
2022, IEEE Internet of Things Journal
*Rethinking Data Center Networks: Machine Learning Enables Network Intelligence
2022, Journal of Communications and Information Networks

View all citing articles on Scopus

Dr. Arian Baer received his PhD in computer science from the University of Vienna in 2015. During the course of his PhD, he was a researcher at the Telecommunications Research Center Vienna (FTW). In 2009, he received his Diploma degree in Computer Science from the Friedrich–Alexander Universitaet Erlangen–Nuernberg. His PhD topic was about the application of data base approaches to big and fast data streams common in network monitoring environments. His research interests include network monitoring and analytics, data stream warehousing, query scheduling, data mining and machine learning. Currently he is employed as a big data architect at BMW Group in Munich, Germany. He has (co)authored about 20 publications in international journals and conferences.

Dr. Pedro Casas received the electrical engineering degree from the University of the Republic, Uruguay, in 2005, and the Ph.D. degree in computer science from Telecom Bretagne, France, in 2010. He is a Scientist with the Austrian Institute of Technology (AIT), Vienna, Austria. He held Research and Teaching Assistant positions with the University of the Republic, between 2003 and 2012, and was at the French Research Laboratory LAAS-CNRS, Toulouse, France, as a Postdoctoral Research Fellow between 2010 and 2011. Between 2011 and 2015, he was a Senior Researcher with the Telecommunications Research Center Vienna (FTW), Vienna, Austria. His research interests include the monitoring and analysis of network traffic, network security and anomaly detection, QoE modeling and automatic assessment, as well as machine-learning and data mining-based approaches for Networking. He has authored more than 80 networking research papers (50 as main author) in major international conferences and journals. He is the recipient of seven best paper awards in the last 6 years.

Dr. Alessandro D’Alconzo received the M.Sc. degree in Electronic Engineering with honors in 2003, and the Ph.D. in Information and Telecommunication Engineering in 2007, from Polytechnic of Bari, Italy. He is Scientist in the Digital Safety & Security department of AIT, Austrian Institute of Technology. From 2007 to 2015, he was Senior Researcher in the Communication Networks Area of the Telecommunications Research Center Vienna (FTW). From 2008 to 2013 he has been Management Committee representative for Austria and Secretary of the COST Action IC0703 "Traffic Monitoring and Analysis". He has extensive experience in contributing and managing EU funded projects, as well as in applied research projects in the field of network traffic measurements in collaboration with national telecommunication operators. His research interests embrace network measurements and traffic monitoring, ranging from design and implementation of statistical based anomaly detection algorithms and root cause analysis, to Quality of Experience evaluation, and application of secure multiparty computation techniques to cross-domain network monitoring and troubleshooting.

Dr. Pierdomenico Fiadino received his BSc and MSc degrees in Computer Engineering from Sapienza University of Rome and a PhD in Electrical Engineering from the Institute of Telecommunications of TU Wien, in 2008, 2010, and 2015, respectively. Since 2010, he is a Researcher at the Telecommunications Research Center of Vienna (FTW) where he is involved in projects dealing with large scale network measurements for Internet traffic analysis and design of Intelligent Transport Systems. His research interests cover network traffic monitoring and analysis, anomaly detection and diagnosis, machine learning and data mining.

Prof. Dr. Lukasz Golab is an assistant professor and Canada Research Chair at the University of Waterloo. Prior to joining Waterloo, he was a Senior Member of Research Staff at AT&T Labs. He holds a B.Sc. from the University of Toronto and a PhD from the University of Waterloo. Lukasz’s research interests include data stream management, data quality and data analytics. He has published over 50 articles and has given tutorials on data stream warehousing at SIGMOD 2013 and ICDE 2014.

Prof. Dr. Marco Mellia, PhD, research interests are in the in the area of traffic monitoring and analysis, in cyber monitoring in general, and Big Data analytics. He has co-authored over 250 papers published in international journals and presented in leading international conferences. He won the IRTF ANR Prize at IETF-88, and best paper award at IEEE P2P’12, ACM CoNEXT’13, IEEE ICDCS’15. He is part of the editorial board of ACM/IEEE Transactions on Networking, IEEE Transactions on Network and Service Management, and ACM Computer Communication Review. He holds a position as Associate Professor at Politecnico di Torino, Italy.

Prof. Dr. Erich Schikuta is professor of computer science in the Workflow Systems and Technology group at the University of Vienna. He obtained a bachelor in mathematics and a master and PhD in computer science from the University of Technology of Vienna. His research interests are in the area of information and database systems, parallel and distributed computing, cloud computing and computational intelligence, which resulted in the (co)authorship of about 200 peer-reviewed papers.

View full text

DBStream: A holistic approach to large-scale network traffic monitoring and analysis

Abstract

Introduction

Section snippets

Related work

DBStream system design

Continuous execution language (CEL)

Performance evaluation

Improving performance with intelligent scheduling

Experience from NTMA projects

MTRAC - M2M traffic classification

Conclusion and future work

Acknowledgments

Traffic monitoring and analysis for the optimization of a 3g network

IEEE Wireless Commun.

The architecture of CoralReef: an Internet traffic monitoring software suite

Passive and Active Network Measurement Workshop (PAM)

Experiences of internet traffic monitoring with Tstat

IEEE Netw.

High speed network traffic analysis with commodity multi-core systems

Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement 2010, Melbourne, Australia - November 1-3, 2010

Sql databases v. nosql databases

Commun. ACM

Gigascope: a stream database for network applications

Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9–12, 2003

Aurora: a new model and architecture for data stream management

VLDB J.

Esper: Event Processing for Java

Monetdb/datacell: Online analytics in a streaming column-store

PVLDB

A sequence-oriented stream warehouse paradigm for network monitoring applications

Passive and Active Measurement - 13th International Conference, PAM 2012, Vienna, Austria, March 12–14th, 2012. Proceedings

Mapreduce: simplified data processing on large clusters

Commun. ACM

Hadoop: The Definitive Guide

Hive - a warehousing solution over a map-reduce framework

PVLDB

Dremel: interactive analysis of web-scale datasets

PVLDB

Spark: cluster computing with working sets

2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010

Muppet: mapreduce-style processing of fast data

PVLDB

SCALLA: a platform for scalable one-pass analytics using mapreduce

ACM Trans. Database Syst.

Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters

4th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’12, Boston, MA, USA, June 12-13, 2012

Issues and future directions in traffic classification

IEEE Netw.

Reviewing traffic classification

Data Traffic Monitoring and Analysis

Grid-clustering: an efficient hierarchical clustering method for very large data sets

13th International Conference on Pattern Recognition, ICPR 1996, Vienna, Austria, 25-19 August, 1996

The bang-clustering system: grid-based data analysis

Advances in Intelligent Data Analysis, Reasoning about Data, Second International Symposium, IDA-97, London, UK, August 4-6, 1997, Proceedings

A survey of techniques for internet traffic classification using machine learning

IEEE Commun. Surv. Tut.