DBStream: A holistic approach to large-scale network traffic monitoring and analysis
Introduction
Since the introduction of computer networks in general and the Internet more specifically, networked computer systems have become more and more important to modern society. Todays Internet is a highly complex, distributed system, spanning the globe and reaching even into outer space to the International Space Station. Human communication relies to a large extent on emails, (mobile) phone calls and social media. It has become normal to buy electronics, clothes or even cars, book flights and make bank transfers over the Internet. The financial market exchanges large amounts of stocks via interconnected high frequency trading systems. This shows that computer networks have become a corner stone of today’s modern society.
Network operators are responsible for the proper functioning of those highly complex networks. They face the challenge of detecting and reacting very quickly to network anomalies, security breaches and, at the same time, plan ahead to adopt their networks to novel usage patterns. Network monitoring and analysis systems play a central role in supporting operators in these tasks. However, the above challenges put a wide range of requirements to the system in charge to collect, store, and process the gathered monitoring data. Such a system should be: (i) able to store data over extended time periods, (ii) make analysis results available quickly, on the order of minutes or even seconds, and (iii) network experts should be able to easily specify and extend typical analysis tasks. Whereas, many isolated systems and approaches have been proposed to capture and analyze network data [1], [2], [3], [4], there is still a clear lack of open, comprehensive approaches for integrating, combining and post processing data from multiple sources.
In this paper, we propose the open source system DBStream1, a holistic approach to large-scale network data analysis (Fig. 1). DBStream is a Data Stream Warehouse (DSW) based on traditional database techniques, designed with comprehensive network monitoring in mind. We show that DBStream is performance-wise at least on par with the most recent large-scale data processing frameworks such as Hadoop and Spark. We report the use of DBStream for several network monitoring and analysis applications, and the experience from its deployment in a production mobile network. Finally we show a DBStream integration with the well-known Weka Machine Learning (ML) toolkit can be used for on-line detection of Machine-to-Machine (M2M) devices in mobile networks, using only high level statistical information.
The specific contributions of the paper are:
- •
We propose the open source DSW DBStream.
- •
We present the high level, micro service architecture of DBStream.
- •
We show the high performance of DBStream by comparing it to state-of-the-art large-scale data processing frameworks.
- •
We demonstrate how the Continuous Execution Language (CEL) language empowers users to solve analytic challenges effectively.
The remainder of the paper is organized as follows. Section 2 presents the related work. In Section 3 and 4, we describe the system architecture and the processing language of DBStream, respectively. In Section 5, the performance of DBStream is compared to the in-memory MapReduce framework Spark. We discuss the impact of jobs scheduling on DBStream performance in Section 6. We provide in Section 7 an extensive report of the DBStream usage in several network traffic monitoring and analysis projects, as well as in a nation-wide mobile network. A prototypical integration of DBStream with a ML library is presented in Section 8, along with its application to M2M traffic detection. Finally, Section 9 provides the overall conclusions and an outlook on the future work.
Section snippets
Related work
The introduction of the term Big Data lead to a new era in which many scientific and commercial organizations started designing and developing novel large-scale data processing systems. Most of them achieve increased performance by re-implementing the whole or parts of the data processing engine. They often relax Atomicity, Consistency, Isolation, Durability (ACID) constraints [5] and/or apply novel data processing paradigms. Still, a limitation of such systems is the inability to cope with
DBStream system design
The main purpose of DBStream is to store and analyze large amounts of network monitoring data. But, it might also be applied to data from other application domains like e.g., smart grids, smart cities, intelligent transportation systems, or any other use case that requires continuous processing of large amounts of heterogeneous data data over time. DBStream is implemented as a middle-ware layer on top of PostgreSQL. Whereas all data processing is done in PostgreSQL, DBStream offers the ability
Continuous execution language (CEL)
In this section, we describe the batched stream processing language CEL originally introduced in [29] in full detail. Table 1 gives an overview of the important terms of CEL. We start with a simple example explaining the main functions of CEL. In the following Section 4.1 we detail the Continuous Tables (CTs) used in CEL. Section 4.2 describes how time windows are handled in DBStream. Finally, in Section 4.3 we explain multiple complex examples showing the full expressive power of the presented
Performance evaluation
In this section, we compare the performance of DBStream to the state-of-the-art Big Data processing framework Spark.
Improving performance with intelligent scheduling
In the setup considered for the performance comparison, the main bottleneck of the DBStream system is disk I/O. However, we will show that it is possible to minimize disk I/O overhead by intelligent tasks scheduling. In this section, we give an introduction to a more general scheduling problem found in disk-based continuous processing systems executing shared worflows. The automation of the scheduling presented here is part of future work and a first step towards this automation has already
Experience from NTMA projects
DBStream has been adopted in several research projects for running a number of NTMA applications. To provide a concrete example, we report in Section 7.1 several statistics from running DBStream in the network monitoring project DARWIN4 [33], where it has been used as central analysis system. In addition, in Section 8 we present the M2M TRAffic Classification (MTRAC) approach [34] as one prominent advanced analytics application of DBStream.
Besides these illustrative examples, there exist a
MTRAC - M2M traffic classification
In this Section we describe the MTRAC as one of the most important applications of DBStream not under Non-Disclosure Agreement (NDA) constraints.
Machine-to-Machine (M2M) traffic has become a major share of today’s mobile networks and will grow even further in the near future. The quickly increasing number of M2M devices introduces unprecedented traffic patterns and fosters the interest of mobile operators, who whish to discover and track those devices in their networks. MTRAC enables the
Conclusion and future work
In this paper, we presented DBStream, a Data Stream Warehouse (DSW) tailored for, but not limited to, Network Traffic Monitoring and Analysis (NTMA) applications. We have shown, that if instrumented correctly, a PostgreSQL database engine can process large amounts of data in a fast and efficient way.
In a performance study, we demonstrated that a single-node instance of DBStream can outperform a cluster of 10 Spark nodes by a factor of 2.6, running the same query workload on the same dataset.
The
Acknowledgments
The research leading to these results has received funding from the European Union under the FP7 Grant Agreement n. 318627, mPlane project. The work has been partially performed within the framework of the projects Darwin 4 and N-0 at the Telecommunications Research Center Vienna (FTW), and has been partially funded by the Austrian Government and the City of Vienna through the program COMET, and by the Vienna Science and Technology Fund (WWTF) through project ICT15-129, BigDAMA. We would like
Dr. Arian Baer received his PhD in computer science from the University of Vienna in 2015. During the course of his PhD, he was a researcher at the Telecommunications Research Center Vienna (FTW). In 2009, he received his Diploma degree in Computer Science from the Friedrich–Alexander Universitaet Erlangen–Nuernberg. His PhD topic was about the application of data base approaches to big and fast data streams common in network monitoring environments. His research interests include network
References (48)
Traffic monitoring and analysis for the optimization of a 3g network
IEEE Wireless Commun.
(2006)- et al.
The architecture of CoralReef: an Internet traffic monitoring software suite
Passive and Active Network Measurement Workshop (PAM)
(2001) - et al.
Experiences of internet traffic monitoring with Tstat
IEEE Netw.
(2011) - et al.
High speed network traffic analysis with commodity multi-core systems
Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement 2010, Melbourne, Australia - November 1-3, 2010
(2010) Sql databases v. nosql databases
Commun. ACM
(2010)- et al.
Gigascope: a stream database for network applications
Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9–12, 2003
(2003) - et al.
Aurora: a new model and architecture for data stream management
VLDB J.
(2003) Esper: Event Processing for Java
(2016)- StreamBase Inc., Streambase: Real-time, Low Latency Data Processing with a Stream Processing Engine., 2016....
- et al.
Monetdb/datacell: Online analytics in a streaming column-store
PVLDB
(2012)
A sequence-oriented stream warehouse paradigm for network monitoring applications
Passive and Active Measurement - 13th International Conference, PAM 2012, Vienna, Austria, March 12–14th, 2012. Proceedings
Mapreduce: simplified data processing on large clusters
Commun. ACM
Hadoop: The Definitive Guide
Hive - a warehousing solution over a map-reduce framework
PVLDB
Dremel: interactive analysis of web-scale datasets
PVLDB
Spark: cluster computing with working sets
2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010
Muppet: mapreduce-style processing of fast data
PVLDB
SCALLA: a platform for scalable one-pass analytics using mapreduce
ACM Trans. Database Syst.
Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters
4th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’12, Boston, MA, USA, June 12-13, 2012
Issues and future directions in traffic classification
IEEE Netw.
Reviewing traffic classification
Data Traffic Monitoring and Analysis
Grid-clustering: an efficient hierarchical clustering method for very large data sets
13th International Conference on Pattern Recognition, ICPR 1996, Vienna, Austria, 25-19 August, 1996
The bang-clustering system: grid-based data analysis
Advances in Intelligent Data Analysis, Reasoning about Data, Second International Symposium, IDA-97, London, UK, August 4-6, 1997, Proceedings
A survey of techniques for internet traffic classification using machine learning
IEEE Commun. Surv. Tut.
Cited by (25)
How does enterprise IoT traffic evolve? Real-world evidence from a Finnish operator
2020, Internet of Things (Netherlands)Citation Excerpt :The authors concluded that the data traffic that cars generate differ both from smartphones and other IoT devices, and warned about the potential adverse impact that massive over-the-air firmware updates may have on network performance. Several studies [8–10] similarly analyzed IoT data from a cellular network but with different objectives. The studies proposed methods for online and offline classification of IoT traffic that would give MNOs a more efficient way of identifying IoT devices compared to the traditional TAC-based (Type Allocation Code) approach.
Systematic survey of big data and data mining in internet of things
2018, Computer NetworksCitation Excerpt :It is based on Hadoop MapReduce and develops the MapReduce model so that it can be effectively used for most types of computing, and it includes the interactive queries and stream processing. The main feature of Spark is to store the computing in memory so that it increases the processing speed of the application [77,78]. Storm: It processes large volumes of data by horizontal scalability method with error tolerance and analysis in real time.
A Comprehensive Investigation Into the Implementation of Machine Learning Solutions for Network Traffic Classification
2023, Proceedings - 2023 International Conference on Advanced Computing and Communication Technologies, ICACCTech 2023Machine Learning Empowered Intelligent Data Center Networking
2023, SpringerBriefs in Computer ScienceEfficient Gaussian Kernel Microcluster Real-Time Clustering Method for Industrial Internet of Things (IIoT) Streams
2022, IEEE Internet of Things Journal*Rethinking Data Center Networks: Machine Learning Enables Network Intelligence
2022, Journal of Communications and Information Networks
Dr. Arian Baer received his PhD in computer science from the University of Vienna in 2015. During the course of his PhD, he was a researcher at the Telecommunications Research Center Vienna (FTW). In 2009, he received his Diploma degree in Computer Science from the Friedrich–Alexander Universitaet Erlangen–Nuernberg. His PhD topic was about the application of data base approaches to big and fast data streams common in network monitoring environments. His research interests include network monitoring and analytics, data stream warehousing, query scheduling, data mining and machine learning. Currently he is employed as a big data architect at BMW Group in Munich, Germany. He has (co)authored about 20 publications in international journals and conferences.
Dr. Pedro Casas received the electrical engineering degree from the University of the Republic, Uruguay, in 2005, and the Ph.D. degree in computer science from Telecom Bretagne, France, in 2010. He is a Scientist with the Austrian Institute of Technology (AIT), Vienna, Austria. He held Research and Teaching Assistant positions with the University of the Republic, between 2003 and 2012, and was at the French Research Laboratory LAAS-CNRS, Toulouse, France, as a Postdoctoral Research Fellow between 2010 and 2011. Between 2011 and 2015, he was a Senior Researcher with the Telecommunications Research Center Vienna (FTW), Vienna, Austria. His research interests include the monitoring and analysis of network traffic, network security and anomaly detection, QoE modeling and automatic assessment, as well as machine-learning and data mining-based approaches for Networking. He has authored more than 80 networking research papers (50 as main author) in major international conferences and journals. He is the recipient of seven best paper awards in the last 6 years.
Dr. Alessandro D’Alconzo received the M.Sc. degree in Electronic Engineering with honors in 2003, and the Ph.D. in Information and Telecommunication Engineering in 2007, from Polytechnic of Bari, Italy. He is Scientist in the Digital Safety & Security department of AIT, Austrian Institute of Technology. From 2007 to 2015, he was Senior Researcher in the Communication Networks Area of the Telecommunications Research Center Vienna (FTW). From 2008 to 2013 he has been Management Committee representative for Austria and Secretary of the COST Action IC0703 "Traffic Monitoring and Analysis". He has extensive experience in contributing and managing EU funded projects, as well as in applied research projects in the field of network traffic measurements in collaboration with national telecommunication operators. His research interests embrace network measurements and traffic monitoring, ranging from design and implementation of statistical based anomaly detection algorithms and root cause analysis, to Quality of Experience evaluation, and application of secure multiparty computation techniques to cross-domain network monitoring and troubleshooting.
Dr. Pierdomenico Fiadino received his BSc and MSc degrees in Computer Engineering from Sapienza University of Rome and a PhD in Electrical Engineering from the Institute of Telecommunications of TU Wien, in 2008, 2010, and 2015, respectively. Since 2010, he is a Researcher at the Telecommunications Research Center of Vienna (FTW) where he is involved in projects dealing with large scale network measurements for Internet traffic analysis and design of Intelligent Transport Systems. His research interests cover network traffic monitoring and analysis, anomaly detection and diagnosis, machine learning and data mining.
Prof. Dr. Lukasz Golab is an assistant professor and Canada Research Chair at the University of Waterloo. Prior to joining Waterloo, he was a Senior Member of Research Staff at AT&T Labs. He holds a B.Sc. from the University of Toronto and a PhD from the University of Waterloo. Lukasz’s research interests include data stream management, data quality and data analytics. He has published over 50 articles and has given tutorials on data stream warehousing at SIGMOD 2013 and ICDE 2014.
Prof. Dr. Marco Mellia, PhD, research interests are in the in the area of traffic monitoring and analysis, in cyber monitoring in general, and Big Data analytics. He has co-authored over 250 papers published in international journals and presented in leading international conferences. He won the IRTF ANR Prize at IETF-88, and best paper award at IEEE P2P’12, ACM CoNEXT’13, IEEE ICDCS’15. He is part of the editorial board of ACM/IEEE Transactions on Networking, IEEE Transactions on Network and Service Management, and ACM Computer Communication Review. He holds a position as Associate Professor at Politecnico di Torino, Italy.
Prof. Dr. Erich Schikuta is professor of computer science in the Workflow Systems and Technology group at the University of Vienna. He obtained a bachelor in mathematics and a master and PhD in computer science from the University of Technology of Vienna. His research interests are in the area of information and database systems, parallel and distributed computing, cloud computing and computational intelligence, which resulted in the (co)authorship of about 200 peer-reviewed papers.