Elsevier

Computer Communications

Volume 34, Issue 3, 15 March 2011, Pages 502-514
Computer Communications

Clustering botnet communication traffic based on n-gram feature selection

https://doi.org/10.1016/j.comcom.2010.04.007Get rights and content

Abstract

Recognized as one the most serious security threats on current Internet infrastructure, botnets can not only be implemented by existing well known applications, e.g. IRC, HTTP, or Peer-to-Peer, but also can be constructed by unknown or creative applications, which makes the botnet detection a challenging problem. Previous attempts for detecting botnets are mostly to examine traffic content for bot command on selected network links or by setting up honeypots. Traffic content, however, can be encrypted with the evolution of botnet, and as a result leading to a fail of content based detection approaches. In this paper, we address this issue and propose a new approach for detecting and clustering botnet traffic on large-scale network application communities, in which we first classify the network traffic into different applications by using traffic payload signatures, and then a novel decision tree model is used to classify those traffic to be unknown by the payload content (e.g. encrypted traffic) into known application communities where network traffic is clustered based on n-gram features selected and extracted from the content of network flows in order to differentiate the malicious botnet traffic created by bots from normal traffic generated by human beings on each specific application. We evaluate our approach with seven different traffic trace collected on three different network links and results show the proposed approach successfully detects two IRC botnet traffic traces with a high detection rate and an acceptable low false alarm rate.

Introduction

The recent growth of botnet activity in cyberspace has attracted in a significant way the attention of the research community. According to the recent Symantec’s research report, botnets have become one of the biggest security threats, responsible for a large volume of malicious activities from distributed-denial-of-service (DDoS) attacks to spamming, phishing, identify theft and DNS server spoofing [1]. The concept of botnet refers to a group of compromised computers remotely controlled by one attacker or a small group of attackers working together called a “botmaster”. A bot also known as zombie or drone can refer to either a computer that has been compromised and is part of the botnet or to the malicious program used to compromise new machines into the botnet. The botmaster’s ability to carry out an attack from hundreds or even tens of thousands of computers means increased bandwidth, increased processing power, increased memory for storage and a large number of attack sources making botnet attacks more malicious and difficult to detect and defend against.

Unlike viruses, worms and other malwares working as individual entities, in a botnet there is need to have a form of communication architecture that the botmaster(s) can use to send out commands, receive responses to commands from bots, and do all other tasks to manage the bots. This control architecture can be classified according to two criteria. Firstly, it can be generally classified according to the underlying protocol into IRC-based, HTTP-based and P2P-based. The other classification is based on communication topology and aggregate control architecture into centralized Command and Control (C&C) and Peer-to-Peer (P2P) botnets. In a centralized command and control structure there is a central location called botnet C&C server where all communication between bots and botmaster takes place. Internet Relay Chat (IRC) protocol is used to manage chat sessions, which is described in RFC 1459 with latest update in RFC 2813. In IRC-based C&C, the botmaster creates a channel in the C&C server to post commands on and bots of this botnet must subscribe to this channel in order to have access to the posted commands. In secured IRC servers bots shall first provide a connection password. Bots join an IRC channel using a unique nickname, the nickname is authenticated and if accepted the bot has to provide more details such as hostname using USER command in order to be registered. A bot’s nickname may be rejected if it does not follow the botnet’s nickname pattern, suspicious to be a spy or to avoid overloading the IRC server. IRC messages are then exchanged as channel “TOPIC” messages or using “PRIVMSG” or “NOTICE” commands. The bot stays connected to the IRC channel until it chooses to leave channel using “PART” command, completely close its connection with IRC server by “QUIT”, the botmaster forcibly kicks it out of the channel using “KICK” command or botmaster completely close the connection between this bot client and IRC server using “KILL” command. As a result, with at least one attempt to connect to an IRC server channel denoted by resending “NICK” and “USER” commands, the IRC communication pattern during a botnet attack can be as follows:

  • PASS-> NICK+ -> USER+ -> JOIN -> TOPIC -

  • > PRIVMSGNOTICE ->⋯-> PRIVMSGNOTICE

  • -> PARTQUITKICKKILL

Examples of IRC-based bots include Phatbot, Sdbot, Rbot/Rxbot, and GTBot. More information on IRC-based botnets can be found in [2].

In reality, detecting and blocking such a centralized IRC botnet, however, is not a difficult task since the whole botnet can be put down by blacklisting the IRC server. To overcome this issue, botnets have evolved by allowing more flexibility in the applied protocols such as HTTP [3], [4], and now they are even transforming from centralized structure into the advanced distributed strategy to solve the weakness of having a single point of failure (i.e. C&C server). Sinit [5], Phatbot with WASTE command [6], Nugache [7] and the recent Peacomm (Storm worm) [8] are some of the successful examples of distributed botnets on the Internet. For instance, Storm worm spreads using social engineering technique, mainly emails with tempting subjects of public interest (e.g. politics and public holidays). These emails can either contain malicious attachments or have links to malicious websites. According to Holtz et al. in [9], after the user opens malicious attachment or connects to malicious server, a storm binary is installed to infect the new machine. The storm binary uses a root kit to avoid detection and a configuration file containing hash values and IP/port number combination of list of peers to connect to after installation is stored. The binary computes a global identifier and stores it. To join the network it searches for keys as peers to connect in the network. After contacting these keys, the newly infected bots compute an IP address and TCP port number combination to contact the botmaster. Communication that follows now takes place between the new bot and the botmaster. After completing TCP handshake and successful authentications the botmaster sends commands to the bot in a zlib encoded communication channel. The botmaster’s commands they observed are instructing bots to send out spam emails or start DDOS attacks. Details of how storm-worm botnet works and how P2P botnet works can be found in [9].

Existing botnet detection mechanisms generate a number of good ideas, they are, however, far from completed yet because: (1) botnets are often hidden in existing applications, and thus their traffic volume is not that big and is very similar with normal traffic behaviours; (2) classifying traffic into different applications becomes more challenging and is still an issue yet to be solved due to traffic content encryption and the unreliable destination port labelling method; and (3) botnets are now evolving from the traditional centralized (e.g. IRC) communication way to the advanced distributed strategy (e.g. peer-to-peer). Previous attempts on detecting botnets are mainly based on honeypots, passive anomaly analysis and traffic application classification. Setting up and installing honeypots on the Internet is very helpful to capture malwares and understand the basic behaviours of botnet. The passive anomaly analysis for detecting botnets on a network traffic is usually independent of the traffic content and has the potential to find different types of botnets (e.g. HTTP-based botnet, IRC-based botnet or P2P based botnet).

In this paper, we focus on traffic classification based botnet detection and propose an unsupervised detection framework for clustering new or unknown botnet communities. Self-learning techniques have been widely used for botnet detection in recent years due to its capability to form automatically an opinion of what the subject’s normal or abnormal behaviour was. Typical examples include [10], [11]. According to whether they are based on supervised or unsupervised learning, botnet detection schemes can be classified into two categories: unsupervised botnet detection and supervised botnet detection. In supervised botnet detection, profiles of systems or networks are established by training using a labelled dataset. Unsupervised botnet detection uses unlabelled data to identify bot behaviours. The main drawback of supervised botnet detection is the need to label the training data, which makes the process error-prone, costly and time consuming. Unsupervised botnet detection addresses these issues by allowing training based on an unlabelled dataset and thus facilitating online learning and improving detection accuracy. By facilitating online learning, unsupervised approaches provide a higher potential to find new botnet attacks. By removing the need of labelling, unsupervised detection creates a greater potential for accurate detection.

Most of the unsupervised anomaly detection schemes proposed in the literature so far are based on clustering techniques. Clustering is the organization of data patterns into groups or clusters based on some measure of similarity. When applying clustering techniques for botnet detection, determining the number of clusters is a difficult issue since the occurrence of intrusions is unknown. The general approach and current practice assume that data instances are always divided into two categories: normal clusters and intrusive clusters, and that the number of normal data instances largely outnumbers the number of intrusions. However, these assumptions are not always true in practice. The number of normal data instances does not necessarily largely outnumber the number of intrusions, e.g. when the highly distributed denial of service attacks occurs, the assumption will lead to a high false alert rate. In order to achieve an efficient and effective detection, we propose in this paper a new unsupervised botnet detection framework which consists of: (1) a feature selection module to discriminate IRC traffic from other Internet traffic based on the selected n-gram features, and (2) a HCI (Human–Computer Interactive) metric to distinct traffic created by bots from Internet traffic generated by human-beings. We evaluate our metric with three different clustering algorithms, namely K-means, unmerged X-means and merged X-means during the experiment and results show that the proposed HCI metric can classify the bot based traffic and human-being based traffic with a high classification rate and low false positive rate.

The rest of the paper is organized as follows. Section 2 introduces related work, in which we summarize existing botnet detection approaches. Section 3 presents our feature selection module for botnet detection. Section 4 is the botnet detection based on the HCI metric and clustering algorithm with selected features. Section 5 is the experimental evaluation for our unsupervised detection model with three traffic traces collected on real Internet and our testbed network. Section 6 makes some concluding remarks and discusses future work.

Section snippets

Related work

Recent years have seen a great interest in studying botnet detection techniques that can be classified into three main categories, namely honeypots based, passive anomaly analysis based and traffic application classification based. In this section, we discuss the different botnet detection techniques in related literatures.

Botnet detection techniques based on traffic application classification are usually guided by botnet C&C control protocol e.g. if one is only interested in IRC-based botnets

Traffic classification with selected n-gram features

Early common techniques for identifying network application rely on the association of a particular port with a particular protocol. Such a port number based traffic classification approach has been proved to be ineffective due to: (1) the constant emergence of new peer-to-peer networking applications that IANA does not define the corresponding port numbers [26], (2) the dynamic port number assignment for some applications (e.g. FTP for data transfer), and (3) the encapsulation of different

Botnet detection based on traffic clustering

A general aim for intrusion detection is to find various attack types by modelling signatures of known intrusions (misuse detection) or profiles of normal behaviour (anomaly detection). Botnet detection, however, is more specific due to a given application domain. Clustering techniques have been widely used for anomaly detection in last years and can be applied as well to detect the unknown (or zero-day) botnet traffic through the online learning without the need of labelling. Fig. 7

Experimental evaluation

We implement a prototype system for the approach and then evaluate it with seven traffic traces collected on three networks including an in-door testbed network, a honeynet through a public Internet connection and a public network as we mentioned above. The botnet traffic consists of two traces: (1) the Kaiten IRC botnet traffic collected on our testbed network; (2) the IRC botnet collected on a honeypot deployed on a real network environment.

Our testbed network is composed by a 48-port Gigabit

Conclusion

This paper addresses the issue of botnet detection since that would be a very first step to combat botnets. As claimed by Gu et al., the biggest limitation of existing botnet detection approaches is they usually rely on the employed C&C structure and thus an anomaly detection framework is proposed for detecting botnets which is independent to the botnet structure [10]. Addressing this limitation, we propose in this paper an unsupervised botnet detection framework in which we first identify

References (42)

  • J. Erman et al.

    Offline/realtime traffic classification using semi-supervised learning

    Performance Evaluation

    (2007)
  • Symantec Internet Security Threat Report. <http://www.symantec.com/business/theme.jsp?themeid=threatreport,...
  • Taxonomy of Botnet Threats....
  • K. Chiang, L. Lloyd, A case study of the rustock rootkit and spam bot, in: Proceedings of USENIX HotBots,...
  • N. Daswani, M. Stoppelman, The anatomy of clickbot. A., in: Proceedings of USENIX HotBots,...
  • Sinit. <http://www.secureworks.com/research/threats/sinit/>,...
  • Phatbot. <http://www.secureworks.com/research/threats/phatbot/>,...
  • Nugache. <http://www.securityfocus.com/news/11390/>,...
  • Storm Worm Analysis....
  • T. Holz, M. Steiner, F. Dahl, E. Biersack, F. Freiling, Measurements and mitigation of peer-to-peer-based botnets: a...
  • G. Gu, R. Perdisci, J. Zhang, W. Lee, BotMiner: clustering analysis of network traffic for protocol- and...
  • T. Strayer, D. Lapsley, R. Walsh, C. Livadas, Botnet Detection: Countering the Largest Security Threat, vol. 36,...
  • C. Livadas, R. Walsh, D. Lapsley, T. Strayer, Using machine learning techniques to identify botnet traffic, in:...
  • W. Wang, B. Fang, Z. Zhang, C. Li, A novel approach to detect IRC-based botnets, in: International Conference on...
  • J. Goebel, T. Holz, Rishi: identify bot contaminated hosts by irc nickname evaluation, in: HotBots’07: Proceedings of...
  • P. Sroufe, S. Phithakkitnukoon, R. Dantu, J. Cangussu, Email shape analysis for spam botnet detection, in: Sixth IEEE...
  • A. Brodsky, D. Brodsky, A distributed content independent method for spam detection, in: Proceedings of the First...
  • Y. Zhao, Y.L. Xie, F. Yu, Q.F. Ke, Y. Yu, Y. Chen, E. Gillum, BotGraph: large-scale spamming botnet detection, in:...
  • G. Gu, J. Zhang, W. Lee, BotSniffer: Detecting botnet command and control channels in network traffic, in: Proceedings...
  • J.R. Binkley, S. Singh, An algorithm for anomaly-based botnet detection, in: USENIX SRUTI: 2nd Workshop on Steps to...
  • M.M. Masud, J. Gao, L. Khan, B. Thuraisingham, Peer to peer botnet detection for cyber-security: a data mining...
  • Cited by (0)

    View full text