Skip to main content

Automatic Log Analysis to Prevent Cyber Attacks

  • Chapter
  • First Online:
Advances in Intelligent Systems Research and Innovation

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 379))

Abstract

Organizations nowadays are more dependent than ever on the internet, becoming necessary to provide better security to their networks. However, the rise of automated cyber attacks and their complexity is making this task harder for static approaches and manual examination of the logs, leading to a necessity of studying automated ways for identifying these cyber attacks. In this paper we propose an effective Log-based Intrusion Detection System (LIDS), to predict an attack or not, based on carefully selected features. The logs from various sources are aggregated into one dashboard and the most discriminative features are first determined. For the attack prediction, a few machine learning techniques were comparatively tested, with the Decision tree being the winner. The proposed system is illustrated with the largest publicly available labelled log file dataset KDD Cup 1999.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Portnoy, L.: Intrusion detection with unlabeled data using clustering. Ph.D. dissertation, Columbia University (2000)

    Google Scholar 

  2. Laskov, P., Düssel, P., Schäfer, C., Rieck K.: Learning intrusion detection: supervised or unsupervised? In: International Conference on Image Analysis and Processing, pp. 50–57. Springer (2005)

    Google Scholar 

  3. Yen, T.-F., Oprea, A., Onarlioglu, K., Leetham, T., Robertson, W., Juels, A., Kirda, E.L Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proceedings of the 29th Annual Computer Security Applications Conference, pp. 199–208. ACM (2013)

    Google Scholar 

  4. Stroeh, K., Madeira, E.R.M., Goldenstein, S.K.: An approach to the correlation of security events based on machine learning techniques. J. Internet Serv. Appl. 4(1), 7 (2013)

    Article  Google Scholar 

  5. Li, W.: Automatic log analysis using machine learning: awesome automatic log analysis version 2.0 (2013)

    Google Scholar 

  6. Vasquez Villano, E.G.: Classification of logs using machine learning technique. Master’s thesis, NTNU (2018)

    Google Scholar 

  7. Vigneswaran, K.R., Vinayakumar, R., Soman, K., Poornachandran, P.: Evaluating shallow and deep neural networks for network intrusion detection systems in cyber security. In: 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6. IEEE (2018)

    Google Scholar 

  8. Dionísio, N., Alves, F., Ferreira, P.M., Bessani, A.: Cyberthreat detection from twitter using deep neural networks. arXiv preprint arXiv:1904.01127 (2019)

  9. Deokar, B., Hazarnis, A.: Intrusion detection system using log files and reinforcement learning. Int. J. Comput. Appl. 45(19), 28–35 (2012)

    Google Scholar 

  10. Bulavas, V.: Investigation of network intrusion detection using data visualization methods. In: 2018 59th International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS), pp. 1–6. IEEE (2018)

    Google Scholar 

  11. Rampure, V., Tiwari, A.: A rough set based feature selection on KDD Cup 99 data set. Int. J. Database Theory Appl. 8(1), 149–156 (2015)

    Article  Google Scholar 

  12. Bozhkov, L., Georgieva, P.: Brain neural data analysis with feature space defined by descriptive statistics. In: Iberian Conference on Pattern Recognition and Image Analysis, pp. 415–422. Springer (2015)

    Google Scholar 

  13. Tucker, L.R., MacCallum, R.C.: Exploratory factor analysis. Unpublished manuscript, Ohio State University, Columbus (1997)

    Google Scholar 

  14. Stolfo, J., Fan, W., Lee, W., Prodromidis, A., Chan, P.K.: Cost-based modeling and evaluation for data mining with application to fraud and intrusion detection. Results from the JAM Project by Salvatore, pp. 1–15 (2000)

    Google Scholar 

Download references

Acknowledgements

This work was funded by National Funds through the FCT—Foundation for Science and Technology, in the context of the project UID/CEC/00127/2019.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Petia Georgieva .

Editor information

Editors and Affiliations

Appendices

Appendix 1: Logs Structure and Formats

Before formulating attack hypotheses, it’s important to learn as much about the logs structure and involved systems as possible. A better understanding of the environment and settings will allow to make an educated judgement on which attacks actually had a chance of succeeding and which had no chance at all. A typical logs structure is depicted in Fig. 6.

Fig. 6
figure 6

Log structure

1.1 Access

The http access log records all requests processed by the server. Storing the information in the access log is the start of log management. The next step is to analyze this information to produce useful statistics. The format of the access log is highly configurable.

Typical Access Log Format: <ClientIP> - >ClientID> >Timestamp> >Method> >RequestResource> >Protocol> >StatusCode> >RetObjSize> >Referer> >UserAgent>

  • ClientIP: This is the IP address of the client (remote host) which made the request to the server.

  • ClientID: This is the userid of the person requesting the document as determined by HTTP authentication.

  • Timestamp: The time that the request was received.

  • Method: HTTP request method (e.g. GET, POST, PUT, DELETE, etc.).

  • RequestResource: Path to the resource requested by the client.

  • Protocol: HTTP protocol used by the client.

  • StatusCode: This is the status code that the server sends back to the client.

  • RetObjSize: The size of the object returned to the client, not including the response headers.

  • Referer: This gives the site that the client reports having been referred from.

  • UserAgent: The User-Agent HTTP request header. This is the identifying information that the client browser reports about itself.

Log Example: 222.95.39.192—[07/Feb/2005:19:52:35-0500]

“GET http://ad.trafficmp.com/tmpad/banner/ad/tmp.asp?poID=el0w HTTP/1.0” 404 1187 “http://www.besteach.com/” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”

  • ClientIP: 222.95.39.192.

  • ClientID: - 1.

  • Timestamp: [07/Feb/2005:19:52:35-0500].

  • Method: GET.

  • RequestResource: http://ad.trafficmp.com/tmpad/banner/ad/tmp.asp?poID=el0w.

  • Protocol: HTTP/1.0.

  • StatusCode: 404.

  • RetObjSize: 1187.

  • Referer: http://www.besteach.com/.

  • UserAgent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98).

1.2 Error

Contains information about errors that the web server encountered when processing requests, such as when files are missing. It looks something like this:

Log Format: <Timestamp< <LogLevel< <ClientIPAndPort< <LogMessage<

  • imestamp: The current time.

  • LogLevel: Loglevel of the message (e.g. error, notice, warn, etc.).

  • ClientIPAndPort: Client IP address and port of the request.

  • LogMessage: The actual log message.

Log Example: [Tue Feb 22 11:04:40 2005] [error] [client 211.59.0.40] File does not exist: /var/www/html/scripts

  • Timestamp: [Tue Feb 22 11:04:40 2005].

  • LogLevel: error.

  • ClientIPAndPort: 211.59.0.40.

  • LogMessage: File does not exist: /var/www/html/scripts.

The Secure Sockets Layer (SSL) is used to create a safe connection between the client and the server which transmits data. This information is encrypted using two keys a private one and a public one. The SSL error format is similar as the error log format above.

1.3 IPtables

Log Format: <Timestamp< <MachineName< kernel: <TrafficDirection< <Protocol<: IN=<IN< PHYSIN=<PHYSIN< OUT=<<UT< PHYSOUT=<PHYSOUT< SRC=<SRC< DST=<DST< LEN=<LEN< TOS=<TOS< PREC=<PREC< TTL=<TTL< ID=<ID< PROTO=<PROTO< SPT=<SPT< DPT=<DPT<

  • Timestamp: The current time.

  • MachineName: Name of the machine.

  • TrafficDirection: Traffic direction.

  • Protocol: Protocol.

  • IN: This indicates the interface that was used for this incoming packets.

  • PHYSIN: This indicates the physical interface that was used for this incoming packets.

  • OUT: This indicates the interface that was used for outgoing packets.

  • PHYSOUT: This indicates the physical interface that was used for outgoing packets.

  • SRC: The source ip-address from where the packet originated.

  • DST: The destination ip-address where the packets was sent to.

  • LEN: Length of the packet.

  • TOS: Type of Service.

  • PREC: Precedence bits.

  • TTL: Time To Live.

  • ID: Packet identifier.

  • PROTO: Indicates the protocol (e.g. ICMP, TCP, etc.).

  • SPT: Indicates the source port.

  • DPT: Indicates the destination port.

Log Example: Feb 25 12:11:24 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=220.228.136.38 DST=11.11.79.83 LEN=64 TOS=0x00 PREC=0x00 TTL=47 ID=17159 PROTO=TCP SPT=1629 DPT=139

  • Timestamp: Feb 25 12:11:24

  • MachineName: bridge

  • TrafficDirection: INBOUND

  • Protocol: TCP

  • IN: br0

  • PHYSIN: eth0

  • OUT: br0

  • PHYSOUT: eth1

  • SRC: 220.228.136.38

  • DST: 11.11.79.83

  • LEN: 64

  • TOS: 0x00

  • PREC: 0x00

  • TTL: 47

  • ID: 17159

  • PROTO: TCP

  • SPT: 1629

  • DPT: 139

1.4 Mail

Log Format: <Date> <Host> sendmail[Pid]: <Qid>: <What>=<Value>

  • Date: Month, day and time that the line was logged.

  • Host: The name of the host that produced this information (may dier from the logging host).

  • sendmail: Literal, even if sendmail is invoked as mailq or newaliases, ‘sendmail’ is printed here.

  • Pid: The process id of the sendmail invocation that produced this log line.

  • Qid: The queue id, a message identifier unique on the host producing the log lines.

  • What=Value: A comma-separated list of equates. Which equate appears in which line depends on whether the line documents the sender or the recipient and whether delivery succeeded, failed, or was deferred.

    • lass: The queue class: the numeric value defined in the sendmail configuration le for the keyword given in the Precedence: header of the processed message.

    • Ctladdr: The “controlling user”, that is, the name of the user whose credentials we use for delivery.

    • Delay: The total message delay: the time difference between reception and final delivery or bounce). Format is delay=HH:MM::SS for a delay of less than one day and delay=days+HH:MM::SS otherwise.

    • From: The envelope sender. Format is from=addr, with addr defined in [2] by the “address” keyword. This can be an actual person, or a postmaster.

    • Mailer: The symbolic name (defined in the sendmail configuration file) for the program (known as delivery agent) that performed the message delivery.

    • Msgid: A world-unique message identifier. The msgid = equate is omitted if it (incorrectly) is not defined in the configuration file.

    • Nrcpts: The number of recipients for the message, after all aliasing has taken place.

    • pri: The initial priority assigned to the message. The priority changes each time the queued message is tried, but this equate only shows the initial value.

    • Proto: The protocol that was used when the message was received; this is either SMTP, ESMTP, or internal, or assigned with the -p command-line switch.

    • Relay: Shows which user or system sent/received the message; the format is one of relay=user(a)domain [IP], relay=user(a)localhost, or relay=fqdn host.

    • Size: The size of the incoming message in bytes during the DATA phase, including end-of-line characters. For messages received via sendmails’ standard input, it is the count of the bytes received, including the newline characters.

    • Stat: The delivery status of the message. For successful delivery, stat=Sent (text) is printed, where text is the actual text that the other host printed when it accepted the message, transmitted via SMTP. For local delivery, stat=Sent is printed. Other possibilities are stat=Deferred: reason, stat=queued, or stat=User unknown. [complete list of possible values to be made]

    • to: Address of the final recipient, after all aliasing has taken place. The format is defined in [2] by the “address” keyword.

    • Xdelay: The total time the message took to be transmitted during final delivery. This ders from the delay= equate, in that the xdelay= equate only counts the time in the actual final delivery.

    • dsn: Delivery Status Notifications.

Log Example: Mar 15 04:04:36 combo sendmail[13337]: j2F94C6S013336:

to=<root@combo.honeypotbox.com>, ctladdr=<root@combo.honeypotbox.com> (0/0), delay=00:00:00, xdelay=00:00:00, mailer=local, pri=31702, dsn=2.0.0, stat=Sent

  • Date: Mar 15 04:04:36.

  • Host: combo.

  • sendmail: ‘sendmail’.

  • Pid: 13337.

  • Qid: j2F94C6S013336.

  • What=Value:

  • Ctladdr: root@combo.honeypotbox.com.

  • Delay: 00:00:00.

  • Mailer: local.

  • pri: 31702.

  • Stat: Sent.

  • to: root@combo.honeypotbox.com.

  • Xdelay: 00:00:00.

  • dsn: 2.0.0.

1.5 1.1   Messages

Log Format: <Date> <Host> <Program> [Pid]: <Action>

  • Date: Date and time that the line was logged (no year is present, which is a syslog peculiarity).

  • Host: Executing program’s hostname.

  • Program: Name of the utility, program or daemon that caused the message.

  • Pid: The process id of the program that produced this log line.

  • Action: The action that occurred.

Log Example:

  • Date: Feb 1 11:50:36

  • Host: combo.

  • Program: sshd(pam unix).

  • Pid: 32603.

  • Action: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=208.29.180.3 user=nobody.

1.6 1.2   Secure

Authentication messages, xinetd services, etc. are logged here. Log Format: <Date> <Host> <Program> [Pid]: <Action>

  • Date: Date and time that the line was logged (no year is present, which is a syslog peculiarity).

  • Host: Executing program’s hostname.

  • Program: Name of the utility, program or daemon that caused the message.

  • Pid: The process id of the program that produced this log line.

  • Action: The action that occurred.

Log Example: Mar 13 22:50:55 combo sshd[9356]: Failed password for root from 67.103.15.70 port 55639 ssh2

  • Date: Mar 13 22:50:55.

  • Host: combo.

  • Program: sshd.

  • Pid: 9356.

  • Action: Failed password for root from 67.103.15.70 port 55639 ssh2.

Fig. 7
figure 7

Data visualization: a number of requests by status code: b distribution of requests by time; c distribution of requests by client IP; d distribution of requests by IP network

Fig. 8
figure 8

Class distribution of features—part 1

1.7 1.3   Snort

Log Format:

  • Date: Date and time that the line was logged (no year is present, which is a syslog peculiarity).

  • Host: Executing program’s hostname.

  • GID: Generator ID, the component of Snort generated this alert.

  • SID: Snort ID (sometimes referred to as Signature ID),

  • RID: Revision ID.

  • Msg: Message.

  • Class: Classification.

  • Priority: Priority.

  • PROTO: Protocol.

  • SOURCE IP PORT: Source IP and port.

  • DEST IP PORT: Destination IP and port.

Log Example: Feb 25 12:23:54 bastion snort: [1:2003:8] MS-SQL Worm propagation attempt [Classification: Misc Attack] [Priority: 2]: UDP 61.185.28.41:1067 -> 11.11.79.89:1434

  • Date: Feb 25 12:23:54.

  • Host: bastion.

  • GID: 1.

  • SID: 2003.

  • RID: 8.

  • Msg: MS-SQL Worm propagation attempt.

  • Class: Misc Attack.

  • Priority: 2.

  • PROTO: UDP.

  • SOURCE IP PORT: 61.185.28.41:1067.

  • DEST IP PORT: 11.11.79.89:1434 (Fig. 7).

Appendix 2: Data Class Distribution

Fig. 9
figure 9

Class distribution of features—part 2

Fig. 10
figure 10

Class distribution of features—part 3

Fig. 11
figure 11

Class distribution of features—part 4

Fig. 12
figure 12

Class distribution of features—part 5

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Brandao, A., Georgieva, P. (2022). Automatic Log Analysis to Prevent Cyber Attacks. In: Sgurev, V., Jotsov, V., Kacprzyk, J. (eds) Advances in Intelligent Systems Research and Innovation. Studies in Systems, Decision and Control, vol 379. Springer, Cham. https://doi.org/10.1007/978-3-030-78124-8_14

Download citation

Publish with us

Policies and ethics