Automatic Log Analysis to Prevent Cyber Attacks

Brandao, Andre; Georgieva, Petia

doi:10.1007/978-3-030-78124-8_14

Andre Brandao⁵ &
Petia Georgieva⁵

Part of the book series: Studies in Systems, Decision and Control ((SSDC,volume 379))

451 Accesses
1 Citations

Abstract

Organizations nowadays are more dependent than ever on the internet, becoming necessary to provide better security to their networks. However, the rise of automated cyber attacks and their complexity is making this task harder for static approaches and manual examination of the logs, leading to a necessity of studying automated ways for identifying these cyber attacks. In this paper we propose an effective Log-based Intrusion Detection System (LIDS), to predict an attack or not, based on carefully selected features. The logs from various sources are aggregated into one dashboard and the most discriminative features are first determined. For the attack prediction, a few machine learning techniques were comparatively tested, with the Decision tree being the winner. The proposed system is illustrated with the largest publicly available labelled log file dataset KDD Cup 1999.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Portnoy, L.: Intrusion detection with unlabeled data using clustering. Ph.D. dissertation, Columbia University (2000)
Google Scholar
Laskov, P., Düssel, P., Schäfer, C., Rieck K.: Learning intrusion detection: supervised or unsupervised? In: International Conference on Image Analysis and Processing, pp. 50–57. Springer (2005)
Google Scholar
Yen, T.-F., Oprea, A., Onarlioglu, K., Leetham, T., Robertson, W., Juels, A., Kirda, E.L Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proceedings of the 29th Annual Computer Security Applications Conference, pp. 199–208. ACM (2013)
Google Scholar
Stroeh, K., Madeira, E.R.M., Goldenstein, S.K.: An approach to the correlation of security events based on machine learning techniques. J. Internet Serv. Appl. 4(1), 7 (2013)
Article Google Scholar
Li, W.: Automatic log analysis using machine learning: awesome automatic log analysis version 2.0 (2013)
Google Scholar
Vasquez Villano, E.G.: Classification of logs using machine learning technique. Master’s thesis, NTNU (2018)
Google Scholar
Vigneswaran, K.R., Vinayakumar, R., Soman, K., Poornachandran, P.: Evaluating shallow and deep neural networks for network intrusion detection systems in cyber security. In: 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–6. IEEE (2018)
Google Scholar
Dionísio, N., Alves, F., Ferreira, P.M., Bessani, A.: Cyberthreat detection from twitter using deep neural networks. arXiv preprint arXiv:1904.01127 (2019)
Deokar, B., Hazarnis, A.: Intrusion detection system using log files and reinforcement learning. Int. J. Comput. Appl. 45(19), 28–35 (2012)
Google Scholar
Bulavas, V.: Investigation of network intrusion detection using data visualization methods. In: 2018 59th International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS), pp. 1–6. IEEE (2018)
Google Scholar
Rampure, V., Tiwari, A.: A rough set based feature selection on KDD Cup 99 data set. Int. J. Database Theory Appl. 8(1), 149–156 (2015)
Article Google Scholar
Bozhkov, L., Georgieva, P.: Brain neural data analysis with feature space defined by descriptive statistics. In: Iberian Conference on Pattern Recognition and Image Analysis, pp. 415–422. Springer (2015)
Google Scholar
Tucker, L.R., MacCallum, R.C.: Exploratory factor analysis. Unpublished manuscript, Ohio State University, Columbus (1997)
Google Scholar
Stolfo, J., Fan, W., Lee, W., Prodromidis, A., Chan, P.K.: Cost-based modeling and evaluation for data mining with application to fraud and intrusion detection. Results from the JAM Project by Salvatore, pp. 1–15 (2000)
Google Scholar

Download references

Acknowledgements

This work was funded by National Funds through the FCT—Foundation for Science and Technology, in the context of the project UID/CEC/00127/2019.

Author information

Authors and Affiliations

Department of Electronics, Telecommunications and Informatics/IEETA, University of Aveiro, Aveiro, 3810-193, Portugal
Andre Brandao & Petia Georgieva

Authors

Andre Brandao
View author publications
You can also search for this author in PubMed Google Scholar
Petia Georgieva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petia Georgieva .

Editor information

Editors and Affiliations

Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria
Vassil Sgurev
University of Library Studies and Information Technologies, Sofia, Bulgaria
Vladimir Jotsov
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Janusz Kacprzyk

Appendices

Appendix 1: Logs Structure and Formats

Before formulating attack hypotheses, it’s important to learn as much about the logs structure and involved systems as possible. A better understanding of the environment and settings will allow to make an educated judgement on which attacks actually had a chance of succeeding and which had no chance at all. A typical logs structure is depicted in Fig. 6.

1.1 Access

The http access log records all requests processed by the server. Storing the information in the access log is the start of log management. The next step is to analyze this information to produce useful statistics. The format of the access log is highly configurable.

Typical Access Log Format: <ClientIP> - >ClientID> >Timestamp> >Method> >RequestResource> >Protocol> >StatusCode> >RetObjSize> >Referer> >UserAgent>

ClientIP: This is the IP address of the client (remote host) which made the request to the server.
ClientID: This is the userid of the person requesting the document as determined by HTTP authentication.
Timestamp: The time that the request was received.
Method: HTTP request method (e.g. GET, POST, PUT, DELETE, etc.).
RequestResource: Path to the resource requested by the client.
Protocol: HTTP protocol used by the client.
StatusCode: This is the status code that the server sends back to the client.
RetObjSize: The size of the object returned to the client, not including the response headers.
Referer: This gives the site that the client reports having been referred from.
UserAgent: The User-Agent HTTP request header. This is the identifying information that the client browser reports about itself.

Log Example: 222.95.39.192—[07/Feb/2005:19:52:35-0500]

“GET http://ad.trafficmp.com/tmpad/banner/ad/tmp.asp?poID=el0w HTTP/1.0” 404 1187 “http://www.besteach.com/” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)”

ClientIP: 222.95.39.192.
ClientID: - 1.
Timestamp: [07/Feb/2005:19:52:35-0500].
Method: GET.
RequestResource: http://ad.trafficmp.com/tmpad/banner/ad/tmp.asp?poID=el0w.
Protocol: HTTP/1.0.
StatusCode: 404.
RetObjSize: 1187.
Referer: http://www.besteach.com/.
UserAgent: Mozilla/4.0 (compatible; MSIE 5.5; Windows 98).

1.2 Error

Contains information about errors that the web server encountered when processing requests, such as when files are missing. It looks something like this:

Log Format: <Timestamp< <LogLevel< <ClientIPAndPort< <LogMessage<

imestamp: The current time.
LogLevel: Loglevel of the message (e.g. error, notice, warn, etc.).
ClientIPAndPort: Client IP address and port of the request.
LogMessage: The actual log message.

Log Example: [Tue Feb 22 11:04:40 2005] [error] [client 211.59.0.40] File does not exist: /var/www/html/scripts

Timestamp: [Tue Feb 22 11:04:40 2005].
LogLevel: error.
ClientIPAndPort: 211.59.0.40.
LogMessage: File does not exist: /var/www/html/scripts.

The Secure Sockets Layer (SSL) is used to create a safe connection between the client and the server which transmits data. This information is encrypted using two keys a private one and a public one. The SSL error format is similar as the error log format above.

1.3 IPtables

Log Format: <Timestamp< <MachineName< kernel: <TrafficDirection< <Protocol<: IN=<IN< PHYSIN=<PHYSIN< OUT=<<UT< PHYSOUT=<PHYSOUT< SRC=<SRC< DST=<DST< LEN=<LEN< TOS=<TOS< PREC=<PREC< TTL=<TTL< ID=<ID< PROTO=<PROTO< SPT=<SPT< DPT=<DPT<

Timestamp: The current time.
MachineName: Name of the machine.
TrafficDirection: Traffic direction.
Protocol: Protocol.
IN: This indicates the interface that was used for this incoming packets.
PHYSIN: This indicates the physical interface that was used for this incoming packets.
OUT: This indicates the interface that was used for outgoing packets.
PHYSOUT: This indicates the physical interface that was used for outgoing packets.
SRC: The source ip-address from where the packet originated.
DST: The destination ip-address where the packets was sent to.
LEN: Length of the packet.
TOS: Type of Service.
PREC: Precedence bits.
TTL: Time To Live.
ID: Packet identifier.
PROTO: Indicates the protocol (e.g. ICMP, TCP, etc.).
SPT: Indicates the source port.
DPT: Indicates the destination port.

Log Example: Feb 25 12:11:24 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=220.228.136.38 DST=11.11.79.83 LEN=64 TOS=0x00 PREC=0x00 TTL=47 ID=17159 PROTO=TCP SPT=1629 DPT=139

Timestamp: Feb 25 12:11:24
MachineName: bridge
TrafficDirection: INBOUND
Protocol: TCP
IN: br0
PHYSIN: eth0
OUT: br0
PHYSOUT: eth1
SRC: 220.228.136.38
DST: 11.11.79.83
LEN: 64
TOS: 0x00
PREC: 0x00
TTL: 47
ID: 17159
PROTO: TCP
SPT: 1629
DPT: 139

1.4 Mail

Log Format: <Date> <Host> sendmail[Pid]: <Qid>: <What>=<Value>

Date: Month, day and time that the line was logged.
Host: The name of the host that produced this information (may dier from the logging host).
sendmail: Literal, even if sendmail is invoked as mailq or newaliases, ‘sendmail’ is printed here.
Pid: The process id of the sendmail invocation that produced this log line.
Qid: The queue id, a message identifier unique on the host producing the log lines.
What=Value: A comma-separated list of equates. Which equate appears in which line depends on whether the line documents the sender or the recipient and whether delivery succeeded, failed, or was deferred.
- lass: The queue class: the numeric value defined in the sendmail configuration le for the keyword given in the Precedence: header of the processed message.
- Ctladdr: The “controlling user”, that is, the name of the user whose credentials we use for delivery.
- Delay: The total message delay: the time difference between reception and final delivery or bounce). Format is delay=HH:MM::SS for a delay of less than one day and delay=days+HH:MM::SS otherwise.
- From: The envelope sender. Format is from=addr, with addr defined in [2] by the “address” keyword. This can be an actual person, or a postmaster.
- Mailer: The symbolic name (defined in the sendmail configuration file) for the program (known as delivery agent) that performed the message delivery.
- Msgid: A world-unique message identifier. The msgid = equate is omitted if it (incorrectly) is not defined in the configuration file.
- Nrcpts: The number of recipients for the message, after all aliasing has taken place.
- pri: The initial priority assigned to the message. The priority changes each time the queued message is tried, but this equate only shows the initial value.
- Proto: The protocol that was used when the message was received; this is either SMTP, ESMTP, or internal, or assigned with the -p command-line switch.
- Relay: Shows which user or system sent/received the message; the format is one of relay=user(a)domain [IP], relay=user(a)localhost, or relay=fqdn host.
- Size: The size of the incoming message in bytes during the DATA phase, including end-of-line characters. For messages received via sendmails’ standard input, it is the count of the bytes received, including the newline characters.
- Stat: The delivery status of the message. For successful delivery, stat=Sent (text) is printed, where text is the actual text that the other host printed when it accepted the message, transmitted via SMTP. For local delivery, stat=Sent is printed. Other possibilities are stat=Deferred: reason, stat=queued, or stat=User unknown. [complete list of possible values to be made]
- to: Address of the final recipient, after all aliasing has taken place. The format is defined in [2] by the “address” keyword.
- Xdelay: The total time the message took to be transmitted during final delivery. This ders from the delay= equate, in that the xdelay= equate only counts the time in the actual final delivery.
- dsn: Delivery Status Notifications.

Log Example: Mar 15 04:04:36 combo sendmail[13337]: j2F94C6S013336:

to=<root@combo.honeypotbox.com>, ctladdr=<root@combo.honeypotbox.com> (0/0), delay=00:00:00, xdelay=00:00:00, mailer=local, pri=31702, dsn=2.0.0, stat=Sent

Date: Mar 15 04:04:36.
Host: combo.
sendmail: ‘sendmail’.
Pid: 13337.
Qid: j2F94C6S013336.
What=Value:
Ctladdr: root@combo.honeypotbox.com.
Delay: 00:00:00.
Mailer: local.
pri: 31702.
Stat: Sent.
to: root@combo.honeypotbox.com.
Xdelay: 00:00:00.
dsn: 2.0.0.

1.5 1.1 Messages

Log Format: <Date> <Host> <Program> [Pid]: <Action>

Date: Date and time that the line was logged (no year is present, which is a syslog peculiarity).
Host: Executing program’s hostname.
Program: Name of the utility, program or daemon that caused the message.
Pid: The process id of the program that produced this log line.
Action: The action that occurred.

Log Example:

Date: Feb 1 11:50:36
Host: combo.
Program: sshd(pam unix).
Pid: 32603.
Action: authentication failure; logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=208.29.180.3 user=nobody.

1.6 1.2 Secure

Authentication messages, xinetd services, etc. are logged here. Log Format: <Date> <Host> <Program> [Pid]: <Action>

Date: Date and time that the line was logged (no year is present, which is a syslog peculiarity).
Host: Executing program’s hostname.
Program: Name of the utility, program or daemon that caused the message.
Pid: The process id of the program that produced this log line.
Action: The action that occurred.

Log Example: Mar 13 22:50:55 combo sshd[9356]: Failed password for root from 67.103.15.70 port 55639 ssh2

Date: Mar 13 22:50:55.
Host: combo.
Program: sshd.
Pid: 9356.
Action: Failed password for root from 67.103.15.70 port 55639 ssh2.

1.7 1.3 Snort

Log Format:

Date: Date and time that the line was logged (no year is present, which is a syslog peculiarity).
Host: Executing program’s hostname.
GID: Generator ID, the component of Snort generated this alert.
SID: Snort ID (sometimes referred to as Signature ID),
RID: Revision ID.
Msg: Message.
Class: Classification.
Priority: Priority.
PROTO: Protocol.
SOURCE IP PORT: Source IP and port.
DEST IP PORT: Destination IP and port.

Log Example: Feb 25 12:23:54 bastion snort: [1:2003:8] MS-SQL Worm propagation attempt [Classification: Misc Attack] [Priority: 2]: UDP 61.185.28.41:1067 -> 11.11.79.89:1434

Date: Feb 25 12:23:54.
Host: bastion.
GID: 1.
SID: 2003.
RID: 8.
Msg: MS-SQL Worm propagation attempt.
Class: Misc Attack.
Priority: 2.
PROTO: UDP.
SOURCE IP PORT: 61.185.28.41:1067.
DEST IP PORT: 11.11.79.89:1434 (Fig. 7).

Appendix 2: Data Class Distribution

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Brandao, A., Georgieva, P. (2022). Automatic Log Analysis to Prevent Cyber Attacks. In: Sgurev, V., Jotsov, V., Kacprzyk, J. (eds) Advances in Intelligent Systems Research and Innovation. Studies in Systems, Decision and Control, vol 379. Springer, Cham. https://doi.org/10.1007/978-3-030-78124-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-78124-8_14
Published: 03 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-78123-1
Online ISBN: 978-3-030-78124-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Automatic Log Analysis to Prevent Cyber Attacks