+ All Categories
Home > Documents > Network Based Intelligent Malware Detection...Malmenator Network Based Intelligent Malware Detection...

Network Based Intelligent Malware Detection...Malmenator Network Based Intelligent Malware Detection...

Date post: 19-Feb-2021
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
46
Malmenator Network Based Intelligent Malware Detection Final Year Project Interim Report February 2, 2020 David B. Han (3035344211) Piyush Jha (3035342691) Supervisor: Dr. Dirk Schnieders University of Hong Kong github.com/piy0999/Malmenator
Transcript
  • Malmenator

    Network Based Intelligent Malware Detection

    Final Year Project

    Interim Report

    February 2, 2020

    David B. Han (3035344211)Piyush Jha (3035342691)

    Supervisor: Dr. Dirk SchniedersUniversity of Hong Kong

    github.com/piy0999/Malmenator

    github.com/piy0999/Malmenator

  • Abstract

    Strong security measures in the digital world are becoming increasingly important as indicated

    by the meteoric rise of spending on cybersecurity as well as the increased losses from successful

    cyberattacks. One important research area in cybersecurity is network anomaly detection, which is

    implemented in network-based intrusion detection systems (NIDSs) to identify and stop malicious

    network activity before it causes damage. Unfortunately, most of the open source NIDSs utilize

    a purely rules-based approach to detect anomalies, which render them vulnerable to new exploits.

    On the other hand, many novel machine learning approaches have sub optimal performance due

    to challenges in their lack of interpretability as well as in the absence of large and representative

    datasets for training. Malmenator takes a two step approach to create a stronger anomaly detection

    engine. It first evaluates publicly available network traffic datasets and compiles a comprehensive

    dataset for evaluation. Malmenator then proceeds to implement a new hybrid approach that utilizes

    both machine learning and rules-based methods to create a custom NIDS. All of these features are

    then combined into a portable hardware component that functions both as an NIDS and a WiFi

    router. Malmenator has already implemented a basic NIDS on a WiFi router constructed out of a

    Raspberry Pi and has also researched and selected a comprehensive dataset, CIDDS-001, for building

    the anomaly detection model. Malmenator has just begun researching different methodologies for

    implementing the network anomaly detection model and is making decent progress in accordance

    with our projected timeline.

    i

  • Acknowledgements

    We would like to thank our supervisor, Dr. Dirk Schnieders, for his support and advice regarding the

    usage of machine learning in our project. His input allowed us to appropriately define our project

    scope and focus our research in a direction with the most impact. Furthermore, we would like to

    extend a special token of gratitude to professors from the Centre of Applied English Studies at the

    University of Hong Kong in helping us improve our writing skills to express the work done during

    the project. Lastly, we would like to thank the Department of Computer Science at The University

    of Hong Kong for reimbursing the costs incurred for buying various equipment during the project

    and making the project feasible.

    ii

  • Contents

    Abstract i

    Acknowledgements ii

    List of Figures vi

    List of Tables vii

    Abbreviations viii

    1 Introduction 1

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.4 Report organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 A Primer on Cybersecurity 3

    2.1 What are Cyberattacks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.2 Economic Impacts of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    2.3 Categories of Cybersecurity Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.4 Ethics of Malware Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.5 Challenges of Malware Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3 Project background 8

    3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3.1.1 What is an NIDSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    iii

  • 3.2 NIDSs vs network-based intrusion prevention systems (NIPSs) . . . . . . . . . . . . 9

    3.3 NIDS implementation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.4 Overview of NIDS techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    3.4.1 Signature-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.4.2 Anomaly-based detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.5 Packet-based and flow-based data formats . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.5.1 NetFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.6 Nfdump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    4 Previous works 14

    4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4.2 Network traffic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4.2.1 Methods for network traffic dataset generation . . . . . . . . . . . . . . . . . 14

    4.2.2 Old reference datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    4.2.3 Key datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    4.2.3.1 CIDDS-001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    4.2.3.2 UGR 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    4.2.3.3 CICIDS 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    4.2.4 Other relevant datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    4.2.4.1 UNSW-NB15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    4.2.4.2 MAWILab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    4.2.4.3 UNB ISCX 2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    4.2.4.4 CTU-13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    4.2.5 General Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    4.3 Machine learning in network anomaly detection . . . . . . . . . . . . . . . . . . . . . 18

    4.3.1 Classification methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    4.3.2 Statistical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    4.3.3 Clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    4.3.4 Constraints of machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    iv

  • 5 Methodology 21

    5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    5.2 Network scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    5.2.1 Raspberry Pi hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    5.2.2 Hardware setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    5.2.3 Utilizing nfdump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    5.3 Dataset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    5.3.1 CIDDS-001 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    5.4 Network anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    5.4.1 Hybrid network anomaly detection approach . . . . . . . . . . . . . . . . . . 24

    5.4.2 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    5.5 Web interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    5.5.1 Dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    5.5.2 Web architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    6 Discussion and remarks 27

    6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    6.2 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    6.3 Discussion of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    6.3.1 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    6.4 Preliminary anomaly detection model results . . . . . . . . . . . . . . . . . . . . . . 30

    6.5 Project timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    6.6 Engineering Risk mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    6.7 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    6.7.1 Steep learning curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    6.7.2 Scope identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    6.7.3 Proprietary information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    Conclusion 33

    Bibliography 34

    v

  • List of Figures

    2.1 Estimating financial losses to cyberattacks . . . . . . . . . . . . . . . . . . . . . . . . 5

    3.1 Inline and passive implementations of an NIDS . . . . . . . . . . . . . . . . . . . . . 10

    3.2 Overview of packet-based headers by protocol . . . . . . . . . . . . . . . . . . . . . . 12

    4.1 Taxonomy of network anomaly detection methods . . . . . . . . . . . . . . . . . . . 18

    5.1 Network setup with the Malmenator network scanner . . . . . . . . . . . . . . . . . . 22

    5.2 Malmenator web architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    vi

  • List of Tables

    6.1 Progress Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    6.2 Planned project timetable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    vii

  • Abbreviations

    ANN artificial neural network

    DDoS distributed denial-of-service

    ELK Elasticsearch Logstash Kibana

    ICMP Internet Control Message Protocol

    IDS intrusion detection system

    IP Internet Protocol

    KNN K-nearest neighbor

    LAN local area network

    NIDPS network-based intrusion detection and prevention system

    NIDS network-based intrusion detection system

    NIPS network-based intrusion prevention system

    SSH secure shell

    SVM support vector machine

    TCP Transmission Control Protocol

    UDP User Datagram Protocol

    WLAN wireless local area network

    viii

  • Chapter 1

    Introduction

    1.1 Motivation

    Over the past decade, spending on cybersecurity has been consistently increasing year over year, yet

    the losses due to cyberattacks only continue to grow [1]. With the scarcity of investment in network

    security research and the lack of effective open source network anomaly detection methods, this

    research project aims at empowering open source network security technology [2]. Additionally, the

    Malmenator project aims to create an effective, affordable, and portable NIDS that can be adapted

    to any home or small office environment. By making cybersecurity more readily accessible, we aim

    to increase trust in a society that is undergoing an enormous digital transformation.

    1.2 Problem formulation

    From an industry perspective, network security solutions for individuals are inaccessible and

    expensive. Due to the confidential nature of data breaches and network traffic, individual parties

    such as governments and corporations have very little cooperation compared to other areas of

    computer science research [3]. This is especially true for network cybersecurity as network traffic

    data often contains highly sensitive information. Currently, most network security solutions are

    tailored towards large enterprises that can afford the implementation costs, while most individuals

    and small groups leave network security to their Internet Service Providers.

    From an academic perspective, network anomaly detection faces two large challenges. The

    foremost problem is the lack of a single widely accepted network traffic dataset. Without a

    1

  • comprehensive and representative dataset, the second problem is reached - appropriately

    developing, testing, and deploying anomaly-based detection techniques using standardized norms

    and metrics. Malmenator aims to bridge the industry gap through a research based project, and in

    doing so, target the two large issues that plague research into network security - the lack of a

    comprehensive dataset for NIDS evaluation and the absence of powerful open source hybrid

    detection tools.

    1.3 Contributions

    There are several contributions that Malmenator hopes to make upon its completion. First,

    Malmenator hopes to contribute to the open source community in its development of new anomaly

    detection methodologies built on Netflow data. Additionally, Malmenator aims to shed light on the

    difficulties of testing NIDSs with both simulated and real datasets and provide a more

    comprehensive solution to this issue. Lastly, Malmenator would like to introduce novel

    combinations of malicious network detection methodologies that have real-world and commercial

    implications.

    1.4 Report organization

    The remaining contents of this report are as follows. This report begins by providing a brief

    introduction to the cybersecurity industry as a whole. It then gives an introduction to NIDSs and

    NetFlow data format before proceeding to summarize previous works related to the major aspects

    of the Malmenator project, namely network traffic dataset creation and network anomaly

    detection. For traffic dataset creation, this report looks at the different types of network data

    formats before highlighting the key dataset Malmenator will be using for model training and

    evaluation in Malmenator. For network anomaly detection, this report first summarizes the

    different high-level machine learning approaches that have been explored by various researchers,

    and then proceeds to the project methodology and describes the steps by which Malmenator will

    implement its hardware, analyze its dataset, detect anomalies, and visualize its system. Lastly, this

    report concludes by discussion Malmenator’s project progress, challenges, and timeline.

    2

  • Chapter 2

    A Primer on Cybersecurity

    This chapter offers an overview of the technology and economic conditions of various aspects of the

    the cybersecurity industry. Reading this chapter useful for understanding the basics of cybersecurity

    for the unfamiliar and may be skipped if one wishes to delve directly into details of the project. This

    chapter first describes the nature of cyberattacks before giving an overview on various high level

    aspects of cybersecurity including its economic impacts, ethics, and challenges.

    2.1 What are Cyberattacks?

    A cyberattack is an intentional and malicious attempt by an individual or organization to disrupt

    or penetrate the information system of another individual or organization. There are many forms of

    cyberattacks, some of common forms are described below. Note that the list below is non-exhaustive;

    cyberattacks are constantly evolving and adapting to defensive measures as they are deployed.

    Malware: Malware is a combination of the words mal icious software. Malware breaches a

    network through a vulnerability or exploit, and usually occurs when a user clicks a dangerous

    link or email attachment that installs the malware. Once activated, malware can perform

    any of a number of actions depending on the malware type. In the case of ransomware, the

    malware can block access to key files until a fee is paid. In the case of spyware, the malware

    can secretly obtain and transmit information from the system. In the case of a backdoor, the

    malware can allow hackers to gain remote access to the computer. There are a large number

    of malware variations, and new forms are discovered every day.

    3

  • Man-in-the-middle attack : Man-in-the-middle attacks occur when attackers insert themselves

    into a a communication line. Thus, instead of one party communicating directly with the other

    party, all communication traffic is unknowingly sent through the ”man-in-the-middle”, which

    enables the attacker to monitor and steal data.

    Denial-of-service attack : Denial-of-service attacks overwhelms system networks with a huge

    amount of traffic to exhaust their resources and bandwidth to cause the system to be unable

    to fulfill legitimate attacks. In the event that this attack is simultaneously launched from a

    number of compromised devices, the attack is known as a distributed-denial-of-service (DDoS)

    attack.

    Zero-day exploit : Zero-day exploits occur when a network vulnerability is discovered and

    announced, but before an update has been released to patch the problem. Attackers target

    the disclosed vulnerability during the window of time before that patch is released.

    Each form of cyberattack takes advantage of different system vulnerabilities and have a variety of

    different risks. Furthermore, each form can be further subdivided into various types and families as

    is evident in the malware section. In this project, we focus on forms of attack that can be detected

    using network traffic patterns, which includes denial-of-service attacks, man-in-the-middle attacks,

    and certain types of malware such as backdoors.

    2.2 Economic Impacts of Malware

    Cyberattacks from malware are an increasingly large problem, and companies and governments

    are spending more year over year in order to properly protect themselves. From 2012 to 2018,

    average annual cybersecurity expenditures per employee doubled from $584 USD to $1,178 [4]. In

    2019, global spending on cybersecurity initiative is expected to exceed $100 billion USD [5]. These

    numbers are only expected to continue to rise as the financial incentives to engage in malicious

    attacks only continue to rise.

    In 2018, a conservative estimate of financial losses caused by cyberattacks is at least $45 billion

    USD across millions of reported attacks such as DDoS and ransomware attacks [3]. Including the

    financial losses due to data breaches and the loss of more than 2 billion consumer data records, this

    number rises to a staggering $654 billion USD for US corporations in 2018 alone [2]. Cyberattacks

    to U.S. based financial services organizations in Q1 of 2019 alone cost more than $6.2 billion USD,

    a sharp rise from just $8 million USD in Q1 of 2018.

    4

  • It must be noted that these costs are general estimates as the exact loss of value can be difficult

    to calculate. There is no centralized data set on the costs of cyberattacks and data breaches. Thus,

    many statistics in the cybersecurity industry come from surveys, which can suffer from

    non-representative and inaccurate reporting. Often, firms are reluctant to report negative

    information which may cause these statistics to bias downwards due to under-reporting. The

    White House published a report in 2018 containing figure 2.1 depicting the various economic

    impacts of cyberattacks and their respective difficulties in quantifying cost [1]. It can be seen that

    certain losses such as court fees and forensic costs are easy to justify, but damage to reputation

    and loss of IP are difficult to quantify. Overall, cybersecurity breaches impact organizations across

    the world in various ways and magnitudes.

    Figure 2.1: Estimating financial losses to cyberattacks

    2.3 Categories of Cybersecurity Solutions

    Cybersecurity efforts are growing rapidly in order to respond to the quickly shifting landscape of

    malware attacks. Defensive solutions can take on a variety of forms and roles including anti-virus

    software, identity and access management tools, intrusion detection system (IDS), and others. The

    uses of most of these follow directly from their naming convention. Anti-virus software are the bread

    and butter of the cybersecurity industry and include software such as Windows Defender, Norton

    Antivirus, and Avast Antivirus. These tools can scan system files and downloads to check for virus

    5

  • signatures that are then matched to a trusted database. Identity management tools include solutions

    like 2-factor authentication where users must confirm their identity through a verification code sent

    to an email or mobile device. These solutions enable organizations to more stringently verify a user’s

    online identity before granting access to sensitive information. IDSs are used to monitor network

    traffic for suspicious activity and issue alerts when such activity is discovered. Appropriate actions

    can then be taken such as blocking traffic from suspicious IP addresses or discarding undesirable

    packets. IDSs are at the heart of the Malmenator project.

    2.4 Ethics of Malware Research

    In order to defend against malware attacks, researchers first need to understand how malicious

    code works. Often, this requires preemptively creating malware in order to find vulnerabilities and

    appropriate solutions. Thus, not all malware is created with malicious intentions. People in the

    cybersecurity industry can be categorized into one of four categories: white hat, black hat, grey hat,

    and red hat. The definitions of these types of hackers are listed below:

    White Hat : White hat hackers are people who create malware and attempt to break into

    computer systems for a good cause. These people could work for a cybersecurity firm, or could

    be a professional penetration testing consultant.

    Black Hat : Black hat hackers create malware for malicious causes in order to extort individuals

    and corporations for personal incentives such as money or power. These people are the ones

    responsible for much of the malware that cause tremendous economic losses.

    Grey Hat : Grey hat hackers do a mix of white hat and black hat activities and dabble in using

    their knowledge and skills for both good and bad depending on the circumstances.

    Red Hat : Red hat hackers are similar to black hat hackers, except they are employed by a

    government to initiate attacks on foreign powers.

    The key takeaway from this section is that ethics in malware research is not always

    straightforward. However, publicly published academic research is generally a white hat effort, this

    paper included.

    6

  • 2.5 Challenges of Malware Research

    Performing substantive literature review to understand the cutting edge technology in malware

    research is a difficult task because there is little incentive to publicly publish anti-malware

    techniques. Any research that is published is also available to attackers who can choose to exploit

    other vulnerabilities in the system. Thus, the papers published by the top cybersecurity firms and

    government organizations are usually either outdated or extremely abstract without precise

    implementation and performance details. A cybersecurity report published by the Royal Society

    highlighted cybersecurity’s distinct characteristics including having multidisciplinary, global, and

    cross-sectoral interest, which cause research to take place across academic, commercial, and

    government sectors that further adds to these difficulties [6]. Information sharing across these

    fields is not transparent, and many corporations are hesitant to share their vulnerabilities in

    academic or government research as that may impact their reputation and harm their business.

    Thus, much of the recently published academic research in malware detection have challenges in

    practical implementation that must be taken into account.

    7

  • Chapter 3

    Project background

    3.1 Overview

    This section provides an introduction to the concept of NIDSs. Given that cybersecurity is an

    extremely niche field in Computer Science, the information in the background is critical for

    understanding the relevant tools and technologies that the Malmenator project is based on. This

    section begins with a high level overview of what NIDSs are before delving into the details of their

    implementation and categorization.

    3.1.1 What is an NIDSs

    NIDSs monitor network traffic in order to detect when an unauthorized intrusion is being carried

    out by hostile entities by providing some or all of the following:

    • Monitoring the condition of routers, firewalls, and servers

    • Providing system admins a way to tune, organize and understand relevant operating system

    audit trails and other logs that are often otherwise difficult to track or parse

    • Including an extensive attack signature database against which information from the system

    can be matched

    • Recognizing and reporting when the IDS detects that data files have been altered

    • Generating an alarm and notifying that security has been breached.

    8

  • NIDSs offer organizations a number of benefits, starting with the ability to identify security

    incidents. NIDSs can be used to help analyze the quantity and types of attacks, and organizations

    can use this information to change their security systems or implement more effective controls.

    NIDSs can also help companies identify bugs or problems with their network device configurations.

    These metrics can then be used to assess future risks.

    NIDSs can also help the enterprise attain regulatory compliance. An IDS gives companies greater

    visibility across their networks, making it easier to meet security regulations. Additionally, businesses

    can use their NIDSs logs as part of the documentation to show they are meeting certain compliance

    requirements.

    Although NIDSs monitor networks for potentially malicious activity, they are also prone to

    false alarms (false positives). Consequently, organizations need to fine-tune their NIDS products

    when they first install them. That means properly configuring their intrusion detection systems to

    recognize what normal traffic on their network looks like compared to potentially malicious activity.

    3.2 NIDSs vs NIPSs

    NIDSs are used to monitor and analyze network traffic in order to identify suspicious activity and

    alert system administrators in the event of an attack. A NIPS is similar to an NIDS, but differs in that

    an NIPS can also be configured to automatically block potential threats without the intervention of

    a system administrator. Historically, NIDSs were tailored to process network data more thoroughly

    and rapidly than NIPSs, but with the advent of increased processing power, the line between the

    two has become blurred. Today, most NIDSs provide configurations to allow for their capabilities

    to extend into the territory of NIPSs. Hence, the term network-based intrusion detection and

    prevention system (NIDPS) was coined. network-based intrusion detection and prevention systems

    (NIDPSs) are systems that combine the capabilities of NIDSs and NIPSs. Although the prevention

    aspect of NIDPSs are critical is dealing with certain aspects of cybersecurity, Malmenator focuses

    on NIDSs implementation to allow for a more focused project. Further research should be done to

    determine and implement the optimal reaction to different forms of cyberattacks.

    3.3 NIDS implementation strategies

    NIDSs are deployed at strategic points within a system network where it can best capture traffic

    to and from all devices on the network. This usually involves being deployed directly within or in

    9

  • parallel to a router, switch, or access point in a network. Figure 3.1 depicts two implementation

    Figure 3.1: Inline and passive implementations of an NIDS

    Inline (left) and passive (right) implementation of a NIDS as recommended by the National Institute ofStandards and Technology, U.S. Department of Commerce [7]. In the inline implementation, all networktraffic must pass through the NIDS which may throttle network speeds. In the passive implementation, acopy of all network traffic is send to the NIDS for processing, enabling network speeds to remain high.

    methods of an NIDS. Inline implementation has the benefit of being able to directly respond to

    attacks by blocking network traffic. On the other hand, passive implementation must rely on a

    separate tool such as a firewall to secure traffic. Malmenator utilizes inline implementations of

    NIDSs since Malmenator focuses not only on detecting, but also on preventing network attacks.

    Note that in Figure 3.1, the NIDS is depicted as a separate hardware component. In some inline

    implementations, such as with Malmenator, the NIDS can be set up directly within the router.

    3.4 Overview of NIDS techniques

    This section describes the high-level methodologies that are employed in NIDSs to detect malicious

    traffic, namely signature-based detection and anomaly-based detection. These methodologies can

    be combined in a hybrid approach to balance each other’s weaknesses [8].

    10

  • 3.4.1 Signature-based detection

    Also known as misuse-based techniques, signature-based techniques refer to the detection of attacks

    by looking for specific patterns within traffic data. The detected patterns are known as signatures,

    which are then matched against a trusted database to check if they have previously appeared in any

    malicious attacks. Signature-based techniques are effective for detecting previously known attack

    types with high accuracy and without raising a large number of false alarms, but their efficacy is

    only as good as the signature database [9]. Signature-based techniques rarely detect novel attacks

    (zero-day attacks) whose signature is not already inside the database. For example, one of the most

    popular NIDS, Snort, conservatively captures 8.2% of zero-day attacks [10]. Overall, this technique

    is computationally fast, but not highly adaptable [11].

    3.4.2 Anomaly-based detection

    To solve the shortcomings of signature-based techniques in novel attacks, anomaly-based techniques

    model normal network behavior and identify deviations from the norm, enabling them to detect

    novel attacks whose signatures may be previously unknown. However, anomaly-based detection

    techniques may suffer from false positives - normal activity not yet seen before may be incorrectly

    classified as malicious. Furthermore, anomaly-based detection is a time consuming process in both

    training and execution, and many implementations suffer from excessive delay during the detection

    process that degrades their performance [12]. Anomaly-based detection is an area of ongoing and

    active research, and is one of the areas of focus for the Malmenator project.

    3.5 Packet-based and flow-based data formats

    Network traffic data is formatted in one of two ways: packet-based or flow-based. Packet-based

    data is captured in PCAP format and contains both metadata and payload information for each

    network packet. Metadata information for each packet depends on the transport protocol, and their

    differences are highlighted in Figure 3.2. There are a number of different protocols, but the most

    important ones are IP, TCP, UDP, and ICMP as they constitute the majority of internet traffic

    and are the core of packet-based datasets [13]. It is important to critically evaluate the impact of

    various transfer protocol data types on models built using packet-based data.

    Flow-based data is more compact compared to packet-based data, mainly containing metadata

    about network connections. Flow-based data aggregates packets within similar properties in a time

    11

  • time frame into one flow and discard payload information [13]. As packet-based data is more detailed

    and thorough, packet-based data can be converted into flow-based data but not vice versa.

    Figure 3.2: Overview of packet-based headers by protocol

    Packet header formats for the IP, TCP, UDP, and ICMP transport protocols [13]. Each segment of theheader is 32 bits in lengths. Note that there may be multiple data segments in a network packet. Packetdata information can be used in payload analysis for anomaly detection, which is another branch of networkanomaly detection.

    3.5.1 NetFlow

    There are a number of variations of flow based network data including sflow and jflow, but the

    most widely used format of network flow data is called NetFlow. NetFlow was created by Cisco

    12

  • and defines a flow as a set of packets that have a common combination of key-fields in the packet.

    This includes information such as source and destination IP addresses and port numbers, protocol

    type, signature byte and logical interface. A packet is sorted into a flow record if it matches the

    combination of key-fields listed above [14].

    There are a number of different version of NetFlow, the most widely used being V4, V6, V9, and

    V10. At the time of writing, V10 is still relatively new the market and the most widely implemented

    is V9, which we will be using throughout the remainder of this project. The main differentiator

    between versions is the underlying methodology in which the packets are gathered and sorted - not

    in the flows themselves. Thus, research done on V9 should be widely applicable to NetFlow data

    of both newer and older versions as the high level concept behind NetFlow data remains unchanged

    despite underlying engineering changes [14].

    3.6 Nfdump

    Nfdump is a library contained a suite of netflow collector tools which includes nfcapd, nfdump,

    nfreplay, and others. However the entire suite of tools is often referred to simply as Nfdump. These

    tools are open source and can be configured from the command line to capture (via nfcapd) and

    export (via nfdump) netflow data. Importantly, netflow captured and stored by this suite of tools

    can be rebroadcast to a different port or machine using nfreplay. This suite of open source tools is

    integral to the data pipeline of Malmenator.

    3.7 Summary

    In this chapter, the concept of NIDS and how they are used to detect and protect against malicious

    network data was introduced. Furthermore, inline and passive NIDS implementation techniques

    were covered, and a high level overview of their detection techniques, namely signature and anomaly

    based techniques, was also provided. Additionally, nfdump, a leading open source netflow library

    that Malmenator is built upon, was introduced. Building upon this knowledge, the following chapter

    will explore more advanced information and recent research pertaining to Malmenator.

    13

  • Chapter 4

    Previous works

    4.1 Overview

    This chapter provides an in depth look into the context of the Malmenator project through a review

    of relevant literature. It first looks at research in the areas of network traffic dataset generation

    before proceeding with various methodologies in network anomaly detection.

    4.2 Network traffic datasets

    One of the major research challenges in training and evaluating an NIDS is the unavailability of

    a single comprehensive network traffic dataset which that reflects modern network traffic scenarios

    [8]. Since the effectiveness of an NIDS is evaluated based on its performance against data set that

    contains normal and abnormal behaviors, it is critically important to use a combination of datasets

    that feature different attack behaviours to serve as a metric for evaluation.

    4.2.1 Methods for network traffic dataset generation

    Testing NIDSs against realistic data has always been a challenge in the network security field as

    such data is sensitive and is not publicly available [15]. To date there are three main methods by

    which network data is generated and NIDSs are evaluated:

    1. Replaying real network attack traffic from publicly available datasets

    2. Generating artificial attack traffic from tools such as curl-loader

    14

  • 3. Testbed design strategies, which can be further broken down into three methods:

    (a) Simulation: all entities of the network are simulated

    (b) Emulation: attackers and targets are physically implemented, while the network structure

    is simulated with software

    (c) Physical Representation: all entities of the network are physically implemented

    Each method has their own benefits and drawbacks that are more fully discussed in [15]. It important

    to note that testing NIDSs is not a straightforward task, and that it is recommended to use a number

    of datasets for comprehensive evaluation [13]. In the network security field, it is important to be

    aware of the means by which datasets are generated as network dataset generation is a huge research

    area itself.

    4.2.2 Old reference datasets

    The most famous and classical benchmark network datasets are the KDD CUP 99, DARPA 1998, and

    DARPA 1999 datasets, were all gathered from simulated environments [16] [17] [18]. Unfortunately,

    since their inception nearly two decades ago, a number of studies have shown that evaluating an

    NIDS on these datasets is ineffective as they do not reflect realistic network data [19] [20] [21]

    [22]. These datasets have not been able to accurately reflect the changing landscape of network

    cybersecurity.

    To solve these issues, the NSL-KDD was created in 2009 to enhance the KDD CUP 99 dataset by

    refining the KDD CUP 99 data set and creating more sophisticated subsets [23]. Unfortunately, the

    underlying network traffic of the NSL-KDD dataset dates back to two decades ago, which renders

    it a poor representation of a modern low foot print attack environment [24].

    4.2.3 Key datasets

    4.2.3.1 CIDDS-001

    The CIDDS-001 dataset was captured within an emulated environment of a small business in 2017

    and is formatted as a flow-based network traffic. Normal and malicious behaviours were generated

    via python scripts [25] [26].

    This dataset also contains 14 features and captures over a million network flows which make

    it reasonably comprehensive. This dataset has also been cited by more than 50 researchers as of

    15

  • December 2019 which adds to its credibility. Its relatively recent release, popularly cited creation

    methodology, comprehensiveness, and the flow-based format makes it a top choice for this project.

    4.2.3.2 UGR 16

    UGR ’16 is another public flow-based dataset which boasts 16,900 million unidirectional flows. This

    dataset was captured in 2016 for 4 months and aims to capture the behaviour of the network traffic

    using an internet connection provided by an internet service provider. Some attacks were explicitly

    run against the system, while other attacks were later identified and labeled as such [27]. Each flow

    is labelled and categorised as normal, background, or attack, and includes a variety of attacks such

    as DoS attacks, botnet, and port scans.

    A potential issue that might limit its use in this project comes with the fact that most of the

    traffic in this dataset is labelled as ”background” which could be a normal set of traffic flow or an

    attack scenario and the labelling does not differentiate between the two scenarios. This might cause

    the accuracy of a model trained on this dataset to drop. However, the comprehensiveness of this

    flow-based data still makes it a trustable choice for this project.

    4.2.3.3 CICIDS 2017

    The CICIDS 2017 dataset was also created using an emulated environment and contains network

    traffic in both packet and flow-based formats. It comprises of 5 days of network traffic generated

    in an emulated environment and comes in both bidirectional flow-based format and the standard

    packet-based format. It also consists of more than 80 features for each flow and enriches the data

    about IP addresses and cyberattacks by providing extra metadata for both of them. Normal user

    behavior was executed via scripts, and the attacks have a wide range, including SSH brute force,

    DDoS, heartbleed, and botnet attacks. CICIDS 2017 is publicly available [28].

    A major advantage of this dataset is that it comprises of a wide variety of cyberattack scenarios

    such as SSH brute force, heartbleed, botnet, DoS, web and infiltration attacks. This dataset’s

    superiority in capturing the network traffic behaviour for a broad range of cyberattacks makes it a

    prime choice for its utilization in this project.

    4.2.4 Other relevant datasets

    The following subsection briefly summarizes some other datasets that have been the subject of some

    research and may be explored at a later time.

    16

  • 4.2.4.1 UNSW-NB15

    The UNSW-NB15 dataset was created using the IXIA Perfect Storm tool in an emulated environment

    that is available in both packet-based and flow-based formats. It contains nine different families of

    attacks including backdoors and DDoS attacks [24].

    4.2.4.2 MAWILab

    The MAWILab repository contains real network traffic captured between two systems in the USA

    and Japan. Since 2007, a 15-minute trace of packet-based data every day and labeled using anomaly

    detection methods [29].

    4.2.4.3 UNB ISCX 2012

    The ISCX dataset was created by emulating a network environment for one week to create a dataset

    containing both normal and malicious network behaviours and contains variety of attacks such as

    SSH brute force and DDoS attacks [30]. The dataset exists in both a packet-based and bi-directional

    flow based format.

    4.2.4.4 CTU-13

    The CTU-13 dataset is available in both packet-based and flow-based formats and was captured in a

    university network. Since the data is captured is real data, the authors of the dataset utilized their

    own anomaly detection methodologies to label the data [31].

    4.2.5 General Remarks

    According to the most recent survey on network datasets published earlier this year [13], the

    CIDDS-001, CICIDS 2017, UNSW-NB15, and UGR16 data sets the the most ideal for uses across

    generalized applications. UGR16 has a huge volume, CIDDS-001 contains detailed metadata for

    deeper analysis, and CICIDS 2017 and UNSW-NB15 have large variations in attack scenarios [13].

    Other datasets mentioned in this section are more appropriately utilized in specialized cases and

    for certain evaluation scenarios where they may excel.

    17

  • 4.3 Machine learning in network anomaly detection

    The anomaly detection problem boils down to the task of labeling a data point as being either a

    normal point or an outlier (i.e. harmless or malicious network traffic in this case). The following

    subsections each cover a different machine learning based approach to anomaly-based detection

    that this project will explore and implement. As for signature-based techniques, the areas for

    Figure 4.1: Taxonomy of network anomaly detection methods

    Categorization of the four main approaches to network anomaly detection as shown in [32]. Note that thislist is by no means exhaustive. Research is being done on the applications of new algorithms in anomalydetection such as deep learning, genetic algorithms, and artificial immune systems.

    improvement generally rely on having more comprehensive knowledge databases or faster signature

    matching algorithms, both of which are outside the scope of this project.

    An overview for network anomaly detection techniques is depicted in figure 4.1. Note that an

    exhaustive and in depth discussion on the various subcategories of each methodology is not discussed

    in this report, but are discussed at length in [12] [8] [32] [9]. This section will center its discussion

    on the highest level of the taxonomy tree, namely classification, statistical, and clustering as general

    directions for exploration for the project. Models based on information theory also lie outside the

    scope of the project as they do not directly fall under the umbrella of machine learning.

    4.3.1 Classification methods

    Classification methods are a subsection of supervised machine learning categorize data points into

    classes (normal and malicious) by building a baseline profile based on normal network traffic

    18

  • features in a set of labeled training data. There is a large variety of classification methods, but the

    most popular for network anomaly detection are support vector machines (SVMs), K-nearest

    neighbors (KNNs), and artificial neural networks (ANNs) [33]. SVMs are of special note because

    this classification technique has been widely used to much success not only in network anomaly

    detection, but in the machine learning field as a whole [34] [35].

    Other classification techniques, including fuzzy logic, regression models, and decision trees have

    also been explored to some success [36] [37] [9]. However, classification techniques consume

    comparatively more resources than their statistical counterparts, which cause practical

    implementation challenges for their usage in NIDPSs. Furthermore, classification techniques are

    also hindered by the reliance on successfully modeling baseline profiles for standard network traffic,

    and is heavily influenced by the datasets they can be trained on. To date, a large number

    classification methods have been evaluated using outdated datasets such as KDD CUP 99

    (described in 4.2.2), and their performance would be worse when evaluated on newer datasets [12].

    4.3.2 Statistical methods

    Statistical methods fit a statistical model to the dataset and apply an inference test to determine

    whether a new data point belongs to the normal dataset or if it is an anomaly. Statistical models

    tend to be more straightforward to compute and test compared to classification models since they

    are closely linked to statistical tests. For example, a successfully implemented statistical model

    for network anomaly detection was based on a distance measure from the norm derived from the

    chi-square test statistic [38]. Since statistical methods are relatively lightweight, they can be used

    to great success in hybrid network anomaly detection methods on real-time traffic [39].

    4.3.3 Clustering methods

    Clustering methods are unsupervised machine learning methods which do not require prelabeled

    data. Its aim is to discover patterns within data and extract rules for assigning data points to

    groups based on characteristics such as distance or probability measure. Outliers or sparse clusters

    are then retrospectively identified and labeled as anomalous. Clustering methods generally rely on

    three key assumptions: (1) normal data falls into clusters while attacks do not, (2) normal data

    clusters are closer to the cluster centroid, while attack clusters are far away, and (3) normal data

    forms large and dense clusters, while attack data forms small and sparse ones [32] [12].

    Clustering techniques are advantageous because they do not require prelabeled data. Since

    19

  • there is no single comprehensive and correctly labeled benchmark dataset, clustering techniques

    can proceed where traditional classification or statistical techniques may fail. Furthermore,

    clustering techniques decrease computational complexity and generally have better performance

    than classification methods [32]. On the other hand, clustering falls short in their high false

    positive rate and their inability to detect attacks that conceal themselves in a normal cluster [9].

    4.3.4 Constraints of machine learning

    Despite the huge variety of network anomaly detection techniques being researched, all machine

    learning methodologies must keep several key points in mind in order for to be functional in a real

    world implementation. According to one of the world’s leading cybersecurity firms, Kaspersky Labs,

    network anomaly detection models must: (1) be interpretable, (2) have relatively few false positives,

    (3) adapt to counteractions, and (4) be trained on a large and comprehensive dataset [40]. These

    key points are touted not only in industry, but also in academic research environments [41].

    4.4 Summary

    This section first explored methods for formatting and creating network data before proceeding

    to highlight several recent and comprehensive datasets. Lastly, this section discussed classification,

    statistical, and clustering techniques for applying machine learning to the network anomaly detection

    problem as well as their constraints. With ample background knowledge of the topic and relevant

    research, the next section will detail how Malmenator builds upon state-of-the-art technology to

    create a viable product.

    20

  • Chapter 5

    Methodology

    5.1 Overview

    This chapter covers in detail the steps required to complete the project. The project consists of

    two phases over two corresponding semesters. The first phase focuses on completing the basic NIDS

    hardware implementation, processing selected datasets, and implementing a basic web interface

    for reporting. The second phase focuses on network anomaly detection, real-time network traffic

    evaluation and web interface completion to unify both phases into a single web platform.

    5.2 Network scanner

    This section describes the process of building a NIDS hardware device on Raspberry Pi to monitor

    network traffic. This will be completed in phase one of the project.

    5.2.1 Raspberry Pi hardware

    Raspberry Pis are low cost computers that can be customized for a variety of purposes such as web

    servers, personal computers, or in our case a hybrid WiFi Router and NIDS device. The recent

    release of the Raspberry Pi 4 Model B expanded its processing power from 1GB to 4GB of RAM,

    making it an ideal tool to use to build Malmenator. Raspberry Pis can run a variety of open source

    operating systems, but Raspbian OS’s latest September 2019 distribution made it the ideal choice

    to use in Malmenator for stability and compatibility reasons [42].

    21

  • 5.2.2 Hardware setup

    The Raspberry Pi was used as illustrated in Figure 5.1, where it acted as a NIDS integrated into a

    router, similar to the inline NIDS implementation described in section 3.3. The NIDS device was

    connected to the internet through a wired Ethernet connection into the HKU local area network

    (LAN). From there, the device was configured to act as a wireless router by broadcasting a wireless

    local area network (WLAN) and routing all connections through the Ethernet port by implementing

    the iptables library for Linux and intensively modifying a number of network configuration files.

    To broadcast a wireless network, an external wireless USB adapter was originally utilized before

    discovering that Raspberry Pi’s inbuilt wireless card could similarly be utilized to accomplish the

    same task. Further customization was performed to ensure that the Raspberry Pi’s internal internet

    usage was not impacted and that multiple devices could connect to the wireless network without

    interference. By accomplishing this setup, all network traffic sent and received from devices on the

    wireless network can be analyzed by the NIDS.

    Figure 5.1: Network setup with the Malmenator network scanner

    Depiction of the utilization of the Raspberry Pi with integrated NIDS features. The Raspberry Pi is directlyconnected to the internet through an ethernet channel and functions as a WiFi Router.

    5.2.3 Utilizing nfdump

    Nfdump tools (described in section 3.6), was installed on the Raspberry Pi as the foundation NIDS

    upon which the rest of the project will be built. Preliminary trials on light network traffic revealed

    that nfcapd was able to capture 100% of traffic that went through the Raspberry Pi.

    Data captured by nfcapd and exported by nfdump will be periodically exported into an instance

    of Elasticsearch on AWS. Malmenator’s future works in network anomaly detection algorithms will

    be implemented on top of this system and deployed for testing and evaluation.

    22

  • 5.3 Dataset selection

    This section outlines the method for compiling Malmenator’s real-time network dataset for testing

    and evaluating the network anomaly detection system and also for contributing to the open source

    community. This part of the project will be completed in phase one, and is crucial for the overall

    success of Malmenator since many existing network datasets suffer from inaccurate labelling and

    poor attack diversity [12].

    5.3.1 CIDDS-001

    Determining a suitable training dataset for anomaly based detection was a two phase process. First,

    it required looking into state of the art network traffic capturing technology. From there, it required

    datasets that capture similar features to train the models on.

    In accordance with the recommendations laid out in [13] and after careful deliberation and

    debate with both advisor and another professor specializing in cybersecurity, we opted to leverage

    the CIDDS-001 dataset. CIDDS-001 data set was formed in 2017 by capturing NetFlow data in an

    emulated small business environment. It contains four weeks of unidirectional flow-based network

    traffic and encompasses an external server which was attacked in the internet [25]. In contrast

    to honeypots (dummy malware traps) this server was also regularly used by the clients from the

    emulated environment. The CIDDS-001 data set is publicly available and contains SSH brute force,

    DoS and port scan attacks as well as several attacks captured from the wild [26].

    CIDDS-001 contains detailed metadata for deeper analysis, and has a number of advantages of

    other datasets including high accuracy for its labeled NetFlow entries, large volumes of data collected

    over a long period, and recency and relevancy. Detailed analysis and high level information about the

    dataset has already been done by its creators and be found in either its technical report describing

    its features [25] or its research paper describing its generation techniques [26].

    5.4 Network anomaly detection

    This project takes a hybrid approach to network anomaly detection using both signature-based and

    anomaly-based detection methods (introduced in section 3.4). This portion of the project will be

    completed in phase two only after the selection and processing of datasets have been completed.

    23

  • 5.4.1 Hybrid network anomaly detection approach

    For the signature-based portion, the most effective open source rules will be retained and utilized in

    Malmenator such as utilizing a list of known untrusted IPs or geolocation. Much of the work in this

    section will arise from proposing and evaluating a novel ensemble of algorithms for evaluating network

    traffic data among those referred to in section 4.3 on anomaly-based methods. Work will begin by

    implementing standard versions of various statistical, classification, and clustering techniques with

    low training times on the dataset to see which methods have the best immediate results. From

    there, further research will be done on the more promising techniques. The selection of the final

    machine learning approach will be subject to further experiments and will be primarily evaluated

    on interpretability, real-time processing speeds, accuracy, and a low false positive rate as justified in

    section 4.3.4.

    5.4.2 Model evaluation

    The most critical aspect of creating an anomaly detection model is its evaluation - this is doubly

    true for the cybersecurity space for the reasons listed in the introduction. Thus, Malmenator has a

    3-fold approach to ensure its network anomaly detection model is competitive and reliable.

    1. Comparing to prior CIDDS-001 research

    2. Comparing to commercial netflow analyzers such as ManageEngine and SolarWinds

    3. Comparing our models performance on other NetFlow datasets as mentioned in 4.2.4 to

    research done on those datasets

    Having a 3-fold approach to model evaluation allows Malmenator to ensure consistency and

    quality in its performance.

    5.5 Web interface

    This section gives an overview of the implementation of Malmenator’s web interface for interacting

    with the NIDS. The web interface will be continuously refined throughout the course of the project.

    24

  • 5.5.1 Dashboard

    The main functionality of the web interface will be to monitor and control aspects of the Raspberry

    Pi NIDS. This includes features such as viewing the different devices connected to the network as

    well as seeing live updates and data analytics on the traffic flow. The network data analytics include

    network anomaly alerts, IP address origins, and network speeds.

    5.5.2 Web architecture

    The web interface is centered around the Elasticsearch Logstash Kibana (ELK) stack, three open

    source tools that drive thousand of data analytics projects across the world. Elasticsearch is a

    powerful search and analytics engine. Logstash is a serverside data processing pipeline for ingesting

    data from multiple sources to send to Elasticsearch, and Kibana is used to visualize data with charts

    and graphs [43].

    Figure 5.2: Malmenator web architecture

    The overall web architecture. The blue lines denote standard internet traffic generated by users, while theyellow line indicates the NetFlow log messages that are sent to the cloud server. Note that the networkanomaly detection model may be deployed on the cloud or within the Raspberry Pi depending on processingspeeds.

    25

  • The ELK-powered web application will be hosted on Microsoft Azure Cloud Services as illustrated

    in Figure 5.2. Elasticsearch will be hosted on the cloud, and the NIDS will feed information into the

    Elasticsearch instance via Logstash. Kibana will then be responsible for monitoring and displaying

    data to the end users. This enables us to use a modular approach in constructing the web interface

    which has advantages based on its agile software development methodology

    5.6 Summary

    This chapter first the detailed roadmap to the Malemenator project and began by covering the

    hardware NIDS setup using nfdump and Raspberry Pi. The chapter proceeded to detail dataset

    selection and network anomaly detection approaches before finally providing an overview of the web

    architecture behind the web interface to control the NIDS. Armed with a roadmap to the project, the

    next and final section will discuss and evaluate Malmenator’s overarching progress and challenges.

    26

  • Chapter 6

    Discussion and remarks

    6.1 Overview

    This chapter focuses on the current status and planned timeline of the project as well as self-

    evaluation regarding Malmenator’s progress.

    6.2 Current Status

    The project since its inception in August 2019 has been consistently making progress towards

    achieving its goals. Based on the methodology, there are 4 key components in this research-based

    project. These are Network Scanner, Dataset Curation, Hybrid Network Anomaly Detection

    Model and Web Architecture. This section describes the current progress of the project and

    objectively evaluates the current standing of the project in terms of its success in delivering the

    key components.

    The project has finished working on all aspects of the Network Scanner. The Raspberry Pi

    hardware has been successfully configured so that it can connect to any router and broadcast a Wifi

    network. Moreover, nfdump has been installed and configured on the Raspberry Pi hardware and

    it is able to scan network packets flowing through the system. This part of the project was finished

    in October 2019.

    During November 2019, dataset curation has also been completed following our extensive research

    on public network flow datasets. This includes selecting public datasets as described for which the

    CIDDS-001, UGR’16 and CICIDS-2017 datasets have already been extracted and analyzed. Further

    27

  • work on analyzing and extracting more public datasets will be carried in accordance with the results

    obtained and the time constraints.

    Starting in December 2019 the work for the Hybrid Network Anomaly Detection Model has been

    started and is currently in progress. The team is currently trying to build the initial version for

    an anomaly detection model based on the analyzed CIDDS-001 public dataset. This also involves

    trying to train and test the anomaly detection model on the CIDDS-001 dataset using several basic

    classification-based methods. Further testing with other types of anomaly detection methods is also

    simultaneously being done by the team.

    As of February 2019, our progress seems to be consistent with our planned project timetable

    provided in table 6.2 and we have been able to complete the deliverables due on September 29,

    October 10 and December 10 successfully. Based on our progress we are hopeful about meeting

    the deadlines for the remaining deliverables and achieving the goals set out for this research-based

    project. A summary of our current progress based on the 4 key components of the project detailed

    in section 5 can be found in the following table.

    Table 6.1: Progress Evaluation

    ComponentName

    Description Status

    Network Scanner Hardware and software for logging network traffic flow Finished

    Dataset Curation Researching, extracting and analyzing public networktraffic datasets for model building

    Finished

    Network AnomalyDetection Model

    Building, training and testing the model with differentmachine learning methods

    Ongoing

    Web Architecture Setup Elasticsearch, kibana and the data pipeline fromnetwork scanner to elasticsearch

    Ongoing

    6.3 Discussion of Findings

    The Wifi network broadcasted by the Raspberry Pi has been tested by connecting several mobile

    devices and personal computers which were able to connect using the Pi’s Wifi network. This has

    reduced a major risk in terms of feasibility of the project because the new model of Raspberry Pi

    did not have any resources describing the setup for turning the Raspberry Pi into a router that can

    broadcast a network along with an internet connection successfully and work with multiple devices

    28

  • simultaneously.

    Another aspect finished includes a running nfdump on the Raspberry Pi. We found that

    nfdump functions very efficiently without causing any lags on the Raspberry Pi’s normal

    functioning. Moreover, it was able to capture 100% of data packets that passed through the

    Raspberry Pi’s network during a 10-minute test run.

    This result confirms the expected behaviours because no alerts should be generated on a malware

    and cyberattack free network as even the connected devices on the network did not have any malware

    during the test run. It was also observed that nfdump can log each network packet with 7 key

    parameters such as the packet size, receiver and sender’s IP addresses, receiver and sender’s port,

    packet transfer protocol and the network packet’s content in byte format.

    Moreover, exploration and analysis of the CIDDS-001 public data resulted in confirmation that

    the 7 parameters captured by nfdump for each data packet are also available in the public dataset

    and this makes it possible for us to successfully train the network anomaly detection model on

    the CIDDS-001 dataset while evaluating and running the model on the network scanner using the

    parameters logged by nfdump.

    This enables our project to analyze all the network packets flowing in and out of the system in

    approximately real-time. Besides, the ability to capture the parameters in the curated datasets from

    the Network Scanner is significantly helping with the development of the hybrid network anomaly

    detection model, one of the key goals of Malmenator.

    6.3.1 Future Works

    While the current progress is promising, there are still a wide array of tasks that are yet to be fulfilled

    for a satisfactory success of this project. These include working on further analyzing the UGR16

    and CICIDS-2017 public datasets to incorporate them in building the network anomaly detection

    model, besides the work done on CIDDS-001.

    This further curation of data will enhance the comprehensiveness of the existing curated datasets

    and support better training and evaluation of the performance of the hybrid network anomaly

    detection model.

    Following that, another task is to keep working on the hybrid network anomaly detection model

    which will require significant time and experimentation with different machine learning algorithms.

    The experimentation is required to discover which set of algorithms fit the curated datasets

    effectively.

    29

  • Moreover, the experimentation of the accuracy of the model with different algorithms will also

    help in creating a more comprehensive evaluation matrix to compare and contrast our methods with

    other research on network anomaly detection model.

    And finally, we need to continue developing our web infrastructure which will act as the end

    deliverable for the users of the research project to interact with the project and perform operations

    on the project using a web-based dashboard.

    6.4 Preliminary anomaly detection model results

    The initial anomaly detection model was made by utilising two approaches based on the CIDDS-001

    dataset as described in section 5.3.1.

    Approach 1: An initial application involved the usage of a one class SVM model utilising

    numeric and categorical fields from the netflow data with features including netflow duration, source

    port, Destination Port, Number of Packets, Number of Flows and the labels include a single boolean

    information on whether the flow is anomalous or not. For this initial approach the model considers

    the attacker and victim labelled flow as anomalous and the normal labelled flow as non-anomalous.

    In this way, out of the 8.45 million flows, 7.01 million were anomalous and 1.44 million were non-

    anomalous. The One Class SVM model trained on this dataset produced fair results on testing data

    with a false positive rate of 10.88%. Correctly classifying almost 89.12% of network flows.

    Approach 2: This involved the use of KMeans unsupervised learning on the network flow data

    as a time series. The KMeans model utilizes the date first seen feature in addition to the features

    utilized by the one class SVM approach. Adding this feature makes it a time series anomaly detection

    problem on which the KMeans model was trained. The results on training data only involved 0.637%

    of incorrect prediction and a near 99% correct prediction rate. However, as the predictions were

    done on the training data this might not be the precise indication of the model’s actual performance

    which is susceptible to over-fitting. This measurement will be corrected in the future versions of the

    model.

    6.5 Project timeline

    As mentioned in 5.1, Malmenator consists of two phases. In phase one, we aim to setup the NIDS

    hardware and generate a comprehensive dataset for evaluation. General statistical analysis on the

    datasets and the creation of a basic web interface for Malmenator will also be completed. In phase

    30

  • two, we aim to train, evaluate and deploy a hybrid network anomaly detection technique using

    customized hardware.

    Malmenator’s project schedule can be found in table 6.2. Phase one will be completed by the

    end of December and phase two will be completed by the end of March.

    Table 6.2: Planned project timetable

    Date Description Deliverable

    Aug 22 Initial meeting with supervisor

    Sep 20 Second meeting with supervisor

    Sep 29 Project proposal and website Project plan and website

    Oct 31 Setup nfdump and Raspberry Pi router Hardware setup completed

    Nov 30 Dataset compilation Network dataset processed

    Dec 31 Web UI hosted on the cloud ELK established

    Jan 24 First presentation

    Feb 2 Preliminary implementation and interim report Interim report

    Feb 28 Network classification model Fully trained classifier

    Mar 30 Unified network scanner, classifier, and web UI Functional product

    April 1 Project in production

    Apr 19 Finalized implementation and final report Final Report

    Apr 20-24 Final presentation

    May 5 Project exhibition

    June 3 Project competition

    6.6 Engineering Risk mitigation

    There are two main approaches Malmenator uses to reduce the risk of wasting time or failing to

    attain desirable results. First, our team is focused on coding the skeleton framework for the NIDS

    and web interface first without spending time on the details that may be refined later. This can be

    seen in the deliverables for phase one of Malemenator. Second, our team spends a large amount of

    time researching previous related works to ensure that our research direction continues to remain

    promising. By taking this approach, our team does not waste time coding technology that will not

    be implemented and enables our knowledge base to grow at a steady pace without rushing to acquire

    31

  • experimental results.

    6.7 Challenges

    6.7.1 Steep learning curve

    There have been several major challenges posed in the way of the Malmenator project, the first and

    foremost being our team’s lack of prior knowledge in the cybersecurity field. This directly resulted in

    us setting unrealistic expectations for our project scope which required us to reevaluate our decision

    and narrow our scope several times. This has resulted in us modifying our initial methodology

    several times to more accurately reflect feasibility and utility based on our research.

    6.7.2 Scope identification

    Cybersecurity is a huge domain in itself, entailing specialisations from malware analysis to

    penetration testing. This made it harder to limit the scope of the project as the team started with

    a broad set of goals including cleaning of malware from a network as well as malware file

    classification. A reason for this can be attributed to the lack of prior knowledge of the team in the

    field of cybersecurity. The team was eventually able to identify key problems and the need for

    improvement as well as innovation in the niche of network security and solved this challenge by

    limiting the scope of research to this field by working on identifying anomalies/cyberattacks on a

    network level.

    6.7.3 Proprietary information

    A second challenge comes from the inherent nature of the cybersecurity field. Unlike other fields

    such as computer vision or natural language processing, there are huge barriers in navigating and

    researching the field due to a lack of transparency and a general obfuscation of materials. For

    example, a universally used benchmark dataset does not exist due to the privacy and legal concerns

    over data sharing (in contrast with ImageNet for image recognition problems) as well as the ever

    changing nature of cyberattacks. Furthermore, cybersecurity is a field that is prone to industry

    research being far ahead of academic research since academic research is accessible by malicious

    attackers. Lastly, experimenting with live viruses and malware pose serious safety hazards that

    require additional safety precautions, which further slows progress.

    32

  • Conclusion

    Malmenator is research based project that aims to deliver a powerful and adaptable network security

    tool to individuals and small organizations. By exploring powerful and comprehensive anomaly

    detection techniques from multiple approaches, our project thoroughly analyzes network traffic data

    for suspicious and unwanted activity. Malmenator’s research component will shed new light on the

    efficacy of various datasets in evaluating NIDSs and propose novel approaches to detect network

    anomalies. To date, the NIDS hardware has already been implemented on a heavily configured

    Raspberry Pi using a modular open source NIDS software, nfdump. Furthermore, research into the

    appropriate dataset for model creation has been completed, leading us to leverage the CIDDS-001

    dataset for building and evaluating a hybrid technique for network anomaly detection. Over the

    next month, our team will explore a variety of machine learning models on the dataset to determine

    which class of techniques work most effectively.

    33

  • Bibliography

    [1] The White House, “The cost of malicious cyber activity to the u.s. economy,” tech. rep., The

    Council of Economic Advisors, Feb 2018.

    [2] Forgerock, “U.s. consumer data breach report 2019: Personally identifiable information targeted

    in breaches that impact billions of records,” tech. rep., Forgerock, 2019.

    [3] The Internet Society, “2018 cyber incident and breach trends report,” tech. rep., The Internet

    Society, Jul 2019.

    [4] A. Asen, W. Bohmayr, S. Deutscher, M. Gonzalez, and D. Mkrtchian, “Are you spending

    enough on cybersecurity?,” tech. rep., Boston Consulting Group, Feb 2019.

    [5] International Data Corporation, “Worldwide semiannual security spending guide,” tech. rep.,

    International Data Corporation, Mar 2019.

    [6] The Royal Society, “Progress and research in cybersecurity: Supporting a resilient and

    trustworthy system for the uk,” tech. rep., The Royal Society, Jul 2016.

    [7] K. A. Scarfone and P. M. Mell, “Guide to intrusion detection and prevention systems (idps),”

    tech. rep., National Institute of Standards and Technology, U.S. Department of Commerce, Feb

    2007.

    [8] G. Fernandes, J. J. P. C. Rodrigues, L. F. Carvalho, J. F. Al-Muhtadi, and M. L. Proena,

    “A comprehensive survey on network anomaly detection,” Telecommunication Systems, vol. 70,

    p. 447489, Jul 2018.

    [9] J. K. K. M. H. Bhuyan, D. K. Bhattacharyya, “Network anomaly detection: Methods, systems

    and tools,” IEEE Communications Surveys Tutorials, vol. 16, pp. 303–336, Jan 2014.

    34

  • [10] H. Holm, “Signature based intrusion detection for zero-day attacks: (not) a closed chapter?,”

    2014 47th Hawaii International Conference on System Sciences, Jan 2014.

    [11] H.-J. Liao, C.-H. R. Lin, Y.-C. Lin, and K.-Y. Tung, “Intrusion detection system: A

    comprehensive review,” Journal of Network and Computer Applications, vol. 36, p. 1624, Jan

    2013.

    [12] N. Moustafa, J. Hu, and J. Slay, “A holistic review of network anomaly detection systems: A

    comprehensive survey,” Journal of Network and Computer Applications, vol. 128, p. 3355, Feb

    2019.

    [13] M. Ring, S. Wunderlich, D. Scheuring, D. Landes, and A. Hotho, “A survey of network-based

    intrusion detection data sets,” Computers & Security, vol. 86, p. 147167, 2019.

    [14] Cisco, “Netflow version 9 flow-record format,” tech. rep., Cisco, May 2011.

    [15] I. Sharafaldin, A. Gharib, A. H. Lashkari, and A. A. Ghorbani, “Towards a reliable intrusion

    detection benchmark dataset,” Software Networking, vol. 2017, p. 177200, Jul 2017.

    [16] D. Dua and C. Graff, “UCI machine learning repository,” 2017.

    [17] R. Lippmann, D. Fried, I. Graf, J. Haines, K. Kendall, D. Mcclung, D. Weber, S. Webster,

    D. Wyschogrod, R. Cunningham, and et al., “Evaluating intrusion detection systems: the 1998

    darpa off-line intrusion detection evaluation,” Proceedings DARPA Information Survivability

    Conference and Exposition, Jan 2000.

    [18] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and K. Das, “The 1999 darpa off-line

    intrusion detection evaluation,” Computer Networks, vol. 34, no. 4, p. 579595, 2000.

    [19] P. Gogoi, M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Packet and flow based network

    intrusion dataset,” Communications in Computer and Information Science Contemporary

    Computing, p. 322334, 2012.

    [20] M. V. Mahoney and P. K. Chan, “An analysis of the 1999 darpa/lincoln laboratory evaluation

    data for network anomaly detection,” Recent Advances in Intrusion Detection, p. 220237, 2003.

    [21] J. Mchugh, “Testing intrusion detection systems: a critique of the 1998 and 1999 darpa

    intrusion detection system evaluations as performed by lincoln laboratory,” ACM Transactions

    on Information and System Security, vol. 3, p. 262294, Nov 2000.

    35

  • [22] A. Vasudevan, E. Harshini, and S. Selvakumar, “Ssenet-2011: A network intrusion detection

    system dataset and its comparison with kdd cup 99 dataset,” 2011 Second Asian Himalayas

    International Conference on Internet (AH-ICI), 2011.

    [23] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the kdd

    cup 99 data set,” IEEE Symposium on Computational Intelligence for Security and Defense

    Applications, 2009.

    [24] N. Moustafa and J. Slay, “Unsw-nb15: a comprehensive data set for network intrusion detection

    systems (unsw-nb15 network data set),” Military Communications and Information Systems

    Conference (MilCIS), Nov 2015.

    [25] M. Ring, S. Wunderlich, D. Grdl, D. Landes, and A. Hotho, “Flow-based benchmark data sets

    for intrusion detection,” in Proceedings of the 16th European Conference on Cyber Warfare and

    Security (ECCWS), pp. 361–369, ACPI, 2017.

    [26] M. Ring, S. Wunderlich, D. Grdl, D. Landes, and A. Hotho, “Creation of flow-based data sets

    for intrusion detection,” Journal of Information Warfare, vol. 16, pp. 40–53, 2017.

    [27] G. Maci-Fernndez, J. Camacho, R. Magn-Carrin, P. Garca-Teodoro, and R. Thern, “Ugr16: A

    new dataset for the evaluation of cyclostationarity-based network idss,” Computers & Security,

    vol. 73, p. 411424, 2018.

    [28] I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating a new intrusion

    detection dataset and intrusion traffic characterization,” Proceedings of the 4th International

    Conference on Information Systems Security and Privacy, Jan 2018.

    [29] R. Fontugne, P. Borgnat, P. Abry, and K. Fukuda, “Mawilab,” Proceedings of the 2018

    Workshop on Big Data Analytics and Machine Learning for Data Communication Networks,

    2010.

    [30] A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward developing a systematic

    approach to generate benchmark datasets for intrusion detection,” Computers & Security,

    vol. 31, p. 357374, May 2012.

    [31] S. Garca, M. Grill, J. Stiborek, and A. Zunino, “An empirical comparison of botnet detection

    methods,” Computers & Security, vol. 45, p. 100123, Sep 2014.

    36

  • [32] M. Ahmed, A. N. Mahmood, and J. Hu, “A survey of network anomaly detection techniques,”

    Journal of Network and Computer Applications, vol. 60, p. 1931, Jan 2016.

    [33] S. Dua and X. Du, Data Mining and Machine Learning in Cybersecurity. Auerbach Publications,

    2016.

    [34] C. Wagner, J. Franois, R. State, and T. Engel, “Machine learning approach for ip-flow record

    anomaly detection,” International Conference on Research in Networking, p. 2839, 2011.

    [35] M. A. Ambusaidi, X. He, P. Nanda, and Z. Tan, “Building an intrusion detection system using a

    filter-based feature selection algorithm,” IEEE Transactions on Computers, vol. 65, p. 29862998,

    Oct 2016.

    [36] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection,” ACM Computing Surveys,

    vol. 41, p. 158, Sep 2009.

    [37] I. Corona, G. Giacinto, and F. Roli, “Adversarial attacks against intrusion detection systems:

    Taxonomy, solutions and open issues,” Information Sciences, vol. 239, p. 201225, Aug 2013.

    [38] N. Ye and Q. Chen, “An anomaly detection technique based on a chi-square statistic

    for detecting intrusions into information systems,” Quality and Reliability Engineering

    International, vol. 17, no. 2, p. 105112, 2001.

    [39] C. Manikopoulos and S. Papavassiliou, “Network intrusion and fault detection: a statistical

    anomaly approach,” IEEE Communications Magazine, vol. 40, pp. 76–82, Oct 2002.

    [40] Kaspersky Lab, “Machine learning in malware detection,” tech. rep., Kaspersky, 2019.

    [41] S. Saad, W. Briguglio, and H. Elmiligi, “The curious case of machine learning in malware

    detection,” Proceedings of the 5th International Conference on Information Systems Security

    and Privacy, Feb 2019.

    [42] Raspberry Pi Foundation, “Raspberry pi foundation,” Sep 2019.

    [43] C. Gormley and Z. Tong, Elasticsearch: The Definitive Guide. O’Reilly Media, Inc., 1st ed.,

    2015.

    37

    AbstractAcknowledgementsList of FiguresList of TablesAbbreviationsIntroductionMotivationProblem formulationContributionsReport organization

    A Primer on CybersecurityWhat are Cyberattacks?Economic Impacts of MalwareCategories of Cybersecurity SolutionsEthics of Malware ResearchChallenges of Malware Research

    Project backgroundOverviewWhat is an nids

    nids vs nipsnids implementation strategiesOverview of nids techniquesSignature-based detectionAnomaly-based detection

    Packet-based and flow-based data formatsNetFlow

    NfdumpSummary

    Previous worksOverviewNetwork traffic datasetsMethods for network traffic dataset generation Old reference datasetsKey datasetsCIDDS-001UGR â•Ÿ16CICIDS 2017

    Other relevant datasetsUNSW-NB15MAWILabUNB ISCX 2012CTU-13

    General Remarks

    Machine learning in network anomaly detectionClassification methodsStatistical methodsClustering methodsConstraints of machine learning

    Summary

    MethodologyOverviewNetwork scannerRaspberry Pi hardwareHardware setupUtilizing nfdump

    Dataset selectionCIDDS-001

    Network anomaly detectionHybrid network anomaly detection approachModel evaluation

    Web interfaceDashboardWeb architecture

    Summary

    Discussion and remarksOverviewCurrent StatusDiscussion of FindingsFuture Works

    Preliminary anomaly detection model resultsProject timelineEngineering Risk mitigationChallengesSteep learning curveScope identificationProprietary information

    ConclusionBibliography


Recommended