+ All Categories
Home > Documents > HPL-2012-206

HPL-2012-206

Date post: 03-Apr-2018
Category:
Upload: saryus2012
View: 213 times
Download: 0 times
Share this document with a friend

of 86

Transcript
  • 7/29/2019 HPL-2012-206

    1/86

    Security Analytics: Risk Analysis for an Organisation's Incident Management

    ProcessMarco Casassa Mont, Richard Brown, Simon Arnell, Neil Passingham

    HP Laboratories

    HPL-2012-206

    Keyword(s):Security Analytics; Risk Analysis; What-if Analysis; Incident Management Processes; SOC

    Abstract:This document is an example of the type of report an organisation would receive at the end of a HP Security

    Analytics engagement. The focus is on the analysis of the security risks and performance of the organisation's

    Security Incident Management Processes and related Security Operation Centre (SOC)'s activities. HP Labs carried

    out the underlying R&D work in collaboration with HP Enterprise Security Services (HP ESS) and involved

    analysis of processes, probabilistic modeling, simulation and "what-if" analysis for some of HP's key customers. The

    outcome of this was a set of case studies from which we have been able to create this more general anonymised

    report illustrating the richness of the risk assessment and "what-if" analysis that has been carried out. The lifecycle

    management of security is critical for organisations to protect their key assets, ensure a correct security posture and

    deal with emerging risks and threats. It involves various steps, usually carried out on an ongoing, regular basis,

    including: risk assessment; policy definition; deployment of controls within the IT infrastructure; monitoring and

    governance. In this context, Security Information & Events Management (SIEM) solutions play a key role. Even the

    best information security practices and investments in security controls cannot guarantee that intrusions

    accidentaland criminal activities and/or other malicious acts will not happen. Controls can fail, be bypassed or become

    inadequate over time; new threats emerge. Managing such incidents requires detective and corrective controls to

    minimise adverse impacts, gather evidence, and learn from previous situations in order to improve over time. These

    incident management processes are usually run in the context of a SOC and/or as part of specialised Computer

    Security Incident Response Teams (CSIRTS), built on top of SOCs. Even with SIEM solutions in place, a potential

    major risk for the organisation arises due to delays introduced in assessing and handling known incidents: this may

    postpone the successful resolution of critical security incidents (e.g. devices exposed on the Internet, exploitation of

    privileged accounts, deployed malware, etc.) and allow for further exploitation. Another related risk can be

    introduced by sudden and/or progressive changes of the threat landscape, due to changing economic and social

    scenarios, new business activities or process failings within the existing IT services. This might create unexpected

    volumes of new events and alerts to be processed by the security team and as such, introduce additional delays.

    Hence, it is important for an organisation to understand the risk exposure due to their Incident Management

    processes, explore potential future scenarios (e.g. changes in available resources or threats landscapes or adoption of

    Cloud solutions) and identify suitable ways to address related issues, e.g. by introducing process changes and/or

    making investments in security controls.

    External Posting Date: September 6, 2012 [Fulltext] Approved for External Publication

    Internal Posting Date: September 6, 2012 [Fulltext]

    Copyright 2012 Hewlett-Packard Development Company, L.P.

  • 7/29/2019 HPL-2012-206

    2/86

  • 7/29/2019 HPL-2012-206

    3/86

  • 7/29/2019 HPL-2012-206

    4/86

  • 7/29/2019 HPL-2012-206

    5/86

    HP Security Analytics Final Report

    Page 1 of 80

    Disclaimer

    This document is an example of the type of report a customer would receive at the end of a Security

    Analytics engagement [10]. The focus is on the analysis of the security risks and performance of thecustomersSecurity Events & Incident Management Processes.

    The report is loosely based around a real customer case study, but all contextual information, event

    and incident management processes, empirical data, analyses and conclusions have been modified in

    order to protect this customers identity and confidential information. As a result we believe this

    report reflects well the current situation many large organizations find themselves in with respect to

    the management of security events and related incidents.

    For this purpose, we introduce a fictitious company, ServiceBizWorld.

    This report focuses on two key processes in the incident management and remediation space - the

    process ofanalyzing and handling security events/alerts (to the point these events are either closed orclassified as false positives or turned into incidents) and the process of managing and remediating to

    security incidents - which have been generalized (and represented with templates) and should be of

    relevance to many customers. The report shows how Security Analytics can be applied in carrying out

    risk assessment and what-if analyses.

    The analysis discussed in this report can be applied also to those companies that currently do not have

    explicit event and incident management processes in place but are planning to create them. In this

    context, Security Analytics can be of help to explore the impact of making different process choices

    (in terms of security risk and productivity) along with the implications of making specific

    technological investments.

    Security Analytics [10] has also been successfully applied to the areas ofIdentity and Access

    Management, Vulnerability and Threat Management and Web Infection.

  • 7/29/2019 HPL-2012-206

    6/86

    HP Security Analytics Final Report

    Page 2 of 80

  • 7/29/2019 HPL-2012-206

    7/86

    HP Security Analytics Final Report

    Page 3 of 80

    Security AnalyticsRisk Analysis: Security Event & Incident Management Processes

    Executive Summary

    HP Labs conducted a detailed analysis of the current ServiceBizWorlds Security Events and Incident

    Management Processes. It was undertaken using HPs Security Analytics modelling methodology.

    The analysis focused on the Event/Alert Processing & Remediation Process and the Incident Management

    Process. The latter is triggered whenMajor (IT) Security Incidents are identified by the SOC team. The analysis

    included the examination of the risk exposure arising from current practice and process steps measured as the

    time to identify and remediate security incidents. Models have been developed to simulate the arrival of

    events/alerts and the execution of the involved processing steps. These models took into account the complexity

    of events/alerts and their rate, the interactions with Third Parties (TPs), TP behaviours and non compliance

    issues, various SOC teams management steps. They measured, among other things, the time required to:

    process events/alerts; identify minor and major security incidents and remediate them; identify false positives.

    A first outcome of the analysis highlighted the complexity of the current processes; the models provide an

    explicit representation of the involved steps.

    Experimental results confirmed that interactions with TPs constitute a major bottleneck. Specifically they

    account for 60-70% of the time required to process events/alerts (i.e. between 15-50 days). This is a critical

    issue, as it negatively affects the processing ofMajor Security Incidents. Two key aspects have been identified

    as root causes: the lack of specific SLAs with TPs about how information should be provided to the SOC Team;the lack of information availability for TPs.

    In this context, three what-if scenarios were explored using the model of the Event/Alert Processing &

    Remediation Process (the most critical of the two analysed processes) to assess the impact on risk exposure and

    performance, in case of: (1) improving SLAs with TPs; (2) changing threat environments; (3) improving

    information availability for TPs. The main findings show that:

    The definition of SLAs for TPs (e.g. about response time and number of iterations with the SOC Team) candramatically improve the overall performance and reduce the risk exposure by at least an order of

    magnitude;

    The current process might expose ServiceBizWorld to additional risks - in terms of delays to identifysecurity incidents and saturation of TP resources/personnel - if the daily number of new events to be

    processed increases by about 30%;

    Investments in new controls not only can improve the overall TP information availability and performance(estimated to be an order of magnitude) but also help to cope with worsening threat landscapes.

    Specifically, the adoption ofinformation request templates and manual scripts can ensure process resilience

    to a 60% increase of the daily number of new events/alerts to be processed. The introduction of automation

    of scripts and/or workflows enables process resilience to an increase of 90% of events/alerts.

    A what-if scenario for the Incident Management Process showed similar outcomes and improvements when

    templates and automation are introduced to deal with incident response & remediation tasks.

    These findings depend on the assumptions made and the information available at the time of the analysis. They

    provide a starting point for conversations between ServiceBizWorld, TPs and the SOC Team, on expectations,

    policy & SLA changes, as well as on potential new investments, to improve the current process performances,reduce risk exposure and deal with contingency planning for worsening threat environments.

  • 7/29/2019 HPL-2012-206

    8/86

    HP Security Analytics Final Report

    Page 4 of 80

    Content

    Executive Summary 3

    Content 4

    1 Glossary 6

    2 Overview of the Report 7

    3 Area Selected for Investigation: Goals and Scope 8

    4 Overview of Security Event & Incident Management Processes 10

    5 Analysis of ServiceBizWorld Security Event & Incident Management Processes 13

    6 Security Analytics Approach 15

    7 Models 17

    7.1 Model 1: Security Event/Alert Processing and Remediation 17

    7.1.1 Modelled Process Areas 19

    7.1.1.1 Events and Alerts 19

    7.1.1.2 Initial Assessment Phase 19

    7.1.1.3 SOC Team Interactions with TPs 20

    7.1.1.4 SOC Team Final Assessment 21

    7.1.1.5 Event Closure 22

    7.1.1.6 Security Incident 237.1.1.7 TP Non Compliance 23

    7.1.1.8 Handling False Positives and Known Problems 24

    7.1.2 Metrics 25

    7.1.3 What-if Scenarios 26

    7.1.3.1 Impact of Changing SLAs for TPs 26

    7.1.3.2 Impact of Changes of the Threat Environment 28

    7.1.3.3 Impact of Investments to Improve Information Availability for TPs 32

    7.2 Model 2: Incident Management Processes 36

    7.2.1 Modelled Process Areas 38

    7.2.1.1 Events 387.2.1.2 Initial Assessment Phase 38

    7.2.1.3 Information Gathering from Risk Management Team 38

    7.2.1.4 TP Handling of Incident Response 39

    7.2.1.5 TP Incident Remediation 40

    7.2.1.6 Risk Management 40

    7.2.2 Metrics 40

    7.2.3 What-If Scenarios 41

    7.2.3.1 Impact of Investments to Improve the Process Performance 41

    8 Results Regarding Current State 44

    8.1 Model 1: Events/Alerts Processing and Resolution 44

  • 7/29/2019 HPL-2012-206

    9/86

    HP Security Analytics Final Report

    Page 5 of 80

    8.1.1 Summary 48

    8.2 Model 2: Incident Management Process 49

    8.2.1 Summary 50

    9 Results for What-If Scenarios 52

    9.1 Model 1: Security Event/Alert Processing and Resolution 52

    9.1.1 Impact of Changing SLAs for TPs 52

    9.1.1.1 Summary 57

    9.1.2 Impact of Changes of the Threat Environment 58

    9.1.2.1 Summary 62

    9.1.3 Impact of Investments to Improve Information Availability for TPs 63

    9.1.3.1 Summary 71

    9.2 Model 2: Incident Management Process 72

    9.2.1 Impact of Investments to Improve the Process Performance 72

    9.2.1.1 Summary 75

    10 Conclusion 76

    11 References 77

    A.1 List of parameters - Model 1: Events/Alerts Processing and Resolution 78

    A.2 List of parameters - Model 2: Incident Management Process 80

  • 7/29/2019 HPL-2012-206

    10/86

    HP Security Analytics Final Report

    Page 6 of 80

    1 Glossary

    CSIRTS Computer Security Incident Response Teams

    ITIL Information Technology Infrastructure Library

    Major Security

    Incident

    Serious Security Incident with potential Major Security & Business Impacts

    SA Security Analytics

    SIEM Security Information and Event Management

    SIM/SEM Security Information Management/Security Event Management

    SLA Service Level Agreement

    SOC Security Operations Centre

    TP Third Party

  • 7/29/2019 HPL-2012-206

    11/86

    HP Security Analytics Final Report

    Page 7 of 80

    2 Overview of the Report

    This final report presents the findings of the security risk analysis for the ServiceBizWorlds Event &

    Incident Management Processes, based on the Security Analytics methodology [10] developed by HP

    Labs.

    Section 3 describes the problem space, the scope and the goals of this risk analysis.

    Section 4 provides a general overview of the Security Event and Incident Management area.

    Section 5 specifically analyses the current security event and incident management processes in place

    at ServiceBizWorld.

    Section 6 provides an overview of the HP Security Analytics approach and how it has been applied to

    this area, to explore the problems investigated in this report.

    Section 7 describes the two Security Analytics models developed by HP Labs, in collaboration with

    ServiceBizWorld, related to: (1) theirEvents/Alerts Processing & Remediation Process; (2) their

    Incident Management Process. This section also describes the agreed metrics and the what-ifscenarios that have been investigated.

    Section 8 provides a detailed discussion of the risk assessment results obtained from the current

    Security Analytics models.

    Section 9 presents and discusses the outcomes of the analysis of various, related what-if scenarios.

    Section 10 draws the final conclusions.

  • 7/29/2019 HPL-2012-206

    12/86

    HP Security Analytics Final Report

    Page 8 of 80

    3 Area Selected for Investigation: Goals and Scope

    The focus of the HP Labs security risk analysis is on the event & incident management processes that

    are currently deployed and run by ServiceBizWorld. These processes are critical for the organisation

    to identify potential security issues, analyse and mitigate them. They are complex and involve various

    stakeholders: the SOC Centre/Security Team; the Risk Management Team; various Third Parties(TPs). Specifically, these TPs include ServiceBizWorlds IT department, ServiceBizWorlds business

    groups and the IT departments of external companies, which provide (business and IT) services to

    ServiceBizWorld in the Cloud and/or in outsourcing.

    The issues arising in this area, explored with the HP Security Analytics methodology, are captured in

    the following statements:

    How to ensure that the right incident management processes are in place to handle both

    current security threats and potential future critical ones?

    What is the risk exposure for ServiceBizWorld? What is the impact of changing threat

    environments?

    The area of Security Incident Management is wide as it covers various types of incidents, including

    the ones related to: IT infrastructure; people behaviours; system failures; fraud; etc.

    The security risk analysis discussed in this report focuses on IT security events and incidents. The

    following priorities have been identified by ServiceBizWorld:

    the need to be confident that ServiceBizWorld is ready to handle current and future securitythreats;

    the need to be able to justify investments and process changes.

    The goals for the security risk analysis are: Analyse the risk exposure of ServiceBizWorld due to the current security incident

    management processes;

    Analyse the impact of introducing new SLAs for the involved stakeholders and/or deploying

    new controls within the incident management processes;

    Analyse the impact on current processes (in terms of risk exposure) of changes of the threat

    landscape (e.g. increase of number of new events/alerts to be processed);

    Analyse the impact of deploying new controls within the current processes to mitigate

    changes of the threat landscape.

    A secondary goal of this analysis is to identify potential productivity issues related to current Incident

    Management Processes and explore suitable ways for improvement.

    The security analytics methodology adopted for this investigation has been specifically designed to

    answer these types of questions. The methodology is based on system environment modelling and

    predictive simulations, which allow the experimental exploration of risk and productivity across the

    set of outcomes for the current incident management and remediation processes. It also provides the

    ability to examine the impact of deploying different types of controls and/or process changes along

    with exploring the impact of a changing threat landscape.

    Changes to the event & incident management processes need to have the support of the various

    involved stakeholders and potentially continued investments in a platform that can facilitate the

  • 7/29/2019 HPL-2012-206

    13/86

    HP Security Analytics Final Report

    Page 9 of 80

    overall processes. This requires strategic alignment with ServiceBizWorlds overall business

    objectives and risk posture.

    The final aim of the analysis discussed in this report is to provide evidence to ServiceBizWorld and its

    Security & Risk Management teams to support their decision making process, specifically in

    identifying suitable policies and SLAs changes and/or justify future investments in process changes

    and controls.

  • 7/29/2019 HPL-2012-206

    14/86

    HP Security Analytics Final Report

    Page 10 of 80

    4 Overview of Security Event & Incident Management Processes

    The lifecycle management of security is critical for organisations, to protect their key assets, ensure a

    correct security posture and deal with emerging risks and threats. It involves various steps, usually

    carried out on an ongoing, cyclic basis, including: risk assessment; policy definition; deployment of

    controls within the IT infrastructure; monitoring and governance.

    In this context, Security Event Monitoring and Incident Management play a key role. Even the best

    information security practice and security control investments cannot guarantee that intrusions,

    accidental and criminal activities and/or other malicious acts will not happen. Controls can fail, be

    bypassed or become inadequate over time. New threats could emerge.

    ITIL [1] defines an incident as "any event which is not part of the standard operation of a service and

    which causes, or may cause, an interruption to, or a reduction in, the quality of service". A similar

    definition could apply in the context of security incidents, which might not only cause a reduction in

    the quality of service but also expose the organisation to additional risks.

    This document specifically focuses on IT security incidents and the processes that organisations put inplace for their management and remediation.

    Various types of security incidents are of concerns to organisations, including: deployment of

    malicious software and malware within critical IT systems; misuse of systems and services by users;

    unauthorised access to resources; computer sabotage and damages; intrusion of critical

    systems/services; information theft and espionage.

    Managing incidents involves detective and corrective controls to minimise adverse impacts, gather

    evidence, learn from previous situations and improve over time. When security incidents occur, it is

    important for an organization to effectively manage and respond to them. The speed with which an

    organization can recognize, analyze, and respond to a security incident will limit the impact of the

    damage and potentially lower the cost of recovery [2]. This process carried out by organisations to

    collect, identify, analyze, and determine an organizational response to security incidents is called

    incident management. Processes are usually put in place to deal with incident management. The staff,

    resources, and infrastructure used to perform this function makeup the incident management

    capability.

    Various controls are currently available for organisations, including Security Information and Event

    Management (SIEM) solutions and Security Operations Centres (SOCs) [5].

    SIEM solutions, including HP ArcSight [3], RSA enVision [4], etc. provide real-time analysis of IT

    and security events generated from the IT infrastructure including network hardware, systems,

    applications and services. Various degrees of correlations and analysis are offered by these solutionsto focus on the root causes of events, identify potential critical alerts, store logs for forensic purposes

    and generate reports for compliance management.

    However, the events and alerts generated by these systems often require analysis and further

    processing by security specialists to fully understand their nature and implications, identify critical

    incidents and apply the right strategy to mitigate them.

    In this context, Security Event & Incident ManagementProcesses need to be put in place, along with

    security teams, to execute on the management of events and alerts and the resolution of related

    incidents.

    Figure 4-1 shows the key steps that are usually involved in these processes:

  • 7/29/2019 HPL-2012-206

    15/86

    HP Security Analytics Final Report

    Page 11 of 80

    Figure 4-1. Security Event & Incident Management Processes

    They include:

    Collection of events and alerts: the initial phase, where filtered/correlated events and alertsare provided by SIEM solutions to the security team;

    Triage and initial analysis: it involves an initial analysis and assessment of the events andalerts, to identify potential known problems and false positives. The output of this step iseither a decision to close the event/alert (in case of known issue or false positive) or toproceed with a deeper analysis;

    Collection of additional information from the involved stakeholders: it usually involves thecollection of additional information about events/alerts, potentially from the involvedstakeholders, i.e. the IT department, business areas, etc. This information is necessary togather further evidence and data to assess the seriousness of the event/alert, identify the root

    causes, assess the involved risks and support informed decision making;

    Final Assessment: it involves making a final decision about the nature of the events and alerts.The collected evidence might still indicate that it is a false positive or it should be closed asnot critical for the organisations. Alternatively, a security incident might be raised and/orescalated for remediation;

    Identification of Remediation steps for the Incident: it involves the identification ofremediation steps to deal with the incident. This activity is usually carried out jointly with theinvolved parties within the organisation;

    Incident Remediation: it involves the actual resolution and remediation of the securityincident by the involved entity, along with handling any residual risks due to failures to

    mitigate the security incident.

  • 7/29/2019 HPL-2012-206

    16/86

    HP Security Analytics Final Report

    Page 12 of 80

    It is important to notice that the management of false positives and known problems usually consume

    resources that could be used for the investigation of real security incidents. Ideally, it is desirable to

    minimise the number of false positives. This usually requires fine tuning activities between the SIEM

    solutions (raising events/alerts) and the initial assessment phase.

    The incident management processes are usually run in the context of Security Operation Centres

    (SOC) [5] and/or as part of specialised Computer Security Incident Response Teams (CSIRTS) [6],

    built on top of SOC centres.

    The work of the involved security team(s) can be supported by a variety of technical controls and

    solutions including workflow systems, incident tracking and support solutions, knowledge bases and

    collaborative tools, etc. However, quite often the involved processes are driven by human activities

    and the investigative skills of the security team members. As such, these processes might be subject to

    a variety of issues and potential failures, including:

    Delays due to lack of resources (personnel) to handle large volumes of events/alerts;

    Delays due to complex process steps and failures in human interactions and communications;

    Bottlenecks generated by some of the stakeholders to retrieve relevant information, e.g. by ITdepartments or business groups;

    Failures in assessing the severity of events/alerts and their correct categorization asincidents;

    Lack of tools, expertise and capabilities to remediate security incidents.

    These issues and failures can expose the organisation to additional risks.

    A potential major risk for the organisation is due to delays introduced in assessing and handling

    known incidents: this might postpone the successful resolution of critical security incidents (e.g.

    devices exposed on the Internet, exploitation of privileged accounts, deployed malware, etc.) andallow for further exploitation.

    Another related risk can be introduced by sudden or progressive changes of the threat landscape, due

    to changing economical and social scenarios and/or new business activities. This might create

    unexpected volumes of new events and alerts to be processed by the security team and as such,

    introduce additional delays.

    To summarise, the space of Security Event & Incident Management is complicated and with many

    uncertainty points: it is subject to changing threat environments; it is defined by a variety of processes

    and process steps relying on people, human interactions, controls and information flows; it is subject

    to technical issues and failures.

    It is important for an organisation to understand their risk exposure due to the Incident Management

    processes and identify suitable ways to address related issues, e.g. by introducing process changes

    and/or making investments in security controls.

    HP Security Analytics is uniquely positioned to provide the analysis of the involved risks, explore

    what-if scenarios and provide decision support for decision makers.

    This report discusses the findings of the risk analysis carried out by HP Labs for ServiceBizWorld

    event & incident management processes, along with the exploration of what-if scenarios aiming at

    providing specific decision support and enabling further discussions between the involved

    stakeholders.

  • 7/29/2019 HPL-2012-206

    17/86

    HP Security Analytics Final Report

    Page 13 of 80

    5 Analysis of ServiceBizWorld Security Event & Incident

    Management Processes

    The Security Operations Centre (SOC) team currently runs the core ServiceBizWorld processes for

    (IT) security event & incident management processes.

    ServiceBizWorld has a complex IT and business infrastructure. Part of its IT and service

    infrastructure is run by third parties, either in the context of outsourcing contracts and/or by using

    services in the Cloud.

    In order to assess the specific security event & incident management processes, identify potential risk

    exposures for ServiceBizWorld and explore relevant what-if scenarios, HP Labs requested

    information from key SOC and ServiceBizWorld personnel. Information and data have been provided

    to support this analysis along with detailed explanations of the various process steps and interactions.

    Various stakeholders are involved in these processes:

    The SOC team: it provides the key coordination, supervision and assessment capabilities forthe management of events/alerts and the processing of related incidents;

    The Security Risk Management team: it is the ServiceBizWorld team that provides furtherguidance on priorities and decisions about incident management. It has a longer term,strategic view on security aspects and related policies driven not only by securityrequirements but also by business needs;

    Various third parties, including ServiceBizWorlds IT department, ServiceBizWorldsbusiness groups and the IT departments of external companies, which provide (business and

    IT) services to ServiceBizWorld in the Cloud and/or in outsourcing.

    The relevant SOC Event & Security Incident Management activities can be classified in two types of

    processes, shown in Figure 5-1:

    Events/Alerts Processing

    and Resolution Process

    Security Incident

    Management Process

    Events/Alerts

    - Known Problems

    - False Positives

    - Event Closure

    - Minor Security

    Incidents

    Major Security Incidents

    - Remediated Security Incidents

    - Risk Management

    Figure 5-1. ServiceBizWorlds Security Event & Incident Management Processes

  • 7/29/2019 HPL-2012-206

    18/86

    HP Security Analytics Final Report

    Page 14 of 80

    The first process, referred in this report as Events/Alerts Processing and Resolution Process,

    primarily aims at: providing the initial assessment of events/alerts; gathering of relevant information

    from the involved TPs to further analyse the events/alerts; dealing with the final assessment of

    events/alerts and their classification as false positives, known problems, events to be closed, minor

    security incidents and major security incidents.

    This process is fed by a variety of events/alerts, originated by notifications sent to the SOC team by

    various SIEM tools and TPs. The SIEM tools include, among others, the HP ArcSight [3] and RSA

    enVision solutions [4].

    Specifically, the events/alerts collected from the SIEM system could relate to:

    Managed internal firewalls;

    Managed network devices;

    Domain authentication events;

    Malware security events;

    Authentication events.

    Based on information provided by the SOC team, on average 35-40 events/alerts are processed on

    daily basis, i.e. about 12000-15000 per year.

    Among all these events/alerts, a small part of them involve Major Security Incidents. These are

    critical security incidents for which a full report has to be provided to the Risk Management team and

    that trigger the second process. These are also referred to as Managed Security Incident as the SOC

    team is directly involved in the incident resolution process. On average, 800-900 major security

    incidents are raised on yearly basis.

    The category ofMajor (IT) Security Incidents include: discovery of exposure of ServiceWorldBizs

    devices and systems to the Internet; unauthorised privilege escalation; malware security incidents;

    security log cleaning; misuse of authentication tokens; incidents raised from internal firewalls.

    The second process, referred in this report as Security Incident Management Process is triggered

    when Major Security Incidents are raised by the SOC team.

    This process involves the steps of: identifying and agreeing remediation requirements for major

    security incidents, related to TPs (task usually carried out jointly by the SOC team and the Risk

    Management team); the actual phase involving TPs remediation to the incident.

    Next sections provide further detail of how the information about these two processes has been

    interpreted and captured in Security Analytics models to carry out the investigations described in

    Section 3.

    These models will primarily focus on the interactions between the SOC team and Third Parties (TPs).

    The key interactions with theRisk Management team have also been modelled.

  • 7/29/2019 HPL-2012-206

    19/86

    HP Security Analytics Final Report

    Page 15 of 80

    6 Security Analytics Approach

    Two Security Analytics models have been built, one for each process mentioned in Section 5, in order

    to carry out the analysis of ServiceBizWorlds risk exposure to the Security Event & Incident

    Management Processes:

    1. Model #1: Events/Alerts Processing and Resolution Process;

    2. Model #2: Security Incident Management Process.

    Each model represents the involved process steps, interactions with TPs, information flows and

    decision points.

    These models measure the time to carry out various activities and determine the potential process

    outcomes: false positives, known problems, events to be closed, minor security incidents and major

    security incidents. Additional metrics have been used to count the occurrence of such outcomes.

    These metrics, discussed in detail in Section 7, have been used to provide the required estimates of

    risk exposure and productivity.

    We anticipate that the interactions with TPs introduce a major bottleneck with a negative impact on

    risk exposure and productivity. As such, the related parts of the processes have been subject of

    specific analysis and risk assessment (as well as what-if analysis), as discussed in the next Sections.

    The Security Analytics models have been used to carry out the investigations presented in Section 3,

    i.e. to identify:

    The level of risk exposure for ServiceBizWorld, to manage and mitigate IT security incidents

    with current processes, based on the overall time required to handle these incidents;

    The impact of bottlenecks introduced by TPs, in the current processes;

    The predicted improvements of changing SLAs with TPs and/or introducing new controls withincurrent processes (what-if scenario);

    The level of risk exposure to manage and mitigate IT security incidents, in case of changing threat

    landscapes, based on the overall time required to handle these incidents (what-if scenario);

    The predicted improvements, in case of a changing threat landscape, by introducing new controls

    within current processes (what-if scenarios).

    Specifically, the two models have been built by using a flow chart-based graphical representation.

    To validate them, they have initially been used to provide an estimate of the risk exposure to and

    productivity for current processes. This phase was iterated over a few times with the SOC and RiskManagement teams to ensure that the right process steps and the correct empirical data were captured

    by the. After their validation, the two models have been used to explore what-if scenarios, as

    requested by ServiceBizWorld.

    The visual representations of the models were created by using the HP proprietary Gnosis modelling

    toolset, with the various shapes in the model instantiated with a set of parameters. The models were

    then transformed by the toolset into executable code, which was used to run Monte Carlo simulations.

    This whole analytic chain is shown in Figure 6-1.

  • 7/29/2019 HPL-2012-206

    20/86

    HP Security Analytics Final Report

    Page 16 of 80

    Figure 6-1. Visual Representation of the Model/Experiment/Results Chain

    The two models were drawn by using visual diagrams (shown in the next sections) together with

    detailed descriptions of the involved parameters. The complete list of parameters can be found in

    Appendices A.1 and A.2. The visual diagrams should be interpreted as follows:

    The star-shaped components represent the events that occur at regular intervals as defined

    with a probability distribution in a model (e.g. 7 events/alerts received per day), each star-

    shaped component starts a separate flow that runs in parallel to other flows, within

    simulations;

    The rectangular shapes correspond to the process steps each with an internal logic, that

    consists of time duration (e.g. a step takes 1 day to complete) or some manipulations of

    internal parameters (e.g. number of iterations with TPs);

    The rhombus shapes are if-type decision boxes, that return Boolean values i.e. true or false

    based on Bernoulli trial for a certain probability (e.g. the probability that an event/alert is a

    false positive has a probability of 0.1);

    The people-shaped icons represent resources that are used by different process steps (e.g.

    members/people of the TP security staff);

    The cylinder shape stores the values of certain parameters that, for example, represent the

    current state (e.g. the number of iterations with TPs, the number of incidents, false positives,

    etc.);

    Finally, the process dynamics of the model is captured by the arrows connecting events to

    processes steps and decision boxes.

  • 7/29/2019 HPL-2012-206

    21/86

    HP Security Analytics Final Report

    Page 17 of 80

    7 Models

    This section presents the two models built by HP Labs to carry out the risk and performance analysis

    for the ServiceBizWorlds Event & Incident Management processes. It also describes the set of

    metrics used to convey analytic results and the what-if scenarios that have been explored with the

    models.

    7.1 Model 1: Security Event/Alert Processing and Remediation

    The first model, related to the SOC Security Event/Alert Processing & Remediation Process is shown

    in Figure 7-1.

    The model captures the current process steps, specifically:

    1. Set of Events/Alerts that are analysed and managed within the process;

    2. Various event/alert processing phases, including: Initial Assessment Phase, SOC teams

    Interactions with TPs, SOC teams final assessments;

    3. Various possible outcomes of the execution of the process: event/alert closure, raisingsecurity incident, TP non compliance; handling false positives and known problems.

    The various process steps are annotated with the empirical data gathered from the SOC and Risk

    Management teams. They reflect the current situation in terms of frequency of events, duration of

    process steps and likelihood of making certain decision. The complete list of the empirical data used

    in this model is available in Appendix A.1.

    When executed, this model simulates the arrival of events and alerts and the various processing steps

    in order to determine the outcomes i.e. if an event/alert has to be closed, if it is a false positive/known

    problem or if it has to be escalated as a minor or major security incident. The model has been

    instrumented with various metrics, discussed in this section, to measure the impact in terms of riskexposure and performance.

    The next sections described in detail each of the key parts highlighted in the model.

  • 7/29/2019 HPL-2012-206

    22/86

    HP Security Analytics Final Report

    Page 18 of 80

    Figure 7-1. Model 1: SOC Security Event/Alert Processing and Remediation

  • 7/29/2019 HPL-2012-206

    23/86

    HP Security Analytics Final Report

    Page 19 of 80

    7.1.1 Modelled Process Areas

    7.1.1.1 Events and AlertsFigure 7-2 shows the set of events and alerts that are managed within this process:

    Figure 7-2. Model 1: Events and Alerts

    Specifically there are 2 types of events/alerts that are processed:

    SOC Events and Alerts received from SOC SIEM tools: these are the SOC events and alerts

    received from SIEM tools such as HP ArchSight and RSA enVision. On average, 20 of theseevents/alerts are generated on daily basis;

    Notification of IT Security Events/Alerts from TPs: these are specific notifications of ITsecurity events/alerts from third parties. On average, 15 of these events/alerts are generated ondaily basis.

    7.1.1.2 Initial Assessment PhaseFigure 7-3 illustrates the various process steps that are carried out by the SOC team to initially assess

    the new event/alert:

    Figure 7-3. Model 1: Initial Assessment Phase

    This includes the following steps:

    An initial assessment of the event/alert, usually taking between 1 hour and 2 hours;

    A decision if the event/alert is a known security problem and in case it is, if the threat levelhas been changed. The available empirical data suggests that 10% of cases are known issues.In 5% of them the threat level has changed. In case it is a known problem, it is handled assuch;

    In case it is not a known problem or the threat level changed, a decision is made if there is any

    need to notify the Risk Management team. This usually happens in 15% of cases. In this case,the SOC team has to prepare a briefing for the Risk Management team and agree on the

  • 7/29/2019 HPL-2012-206

    24/86

    HP Security Analytics Final Report

    Page 20 of 80

    relevant remediation. This usually takes between 5 hours and 1 day. An additional decision ismade if this event/alert is actually an incident and its category has been established. Thishappens in 80% of cases. The empirical data suggests that in 10% of cases the securityincident is raised as aMajor Security Incident;

    If there is no need to brief the Risk Management team and/or the incident category has not

    been established, then the process continues with the analysis of the event/alert. This mightrequire gathering information from TPs.

    7.1.1.3 SOC Team Interactions with TPsAdditional information might be required from TPs to identify the root causes of the event/alert or

    gather additional contextual data, audit logs, explanations, etc. Only in 20% of cases no additional

    information is required from TPs. In this case, the SOC team carries out their investigation (usually

    taking between 1 hour and 2 days).

    In 80% of cases, however, an interaction with TPs is required. Figure 7-4 illustrates the involved

    steps:

    Figure 7-4. Model 1: SOC Team Interactions with TPs

    NOTE: We anticipate that this part of the process is critical, both in terms of risk exposure and

    performance. It introduces a bottleneck due to fact that the SOC Team usually has to go through

    multiple iterations with TPs in order to gather the required information: the quality and quantity of

    the provided information is also another potential issue. A key problem is that there are no specific

    policies or SLAs in place, with TPs, mandating criteria and obligations on how to effectively retrieve

    and provide data to the SOC Team.

    This part of the process usually starts with a request for information to a TP by the SOC Team, along

    with an informal definition of the agreed timescale for responses.

    In 40% of cases (on aggregated bases, over the various iterations between the TP and the SOC Team)the TP does not respond in due time (usually after an agreed response period of 1-10 days, depending

  • 7/29/2019 HPL-2012-206

    25/86

    HP Security Analytics Final Report

    Page 21 of 80

    on the circumstances). In this case, a decision is made by the SOC Team to escalate the issue to the

    Risk Management team (10% of cases) and deal with the escalation process (1-3 hours). This situation

    triggers another process step, involving dealing with TP Non Compliance to agreed requests. The

    process continues with the SOC Team asking again for the required information.

    If the required information is provided by the TP (60% of cases, on aggregated basis) the model

    captures various TP behaviours, based on observation provided by the SOC team:

    Usually, 5-7 iterations are required with TPs, in order to gather the right information anddata;

    The first iteration with TPs is critical: it can take between 1 and 10 days to get someinformation back. This is due to various delays and overheads, for example the SOC teamhaving to identify the right TP people to interact with, delays in getting replays, various

    redirections, etc.

    The following iterations provide a better TP response time, usually between 1 and 3 days.

    After receiving some data from the TP, the SOC team carries out their investigation (duration: 1 hour-

    2 days). This might trigger the need to brief the Risk Management team, in 20% of cases (duration: 1-

    3 hours).

    Time is also spent to identify the recommended level category for the event/alert (1-2 hours).

    The provided information might not satisfy the required quantity and quality criteria. This happens in

    30% of cases: an additional request for information is made by the SOC Team.

    If it satisfies the criteria, an additional SOC team investigation is carried out (duration: 1 hour - 1

    day). This might raise the need to brief the Risk Management team again (10% of cases) and prepare

    for the briefing (duration: 3 hours - 1 day).

    If the incident category has been established (80% of cases), than a security incident is raised. In 10%of cases the incident is aMajor Security Incident.

    Otherwise, the SOC Team decides if additional information is still required from TPs. As previously

    mentioned, on average 5-7 iterations with the TP might need to take place before all the relevant

    information is retrieved.

    In case the required information has been provided, the process proceeds with the SOC Team final

    assessment phase.

    7.1.1.4 SOC Team Final AssessmentFigure 7-5 illustrates the steps involved in the SOC Team Final Assessment:

  • 7/29/2019 HPL-2012-206

    26/86

    HP Security Analytics Final Report

    Page 22 of 80

    Figure 7-5. Model 1: SOC Team Final Assessment

    This usually involves an additional analysis activity by the SOC Team (duration: 4 hours -2 days),

    followed by SOC team handling recommendations and agreements made with the Risk Management

    team, which make take between 1 and 5 days.

    The next process step involves the SOC Team providing recommendations about the level category

    for the event/alert (duration: 2-4 hours). In 20% of cases the conclusion is that the event/alert is afalse

    positive. In this case, the procedure for handling false positives is followed.

    Otherwise, in 80% of cases it is asecurity incident. In this case, in 95% of cases a security incident isactually raised. The empirical data suggests that in 10% of cases a Major Security Incident is actually

    raised.

    In all other situations the event/alert is closed.

    7.1.1.5 Event ClosureFigure 7-6 shows the involved steps:

    Figure 7-6. Model 1: Event/Alert Closure

    The process of closing an event/alert usually requires 1 hour2 days. It is followed by the writing of

    a Report for the Risk Management Team, that usually takes between 1 hour and 1 day.

  • 7/29/2019 HPL-2012-206

    27/86

    HP Security Analytics Final Report

    Page 23 of 80

    7.1.1.6 Security IncidentFigure 7-7 shows the various process steps involved in case the event/alert has been classified as a

    (IT) Security Incident:

    Figure 7-7. Model 1: Security Incident

    The actual process followed by the SOC team depends on the type of security incident:

    Major Security Incident (SOC teams Managed Incident): it is a critical incident that triggersthe second process, captured by Model 2 (Security Incident Management). This involves

    raising a major security incident with the Risk Management team (duration: 1-2 hours),recording the incident (0.5-1 hour) and producing an incident report (2-5 days), beforetriggering the second process. On average there are 800-900 Major Security Incidentper year;

    Minor Security Incident: it usually requires raising an internal (SOC Team) minor securityincident (duration: 0.5-1 hour), recording the incident (0.5-1 hour) abd producing a minorincident report (duration: 1-5 hours), usually requiring notifying the Risk Management Teamwith a short email/document.

    7.1.1.7 TP Non ComplianceThe TP Non Compliance process is usually triggered if TP does not provide the required

    information to the SOC Team in due time. Figure 7-8 shows the involved process steps:

  • 7/29/2019 HPL-2012-206

    28/86

    HP Security Analytics Final Report

    Page 24 of 80

    Figure 7-8. Model 1: TP Non Compliance

    The SOC TEAM raises a TP Non Compliance Warning (duration: 1-2 hours), followed by reportingthe TP to the Risk Management Team (duration: 1-5 hours). This might be followed by an escalation

    process to deal with the resolution of the problem or the acceptance of risk (this part is outside the

    scope of this report).

    7.1.1.8 Handling False Positives and Known ProblemsThis phase involves the handling of false positives and known problems, as shown in Figure 7-9:

    Figure 7-9. Model 1: Handling False Positives and Known Problems

  • 7/29/2019 HPL-2012-206

    29/86

    HP Security Analytics Final Report

    Page 25 of 80

    If the event/alert is a false positive, it will be closed in 0.5-1 hour.

    If initially the event/alert was classified as a known problem, a further assessment is made to

    discriminate if it is a known problem with an acceptable risk (20% of cases) or if it is a false positive.

    After this classification, the event/alert is closed, usually in 0.5-1 hour.

    7.1.2 Metrics

    This model has been built to carry out the investigations discussed in Section 3, specifically explore

    the risk exposure and performance issues due to the SOC Teams Event/Alert Processing &

    Remediation Process. Metrics need to be put in place to provide measurements both for the current

    situation and for what-if scenarios.

    As such, the model has been instrumented with a set of metrics to convey outcomes in terms of risk

    and performance. This includes:

    Risk Indicators:

    o Time to Process Events/Alerts in order to Identify (Minor and Major) SecurityIncident: the longer the time to identify the nature of the event/alert, the longer theorganisation is exposed to potential risks, such as exploitation of privileged accounts,spread of malware, exploitation of systems by attackers, etc;

    o Time Spent in Iterations with TPs: this is a subset of the previous time. It specificallymeasure the time that has been spent in interactions with TPs, being these abottleneck. Also in this case, the longer the time, the longer it takes to deal with theevent/alert, hence further exposing ServiceBizWorld to risks;

    o Time Spent for Event/Alerts Closures: this is the time required to close an event/alert,hence determining that there are no additional risks/issues for ServiceBizWorld;

    o Time Spent to deal with TP Non Compliance Situations: this is the time spent toprocess situations of TP non compliance i.e. when TPs do not provide the requestedinformation in due time. TP non compliance causes additional risk exposure for

    ServiceBizWorld due to the additional delays to process events/alerts as well asperformance issues, due to the extra work carried out by the SOC Team to handlethem;

    o Various Counters: counters have been introduced to count how many times anevent/alerts ends up as a (Major or Minor) Security Incident or as a ClosedEvent/Alert, as well as many times there has been a TP Non Compliance issue.

    Performance Indicators:

    o Time Spent in Dealing with False Positives: this is the time required to identify false

    positives and deal with them. The shorter it is the less time SOC Team resources areused for this type of event/alert instead of focusing on potential security issues. This

    time illustrates the impact on efficiency and performance;

    o Time Spent in dealing with Known Problems: this is the time required to identifyknown problems and dealing with them. The same comments made before, for false

    positives, apply.

    o Various Counters: counters have been introduced to count how many times anevent/alerts ends up as a false negative or a known problem.

    Section 8 discusses how these metrics have been used to describe the outcomes for various

    simulations.

  • 7/29/2019 HPL-2012-206

    30/86

    HP Security Analytics Final Report

    Page 26 of 80

    7.1.3 What-if Scenarios

    This model, instantiated with the current empirical data, has been used to run simulations. The results

    have been used to show the current state of risk exposure and performance. Section 8 presents these

    outcomes.

    A number of what-if scenarios have also been explored, based on ServiceBizWorlds requests, aiming

    to investigate the impact of making potential policy changes and/or additional investments in controls

    as well the impact of changing threat landscapes:

    1. Implications of making changes to Service Level Agreements (SLAs) with Third Parties (TPs);2. Impact on risk exposure due to changes of the Threat Environment;3. Impact of making (control) investments to improve Information Availability for TPs.

    The next section provides more details about these what-if scenarios. Section 9 illustrates the results

    obtained from simulations.

    7.1.3.1 Impact of Changing SLAs for TPsThe experimental results discussed in Section 8 show that the interactions with TPs create a major

    bottleneck within theEvent/Alert Processing & Remediation process: this introduces delays that have

    an overall negative impact on the performance and potentially increase the time required to make

    decisions, hence having a negative impact on exposures to risks. This is particularly relevant in case

    of Major Security Incident.

    This what-if scenario aims at exploring the implications of improving the behaviour of TPs by means

    of changing contractual SLAs, in particular about how they interact with the SOC Team to provide

    information. Two main aspects have been investigated:

    The impact of changing SLAs/Policies regulating interactions with TPs

    What the best possible outcome is, if the TP interactions are streamlined and optimised

    Ultimately, the desired outcome of this what-if analysis is to enable discussions with TPs to improve

    the current situation and explore trade-offs involving Performance vs. Costs vs. Security Risks.

    Various types of SLAs could have been explored and investigated. For the purpose of this scenario,

    being the focus on the SOC Team TP interactions, the following types of SLAs (and related

    constraints) have been explored:

    1. Provision of requested information to the SOC Team, within the agreed timeframe;2. Maximum number of allowed TP iterations, for each event/alert, to collect the requested

    information;

    3. Quality and quantity of the information provided in TP responses;4. Time to provide information to the SOC Team, for each iteration.

    It is important to notice that, in this analysis, we only considered the case where the same class of

    SLAs/Policies apply to all involved TPs. This might not necessarily be the case in other

    contexts/scenarios. Security Analytics and related models can help to capture multiple classes of SLAs

    and explore their implications.

    Figure 7-10 illustrates four cases of potential SLAs, along with the related parameters for the Security

    Analytics model, compared against the current situation:

  • 7/29/2019 HPL-2012-206

    31/86

    HP Security Analytics Final Report

    Page 27 of 80

    Cases TP

    Responding

    in Time

    (Aggregated)

    Quality/

    Quantity of Response

    # TP Iterations Time to Provide data to the

    SOC Team, per Iteration

    CURRENT 60%

    (single iteration:

    92%)

    70% 5-7 (average: 6) 1st: 1-10 days

    >1: 1-3 days

    CASE #1 70%

    (single iteration:

    93.1%)

    78% 4-6 (average: 5) 1st: 1-8 days

    >1: 1-3 days

    CASE #2 80%

    (single iteration:

    94.6%)

    86% 3-5 (average: 4) 1st: 1-6 days

    >1: 1-2 days

    CASE #3 90%

    (single iteration:

    96.5%)

    93% 2-4 (average: 3) 1st: 1-4 days

    >1: 0.5-1.5 days

    CASE #4 100%

    (single iteration:

    100%)

    100% 1-2 1st: 1-2 days

    >1: 1 hour-1 day

    Figure 7-10. What-if Analysis: Changing SLAs with TPs

    Each case introduces an improvement in the expected TP performance, compared to the current

    situation. Case #4 potentially describes the best type of SLA that could be agreed with TPs i.e. the one

    that should provide the best possible outcomes, all other aspects being the same.

    For each of these cases, simulations have been carried out to determine the impact on the risk

    exposure. Section 9 presents the results.

    It is beyond the scope of this what-if scenario to indicate which specific controls and investments

    should be put in place, to achieve these SLAs, from a TP perspective. We understand this involves

    complex contractual agreements between the parties and changes in the way work is currently

    organised with TPs.

    However, in general, some of the improvements described in the SLAs can be achieved by optimising

    the communications between the SOC Team and TPs and potentially deploying tools and solutions to

    simplify and automate the involved critical process steps. This includes:

    Defining templates about the type of requested information, for each specific managed eventand alert;

    Defining scripts, at the TP site, to describe how and where to collect the relevant information,to be provided to the SOC Team;

    Providing ways to automate these scripts, for example by using monitoring tools and SIEMsolutions;

    Deploying workflow management solutions spanning across the SOC Team and TPs, to track

    the overall process of handling events/alerts and the information gathering activities.

  • 7/29/2019 HPL-2012-206

    32/86

    HP Security Analytics Final Report

    Page 28 of 80

    7.1.3.2 Impact of Changes of the Threat EnvironmentThis what-if scenario aims at analysing the risk exposure implications in case of the worsening of the

    threat landscape. What if, the threat environment gets worst and worst? What would be the impact, in

    terms of risk exposure, if the number of new events/alerts to be handled by the process increases of

    10%? Or 30%? Or 50%?Various situations and scenarios might introduce sudden changes of the threat landscape:

    Major social and economical events: this might include events such as the Olympic Games inLondon 2012, major economical and financial crisis, etc. that could create a surge of socialengineering attacks and malware to exploit vulnerabilities and gather confidential informationfrom ServiceBizWorlds employees;

    Major changes in the ServiceBizWorlds business model: this might include increasing thenumber of services offered online and/or in the Cloud, hence potentially increasing the attacksurface and introducing new attack/exploitation opportunities for hackers.

    In this scenario we consider the current event/alert processing situation as the baseline.

    Simulations are carried out with the model to explore the implications, in terms of risk exposure, of

    increasing the number of new events/alerts. Specifically the focus is on exploring the impact on the

    time required to identify security incidents, includingMajor Security Incidents.

    A desired outcome of this analysis is to enable discussion with TPs for Contingency Planning as well

    as exploring suitable investment trade-offs between Costs (due to resources/personnel) vs. Security

    Risks.

    In this context it is important to identify critical thresholds, after which event/alert processing delays

    start accumulating, hence exposing ServiceBizWorld to additional security risks.

    The current bottleneck introduced by TP interactions has been the specific focus of this analysis.Whilst ServiceBizWorld has direct control of the SOC Team and its incident management processes

    and can flexibly adapt them to changes, this is not necessarily the case for TPs, especially for the

    external service providers.

    In this context, to effectively carry out the analysis, it has been necessary to introduce the concept of

    Resources/Personnelinto our model, related to the interactions between the SOC team and TPs. The

    number of TP people (potentially part of local security teams) involved in each interaction to address

    the SOC Team requests is limited: an increase in the workload might inevitably introduce delays.

    It has been proven difficult to gather the specific information about the number of involved resources

    (dealing with the event/alert processing activities) from the various TPs. This information is often

    business confidential.

    However, we can leverage Security Analytics and the model to make assumptions. The current model

    for the Event/Alert Processing & Resolution process has been validated by the SOC Team and teh

    Risk Management Team: it reflects the current solution.

    It is possible to use it, make assumptions and derive the required information about resources

    indirectly. Specifically, this involves:

    Modelling the collective impact of the various TPs;

    Modelling the TP allocation of resources/personnel in the context of their interactions withthe SOC Team;

  • 7/29/2019 HPL-2012-206

    33/86

    HP Security Analytics Final Report

    Page 29 of 80

    Doing a reverse engineering with the model, to indirectly determine a reasonable numberof TP resources/personnel that, once added to a new version of our model, produces the samekind of outcomes i.e. as in the current situation - in terms of risk exposure andperformance;

    Allow for some slack in the usage of the current TP resources.

    As previously mentioned, in the current modelled process, we focus on the interactions between TPs

    and SOC Team, see Section 7.1.1.3, to obtain the data requested by the SOC Team.

    The empirical data currently available suggests that the time required to get data from TPs is variable,

    depending if it is the first iteration with a TP or a follow-up iteration:

    1stIteration: 1-10 days;

    Follow-up Iterations: 1-3 days.

    However, not all this time is spent by the involved TP resources in actively doing the work. A major

    overhead is introduced to understand who to interact with, which information to retrieve and from

    where, miscommunications, etc.As such, the overall time required to provide a response by the TPs has been modelled in a more

    accurate way, by splitting it in two parts:

    Overhead time: this time does not really need resource allocations. It involves delays andwaiting time;

    Actual work: this is the time spent to carry out the information gathering work, requiring theutilisation of resources/people.

    Figure 7-11 illustrates the assumption made for the current model (and validated by the SOC Team

    and the Risk Management team) about splitting the current available time for providing data to the

    SOC Team into the overheadand actual worktimes:

    Time to Get Data from TPs in the 1 st

    Iteration

    Time to Get Data from TPs in the follow-

    up Iterations

    CURRENT MODEL 1-10 days 1-3 days

    MODEL

    with TP RESOURCES

    Overhead: 0.57.5 days

    Actual work: 0.52.5 days

    Overhead: 0.450.55 days

    Actual Work: 0.52.5 days

    Figure 7-11. What-if Analysis: Modelling Overhead and Actual Work time allocated to Resources

    Of course different assumptions could have been made, by having more precise information from TPs.

    It is beyond the scope of this report to explore the entire space of assumptions. The main purpose of

    this what-if analysis is to illustrate the potential impact of changes of the landscape by using

    reasonable assumptions and examples. The main goal is to use the simulation outcomes as additional

    scientific data to enable informed follow-up discussions with TPs, in terms of Contingency Planning.

    The current model of the Event/Alert Processing & Remediation Process has been extended, to factor

    in the introduction of resources and the time breakdown, described in Figure 7-11.

  • 7/29/2019 HPL-2012-206

    34/86

    HP Security Analytics Final Report

    Page 30 of 80

    The updated model is shown in Figure 7-12:

    Figure 7-12. What-if Analysis: Updated part of the Model to reflect Resources and Time Breakdown

  • 7/29/2019 HPL-2012-206

    35/86

    HP Security Analytics Final Report

    Page 31 of 80

    The only affected part of the model is the Information Gathering Phase phase, discussed in Section

    7.1.1.3. Figure 7-13 shows the changes:

    Figure 7-13. What-if Analysis: Specific updated part of the Model

    As previously mentioned, this model has been used to carry out various experiments, with different

    numbers of TP resources, to identify a reasonable number of resources that would produce results

    consistent to the ones observed with the current SOC event/alert processing & resolution process.

    Some slack in the utilisation of these TP resources has also been factored in.

    The outcome of these experiments is that if 250 resources (people), spread across the various TPs

    (along with the other assumptions mentioned before), are factored in, the updated Security Analytics

    model provides results comparable to the ones obtained with the initial model (Figure 7-1). More

    details are provided in Section 9, along with the outcomes of this what-if scenario. However, it is

    important to notice that, in this context, the number of identified resources should be purely

    considered as an internal Security Analytics mechanism to enable this type of analysis.

    The updated model has then been used to explore different cases, involving changes in the threat

    environment, in terms of an increase of the number of events/alerts received on daily basis, ranging

    from +10% up to +60%. Figure 7-14 shows the different scenarios that have been analysed:

    Case Events/Alerts per Day

    CURRENT 35

    CASE #1: +10% Events/Alerts 38

    CASE #2: +20% Events/Alerts 42

    CASE #3: +30% Events/Alerts 45

  • 7/29/2019 HPL-2012-206

    36/86

    HP Security Analytics Final Report

    Page 32 of 80

    CASE #4: +40% Events/Alerts 49

    CASE #5: +50% Events/Alerts 52

    CASE #6: +60% Events/Alerts 56

    Figure 7-14. Different Scenarios involving Changes of the Threat Landscape

    For each of these cases, simulations have been carried out to determine the impact on the time to

    identify a security incident (hence the impact on risk exposure) and the actual allocation of resources

    over time, including situations where resources saturates and delays are introduced. Section 9 presents

    the results.

    It is important to notice that in case of resource saturations, the delays are likely to increase over

    time. Hence results depend on the simulated time period. In the context of this what-if analysis, a

    period of time of 6 months has been considered. We wanted to explore the implications for theorganisation of taking no actions during that entire period of time. The analysis can be tuned on a

    different timeframe, based on needs.

    7.1.3.3 Impact of Investments to Improve Information Availability for TPsThis what-if scenario is based on the previous two scenarios. Indeed, TP interactions are a bottleneck

    and changes in the threat landscape can further negatively impact ServiceBizWorlds exposures to

    risks.

    In case of a worsening threat landscape, TPs could inject an increased number of resources

    (personnel) to cope with the additional workload: however, this does not make sense for a variety of

    reasons, including practical and economic ones. Specifically, it would makes no sense for TPs to keepinvesting in additional resources (personnel) without first addressing the root causes of the delays and

    bottlenecks introduced in their interactions with the SOC team.

    Discussions with the SOC Team and the Risk Management Team highlighted that one of the key root

    causes of this problem is the lack of information and/or complexity in retrieving it. In other

    words, TPs might need to spend considerable amounts of time in understanding how and where to

    retrieve the requested information. This would account for most of the delays and overheads discussed

    in the previous sections.

    This is a complex problem to be addressed, as it involves legal, economical and contractual aspects.

    New investments might need to be made by TPs and potentially the SOC Team; contracts and SLAsmight need to be modified.

    However, the main goal of this what-if scenario is to illustrate the implications of making these

    improvements and analyse the consequences, in particular in case of changes of the threat landscape

    (worsening threat environment). These outcomes can be used as the starting point, i.e. as additional

    evidence, for joint discussions, involving ServiceBizWorld, TPs and the SOC Team towards making

    changes.

    In this context, various controls have been considered as potential ways to improve the current

    situation:

  • 7/29/2019 HPL-2012-206

    37/86

    HP Security Analytics Final Report

    Page 33 of 80

    Definition ofSOC Teams templates about the type of requested information, for each specificmanaged event and alert. Definition of scripts, at the TP site, to describe how and where tocollect the relevant information, to be provided to the SOC Team;

    Script automation coupled with advanced monitoring and correlation tools (e.g. SIEMsolutions [3], [4]);

    Deployment of workflow management solutions (e.g. [9]) spanning across the SOC Team andTPs, to track the overall process of handling events/alerts and the information gatheringactivities.

    A few assumptions have been made about the impact of introducing these controls on the overall

    process, based on the technical capabilities and implications of the mentioned controls. Figure 7-15

    illustrates three different cases that have been explored with Security Analytics simulations:

    Investigation 1 -

    Cases

    Additional Capabilities/Controls Parameters in the Model

    (Information Gathering Phase)

    CURRENT NONE Overhead: 1st Iteration: [0.5 days, 7.5 days]

    Next Iterations: [0.45 days, 0.55 days]

    ctual work: [0.5 days, 2.5 days]

    CASE #1- SOC Teams templates defining the

    types of required information

    - Manual scripts, at the TP site,describing how to retrieve data

    Overhead: 1st Iteration: [0.4 days, 6 days]

    Next Iterations: [0.35 days, 0.4 days]

    ctual work: [0.4 days, 2 days]

    Note: Improvement of 20% compared to current situation

    CASE #2 As above. In addition:

    - Improved Monitoring Solutions at TPsites

    - Automated execution of scripts toretrieve data

    Overhead: 1st Iteration: [0.25 days, 3.75 days]

    Next Iterations: [0.225 days, 0.25 days]

    ctual work: [0.25 days, 1.25 days]

    Note: Improvement of 50% compared to current situation

    CASE #3 As above. In addition:

    - Workflow automation spanning acrossthe SOC Team and TPs

    Overhead: 1st Iteration: [0.1 days, 1.5 days]

    Next Iterations: [0.09 days, 0.1 days]

    ctual work: [0.1 days, 0.5 days]

    Note: Improvement of 80% compared to current situation

    Figure 7-15. What-if Investigation 1: Different Controls to Improve Information Availability

    It has been estimated that, in Case 1, the introduction of templates and manual scripts will provide an

    improvement of the overall TPs performances (see indicators in the table) of 20%, compared to the

    current situation. The additional automation of the scripts, coupled with monitoring solutions (Case 2)

    will introduce an improvement of 50%, compared to the current situation. Finally, the introduction of

    workflow automation (Case 3) could introduce an improvement of 80%, compared to the current

    situation.

  • 7/29/2019 HPL-2012-206

    38/86

    HP Security Analytics Final Report

    Page 34 of 80

    This what-if investigation, referred as investigation 1, still assumes that there will be 4-7 iterations

    with TPs, for each information request coming from the SOC Team. This is the worst case scenario,

    where no improvement happens in terms of reducing the number of iterations. This might be justified

    with the fact that the improvement in speed-up the retrieval of information is not matched by

    improvements in the communication with the SOC Team, hence still the need for multiple iterations.

    A variant of this investigation, referred as investigation 2, has factored in the situation where the

    number of required interactions with TPs is also affected, i.e. it progressively diminish by introducing

    more sophisticated controls. Figure 7-16 illustrates the updated three what-if cases that have been

    explored with Security Analytics simulations:

    Investigation

    2 - Cases

    Additional

    Capabilities/Controls

    Parameters in the Model

    (Information Gathering Phase)

    Number of TP

    Iterations (per

    event/alert)

    CURRENT NONE Overhead: 1st Iteration: [0.5 days, 7.5 days]

    Next Iterations: [0.45 days, 0.55 days]

    Actual work: [0.5 days, 2.5 days]

    5-7 iterations

    (Average: 6)

    CASE #1- SOC Teams templates

    defining the types ofrequired information

    - Manual scripts, at the TPsite, describing how toretrieve data

    Overhead: 1st Iteration: [0.4 days, 6 days]

    Next Iterations: [0.35 days, 0.4 days]

    Actual work: [0.4 days, 2 days]

    ote: Improvement of 20% compared to current situation

    4-6 iterations

    (Average: 5)

    CASE #2 As above. In addition:

    - Improved Monitoring

    Solutions at TP sites- Automated execution of

    scripts to retrieve data

    Overhead: 1st Iteration: [0.25 days, 3.75 days]

    Next Iterations: [0.225 days, 0.25 days]

    Actual work: [0.25 days, 1.25 days]

    ote: Improvement of 50% compared to current situation

    2-3 iterations

    (50% chances 2 or 3)

    CASE #3 As above. In addition:

    - Workflow automationspanning across the SOCTeam and TPs

    Overhead: 1st Iteration: [0.1 days, 1.5 days]

    Next Iterations: [0.09 days, 0.1 days]

    Actual work: [0.1 days, 0.5 days]

    ote: Improvement of 80% compared to current situation

    1-2 iterations

    (50% chances 1 or 2)

    Figure 7-16. What-if Investigation 2: Introducing Improvements in the Number of TP Iterations

    For each of the above 3 cases, both for Investigations 1 and 2, the parameters defined within the

    model have been updated and a new version of the model generated. Each updated version of the

    model has been used in simulations, to explore the specific implications and impact of threat

    environments getting worst and worst.

    Figure 7-17 shows the different threat environment situations that have been analysed, against each of

    the 3 what-if cases (for investigations 1 and 2) discussed above:

  • 7/29/2019 HPL-2012-206

    39/86

    HP Security Analytics Final Report

    Page 35 of 80

    Case Events/Alerts per

    Day

    CURRENT 35

    THREAT CASE #1: +10% Events/Alerts 38

    THREAT CASE #2: +20% Events/Alerts 42

    THREAT CASE #3: +30% Events/Alerts 45

    THREAT CASE #4: +40% Events/Alerts 49

    THREAT CASE #5: +50% Events/Alerts 52

    THREAT CASE #6: +60% Events/Alerts 56

    THREAT CASE #7: +70% Events/Alerts 59

    THREAT CASE #8: +80% Events/Alerts 63

    THREAT CASE #9: +90% Events/Alerts 66

    Figure 7-17. Different Scenarios involving Changes of the Threat Landscape

    Simulations have been carried out to determine the impact on the time to identify a security incident

    (hence the impact on risk exposure) and the actual allocation of resources over time. Section 9

    presents the results.

  • 7/29/2019 HPL-2012-206

    40/86

    HP Security Analytics Final Report

    Page 36 of 80

    7.2 Model 2: Incident Management Processes

    The second model, related to the Incident Management Processes is shown in Figure 7-18. This

    process is triggered by the process described in the first model, when Major Security Incidents are

    identified by the SOC team.

    The model captures the current process steps:

    1. Notification of Major (IT) Security Incidents;

    2. Various incident processing phases, including: Initial Assessment Phase, Information

    Gathering from the Risk Assessment Team, TP Handling of the Incident Response;

    3. Various possible outcomes of the execution of the process: TP Incident Remediation and Risk

    Management.

    It is important to notice that this process (and related model) is of secondary importance for the

    purpose of the risk assessment analysis. It presents a subset of the issues already investigated in the

    first process (e.g. TP interaction bottleneck): furthermore, it is not a critical process as it is only

    triggered, on average, 800-900 times a year, i.e. each time a Major Security Incident is identified.

    However, we believe that it is worth providing an explicit representation of this process and further

    analysing the implications in terms of the additional risk exposure for ServiceBizWorld, when dealing

    with the remediation of major security incidents. This analysis is very similar to the one carried out

    for the first process.

    Also in this case, the various process steps captured by the model are annotated with the empirical

    data gathered from the SOC and Risk Management teams. They reflect the current situation in terms

    of frequency of events, duration of process steps and likelihood of making certain decision. The

    complete list of the empirical data used in this model is available in Appendix A.2.

    When executed, this model simulates the arrival ofMajor (IT) Security Incidents and the various

    processing steps, in order to determine the outcomes. The model has been instrumented with various

    metrics, discussed in this section, to measure the impact in terms of risk exposure and performance.

    The next sections describes in detail each of the key parts highlighted in the model.

  • 7/29/2019 HPL-2012-206

    41/86

    HP Security Analytics Final Report

    Page 37 of 80

    Figure 7-18. Model 2: Security Incident Management Process

  • 7/29/2019 HPL-2012-206

    42/86

    HP Security Analytics Final Report

    Page 38 of 80

    7.2.1 Modelled Process Areas

    7.2.1.1 EventsFigure 7-19 shows that Major Security Incidents trigger the Incident Management Process, on average

    800-900 times per year (about 3 incidents per day):

    Figure 7-19. Model 2: Events for Security Incident Management Process

    7.2.1.2 Initial Assessment PhaseFigure 7-20 shows the process steps involved in the incident assessment phase for Major Security

    Incidents:

    Figure 7-20. Model 2: Incident Assessment Phase

    This phase involves an exchange of available information between the SOC Team and the Risk

    Management team (duration: 1 hour -1 day) along with a joint analysis of the incident (duration: 1

    hour1 day). These steps are followed by the SOC Team writing a summary report (duration: 1 hour-

    4 hours).

    This step might be followed by the need to collect additional information from the Risk Management

    Team, in terms of priorities, risk mitigation practices and directions.

    7.2.1.3 Information Gathering from Risk Management TeamFigure 7-21 describes the process steps involved in this phase:

  • 7/29/2019 HPL-2012-206

    43/86

    HP Security Analytics Final Report

    Page 39 of 80

    Figure 7-21. Model 2: Information Gathering from Risk Management Team

    In 70% of cases, additional information about the security incident needs to be collected from the Risk

    Management Team. In this case a request for additional information is made to the Risk Management

    Team (duration: 1-4 hours). The requested information is usually provided in a timeframe of 1 hour -8

    hours. This step is followed by an additional joint analysis involving the SOC Team and the Risk

    Management Team (duration: 1 hour1 day).

    This step is usually followed by an agreement, between the SOC Team and the Risk Management

    team, about the remediation requirements, to handle the security incident. This step usually requires 1-

    2 hours.

    7.2.1.4 TP Handling of Incident ResponseFigure 7-22 describes the process steps involved in this phase:

    Figure 7-22. Model 2: TP Handling of Incident Response

    This is the phase where a TP is in charge of handling their response to the security incident.

    The SOC Team provides the TP with recommendations on how to handle the incident (duration: 1

    hour2 days).

    This step is potentially followed by a set of interactions between the SOC Team and the TP:

    The TP provides a remediation document to the SOC Team (duration: 5 days -40 days);

    The SOC Team carries out an analysis of the outcomes and the proposed mitigations

    (duration: 4 hours -2 days);

  • 7/29/2019 HPL-2012-206

    44/86

  • 7/29/2019 HPL-2012-206

    45/86

    HP Security Analytics Final Report

    Page 41 of 80

    Time to Remediate an Incident: the longer the time to remediate the incident, the longer theorganisation is exposed to the potential risks, such as exploitation of privileged accounts,spread of malware, exploitation of devices accessible via the Internet, etc;

    Time Spent in Iterations with TPs: this is a subset of the previous time. It specificallymeasures the time that has been spent in interactions with TPs. Also in this case, the longer

    the time, the longer it takes to deal with the security incident, hence further exposingServiceBizWorld to risks;

    Time Spent to Deal with Risk Management: this is the time to process the incident anddetermine that it cannot be remediated, hence dealing with risk management;

    Various Counters: for completeness, counters have been introduced to count how many timesan incident ends up to be remediated or in a Risk Management situation. However, their valueis obvious, given the simple nature of this model, based on the known frequency of thesecurity incidents (800-900 major security incidents per year) and the fact that 90% of themwill be remediated.

    Section 8 discusses how these metrics have been used to describe the outcomes of the simulations.

    7.2.3 What-If Scenarios

    7.2.3.1 Impact of Investments to Improve the Process PerformanceThe current Security Incident Management Process is relatively simple, compared to the SOC

    Event/Alert Processing & Remediation process. In addition it is not very critical, as (on average) it is

    executed 800-900 times per year, when Major IT Security Incidents occurs.

    However this process (similarly to the other one) is affected by thepotential bottleneck introduced by

    TP interactions. Furthermore, the current empirical data reveals that it could take up to 40 days for

    TPs to remediate incidents. This is a very large period of time, that can potentially expose

    ServiceBizWorld to additional risks (such as exploitation of the root causes of the incident).

    What-if investments are made to improve the current situations, primarily to reduce the involved

    delays? Despite the process being simple and the outcomes intuitive, it still make sense to explore the

    types of investments and controls that could be deployed and assess their impact.

    Two kind of issues need to be addressed.

    The first issue is related to the Handling of the Incident Response phase. Various TP interactions

    might occur (up to 3), each of them potentially requiring long periods of time to get TPs remediation

    documents (potentially up to 40 days each).

    In this case various kinds of investments and solutions could be considered to potentially improve theperformance. For the purpose of this what-if analysis, the following types of solutions have been

    considered, each of them with an increasing level of impact and automation:

    Remediation Templates to describe common remediation activities for known, recurrentsecurity incidents;

    Knowledge-base and collaborative solutions (such as [7]) to provide guidance aboutacceptable remediation plans and track their evolution against agreed SLAs/deadlines;

    Incident Response Plan solutions (such as [8]) to automate the overall process of handling thedefinition of remediation plans.

    The second issue is related the TP Incident Remediation phase. As mentioned above, it might

    require a long period of time for TPs to remediate to incidents (up to 40 days).

  • 7/29/2019 HPL-2012-206

    46/86

    HP Security Analytics Final Report

    Page 42 of 80

    Also in this case various kinds of investments and solutions could be considered to potentially

    improve the performance. For the purpose of this what-if analysis, the following types of solutions

    have been considered, each of them with an increasing level of impact and automation:

    Manual templates and scripts to carry out remediation activities for known, recurrentsecurity incidents;

    Automated templates, knowledge-base and scripts to carry out remediation activities;

    Change management tools and solutions to track and automate the overall remediationactivities (such as [9]).

    The combined deployment of these controls and solutions at the TP sides is likely to affect the

    following aspects:

    Number of required iterations with TPs to obtain a satisfactory response in terms ofremediation plans/documents;

    Time required by TPs to provide the remediation plans/documents;

    Time required by TPs to remediate the security incidents.

    Figure 7-25 shows the cases and assumptions that have been explored in this what-if analysis

    involving the deployment of combinations of the above controls:

    Cases Additional Capabilities/Controls Parameters in the Model

    CURRENT NONE TP Handling of Incidence Response

    Time to Produce Remediation Documents:

    [5 days, 40 days]

    Num TP Iterations: (1,3)

    TP Incident Remediation

    Time to TP Incident Remediation:

    [5 days, 40 days]

    CASE #1 TP Handling of Incidence Response

    - Remediation Templates

    TP Incident Remediation

    - Manual templates and scripts

    TP Handling of Incidence Response

    Time to Produce Remediation Documents:

    [4 days, 32 days]

    Num TP Iterations: (1,


Recommended