SUCCESS D2.5 v1.0 The Resilience by Design concept, V2 · SUCCESS D2.5 v1.0 1 (42) SUCCESS D2.5...

SUCCESS D2.5 v1.0

1 (42)

SUCCESS D2.5 v1.0 The Resilience by Design concept, V2 The research leading to these results has received funding from the European Union’s Horizon 2020 Research and Innovation Programme, under Grant Agreement no 700416. Project Name SUCCESS Contractual Delivery Date: April 30, 2018 Actual Delivery Date: April 30, 2018 Contributors: RWTH Workpackage: WP2 – Security, Resilience and Survivability by Design Security: PU Nature: R Version: 1.0 Total number of pages: 42 Abstract: Resilience is a key characteristic to be considered while designing critical infrastructures’ monitoring and automation network systems. An approach to improve resilience applying virtualization techniques to the use case of decentralised automation functions has been investigated in SUCCESS. This technique, called Double Virtualization, allows the automation system to dynamically reallocate the specific controlling functions in use and also enables scaling of the solution with respect to computational power. This report describes how we realised our Double Virtualization concept in the laboratory setup. We report on how Substation Automation Functions have been virtualized, run over the Double Virtualization implementation and tested on cloud platforms. Keyword list: Security, Communication, Threat, Countermeasure, Double Virtualization, Cloud Computing, Virtual Instance Disclaimer: All information provided reflects the status of the SUCCESS project at the time of writing and may be subject to change.

SUCCESS D2.5 v1.0

2 (42)

Executive Summary Resilience is the one of key characteristics to be considered while designing a critical infrastructure system. For example, next generation power systems, in which power generation based on renewable energy sources is increasingly integrated and in which distributed automation architecture are being highly digitalised to support the functioning of system, will become vulnerable to failures and cyber-attacks. Hence, there is a need to provide enhanced resilience functionality to such systems. This is the challenge addressed in the scope of Task 2.4 of the SUCCESS project. SUCCESS applies a new Double Virtualization concept to critical infrastructure management and automation architectures to increase their resilience. A joint reconfiguration logic, based on separating decentralised distribution automation functionality from the physical systems the applications run on, has been implemented through virtualising the functionality of the decentralised distribution automation functionality and then dynamically relocating this functionality to different physical systems, so that an attacker has to know on which hardware the functionality is currently running in order to attack it. Furthermore, this functionality can be dynamically reallocated making the power system more resilient against a single point of failure and cyber-attacks. Moreover a study on the criticality of a Distribution Automation system’s components has been carried out, in order to identify which components or functions are the most critical in a reference architecture for power systems’ monitoring and management, and a general methodology for such analysis is proposed, which could be applied also to other domains.

SUCCESS D2.5 v1.0

3 (42)

Authors Partner Name e-mail Rheinisch-Westfaelische Technische Hochschule Aachen (RWTH) Abhinav Sadu [email protected] Gianluca Lipari [email protected] Kungliga Tekniska Hoegskolan (KTH) György Dán [email protected] Peiyue Zhao [email protected]

SUCCESS D2.5 v1.0

4 (42)

Table of Contents 1. Introduction ................................................................................................. 6 1.1 How to read this document ............................................................................................ 6 1.2 Relation to previous Deliverables .................................................................................. 7 2. Threats and Countermeasures .................................................................. 8 3. Functional (or operational) Resilience of Distribution Grid Automation Systems ....................................................................................................... 9 3.1 Introduction .................................................................................................................... 9 3.2 Adopted methodology for designing the functional resiliency of the DA ....................... 9 3.3 Reference De-centralized DA architecture: IDE4L ...................................................... 10 3.3.1 Introduction ........................................................................................................... 10 3.3.2 IDE4L architecture summary ................................................................................ 10 Overview .................................................................................................... 10 Actors & use cases .................................................................................... 10 3.4 Identification of critical component of the IDE4L architecture ...................................... 12 3.4.1 Introduction ........................................................................................................... 12 3.4.2 Petri Net-based modelling .................................................................................... 13 Mathematical formulation ........................................................................... 14 Mapping of IDE4L architecture to CPNs .................................................... 15 Monitoring use cases ................................................................................. 15 Control use cases ...................................................................................... 16 Protection use cases .................................................................................. 17 Complete system ....................................................................................... 17 3.4.3 Identification of critical components: Test results ................................................ 17 3.5 Availability analysis of the automation system ............................................................. 18 3.5.1 Introduction ........................................................................................................... 18 3.5.2 Stochastic Petri Net based failure modelling ....................................................... 19 Failure models............................................................................................ 19 3.5.3 Performance evaluation ....................................................................................... 22 Key performance index .............................................................................. 22 Test cases .................................................................................................. 23 Test results ................................................................................................. 23 4. Resilient Virtual Process Control Function Placement Algorithm........ 25 4.1 Introduction .................................................................................................................. 25 4.2 System Model and Problem Formulation ..................................................................... 25 4.2.1 Failure Scenarios ................................................................................................. 25 4.2.2 Cost Model ........................................................................................................... 26 4.2.3 Problem Formulation ............................................................................................ 26 4.3 Resilient VPF Placement Algorithm ............................................................................. 26

SUCCESS D2.5 v1.0

5 (42)

4.4 Numerical Results ........................................................................................................ 27 5. Triggering Virtual Control Relocation ..................................................... 28 5.1 Relocation Triggering by DSO Entity ........................................................................... 28 5.2 Relocation Triggered by the Edge Cloud ..................................................................... 29 6. Implementation of flexible migration of SAU: Double virtualization with Calvin ......................................................................................................... 30 6.1.1 Introduction ........................................................................................................... 30 6.1.2 Calvin framework .................................................................................................. 30 6.1.3 Test case implementation: Lab setup .................................................................. 31 Test lab setup ............................................................................................. 31 6.1.4 Implementation of the grid monitoring application with CALVIN .......................... 33 6.1.5 Test case & results ............................................................................................... 33 7. Application of Double Virtualization on other critical infrastructures .. 35 8. Conclusion ................................................................................................ 37 9. References ................................................................................................. 38 10. List of Abbreviations ................................................................................ 39 11. List of Figures ........................................................................................... 41 12. List of Tables ............................................................................................. 42

SUCCESS D2.5 v1.0

6 (42)

1. Introduction This report describes the realization of the Double Virtualization concept in power system to enable resilience against cyber-attacks. The resilience of any system describes its ability to continue functioning even if there has been a failure of one or more of its components. In the current transformation of power grids, power generation is becoming more decentralised, and Distribution Automation architecture should adapt to this situation by also becoming more decentralised. In such a scenario, resilience becomes a key characteristic for their seamless functioning. In such a decentralised scenario, the power network is built from the bottom up by aggregating control cells that are a combination of virtualised automation function and power infrastructure. This allow reconfiguration at any time to enable resilience in the system, which acts as a countermeasure against cyber-attacks. Cyber-attacks are increasingly performed by highly sophisticated groups. The attacks can be directed at both power and communication resources. The resilience technique explored in the SUCCESS project is mainly based on the cloud computing paradigm. In principle, system resilience is ensured by enabling fast relocation of cloud virtual resources when a security incident (attack) is identified. This can be realized by an effective algorithm that allocates the cloud resources in response to component failures and cyber-attacks, with the objective of minimizing the expected operational cost. With respect to communication technologies, the system resilience can be leveraged by separation of the functional layers from the data layers in the virtual environment. Functional layers consist of the decentralised distribution automation functions and data layer consist of the database used to store the measurement values based on which these automation functions are applied. In this way the effectiveness of attacks is reduced because combined attacks targeting simultaneously both layers will be complex to execute, since they will require different approaches and methodologies to be applied at the same time on different targets. This approach is named Double Virtualization in SUCCESS. Technologies used in our work are well known in the ICT domain almost for a decade. Our innovation is to apply the techniques to the use case of critical infrastructures resilience. The objective of the presented work is to quantify the improvement in the overall availability of the key automation functions with and without Double Virtualization. It is not to determine how fast these functions are made available after a failure of the automation components, but their steady state probability of availability. 1.1 How to read this document This document builds on the concepts of resilience, among other concepts such as security and survivability, which are introduced in D4.3 [1]. Therefore, the reader of this document should be familiar with the contents of D4.3 before reading this document. The structure of this document is as follows: Threats to which Double Virtualization as a countermeasure is applied are listed in Chapter 2 A methodology for the analysis of the criticalities in a Critical Infrastructure monitoring and automation architecture, based on Petri Nets analysis, is presented in Chapter 3. A Resilient Virtual Control Functions reallocation algorithm is presented in Chapter 4. Triggering mechanism of the control functions migration, and the interface between the Security Operations Centre and the Breakout Gateway is described in Chapter 5. How Double Virtualization is realised in SUCCESS is described in Chapter 6, together with lab test results. Finally, a brief analysis of the possible application of the Double Virtualization to other critical infrastructures is presented in Chapter 7.

SUCCESS D2.5 v1.0

7 (42)

1.2 Relation to previous Deliverables This Deliverable builds on D2.4 [2], which is now deprecated. In comparison with D 2.4, it presents the evolution of the Double Virtualization implementation, which is now based on the Calvin Iot platform, an actor-based open source virtualization environment which permits faster migration of the control functions in response to an attack. Additionally, general considerations on its applicability to other critical infrastructures are presented. Moreover, a methodology based on Petri Nets for the assessment of the criticality of each component in a Distribution Automation architecture is presented. Such methodology can be used to identify which are the most critical components and thus to prioritize them in the application of the Resilience by Design concept. Finally, a new resilient control functions placement algorithm is presented. Such optimal placement strategy aims at minimizing the expected operational cost of Virtual Process Control Functions allocation on the available computing resources, especially on distributed computational and storage resources like the Mobile Edge Computing (MEC) nodes.

SUCCESS D2.5 v1.0

8 (42)

2. Threats and Countermeasures Today, cyber-attacks are directed to both power and communication resources. Deliverable 1.2 [3] contains the identified threats to the system, particularly in Chapter 3.2 “Success Centric Threats”. Those threats are mapped to Double Virtualization as a countermeasure in Deliverable 4.6 [4], in Chapter 3 “3. Mapping of Threats to Security Incidents and Countermeasures”. The threats shown in Table 2.1 are fully or partially covered by Double Virtualization as a countermeasure. In D4.6 the incidents are listed to which Double Virtualization can be applied as a countermeasure. Relevant incidents are labelled as cyber-security related incidents (CS-1, CS-3 and CS-5) and physical-security incidents (PS-3, PS-4 and PS-5). How Double Virtualization mitigates the listed incidents is explained in detail in D4.4. Table 2.1 - Threats where Double Virtualization can be applied as countermeasure Threat No. Threat T001 Distributed Denial of Service (DDoS) T206 Elevation of privileges T002 Smurf attack T207 Spyware T003 TCP/SYN flooding T208 Fake SSL certificate T004 UDP flooding T209 Signed malware T005 Teardrop attack T307 Rogue hardware T101 Man-in-the-middle T401 IP hijacking T102 Eavesdropping T403 DNS poisoning T103 Masquerade T404 Falsification of record T105 User Impersonation T405 Time synchronization attack T106 Service spoofing T410 Brute Force T107 Traffic analysis T501 Theft of fixed hardware T109 Disclosure T505 Unauthorized physical access T110 Replay attacks T605 Loss of information in the cloud T112 Deception T701 Stealing sensitive data T113 Session hijacking T702 Injection of Viruses T201 Malware T703 Disturbing availability T202 Virus/Worms T704 Compromised actors T203 Trojan Horse T705 Accidental leaks/sharing of data by employees (naive insiders) T204 Trapdoor T706 Malicious insiders (saboteur, disloyal employees) T205 Rootkits T707 Tech savvy insiders

SUCCESS D2.5 v1.0

9 (42)

3. Functional (or operational) Resilience of Distribution Grid Automation Systems 3.1 Introduction A Distribution Automation (DA) uses digital sensors and switches with advanced control and communication technologies to automate feeder switching, voltage and equipment health monitoring and outage, voltage, and reactive power management. Automation can improve the speed, cost, and accuracy of these key distribution functions to deliver reliability improvements and cost savings to customers. The goal of Advanced Distribution Automation is real-time adjustment to changing loads, generation, and failure conditions of the distribution system, usually without operator intervention. This necessitates control of field devices, which implies enough information technology (IT) development to enable automated decision making in the field and relaying of critical information to the utility control centre. The IT infrastructure includes real-time data acquisition and communication with utility databases and other automated systems. Accurate modelling of distribution operations supports optimal decision making at the control centre and in the field. A Distribution Automation architecture includes participation of different entities (actors) responsible for the safe and reliable operation of the Distribution grid Automation (DA). The automation architectures of the DA vary depending upon the energy market unbundling, energy policies, advancement of ICT infrastructure, cost of deployment of automation infrastructure, consumer behaviour, incentives given to DSOs for deploying energy efficient infrastructure, etc. Hence while designing countermeasures for improving the resiliency of such DA systems, thorough analysis is required to identify the critical components of the DA system and provide appropriate security measures that increases the availability of the key automation functionalities. 3.2 Adopted methodology for designing the functional resiliency of the DA In this study, a generic workflow is proposed as shown in the flow chart depicted in Figure 3.1. For this study a reference architecture is adopted, but it could be extended for any automation architecture deployed in the field by any DSO. With the automation architecture including the actors and their semantics as inputs, the identification of critical components in the architecture should be done. Once the critical components are identified then the availability of the automation functionalities when critical components fail, should be evaluated. Additionally, the improvement of the availability of these functions with the security countermeasures adopted for the critical components is measured. Appropriate countermeasures are then shortlisted, implemented and tested. Assumptions of reference DAArchitectureIdentification of critical componentinDAArchitectureAvailability analysis of the DAArchitecture(with & without countermeasure)Implementation ofcountermeasure Figure 3.1 – Methodology for designing functional resiliency of DA

SUCCESS D2.5 v1.0

10 (42)

3.3 Reference De-centralized DA architecture: IDE4L 3.3.1 Introduction In this paragraph a reference Distribution Automation (DA) architecture will be presented. The goal is to describe what are the general components of a DA architecture and point out what could be the criticalities to be taken into account. We will use the IDE4L architecture as a reference, since it is designed as modular and hierarchical, which allows us to highlight easily the interdependencies between architecture’s layers and functions. Such set of features is also common to the typical structure of a Supervisory Control and Data Acquisition (SCADA) system, a solution widely used for the monitoring and control of critical infrastructures. The IDE4L architecture includes, and to some degree extends, the concept of a SCADA system, see Figure 3.2. Hence, the results presented in the following chapters can be easily generalised in reference to SCADA systems. A brief description of the architecture will be given, providing also the needed references for gathering further details [5] [6]. 3.3.2 IDE4L architecture summary Overview The IDE4L automation concept dynamically integrates end-use energy services with real-time network operations. Measurement data and controls may be merged, analysed and utilized at different levels to integrate numerous measurement points and Distributed Energy Resources (DERs). Also partly decentralizing the monitoring and decision-making process, supports management in fast-changing conditions. The automation concept revolves around three design points: Hierarchical and distributed control architecture in distribution network automation, Virtualization and aggregation of DERs via aggregator and Large scale utilization of DERs in network management.

Figure 3.2 – IDE4L automation architecture for active network management Actors & use cases A High-level use case (HLUC) describes a general requirement, idea or concept independently from a specific architectural solution. There are ten High-level use cases defined in IDE4L project, they are listed below: :

SUCCESS D2.5 v1.0

11 (42)

Monitoring cluster o Real time monitoring o Forecast o System updating

Control cluster o Power control o Protection o Power quality o Network planning

Business cluster o Interaction among Commercial Aggregator, DSO and TSO - Operation Domain o Interaction among Commercial Aggregator, DSO and TSO – Market Domain o Grid tariffs

Figure 3.3 – IDE4L HLUCs schematic, with synthetic representation of information exchange Real time monitoring includes the collection, filtering and storage of information coming from measurement devices installed in the grid and from information services such as Custom Information System (CIS) and Network Information System (NIS). Power control use cases include 3 levels of control starting from local device control, going to substation automation actions and finally to control centres. Eventually, the business cluster refers to the procedures for the purchasing of energy and flexibility services in the distribution grid. A list of Primary Use Cases (PUC) in IDE4L, clustered following the HLUC schema, is reported in Table 3.1:

SUCCESS D2.5 v1.0

12 (42)

Table 3.1 - IDE4L PUCs grouped by HLUCs and Clusters Clusters High Level Use Cases Primary Use Cases Monitoring Real Time-Monitoring MV Real-Time Monitoring LV Real-Time Monitoring MV State Estimation LV State Estimation Dynamic Monitoring for TSO Forecasting MV Load and State Forecast LV Load and State Forecast System Updating Network Description Update Protection Configuration Update Control Control Power Control MV Network Power Control LV Network Power Control Control Centre Network Power Control FLISR Decentralized FLISR Microgrid FLISR Power Quality Power Quality Control Network Planning Target Network Planning Expansion Planning Commercial Aggregator Asset Planning Business Interaction among Commercial Aggregator, DSO and TSO – Operation Domain SRP and CRP Day-Ahead and Intra-Day Market Procurement Conditional re-profiling activation (CRP Activation) Grid Tariffs Day-Ahead Dynamic Tariff Day-Ahead Demand Response 3.4 Identification of critical component of the IDE4L architecture 3.4.1 Introduction As shown in Figure 3.2 the IDE4L automation architecture for active distribution grids is complex. It involves participation of multiple actors exchanging heterogeneous data to ensure safe operation of the distribution grids. Each actor, has its specific roles and all the actors have to coordinate with each other to operate the grid. The significance of each actor is different. It is important to find the most critical actor/component in the automation architecture, which is being used for the majority of the functions that are needed for the automation of distribution grid. These functions typically are the following:

Monitoring of distribution grid

SUCCESS D2.5 v1.0

13 (42)

o Real-time state estimation o Load forecasting o Network data update

Control of distribution grids o Volt/VAr control o Power congestion management o Load balancing

Protection & service restoration o Fault localisation and isolation o Optimal service restoration There are different use cases pertaining to each functionality and the automation system should be able to realize all of them. However, not all actors are involved in every use case. There is a need to rank the actors, according to the number of use-cases they are involved in, as the unavailability of the actors that are involved in large number of use-cases can bring down the complete automation system. Appropriate levels of countermeasures against cyber-physical attacks should be considered to safeguard the actors. In order to identify the actors that need the highest level of protection, given their critical role in the overall system operation, the different use-cases with actors and the data exchange have been modelled using the Coloured Petri Nets (CPN). The detailed modelling with CPN is presented in the upcoming subsections. For this study automation of Medium Voltage (MV) distribution grid is considered. Nevertheless, the same analysis could also be applied for Low Voltage (LV) distribution grid automation. 3.4.2 Petri Net-based modelling Petri Net (PN) is a visual graphical mathematical modelling language to model static structure and dynamic change of Discrete Event Dynamic System (DEDS). It is a structured description tool, which is able to represent synchrony, synchronization and parallel logical relationships. Petri Net has both rigorous mathematical expression and intuitive graphical representation. As a graphical tool, Petri Net can clearly represent concurrent, asynchronous, conflict, distributed, parallel and other behaviours. As a mathematical tool, it can build a model by state equations to perform reliable mathematical analysis. As a result, it can describe the conflict and concurrency processes while clearly representing the static and transient state of the system. Petri Net includes several elements such as places, transitions, and arcs. The input arcs connect places to transitions, and the output arcs connects transitions to places. Of course, other types of arcs, such as suppression arcs, appear in more complex systems. An exemplary Petri Net with token is shown in Figure 3.4

Figure 3.4 – A simple Petri Net Place: a circle (P0, P1, P2) • Transition: a rectangle (T0, T2) • Arc: a tangential arc between the place and the transition • Token: a dynamic object in the place (represented by a dot in the place P) that can be moved from one place to another

SUCCESS D2.5 v1.0

14 (42)

Note： • The arc is directional • There is no arc between two places or transitions • The library can have any number of tokens • There are two possible changes to be allowed, but only one change can occur at a time Mathematical formulation 3.4.2.1.1 Basic PN Definition：The structure of a PN is a directed graph described by 4 elements: PN = {P, T, I, O} Here: P = {P1, ... , Pn} is a finite set of the places, n> 0 is the number of the places; T = {T1, ... , Tm} is a finite set of transitions, m> 0 is the number of the transitions, P ∩ T = ⊙ (empty set); I: P × T → N is an input function that defines the set of repetitions or weights of the directed arc from P to T where N = {0,1 ...} is a nonnegative integer set; O: T × P → N is an output function that defines the set of repetitions or weights of the directed arc from T to P. Further defined a marked Petri with 6 elements: PN = {P, T, F, K, W, M0} among them: P = {P1, ..., Pn} is a finite set of the places; T = {T1, T2, ... Tm,} is a finite set of transitions, F =(P × T) ∪ (T × P), arc set; K: P → N +, the capacity function of the places, K (P) = ω denotes that the capacity of P is infinite, N + = {1,2, …}; W: F → N +, weighting function of arc; M0: P → N, initial marking, request: P ∩ T =, P ∪ T ≠ ф; M: P → N, N = {0,1,2, ...}, the marking of the PN In a PN, one transition T is used to represent an event, and the enabling with the transition indicates that the event can happen if the prerequisite is satisfied. The input place of T (e.g.P0 for T0 in Fig.p1) is used to represent the preconditions for the occurrence of the event, and the number above the arc is the requirements of the input function. And the number of tokens contained in the place represents the number that the local state is implemented. If all the input places of a transition have at least one token, the transition is enabled. When a transition is enabled, the transition can be fired, then the token of the input place is consumed, and the token is generated for the output place. 3.4.2.1.2 Coloured Petri Nets (CPN) The CPN is an advanced Petri Net, especially useful for the description of complex random discrete event systems. Its main difference from standard Petri net is that the tokens in SCPN may have many defined attributes. Thus, unlike the single token attribute in standard Petri net, SCPN can define and identify different tokens in the same place, thus completing a more complex logical network architecture with the ability for building hierarchical definitions

SUCCESS D2.5 v1.0

15 (42)

3.4.2.1.3 Generalised Stochastic Petri Nets (GSPN) SPNs can be defined as a five-tuple (P,T,F,M0, λ), where P = {P1,P2,P3…,Pk} is the place set, which describes the states of networks or the conditions for transitions, T = {T1,T2,T3…,Tk} is the finite set of transitions, the execution of which changes the states of networks, F = (P X T) Ս (T X P) is an arc set, connecting places and transitions, λ = { λ 1, λ 2, λ 3…, λ k} is the set of firing rates associated with the transitions. In SPNs, each firing rate λ i(i= 1…l) is exponentially distributed, and M0 = {M01,M02, . . .,M0k} is an initial marking, which depicts the initial state of networks. Mapping of IDE4L architecture to CPNs The CPNs are used to model the information flow between the different actors. The places represent the different actors. The arcs and transition show the information flow between the actors. The tokens represent the data exchanged. The individual colours of the tokens represent the data used in different automation use-cases. This is done to differentiate the data exchange between the actors for various automation use-cases. The use of coloured Petri Nets in this study is to simulate the dataflow between different actor simultaneously for all use-cases, thus, providing an opportunity to not only analyse the dataflow of specific use case but also for the whole automation architecture, therefore to help in determining the most critical actor in the automation architecture. However, for each use-case, the critical actor can also be identified by computing the Average Number of Tokens (ANT) per place. The place with the highest ANT would be considered the most critical actor for that specific use-case. For the analysis of the complete automation architecture, the complete set of use cases considered to develop the automation architecture, have been clustered and modelled in three different Coloured Petri Net (CPN), namely, Monitoring, Control and Protection use cases. A single CPN was modelled to represent all monitoring use-cases. The different colours of the tokens in a single CPN depicts the dataflow of all use-cases in monitoring/control/protection cluster of use-cases. Similarly, separate CPN, representing Control and protection use-cases, were modelled. A weighted average of the ANT obtained from monitoring CPN, Control CPN and Protection CPN was calculated to identify the most critical actor of the complete automation architecture. The detailed CPN models are elucidated below. For clarity different CPNs corresponding to the different use cases have been presented separately. Finally, the complete Petri net, corresponding to the complete automation architecture, will be presented. The list of actors and a short description of their roles are tabulated in [6]. Monitoring use cases The monitoring of the MV grid includes: Real Time Monitoring (MVRTM); State Estimation (MVSE); and the State Forecast (MVSF). The MVRTM is responsible for acquisition of measurements and for forwarding them to the MVSE. The MVSE then processes those measurements and calculates the state of the grid with a specific confidence level depending upon the uncertainty of the measurements. The MVSF is responsible to determine the forecast of the MV network states considering the forecasts of Distributed Energy Resources and the power consumption patterns. The CPN corresponding to the MVRTM and MVSE use cases are shown in Figure 3.5 and Figure 3.6.

SUCCESS D2.5 v1.0

16 (42)

Figure 3.5 - CPN of MVRTM use-cases Figure 3.6 - CPN of MVSE with MVSF Control use cases For the control of the MV grid two major control applications are considered. Firstly, the Medium Voltage Power Control (MVPC) and secondly the Control Centre Power Control (CCPC). Both of these actors are responsible for congestion management of the MV distribution grid. Their respective CPNs are as shown in Figure 3.7 and Figure 3.8. .

Figure 3.7 - CPN of MVPC

SUCCESS D2.5 v1.0

17 (42)

Figure 3.8 – CPN of CCPC Protection use cases The protection use-case mainly corresponds to the Fault Location isolation and Service Restoration function. To identify the fault location and isolation, the different breaker IEDs exchange information between them to identify the fault and isolate the faulty section. Later the status of the breakers is sent to the Primary Substation Automation Unit (PSAU) for deciding the optimal reclose path. The PSAU then sends messages to specific IEDs that recloses the required breaker. Complete system The complete system can be represented as shown in the Figure 3.9.

Figure 3.9 - CPN model of the overall automation architecture 3.4.3 Identification of critical components: Test results The CPN of the complete automation architecture was simulated and the weighted average of ANT was calculated for each place. The final results are tabulated in Table 3.2. From the results it can be seen that the PSAU.RDBMS is the most critical actor/component of the automation architecture. This is the actor responsible to store all the streaming data from the sensors, the results of the grid state estimation algorithms, the results of the control and protection algorithms, grid topology data and all the configuration data is stored. The highest level of security, reliability

SUCCESS D2.5 v1.0

18 (42)

and redundancy measures have to be taken for this component. However, the PSAU.MMS and DMS.MMS need to be given due consideration for their relative importance too Table 3.2 - Average token number per place Place Average tokens PSAU.RDBMS 2.01365 DMS.MMS 1.02142 PSAU.MMS 1.02132 PSAU.SE 0.7426 SSAU.RDBMS 0.6667 PSAU.SF 0.6535 CC.DXP 0.5025 CC.PC 0.4975 PSAU.TM 0.2500 PSAU.MVPC 0.2 Sensor 0.0296 Irrespective of the architecture of the automation systems, the aforementioned CPN based critical component identification can be made. Depending on the architecture different components would be critical. In order to quantify impact of losing any of these actors in the performance of the automation system, a detailed dependability and availability analysis is required. This is done using the Generalised Stochastic Petri Nets (GSPN), as elucidated in the following subsections. Furthermore, with the GSPN the effectiveness of specific countermeasure in improving the availability of the automation system could be evaluated. 3.5 Availability analysis of the automation system 3.5.1 Introduction The main goal of the automation systems to keep the grid in safe and in a reliable operating state. The availability of key automation functionalities determines the operational state of the power grid. Since each of these key functions are realized by different automation actors, the availability of the individual actors determines the availability of the key automation functionalities. When designing the automation system to be inherently resilient and survivable, it is important to evaluate the failure of the complete system due to the failure of the individual actors of the automation system. With the statistical failure rates of the individual actors given, the availability of the automation function can be deduced using Generalised Stochastic Petri Nets (GSPN). Based on this evaluation, appropriate countermeasures in safeguarding the individual automation actors could be designed. Furthermore, the effectiveness of the countermeasures in improving the availability of the system even under concurrent failures of individual automation devices, can also be determined with GSPN. Assuming specific failure rates of individual automation actor, the different failure modes of the automation actors are modelled using GSPN. Furthermore, the chain of failures triggered by the failures of individual components is also modelled. Thus, the overall statistical availability of the automation functionalities can be evaluated. In this study the availability of the automation functions (of IDE4L architecture) is evaluated for failures of different automation devices. A detailed analysis of the degradation of the automation function availability under the failure of PSAU is studied. Finally, the improvement in the availability of the grid automation functionality with Double Virtualization of PSAU is presented. For this study major

SUCCESS D2.5 v1.0

19 (42)

automation functionalities considered are the State Estimation (SE), Medium Voltage Power Control (MVPC), Control Centre Power Control, Medium Voltage State Forecast (MVSF), Network Data Update (NDU) and Medium Voltage Real Time monitoring (MVRTM). The basic components of the automation system for this analysis and the semantics between them is as depicted in Figure 3.10.

Sensors/actuatorsIEDs IEDs IEDsPSAU1 PSAU2 PSAU3

Sensors/actuators Sensors/actuators

DMS

Figure 3.10 - Automation components considered for GSPN 3.5.2 Stochastic Petri Net based failure modelling Failure models The failure models of sensor/actuator, IEDs, PSAUs and the DMS are modelled as explained in the coming subsections. The mean frequency of occurrence of the different failure modes are tabulated in Table 3.3. The TCLF denotes the mean frequency of communication link failure, the TDC denotes mean frequency of data corruption, the TDPD denotes the mean frequency of the device being physically damaged and finally the TDH denotes the mean frequency of the database being hacked. The probability of the frequency of the aforementioned failures is assumed to follow an exponential distribution with the mean frequency of failure as tabulated in Table 3.3.The specific values have been adopted from [7]. These values are used in modelling the individual failure models of sensors/actuators, IEDs, PSAUs, DMS and the whole automation scheme. The failure models are then used to deduce the availability of the automation functions. Table 3.3 - Mean frequency of failures TCLF TDPD TDC TDH λ1 λ2 λ3 λ4 2 2 1 0.01 3.5.2.1.1 Sensor/actuator The sensor/actuator failure mode model has been depicted in Figure 3.11. Three different modes of failure have been considered in parallel.

SUCCESS D2.5 v1.0

20 (42)

Figure 3.11 - Sensor/Actuator failure model The sensor failure model is built considering three different failure modes. The sensor/actuator could be physically damaged or the data could be corrupted (cyber-attack) or the communication link between the sensor/actuator and the IEDs might fail. 3.5.2.1.2 IEDs The Figure 3.12 depicts the failure model of the IEDs. Similar to the sensor/actuator the IED also has three failure modes. Intelligent Electronic Devices (IEDs). An IED is a microprocessor-based device that receives measurements and status information from the sensors and computes the appropriate control command that is sent to the actuator. The IED has three failure modes, firstly, the failure of the communication interface that results in communication link failure. Secondly, when the control logic of the IED is compromised resulting data corruption. Thirdly when the IED is physically damaged. Figure 3.12 - IED failure model 3.5.2.1.3 PSAU The PSAU hosts both the services to connect to database and the different functions like the SE, MVPC, etc. Therefore, the failure of the PSAU also results in the failure of these functionality. Furthermore, apart from the communication link failure, data corruption and physical breakdown of the device, the PSAU has database hack failure mode as depicted in Figure 3.13. Since the failure of the PSAU results in immediate failure of these functionalities, the transition between PSAU down and the MVSE_down, MVSF_down, MVRT_down, MVPC_down and NDU_down, are immediate.

SUCCESS D2.5 v1.0

21 (42)

Figure 3.13 - Failure model of PSAU 3.5.2.1.4 DMS The failure model of DMS is depicted in Figure 3.14. The failure of DMS causes an immediate failure of NDU and CCPC.

Figure 3.14 - DMS failure model 3.5.2.1.5 Whole automation scheme The complete automation system with all the interdependent semantics shown in Figure 3.10 has been depicted in Figure 3.15. The failure of sensor /actuator results in the failure of the upstream IEDs, responsible for specific measurement/control message exchange. The failure of the IEDs result in the failure of the upstream PSAU. Failure of PSAU results immediately the failure of the key grid automation functions like the SE, MVPC, MVSF etc. Failure of DMS also results in the partial failure of the automation function. Since PSAU individually manage a specific section of the distribution grid, its failure results mismanagement of only that specific section. In this study 3 PSAUs are considered with 2 IEDs and corresponding sensors.

SUCCESS D2.5 v1.0

22 (42)

Figure 3.15 - GSPN model of automation system failure model without Double Virtualization While designing a resilient automation system the availability of the key automation functions like the SE, MVPC should be high. Therefore, the PSAU should have high availability. With Double Virtualization the PSAU can be virtualized and migrated to another hardware, which may or may not have a PSAU instance running. This backup procedure is also modelled with Petri nets. This is modelled in such a way that PSAU failure transitions are inhibited after a specific time period corresponding to the migration delay. The overall automation system with Double Virtualization is modelled with GSPN as shown in Figure 3.16. With this modelling, the steady state availability of the automation functions can be calculated. The improvement in the availability of these functions with Double Virtualization can be studied using such model.

Figure 3.16 - GSPN model of Distribution Automation system failure model with Double Virtualization 3.5.3 Performance evaluation Key performance index To analyse the performance of the automation system there are different metrics that can be calculated. For this study the steady state availability of the automation functions is evaluated. Availability describes the readiness of critical components to provide correct services [8]. The objective of the presented work is to quantify the improvement in the overall availability of the key automation functions with and without Double Virtualization. It is not to determine how fast these functions are made available after a failure of the automation components, but their steady state probability of availability.. The calculation of the steady state availability involves two main steps.

SUCCESS D2.5 v1.0

23 (42)

Initially the equivalent Continuous Time Markov Chain (CTMC) representing the different reachable markings of the stochastic Petri net is calculated, where each marking represents the state of the modelled system. Secondly, calculation of the steady state availability. It has been proved in [9] that k-bounded SPNs are isomorphic to the CTMC. Thus, the SPNs can be associated to the CTMCs by constructing the reachability graph of the SPN and naming the arcs with the sum of the firing rates of transition, which trigger the change in the CTMC states [10]. The steady-state probability relies on the birth-death process, which is easy to manually derive analytical solutions. It is calculated as follows. For any marking �� ∈ �� and all the �, �� ∈�� ,�� ∈ � � , � ∈ � � �� ∑ �� ∗ �� ∑ �� ∗ �� Eq (1) Where λi corresponds to the mean firing rate of a transition (mean failure rate of a automation component) and πi corresponds to the probability of reaching the ith marking. In addition to the n-1 equations (where n is the maximum number of markings reachable for a specific Petri Net), ∑ �� 1 Among the states that are reachable with the Petri Nets, the states MA denote the state of the automation system where the key automation functionalities are available and the states M1-A corresponds to the states where these functionalities are unavailable. Given the states MA& M1-A the steady-state availability is given as �� !"#_��!� � ∑ �� ∗ ��%�&' , �� ∈ �% Eq(2) Where �� is the weight corresponding to the different states for which the key automation functionalities are available. The �� is the probability of the system to be in a state ‘i’ where the key functions are available. For this study equal weights are considered to all the possible system state (marking) where the key automation function was available. Test cases The GSPN model depicted in Figure 3.15 and Figure 3.16 were simulated using the Oris tool. The availability of the following key automation functions, namely: Medium Voltage State Estimation (MVSE), Medium Voltage State Forecast (MVSF), Medium Voltage Power Control (MVPC), Control Centre Power Control (CCPC), Network Description Update (NDU), Medium Voltage Real-Time Monitoring (MVRTM), is calculated for failures of different components. The failure of sensor/actuator, IEDs, PSAU and DMS is simulated individually and the availability of the automation functionality is calculated. The improvement in the availability of the key automation functions with Double Virtualization is presented. Test results The final availability of the key automation functionality with and without the Double Virtualization of PSAU is depicted in Figure 3.17.

SUCCESS D2.5 v1.0

24 (42)

Figure 3.17 - a) Availability of key automation functions under sensor/actuator failure .b) Availability of key automation functions under IED failure. c) Availability of key automation functions under PSAU failure. d) Availability of key automation functions under DMS failure From the analysis, with the Double Virtualization, the PSAU failures can be mitigated. However, under sensor/actuator IED and DMS failures there is no improvement in the availability of the key automation functionality. This is due to the fact that the backup strategies for the different components were not modelled and backup strategy only for PSAU was applied. From this analysis it can be concluded that with PSAU Double Virtualization the availability of the key automation function can be improved, in this specific case by 8%, but however additional backup strategies safeguarding the individual sensor/actuator, IED should be designed. Such additional features could be provided by the SUCCESS NORM security functions, described in [11].

SUCCESS D2.5 v1.0

25 (42)

4. Resilient Virtual Process Control Function Placement Algorithm 4.1 Introduction In control processes of distribution automation systems, hardware controllers process real-time data collected by sensors, and send control commands to actuators for execution. With softwarization, hardware controllers are replaced by software instances, which we name as virtual process control functions (VPFs). VPFs are easier to upgrade and more flexible than hardware controllers. VPFs can be hosted by various computational and storage architectures, and mobile edge computing (MEC) is a promising architecture to host VPFs with high bandwidth and low latency. MEC provides distributed computational and storage resources within the proximity of end-users. One critical aspect for placing VPFs within MEC is to be resilient to both component failures (e.g., MEC node failures and communication link failures) and cyber-attacks (e.g., DoS attacks and advanced persistent threats). Resilience can be realized by executing redundant copies of VPFs on multiple MEC nodes at the cost of high operational cost, since the computational and storage resources for executing the redundant copies need to be paid for. Alternatively, restoring instances of VPFs on different MEC nodes on occurrence of failures is a more tempting way, since it consumes less computational and storage resources. The feasibility of restoring VPFs is restricted by available resources on MEC nodes, and the VPFs that are not affected by the failures may need to be migrated, which makes the problem of VPF placement challenging. Furthermore, adoption of MEC for VPF placement has to consider the data transmission delay, which is critical for the performance of VPFs and depends on the placement of VPFs. In practice, VPF placement for delay optimization is subject to the capacity of MEC nodes, and should be aware of the cost of executing VPFs on different MEC nodes, which makes the VPF placement difficult. In what follows, we address the problem of resilient VPF placement with the objective of minimizing the expected operational cost, including the fees to make MEC nodes available, the fees for VPF execution, and the communication cost that models data transmission delay. We propose an effective solution based on Benders decomposition and evaluate its performance. 4.2 System Model and Problem Formulation We consider a mobile network that consists of a set B of base stations (BSs). A subset M∈B of the BSs is equipped with computational and storage resources and serve as MEC nodes. We denote by ω_m the computing capacity (e.g., the number of virtual machines) of MEC node m∈M. Within the infrastructure above, we consider a set F of VPFs that need to be allocated to MEC nodes for execution, and a set S of sensors and a set A of actuators that need to communicate with the VPFs. For each function f∈F, y_(f,s) indicates whether sensor s∈S captures data needed by f, and z_(f,a) indicates whether actuator a∈A receives commands from f. We consider that each VPF requires one unit of computing resource to satisfy the requirement of isolation, for the purpose of guaranteed performance and security. Sensors and actuators can send and receive data via BSs wirelessly, and the BSs are interconnected by a backhaul network (e.g., software defined mobile backhaul network). Data communication through the wireless links and the backhaul network incurs some cost, which depends on the failure scenario, as discussed next. 4.2.1 Failure Scenarios We consider that the communication and computing infrastructure is subject to occasional failure of its components. We use the term failure scenario to refer to the system with a set of component failures, and denote by ( the set of all failure scenarios. Each failure scenario can include communication and computing resource failures. A communication failure, a hardware failure or a DoS attack may result in the failure of some MEC nodes, and can make MEC nodes to be unsuitable for VPF placement. Binary variable )*,+ indicates the suitability of MEC node , for VPF placement in scenario � ∈ (. We assume that the system operator is able to estimate the occurrence probability of each failure scenario, and we denote the estimated occurrence probability of scenario � by �*. By definition

SUCCESS D2.5 v1.0

26 (42)

∑ �**∈- �1. This model allows capturing correlated failures, and is thus able to capture various link layer, network layer and cloud failure recovery mechanisms. Note that the failure of a wireless communication link, either due to equipment failure, due to jamming, or due to a denial of service attack, can also result in the failure of the communication between a sensor or an actuator and its associated BS. We consider that the system can recover from this kind of failures by re-associating the sensor or actuator to another BS. We consider that BS association is taken care of by the mobile network. Similarly, component failure in the backhaul network (e.g., an SDN switch or an optical cable) is handled by the mobile network, but it could result in increased delay. 4.2.2 Cost Model Our VPF placement cost model consists of the cost for storage and computing resources, and of the cost for the data transmission between the MEC nodes and the sensors and actuators. We denote by F_m the availability cost of MEC node m. The availability cost F_m includes the cost of storing the virtual machine images in the MEC nodes (storage cost), and the cost of reserving computational resources and memory for the execution of the VPFs (availability fee). We denote by p_(m,f) the placement cost of an instance of VPF f on node m. This cost corresponds to the computational and memory resources needed for the execution of the VPF. We denote by c_(l,m,i) the data transmission cost between a sensor or an actuator i∈A∪S and a MEC node m∈M in scenario l. 4.2.3 Problem Formulation We are now ready to formulate the resilient VPF placement problem, which consists of choosing the set of MEC nodes to be made available, and of deciding the placement of the VPFs in each failure scenario, subject to resilience and MEC resource capacity constraints. We use the binary decision variable �+ to denote whether MEC node , is kept available, and let . � {�', �0, ⋯ , v|4|}. Furthermore, we use the binary decision variable 6*,7,+ to denote whether VPF 8 is placed in MEC node , in scenario �, and let 9 � {6',',', ⋯ , 6|-|,|:|,|4|}. The resilient VPF placement problem is then to minimize the VPF placement cost, The availability constraint requires that a VPF 8 can only be placed on a MEC node that is available and is suitable for VPF placement in scenario �, and the resilience constraint ensures that each VPF 8 is executed by a MEC node in each scenario. Finally, the capacity constraint ensures that the capacity of the MEC nodes is not exceeded. 4.3 Resilient VPF Placement Algorithm Observe that the resilient VPF placement problem shown in (P1) is an Integer Programming problem, which is computationally hard in general. However, we observe that (P1) has a constraint matrix with a so called block-ladder structure. The constraint matrix with block-ladder structure typically arises in stochastic programming problems, for which an efficient solution method is the Generalized Benders Decomposition (GBD) [12]. We leverage this insight to propose the Resilient VPF Placement (RVP) algorithm to solve (P1). The overall structure of the proposed RVP algorithm is shown in Figure 4.1. Following the idea of GBD, the RVP algorithm decomposes the original problem into a master problem and a sub-problem, and solves them iteratively. In iteration � the master problem makes a set of MEC nodes available, indicated by .;��. If there exists a feasible VPF placement with respect to .;��, the sub-problem computes the placement of the VPFs and adds a feasibility cut to the master problem. Otherwise, an infeasibility cut is added. Both the feasibility and the infeasibility cuts can tighten the master problem. In each iteration the master problem and each sub-problem generate an upper bound and a lower bound of the objective value of (P1), respectively. The iteration stops when the upper bound and the

min?@,AB,C,@ D �+FFF∈G +DπJJ∈K D LDMcJ,F,ODyQ,O6*,7,+Q∈R S +O∈T DMcJ,F,UDzQ,U6*,7,+Q∈R S +D6*,7,+Q∈R pF,QU∈X YF∈G Subject to 6*,7,+ ≤ )*,+vF, ∀�, ∀,, ∀8 Availability Constraint (P1) ∑ 6*,7,+F∈G ≥ 1, ∀l, ∀f Resilience Constraint ∑ 6*,7,+7∈: ≤ ωFvF,∀�, ∀m Capacity Constraint

SUCCESS D2.5 v1.0

27 (42)

lower bound match the termination condition. A more detailed description of the RVP algorithm can be found in [13]. Figure 4.1 - Overall Structure of Resilient VPF Placement Algorithm 4.4 Numerical Results We validated the RVP algorithm for a system with 15 MEC nodes and up to 30 VPFs, and benchmarked the RVP algorithm against a greedy algorithm proposed in [13].Figure 4.2 shows the total cost as a function of the number of VPFs. The results show that the RVP algorithm outperforms the greedy algorithm by up to about 110 percent. The superior performance of the RVP algorithm is due to that it performs joint optimization over the availability cost, the placement cost, and the data transmission cost to minimize total cost. Clearly, in practice the performance gain of the RVP algorithm compared to the greedy algorithm depends on the system parameters and the workload. Nonetheless, the RVP algorithm is scalable enough to be used for the resilient placement of virtualized software instances in mobile edge clouds.

Figure 4.2 - Total cost vs. the number of VPFs for a system of 15 MEC nodes

SUCCESS D2.5 v1.0

28 (42)

5. Triggering Virtual Control Relocation We envision that the Double Virtualization concept is implemented in the Breakout Gateway (BR-GW). The overall security monitoring solution as well as the interfaces between its components is described in deliverable D4.6 [4]. The following two components of the overall SUCCESS security solution are the most relevant in security incident identification: CI-SOC (Critical Infrastructure Security Operations Centre) Breakout Gateway Relevant critical infrastructures entities in the context of security incident detection is the Critical Infrastructure Security Analytics Network, which is composed by: Security Data Concentrators (SDC) CI-SAN (Critical Infrastructure Analytics Node).

Figure 5.1 - SUCCESS Security Solution interfaces A security incident can be identified in both of those fragments initially by monitoring the network and the nodes in it, which is essentially done in the Breakout Gateway. The CI security entities are either informed about the security incident by the Breakout Gateway or able to identify the incident themselves; in both cases a warning flag is raised and the security operator is informed, e.g., by an issuing an alarm. In the SUCCESS security solution, an operator manual action is expected, which will trigger the migration of the virtual instances as a countermeasure, although the complete process from the incident identification to the virtual instance migration triggering could be fully automated. SDCs will forward the incident indication further to the corresponding CI-SAN. 5.1 Relocation Triggering by DSO Entity CI-SOC will continuously receive the following information from the BR-GW, though the interface I2:

SUCCESS D2.5 v1.0

29 (42)

Cloud resources status Virtual functions topology

o Virtual function topology describes relations among virtual functions and physical grid resources (physical network segments) In case that CI-SOC identifies an incident, it will issue an alarm notification. Accordingly, two cases are foreseen: security operator will decide which action to take CI-SOC will send instructions to the BR-GW on which virtual instance should be migrated or shutdown. The SDC entity will distribute the incident indication further to the upper instance of the CI-SAN. 5.2 Relocation Triggered by the Edge Cloud In case of a cyber-attack or malfunction of physical equipment monitored by the BR-GW, it might autonomously take an internal action and inform the CI-SAN. Such action taken by the BR-GW can be fully automatic. Regardless if the BR-GW takes immediate action or not, the BR-GW will send the following information to CI-SOC through interface I2, as described in [4]: The BR-GW virtual and controlled physical network resources’ status, Virtual function topology,

o Virtual function topology describes relations among virtual functions and physical grid resources (physical network segments) Eventual internal action taken by the BR-GW.

SUCCESS D2.5 v1.0

30 (42)

6. Implementation of flexible migration of SAU: Double virtualization with Calvin 6.1.1 Introduction As presented in the previous chapter, the availability of PSAU and DMS ensures the proper operation of the automation systems. Generally, these actors are deployed in dedicated machines and thus become a single point of failure in the automation system. Hence, the chances of blackouts are increased when any of these dedicated machines are compromised (due to either cyber-attacks or natural calamities). Therefore, special measures have to be taken that enable the migration of the functionalities of the DMS/PSAU to a clean new machine, when the previous machine is compromised due to cyber-physical attacks. This ensures the availability of the SAU and increases the resilience of the automation system. In this work, the capability of the CALVIN IoT framework to virtualize the functionalities of PSAU is investigated. The performance of CALVIN concerning latencies involved in the migration process are presented. Furthermore, the different migration strategies possible with CALVIN is evaluated. A test setup is built to show case the migration of specific functionalities of PSAU as a proof of concept. However, this setup can be scaled and extended for all functionalities with proper configuration of the Calvin framework as presented in [14]. 6.1.2 Calvin framework CALVIN is a distributed IoT framework made available as open source package by Ericsson. It combines the idea of the Actor model and flow-based programming [15]. CALVIN provides a simplified framework which eases the development of the application for distributed systems. The framework consists of the three-architecture layers including runtime, actors and an application. ActorRuntimeOS

ApplicationHardwareRuntime ActorRuntime

OSHardwareRuntimeIPC App 1: Grid Monitoring

Hardware + OS Hardware + OS Hardware + OSRuntime Runtime RuntimeActor 1 Actor 2 Actor 3 Figure 6.1 - The distributed execution of the application; (b) The software stack of CALVIN As shown in the Figure 6.1, an application in the CALVIN framework can be implemented as combination of the actor. Where each actor is responsible for the part of the logic. Unlike the sequential programming models, CALVIN automatically creates parallel executable processes when the application allows for it [16]. Thus, an application can be represented as a set of actors. The runtime shown in Figure 6.1 sits on the top of the Operating System. The runtime includes protocols over which actors can communicate and also it abstracts the platform independent functionalities. A scheduler is included in each runtime which orchestrates the actor execution. Furthermore, the users can control the runtimes using the REST-API to create an application, an actor or their migration. This flexible allocation of deployment of the actors provides resilience to the distributed automation. A meshed network of runtimes can facilitate the actor migration between the runtimes. However, latency for migration depends upon both the inherent latency of the communication infrastructure of the meshed runtimes and the processing power of the hardware where the runtime is realized. With a distributed hash-table Kademlia [17], the information of the actors and other streaming data can be stored in a distributed fashion, reducing the chances of single point failures.

(a) (b)

SUCCESS D2.5 v1.0

31 (42)

6.1.3 Test case implementation: Lab setup The test case includes three different physical devices, one Linux Server and two Raspberry Pi 3, where CALVIN environment is run. Each device corresponds to a PSAU. Each PSAU hosts a specific application. This case study is a basic implementation of a real time monitoring of a distribution grid, where a state estimation algorithm runs in a PSAU that receives real time measurements data from other PSAUs, using the CLAVIN framework. For this the different components of the PSAU as described in Section II are realized as individual CALVIN actors, within the CALVIN framework. The different actors are the individual monitoring application (SE), and the instance of TCP server/client, representing the application and interfacing layer of PSAU respectively. The test lab setup for the performance evaluation of CALVIN and the migration strategies is presented in the following sections. Test lab setup The purpose of this lab setup is to test the performance of the CALVIN framework for the migration of SAU. This is indicated by the performance metrics like latency for migration, maximum data rates possible between actors, and latencies of data exchange between actors. 6.1.3.1.1 Hardware setup The hardware setup consists of two Raspberry Pi 3 and a server running Linux, the configuration is as listed in Table 6.1. They are connected through Ethernet. All devices are running a CALVIN runtime, the server additionally hosts an IED App which simulates a measurement device and control unit. Furthermore, it hosts the Remote Actor Manager that manages the different actors in different runtimes. This system setup is shown in Figure 6.2. Table 6.1 - Hardware specifications Component Computing Machines Linux Server Raspberry PI 3 CPU 2 x 2.7 GHz 4 x 1.2GHz RAM 8GB DDR3 1GB OS Ubuntu 16.04 Raspian 4.9 CALVIN Version 0.7 Version 0.7 Networking 100Mbit LAN 100Mbit LAN CALVINRuntimeIED App CALVINRuntimeCALVINRuntimeLinux Server

Raspberry Pi 1 Raspberry Pi 2Remote Actor Manager

Figure 6.2 - System Setup 6.1.3.1.2 Actor migration strategies In this study we assume that the attack detection schemes are already in place. An attack is detected and it triggers the Remote Actor manager to initiate the migration process. To migrate an actor, The Remote Actor Manager sends a request to the API of the runtime, where the actor is currently running. Next, the connections to other actors are disconnected. After that, the actual migration process starts. The actor is deleted on the old runtime and migrated to the new runtime. Since the logic of the actor is already available at each runtime, the actual program code of the actor is not transmitted. Instead, only the type of the actor and the current actor state are transmitted. When the destination runtime receives the data, it creates a new actor and restore

SUCCESS D2.5 v1.0

32 (42)

the state. The ports are reconnected and finally an initialization function of the actor is called. After the successful actor creation, the destination runtime acknowledges this to the source runtime, which then acknowledges the success to the Remote Actor Manager. The two migration strategies possible with CALVIN are described below: A) Restarting the complete environment: This migration procedure is adopted when the CALVIN runtime is not under the control of the Remote Actor Manager anymore. This may occur due to DDOS attacks on the hardware that hosts SAU or when it is physically damaged. In this case, we have to re-create the actors which were running on the affected device. After recreating the actors, the actor ports have to be reconnected to the rest of the application. The reconnecting of the ports is supported by the REST-API of the runtime. However, since the old port connections from the faulty actor are not closed properly, disconnecting them would not work. Therefore, restarting individual actors is not possible. Instead, we have to re-create all the actors of the attacked environment. B) Migrating the actors: This migration procedure is adopted when a device is currently attacked but is still manageable. In this scenario, the Remote Actor Manager sends commands to the CALVIN runtime using the CALVIN migration feature. The migration command must specify the actor which shall be migrated and its destination runtime. All actors, which are currently being executed on the attacked runtime, are migrated to a non-attacked destination runtime. 6.1.3.1.3 Configuration of CALVIN runtimes The configuration of the CALVIN environments on the different hardware is as shown in Figure 6.3. CALVIN RuntimeRemote Actor Manager

CALVIN Runtime

Linux Server

Raspberry Pi 1IED App

Management ActorState Estimation(SE) SetupTCP-Client Management Actor CALVIN RuntimeRaspberry Pi 2Management Actor

Measurement/Control Data Migration /RestartmessagingActor management / Heartbeat signalling Communication parameter Setup Figure 6.3 - Configuration of CALVIN environments The functions of the different actors are as follows: Management Actor: This actor is responsible for detecting live actors within the CALVIN runtime where it is deployed. A heartbeat technique is used, where the management actor sends periodic commands to all the other actors deployed in the runtime, to which, each connected actor acknowledges with an ACK message. The Management actor then informs the Remote Actor Manager of the disconnected and the connected actors. Additionally, all actors can send a disconnect message to the management actor. This message is sent when an actor is shut down properly or if the actor is going to migrate. The feature is needed to prevent false positives, since a migrating actor is not available

SUCCESS D2.5 v1.0

33 (42)

for a short time and therefore might not be able to answer a heartbeat signal. The heartbeat signal is a JSON message, to comply with CALVIN’s requirements on token types. TCP-Client: Represents the data interfacing layer of PSAU. Responsible for acquiring measurements and send control set points from and to remote IEDs respectively. Grid Monitoring Actor & Grid Control Actor: These actors represent the different monitoring and control applications that run in PSAU that are responsible for operating the grid in real time. For this study it is a State Estimation function (SE) Setup: This actor is needed for the initial TCP connection configuration 6.1.4 Implementation of the grid monitoring application with CALVIN A specific type of monitoring application “State Estimator” actor is implemented in CALVIN. The State Estimator actor implements the Weighted Least Square (WLS) Method based power grid state estimation. For simplicity, the necessary information on the grid structure is stored in the code of the actor. It would be possible to separate this into another actor, which fetches the information from a database. The TCP-client actor simulates the communication interface of the PSAU that is responsible to connect with remote IEDs, acquires measurements, and forwards it to the State Estimation actor. To emulate the remote IED a TCP server (IED App) has been configured outside the CALVIN environment to which the TCP-client connects and requests for measurements at regular intervals. The TCP-client actor decodes the floats and forwards them through a CALVIN port to the state estimator actor. The estimated states are then, written back on the TCP server (IED App) using the TCP-client actor. This emulates the setting of the set points of the IED. If the TCP-client actor of our CALVIN application is migrated to another runtime, the actor closes the TCP connection and tries to connect to the server again after it migrated. The Remote Actor Manager is responsible for migrating/restarting of the different actors from runtime environment to another. 6.1.5 Test case & results The performance of the CALVIN framework in terms of the total delay in either migrating the actors or restarting the actors is presented in this section, after carefully considering the trade-offs pertaining to the achievable bandwidth and delay of the CALVIN communication ports.

Figure 6.4 - Delay characterization of actor migration

SUCCESS D2.5 v1.0

34 (42)

Figure 6.5 - Delay characterization of actor restart Migration is performed when the specific attacked node is alive and is responsive to the rest of the network. The restarting of actors is performed when a specific CALVIN node is either not reachable due to physical damages or due to cyber-attacks. For this test the CALVIN environments have been configured as depicted in Figure 6.3. The migration of the SE actor into the Raspberry Pi 3 is performed. The results are as shown in Figure 6.4 and Figure 6.5 respectively for the migration and restarting of actors. In Figure 6.4 the attacked runtime depicts the targeted runtime whose actors need to be migrated. The migration takes less time than the restarting of the actors. This is attributed to the amount of interactions between the attacked runtime (actors of which need to be migrated) and the destination runtimes while migrating the actors. However, while migrating the automation functions due care should be taken in the latency of the migration introduced by the communication infrastructure as shown in Figure 6.4 and Figure 6.5. Only the automation functions whose update rates are less than the communications latency are good candidates for migration without any degradation of service. Care should also be taken in choosing the target hardware where the automation functions would be migrated, as a part of DV, considering the computational resources available, bandwidth of the communication infrastructure, latency introduced by the communication infrastructure and so on.

SUCCESS D2.5 v1.0

35 (42)

7. Application of Double Virtualization on other critical infrastructures Cyber-attacks on all kind of critical infrastructures are on the increase and are becoming a growing concern for organisations and governments across the globe. Power generation facilities, metropolitan traffic control systems, water treatment systems and factories have become targets of attackers and have been hit recently with an array of network breaches, data thefts and denial-of-service activities. Critical Infrastructure facilities (electricity, oil, gas, water, waste, etc.) rely heavily on electrical, mechanical, hydraulic and other types of equipment. This equipment is controlled and monitored by dedicated computer systems known as controllers and sensors. These systems are connected to management systems, together forming networks that leverage SCADA (Supervisory Control and Data Acquisition) and ICS (Industrial Control System) solutions. Both ICS and SCADA enable efficient collection and analysis of data and help automate control of equipment such as pumps, valves and relays. The benefits that these systems provide have contributed to their wide adoption. Their ruggedness and stability enabled critical infrastructure-related facilities to use ICS and SCADA solutions for long periods of time. However, since most of these so-called critical infrastructures nowadays are controlled by SCADA systems, if the SCADA malfunctions, it will cause debilitating impact to the community and society. While their implementation is often proprietary, SCADA controllers are essentially small computers. They use standard computer elements such as operating systems (often embedded Windows or Unix), software applications, accounts and logins, communication protocols, etc. In general, a SCADA system is responsible of collecting of the information, transferring it to the central site, carrying out any necessary analysis and control and then displaying that information on the operator screens. The required control actions are then passed back to the process [18]. Typically, SCADA systems include the following components [19]: 1. Instruments in the field or in a facility that sense conditions such as pH, temperature, pressure, power level and flow rate. 2. Operating equipment such as pumps, valves, conveyors and substation breakers that can be controlled by energizing actuators or relays. 3. Local processors that communicate with the site’s instruments and operating equipment. This includes the Programmable Logic Controller (PLC), Remote Terminal Unit (RTU), Intelligent Electronic Device (IED) and Process Automation Controller (PAC). A single local processor may be responsible for dozens of inputs from instruments and outputs to operating equipment. 4. Short range communications between the local processors and the instruments and operating equipment. These relatively short cables or wireless connections carry analog and discrete signals using electrical characteristics, such as voltage and current, or using other established industrial communications protocols. 5. Host computers that act as the central point of monitoring and control. The host computer is where a human operator can supervise the process; receive alarms, review data and exercise control. 6. Long range communications between the local processors and host computers. This communication typically covers miles using methods such as leased phone lines, satellite, microwave, frame relay and cellular packet data. By looking at such components’ list, it can be noticed that points 1, 2, 3 and 5 are basically physical devices (computers, servers or controllers) running a software function. Such functions could be easily virtualized and integrated in Calvin runtimes and thus benefit from the DV. Additionally, given the general purpose of the Petri nets, the components criticality assessment, proposed in Chapter 3, could be carried out even for other critical infrastructures. The application of the DV would reduce the chances of the blackouts similar to the ones caused in Ukraine in December 2015, where the hardware running the SCADA was compromised. As single dedicated hardware hosts SCADA application, compromising such hardware gives access to all automation functions and control to all actuators deployed in the grid. But with DV the SCADA functions could be distributed on different physical devices, end eventually then randomly

SUCCESS D2.5 v1.0

36 (42)

migrated to different hardware within a meshed network. This would make sure that the SCADA functions are totally distributed and compromising a single or limited set of hardware would not compromise the complete SCADA, thus increasing the availability of the key SCADA functions all the time and assuring a safe, reliable and blackout resistant grid operation.

SUCCESS D2.5 v1.0

37 (42)

8. Conclusion In the future, in which information and communication infrastructure will play key role in the functioning of power systems, resilience will be seen as a key characteristic that has to be addressed when designing such cyber-physical systems. This document described how the innovative concept of Double Virtualization has been implemented in a laboratory environment at RWTH in Aachen, Germany. In the demonstration, Substation Automation Functions are relocated to enable improved resilience of the functioning the distribution grid during failure of the physical computing infrastructure or at the time where a cyber-attack has been detected. The CALVIN IoT framework is utilised to realise the concept and to investigate the performance of the virtual control functions. Furthermore, a framework and an algorithm for resilient virtual process control function placement within mobile edge clouds is developed by KTH in Stockholm, Sweden, and extensive simulation results show that the proposed solution is efficient and can provide significant cost reduction compared to a greedy approach. From the analysis based on Petri nets it was found that the most critical component of the reference automation architecture is the Primary Substation automation unit that hosts the key functions responsible for the monitoring, control and protection of the distribution grid. Therefore, the Double Virtualization was applied to this component using CALVIN. The migration of the exemplary automation function namely the state estimation was shown in this study, which could be extended to other automation functions namely congestion management, optimal service restoration and so on. Finally, this concept can be exploited for increasing the reliability of any automation infrastructure that is responsible for operating different kind of networks, for example for water, gas or electricity and big process industries like the petroleum refining industries. Any system where the collection of field data measurement is involved and where actuator control decisions are made, the approach could be exploited. Furthermore, the perfect exploitation of Double Virtualization in the operation of the power grid is to provide high availability of the Optimal Service Restoration (OSR) algorithm, since utilities pay significant penalties for customers’ downtime due to faults and unplanned outages. By making the OSR highly available, the loads in the power outage zone, can be quickly re-energized using alternate feeding paths, thus reducing penalties.

SUCCESS D2.5 v1.0

38 (42)

9. References [1] SUCCESS D4.3, “First Solution Architecture (SA) and Solution Description (SD), V3”, April 2018 [2] SUCCESS D2.4, “Resiliency by Design Concept, V1”, April 2017 [3] SUCCESS D1.2, “Identification of existing threats, V2”, April 2017 [4] SUCCESS D4.6, “Description of available components for SW functions, infrastructure and related documentation, V3”, April 2018 [5] IDE4L Project Page [6] IDE4L Deliverable D3.2 – Architecture design and implementation [7] Rongfei, et al. "Dependability analysis of control center networks in smart grid using stochastic petri nets.” [8] A. Avizienis, J. Laprie, B. Randell, and C. Landwehr, “Basic Concepts and Taxonomy of Dependable and Secure Computing,” IEEE Trans. Dependable and Secure Computing, vol. 1, no. 1, pp. 11- 33, Jan.-Mar. 2004. [9] K. Molly, “On the Integration of Delay and Throughput Measures in Distributed Processing Models,” PhD dissertation, Univ. of California, 1981 [10] R. Zeng, Y. Jiang, C. Lin, X. Chu, and F. Liu, “Performance Analysis of Data Management in Sensor Data Storage via Stochasitc Petri Nets,” Proc. IEEE Global Telecomm. Conf. (Globecom), pp. 1-5, 2010 [11] SUCCESS D3.9, “Next Generation Smart Meter, V3", April 2018 [12] A. M. Geoffrion, “Generalized Benders Decomposition”, Journal of optimization theory and applications, vol. 10, no. 4, pp. 237–260, 1972. [13] P. Zhao, and G. Dán, “Resilient Placement of Virtual Process Control Functions in Mobile Edge Clouds”, in Proc. of IFIP Networking, 2017. [14] A. Sadu, L. Ostendorf, G. Lipari, F. Ponci, A. Monti, " Resilient design of distribution grid automation system with CALVIN", in 2018 IEEE International Energy Conference (ENERGYCON), Limassol, 2018 (in press) [15] Per Persson, Ola Angelsmark, CALVIN– Merging Cloud and IoT, In Procedia Computer Science, Volume 52, 2015, Pages 210-217 [16] Gul A Agha. “Actors: A model of concurrent computation in distributed systems.” PhD thesis. Massachusetts inst of tech Cambridge artificial intelligence lab, 1985 [17] Petar Maymounkov and David Mazieres. “Kademlia: A peer-to-peer information system based on the xor metric”. In: International Workshop on Peer-to- Peer Systems. Springer. 2002, pp. 53-65 [18] D. Bailey, E. Wright, Practical SCADA for Industry. Oxford: Elsevier, 2003 [19] A. Hildick-Smith, Security for Critical Infrastructure SCADA Systems. SANS Institute, 2005

SUCCESS D2.5 v1.0

39 (42)

10. List of Abbreviations B2B Business to Business BMS Building management system CAPEX CAPital EXpenditure CENELEC European Committee for Electro technical Standardization CEP Complex Event Processing COTS Commercial off-the-shelf CPMS Charge Point Management System CSA Cloud Security Alliance EMS Decentralised energy management system DER Distributed Energy Resources DMS Distribution Management System DMTF Distributed Management Taskforce DSE Domain Specific Enabler EAC Exploitation Activities Coordinator ERP Enterprise Resource Planning ESB Electricity Supply Board ESCO Energy Service Companies ESO European Standardisation Organisations ETP European Technology Platform ETSI European Telecommunications Standards Institute GE Generic Enabler HEMS Home Energy Management System HV High Voltage I2ND Interfaces to the Network and Devices ICT Information and Communication Technology IEC International Electro-technical Commission IoT Internet of Things KPI Key Performance Indicator LV Low Voltage M2M Machine to Machine MEM Mobile Edge Computing MPLS Multiprotocol Label Switching MV Medium Voltage NIST National Institute of Standards and Technology O&M Operations and maintenance OPEX OPerational EXpenditure PM Project Manager PMT Project Management Team

SUCCESS D2.5 v1.0

40 (42)

PPP Public Private Partnership QEG Quality Evaluation Group S3C Service Capacity; Capability; Connectivity SCADA Supervisory Control and Data Acquisition SDH Synchronous Digital Hierarchy SDN Software defined Networks SDOs Standards Development Organisations SET Strategic Energy Technology SET Strategic Energy Technology SG-CG Smart Grid Coordination Group SGSG Smart Grid Stakeholders Group SME Small & Medium Enterprise SoA State of the Art SON Self Organizing Network SS Secondary Substation TL Task Leader TM Technical Manager VPF Virtual Process Function VPP Virtual Power Plant WP Work Package WPL Work Package Leader

SUCCESS D2.5 v1.0

41 (43)

11. List of Figures Figure 3.1 – Methodology for designing functional resiliency of DA ............................................. 9 Figure 3.2 – IDE4L automation architecture for active network management ............................ 10 Figure 3.3 – IDE4L HLUCs schematic, with synthetic representation of information exchange . 11 Figure 3.4 – A simple Petri Net ................................................................................................... 13 Figure 3.5 - CPN of MVRTM use-cases ...................................................................................... 16 Figure 3.6 - CPN of MVSE with MVSF ........................................................................................ 16 Figure 3.7 - CPN of MVPC .......................................................................................................... 16 Figure 3.8 – CPN of CCPC ......................................................................................................... 17 Figure 3.9 - CPN model of the overall automation architecture .................................................. 17 Figure 3.10 - Automation components considered for GSPN ..................................................... 19 Figure 3.11 - Sensor/Actuator failure model ............................................................................... 20 Figure 3.12 - IED failure model ................................................................................................... 20 Figure 3.13 - Failure model of PSAU .......................................................................................... 21 Figure 3.14 - DMS failure model ................................................................................................. 21 Figure 3.15 - GSPN model of automation system failure model without Double Virtualization .. 22 Figure 3.16 - GSPN model of Distribution Automation system failure model with Double Virtualization ................................................................................................................................ 22 Figure 3.17 - a) Availability of key automation functions under sensor/actuator failure .b) Availability of key automation functions under IED failure. c) Availability of key automation functions under PSAU failure. d) Availability of key automation functions under DMS failure ... 24 Figure 4.1 - Overall Structure of Resilient VPF Placement Algorithm ......................................... 27 Figure 4.2 - Total cost vs. the number of VPFs for a system of 15 MEC nodes ......................... 27 Figure 5.1 - SUCCESS Security Solution interfaces ................................................................... 28 Figure 6.1 - The distributed execution of the application; (b) The software stack of CALVIN .... 30 Figure 6.2 - System Setup ........................................................................................................... 31 Figure 6.3 - Configuration of CALVIN environments ................................................................... 32 Figure 6.4 - Delay characterization of actor migration ................................................................ 33 Figure 6.5 - Delay characterization of actor restart ..................................................................... 34

SUCCESS D2.5 v1.0

42 (42)

12. List of Tables Table 2.1 - Threats where Double Virtualization can be applied as countermeasure .................. 8 Table 3.1 - IDE4L PUCs grouped by HLUCs and Clusters ......................................................... 12 Table 3.2 - Average token number per place .............................................................................. 18 Table 3.3 - Mean frequency of failures ........................................................................................ 19 Table 6.1 - Hardware specifications ............................................................................................ 31

Date post:	11-Aug-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

SUCCESS D2.5 v1.0 The Resilience by Design concept, V2 · SUCCESS D2.5 v1.0 1 (42) SUCCESS D2.5...

Documents