+ All Categories
Home > Documents > When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard...

When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard...

Date post: 17-Dec-2015
Category:
Upload: brendan-york
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
30
When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School
Transcript
Page 1: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

When Technology Falters:The CareGroup Network Outage

John D. Halamka MD

CIO, CareGroup

CIO, Harvard Medical School

Page 2: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Agenda

In depth overview of the Network Outage Key Lessons The Sequel – SQL Slammer Questions and Answers

Page 3: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

CareGroup Network as Built

RenaissanceParkswitch-rca switch-rcb

switch-rcc

5500 5500

5500

EastCampus

switch-ccell118

switch-rob05

switch-ly030

SiSi

switch-br203

5500

5509

5500

5500ATM7/1

FEC 9/1-2

FEC10/5-6

FEC 9/1-2

WestCampus

switch-spg06b

switch-spg06a

switch-ccw00m4

5500

5500

5500

SiSi

ATM 5/1

FEC 9/5-6

SiSi

FEC 8/1-3

FEC 10/1-4FEC 8/3-4

FEC 9/1-4

FEC 8/1-2

FEC 8/1-2

FEC 8/1-2

SiSi

FEC 10/1-2

SiSi

FEC 11/5-6

FEC 8/1-2

FEC 9/1-2

FEC 10/5-6FEC10/1-4

FEC 11/1-2

FEC 9/1-2

FEC 11/3-4

ATM10/1ATM10/1

FEC 6/23-24 FEC 6/23-24

ATM

7/1

ATM

10/

1

FEC 3/21-24FEC 4/21-24

FEC 3/21-24FEC 4/21-24

ATM 10/1

ATM 5/1

14

8 12

12

12

12

1212

8 12

14

14 14

12

8 8

8

(8) 5505 Switches

(1) 5505 Switch

(15) 5505 Switches(1) 5500 Switch

(37) 5505 Switches

(1) 5500 Switch(21) 5505 Switches(2) 5509 Switches

(1) 3500 XL Switch (PACS)

(18) 5505 Switches

(21) 5505 Switches(1) 3500XL

(4) Ren Ctr (rc5, rc6, rc7, rcc)(1) Mount Auburn (Remote)

(1) 5500 Switch(1) 5505 Switch

(3) 5509 Switches

SiSi

(12) CC East Campus(4) HIM

(3) 109 Brookline Ave(2) 2127 Burlington(3) Research North

SiSi

(24) CC West Campus - - 1 is Dual Homed w/spg06b(1) PACS

(1) Research East

(1) 5500 Switch(1) 5505 Switch(1) 5509 Switch

(4) Dana(3) East(7) Feldberg(2) Finard(3) Kirstein(4) Reisman(5) Rose(1) Service(5) Stoneham(2) Yamens

(1) Baker(4) Deaconess(7) Lowry Medical(1) Maintenance(3) Palmer(1) CC West (Dual-homed w/ccw00m4)

(13) Farr(6) Kennedy(2) Lowry Medical(1) Masco

(3) Ren Ctr (rc7, rc8, rcc)

SiSi

ATM OC-3 (155Mbps) over SonetATM OC-3 (155Mbps) dark fiber

Fast Etherchannel (400 Mbps)Fast Etherchannel (800 Mbps)Not Active

SiSi

Page 4: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Timeline

November 13, 2002 1:45pm– Napster-like internal attack– Change begins, redundant links cut– Callisma and Cisco on site

November 14, 2002– Spanning tree issues– WAN issues– CAP declared at 4:00pm

Page 5: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Core Switch Utilization

Page 6: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Timeline

November 15, 2002– PACS Rebuild– Research/Cardiology rebuild– Reboot of core and distribution layer

November 16, 2002– VLAN mismatch– Redundant Core built as contingency

Page 7: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Core Switch Utilization

Page 8: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Root Cause Analysis

CareGroup Network grew organically by Merger and Acquisition into a massive bridged switched network which was not within Spanning Tree spec

Equipment was not life cycle managed Router/switch configuration was not in

accordance with best practices i.e. multicast dense mode

Page 9: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Spanning Tree Problems

When TAC was first able to access and assess the network, we found the Layer 2 structure of the network to be unstable and out of specification with 802.1d standards. The management vlan (vlan 1) had in some locations 10 Layer 2 hops from root.

The conservative default values for the Spanning Tree Protocol (STP) impose a maximum network diameter of seven. This means that two distinct bridges in the network should not be more than seven hops away from one to the other.

Page 10: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Key Lessons

Partner with your network vendor– Encourage external audits of your network– Engage advanced engineering services– Avoid senior management blind spots

Page 11: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Key Lessons

Avoid flat topology bridged switched networks.

Best Practice                           CareGroup Network

One VLAN per Subnet per     VLANs span many physical switches physical switches

Limited or no bridging            Extensive use of bridging

Layer 2 switching limited to   Layer 2 switching access layer extended across core

Page 12: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.
Page 13: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Key Lessons

Re-evaluate the enterprise architecture of your network– Routed core– Switched distribution and access layers– Robust Firewall

Page 14: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Key Lessons

Life Cycle Manage your network– Eliminate Legacy Protocols– Recognize the value of new feature sets– Hardware must keep up with the demands of a

changing organization – video over IP, IP telephony, bioinformatics, image management

Page 15: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Key Lessons

Implement appropriate monitoring and diagnostic tools to maintain the health and hygiene of your network– Concord– NATKit– CiscoWorks– OpenView

Page 16: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Key Lessons

Have a robust downtime plan– Out of band diagnostics– Dial up modems and computers in key clinical

areas– Overview of CareGroup Disaster Recovery

plan

Page 17: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Service Objectives

Page 18: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Protection Features

Page 19: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Protection features

Page 20: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Protection Techniques Cost versus Benefit

Page 21: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Protection Techniques by Vulnerability

Page 22: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Key Lessons

Implement Strict Change Control– Standards, configurations, devices, protocols,

links, processes, procedures, or services– Prior review and approval of all network

infrastructure changes– Multi-discipline membership– Changes classed as substantial, moderate, or

minimal impact

Page 23: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Key Lessons

Implement Strict Change Control (cont)– Substantial changes require Cisco AES review– Changes scheduled 2am – 5am weekends– Changes require baseline, testing, and recovery

plans– As-Built documentation to include overall,

physical and logical diagrams– NCCB recommends expense allocation

Page 24: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

The Sequel – SQL Slammer

Released at 12:30am on January 25 Infected East Coast at 12:40am Microsoft SQLServer 2000 was patched,

however Microsoft did not issue any patches or security warnings on Microsoft Data Engine 2000 (MSDE), which is included with numerous desktop products

Page 25: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Spread of the Worm

Page 26: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.
Page 27: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Exact effect on CareGroup

MSDE and non-IS maintained databases infected

Network saturated by worm activity Shut off links to Research areas Blocked all traffic from the public internet Network traffic levels returned to normal

Page 28: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Cleanup

Restart of servers and desktops that were disrupted by the outage

Once all areas research areas had cleaned desktops, we restored port 1433 connectivity

Page 29: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Further Lessons learned

VPN as a security risk Implement a scanning program to analyze

research desktop and server vulnerabilities Ensure you have modern network

equipment that afford you the tools to control intra-VLAN traffic

Page 30: When Technology Falters: The CareGroup Network Outage John D. Halamka MD CIO, CareGroup CIO, Harvard Medical School.

Conclusions

Lifecycle manage your network just as you would your desktop

Ensure senior management understands the value of the network as a strategic asset

Build great downtime procedures including out of band connectivity just in case the technology falters


Recommended