Post on 30-Mar-2015
transcript
EarthLink and Micromuse:
Growing up Together
Doug McClureEarthLink OperationsSr. Manager, Fault and Performance MgmtJune 3, 2004
2
Fault & Performance Mgmt Overview
• One of the Nation’s Largest ISPs
• Headquarters in Atlanta, GA– Key facilities in Dallas, TX, Pasadena and San Jose, CA, Knoxville, TN and Seattle, WA
• Profitable, strong balance sheet
• Largest DSL footprint
• First-to-market with products that provide the best possible Internet experience
• Customer Advocacy: Fighting SPAM with technical solutions, litigation, legislative support, industry collaboration and consumer education
– Howard Carmack, aka the "Buffalo Spammer," was sentenced to 3-1/2 to seven years in prison on May 27 th after EarthLink received a $16.4M civil judgment in May 2003
• 10th Anniversary (1994-2004) – http://www.redefineyourworld.com
3
Fault & Performance Mgmt Overview
5.25M Customers• ~4M Dialup (Premium ~3.5M, Value ~500K)• ~1.2M Broadband (Cable, xDSL)• ~160K Web Hosting (Unix, Windows)• ~50K Wireless (Blackberry, PDA, Laptops, Wi-Fi)
Dial Access Coverage > 90% of US Population• ~16K Local Dial Access Numbers• ~500K Active Modem Ports (~50% ELNK, ~50% Outsourced)• ~400 PoPs (18 Core Backbone PoPs, four data centers)
Broadband Coverage• ~200 Markets with Broadband Offerings
Large and Diverse Infrastructure• 2300 Network Elements• 1500 Server Elements• Thousands of Access Circuits, Hundreds of WAN Circuits
4
Fault & Performance Mgmt Overview
Access Technology Innovation•Premium and Value Dial-up•Broadband (Cable, xDSL, Satellite)•Voice (Converged Devices, VoIP)•Wireless (WiFi, CDMA, Blackberry, PDA)•Broadband over Power Lines (BPL)
Value Added Service and Product Innovation•Blocker Family: spamBlocker, POP-UP Blocker, ScamBlocker, Virus Blocker, Spyware Blocker
•Parental Controls•Webmail•Web Accelerator
5
Fault & Performance Mgmt Overview
Exceptional Customer Service•2003 PC Magazine Readers' Choice Awards for both high-speed and dial-up services
•2003 highest ranking in customer satisfaction for the second year in a row for high-speed Internet service by J.D. Power and Associates in its Internet Service Provider Residential Customer Satisfaction StudySM
•2003 CNET Editors' Choice award
6
Fault & Performance MgmtInnovation = Constant Change
Drivers•Speed to Market, Competition – Do more, faster•Quality, Performance, Support Costs•Compliance - Sarbanes-Oxley
Operational Challenges•Release Management•Change Management•Service Level Management
7
Fault & Performance MgmtOperations Maturity: Growing Up
Production Improvement Program (PIP)•Foundation in IT Service Management, ITIL, CobIT•Focusing on four main areas: Service Level Mgmt, Change Mgmt, Release Mgmt, and Production Security
– Over 10% of Operations staff have now attended ITIL Foundation Training
• 1 Master Level Certified (more planned)• 9 Practitioner Level Trained in CCR Quadrant (pending
certification results)• 114 Foundation Level Trained (most pending
certification results)
8
Fault & Performance MgmtOperations Maturity: Growing Up
Service Level Management• NOC, Help Desk• Set and manage expectations internal/external to
Operations
Change Management• Provide oversight and control of the production
environment• Minimize risk and impact from change activities
Release Management• Development Operations• Minimize poor quality production releases
Enterprise Security• Compliance, control, audit
9
Fault & Performance MgmtEarthLink and Micromuse FactsVery Early Netcool Adopter• EarthLink (Mindspring) was Micromuse’s first US customer
– Began evaluating Micromuse Netcool in 1996, official customer April 1997
Early Innovation• Early joint innovation and development helped build foundation for many
of Micromuse’s key products– EarthLink and Micromuse are revitalizing joint development projects with
emerging service and business activity monitoring products
Driving 3rd Party Vendor Integration & Partnerships• EarthLink requires detailed integration with Micromuse suite – much more
than just “sending SNMP TRAPs”– Quest Software, Compuware, PeopleSoft, Remedy, Cisco Systems, Arbor Networks
Current Deployment• Netcool OMNIbus, Internet Service Monitors, Desktop Clients, Webtop,
Impact, numerous Gateways, Probes, Data Source Adaptors– Two Senior System Engineers, Three System Engineers, Two System Analysts
devoted to Fault and Performance Management (Netcool + Other)• Services provided for NOC (3 shifts, 6 per shift), Systems Administration (3
shifts, 10 per shift), Network Engineering
10
Fault & Performance MgmtMoving Beyond “MoM” and Apple
PieEarthLink’s Early Micromuse Netcool Deployment• Focused on Netcool as the “Manager of Managers” or “MoM”• Needed during EarthLink’s rapid growth and expansion• Enabled event management, eliminated “swivel chair NOC”
“Apple Pie” is Event Correlation and Deduplication• The Netcool sweet spot was providing EarthLink with event correlation
and deduplication– Able to reduce the event stream from 100,000’s to 1,000’s per week – Further reduction expected to 100’s per week through use of advanced Netcool/Impact
policies and deployment of Netcool/Precision• Enables NOC and support staffs to operate efficiently
Focus now on End-to-End Service Management• Netcool Suite allows EarthLink to manage entire service
– Understand service relationships, service levels, perform service modeling and service discovery
• Enables impact assessment, prioritization, understanding service delivery chain
• Eliminates “needle in the haystack” approach of event management– This is the problem that needs attention now (compared to “I think this is the event
causing problems”)
11
Fault & Performance MgmtService Management Complexity
S111
ANY WEB BROWSER
PALM CLIENT
CLIENTClientApplications
PresentationLayer
ApplicationServicesLayer
InfrastructureLayer
CoreServicesLayer
HTML
S86S84
APIs
APIs
APIs
StorageS110
S91
S112
Tickets
S102
ANY WEB BROWSER
S83
HTML
S81
IMAP
S108 S104
API 1
S82
API 4 API 7
API 2
S88
S106
S101S100
SMTP
API 5API 3
POP3
API 6
S109
S90
HTMLHTML
S103
S107
CLIENT
S87
S105
S80
S85
To Other Systems
Good Customer Experience?
Performance?
Infrastructure Events to Netcool
Source: EarthLink Product Group
12
Fault & Performance MgmtService Management Complexity
Number of Components
Time(24x7x365)
System Changes
Infrastructure Events
D
D
D D D D
D
D
D D D D
D D D D
D D D D
D D D D
Identify key service elements
Instrument those elements
Consolidate & analyze data
Develop service model and SLAs
Dealing with EarthLink Service Complexity:
•The complexity and amount of data generated from end-to-end service management is enormous
•Networks, Firewalls, Servers, Applications, Switches, Routers, Load Balancers, Applications, Databases, etc.
•Netcool/ObjectServer is a must have for EarthLink to effectively manage and understand EarthLink’s service event stream from end-to-end
•Impact 3.0’s cluster capability will enable EarthLink to analyze, enrich, suppress, and manage event stream regardless of our growth
Source: EarthLink Product Group
•RAD (future)•Impact•Precision (future)
•ISM•System Agents•SNMP
•ObjectServer•RAD (future)•Impact
•RAD (future)•Impact•ISM
13
Fault & Performance Mgmt The Customer IS Important
Customer Experience Monitoring and Management
• The Micromuse Netcool Suite enables proactive, real-time monitoring of the customer’s experience for core EarthLink services
– Over 14K Internet Service Monitors (ISM) instances in operation covering all key services (HTTP, HTTPS, SMTP, POP3, IMAP) and dedicated customers (ICMP)
• Allows for customer experience monitoring information to be correlated, analyzed, and presented in real-time
– Micromuse Netcool/ISMs, Keynote, Compuware Client Vantage, Quest Foglight
– External/Internal Synthetic testing system & network element monitoring system and network port monitoring
• Immediate notification to support groups when customer’s experience degrades
14
Fault & Performance Mgmt The Business IS Important
Business Activity Monitoring and Management• Expands IT Operations visibility vertically and horizontally• Ties IT Operations data and Business data together
– System Downtime vs. Contact Center Call Volume– Real-Time Customer Subscriptions vs. Sales Forecasts
• Enables Real Time Monitoring and Management of Business and IT processes
– Change and Downtime Management – Customer Registration Management
15
Fault & Performance MgmtProduction Improvement Program
Release Planning
Dev / Procurement
Release Design, Build
Release Acceptance
Roll-out Planning
Comm, Prep, Training
Distribution/ Installation
Policy, Procedures, Standards & Guidelines
Security Consulting
Security Assessment
Security Monitoring
STATUS CHANGE (1)Prioritization, Risk Assessment and
Forward Schedule of Change
STATUS CHANGE (2)
Change Approval and Proj. Service
Availability
STATUS CHANGE (3)Final Change Approval and
Implementation
Metrics &
Reporting
Corp Project
Ops Project
Non-Project
Pro
d S
ec
REQUEST FOR CHANGE (RFC)
CLOSED RFC
STATUS CHANGE (4)
Review Changes
Security Test & Sign off
Rel
ease
Mg
tC
han
ge
Mg
t
Mutual Benefit from EarthLink’s Innovation and Advanced Use of Micromuse Products
Micromuse OMNIbus, Impact, Webtop, RAD, NFSM
Source: EarthLink SLM Group
16
Fault & Performance Mgmt Business Activity Monitoring
Managing the Impact of Change and Downtime Activities on the
Business and Operations
17
Fault & Performance Mgmt Overview
Drivers• Adoption of ITIL/COBIT Best Practices for Change Management
– Production Improvement Program (PIP), SOX Compliance, etc.– Significant change for many groups – Fear, Uncertainty, Doubt (FUD)
• No Real-Time Visibility into Change/Downtime Management Activities
– Business Process • Who, What, When, Where, Why, and How, Cost, Risk, and Impact
– Workflow – Monitor Lifecycle, SLAs, Bottlenecks – Is the process enabling Operations or is it a bottleneck?
– Impact on Infrastructure – False Positives, Contact Center Call Volume (COGS)
• Drive out False Positives from Production Monitoring Systems– Huge burden on NOC and other support staff
• Desire to have Automated Remedy Trouble Ticket Creation– Reduce time to address problems, reduces MTTR
18
Fault & Performance Mgmt Overview
Solution• Provide Real-Time Visibility into Change/Downtime Process
– There are 12 pending and 24 scheduled change requests for tonight, 6 are underway and 8 start in 15 minutes or less
• Create Actionable Information – Dept. 828 has five outstanding major change requests, attention is
needed
• Ensure Business Rules are Guiding/Enabling the Process – Not Hindering It
– Eliminate FUD
• Report (dashboards, reports) on Process and Impact– NOC and other support groups know what’s happening during change
and downtime windows– Management has oversight and visibility– Business understands impact of change and downtime activity
19
Fault & Performance Mgmt Implementation• Micromuse Netcool/OMNIbus
– Custom integration with Request for Change (RFC) and Downtime Management System
– ObjectServer flexibility allows for definition of important business and IT data in each event to capture Change/Downtime Status
• Service Impact, Business Impact, Customer Impact, SLA, Restoral Priority, Escalation Path, etc.
• Micromuse Netcool/Impact 3.0– Impact policies build lists in real time for all nodes listed in change/downtime
request– As change/downtime activity progresses through its lifecycle, the
change/downtime Netcool event changes states– Change/Downtime event suppression policy updates all incoming events that
match node list during the maintenance window with “Suppression Status” and “Change/Downtime Reference Number”
• Micromuse Netcool/Webtop 1.2 – RAD 2.0– Process owner (Change/Downtime Management Group) dashboard for
monitoring and managing the overall end-to-end process, workflow, and business impact
– Business group dashboards for monitoring change/downtime activities within area of control (Network Engineering, MIS, etc.)
20
Fault & Performance Mgmt Webtop 1.2 Presentation
21
Fault & Performance Mgmt RAD 2.0 Presentation
22
Fault & Performance Mgmt Netcool Event Management
Change/Downtime Request Events
Suppressed Change/Downtime Activity Events
Change / Downtime Status
Event Suppressed by Change / Downtime
Change / Downtime ID
23
Fault & Performance Mgmt Future Enhancements
Planned Netcool/Impact Policies•COGS Impact
– Assess support cost impact due to change and downtime activities within Operations and Customer Support in Real-Time
•Data Gap Management– A common question: Why does my chart or graph have gaps? – The solution: Annotate graphs, charts, portals, etc. with the
reason for data gaps caused by planned change/downtime activities
– How: Integrate change and downtime event information with all performance, utilization, and capacity monitoring solutions via Impact 3.0
24
Fault & Performance Mgmt Business Activity Monitoring
EarthLink Customer Registration, Provisioning, and Fulfillment
Dashboards
25
Fault & Performance Mgmt RAD 2.0 Joint DevelopmentBusiness Activity Monitoring: Real-Time Customer Registration Dashboard
26
Fault & Performance MgmtRAD 2.0 Joint Development
Business Activity Monitoring: Real-Time Customer Registration Dashboard
27
Fault & Performance Mgmt Continuous Improvement
Building better Network and Systems Management
•Founded Atlanta Network and Systems Management Technical User Group (ANSMTUG) in January 2004
– http://www.ansmtug.org– Metro-Atlanta Fortune 100, Service Providers, Enterprise, Media,
and Emerging Technology Companies • Bell South, The Home Depot, EarthLink, Southern Company,
N2 Broadband, eDeltacom, Delta, CNN, Cingular, E*Trade, Knology Broadband, Cox Communications
•Customers helping Customers– Use Micromuse and other NSM products better– Collectively drive product requirements and features into
Micromuse and other NSM vendors•Special Interest Groups (SIG) Forming
– Best practices for NSM using Micromuse Netcool Suite– Aligning NSM solutions to ITIL, MOF, CobIT, etc.
28
Fault & Performance Mgmt Challenges facing Micromuse
• Product Development, Focus, and Release Cycle– Business * Monitoring (BAM, BSM, BI, BTI, B-I-N-G-O)– Performance Monitoring & Management Solution– Features vs. New Product – Finding the Right Balance– Licensing – Needs Review and Simpler Approach– Support New Technologies Sooner Across Core Products– Uniform Release Cycle (core architecture components and capabilities)
• Discovery, Root Cause Analysis (RCA), Next-Gen Polling– Emerging Competition– Service / Application Discovery & RCA– Universal Poller Concept
• Out of the Box Functionality and Updates– Appearance of Requiring Too Much Customization
• Competition is focusing on this• Many customers have product still on the shelf
– Ease of Use• More out of the box, templates, examples, plug and play, wizards,• Tools and Utilities section on Support website is a start
– Improving Documentation
29
Fault & Performance Mgmt Closing and Q&A
Closing
Q&A
Doug McClureSr. Manager,
Fault and Performance MgmtEarthLink Operations
dmcclure@corp.earthlink.net404-748-7665 (W)678-362-7712 (C)