Reactive Cloud Security | AWS Public Sector Summit 2016

Post on 21-Jan-2018

741 views 0 download

transcript

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ben Hagen, Cloud Security Operations @ Netflix

June 21, 2016

Reactive Cloud Security

Toward Self-Defending Cloud Environments

Introductions, because they

matter.

Me

● Bachelor’s in Political Science, International Studies,

Minor in Mandarin Chinese

● Master’s in Information Assurance

● Security Operations Center at Motorola

● Consulting at Motorola and Neohapsis

● Security at Obama 2012

● Security Operations at Netflix

Netflix

● 81+ million members

● Supporting 1,000+ device types

● Available in every* country

● Concurrent delivery from 3 global regions

● > 1/3 of all US broadband

● 1,000+ of developers/1,000s of applications

● A very large monthly AWS bill

● High elasticity

Netflix

● Application owners “own” their own DevOps

● Immutable server pattern

● Everything scales

● The average TTL of an instance is < 3 days

Security @ Netflix

● A paved road

● Enablers not blockers

● Application owners “own” their security; Security teams

help them make the right choices

●❤️❤️ Self-service, automation, and architecture ❤️❤️

Let’s talk about reactive cloud

security

The old model

● A network firewall blocks traffic

● An intrusion prevention system blocks traffic

● A web application firewall blocks traffic

● Authentication/authorization blocks access

Block, block, block, block ...

We can do better.

What is Reactive Cloud Security?

● Environments should be architected for change

● Security models should understand and leverage these

changes

● Reactive Cloud Security should ...• Understand the context of events within your environment

• Automatically adapt the environment based on security

conditions

That sounds great. What are

some examples?

Environmental changes

● Scale an Auto Scaling group

● Modify security groups

● Adjust AWS Identity and Access Management (IAM)

object privileges

● Turn on/off logging

● Isolate a system

● Tag a system

● Redeploy a system

● Shift traffic

● ...

OK. I get it. But how does it

work?

The easy stuff: binary conditions

● There are things about your environment which should

never change

● AWS CloudTrail should always be on

● Administrators should always have high privileges

● External traffic should only be terminated on Elastic

Load Balancing load balancers

● SSL certificates should always be valid

● ...

Less easy stuff: fuzzy conditions

● There are things about your environment that could

change

● Web server CPU load should never exceed X%

● Patterns of inter-application traffic

● Engineers/administrators logging into systems

● API access patterns

● Inbound/outbound traffic patterns

● ...

Laying the groundwork: AWS

● AWS CloudTrail• Make sure CloudTrail is turned on ... for all the things

• Stream to CloudWatch logs (> 10 min latency)

• Use CloudWatch Events when you can (< 1 min latency)

• Connect both to AWS Lambda functions monitoring for specific

conditions

● AWS Lambda functions identify, log/notify, and react to

these conditions• Create specific “OK” conditions, break glass buttons, etc.

Laying the groundwork: Non-AWS events

● Requires a robust, reliable, and (programmatically)

accessible logging infrastructure

● Access logs, authentication logs, performance logs, etc.

● A leveragable pipeline ... ELK is a good start, but not

appropriate for everything• CloudWatch Logs, CloudWatch Metrics, Datadog, Statsd,

$plunk, New Relic, etc.

● At Netflix we use Atlas, ES, and other big data pipelines

(https://github.com/Netflix/atlas)

Strategy is important.

Three categories of events

#1 Fully automatable #2 Almost automatable #3 Never automatable

Please talk about some more

relevant buzzwords.

ChatOps

● Baby steps toward full reactive automation (for

managing bucket #2 type events)

● Use a single shared interface to facilitate notifications,

log work, provide context, and interact with tools

● Automation gets you the context and notification

● Humans approve and execute commands

● Two-factor is important!

Right sizing your environment

● Monitor your environment so that security policies

match reality• IAM roles (look out for RepoMan from Netflix)

• Security groups (working on something here too)

• Amazon S3 policies 😣

● Start off with more than you need during development

● Monitor for X days

● Adjust policy based on actual usage; expose this

information!

● Enable break-glass and self-service changes to

automation

In closing ...

● Cloud environments and modern

development/deployment technologies can increase

Security

● Architect for flexibility and varying security conditions

● Seek to remove practices which can’t be automated

● ChatOps and right sizing are your friends

Thanks!

Feel free to reach out:

● bhagen@netflix.com

● benhagen@gmail.com

... or yell at me publicly:

● @benhagen