2013: Trends from the Trenches

transcript

Trends from the trenches.2013 Bio IT World - Boston

Some less aspirational title slides ...

Trends from the trenches.2013 Bio IT World Boston

I’m Chris.

I’m an infrastructure geek.

I work for the BioTeam.

www.bioteam.net - Twitter: @chris_dag

Who, What, Why ...

BioTeam

‣ Independent consulting shop

‣ Staffed by scientists forced to learn IT, SW & HPC to get our own research done

‣ 10+ years bridging the “gap” between science, IT & high performance computing

Apologies in advance

If you have not heard me speak ...

‣ “Infamous” for speaking very fast and carrying a huge slide deck

• ~70 slides for 25 minutes about average for me

• Let me mention what happened after my Pharma HPC best practices talk yesterday ...

By the time you see this slide I’ll be on my ~4th espresso

Why I do this talk every year ...

‣ Bioteam works for everyone

• Pharma, Biotech, EDU, Nonprofit, .Gov, etc.

‣ We get to see how groups of smart people approach similar problems

‣ We can speak honestly & objectively about what we see “in the real world”

Listen to me at your own risk

Standard Dag Disclaimer

‣ I’m not an expert, pundit, visionary or “thought leader”

‣ Any career success entirely due to shamelessly copying what actual smart people do

‣ I’m biased, burnt-out & cynical

‣ Filter my words accordingly

So why are you here?And before 9am!

It’s a risky time to be doing Bio-IT

Big Picture / Meta Issue

‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed

• Example: CCD sensor upgrade on that confocal microscopy rig just doubled storage requirements

• Example: The 2D ultrasound imager is now a 3D imager

• Example: Illumina HiSeq upgrade just doubled the rate at which you can acquire genomes. Massive downstream increase in storage, compute & data movement needs

‣ For the above examples, do you think IT was informed in advance?

Science progressing way faster than IT can refresh/changeThe Central Problem Is ...

‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure

• Bench science is changing month-to-month ...• ... while our IT infrastructure only gets refreshed every

2-7 years

‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)

The Central Problem Is ...

‣ The easy period is over

‣ 5 years ago we could toss inexpensive storage and servers at the problem; even in a nearby closet or under a lab bench if necessary

‣ That does not work any more; real solutions required

The new normal.

And a related problem ...

‣ It has never been easier to acquire vast amounts of data cheaply and easily

‣ Growth rate of data creation/ingest exceeds rate at which the storage industry is improving disk capacity

‣ Not just a storage lifecycle problem. This data *moves* and often needs to be shared among multiple entities and providers

• ... ideally without punching holes in your firewall or consuming all available internet bandwidth

If you get it wrong ...

‣ Lost opportunity‣ Missing capability‣ Frustrated & very vocal scientific staff‣ Problems in recruiting, retention,

publication & product development

Enough groundwork. Lets Talk Trends*

Topic: DevOps & Org Charts

The social contract betweenscientist and IT is changing forever

You can blame “the cloud” for this

DevOps & Scriptable Everything

‣ On (real) clouds, EVERYTHING has an API

‣ If it’s got an API you can automate and orchestrate it

‣ “scriptable datacenters” are now a very real thing

DevOps & Scriptable Everything

‣ Incredible innovation in the past few years

‣ Driven mainly by companies with massive internet ‘fleets’ to manage

‣ ... but the benefits trickle down to us little people

DevOps will conquer the enterprise

‣ Over the past few years cloud automation/orchestration methods have been trickling down into our local infrastructures

‣ This will have significant impact on careers, job descriptions and org charts

2013: Continue to blur the lines between all these roles

Scientist/SysAdmin/Programmer

‣ Radical change in how IT is provisioned, delivered, managed & supported

• Technology Driver: Virtualization & Cloud

• Ops Driver:Configuration Mgmt, Systems Orchestration & Infrastructure Automation

‣ SysAdmins & IT staff need to re-skill and retrain to stay relevant

www.opscode.com

‣ When everything has an API ...

‣ ... anything can be ‘orchestrated’ or ‘automated’ remotely

‣ And by the way ...‣ The APIs (‘knobs &

buttons’) are accessible to all, not just the bearded practitioners sitting in that room next to the datacenter

‣ IT jobs, roles and responsibilities are going to change significantly

‣ SysAdmins must learn to program in order to harness automation tools

‣ Programmers & Scientists can now self-provision and control sophisticated IT resources

‣ My take on the future ...• SysAdmins (Windows & Linux) who

can’t code will have career issues • Far more control is going into the

hands of the research end user • IT support roles will radically change

-- no longer owners or gatekeepers

‣ IT will “own” policies, procedures, reference patterns, identity mgmt, security & best practices

‣ Research will control the “what”, “when” and “how big”

Topic: Facility Observations

Facility 1: Enterprise vs Shadow IT

‣ Marked difference in the types of facilities we’ve been working in

‣ Discovery Research systems are firmly embedded in the enterprise datacenter

‣ ... moving away from “wild west” unchaperoned locations and mini-facilities

Facility 2: Colo Suites for R&D

‣ Marked increase in use of commercial colocation facilities for R&D systems

• And they’ve noticed!- Markly Group (One Summer) has a booth

- Sabey is on this afternoon’s NYGenome panel

‣ Potential reasons:• Expensive to build high-density hosting at small scale• Easier metro networking to link remote users/sites• Direct connect to cloud provider(s)• High-speed research nets only a cross-connect away

Facility 3: Some really old stuff ...

‣ Final facility observation

‣ Average age of infrastructure we work on seems to be increasing

‣ ... very few aggressive 2-year refresh cycles these days‣ Potential reasons

• Recession & consolidation still effecting or deferring major technology upgrades and changes

• Cloud: local upgrades deferred pending strategic cloud decisions• Cloud: economic analysis showing stark truth that local setups

need to be run efficiently and at high utilization in order to justify existence

Facility 3: Virtualization

‣ Every HPC environment we’ve worked on since 2011has included (or plans to include) a local virtualization environment

• True for big systems: 2k cores / 2 petabyte disk

• True for small systems: 96 core CompChem cluster

‣ Unlikely to change; too many advantages

Facility 3: Virtualization

‣ HPC + Virtualization solves a lot of problems• Deals with valid biz/scientific need for researchers to

run/own/manage their own servers ‘near’ HPC stack

‣ Solves a ton of research IT support issues• Or at least leaves us a clear boundary line

‣ Lets us obtain useful “cloud” features without choking on endless BS shoveled at us by “private cloud” vendors

• Example: Server Catalogs + Self-service Provisioning

Topic: Compute

Compute:

‣ Still feels like a solved problem in 2013

‣ Compute power is a commodity

• Inexpensive relative to other costs

• Far less vendor differentiation than storage

• Easy to acquire; easy to deploy

Fat nodes are wiping out small and midsized clusters

Compute: Fat Nodes

‣ This box has 64 CPU Cores• ... and up to 1TB of RAM

‣ Fantastic Genomics/Chemistry system

• A 256GB RAM version only costs $13,000*

‣ BioIT Homework:• Go visit the Sillicon Mechanics

booth and find out the current cost of a box with 1TB RAM

Possibly the most significant ’13 compute trend

Defensive hedge against Big Data / HDFS

Compute: Local Disk is Back

‣ We’ve started to see organizations move away from blade servers and 1U pizza box enclosures for HPC

‣ The “new normal” may be 4U enclosures with massive local disk spindles - not occupied, just available

‣ Why? Hadoop & Big Data‣ This is a defensive hedge against future

HDFS or similar requirements• Remember the ‘meta’ problem - science is

changing far faster than we can refresh IT. This is a defensive future-proofing play.

‣ Hardcore Hadoop rigs sometimes operate at 1:1 ratio between core count and disk count

Topic: Network

Network:

‣ 10 Gigabit Ethernet still the standard

• ... although not as pervasive as I predicted in prior trend talks

‣ Non-Cisco options attractive• BioIT homework: listen to the Arista

talks and visit their booth.

‣ SDN still more hype than reality in our market

• May not see it until next round of large private cloud rollouts or new facility construction (if even)

Network:

‣ Infiniband for message passing in decline

• Still see it for comp chem, modeling & structure work; Started building such a system last week

• Still see it for parallel and clustered storage

• Decline seems to match decreasing popularity of MPI for latest generation of informatics and ‘omics tools

‣ Hadoop / HDFS seems to favor throughput and bandwidth over latency

Topic: Storage

Storage

‣ Still the biggest expense, biggest headache and scariest systems to design in modern life science informatics environments

‣ Most of my slides for last year’s trends talk focused on storage & data lifecycle issues

• Check http://slideshare.net/chrisdag/ if you want to see what I’ve said in the past

• Dag accuracy check: It was great yesterday to see DataDirect talking about the KVM hypervisor running on their storage shelves! I’m convinced more and more apps will run directly on storage in the future

‣ ... not doing that this year. The core problems and common approaches are largely unchanged and don’t need to be restated

It’s 2013, we know what questions to ask of our storage

Data like this lets us make realistic capacity planning and purchase decisions

NGS new data generation: 6-month window

Storage: 2013

‣ Advice: Stay on top of the “compute nodes with many disks” trends.

‣ HDFS if suddenly required by your scientists can be painful to deploy in a standard scale-out NAS environment

Storage: 2013

‣ Object Storage is getting interesting

Object Storage + Commodity Disk Pods

Storage: 2013

‣ Object storage is far more approachable• ... used to see it in proprietary solutions for specific niche needs• potentially on it’s way to the mainstream now

‣ Why?• Benefits are compelling across a wide variety of interesting use cases• Amazon S3 showed what a globe-spanning general purpose object

store could do; this is starting to convince developers & ISVs to modify their software to support it

• www.swiftstack.com and others are making local object stores easy, inexpensive and approachable on commodity gear

• Most of your Tier1 storage and server vendors have a fully supported object store stack they can sell to you (or simply enable in a product you already have deployed in-house)

Remember this disruptive technology example from last year?

100 Terabytes for $12,000 (more info: http://biote.am/8p )

Storage: 2013

‣ There are MANY reasons why you should not build that $12K backblaze pod

• ... done wrong you will potentially inconvenience researchers, lose critical scientific information and (probably) lose your job

‣ Inexpensive or open source object storage software makes the ultra-cheap storage pod concept viable

Storage: 2013

‣ A single unit like this is risky and should only be used for well known and scoped use cases. Risks generally outweigh the disruptive price advantage

‣ However ...

‣ What if you had 3+ of these units running an object store stack with automatic triple location replication, recovery and self-healing?

• Then things get interesting• This is one of the ‘lab’ projects I hope to work on in ’13

Storage: 2013

‣ Caveat/Warning• The 2013 editions of “backblaze-like” enclosures mitigate

many of the earlier availability, operational and reliability concerns

• Still a aggressive play that carries risk in exchange for a disruptive price point

‣ There is a middle ground• Lots of action in the ZFS space with safer & more mainstream

enclosures• BioIT Homework: Visit the Silicon Mechanics booth and

check out what they are doing with Nexenta’s Open Storage stuff.

Topic: Cloud

Can you do a Bio-IT talk without using the ‘C’ word?

Cloud: 2013

‣ Our core advice remains the same‣ What’s changed

Core Advice

Cloud: 2013

‣ Research Organizations need a cloud strategy today

• Those that don’t will be bypassed by frustrated users

‣ IaaS cloud services are only a departmental credit card away ... and some senior scientists are too big to be fired for violating IT policy

Design Patterns

Cloud Advice

‣ You actually need three tested cloud design patterns:

‣ (1) To handle ‘legacy’ scientific apps & workflows

‣ (2) The special stuff that is worth re-architecting

‣ (3) Hadoop & big data analytics

Legacy HPC on the Cloud

Cloud Advice

‣ MIT StarCluster• http://web.mit.edu/star/cluster/

‣ This is your baseline‣ Extend as needed

“Cloudy” HPC

Cloud Advice

‣ Some of our research workflows are important enough to be rewritten for “the cloud” and the advantages that a truly elastic & API-driven infrastructure can deliver

‣ This is where you have the most freedom‣ Many published best practices you can borrow‣ Warning: Cloud vendor lock-in potential is

strongest here

What you need to know

Hadoop & “Big Data”

‣ “Hadoop” and “Big Data” are now general terms

‣ You need to drill down to find out what people actually mean

‣ We are still in the period where senior leadership may demand “Hadoop” or “BigData” capability without any actual business or scientific need

What you need to know

Hadoop & “Big Data”

‣ In broad terms you can break “Big Data” down into two very basic use cases:

1. Compute: Hadoop can be used as a very powerful platform for the analysis of very large data sets. The google search term here is “map reduce”

2. Data Stores: Hadoop is driving the development of very sophisticated “no-SQL” “non-Relational” databases and data query engines. The google search terms include “nosql”, “couchdb”, “hive”, “pig” & “mongodb”, etc.

‣ Your job is to figure out which type applies for the groups requesting “Hadoop” or “BigData” capability

What has changed ..Cloud: 2013

‣ Lets revisit some of my bile from prior years‣ “... private clouds: still utter crap”‣ “... some AWS competitors are delusional

pretenders”‣ “... AWS has a multi-year lead on the

competition”

Private Clouds in 2013:

‣ I’m no longer dismissing them as “utter crap”

‣ Usable & useful in certain situations

‣ BioTeam positive experiences with OpenStack‣ Hype vs. Reality ratio still wacky

‣ Sensible only for certain shops• Have you seen what you have to do

to your networks & gear?

‣ Still important to remain cynical and perform proper due dillegenge

Non-AWS IaaS in 2013

‣ Three main drivers for BioTeam’s evolving IaaS practices and thinking for 2013:

‣ (1) Real world success with OpenStack & BT

‣ (2) Real world success with Google Compute

‣ (3) Real world multi-cloud DevOps

‣ Just to remain honest though:• AWS still has multi-year lead in product, service and features• .. and many novel capabilities• But some of the competition has some interesting benefits that AWS can’t match

BioTeam, BT & OpenStack

‣ We’ve been working with BT for a while now on various projects

‣ BT Cloud using OpenStack under the hood with some really nice architecture and operational features

‣ BioTeam developed a Chef-based HPC clustering stack and other tools that are currently being used by BT customers

• ... some of whom have spoken openly at this meeting

BioTeam & Google Compute Engine

‣ We can’t even get into the preview program‣ But one of our customers did‣ ... and we’ve been able to do some successful and

interesting stuff• Without changing operations or DevOps tools our client is capable of

running both on AWS and Google Compute

• For this client and a few other use cases we believe we can span both clouds or construct architectures that would enable fast and relatively friction-free transitions

Wrapping up ...Chef, AWS, OpenStack & Google

‣ 2012 was the 1st year we did real work spanning multiple IaaS cloud platforms or at least replicating workloads on multiple platforms

‣ We’ve learned a lot - I think this may result in some interesting talks at next year’s Bio-IT meeting

- By BioTeam and actual end-users

‣ What makes this all possible is the DevOps / Orchestration stuff mentioned at the beginning of this presentation.

end; Thanks!Slides: http://slideshare.net/chrisdag/