Infrastructure in Biomed IT - skendric.com · 2019. 11. 23. · Data Generation is Cheap...

Infrastructure in Biomed ITCurrent Challenges

Stuart Kendrick

Systems Engineer

Seattle, WA USA

This deck available at http://www.skendric.com/seminar

http://www.skendric.com/seminar

Outline• Introduction

• Data Generation is Cheap

• IT is a tool, not a Department

• Trade-offs: Agility and Scale

• What about the Cloud?

• People, Processes, Tools

• Bring it Together

Introduction

IntentThis presentation is aimed at IT infrastructure professionals supporting biomed research; here, I sketch my view of current challenges

• Data Generation is Cheap

• IT is a Tool, not a Department

• Trade-offs: Agility and Scale

• What about the Cloud?

• People, Processes, Tools

Introduction

FrameIf you want to go fast, go alone. If you want to go far, go together

Many of the choices we face in IT infrastructure these days can be framed as trade-offs between agility and scale

A few people working in a small environment can produce a CubeSat in less than a year. But if you want a rover on Mars, you gotta bring together a big team and work together for decades

Significant investment in IT infrastructure allows us to tackle larger problems. But infrastructure is heavy and slows us down

Introduction

Data Generation is Cheap

Data Generation is Cheap• Historically, generating data was expensive; the IT infrastructure

needed to capture, process, store, analyze, and publish that data was comparatively small

• Starting early in the 21st century, that ratio began to shift. The cost of generating data has been plunging, while the cost of the supporting IT infrastructure has been rising

• Today, I propose that many of your scientists can generate more data than your institution’s IT infrastructure can handle

Data Handling

Genome Sequencing

https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/ … also known as ‘the most shared chart’ in BioIT

ExampleThe cost of the equipment, people, time, reagents ... for sequencing the DNA in a genome used to be a lot ... But starting in ~2007, the cost started to plummet. Furthermore, the cost not only plummeted, but it quit tracking Moore’s law

Researchers – and clinicians – can now produce genome sequence cheaply ... And substantially outpace their IT infrastructure

Data Handling

https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/

ImplicationsIn the early 90s, a million bucks worth of genome sequence fit on a floppy disk (360K) and could be transmitted across a modern network (10 Mb/s) in ~1 second

Today, a million bucks worth of genome sequence consumes 150TB and takes several days to transmit over a modern network (10Gb/s)

Assuming current trends, somewhere in the next decade a million bucks of genome sequence would consume many petabytes and take years to transmit over a current network (100Gb/s)

This tells us that (a) the costs of data generation are plummeting, and that (b) Moore’s Law has ignored networking (networks are slow)

Data Handling

The Speed of Light is a ProblemClaims• Compute is cheap → the rise of the Cloud• Storage is not nearly as cheap – remains a

noticeable consideration for large data sets• Networking is expensive

Networking is ~100x behind Storage and ~1000x behind Compute

{Lots of hand-waving here}If you spend 99% of your budget on Networking and 1% on Storage, then you can move your bits around wherever & whenever you want: geography doesn’t matter to you

But I bet you didn’t do that …

Data has Gravity: if you produce a lot of it in one place, you will end up clustering your Compute & Storage near it

Data Handling

The

Eco

no

mis

t, 2

01

9-0

9-1

2,

Ub

iqu

ito

us

com

pu

tin

g

Custom InstrumentsUsing the Allen Institute for another example:

Historically, we have built custom instruments which produce unique, and big, data sets

This approach pushes us toward heavy IT infrastructure: the Compute / Network / Storage / In-house Applications needed to handle this approach

When we underestimate how much Infrastructure this takes, we fall behind in meeting our goals

More generally: data has gravity

Data Handling

Scientific DirectionAs another example, consider physics research. Some institutions build particle accelerators, gather unique data, and require heavy IT infrastructure as a result: they explore a particular flavor of physics which requires these data sets

Other institutions perform excellent physics also … without a particle accelerator. They tackle different problems – problems which don’t require data sets from accelerators

Leadership influences scientific direction by choosing the size & character of the IT Infrastructure constructed at their institution

Data Collection

IT is a Tool, not a Department

IT is a Tool, not a DepartmentHistorically, IT functions lived entirely within a single department, called ‘IT’. These days, 99+% of IT functions are distributed throughout the company, operating in the hands of every knowledge worker

As a result, every staff member, during their moment-by-moment work, can place upward pressure on ‘IT’ costs. If you’re attempting to manage IT costs centrally, this distributed pressure can confuse you

If you add up all the $$ you’re spending on ‘IT-related’ efforts, I predict you’ll see a quantity which rivals what you’re spending on Facilities (traditionally the dominant administrative cost of a grant-funded institution)

In the knowledge economy, total IT costs have risen to rival Facilities

IT as a Tool

IT is a Tool, not a DepartmentMore interestingly, because IT tools are laced throughout the daily activities of the knowledge worker (and who in your company is *not* a knowledge worker?), IT tools typically become rate-limiters to business activity

Compute / Network / Storage are obvious examples from the preceding section, but consider also:

- Does the build environment for your in-house developers compile new code in a couple minutes or a couple hours?

- What tools do your staff use to share files, both internally & externally, and are they happy with them? How long does it take to copy that data?

- How long do your staff wait for your Firewall administrators to update rules to support their custom applications?

Managing IT as a rate-limiter to business activity is hard

IT as a Tool

Perhaps IT Infrastructure is a DepartmentThis is where my model of IT as a Tool falls down … at some point, you will want to centralize IT infrastructure support into a department, minimally for leadership control: when infrastructure fails, chunks of your business grind to a halt. Consider what happens when:- Phones are down for a couple weeks (including your public-facing front

number)- Internet access becomes flaky- Ticketing systems go down: consider developers, Facilities, IT itself- The Virtual Server farm infrastructure gets flaky (no one VM is particularly

critical, but when all tens / hundreds / thousands of them go down, the business is kaput)

- The issue is not limited to IT infrastructure … this rate-limiting property is common across all infrastructure: consider if toilet flushing becomes unreliable for a couple weeks: that would put a serious dent in your productivity … combine that issue with a failure of the Facilities ticketing system …

IT as a Tool

Trade-offs: Agility and Scale

Infrastructure is Big & SlowOur current building includes a 360 KW Data Center, expandable to 720KW. After that, we’re done – no more space / cooling / power

We will soon max-out the existing 360KW space with storage, servers, and compute, including a ~15PB Storage Farm which cost us ~6 mil. What happens when that Storage Farm hits end-of-life? We could:

• Throw away its data, power it off, remove it, install a new one, power it on, and start from scratch

• Build out the 2nd half of the Data Center (~3.5 mil), install a new Storage Farm (~6 mil), copy the data from the old Farm (really?), retire the old one, and continue

Building Data Centers, standing up big storage, and keeping it all going is not an agile activity. But it does allow you to tackle larger problems

Trade-offs: Agility vs Scale

The Life of a BuildingAnother example from the Allen Institute

When we designed our current building, leadership chose a 1G (Cat6) cable plant. Why? This saved us ~$15-20 mil on the cost of a $150 mil building, over the 10G (Cat6a) option, and allowed us to build more lab space

Thus, in this building, we are committed to 1G throughput for your average workstation. <aside>Yes, we have augmented somewhat, spot-adding glass cabling, to deliver 10G+ rates to key locations, and we have room to add more. However, in general, the average workstation will max out at 1G rates for the life of the building. BTW: the Data Center has a more flexible plant and can accept new cabling, and thus higher rates, indefinitely</aside>

Giving the average workstation 10G access supports agility – the average researcher can then download a larger data set for analysis. In our environment, a few users can do this … but most users must architect their data flows to perform analysis via servers installed in the Data Center, which takes … more infrastructure and more calendar time … scales better, but is less agile


Regional Power DeliveryAnother example drawn from the modeling world:

Current Machine Learning approaches favor using GPUs over CPUs, and GPUs are power-hungry

In our 10KW / Cabinet Data Center, we can fit (40) CPU nodes in a single Cabinet … but only (4) GPU nodes per Cabinet … the Data Center itself maxes out at 740KW, which is a significant draw, considering the size of the building

At some point, the volume of power the municipality is willing to deliver to your building constrains what you do


What about the Cloud?

The Cloud as RevolutionA revolutionary capability

The Cloud provides agile infrastructure at scale … doesn’t this solve our problem and allow us to quit trading off agility and scale?

That depends. At the Allen Institute, we have a history of building custom instruments to produce unique data sets which are big

Shoveling lots of data into and out of the Cloud is something the Cloud is bad at: slow & expensive. <aside>If you can generate your data inside the Cloud, life is way easier</aside>

If you generate lots of data on-prem, the Cloud may not be a good fit for your needs: data has gravity

Cloud

The Cloud is more reliableBut it gets worse

The Cloud tends to deliver reliable storage – they don’t lose your data*

But in biomed research, we cut costs – staffing, shoe-string systems, lack of data protection – we run storage systems which are just about guaranteed to lose data every year

For this (and other) reasons, on-prem storage can be cheaper, sometimes much cheaper, than Cloud storage

Are you prepared to pay for less (though more reliable) storage in the Cloud?

* Actually, Cloud providers probably do – all file systems have corruption bugs in them. The hard part: without access to the file system, how do you know when that happens? For the bulk of Cloud data, flipped bits don’t matter (think cat videos or selfies) – but they very much matter in genome sequences. Smells like an unsolved problem.

Cloud

Revenue ModelsWhen you buy Cloud, you pay for what you use. The more you use, the more you pay

That works great in retail – the holiday season comes along, you spin up zillions more Web servers selling your wares, your Cloud bill jumps … but your income zooms, so you’re happy. The holiday season ends, your income plummets: you spin down all those Web servers, and your Cloud bill nose-dives also

This revenue model bears little resemblance to the revenue models supporting research, in which utilization of IT resources correlates poorly with income

You have a grant for x $$ … no matter how many publications / brilliant insights / Nobel prizes you generate from this grant … the size of that grant doesn’t change – no matter how big your Cloud bill, you make zero additional income

The Cloud pushes researchers to perform an ROI calculation each time they store a terabyte, run a compute job, or otherwise consume infrastructure services

• Can your researchers do this?

• Can any researchers do this? … if your investigations were well-defined, it wouldn’t be research any more …

• Since *zero* revenue accrues from each run of your job in the Cloud … on what basis do you perform your ROI?

Cloud

Charge-Back Models• Grant-funded research results in siloized funding – by Federal law, you

cannot spend $$ allocated to one grant on work being performed on another grant – doing so would get you and your institution into hot water• Risk: barred from applying for future Federally funded work

• Compare this to General Motors – if GM makes money selling a Chevrolet, they can, if they choose, spend those $$ making more Buicks … this is considered normal business practice and no one at the Federal level cares

• So long as a researcher uses their personal credit card, this is manageable … but as soon as this approach quits scaling, research institutes who want to use the Cloud require highly granular charge-back approaches• What happens when a postdoc spins up an uncontrolled HPC job in the Cloud and

blows past the total budget in the grant on which they are working?

• Does your Cloud provider allow you to set integrated cost limits (compute / network / storage) on a per post-doc, per week, and per grant level?

Cloud

Partly-CloudyThis is a big subject

Come join us some October in Seattle at https://partly-cloudy.fredhutch.org to talk with peers about how to flow back and forth between on-prem and Cloud in support of biomedical research

Cloud

http://partly-cloudy.fredhutch.org/

People, Processes, Tools

People, Processes, ToolsBe wary of focusing on the CapEx side of IT infrastructure – sure, the numbers are eye-popping … but remember what matters, in order of decreasing impact on your business

1. People

2. Processes

3. Tools (e.g. CapEx IT Infrastructure)

If you spend a million bucks on storage without increasing staff velocity and improving your processes, you won’t enjoy the result


Centralized vs SiloizationAnother challenge is when to centralize your IT specialists and when to distribute them

Distributed IT specialists (department-specific) tend to promote agility; they can focus on a department’s priorities without distraction from larger initiatives and can adopt more risk in exchange for velocity

Centralized IT tends to promote scale (bigger Compute / Network / Storage / Applications) & maturity (e.g. staff retention, staff depth, data protection, compliance with regulation, cybersecurity …)

Pity your leadership: This is hard


Bring it Together

SummaryLeadership at research institutes influence scientific direction when they make choices around IT Infrastructure – people / processes / tools, including CapEx investment:

Physical Layer (cabling, Data Center capacity)

Compute / Network / Storage

In-house Application Development (staff expertise)

Renting fat pipes to the Cloud; paying Cloud-scale bills

Data generation is cheap: Your researchers can easily overwhelm whatever infrastructure you build. For that matter, all your staff are knowledge workers now and are all placing upward pressure on your IT costs

How do we create feedback loops to help researchers understand their institution’s limits, to help them share capacity, to help them scope the research questions they are asking?

How do we trade-off agility and scale?

Do we want to tackle small problems rapidly or big problems slowly?

Bring it Together

Questions, Comments, Complaints?

stuart {dot} kendrick {dot} sea {at} gmail {dot} comLast modified: 2019-11-23

This deck available at http://www.skendric.com/seminar

http://www.skendric.com/seminar

Date post:	16-Sep-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Infrastructure in Biomed IT - skendric.com · 2019. 11. 23. · Data Generation is Cheap...

Documents