Infrastructure in Biomed ITCurrent Challenges
Stuart Kendrick
Systems Engineer
Seattle, WA USA
This deck available at http://www.skendric.com/seminar
Outline• Introduction
• Data Generation is Cheap
• IT is a tool, not a Department
• Trade-offs: Agility and Scale
• What about the Cloud?
• People, Processes, Tools
• Bring it Together
Introduction
IntentThis presentation is aimed at IT infrastructure professionals supporting biomed research; here, I sketch my view of current challenges
• Data Generation is Cheap
• IT is a Tool, not a Department
• Trade-offs: Agility and Scale
• What about the Cloud?
• People, Processes, Tools
Introduction
FrameIf you want to go fast, go alone. If you want to go far, go together
Many of the choices we face in IT infrastructure these days can be framed as trade-offs between agility and scale
A few people working in a small environment can produce a CubeSat in less than a year. But if you want a rover on Mars, you gotta bring together a big team and work together for decades
Significant investment in IT infrastructure allows us to tackle larger problems. But infrastructure is heavy and slows us down
Introduction
Data Generation is Cheap
Data Generation is Cheap• Historically, generating data was expensive; the IT infrastructure
needed to capture, process, store, analyze, and publish that data was comparatively small
• Starting early in the 21st century, that ratio began to shift. The cost of generating data has been plunging, while the cost of the supporting IT infrastructure has been rising
• Today, I propose that many of your scientists can generate more data than your institution’s IT infrastructure can handle
Data Handling
Genome Sequencing
https://www.genome.gov/27565109/the-cost-of-sequencing-a-human-genome/ … also known as ‘the most shared chart’ in BioIT
ExampleThe cost of the equipment, people, time, reagents ... for sequencing the DNA in a genome used to be a lot ... But starting in ~2007, the cost started to plummet. Furthermore, the cost not only plummeted, but it quit tracking Moore’s law
Researchers – and clinicians – can now produce genome sequence cheaply ... And substantially outpace their IT infrastructure
Data Handling
ImplicationsIn the early 90s, a million bucks worth of genome sequence fit on a floppy disk (360K) and could be transmitted across a modern network (10 Mb/s) in ~1 second
Today, a million bucks worth of genome sequence consumes 150TB and takes several days to transmit over a modern network (10Gb/s)
Assuming current trends, somewhere in the next decade a million bucks of genome sequence would consume many petabytes and take years to transmit over a current network (100Gb/s)
This tells us that (a) the costs of data generation are plummeting, and that (b) Moore’s Law has ignored networking (networks are slow)
Data Handling
The Speed of Light is a ProblemClaims• Compute is cheap → the rise of the Cloud• Storage is not nearly as cheap – remains a
noticeable consideration for large data sets• Networking is expensive
Networking is ~100x behind Storage and ~1000x behind Compute
{Lots of hand-waving here}If you spend 99% of your budget on Networking and 1% on Storage, then you can move your bits around wherever & whenever you want: geography doesn’t matter to you
But I bet you didn’t do that …
Data has Gravity: if you produce a lot of it in one place, you will end up clustering your Compute & Storage near it
Data Handling
The
Eco
no
mis
t, 2
01
9-0
9-1
2,
Ub
iqu
ito
us
com
pu
tin
g
Custom InstrumentsUsing the Allen Institute for another example:
Historically, we have built custom instruments which produce unique, and big, data sets
This approach pushes us toward heavy IT infrastructure: the Compute / Network / Storage / In-house Applications needed to handle this approach
When we underestimate how much Infrastructure this takes, we fall behind in meeting our goals
More generally: data has gravity
Data Handling
Scientific DirectionAs another example, consider physics research. Some institutions build particle accelerators, gather unique data, and require heavy IT infrastructure as a result: they explore a particular flavor of physics which requires these data sets
Other institutions perform excellent physics also … without a particle accelerator. They tackle different problems – problems which don’t require data sets from accelerators
Leadership influences scientific direction by choosing the size & character of the IT Infrastructure constructed at their institution
Data Collection
IT is a Tool, not a Department
IT is a Tool, not a DepartmentHistorically, IT functions lived entirely within a single department, called ‘IT’. These days, 99+% of IT functions are distributed throughout the company, operating in the hands of every knowledge worker
As a result, every staff member, during their moment-by-moment work, can place upward pressure on ‘IT’ costs. If you’re attempting to manage IT costs centrally, this distributed pressure can confuse you
If you add up all the $$ you’re spending on ‘IT-related’ efforts, I predict you’ll see a quantity which rivals what you’re spending on Facilities (traditionally the dominant administrative cost of a grant-funded institution)
In the knowledge economy, total IT costs have risen to rival Facilities
IT as a Tool
IT is a Tool, not a DepartmentMore interestingly, because IT tools are laced throughout the daily activities of the knowledge worker (and who in your company is *not* a knowledge worker?), IT tools typically become rate-limiters to business activity
Compute / Network / Storage are obvious examples from the preceding section, but consider also:
- Does the build environment for your in-house developers compile new code in a couple minutes or a couple hours?
- What tools do your staff use to share files, both internally & externally, and are they happy with them? How long does it take to copy that data?
- How long do your staff wait for your Firewall administrators to update rules to support their custom applications?
Managing IT as a rate-limiter to business activity is hard
IT as a Tool
Perhaps IT Infrastructure is a DepartmentThis is where my model of IT as a Tool falls down … at some point, you will want to centralize IT infrastructure support into a department, minimally for leadership control: when infrastructure fails, chunks of your business grind to a halt. Consider what happens when:- Phones are down for a couple weeks (including your public-facing front
number)- Internet access becomes flaky- Ticketing systems go down: consider developers, Facilities, IT itself- The Virtual Server farm infrastructure gets flaky (no one VM is particularly
critical, but when all tens / hundreds / thousands of them go down, the business is kaput)
- The issue is not limited to IT infrastructure … this rate-limiting property is common across all infrastructure: consider if toilet flushing becomes unreliable for a couple weeks: that would put a serious dent in your productivity … combine that issue with a failure of the Facilities ticketing system …
IT as a Tool
Trade-offs: Agility and Scale
Infrastructure is Big & SlowOur current building includes a 360 KW Data Center, expandable to 720KW. After that, we’re done – no more space / cooling / power
We will soon max-out the existing 360KW space with storage, servers, and compute, including a ~15PB Storage Farm which cost us ~6 mil. What happens when that Storage Farm hits end-of-life? We could:
• Throw away its data, power it off, remove it, install a new one, power it on, and start from scratch
• Build out the 2nd half of the Data Center (~3.5 mil), install a new Storage Farm (~6 mil), copy the data from the old Farm (really?), retire the old one, and continue
Building Data Centers, standing up big storage, and keeping it all going is not an agile activity. But it does allow you to tackle larger problems
Trade-offs: Agility vs Scale
The Life of a BuildingAnother example from the Allen Institute
When we designed our current building, leadership chose a 1G (Cat6) cable plant. Why? This saved us ~$15-20 mil on the cost of a $150 mil building, over the 10G (Cat6a) option, and allowed us to build more lab space
Thus, in this building, we are committed to 1G throughput for your average workstation. <aside>Yes, we have augmented somewhat, spot-adding glass cabling, to deliver 10G+ rates to key locations, and we have room to add more. However, in general, the average workstation will max out at 1G rates for the life of the building. BTW: the Data Center has a more flexible plant and can accept new cabling, and thus higher rates, indefinitely</aside>
Giving the average workstation 10G access supports agility – the average researcher can then download a larger data set for analysis. In our environment, a few users can do this … but most users must architect their data flows to perform analysis via servers installed in the Data Center, which takes … more infrastructure and more calendar time … scales better, but is less agile
Trade-offs: Agility vs Scale
Regional Power DeliveryAnother example drawn from the modeling world:
Current Machine Learning approaches favor using GPUs over CPUs, and GPUs are power-hungry
In our 10KW / Cabinet Data Center, we can fit (40) CPU nodes in a single Cabinet … but only (4) GPU nodes per Cabinet … the Data Center itself maxes out at 740KW, which is a significant draw, considering the size of the building
At some point, the volume of power the municipality is willing to deliver to your building constrains what you do
Trade-offs: Agility vs Scale
What about the Cloud?
The Cloud as RevolutionA revolutionary capability
The Cloud provides agile infrastructure at scale … doesn’t this solve our problem and allow us to quit trading off agility and scale?
That depends. At the Allen Institute, we have a history of building custom instruments to produce unique data sets which are big
Shoveling lots of data into and out of the Cloud is something the Cloud is bad at: slow & expensive. <aside>If you can generate your data inside the Cloud, life is way easier</aside>
If you generate lots of data on-prem, the Cloud may not be a good fit for your needs: data has gravity
Cloud
The Cloud is more reliableBut it gets worse
The Cloud tends to deliver reliable storage – they don’t lose your data*
But in biomed research, we cut costs – staffing, shoe-string systems, lack of data protection – we run storage systems which are just about guaranteed to lose data every year
For this (and other) reasons, on-prem storage can be cheaper, sometimes much cheaper, than Cloud storage
Are you prepared to pay for less (though more reliable) storage in the Cloud?
* Actually, Cloud providers probably do – all file systems have corruption bugs in them. The hard part: without access to the file system, how do you know when that happens? For the bulk of Cloud data, flipped bits don’t matter (think cat videos or selfies) – but they very much matter in genome sequences. Smells like an unsolved problem.
Cloud
Revenue ModelsWhen you buy Cloud, you pay for what you use. The more you use, the more you pay
That works great in retail – the holiday season comes along, you spin up zillions more Web servers selling your wares, your Cloud bill jumps … but your income zooms, so you’re happy. The holiday season ends, your income plummets: you spin down all those Web servers, and your Cloud bill nose-dives also
This revenue model bears little resemblance to the revenue models supporting research, in which utilization of IT resources correlates poorly with income
You have a grant for x $$ … no matter how many publications / brilliant insights / Nobel prizes you generate from this grant … the size of that grant doesn’t change – no matter how big your Cloud bill, you make zero additional income
The Cloud pushes researchers to perform an ROI calculation each time they store a terabyte, run a compute job, or otherwise consume infrastructure services
• Can your researchers do this?
• Can any researchers do this? … if your investigations were well-defined, it wouldn’t be research any more …
• Since *zero* revenue accrues from each run of your job in the Cloud … on what basis do you perform your ROI?
Cloud
Charge-Back Models• Grant-funded research results in siloized funding – by Federal law, you
cannot spend $$ allocated to one grant on work being performed on another grant – doing so would get you and your institution into hot water• Risk: barred from applying for future Federally funded work
• Compare this to General Motors – if GM makes money selling a Chevrolet, they can, if they choose, spend those $$ making more Buicks … this is considered normal business practice and no one at the Federal level cares
• So long as a researcher uses their personal credit card, this is manageable … but as soon as this approach quits scaling, research institutes who want to use the Cloud require highly granular charge-back approaches• What happens when a postdoc spins up an uncontrolled HPC job in the Cloud and
blows past the total budget in the grant on which they are working?
• Does your Cloud provider allow you to set integrated cost limits (compute / network / storage) on a per post-doc, per week, and per grant level?
Cloud
Partly-CloudyThis is a big subject
Come join us some October in Seattle at https://partly-cloudy.fredhutch.org to talk with peers about how to flow back and forth between on-prem and Cloud in support of biomedical research
Cloud
People, Processes, Tools
People, Processes, ToolsBe wary of focusing on the CapEx side of IT infrastructure – sure, the numbers are eye-popping … but remember what matters, in order of decreasing impact on your business
1. People
2. Processes
3. Tools (e.g. CapEx IT Infrastructure)
If you spend a million bucks on storage without increasing staff velocity and improving your processes, you won’t enjoy the result
People, Processes, Tools
Centralized vs SiloizationAnother challenge is when to centralize your IT specialists and when to distribute them
Distributed IT specialists (department-specific) tend to promote agility; they can focus on a department’s priorities without distraction from larger initiatives and can adopt more risk in exchange for velocity
Centralized IT tends to promote scale (bigger Compute / Network / Storage / Applications) & maturity (e.g. staff retention, staff depth, data protection, compliance with regulation, cybersecurity …)
Pity your leadership: This is hard
People, Processes, Tools
Bring it Together
SummaryLeadership at research institutes influence scientific direction when they make choices around IT Infrastructure – people / processes / tools, including CapEx investment:
Physical Layer (cabling, Data Center capacity)
Compute / Network / Storage
In-house Application Development (staff expertise)
Renting fat pipes to the Cloud; paying Cloud-scale bills
Data generation is cheap: Your researchers can easily overwhelm whatever infrastructure you build. For that matter, all your staff are knowledge workers now and are all placing upward pressure on your IT costs
How do we create feedback loops to help researchers understand their institution’s limits, to help them share capacity, to help them scope the research questions they are asking?
How do we trade-off agility and scale?
Do we want to tackle small problems rapidly or big problems slowly?
Bring it Together
Questions, Comments, Complaints?
stuart {dot} kendrick {dot} sea {at} gmail {dot} comLast modified: 2019-11-23
This deck available at http://www.skendric.com/seminar