Agile Data Warehousingwith Production Sandboxing
Stephen BrobstChief Technology Officer
Teradata Corporation
Oliver RatzesbergerSenior Director
eBay
TERADATARaising Intelligence
Adastra Information Management Conference 2009
The Carlu. Toronto. Canada
April 23. 2009
Agile Data Warehousingwith Production Sandboxing
Stephen BrobstChief Technology OfficerTeradata Corporationstephen. [email protected]
Oliver.RatzesbergerSenior Director
eBayoratzesberger@ebay .com
TERADATARaising Intelligence
Agenda
• The Business Need
• Implementation Options• Governance
• eBay Case Study• Discussion
The Business Need
TheBusi ness
Need
fF..RADATA•.....•-
_ eb_-'_
TfAADATA•.....•-.
2
Sustaining Success in Data Warehousing
The biggest enemy to sustained success in datawarehousing is stagnation.
Continued delivery of value from a data warehousedemands that organizations aggressively encouragenew and creative methods in their use ofinformation and analytics.
If an organization does not evolve and improve itsdata warehouse, value will diminish over time.
TERADATA•...•"'-
Value of a New Analytic Capability
InitialDeployment
RepeatedUse
Value
Time
TERADATA--.
3
Value of a New Analytic Capability
When a report or analytic capability is first introduced, itoffers the potential for new insight and ways of exploitinginformation.
• Value initially increases as adoption of the new capability takes place .
• As time goes on those once groundbreaking insights become old news.
• The value of the report or analytic capability begins to decrease overtime after the point where the organization has incorporated the insightsinto its standard operating procedures.
Over time, the data warehouse will be burdened withgenerating lots and lots of reports that eventually delivervalue that is less than the cost of maintaining them.
TFRADATA--The Requirement for Innovation
Organizations must engage in two actions toensure sustained success of a data warehouse:
1. Constantly innovate and develop new capabilities.
2. Prune older capabilities where the value no longerjustifies the cost.
It is not enough to encourage and sponsorinnovation ...organizations must provide anappropriate platform and environment tofacilitate innovation in the area of informationexploitation.
TE RADATA--4
Use Case eb._
Marketing has an outside source of datathat it wants to bring into the datawarehouse, but they are not yet sure if ithas high value or not ... so we need toexperiment with the data before signingthe purchase contract.
TERADATA---
Rigorous Solution Methodology
C(Strategy
•Opportunity
Assessment
EnterpriseAssessment
10
(Research
==~rojecf Manc(gemerfiAna yze Design Equip Build
Application SystemRequirement Architecture
Logical I I PackageModel Adaptation
DataMapping
nfrastructur,& Education
UserCurriculum
--.a... ~,-- -~--
( 0Intearate Manage
TERADATA--5
11
Rigorous Solutions Methodology
Takes too long!
eb'
TERADATA--Agile Data Warehousing _____ eb~
12
The Manifesto for Agile Software Development puts forththe following principles:
• Value individuals and interactions over processes and tools,
• Value working software over comprehensive documentation,• Value customer collaboration over contract negotiation, and• Value response to change over following a plan.
While it is recognized that there is value in the items onthe right, the items on the left are valued even more inthe agile development methodology.
TEAADATA--6
Agile Data Warehousing
13
The underlying philosophy that drives these valuesin the context of data warehousing is to put thehighest priority on satisfying end user(knowledge worker) requirements through earlyand continuous delivery of analytic capability.
This does not mean that requirements documents,design documents, entity-relationship diagrams,data dictionaries, etc. are not important - but itdoes mean that the emphasis is much more ondelivery than process.
TERADATA----
Agile Data Warehousing
Goal: Same dayavailability of data into
the analytic environment.
TfRADATA--7
15
Agile Data Warehousing
Need a way to get data into the data warehousewithout the overhead of a full blown developmentmethodology:
• Allow for "load and go" analytics.
• Non-certified content to be used in cooperation withcontent in the enterprise data warehouse (EDW).
• Limited users and limited use.
TERADATA---
Implementation Options
16
ImplementationOptions
TEiYtDATA---
8
Option 1:Separate Developmel!.t S_ystem
• Deploy the "experimental" data in thedevelopment/test environment.
_ eb_
17
• Configure the development/test environmentto the full size of the production environment.
• Demonstrate the value of the data and then(assuming positive ROI) use best practices tobring data into production OW.
TERADATA--Option 1:Separate Development System
11
Objection: Costs too much tohave development/testenvironment at full size and keptfully up-to-date with productionenvironment.
TfRADATA--9
Option 2:Downsized Development System----- --------
• Deploy the "experimental" data in thedevelopment/test environment.
• Configure the development/test environmentwith "sampling" from the productionenvironment.
• Demonstrate the value of the data and then(assuming positive ROI) use best practices tobring data into production OW.
TERADATA-"-"
Option 2:Downsized Development System
Objection: More difficult toimplement prototype withsampled data sets and moredifficult to make ROI case.
TERADATA--2.
10
Option 3:Federated Development System- ------ --- -- - ---
• Deploy the "experimental" data in thedevelopment/test environment.
• Join across the development and productionenvironments to perform the analysis.
• Demonstrate the value of the data and then(assuming positive ROI) use best practices tobring data into production OW.
TERADATA--21
Option 3:Federated Development System
Objection: Performance suckswhen joining large data setsacross a network (even when bothsystems are Teradata!).
TERADATA--22
11
Option 4:Production Sandbox
"
• Deploy the "experimental" data directly into theproduction environment.
• Separate Teradata "databases" for sandbox datawith joins allowed to the production data all on asingle system.
• Demonstrate the value of the data and then(assuming positive ROI) use best practices tobring data into production DW.
TERAOAIA----
Option 4:Production Sandbox
"
Objection: End users managetheir own space and have createtable privileges on the productionsystem.
TERAOA1A--12
Option 4:Production Sandbox
25
Objection: End users managetheir own space and have createtable privileges on the productionsystem.
TERADATA•••••••• 1It1•••••••••
Governance
2.
Governance
TERADATA-"-
13
Controls
• No "production" reportingfrom the sandboxenvironment.
• Automated resource governorsto prevent "runaway" queriesin the sandbox area.
• Data residency no more tha nXX days.
TERADATA--27
Promoting Content into the EDW eb"' r
• Monitoring and "promoting" data content in thesandbox.
• Once the value is proven, use "proper"methodologies to integrate content into theenterprise data warehouse.
• Encourage refinement of requirements as part ofthe sandbox experience.
TfRADATA--"
14
Case Study
The eBayExperience
2.TERADATA--
eBay Analytics Technology Highlights
>50 TB/day of new, incremental data>1 OOk data elements
>50A10 new records/day
>50k chains of logic
>5000 business users & analystsActive/Active
turning over a TB every 5 seconds
30
24X7X365Always online Millions of queries/day
99.9+% Availability
Near-Real-time TERADATA--15
eBay Analytics Core
MicroStrategy Unica Crystal SAS SOL
PrimarySecondary
RelationalDlta
MPPr:Relatlona' Data
Teradata
2.SPB2.2PBT••.•d.ta
Linux Linux
L.o~llnt.rconn.ctLoQIIl'lterconll«t
Wid.""".!nterconne<:r:1000 mllu
Sun Fir. <txxx
2.2PBXML, nam_tvalue, raw
Phoenix, A2
Solarl5
MPP/HPC/Grid
Sot.ns
6.6PB
"
Dlta JnteOr8tionB Informatica
TfRAOATA--Design for the Unknown--->85% of ebay analytical workload is NEW &Unknown
Exploration is the core of an analytical company
~; The metrics ou know are 'chea ' ~~1: ~IThe metrics you don't know are expensive but also high ~I in potential ROI IIDesig~ can.'t be static or dependent on specific questions Ior dimensions
j ~L.r~ ------ -- l'
16
"We Need Data Marts!"
WlltKh'Ui
0.1.11 iIIII¥!'Mu$e Mfh ~ tiN m¥h (i,t "~ub and ~f'1~M-nC dat. miffs ~nt c.ttI\ls.ltnt de\l&tt1
CtRtrll~a'll"""-'JSttnJy
·Cordonaed·d~l. INrts (tonr.lUeIfIl ~ftl
y.tu(tIr1aI •• warttlouw(l.t IMIfl,cL1ladJINmallylromSQIfLt ~~-\It".I1\whffi 1W'f'dM)
(1'*It.M\y,f'ftd1
..,Respthleols
90
10
JI
!J
1
17
Data Mart Dilemma
Total Cost of Ownership (TCO)Fully loaded cost staggering $500k++Biggest drivers are
Maintaining separate databasesweekly/daily/hourly data transfers
Data inconsistencies
Data redundancyIncreased complexity
Loss of lineage over time
Analytics as a Service
-
. ,,~- .'.'., " . ' . ';~ ""':<i~~"~.:_' ,,(,.;..::: :" • :, - ••••• .'-
~•. , •••• ,;;,:;iii?;; •• ' •• ~.~ ,Iio-.. , ~
_"'*~ "'J! ;.-",.t-~~~ ;;¥'~" i~" ~-?~
Massive scale Analytical ,=,tili~yComputing
Bring your data - Perform your Analytics
From Simple Web~~~sgd ~afa uploj3d "
jJ, d- dlCI ~ ,.. ; •• to fully p'rivate Utility accessCombine ~~~a:d c:ode~Wltti ALL existing data
~v~- IC:I. ce.
~I
36 TERADATA--18
Analytics as a Service dJ_o· . ~.
ma'-!-•.....•... ,-..•••••••••••••••••••••• ,•••• t••••••• s...,.t c-... ..•
I--- cc:::xDt __,-..-.•..-•..
.•••••-11II.-'" •••, •••
w~••• __ •
-'1. ,_"",-c::::r:-~.' ---..
CCII!-.1
III::!C:II!'."':.I&.:.A.-::~::.-•.•..."'"
---s..-._.,_"_n_----- ..--"
37
" __ L_"... --- ,_ ..
,. __ '00-•.....,--,."- .•. ,--'--
••.•a._--•..-..,
a:::I...-....•---.....",t-. ....• I
...
Analytics as a Service
From a simple web based table upload:
CM:An: 1MU: ••••
rtom you< C"iT! nm SOL ••••
IITtIot""'o::MAT[TAfU SCl..hf"'~"'"'kI~.,..,........••• p.on:;"~,,-1o•• ~W\twllU)"",,,_,--ctMfe~LW_~top~_\JI
I
(
I"-*"-.P,.,., .
~_ •• o.a.Tt)I..MU:PfIIM.t.IItYt()f)I( __ ...DI...,:Cll'lUJ'OU ••••••.••.••MId.~.,. •• toAIon ••• ~ ••••••••••••b::IIIOIfIOI'''tonn.oru'\''DUIfI!II~ •• OII:·h~r"""tdIon ••••~_~2:ltwtoloowlJ_noI~~.¥OIeOe,fotoIIIfetI1ItIrlWy•..... I -I & •••••• In.slllhellec:or'teMg •.•••• lobt~n._ .........•"',.....•....._byo:;oMrM(J.-.d_~_-..db'f'''\1".N.U"'f~ ••~..,.._t-~12~)OO7-Oi ••:WS81WJ,2IXJB.t2-01.987$112.'~$,20Q8.t2.ot~;hJNIbc*~e •• prv0U;40-"''*''''~" CSV~(~ __ 8kId-- -,
..- ~ -n..--. you"'••••••••••to. elf''''''t •••••• __ Bv* ••••••twll4Jk:>8dp'w"'1I'-v~ __ I~")--- ~- -••.,....-"'dD.'•.•. h •••..••••.•.••.•••_•••w~·J_)f~_'_JU'QIt<I-
'''''-''',r
38 TfRADATA--19
...to fully private utility access
We call them PET (Prototyping Environment in Teradata)
More than 75 active right now
In most cases they are small (100GB-5TB)
since all the main data is already in the EDW
They are free to the business units
3.
Analytics as a Service: Benefits
Improved Time To Market - DaYS/Weeks versus Months.
Enable the business to do agile prototyping.
"F -I F t" .Enable the users to al as - Make It easy
to try out new ideas.
Eliminate stray Data Marts.
J II
20
Governance Rule 1 ----~- -
• Keep the production data clean.• The data life cycle methodology is there for a reason.e Do not "pollute" production data with data of unknown
source and validation.• Equivalent to a viral injection ...and you may not recover.
• Do not inject prototype data into "core" DW data:• Data ingest (ETLjELT) does NOT have access to sandbox.
• Not even to populate the sandbox.
• Strictly and conceptually enforced on both Batch and Useraccounts.
"
Governance Rule 2
TERADATA-"-
g-• Prototypes written by experienced personnel:
• PETs assigned to NAMED personnel.• Previous Experience and Training Required.
• Prototype personnel are typically former DWdevelopers who transitioned into a business unit.• Speed of implementation.
• Knowledge of DW processes and methodologies.• Knowledge of data.
.,TERADATA
21
Governance Rule 3
• Sunset dates must be applied:Hold a post mortem.
• Retire it or promote it.
• The prototype must not become a "black market"production application.
Business cannot depend on them.
DW cannot give them appropriate support.
TERADATA--Key Process 1
Pre-defined methods, templates, and rules for setupand teardown :
• Well-defined rules for usage.• Defined, named owners.• Pre-defined security templates.• Pre-defined Help Desk responses to add/drop users.
TERAOATA--..
22
Key Process 2
Help Desk support is critical.
• Direct access for PETpersonnel to the most seniorarchitectural and technical personnel.
• "Bidirectional" mentoring:• Best and brightest technical resources get closer to the business ...• Business gets closer to fast and effective implementations.
• It does not take long for PET personnel to become self-sufficient.
TERADAfA----45
Key Learning --~~~
• In-place processes enable "time-to-market" benefits.• Put the processes and security in place first.
• Failure = Learning• Do so with great effectiveness ...• Fail fast, fail early.
• Most business units now maintain a permanentsandbox.• Complex analysis and decision making within a business day!
Tf RAD,.\TA--4.
23
'7
Questions?
TERADA1A--References
••
[ 1] Higgins, D. Don't Just Tread Water. TeradataMagazine. Volume 8, Number 1. 2008. pp. 23-25.
[ 2] Brobst,S., M. McIntire, and E. Rado. Agile DataWarehousing with Integrated Sandboxing. The BIJournal. First Quarter, 2008.
[ 3] Beck, K., M. Beedle, A. van Bennekum, A. Cockburn,W. Cunningham, M. Fowler, J. Grenning, J. Highsmith, A.Hunt, R. Jeffries, J. Kern, B. Marick, R. Martin, S.Mellor, K. Schwaber, J. Sutherland, and D. Thomas. TheManifesto for Agile Software Development. February,2001.
rFR,\Df\Tt\--24