Ghislain Fourny
Big Data13. Data Warehousing
fotoreactor / 123RF Stock Photo
The road to analyticsAurelio Scetta / 123RF Stock Photo
2
Another history of data management(T. Hofmann)
Age of Transactions
Age of Business Intelligence
Age of Big Data
1970s – 2000s
1995 -
2000s -3
Paradigms
OLTP OLAP
vs.
4
OnLine Transaction Processing
ConsistentandReliableRecord-Keeping
5
OnLine Transaction Processing
Transactionsand resultson
small portionsof data 6
OnLine Transaction Processing
Lotsof transactionson
small portionsof data
7
OnLine Transaction Processing
NormalizedData
8
OnLine Analytical Processing
Data-basedDecisionSupport arturaliev / 123RF Stock Photo
9
OLAP is Big
Possiblymany joins
Large portionsof the data
Few longheavy queries
10
Web analytics
Sales analytics
Management support
Statistical analysis (census)
Scientific databases (e.g., bio-informatics)
OLAP Examples
11
OLTP vs. OLAP
DetailedIndividualRecords
HistoricalSummarizedConsolidated
DataAurelio Scetta / 123RF Stock Photo
OLTP OLAP
vs.
12
OLTP vs. OLAP
OLTP
Lots of writes
OLAP
Lots of reads
vs.
13
OLTP vs. OLAP
OLTP
Small setsof records
OLAP
vs.Analysis
over big chunks
14
OLTP vs. OLAP
OLTP OLAP
Slow interactive
vs.
fully interactive(< 1s)
15
OLTP vs. OLAP
OLTP OLAP
Consistency
RedundancyRedundancyRedundancy
16
OLAPAurelio Scetta / 123RF Stock Photo
17
A data warehouse... is a
subject-orientedintegrated
time-variantnonvolatilecollection of data
in support of management'sdecision-making process
18
Subject-oriented
customers
products
sales
events19
Integrated
20
Time-variant
Timein data warehouses is
paramount(not so in OLTP systems)
21
Time-variant
2016Y-1Y-2Y-3Y-4Y-5Y-6Y-7Y-8Y-9
Often past 5-10 years22
Non-volatile
Milosh Kojadinovich / 123RF Stock Photo
Load.Access.Period.
no updates
23
Architecture
ERP
CRM
OLTP
ETL
Files
Analyze
Report
Mine
24
OLAP: Redundancy
Materializedviews
(denormalized)25
1st Normal Form (tabular) – The Key
26
2nd Normal Form (not joined) – The Whole Key
27
3rd Normal Form – Nothing But The Key
28
Why materialize?
Operational data sourcesare too heterogeneous 29
OLAP: Special-purpose indices
30
OLAP: Derived data
31
Querying OLAP
vs.Continuous
monitoring/tracking0
1
2
3
4
5
6
Category 1 Category 2 Category 3 Category 4
Slow interactive
Series 1 Series 2 Series 3
1 - 10s hours
32
Summary of differencesOLTP OLAP
Source Original (operational) Derived (consolidated)Purpose Business tasks Decision supportInterface Snapshot Multidimensional viewsWriting short and fast, by end user period refreshes, by batch jobsQueries Simple, small results Complex and aggregatingDesign Many normalized tables Few denormalized cubesPrecision ACID Sampling, confidence intervalsFreshness Serializability ReproducibilitySpeed Very fast Often slowOptimization Inter-query Intra-querySpace Small, archiving old data Large, less space efficientBackup Very important Re-ETL
33
Data Model
34
Data Cubes
Data is stored in
multidimensionalhypercubes
35
Data Cubes
Year
36
Data Cubes
Country
37
Data Cubes
Product38
Fact
2016CH
Server
39
Where
What?
Who? etc.
Dimensions Which currency?
When?
40
Fact tableWhere?
Germany 2016 Peter 1,000$
Germany 2015 Mary 15,000$
Switzerland 2016 Mary 1,500$
Switzerland 2015 Peter 3,000$
Australia 2015 Peter 6,000$
China 2015 Mary 1,000$
41
AggregationWhere?
Germany 2016 Peter 1,000$
Germany 2015 Mary 15,000$
Switzerland 2016 Mary 1,500$
Switzerland 2015 Peter 3,000$
Australia 2015 Peter 6,000$
China 2015 Mary 1,000$
42
Aggregation
43
AggregationWhere?
Germany 2016 Peter 1,000$
Germany 2015 Mary 15,000$
Switzerland 2016 Mary 1,500$
Switzerland 2015 Peter 3,000$
Australia 2015 Peter 6,000$
China 2015 Mary 1,000$
44
Aggregation
2016 Peter 1,000$
2015 Mary 16,000$
2016 Mary 1,500$
2015 Peter 9,000$
45
Slicing
46
Slicers and Dicers
Slicers Dicers47
Slicers and Dicers
Slicers Dicers
Usually between1 and 3 dicers,
often 2
48
Slicers and DicersServers
World
USD
Slicers
49
Slicers and Dicers
2014 2015 2016
Peter 1,000,000$ 1,500,000$ 1,400,000$
Mary 2,000,000$ 2,300,000$ 2,200,000$
Servers
World
USD
Dicers
Slicers
50
Products: the big three
Essbase
Cognos
Analysis Services
51
ETLing
52
OLAP: Derived data
53
OLAP: Derived data
ETL
54
ETL
ExtractTransformLoad
55
Extract
Triggers Gateways
Incremental updates Log extraction56
Transform
Derivation Value transformation
Herr
Mister
CleaningFilter, split, merge, join
57
Load
Integrity constraints Sorting
Build indices Partition
58
Considerations
When?Granularity
Infrastructure59
Implementation
60
Two flavors of OLAP
ROLAP MOLAP61
Fact table (ROLAP)
Dim1 Dim2 Dim3 Dim4 Dim5 Value
62
Star Schema
Dim1 Dim2 Dim3 Dim4 Dim5 Value
63
Snow-flake schema
Dim1 Dim2 Dim3 Dim4 Dim5 Value
NormalizeMore
64
Querying
65
Querying cubes
Tables:SQL
Cubes:MDX
66
MDX stands for...
Multi-DimensionaleXpressions
67
Measures
Amount of licenses
Revenues
Taxes paid
...
68
Dimensions
Quarter
Salesperson
Product
Country
69
In short...
A cube is a list of
dimensionsindexing a list of
measures
70
Hierarchies
Dimension values are organized in hierarchies.
[Location]
[Geo] [Economy]i.e., slice and aggregateby geographic region, etc
i.e., slice and aggregateby economic partnership, etc
71
Members
Members correspond to levels in a hierarchy.
[Geo][Europe] [Asia] [America]
[Switzerland][ZH][BE]
[Germany]...
[China][India]...
[Canada][USA][Brazil]...
[Africa][Ocenia]
72
Identifying a member
[Location].[Geo].[Europe].[Switzerland].[ZH].[Zurich]
73
Tuples
([Location].[Geo].[Europe].[Switzerland].[ZH].[Zurich],[Salesmen].[People].[John],[Time].[Year].[2016].[Q4])
A list of members
Associated with a dimensionality(list of hierarchies)
([Location].[Geo],[Salesmen].[People][Time].[Year])
74
Sets
{([Location].[Geo].[Europe].[Switzerland].[ZH].[Zurich],[Salesmen].[People].[John],[Time].[Year].[2016].[Q4]),
([Location].[Geo].[Europe].[Switzerland].[BE].[Bärn],[Salesmen].[People].[Mary],[Time].[Year].[2016].[Q4]),
([Location].[Geo].[Europe].[Germany].[Berlin],[Salesmen].[People].[John],[Time].[Year].[2016].[Q3])}
A set of tuples with same dimensionality
75
MDX statements: dicing
SELECT[Measures].Members ON COLUMNS,[Location].[Geo].Members ON ROWS
FROM [Sales]
76
MDX statements: slicing
SELECT[Measures].Members ON COLUMNS,[Location].[Geo].Members ON ROWS
FROM [Sales]
WHERE [Products].[Line].[Laptops].[MBP]
77
Syntax
78
XBRL Architecture
Instance (.xml)
Schema (.xsd)
Linkbase (.xml)
Discoverable Taxonomy Set79
Technologies
XML
XML SchemaXML Link
XML Names
80
Fact
<us-gaap:AssetscontextRef="FI2012Q4"decimals="-6"id="Fact-600212FD4D06E63B4F8F6874C6E5BE74"unitRef="usd">86174000000
</us-gaap:Assets>
Dimension ValueWhat? AssetsWho? Coca ColaWhen? Dec 31, 2011Of what? USD
81
Context
<xbrli:context id="FI2011Q4"><xbrli:entity><xbrli:identifier scheme="http://www.sec.gov/CIK">0000021344
</xbrli:identifier></xbrli:entity><xbrli:period><xbrli:instant>2011-12-31</xbrli:instant>
</xbrli:period></xbrli:context>
December 2011
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
82
Unit
<xbrli:unit id="usd"><xbrli:measure>iso4217:USD</xbrli:measure>
</xbrli:unit>
83
Concept (XML Schema)
<xs:elementid='us-gaap_Assets'name='Assets'nillable='true'substitutionGroup='xbrli:item'type='xbrli:monetaryItemType'xbrli:balance='debit'xbrli:periodType='instant' />
84
Graphs
85
DAGs
86
Trees
87
Node: locator
<locxlink:href="http://xbrl.fasb.org/us-gaap/2013/elts/us-
gaap-2013-01-31.xsd#us-gaap_Assets"xlink:label="loc_us-
gaap_Assets_102D7A4D204ED45AC0DEDA6BBC78F386"xlink:type="locator" />
88
Node: resource
<link:labelid="lab_ko_NetChangeInOperatingAssetsAndLiabilitiesDisclosureAbstract_A6469A522E35CBF355816876394722EE_label_en-US"xlink:label="lab_ko_NetChangeInOperatingAssetsAndLiabilitiesDisclosureAbstract_A6469A522E35CBF355816876394722EE"xlink:role="http://www.xbrl.org/2003/role/label"xlink:type="resource"xml:lang="en-US">NET CHANGE IN OPERATING ASSETS AND LIABILITIES
DISCLOSURE [Abstract]</link:label>
89
Edge
<presentationArcorder="10"preferredLabel="http://www.xbrl.org/2003/role/totalLabel"xlink:arcrole="http://www.xbrl.org/2003/arcrole/parent-child"xlink:from="loc_us-
gaap_AssetsAbstract_2F55ECB2BF7C1A62009CDA6BBC757094"xlink:to="loc_us-
gaap_Assets_102D7A4D204ED45AC0DEDA6BBC78F386"xlink:type="arc" />
90
Summary
91
Architecture
ERP
CRM
OLTP
ETL
Files
Analyze
Report
Mine
92