Ghislain Fourny Big Data - Systems Group · 2016-12-12 · Big Data 13. Data Warehousing...

Post on 11-Jul-2020

2 views 0 download

transcript

Ghislain Fourny

Big Data13. Data Warehousing

fotoreactor / 123RF Stock Photo

The road to analyticsAurelio Scetta / 123RF Stock Photo

2

Another history of data management(T. Hofmann)

Age of Transactions

Age of Business Intelligence

Age of Big Data

1970s – 2000s

1995 -

2000s -3

Paradigms

OLTP OLAP

vs.

4

OnLine Transaction Processing

ConsistentandReliableRecord-Keeping

5

OnLine Transaction Processing

Transactionsand resultson

small portionsof data 6

OnLine Transaction Processing

Lotsof transactionson

small portionsof data

7

OnLine Transaction Processing

NormalizedData

8

OnLine Analytical Processing

Data-basedDecisionSupport arturaliev / 123RF Stock Photo

9

OLAP is Big

Possiblymany joins

Large portionsof the data

Few longheavy queries

10

Web analytics

Sales analytics

Management support

Statistical analysis (census)

Scientific databases (e.g., bio-informatics)

OLAP Examples

11

OLTP vs. OLAP

DetailedIndividualRecords

HistoricalSummarizedConsolidated

DataAurelio Scetta / 123RF Stock Photo

OLTP OLAP

vs.

12

OLTP vs. OLAP

OLTP

Lots of writes

OLAP

Lots of reads

vs.

13

OLTP vs. OLAP

OLTP

Small setsof records

OLAP

vs.Analysis

over big chunks

14

OLTP vs. OLAP

OLTP OLAP

Slow interactive

vs.

fully interactive(< 1s)

15

OLTP vs. OLAP

OLTP OLAP

Consistency

RedundancyRedundancyRedundancy

16

OLAPAurelio Scetta / 123RF Stock Photo

17

A data warehouse... is a

subject-orientedintegrated

time-variantnonvolatilecollection of data

in support of management'sdecision-making process

18

Subject-oriented

customers

products

sales

events19

Integrated

20

Time-variant

Timein data warehouses is

paramount(not so in OLTP systems)

21

Time-variant

2016Y-1Y-2Y-3Y-4Y-5Y-6Y-7Y-8Y-9

Often past 5-10 years22

Non-volatile

Milosh Kojadinovich / 123RF Stock Photo

Load.Access.Period.

no updates

23

Architecture

ERP

CRM

OLTP

ETL

Files

Analyze

Report

Mine

24

OLAP: Redundancy

Materializedviews

(denormalized)25

1st Normal Form (tabular) – The Key

26

2nd Normal Form (not joined) – The Whole Key

27

3rd Normal Form – Nothing But The Key

28

Why materialize?

Operational data sourcesare too heterogeneous 29

OLAP: Special-purpose indices

30

OLAP: Derived data

31

Querying OLAP

vs.Continuous

monitoring/tracking0

1

2

3

4

5

6

Category 1 Category 2 Category 3 Category 4

Slow interactive

Series 1 Series 2 Series 3

1 - 10s hours

32

Summary of differencesOLTP OLAP

Source Original (operational) Derived (consolidated)Purpose Business tasks Decision supportInterface Snapshot Multidimensional viewsWriting short and fast, by end user period refreshes, by batch jobsQueries Simple, small results Complex and aggregatingDesign Many normalized tables Few denormalized cubesPrecision ACID Sampling, confidence intervalsFreshness Serializability ReproducibilitySpeed Very fast Often slowOptimization Inter-query Intra-querySpace Small, archiving old data Large, less space efficientBackup Very important Re-ETL

33

Data Model

34

Data Cubes

Data is stored in

multidimensionalhypercubes

35

Data Cubes

Year

36

Data Cubes

Country

37

Data Cubes

Product38

Fact

2016CH

Server

39

Where

What?

Who? etc.

Dimensions Which currency?

When?

40

Fact tableWhere?

Germany 2016 Peter 1,000$

Germany 2015 Mary 15,000$

Switzerland 2016 Mary 1,500$

Switzerland 2015 Peter 3,000$

Australia 2015 Peter 6,000$

China 2015 Mary 1,000$

41

AggregationWhere?

Germany 2016 Peter 1,000$

Germany 2015 Mary 15,000$

Switzerland 2016 Mary 1,500$

Switzerland 2015 Peter 3,000$

Australia 2015 Peter 6,000$

China 2015 Mary 1,000$

42

Aggregation

43

AggregationWhere?

Germany 2016 Peter 1,000$

Germany 2015 Mary 15,000$

Switzerland 2016 Mary 1,500$

Switzerland 2015 Peter 3,000$

Australia 2015 Peter 6,000$

China 2015 Mary 1,000$

44

Aggregation

2016 Peter 1,000$

2015 Mary 16,000$

2016 Mary 1,500$

2015 Peter 9,000$

45

Slicing

46

Slicers and Dicers

Slicers Dicers47

Slicers and Dicers

Slicers Dicers

Usually between1 and 3 dicers,

often 2

48

Slicers and DicersServers

World

USD

Slicers

49

Slicers and Dicers

2014 2015 2016

Peter 1,000,000$ 1,500,000$ 1,400,000$

Mary 2,000,000$ 2,300,000$ 2,200,000$

Servers

World

USD

Dicers

Slicers

50

Products: the big three

Essbase

Cognos

Analysis Services

51

ETLing

52

OLAP: Derived data

53

OLAP: Derived data

ETL

54

ETL

ExtractTransformLoad

55

Extract

Triggers Gateways

Incremental updates Log extraction56

Transform

Derivation Value transformation

Herr

Mister

CleaningFilter, split, merge, join

57

Load

Integrity constraints Sorting

Build indices Partition

58

Considerations

When?Granularity

Infrastructure59

Implementation

60

Two flavors of OLAP

ROLAP MOLAP61

Fact table (ROLAP)

Dim1 Dim2 Dim3 Dim4 Dim5 Value

62

Star Schema

Dim1 Dim2 Dim3 Dim4 Dim5 Value

63

Snow-flake schema

Dim1 Dim2 Dim3 Dim4 Dim5 Value

NormalizeMore

64

Querying

65

Querying cubes

Tables:SQL

Cubes:MDX

66

MDX stands for...

Multi-DimensionaleXpressions

67

Measures

Amount of licenses

Revenues

Taxes paid

...

68

Dimensions

Quarter

Salesperson

Product

Country

69

In short...

A cube is a list of

dimensionsindexing a list of

measures

70

Hierarchies

Dimension values are organized in hierarchies.

[Location]

[Geo] [Economy]i.e., slice and aggregateby geographic region, etc

i.e., slice and aggregateby economic partnership, etc

71

Members

Members correspond to levels in a hierarchy.

[Geo][Europe] [Asia] [America]

[Switzerland][ZH][BE]

[Germany]...

[China][India]...

[Canada][USA][Brazil]...

[Africa][Ocenia]

72

Identifying a member

[Location].[Geo].[Europe].[Switzerland].[ZH].[Zurich]

73

Tuples

([Location].[Geo].[Europe].[Switzerland].[ZH].[Zurich],[Salesmen].[People].[John],[Time].[Year].[2016].[Q4])

A list of members

Associated with a dimensionality(list of hierarchies)

([Location].[Geo],[Salesmen].[People][Time].[Year])

74

Sets

{([Location].[Geo].[Europe].[Switzerland].[ZH].[Zurich],[Salesmen].[People].[John],[Time].[Year].[2016].[Q4]),

([Location].[Geo].[Europe].[Switzerland].[BE].[Bärn],[Salesmen].[People].[Mary],[Time].[Year].[2016].[Q4]),

([Location].[Geo].[Europe].[Germany].[Berlin],[Salesmen].[People].[John],[Time].[Year].[2016].[Q3])}

A set of tuples with same dimensionality

75

MDX statements: dicing

SELECT[Measures].Members ON COLUMNS,[Location].[Geo].Members ON ROWS

FROM [Sales]

76

MDX statements: slicing

SELECT[Measures].Members ON COLUMNS,[Location].[Geo].Members ON ROWS

FROM [Sales]

WHERE [Products].[Line].[Laptops].[MBP]

77

Syntax

78

XBRL Architecture

Instance (.xml)

Schema (.xsd)

Linkbase (.xml)

Discoverable Taxonomy Set79

Technologies

XML

XML SchemaXML Link

XML Names

80

Fact

<us-gaap:AssetscontextRef="FI2012Q4"decimals="-6"id="Fact-600212FD4D06E63B4F8F6874C6E5BE74"unitRef="usd">86174000000

</us-gaap:Assets>

Dimension ValueWhat? AssetsWho? Coca ColaWhen? Dec 31, 2011Of what? USD

81

Context

<xbrli:context id="FI2011Q4"><xbrli:entity><xbrli:identifier scheme="http://www.sec.gov/CIK">0000021344

</xbrli:identifier></xbrli:entity><xbrli:period><xbrli:instant>2011-12-31</xbrli:instant>

</xbrli:period></xbrli:context>

December 2011

1 2 3 4

5 6 7 8 9 10 11

12 13 14 15 16 17 18

19 20 21 22 23 24 25

26 27 28 29 30 31

82

Unit

<xbrli:unit id="usd"><xbrli:measure>iso4217:USD</xbrli:measure>

</xbrli:unit>

83

Concept (XML Schema)

<xs:elementid='us-gaap_Assets'name='Assets'nillable='true'substitutionGroup='xbrli:item'type='xbrli:monetaryItemType'xbrli:balance='debit'xbrli:periodType='instant' />

84

Graphs

85

DAGs

86

Trees

87

Node: locator

<locxlink:href="http://xbrl.fasb.org/us-gaap/2013/elts/us-

gaap-2013-01-31.xsd#us-gaap_Assets"xlink:label="loc_us-

gaap_Assets_102D7A4D204ED45AC0DEDA6BBC78F386"xlink:type="locator" />

88

Node: resource

<link:labelid="lab_ko_NetChangeInOperatingAssetsAndLiabilitiesDisclosureAbstract_A6469A522E35CBF355816876394722EE_label_en-US"xlink:label="lab_ko_NetChangeInOperatingAssetsAndLiabilitiesDisclosureAbstract_A6469A522E35CBF355816876394722EE"xlink:role="http://www.xbrl.org/2003/role/label"xlink:type="resource"xml:lang="en-US">NET CHANGE IN OPERATING ASSETS AND LIABILITIES

DISCLOSURE [Abstract]</link:label>

89

Edge

<presentationArcorder="10"preferredLabel="http://www.xbrl.org/2003/role/totalLabel"xlink:arcrole="http://www.xbrl.org/2003/arcrole/parent-child"xlink:from="loc_us-

gaap_AssetsAbstract_2F55ECB2BF7C1A62009CDA6BBC757094"xlink:to="loc_us-

gaap_Assets_102D7A4D204ED45AC0DEDA6BBC78F386"xlink:type="arc" />

90

Summary

91

Architecture

ERP

CRM

OLTP

ETL

Files

Analyze

Report

Mine

92