Welcome to the Age of Data

Post on 15-Jan-2015

1,393 views 1 download

Tags:

description

An introductory presentation on Big Data and Hadoop for bigdate.be - presented 11/Jan/2012 at Accenture (Brussels).

transcript

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Welcome to the age of data!BIGDATA.BE

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

who am i

» Steven Noels» Founder & VP Product» Makers of Lily: Interactive Big Data

platform» Open Source / Apache Software

Foundation» co-founder bigdata.be

2

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Houston,we have

a problem.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

We’re drowning.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Drowningin aSeaofData.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Mountains of Metadata.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The firehoseof UGC.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Still, we can’t make

much sense of it.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

... and we throw a lot of

it away.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

We regard DATA as cost.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

But data is an opportunity.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Think about it.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 13

advertisements

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 14

recommendations

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 15

fraud detection

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 16

eyeballs

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 17

churn

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

The future is fordatanerds.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

This is what Big Data is about: new insights, new business.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

3 issues forBIG DATA

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 21

moore

data

need:more

capacity

time

volu

me 1

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 22

solution:distributed

systems 1

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 23

1

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

distributed systems are hard.

1

2IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25

database

2IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25

database data warehouse

2IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25

database data warehouse analytics

2IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 25

database data warehouse analytics

data shuffling, data duplication

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 26

“Top-performing organizations are twice as likely to apply analytics to activities.”

(MIT Sloan Management Review, Winter 2011)

3

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 27

enter

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 28

HBase

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 29

Disk

CPU

RAM1

serv

er

what is hadoop ?

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 30

RAM

CPU

DISK HDFS

MAP/REDUCE

HBASE

many servers

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 31

map/reduce

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

map/reduce

32

» Batch-oriented» Data locality (code is shipped around)» Heavy parallellization» Process management» Append-only files

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 33

Hadoop ecosystem

» Hadoop Common» Subprojects» Flume/SQOOP: Data collection systems

for large distributed systems.» HBase: A scalable, distributed database

that supports structured data storage for large/wide tables.

» HDFS: A distributed file system that provides high throughput access to application data.

» Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.

» MapReduce: A software framework for distributed processing of large data sets on compute clusters.

» Pig: A high-level data-flow language and execution framework for parallel computation.

» ZooKeeper: A high-performance coordination service for distributed applications.

» Mahout: machine learning libraries

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 34

(PIG, HIVE)

Coordination(ZOOKEEPER)

DataIntegration

(FLUME, SQOOP)

FastRead/Write

Access(HBASE)

Languages / Compilers(PIG, HIVE)

Workflow(OOZIE)

Scheduling(oozie)

Metadata(HIVE)

UI Framework(HUE)

SDK(HUE SDK)

High-level data model / easy API indexes

Search

CDH

Dev2Dev tutoring,

integrateddeployment

andenterprise

supportusage metrics, analytics & recommen-

dations

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 35

speed layer

serving layer

batch layer

real-time big data architecture

1. store master dataset (append-only)2. compute arbitrary views

1. random access to batch views2. updated by batch layer

1. compensate for high latency of updates to serving layer2. fast, incremental algorithms3. batch layer eventually overrides speed layerstorm

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 36

Hadoop, interactive.

Analytics Interacticsbatchstatic files

1018

interactivedata management

1015

(RDBMS)

109-12

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 37

data managementindexing

search

profileharvesting

smartinsights

interactiveaudiencemetrics

news & media

telecom

finan

ce

com

mer

ce

My baby: Lily.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org 38

The start of Lily.

IIC » TECHNOLOGIEPARK 3 » B-9052 ZWIJNAARDE (GENT) » www.outerthought.org

Thank you !for your attentionfor your questions

» steven.noels@outerthought.com

» @stevenn