Gizillions Thingsternet Living online Big Data€¦ · Create Tables in Hive • Hive is a batch...

Post on 30-Apr-2020

6 views 0 download

transcript

A World of Data

“Thingsternet”

Compete by asking bigger

questions

Living online

Big Data

“Gizillions” of mobile transactions

$

$$$...

???

SLA

Yaaaay – Hadoop to Save the Daaaay!!

• But it’s not always easy to tame an elephant…

CUSTOMERS WEB CLIENT WEB SHOP BACKEND

WEB SHOP DATA BASE

~100GB

Product and Customer

Transaction Data

Introducing “DataCo”

“We don’t really have a big data

problem…”

> 6 months?

CUSTOMERS WEB CLIENT WEB SHOP BACKEND

WEB SHOP DATA BASE

Mobile App Data

Web App Click Stream

Data

IT/Ops and InfoSec Data

Product and Customer

Transaction Data

Introducing “DataCo”

Hive

Active Archive / Self Serve Ad-hoc BI

• Top sold products last 6, 12, and 18 months?

SQL

HDFS

Impala

Using Sqoop to Ingest Data from MySQL

• Sqoop is a bi-directional structured data ingest tool

• Simple UI in Hue, more commonly used from the shell

$ sqoop import-all-tables -m 12 –connect

jdbc:mysql://my.sql.host:3306/retail_db --username=dataco_dba

--password=yow!2014 --compression-codec=snappy --as-avrodatafile

--warehouse-dir=/user/hive/warehouse

$ sqoop import -m 12 –connect jdbc:mysql://my.sql.host:3306/retail_db

--username=dataco_dba --password=yow!2014

--table my_cool_table --hive-import --as-parquetfile

Create Tables in Hive

• Hive is a batch query tool, but also the keeper of table structures

• Remember: structure is stored _separate_ from data

hive> CREATE EXTERNAL TABLE products

> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

> STORED AS INPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'

> OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'

> LOCATION 'hdfs:///user/hive/warehouse/products'

> TBLPROPERTIES

('avro.schema.url'='hdfs://namenode_dataco/user/examples/products.avsc');

Use Impala via Hue to Query

$

$$$...

Correlate Multi-type Data Sets

• Top viewed products last 6, 12, and 18 months?

Hive

SQL

HDFS

Impala Flume

Ingest Data Using Flume

• Pub/sub ingest framework

• Flexible multi-level (mini-transformation) pipeline

FLUME SOURCE

FLUME SINK

Continuously generated events, e.g. syslog, tweets

Flume Agent, HDFS, HBase, Solr, or other destination

OptionalLogic

FLUME AGENT

Create Hive Tables over Log Data

• New use case, new data

• Create new tables over semi-structured log data

CREATE EXTERNAL TABLE intermediate_access_logs ( ip STRING, date STRING,

method STRING, url STRING, http_version STRING, code1 STRING, code2 STRING,

dash STRING, user_agent STRING) ROW FORMAT SERDE

'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES (

"input.regex" = "([^ ]*) - - \\[([^\\]]*)\\] \"([^\ ]*) ([^\ ]*) ([^\ ]*)\" (\\d*) (\\d*) \"([^\"]*)\" \"([^\"]*)\"",

"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" )

LOCATION '/user/hive/warehouse/original_access_logs';

CREATE EXTERNAL TABLE tokenized_access_logs ( ip STRING, date STRING, method

STRING, url STRING, http_version STRING, code1 STRING, code2 STRING, dash

STRING, user_agent STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

LOCATION '/user/hive/warehouse/tokenized_access_logs'; ADD JAR

/opt/cloudera/parcels/CDH/lib/hive/lib/hive-contrib.jar; INSERT OVERWRITE TABLE

tokenized_access_logs SELECT * FROM intermediate_access_logs; exit;

Use Impala and Hue to Query

Missing!!!285716349

$

$$$...

!!!

Solr

Multi-Use-Case Data Hub

• Why is sales dropping over the last 3 days?

HDFS

Search Queries

Flume

Create your Index

• Create an empty Solr index configuration directory

• Edit the Solr Schema file to have the fields you want to search over

$ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --generate live_logs_dir

<field name="_version_" type="long" indexed="true" stored="true" multiValued="false" />

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />

<field name="ip" type="text_general" indexed="true" stored="true"/>

<field name="request_date" type="date" indexed="true" stored="true"/>

Create your Index cont.

• Upload your configuration for a collection to ZooKeeper

• Tell Solr to start serving up a collection and start indexing data for it

$ solrctl --zk <ALL YOUR ZK IPs>/ solr collection --create live_logs -s 4

$ solrctl --zk <ALL YOUR ZK IPs>/solr instancedir --create live_logs

./live_logs_dir

Flume and Morphline Pipeline

Flume with Morphlines Configured

• Configure Flume to use your Morphlines and post parsed data to Solr

….

# Describe solrSink agent1.sinks.solrSink.type = org.apache.flume.sink.solr.morphline.MorphlineSolrSink

agent1.sinks.solrSink.channel = memoryChannel agent1.sinks.solrSink.batchSize = 1000

agent1.sinks.solrSink.batchDurationMillis = 1000 agent1.sinks.solrSink.morphlineFile =

/opt/examples/flume/conf/morphline.conf agent1.sinks.solrSink.morphlineId = morphline

agent1.sinks.solrSink.threadCount = 1

…..

Dynamic Search UI in Hue

Shared Storage!!

Benefits• Ad-hoc and

faster insight• Reduced

asthma related ICU visits

• Total license fees < 3 processor licenses for EDW

Solution• 50GB monitor

data per week• 2TB capacity• Sqoop, Solr,

Impala, HDFS

Challenges• Only 3 days’ of

monitoring data capacity

• No ability to correlate large research data sets

• No ability to ad-hoc study environment impact

How Do We Improve Healthcare?

How Do We Feed The World?

Global Warming Changes Conditions

How do we improve quality and resistance of crops and seeds in a variety of global and rapidly changing environments?

Benefits• Streamlined

processes• Time to results

reduced from years to months!!!

Solution• PB-scale• HBase, HDFS,

Solr, MapReduce, Sqoop, Impala, …

Challenges• Time to market

for each new product: 5-10 years

• 1,000+ scientists working in silos

• Data processing bottlenecks slow development

How Do We Feed The World?

Challenges• 100-200 B

events/month• Real-time multi-

type event correlation complex

• No way to do ad-hoc game analytics

Benefits• Ad-hoc insight

on feature trends

• Significant TTR reduction

• ROI realized in the 1st week

Solution• ~20 nodes• 256GB RAM

servers• Flume, Solr,

Impala, HDFS

Learn More?

• Stop by the Cloudera booth today!

• Play on your own: cloudera.com/live

• Get training: http://cloudera.com/content/cloudera/en/training.html

• Join the Community: cdh-user@cloudera.org

• Connect with me: @EvaAndreasson

Hope You Enjoyed This Talk!

Don’t forget to VOTE!!!

Bonus Track…

My Advice for the Road…

Try Something Simple First…

Decide what to Cook!

Collect All Ingredients

Use the Right Tool for the Right Task

Prepare All Ingredients

Don’t Forget the Importance of Visualization!

Benefits• Faster, cheaper

genome sequencing

• Searchable index of variant call data for biologists to explore

Solution• Integration &

storage of multi-structured experimental data

• Data access & exploration via Impala, R, HBase, Solr, Hive

Challenges• Tons of

information locked away in medical records & scientific studies

• Different sources & systems can’t “talk” to each other

Using Sqoop to Ingest Data from MySQL

• View your imported “tables”

• View all Avro files constituting a table

$ hadoop fs -ls /user/hive/warehouse/

$ hadoop fs -ls /user/hive/warehouse/mytablename/

Hadoop - A New Approach toData Management

Schema on Read

Distributed Storage

Distributed Processing

Active Archive

Cost-Efficient Offload

Flexible Analytics

Hadoop: Storage & Batch Processing

The Birth of the Data Lake

• Core Hadoop • Core Hadoop • Core Hadoop• Hbase• ZooKeeper• Mahout

• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive

• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop

• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie

2006 2007 2008 2009 2010 2011

• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie• Hue• Impala• Parquet

2012 2013 2014• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie• Hue• Impala• Parquet• Solr• Sentry

• Core Hadoop• Hbase• ZooKeeper• Mahout• Pig• Hive• Flume• Avro• Sqoop• Bigtop• Oozie• Hue• Impala• Parquet• Solr• Setnry• Spark• Kafka

A Rapidly Growing Ecosystem

The Rise of an Enterprise Data Hub

Applications

HDFS

2005-2007 – Hadoop

MapReduce

HDFS

2008 – HBase, ZooKeeper, Mahout

MapReduce

HBase

ZooKeeper

Mahout

HDFS

2009 – Hive, Pig

MapReduce

HBase

ZooKeeper

MahoutHive Pig

HDFS

2010 – Flume, Sqoop, Avro

MapReduce

HBase

ZooKeeper

MahoutHive Pig

Flume

DB

Avro

HDFS

2011 – Oozie, Hue

MapReduce

HBase

ZooKeeper

MahoutHive Pig

Flume

DB

Avro

Oozie

Hue

HDFS

2012 – YARN, Impala, Parquet

MapReduce

HBase

ZooKeeper

MahoutHive Pig

Flume

DB

Avro

Oozie

Hue

Parquet

Impala

YARN

HDFS

2013 – Solr, Sentry

MapReduce

HBase

ZooKeeper

MahoutHive Pig

Flume

DB

Avro

Oozie

Hue

Parquet

Impala Solr

YARNSentry

HDFS

2014 – Spark, Kafka

MapReduce

HBase

ZooKeeper

MahoutHive Pig

Flume

DB

Avro

Oozie

Hue

Parquet

Impala Solr

YARNSentry

Spark

Kafka

Inter-active SQL

Distributed File System (Scalable Storage)

The Hadoop Ecosystem – Explained!

Event-based data ingest

Batch Processing

KeyValueStore

SQL

Proc. Oriented

Query

Machine Learning

Process MgmtWorkflow Mgmt

GUI

Resource Management and Scheduling

Free-Text

Search

Real Time

Processing

Access Control

DB

Common Use Cases

• Threat detection

• Active archive / accessible global knowledge base

• Data accuracy

• Streamlined cross-data type aggregation

• Richer customer profiling / ecommerce experience

• Interactive market segmenting / customer identification

• Expedited data modeling

• ….

The Right Tool For the Right Task

Tool Workload Use Case Result Ordering

Hive Batch SQL, Analytics & Joins Structured

Pig Batch Proc. Oriented SQL, Analytics & Joins

Structured

Impala Interactive SQL, Analytics & Joins Structured

Solr Interactive Fuzzy, Phonetic, Polygon, GEO-special

Relevance-based

HBase Real Time Random key-lookups over sparsely populated columnar data

Scan-order

Spark NRT Advanced analytics & ML Sorted

When to use what?

• Real Time Query (e.g. Impala) • I want to do BI reports or interactive analytical aggregations but

not wait hours for the response

• Batch Query (e.g. Pig, Hive)• I have nightly batch query jobs as part of a workflow

• Real Time Search (e.g. SolrCloud)• I have unstructured data I want to free text over

• My SQL queries are getting more and more complex as they need to contain 15+ “like” conditions

• Real time key lookups (e.g. Hbase)• I want random access to sparsely populated table-like data

• I want to compare user profiles or behavior in real time

When to use what?

• Spark

• I want to implement analytics algorithms over my data, and my data sets fit into memory

• I have real time streaming data I want to analyze in real time

• MapReduce

• I want to do fail-safe large ETL processing workloads

• My data does not fit into memory and I want to batch process it with my custom logic – no real time needs