Roman Nikitchenko, 04.12.2014
2www.vitech.com.ua
Any real big data is just about DIGITAL LIFE FOOTPRINT
3www.vitech.com.ua
THE SAME IS ABOUT...
NOT ALL THINGS IN OUR LIFE ARE NICE
4www.vitech.com.ua
BIG DATA is not about the
data. It is about OUR ABILITY TO HANDLE THEM.
5www.vitech.com.ua
YARN
6www.vitech.com.ua
Don't shoot your own foot with BIG GUN!
Some aspects are more special.
Most dangerous things in Big Data
Basics
Couple of specific notes
Beware!
7www.vitech.com.ua
MOST SERIOUS BIG DATA failure IS ...
NO DATA
8www.vitech.com.ua
NO DATA
NO MONEY
The biggest mistake in BIG DATA strategy is to limit amount of data you collect.
9www.vitech.com.ua
WHERE ARE
YOU?
10www.vitech.com.ua
DATA LAKETake as much data about your business processes as you can take. The more data you have the more value you could get from it.
11www.vitech.com.ua
YOU ALWAYS HAVE OPTION
● We have developed our own online storage which lowers maintenance and stores anything.
12www.vitech.com.ua
Most serious errors in Big Data are about operations and infrastructure. Not about algorithms, or code.
LIVE WITH IT
13www.vitech.com.ua
YOU ALWAYS HAVE OPTION
● We have special engineering roadmap for big data infrastructure development.
14www.vitech.com.ua
Why hadoop?
x MAX+
=
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
BIG DATA
Use robust solutions
15www.vitech.com.ua
What is HADOOP?
● Hadoop is open source framework for big data. Both distributed storage and processing.
● Hadoop is reliable and fault tolerant with no rely on hardware for these properties.
● Hadoop has unique horisontal scalability. Currently — from single computer up to thousands of cluster nodes.
16www.vitech.com.ua
Hadoop: don't do it yourself
17www.vitech.com.ua
● HortonWorks are 'barely open source'. Innovative, but 'running too fast'. Most ot their key technologies are not so mature yet. Some people LOVE them.
Cloudera is stable enough but not stale. Hadoop 2.5 with YARN, HBase 0.98.x, Spark 1.x. Balance as for late 2014.
● MapR focuses on performance per node but they are slightly outdated in term of functionality and their distribution costs. For cases where node performance is high priority.
Option? Our experience is:
18www.vitech.com.ua
HBase motivation
● Designed for throughput, not for latency.
● HDFS blocks are expected to be large. There is issue with lot of small files.
● Write once, read many times ideology.
● MapReduce is not so flexible so any database built on top of it.
● How about realtime?
Hadoop is...
19www.vitech.com.ua
● 64G RAM is considered pretty small amount. 128G is more and more often configuration.
● 2xCPU with 6 cores each is considered commodity.
● 4xHDD is a minimum. SSD are used more and more often.
Uses commodity hardware...
'Commodity' word understanding is growing
20www.vitech.com.ua
Virtualization
NOTSO
REAL ELEPHANT
VIRTUALIZATION
21www.vitech.com.ua
CONCERNS● Is possible for key nodes. Not for
workers unless you are really big.
● Several nodes on single physical host: what happens if this host fail?
● Loaded services on VM: is it meaningful? Double duties?
22www.vitech.com.ua
Virtualization: practical case
● Apache ZooKeeper is QUORUM based service.
● If host with 2 ZK fails, Everything fail which breaks tolerancy to 1 failure.
● Can you garantee equal performance for ZK service instances?
● DON'T PUT QUORUM SERVICES IN VIRTUAL ENVIRONMENT!
HOST
HOST
REAL EXAMPLE
23www.vitech.com.ua
YOU ALWAYS HAVE OPTION
● Indeed there is lot of options with virtualization. The only concern is about ability to use your own brains.
24www.vitech.com.ua
HBase motivationNeed online storage for big data?
LATENCY, SPEED and all Hadoop properties.
25www.vitech.com.ua
NO ANY SECONDARY
INDEXES OUT OF THE BOX.
26www.vitech.com.ua
YOU ALWAYS HAVE LOT OF OPTIONS
● We have buit our search indexing technology.
27www.vitech.com.ua
● SOLR indexes documents. What is stored into SOLR index is not what you index. SOLR is NOT A STORAGE, ONLY INDEX
● But it can index ANYTHING. Search result is document ID
INDEX UPDATE
Search responses
INDEX QUERY
Index update request is analyzed, tokenized,
transformed... and the same is for queries.
INDEX ALTERNATIVE: SOLR
28www.vitech.com.ua
● HBase handles user data change online requests.
● NGData Lily indexer handles stream of changes and transforms them into SOLR index change requests.
● Indexes are built on SOLR so HBase data are searchable.
29www.vitech.com.ua
HDFS
HBase: Data and search integration
HBase regions
Data update
Client
User just puts (or deletes) data.
Search responses
Lily HBase NRT indexer
Replication can be set up to column
family level.
REPLICATIONHBasecluster
Translates data changes into SOLR
index updates.
SOLR cloudSearch requests (HTTP)
Apache Zookeeper does all coordination
Finally provides search
Serves low level file system.
30www.vitech.com.ua
ETL
LOADYOURDATA
WITH CARE
ETL
31www.vitech.com.ua
ENTERPRISE DATA HUBDon't ruine your existing data warehouse. Just extend it with new, centralized big data storage through data migration solution.
32www.vitech.com.ua
ETL & BD: main stages
SQLserver
Table1
Table2
Table3
Table4 BIG DATA shard
BIG DATA shard
BIG DATA shard
Transform
● SQL solution are usually not so distributed as Big Data one. How to partition your data?
● Big data storages are mostly non-relational. You are to map table relations into objects. Where to put this complexity?
JOIN Partition
EXTRACT TRANSFORM LOAD
33www.vitech.com.ua
ETL & BD: complexity on SQL
SQLserver
JOIN
Table1
Table2
Table3
Table4 BIG DATA shard
BIG DATA shard
BIG DATA shard
ETL stream
● It's hard to transform SQL relationship into NoSQL objects: complex joins.
● Simple stream on big data, lowered network traffic. HUGE load on SQL.
● What if you have several SQL servers and you need 2 times faster import?
SQL
dies
on
this
34www.vitech.com.ua
ETL & BD: complexity on BD side
SQLserver
JOIN
Table1
Table2
Table3
Table4 BIG DATA shard
BIG DATA shard
BIG DATA shardETL stream
● Simple streaming from SQL. Things like joins on Big Data side.
● Even if you have 100 SQL servers, you have to scale single cluster.
● Network load is more intensive.
Muc
h m
ore
scal
able
ETL stream
ETL stream
ETL stream
35www.vitech.com.ua
● YARN forms resource management layer and completes real distributed data OS so heterogeneous clusters and multi-tenancy are real things.
● New distributed processing approaches: MapReduce is from now only one among other YARN appliactions.
YARN: future of Hadoop
36www.vitech.com.ua
First ever worldDATA OS
10.000 nodes computer... Recent technology changes are focused on higher scale. Better resource usage and control, lower MTTR, higher security, redundancy, fault tolerance.
37www.vitech.com.ua
This is how retail agents often work.
YARN
38www.vitech.com.ua
This is how it often works.
YARNWhat can be reality
CPU
CPU CPU CPU
YARN presents
CPU CPU CPU CPU
it's about reservation. Indeed you could have no resource because of service not aware of YARN.
39www.vitech.com.ua
YOU ALWAYS HAVE OPTION
40www.vitech.com.ua
Apache Spark
● Better MapReduce with at least some MapReduce elements able to be reused.
● New job models. Not only Map and Reduce.
● Scala and Python API in addition to Java. Functional model support.
● Results can be passed through memory including final one.
41www.vitech.com.ua
● Works much better if knows about size of job to do. Streaming is just sequence of small jobs.
● Requires proper YARN tuning to use resources properly. No dynamic allocation of executors.
● Persistance: int limitation with 2G. HUGE amount of memory as for today.
● You cannot partition data 'on the fly'. Should guess right way.
42www.vitech.com.ua
● Dynamic, faster to startup, resources reusage.
● Unified management infrastructure such as logging.
+
Your cluster is ready for next tasksMap-reduce Spark
YARN
43www.vitech.com.ua
It is simply too good to wait...
44www.vitech.com.ua
TRUST ME ;-)
45www.vitech.com.ua
Share your knowledge!
DO NOTHIDE YOUREXPERIENCE
46www.vitech.com.ua
Questions and discussion