+ All Categories
Home > Technology > Data infrastructure architecture for medium size organization: tips for collecting, storing and...

Data infrastructure architecture for medium size organization: tips for collecting, storing and...

Date post: 16-Apr-2017
Category:
Upload: dataworks-summithadoop-summit
View: 413 times
Download: 0 times
Share this document with a friend
30
Egor Pakhomov Data Architect, Anchorfree [email protected] Data infrastructure architecture for a medium size organization: tips for collecting, storing and analysis.
Transcript
Page 1: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Egor PakhomovData Architect, [email protected]

Data infrastructure architecture for a medium size organization: tips for collecting, storing and analysis.

Page 2: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Medium organization (<500 people)

Big organization ( >500 people)

DATA CUSTOMERS >10 >100

DATA VOLUME “Big data” “Big data”

DATA TEAM PEOPLE RESOURCES

Enough to integrate and support some open source stack

Enough to write our own data tools

FINANCIAL RESOURCES Enough to buy hardware for Hadoop cluster

Enough to buy some cloud solution (Databricks cloud, Google BigQuery...)

Examples:Examples:

Data infrastructure architecture

Page 3: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

HOW TO MANAGE BIG DATA

WHEN YOU ARE NOT THAT BIG?

Page 4: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Data architect in AnchorFree

About me

Spark contributor since 0.9Integrated spark in Yandex Islands. Worked in Yandex Data Factory

Participated in “Alpine Data” development - Spark based data platform

Page 5: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Agenda

Data aggregation

Why SQL is important and how to use it in Hadoop?

• SQL vs R/Python• Impala vs Spark• Zeppelin vs SQL

desktop client

How to store data to query it fast and change easily?

• JSON vs Parquet• Schema vs schema-

less

How to aggregate your data to work better with BI tools?

• Aggregate your data!• SQL code is code!

1Data

Querying

2Data

Storage

3Data

Aggregation

Page 6: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

1Data

Querying

Why SQL is important and how to use it in Hadoop?

1. SQL vs R/Python2. Impala vs Spark3. Zeppelin vs SQL desktop client

Page 7: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

BI

Analysts

Regular data transformations

SQL

QA

Page 8: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

What do you need from SQL engine?

Fast Reliable Able to process terabytes of data

Support Hive metastore

Support modern SQL statements

Page 9: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Hive metastore role

HDFS

Hive Metastoretable_1 -> file341, file542, file453

table_2 -> file457, file458, file459table_3 -> file37, file568, file359table_4 -> file3457, file568, file349…..

Driver of SQL engine 1

Driver of SQL engine 1

Executor Executor Executor Executor

Page 10: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Which one would you choose? Both!

SparkSQL ImpalaSUPPORT HIVE METASTORE + +FAST - +RELIABLE (WORKS NOT ONLY IN RAM) + -

JSON SUPPORT + -HIVE COMPATIBLE SYNTAX + -OUT OF THE BOX YARN SUPPORT + -MORE THAN JUST A SQL FRAMEWORK + -

Page 11: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Connect Tableau to HadoopStep 1

Hadoop

ODBC/JDBC server

Page 12: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Give SQL to users

Hadoop

ODBC/JDBC server

Step 2

Page 13: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

1. Manage desktop application on N laptops

2. One spark context per many users

3. Lack of visualizing

4. No decent resource scheduling

Would not work...

Page 14: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

No decent resource scheduling: One user blocks everyone

Вот этот кусок скрина хорошо бы увеличить, чтобы было явно видно, что внутри. При этом общий скрин хорошо бы тоже оставить, чтобы было понятно, откуда это увеличение.-Anton Noginov
Page 15: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

No decent resource scheduling: Hadoop good in resource scheduling!

Page 16: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Apache Zeppelin is our solution

Page 17: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

1. Web-based

2. Notebook-based

3. Great visualisation

4. Works with both Impala and Spark

5. Has cloud solution with support - Zeppelin Hub from NFLabs

It’s great!

Page 18: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Apache Zeppelin integration

Hadoop

Page 19: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

2Data

Storage

How to store data to query it fast and change easily?

1. JSON vs Parquet2. Schema vs schema-less

Page 20: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

What would you need from data storing?

Flexible format

Fast querying Access to “raw” data

Have schema

Page 21: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Can we choose just one data format? We need both!

Json Parquet

FLEXIBLE +

ACCESS TO “RAW” DATA +

FAST QUERYING +

HAVE SCHEMA +

IMPALA SUPPORT +

Page 22: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

FORMAT QUERY TIME

Parquet SELECT Sum(some_field) FROM logs.parquet_datasource 136 sec

JSON SELECT Sum(Get_json_object(line, ‘$.some_field’)) FROM logs.json_datasource

764 sec

Parquet is 5 times faster!

But! when you need raw data, 5 times slower is not that bad

Let’s compare elegance and speed:

Page 23: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

{“First name”: “Mike”,“Last name”: “Smith”,

“Gender”: “Male”,“Country”: “US”

}

{“First name”: “Anna”,“Last name”: “Smith”,

“Age”: “45”,“Country”: “Canada”,

Comments: ”Some additional info”}...

FIRST NAME

LAST NAME GENDER AGE

Mike Smith Male NULL

Anna Smith NULL 45

... ... ... ...

JSON Parquet

How data in these formats compare

Page 24: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

3Data

Aggregation

How to aggregate your data to work better with BI tools?

1. Aggregate your data!2. SQL code is code!

Page 25: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

● “Big data” does not mean you need to query all data Daily

● BI tools should not do big queries

Aggregate your data!

Page 26: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

BI tool

select * from ...

How aggregation works?

Git with queries

Query executor

Aggregated table

Page 27: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Report development process

1

2

4

Creating aggregated table in Zeppelin

Creating BI report based on this table

Adding queries to git to run daily

Publishing report

3

Data for report changing process:

Change query in git1

Page 28: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

One more tip)

Page 29: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

1. Need to apply our patches to source code

2. Move to new versions before any release

3. Move to new version on part of infrastructure - rest remain on old one

We do not use Spark, which comes with Hadoop installation

Page 30: Data infrastructure architecture for medium size organization: tips for collecting, storing and analysis.

Questions?

Contact:Egor Pakhomov

[email protected]@gmail.comhttps://www.linkedin.com/in/egor-pakhomov-35179a3a


Recommended