+ All Categories
Home > Data & Analytics > Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Frameworks

Simplifying Hadoop: A Secure and Unified Data Access Path for Computer Frameworks

Date post: 14-Apr-2017
Category:
Upload: dataconomy-media
View: 218 times
Download: 0 times
Share this document with a friend
19
‹#› © Cloudera, Inc. All rights reserved. Simplifying Hadoop: A Secure and Unified Data Access Path for Compute Frameworks Marcell Szabó 2015-12-10
Transcript

‹#›© Cloudera, Inc. All rights reserved.

Simplifying Hadoop: A Secure and Unified Data Access Path for Compute FrameworksMarcell Szabó 2015-12-10

‹#›© Cloudera, Inc. All rights reserved.

RecordService[public beta since Sept 2015]

‹#›© Cloudera, Inc. All rights reserved.

Hi, We’re looking for a data protection solution to mask sensitive customer data during queries using Hive, Impala, MR, Spark and Hbase. Does Cloudera offer something appropriate?

Regards, John Doe

Motivation

‹#›© Cloudera, Inc. All rights reserved.

HDFS • rw-rw-r--

Sentry • Access Control Rules on Hive MetaStore objects

• INSERT / SELECT / ALL • TABLE / VIEW / URI • view allows: filtering, projection, masking

• Understood by • Impala, HiveServer • but others (MapRed, Spark): fallback to HDFS

Before RecordService

RecordService to the rescue!

- Want to mask passwords? - Create a new file!

‹#›© Cloudera, Inc. All rights reserved.

Filtering, Projection, Masking

CREATE VIEW eu_clients_for_marketing as SELECT name, date_of_birth, mask(credit_card_number) as ccn, rating, region FROM clients WHERE region = “Europe”

‹#›© Cloudera, Inc. All rights reserved.

RecordService[public beta since Sept 2015]

Sentry

MetaStore

‹#›© Cloudera, Inc. All rights reserved.

Expectations for a protective layer

• Durable and complete protection • Doesn’t disrupt the interface • Doesn’t impair performance

‹#›© Cloudera, Inc. All rights reserved.

Durable and complete protection• Single access path • Kerberos • Zookeeper • Signed tasks, no user code

‹#›© Cloudera, Inc. All rights reserved.

Doesn’t disrupt the interface

‹#›© Cloudera, Inc. All rights reserved.

Spark Example

//val file = sc.textFile(path) val file = sc.recordServiceTextFile(path)

‹#›© Cloudera, Inc. All rights reserved.

Spark SQL Example

ctx.sql(s""" |CREATE TEMPORARY TABLE $tbl |USING com.cloudera.recordservice.spark.DefaultSource |OPTIONS ( | RecordServiceTable '$db.$tbl', | RecordServiceTableSize '$size' |) """.stripMargin)

‹#›© Cloudera, Inc. All rights reserved.

MR Example

//FileInputFormat.setInputPaths(job, new Path(args[0]));//job.setInputFormatClass(AvroKeyInputFormat.class);

RecordServiceConfig.setInputTable(configuration, null, args[0]);job.setInputFormatClass( com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);

‹#›© Cloudera, Inc. All rights reserved.

Client Integration APIs

• Drop in replacements for common existing InputFormats • Text, Avro

• Can be used with Spark as well • SparkSQL: integration with the Data Sources API • Predicate pushdown, projection

• Migration should be easy • Client APIs make things simpler • Don’t need to interact with HMS • Care about the underlying storage format:

worker always returns records in a canonical format.

• Storage engine details (e.g. s3)

+

‹#›© Cloudera, Inc. All rights reserved.

Doesn’t impair performance

‹#›© Cloudera, Inc. All rights reserved.

Terasort• ~Worst case scenario: a single STRING column Custom RecordServiceTeraInputFormat (similar to TeraInputFormat) • 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks) • Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales

TeraChecksum

Nor

mal

ized

job

tim

e

0

0,28

0,55

0,83

1,1

1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark)

0,850,8

1,03

0,23

0,48

1

Without RecordServiceWith RecordService

• See Github repo for more details and runnable examples.

‹#›© Cloudera, Inc. All rights reserved.

Spark SQL• Represents a more expected use case: Data is fully schemed • TPCDS: 500GB scale factor, on parquet • Cluster: 5 node cluster

SparkSQL

0

100

200

300

400

TPCDS

Q3 Q7 Q8 Q19 Q27 Q34 Q42 Q43 Q52 Q53 Q55 Q61 Q68 Q73 Q88 Q96 GeoMean

SparkSQLSparkSQL with RecordService

~15% improvement in query times; queries are not scan bound

SparkSQL

0

8

16

24

32

2% Selective Scan Sum(col)

23,5

14

3129,5

SparkSQLSparkSQL with RecordService

‹#›© Cloudera, Inc. All rights reserved.

Performance

• Shares some core components with Impala • IO management, optimized C++ code, runtime code generation, uses low level storage APIs • Highly efficient implementation of the scan functionality

• Optimized columnar on wire format • Inspired by Apache Parquet

• Accelerates performance for many workloads

‹#›© Cloudera, Inc. All rights reserved.

Conclusion

• RecordService => schemed data access for Hadoop • security ++ • performance ++ • data format abstracted away • uniform access across Hadoop

• http://cloudera.github.io/RecordServiceClient/ • read … try … report bugs … contribute!

‹#›© Cloudera, Inc. All rights reserved.

Thank you

Marcell Szabó szama at cloudera.com


Recommended