Home >Documents >Hortonworks Data Platform - Apache Spark Component ??2018-04-15Hortonworks Data Platform: Apache...

Hortonworks Data Platform - Apache Spark Component ??2018-04-15Hortonworks Data Platform: Apache...

Date post:14-May-2018
Category:
View:235 times
Download:7 times
Share this document with a friend
Transcript:
  • Hortonworks Data Platform

    (December 15, 2017)

    Apache Spark Component Guide

    docs.hortonworks.com

    http://docs.hortonworks.com

  • Hortonworks Data Platform December 15, 2017

    ii

    Hortonworks Data Platform: Apache Spark Component GuideCopyright 2012-2017 Hortonworks, Inc. Some rights reserved.

    The Hortonworks Data Platform, powered by Apache Hadoop, is a massively scalable and 100% opensource platform for storing, processing and analyzing large volumes of data. It is designed to deal withdata from many sources and formats in a very quick, easy and cost-effective manner. The HortonworksData Platform consists of the essential set of Apache Hadoop projects including MapReduce, HadoopDistributed File System (HDFS), HCatalog, Pig, Hive, HBase, ZooKeeper and Ambari. Hortonworks is themajor contributor of code and patches to many of these projects. These projects have been integrated andtested as part of the Hortonworks Data Platform release process and installation and configuration toolshave also been included.

    Unlike other providers of platforms built using Apache Hadoop, Hortonworks contributes 100% of ourcode back to the Apache Software Foundation. The Hortonworks Data Platform is Apache-licensed andcompletely open source. We sell only expert technical support, training and partner-enablement services.All of our technology is, and will remain, free and open source.

    Please visit the Hortonworks Data Platform page for more information on Hortonworks technology. Formore information on Hortonworks services, please visit either the Support or Training page. Feel free tocontact us directly to discuss your specific needs.

    Except where otherwise noted, this document is licensed underCreative Commons Attribution ShareAlike 4.0 License.http://creativecommons.org/licenses/by-sa/4.0/legalcode

    https://hortonworks.com/training/https://hortonworks.com/products/hdp/https://hortonworks.com/services/https://hortonworks.com/training/https://hortonworks.com/contact-us/http://creativecommons.org/licenses/by-sa/4.0/legalcodehttp://creativecommons.org/licenses/by-sa/4.0/legalcodehttp://creativecommons.org/licenses/by-sa/4.0/legalcode

  • Hortonworks Data Platform December 15, 2017

    iii

    Table of Contents1. Analyzing Data with Apache Spark .............................................................................. 12. Installing Spark ............................................................................................................ 3

    2.1. Installing Spark Using Ambari ........................................................................... 32.2. Installing Spark Manually .................................................................................. 62.3. Verifying Spark Configuration for Hive Access ................................................... 72.4. Installing the Spark Thrift Server After Deploying Spark ..................................... 72.5. Validating the Spark Installation ....................................................................... 8

    3. Configuring Spark ........................................................................................................ 93.1. Configuring the Spark Thrift Server ................................................................... 9

    3.1.1. Enabling Spark SQL User Impersonation for the Spark Thrift Server .......... 93.1.2. Customizing the Spark Thrift Server Port .............................................. 11

    3.2. Configuring the Livy Server ............................................................................. 113.2.1. Configuring SSL for the Livy Server ....................................................... 113.2.2. Configuring High Availability for the Livy Server .................................... 12

    3.3. Configuring the Spark History Server ............................................................... 123.4. Configuring Dynamic Resource Allocation ........................................................ 12

    3.4.1. Customizing Dynamic Resource Allocation Settings on an Ambari-Managed Cluster ............................................................................................ 133.4.2. Configuring Cluster Dynamic Resource Allocation Manually ................... 143.4.3. Configuring a Job for Dynamic Resource Allocation .............................. 153.4.4. Dynamic Resource Allocation Properties ............................................... 15

    3.5. Configuring Spark for Wire Encryption ............................................................ 163.5.1. Configuring Spark for Wire Encryption ................................................. 173.5.2. Configuring Spark2 for Wire Encryption ................................................ 18

    3.6. Configuring Spark for a Kerberos-Enabled Cluster ............................................ 203.6.1. Configuring the Spark History Server .................................................... 213.6.2. Configuring the Spark Thrift Server ...................................................... 213.6.3. Setting Up Access for Submitting Jobs .................................................. 21

    4. Running Spark ........................................................................................................... 244.1. Specifying Which Version of Spark to Run ....................................................... 244.2. Running Sample Spark 1.x Applications ........................................................... 25

    4.2.1. Spark Pi ................................................................................................ 264.2.2. WordCount .......................................................................................... 27

    4.3. Running Sample Spark 2.x Applications ........................................................... 284.3.1. Spark Pi ................................................................................................ 294.3.2. WordCount .......................................................................................... 30

    5. Submitting Spark Applications Through Livy ............................................................... 325.1. Using Livy with Spark Versions 1 and 2 ............................................................ 325.2. Using Livy with Interactive Notebooks ............................................................. 335.3. Using the Livy API to Run Spark Jobs: Overview ............................................... 345.4. Running an Interactive Session With the Livy API ............................................. 35

    5.4.1. Livy Objects for Interactive Sessions ...................................................... 365.4.2. Setting Path Variables for Python ......................................................... 375.4.3. Livy API Reference for Interactive Sessions ............................................ 38

    5.5. Submitting Batch Applications Using the Livy API ............................................ 405.5.1. Livy Batch Object .................................................................................. 415.5.2. Livy API Reference for Batch Jobs ......................................................... 41

    6. Running PySpark in a Virtual Environment ................................................................. 43

  • Hortonworks Data Platform December 15, 2017

    iv

    7. Automating Spark Jobs with Oozie Spark Action ........................................................ 447.1. Configuring Oozie Spark Action for Spark 1 .................................................... 447.2. Configuring Oozie Spark Action for Spark 2 .................................................... 46

    8. Developing Spark Applications ................................................................................... 498.1. Using the Spark DataFrame API ...................................................................... 498.2. Using Spark SQL .............................................................................................. 51

    8.2.1. Accessing Spark SQL through the Spark Shell ........................................ 528.2.2. Accessing Spark SQL through JDBC or ODBC: Prerequisites .................... 528.2.3. Accessing Spark SQL through JDBC ....................................................... 538.2.4. Accessing Spark SQL through ODBC ..................................................... 548.2.5. Spark SQL User Impersonation ............................................................. 54

    8.3. Calling Hive User-Defined Functions ................................................................. 618.3.1. Using Built-in UDFs ............................................................................... 618.3.2. Using Custom UDFs .............................................................................. 62

    8.4. Using Spark Streaming .................................................................................... 628.4.1. Prerequisites ......................................................................................... 638.4.2. Building and Running a Secure Spark Streaming Job ............................. 638.4.3. Running Spark Streaming Jobs on a Kerberos-Enabled Cluster ............... 658.4.4. Sample pom.xml File for Spark Streaming with Kafka .......................... 66

    8.5. HBase Data on Spark with Connectors ............................................................ 688.5.1. Selecting a Connector ........................................................................... 698.5.2. Using the Connector with Apache Phoenix ........................................... 70

    8.6. Accessing HDFS Files from Spark ..................................................................... 708.6.1. Specify

Click here to load reader

Embed Size (px)
Recommended