Install Guide - .NET Framework...• DMX server license key installs components based on whether you...

DMX

Install Guide

Version 9.10

DMX Install Guide

Copyright 1990, 2020 Syncsort Incorporated. All rights reserved.

This document contains unpublished, confidential, and proprietary information of Syncsort

Incorporated. No disclosure or use of any portion of the contents of this document may be made

without the express written consent of Syncsort Incorporated.

Getting technical support: Customers with a valid maintenance contact can get technical assistance

via MySupport. There you will find product downloads and documentation for the products to which

you are entitled, as well as an extensive knowledge base.

Version 9.10

Last Update: 21 October 2020

http://www.syncsort.com/supportcentral

DMX Install Guide i

Contents DMX Overview ............................................................................................................... 4

Installing DMX/DMX-h ................................................................................................... 4

DMX-h Overview .......................................................................................... 4

Prerequisites ................................................................................................ 5

Step-by-Step Installation ............................................................................. 9

Configuring the DMX Run-time Service ................................................... 28

Applying a New License Key to an Existing Installation ......................... 31

Running DMX ............................................................................................................... 33

Graphical User Interfaces ......................................................................... 34

DMX Help .................................................................................................... 34

Connecting to Databases from DMX ......................................................................... 34

Amazon Redshift ........................................................................................ 34

Azure Synapse Analytics (formerly SQL Data Warehouse) .................... 36

Databricks .................................................................................................. 38

DB2 .......................................................................................................... 41

Greenplum ................................................................................................. 41

Hive data warehouses ............................................................................... 43

Apache Impala ........................................................................................... 65

Microsoft SQL Server ................................................................................. 68

Netezza ....................................................................................................... 68

NoSQL Databases ...................................................................................... 70

Oracle .......................................................................................................... 71

Snowflake ................................................................................................... 72

Sybase ......................................................................................................... 74

Teradata ..................................................................................................... 74

Vertica ......................................................................................................... 75

Other DBMSs .............................................................................................. 77

Defining ODBC Data Sources ................................................................... 79

Connecting to Message Queues from DMX ............................................................. 80

IBM WebSphere MQ .................................................................................. 80

Connecting to Salesforce from DMX ......................................................................... 82

Connecting to SAP from DMX .................................................................................... 82

ii DMX Install Guide

Registering DMX in SAP SLD ..................................................................... 83

Connecting to HDFS from DMX ................................................................................. 84

Connecting to Connect:Direct nodes from DMX ...................................................... 84

Security ....................................................................................................... 84

Installation and Configuration ................................................................. 84

Connecting to Databricks File Systems (DBFSs) ....................................................... 85

Databricks File System (DBFS) connection requirements ..................... 85

Defining Databricks File System (DBFS) connections ............................. 87

Connecting to CyberArk Enterprise Password Vault ............................................... 88

CyberArk Licenses ..................................................................................... 88

Connecting to Protegrity Data Security Gateway ..................................................... 88

Connecting to QlikView data eXchange files from QlikView or Qlik Sense............ 88

QlikView desktop installation overview ................................................... 89

Qlik Sense desktop installation overview ................................................ 89

Connecting to Tableau Data Extract files from Tableau .......................................... 89

Tableau desktop installation overview .................................................... 90

Removing DMX/DMX-h from Your System ............................................................... 90

DMX installation component options ....................................................................... 93

DMX Management Service installation and configuration .................... 94

DMX DataFunnel run-time service install and configuration .............. 100

Technical Support ..................................................................................................... 104

DMX Install Guide 3

Documentation Conventions The following conventions are used in the format sections of the command options in this manual.

Convention Explanation Example

Regular type Items in regular type must be entered literally using

either lowercase or uppercase letters. Items may be

abbreviated.

ASCII

ascending

Italics (non-

bold)

Items in italics (non-bold) represent variables. You must

substitute an appropriate numerical or text value for the

variable.

file_name

Braces { } Braces indicate that a choice must be made among items

contained in the braces. The choices may be presented

in an aligned column, or on one line separated by a

vertical bar ( | ).

{"a" }

{X"xx" }

OR

{AND | OR}

Brackets [ ] Brackets indicate that an item is optional. A choice may

be made among multiple items contained in brackets.

[alias]

OR

[+ | -]

Slash / A slash identifies a DMX option keyword. The slash

must be included when an option keyword is specified.

/INFILE

/infile

Double quotes

" "

Double quotation marks that appear in a format

statement must be specified literally.

"b"-"e"

Ellipsis … An ellipsis indicates that the preceding argument or

group of arguments may be repeated.

[expression…]

Sequence

number

A sequence number indicates that a series of arguments

or values may be specified. The sequence number itself

must never be specified.

field2

4 DMX Install Guide

DMX Overview DMX™ is a high-performance data transformation product. With DMX you can design, schedule, and

control all your data transformations from a simple graphical interface on your Windows desktop.

Data records can be input from many types of sources such as database tables, SAP systems,

Salesforce.com objects, flat files, XML files, pipes, etc. The records can be aggregated, joined, sorted,

merged, or just copied to the appropriate target(s). Before output, records can be filtered,

reformatted, or otherwise transformed.

Metadata, including record layouts, business rules, transformation definitions, run history and data

statistics, can be maintained either within a specific task or in a central repository. The effects of

making a change to your application can be analyzed through impact and lineage analysis.

You can run your data transformations directly from your desktop, on any UNIX or Windows server,

or schedule them for later execution, embed them in batch scripts, or invoke them from your own

programs.

Installing DMX/DMX-h Installed DMX components are dependent on your license key:

• DMX server license key installs components based on whether you select a Standard, Full,

Classic, or Custom installation. See DMX installation component options.

• DMX workstation license key installs the development client, Job and Task Editors; the

DMX engine, dmxjob/dmexpress;; and the service for development client, which is the DMX

Run-time Service, dmxd.

The version of DMX server software must be at least as high as the version of the DMX client

software that is used to develop jobs and connect to the server. Thus, when installing a new version

of DMX, ensure that you install the same release of DMX on your client and server machines. If you

are upgrading and unable to install both the client and the server at the same time, you need to

upgrade the server prior to upgrading the client.

DMX-h Overview DMX-h is the Hadoop-enabled edition of DMX, providing the following Hadoop functionality:

• ETL Processing in Hadoop – Develop a DMX-h ETL application entirely in the DMX GUI to

run seamlessly in the Hadoop MapReduce framework, with no Pig, Hive, or Java

programming required. Currently, jobs can be run in either MapReduce or Spark. See the

online DMX Help topic "DMX-h”.

• Hadoop Sort Acceleration – Seamlessly replace the native sort within Hadoop MapReduce

processing with the high-speed DMX engine sort, providing performance benefits without

programming changes to existing MapReduce jobs. See the DMX-h Sort User Guide, which is

included in the Documentation folder under your DMX software installation directory.

• Apache Spark Integration – Use the Spark mainframe connector to transfer mainframe data

to HDFS. See the online DMX Help topic “Spark Mainframe Connector”.

• Apache Sqoop Integration – Use the Sqoop mainframe import connector to transfer

mainframe data into HDFS. See the online DMX Help topic "Sqoop Mainframe Import

Connector”.

DMX Install Guide 5

DMX-h Requirements DMX-h requires the following:

• DMX-h Edition

• A supported Hadoop MapReduce and/or Spark distribution:

o MapReduce

▪ Cloudera CDH 5.x (5.2 and higher) – YARN (MRv2)

▪ Hortonworks Data Platform (HDP) 2.x (2.3 and higher) – YARN

▪ Apache Hadoop 2.x (2.2 and higher) – YARN

▪ MapR, Community Edition and Enterprise Edition only (previously termed M5 and

M7, respectively), 6.x – YARN

▪ Pivotal HD 3.0 – YARN

DMX-h is certified as ODPi (1.0 and higher) interoperable.

o Spark

▪ Spark on YARN on the following Hadoop distributions:

• Cloudera CDH 5.x (5.5 and higher)

• Hortonworks Data Platform (HDP) 2.3.4, 2.x (2.4 and higher)

• MapR 5.x (5.1 and higher), Community Edition and Enterprise Edition only

(previously named M5 and M7, respectively)

▪ Spark on Mesos 0.21.0

▪ Spark Standalone 1.5.2 and higher

DMX-h Component Setup and Operation A DMX-h setup consists of the following:

• Windows workstation

o DMX must be installed as described in Step-by-Step Installation, Windows Systems.

o DMX Job and Task Editors are used for MapReduce job development.

o MapReduce jobs are submitted to Hadoop via the ETL server from the Job Editor.

• Linux ETL server (edge node)

o DMX must be installed as described in Step-by-Step Installation, UNIX Systems.

o The Hadoop client must be installed and configured to connect to the Hadoop cluster.

o The DMX Run-time Service, dmxd, must be running to respond to jobs run via the

Windows workstation; it calls dmxjob with the /HADOOP option, which ultimately calls

hadoop to submit jobs to the cluster.

• Hadoop cluster

o DMX must be installed without dmxd on all nodes in the Hadoop cluster as described in

Step-by-Step Installation, Hadoop Cluster.

o Each mapper and reducer runs the map side or reduce side task(s), respectively.

o All file descriptors for sources, targets, and intermediate files are carefully connected so

they fit into the Hadoop MapReduce flow.

Prerequisites Before you install DMX on your system, ensure that the following are available:

6 DMX Install Guide

• DMX software: This is generally shipped downloaded from Syncsort’s web site as a self-

extracting executable file (Windows) or a tar file (UNIX).

• DMX license key: License keys are sent via e-mail as an attachment file called

DMExpressLicense.txt. If you need specific system information to obtain a license key, refer

to the section below on Getting DMX License Information.

If you have a DMX server license key and plan to install DMX installation components, the type of

user that you setup depends on whether impersonation privileges are extended. See DMX

installation user setup considerations.

• Operating system: DMX runs on the following operating systems, with the listed release

being the minimum supported. Both 32 bit and 64 bit versions are supported, unless

otherwise stated: AIX release 6.1 64-bit; HP-UX release 11.31 IA64 64-bit; Linux kernel

version 2.6.18 to 2.6.31 with C library version 2.5 to 2.11 on Pentium-class x86_64 64-bit

machines; Linux kernel version 2.6.16 with C library version 2.4 on IBM System z 64-bit

mainframes; SunOS 5.10 SPARC 64-bit; Windows Vista; Windows 7; Windows 8.x; Windows

10; and Windows Server 2008, 2012; and 2012 R2.

• Java version requirements: On Windows and UNIX/Linux systems, DMX requires Java

runtime version 1.7 or higher unless you are only running DMX Sort, which does not use

Java. DMX requires JDK 7.

• Communication security protocol: On Windows and UNIX/Linux systems, DMX supports

Transport Layer Security (TLS) up to and including TLS version 1.2.

• User rights: Sufficient privileges to install and start Windows Services for Windows

platforms and root privileges to install and start UNIX daemons on UNIX platforms. An

umask setting of 022 is required so that other users can run the installed executables. The

installation procedure sets and resets umask if required. • Pluggable Authentication Modules (PAM): If you want to use PAM for authentication on

UNIX or Linux platforms, PAM must be installed and configured on the system.

• Database client software: If you want DMX to access data in database tables (either as data

source or target), then the appropriate database client software must be on the system and

accessible via the appropriate shared library or dynamic link library (dll) paths.

For example, to access an Oracle database, Oracle Client must be installed on the system

where you run DMX; to access a database via ODBC, an ODBC data source must be defined

on the system where you run DMX. For details on how to connect to a specific Database

Management System (DBMS), refer to the section Connecting to Databases from DMX.

• Message queue client software: If you want DMX to access data in a message queue, then the

appropriate message queue client software must be on the system and accessible via the

appropriate shared library or dynamic link library (dll) paths.

For example, to access an IBM WebSphere MQ queue, IBM WebSphere MQ client must be

installed on the system where you run DMX. For details on how to connect to a specific

message queue type, refer to the section Connecting to Message Queues from DMX.

• SAP client software: If you want DMX to access data in an SAP system, then the appropriate

SAP client software must be installed on the system where you run DMX and accessible via

the appropriate shared library or dynamic link library (dll) paths. For details on how to

connect to an SAP system, refer to the section Connecting to SAP from DMX. • Hadoop software – If you want DMX to access data in a Hadoop Distributed File System

(HDFS), or you want to run DMX-h ETL MapReduce jobs, then a Hadoop distribution

configured to access the cluster must be installed on the edge/ETL node from which you run

DMX. For details on how to connect to HDFS, refer to the section Connecting to HDFS from DMX.

• Connect:Direct software – If you want DMX to access data using a Connect:Direct

connection, a Connect:Direct server and client (CLI/API) must be installed on the system

where you run DMX and must be configured to access the required Connect:Direct nodes. For

DMX Install Guide 7

details on how to connect to a Connect:Direct node, refer to Connecting to Connect:Direct nodes from DMX.

• QlikView software – DMX supports QlikView data eXchange (QVX) files as targets. To access

QVX files as sources from QlikView or Qlik Sense, refer to Connecting to QlikView data

eXchange files from QlikView or Qlik Sense.

• Tableau software – DMX supports Tableau Data Extract (TDE) files as targets. To access

TDE files as sources from Tableau, refer to Connecting to Tableau Data Extract files from

Tableau.

DMX installation user setup considerations The type of user that you setup to install DMX installation components is dependent on whether

impersonation privileges are extended:

• If you plan to use impersonation when running the DMX Run-time Service, dmxd, you must

install as root.

• When running the DataFunnel Run-time Service, dmxrund, considerations exist for the type

of user that installs components.

User setup when running dmxrund If you do not plan to use impersonation when running dmxrund, setup a non-administrative user to

install and run on Windows or setup a service user to install and run on Linux.

Setup a non-administrative/service user

Windows As the administrative user has impersonation privileges by default, setup a new user who does not

have administrative rights.

Linux To install and run job requests without impersonation, create a service user, dmxuser, and run the

installation as dmxuser.

Setup impersonation If you plan to use impersonation when running dmxrund, no user setup is required to install and run

on Windows; setup an impersonated user to install and run on Linux.

Windows As the administrative user has impersonation privileges by default, no setup is required.

Linux DMX installation impersonation considerations on Linux follow:

• No impersonation – Running jobs without impersonation does not require root access. Upon

receipt of a job submission request from the DMX management service, dmxmgr, dmxrund

calls the DMX engine, dmxdfnl, to run the submitted job as the service user, dmxuser.

• Impersonation – Running jobs with impersonation requires root access to impersonate the

specified user. While dmxrund never is granted root access, another installed component,

dmxexecutor, can enable impersonation. When dmxrund detects that dmxexecutor is

installed in the required directory with the correct permissions, dmxrund calls dmxexecutor

to impersonate the specific user that calls the DMX engine, dmxdfnl, which runs the

submitted jobs.

To install and run job requests with impersonation, do the following:

http://www.qlik.com/us/explore/products/qlikview

http://www.qlik.com/us/explore/products/sense

http://www.tableau.com/

8 DMX Install Guide

• Create a service user, dmxuser.

• Create a service group, dmexpress.

Note: If you choose to change the name of the service group, you must update the SERVICE_GROUP

property of the DMX custom impersonation configuration properties file.

• Add dmxuser to the service group.

• Run the installation as dmxuser.

• Ensure that the following files are in the specified directories with the specified permissions:

Directory and file Permissions Notes

<DMX_installation>/bin/

dmxexecutor

-rwsr-x--- The ‘s’ represents the set-user identification

(setuid) bit and indicates that dmxexecutor is

extended impersonation privileges to run

submitted jobs as a specific user.

<DMX_installation>/conf/

dmxexecutor.conf

-rwx------ Updates to dmxexecutor.conf are required only

if you choose to customize the impersonation.

Getting DMX License Information To obtain a license key, you need the computer name, the hardware model, the number of processors,

and the operating system of each system on which DMX is to run. You can gather the system

information by running the DMX License Information program.

Windows Systems You can run the DMX License Information program in the following ways:

• From Syncsort’s web site at:

http://www.syncsort.com/software/licenseinfo.exe

• If DMX is installed, go to Programs, DMExpress from the Start menu and select License Information.

The program prepares a license information document with the system information and then

displays it in a Notepad window. You can save the form (File, Save As) and e-mail it to Syncsort or to

your local DMX sales agent. The information is then used to create your license key(s).

UNIX Systems You can run the DMX License Information program in the following ways:

• From Syncsort’s web site at:

http://www.syncsort.com/software/licenseinfo.sh

• If DMX is installed, go to the <dmx_home>/bin directory, where <dmx_home> denotes the

directory where DMX is installed, and type: ./licenseinfo

The program generates and displays a text file named SyncsortLicenseInfo.txt in the current

directory. You can e-mail the file to Syncsort or to your local DMX sales agent. The information is

then used to create your license key(s).

http://www.syncsort.com/software/licenseinfo.exe



DMX Install Guide 9

Step-by-Step Installation

Windows Systems

Interactive Installation

1. Make sure that any previous version of DMX has been removed (see Removing DMX from

Your System later in this guide if necessary).

2. You can also install DMX by running

\Windows\x86\setup.exe for 32-bit Windows

\Windows\x64\setup.exe for 64-bit Windows x64

extracted from the downloaded executable directly or via Control Panel, Add/Remove

Programs.

3. You are prompted to either enter a license key or start a free trial. If you've selected to enter

a license key, you can type in the location of the DMExpressLicense.txt, or browse to it, when

prompted. You can also enter the license key manually.

4. Read the terms of the Syncsort License Agreement and confirm your acceptance of them.

5. Review the product options, components and features that are enabled by your license key.

6. If your license key is a

• DMX server license key, a menu displays from which you select from among the

component options:

o Standard

o Full

o Classic

o Custom

For information on these options, see DMX installation component options. Select an

option and make the appropriate selections.

• DMX workstation license key, no component options display for selection.

You are eligible for the classic DMX/DMX-h installation, which installs the development

client, Job and Task Editors; the DMX engine, dmxjob/dmexpress; and the service for

development client, which is the DMX Run-time Service, dmxd.

1. Confirm the file folder into which you want to install DMX. The file folder is subsequently

referred to as <dmx_home>.

2. Select the program folder in which you want the DMX icons to appear.

3. Review the Setup Information, choose back to change these options or install to complete

installation.

4. If your license key enables the DMX Run-time Service, select the configuration options for

the Service. You can also configure the DMX Run-time Service later via Control Panel,

Administrative Tools, Services.

10 DMX Install Guide

5. You may be prompted to choose whether to automatically run SyncSort jobs in DMX, either

immediately or after subsequent un-installation of SyncSort, depending on the presence of the

SyncSort Conversion license option and an existing installation of SyncSort.

6. Upon setup completion, a list of menu shortcuts display in the DMX program folder, which is

available through the Windows Start menu.

7. To run the Connect Portal web UI, you must configure the DMX management service,

dmxmgr, including authentication. Then, start the DMX Management Service via Control Panel,

Administrative Tools, Services.

8. To run copy projects in Connect Portal, start the DataFunnel Run-time Service via Control

Panel, Administrative Tools, Services. See DMX DataFunnel run-time service installation and

configuration for more details.

9. To run CDC replication projects in Connect Portal, separately install the latest version of

MIMIX Share. The MIMIX Share Listener service starts automatically when you install MIMIX

Share, but the listener service must be running to run CDC replication projects in Connect

Portal.

If you performed a full install, including the development client, the following menu shortcuts

display:

• DMExpress

• Apply a New License Key

• DMExpress Application Upgrade

• DMExpress Global Find

• DMExpress Help

• DMExpress Job Editor

• DMExpress Server

• DMExpress Task Editor

• DataFunnel

• License Information

• Reference Guides

• Release Notes

If you performed a standard or classic install, with the development client but not the Management

Service, the following menu shortcuts display:

• DMExpress


• DMExpress Application Upgrade

• DMExpress Global Find

• DMExpress Help

• DMExpress Job Editor

• DMExpress Server

• DMExpress Task Editor



• Release Notes

If you installed the Management Service only (custom install), the following menu shortcuts display:

• DMExpress


DMX Install Guide 11

• DMExpress Help

• DataFunnel



• Release Notes

If you did not install the development client nor the Management Service, the following menu

shortcuts display:

• DMExpress


• DMExpress Help

• Documentation



• Release Notes

If you have ActiveX based SyncSort applications which you choose to run with DMX, and you

subsequently uninstall SyncSort, you may need to re-register the SyncSortX ActiveX control. To

register the ActiveX control, open a command prompt and type the following command: regsvr32.exe <dmx_home>/Programs/SyncSortX.dll

Silent Installation Silent installation requires a silent setup file that can be recorded during an interactive installation.

Installation steps may differ depending on product licensing, so changing the version of DMX or

adding or removing packages may require re-recording the silent setup file.

To record the installation options Open a command prompt; type the full path for the installation program followed by the options:

–r –f1<silent_setup_file>

where <silent_setup_file> is the full path for the file to record the installation options. If you are

installing from a downloaded image which is located in c:\downloads, you would type a command

like:

C:\downloads\DMExpress_1-4_windows.exe–r –f1c:\temp\setup.iss

An interactive installation starts and all the selected installation options are saved in the specified

file.

To run the installation in silent mode Open a command prompt; type the pathname of the install executable followed by the options:

–s –f1<silent_setup_file> -slog<log_file>

where <silent_setup_file> is the full path for the file that was previously used to record the

installation options, and <log_file> is the full path for the installation log file generated by silent

installation. If you are installing from a downloaded image which is located in c:\downloads, you

would type a command like:

C:\downloads\DMExpress_1-4_windows.exe –s–f1c:\temp\setup.iss

If you do not specify the –slog option, then setup generates a log of the silent installation, setup.log,

in the folder from which the setup is run or in the folder where the specified silent setup file is


located.

Multiple command line options are separated with a space, but there should be no spaces inside a

command line option (for example, –slogc:\setup.log is valid, but –slog c:\setup.log is not).

Note: When running silent installation on a machine with User Account Control enabled, an

administrator command prompt or batch file can be used to avoid the initial prompt by the operating

system requesting elevated privileges. To start a Command Prompt with administrative privileges,

right-click the Command Prompt shortcut and select "Run as administrator".

UNIX Systems

Prerequisites for COBOL Support DMX can be used to accelerate COBOL SORT and MERGE verbs or to process COBOL data files as

source or target. In order to use these features, you must have a license to use the COBOL compiler

on the system where the DMX task runs.

Micro Focus COBOL or Server Express The following variables must be set prior to installation: the COBDIR and PATH variables must be

set and exported to include the COBOL compiler, and the following environment variable for shared

libraries must be set to include all the shareable libraries used by the compiler and exported on the

corresponding platform:

AIX LIBPATH

HP-UX SHLIB_PATH

Linux LD_LIBRARY_PATH

Solaris LD_LIBRARY_PATH

AcuCorp’s ACUCOBOL-GT™ COBOL Development System Support for ACUCOBOL-GT™ is available on the following UNIX platforms:

Operating System Architecture

HP-UX 64-bit for Itanium

AIX 64-bit on PowerPC

SunOS 64-bit on SPARC processors

The bit level of DMX must match that of the ACUCOBOL-GT™ installation.

Before running the DMX install script, set the environment variable ACUCOBOL:

export ACUCOBOL=<acucobol_install_dir>

where <acucobol_install_dir> is the location of your ACUCOBOL-GT™ installation. If the

environment variable COBDIR is set, unset it:

unset COBDIR

Once DMX has been installed, additional steps need to be performed to enable support for

AcuCOBOL. Please refer to the DMX online help topic “Installing support for AcuCOBOL.”


COBOL-IT Support for COBOL-IT line sequential files is available on the following UNIX platforms:

Operating System Architecture

AIX 64-bit on PowerPC

Linux 64-bit for Intel-compatible processors

The bit level of DMX must match that of the COBOL-IT installation. The minimum supported

COBOL-IT version is 3.7.

Before running the DMX install script, do the following:

• Unset the environment variable COBDIR, if set:

unset COBDIR

• Set the environment variable COBOLITDIR:

export COBOLITDIR=<cobol-it_install_dir>

where <cobol-it_install_dir> is the location of your COBOL-IT installation.

To configure COBOL-IT runtime environment variables, refer to the DMX online help topic,

“Installing support for COBOL-IT.”

Informix C-ISAM Support If you plan to use DMX to process Informix C-ISAM files, the environment variable INFORMIXDIR

must be set and exported prior to running the install script. The directory $INFORMIXDIR/lib must

contain the library libisam.a.

Unikix VSAM Support If you plan to use DMX to process Unikix VSAM files, the environment variable UNIKIX must be set

and exported prior to running the install script. The directory $UNIKIX/lib must contain the library

libbcisam.a.

Interactive Installation

1. If you are installing from a tar file that you downloaded from Syncsort’s web site, extract the

contents of the tar file on your UNIX system using a command similar to: tar xvof

DMEXPRESS.TAR

This creates a directory dmexpress under the current directory.

2. Log in as user root if you wish to install or configure the DMX Run-time Service. The DMX

Run-time Service allows you to submit tasks or jobs from the DMX Task Editor or Job Editor

components, running on remote desktops, to execute on this DMX server.

To install using downloaded software, navigate to the dmexpress directory created when you

extracted the contents of the tar file and then run the install program. For example,

cd /usr/tmp/dmexpress

./install


3. Depending on your system and the licensed options, you may be asked several questions. For

example, on platforms where both a 32-bit and a 64-bit version of DMX are available, you are

asked to choose which one you would like to install.

You are prompted to either enter a license key or start a free trial. If you've selected to enter

a license key, specify the location of the license key file, DMExpressLicense.txt, when

prompted.



If your license key is a

• DMX server license key, a menu displays from which you select from among the

component options:

DMExpress Components

DMExpress Engine

Service for Development Client

DataFunnel Run-time Service

Management Service

System

Computer name: ...

License Expiry Date

...

or information on these options, see DMX installation component options.

• DMX workstation license key, no component options display.

You are eligible for the classic DMX/DMX-h installation, which installs the development

client, Job and Task Editors; the DMX engine, dmxjob/dmexpress; and the service for

development client, which is the DMX Run-time Service, dmxd.

6. Specify the directory into which you want to install DMX. This directory is subsequently

referred to as <dmx_home>.

7. If you logged on as root, you are prompted to indicate your choice for configuring the DMX

Run-time Service. This allows you to start the service immediately, and choose to start it

with system restart. This also allows you select PAM authentication if it is available on the

system. To configure the DMX Run-time Service at a later time, run the installation

procedure as root from the DMX installation directory. See run-time service install and

configuration for additional information.

8. You may be prompted to choose whether to automatically run SyncSort jobs in DMX, either

immediately or after subsequent un-installation of SyncSort, depending on the presence of

the SyncSort Conversion license option and an existing installation of SyncSort.

9. If you have a DMX server license, you are given the option to install the DataFunnel Run-

time Service and the option to install Management Service.

10. When the installation procedure completes, update your environment variables. Add

<dmx_home>/bin to your PATH, and add <dmx_home>/lib to the shared library path, for


example, by updating your profile. The environment variable that must be set for specific

platforms is as follows:

AIX LIBPATH

HP-UX SHLIB_PATH



11. To run the Connect Portal web UI, you must configure the DMX management service,

dmxmgr, including authentication. Then, start the DMX Management Service, dmxmgr. See

configure the DMX management service for more details.

10. To run copy projects in Connect Portal, start the DataFunnel Runtime Service, dmxrund. See

DMX DataFunnel run-time service installation and configuration for more details.

11. To run CDC replication projects in Connect Portal, separately install the latest version of

MIMIX Share. The MIMIX Share Listener service starts automatically when you install MIMIX

Share, but the listener service must be running to run CDC replication projects in Connect

Portal.

Silent Installation A silent installation allows you to easily install DMX on multiple machines with identical options.

You simply install interactively on the first machine using the record option to save your responses

to installation prompts in a file. Then you run the silent installation on the remaining machines,

pointing to the recorded response file. Because the silent installation is non-interactive, it can be

scripted to effectively automate installation on many machines.

1. To prepare to run the silent installation, initiate the interactive installation on the first

machine as described in the section above, but in step 3, run the install command with the

record option, –r, specifying the file in which to store your responses to installation prompts

as follows:

./install –r <silent_setup_file>

2. Upon successful completion of the interactive installation, run the install program with the

silent option, –s, and the silent log option, –slog, on the remaining machines that require

installation as follows:

./install –s <silent_setup_file> -slog <log_file>

where:

o <silent_setup_file> is the full path to the response file generated by the interactive

installation.

o <log_file> is the full path to the log file generated by the silent installation.

Hadoop Cluster DMX-h must be installed on all the nodes in the Hadoop cluster using one of the following methods:

• Managed Methods - recommended for large clusters

o Cloudera Manager Parcel Installation – Store the parcel in the Cloudera Manager local

or remote parcel repository (requires root/sudo privileges), then distribute and activate


the parcel on the cluster nodes via Cloudera Manager (requires Administrator access to

Cloudera Manager). Available as of Cloudera Manager 4.5.

o Apache Ambari Service Installation – Deploy the DMX-h Service Definition Package to

the Ambari repository, then install DMX-h on the nodes in the cluster using the Ambari

web interface (requires root/sudo privileges). Available as of Ambari 1.7.

o RPM Installation – Deploy the RPM (Red Hat Package Manager) on all nodes in the

cluster, then use the RPM to install DMX-h on all nodes in the cluster (requires root/sudo

privileges).

• Manual/Silent Installation – Install DMX-h on one node and replicate on all remaining nodes

The DMX Run-time Service (dmxd) only needs to be running on the node(s) to which you want to

submit jobs from the DMX GUI; typically, this is the machine designated as the edge node. When

installing DMX-h using any of the managed methods, the DMX Run-time Service is not installed.

See Installing/Upgrading the DMX Run-time Service for instructions on how to do this on the edge

node.

Installation Packages for Managed Methods There are two separate installation packages for DMX-h, one for the software and another for the

license. If you do not already have a license installed, install a license package along with the

software package. If the license isn't installed, DMX-h runs in trial mode, which eventually expires

and stops working.

If you want to upgrade from a release before the introduction of the second license package, you must

install both the software and license packages.

Cloudera Manager Parcel Installation

Note: Cloudera Manager does not support the mixing of parcels with any other managed

install method, and doing so could result in your Hadoop cluster not restarting.

Pre-Installation Execute the following steps on the machine where Cloudera Manager is installed:

1. Run the self-extracting shrink-wrap executables for the software package and license

packages from the directory where it is located. For the software executable this is:

./dmexpress-<DMX version>-<OS>.parcel.bin

For the license executable, this is:

./dmexpresslicense_<license site ID>-<date>-<OS>.parcel.bin

For example, dmexpresslicense_12345-20190928-el6.parcel.bin

2. Read and accept the Software License Agreement.

3. Enter a target directory in which to put the extracted .parcel, .sha, and manifest.json files.

The manifest.json file is required to use DMX via a remote parcel repository. The default is

the current folder.

Installation Install the DMX-h (dmexpress) parcel and the DMX-h license (dmexpresslicense-XXXXX) parcel on

all nodes in the cluster as follows:


1. Depending on whether you are using a local parcel repository or a remote parcel repository,

do one of the following:

• Local parcel repository – With root/sudo privileges, copy the extracted .parcel and .sha

files for software and license to the Cloudera Manager local parcel repository. The default

location is /opt/cloudera/parcel-repo/.

• Remote parcel repository – With root/sudo privileges, copy the extracted .parcel and

manifest.json files for software and license to your remote parcel repository. Ensure that

the files have read and execute permissions for all users. As outlined on Cloudera’s

Creating and Using a Parcel Repository page, follow the steps to Configure the Cloudera

Manager Server to Use the Parcel URL.

2. Logged in to Cloudera Manager as an Administrator user, click on the parcel indicator

button in the Cloudera Manager Admin console navigation bar to bring up the Parcels tab of

the Hosts page.

3. If not already detected, click on the Check for New Parcels button. Consider the following:

• If you are using a local parcel repository, you can see the “downloaded” parcels on this

page, for example, dmexpress 9.8.1 and/or dmexpresslicense_12345 20180928.

• If you are using a remote parcel repository, click on the Download button to download the

dmexpress and/or dmexpresslicense-XXXXX parcel from the remote repository.

Click on the Distribute button to distribute the dmexpress and/or dmexpresslicense-XXXXX

parcel to the nodes in the cluster. By default, the files are written to

/opt/cloudera/parcels/parcel_name/ on each node.

4. Upon completion of the distribution, either or both parcels can be activated by clicking on its

Activate button. If there was a previously activated distribution of DMX-h, be sure that no

DMX-h jobs are running, because Cloudera Manager automatically deactivates the old parcel

upon activation of the new parcel, and any running jobs fail.

5. Upon activation, the symbolic link /usr/dmexpress is created/updated to point to the

activated DMX installation.

See the Cloudera Manager Enterprise Edition User Guide for details on Managing Parcels.

Apache Ambari Service Installation

Pre-Installation Execute the following steps on the machine where the Ambari server resides:

1. Run the self-extracting shrink-wrap executable for the software package from the directory

where it is located. For the software executable this is:

./dmexpress-<DMX version>-<OS>.parcel.bin

For the license executable, this is:

./dmexpresslicense-<license site ID>-<date>-<arch>.ambari-service.bin

e.g. dmexpresslicense-12345-20180928-any.ambari-service.bin


3. Enter a target directory in which to extract the DMX-h or DMX-h license service folder, or

press Enter to accept the default, which is the current directory. If a folder with the same

name already exists, you are prompted to overwrite; enter yes to overwrite, or no to exit the

extracting process.

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ig_create_local_parcel_repo.html

http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cm_ig_create_local_parcel_repo.html

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.5.1/Cloudera-Manager-Enterprise-Edition-User-Guide/cmeeug_topic_7_11.html


4. Enter a target directory in which to copy the DMX-h or DMX-h license service package where

it can be found by the Ambari server, or press Enter to accept the default, which is the root

path of the latest stack.

5. Enter yes to restart the Ambari server for the new package to be picked up, or no to restart

later.

6. If the DMX-h or DMX-h license, respectively, service definition already exists in the

repository, you are prompted to upgrade; enter yes to upgrade, or no to exit the process

without updating the existing service definition package.

7. Enter the Ambari server's hostname, username, and password, and the cluster name, as

prompted, to complete the upgrade.

a. If the credentials entered fail, you can re-run this step manually by executing the

following script, where <Ambari service extracted package path> is the directory you

specified in step 3:

<Ambari service extracted package

path>/services/DMXh/package/scripts/prepare_dmxh_upgrade.sh

b. If the credentials entered fail for the license package, execute this script:

<Ambari service extracted package path>/services/

DMXhLicense/package/scripts/prepare_dmxh_license_upgrade.sh

8. If there is no license installed, repeat steps 1-7 for the license .bin file.

Installation Install the DMX-h and/or DMX-h License service on all nodes in the cluster as follows:

1. Log in to the Ambari dashboard and select Actions->Add Service.

2. On the Add Service Wizard page, select DMX-h and/or DMX-h License and click Next.

3. On the Assign Slaves and Clients page, check Client for all nodes, and click Next.

4. On the Configure Services page, click Next to continue with the default options

(recommended). Alternatively, if you wish to change the default installation directory,

expand the “Advanced” section and make changes to the DMX-h Base Directory setting,

ensuring that the same directory is specified for both the DMX-h and DMX-h License tabs,

and then click Next.

5. On the Review page, verify the configuration and click Deploy to deploy DMX-h and/or DMX-

h License, or click Back to make modifications.

6. On the Install, Start and Test page, wait for the DMX-h and/or DMX-h License service to be

successfully installed on each node. If an error occurs, select the "Failures encountered" text

to display an error log and identify the problem.

See http://docs.hortonworks.com/ for details on Apache Ambari.

RPM Installation

Pre-Installation Execute the following steps starting with one node in or with access to the Hadoop cluster:

1. Run the self-extracting shrink-wrap executable for the software and license packages from

the directories where they are located. For the software RPM, this is:

./dmexpress-<DMX version>-1.x86_64.bin

For the license RPM, this is:

http://docs.hortonworks.com/


./dmexpresslicense-<license site ID>-<date>-<revision>.<arch>.bin

e.g. dmexpresslicense-12345-20180927-1.x86_64.bin


3. Enter a target directory in which to put the extracted RPM file (the default is the current

folder).

Installation You can deploy the RPM on all nodes in the cluster using configuration management software or

install the DMX-h RPM package on all nodes in the cluster directly:

1. Execute the following command with sudo or root privileges:

rpm -i dmexpress-<DMX version>-1.x86_64.rpm

The license RPM equivalent command is:

rpm -i dmexpresslicense-<license site ID>-<date>-

<revision>.<arch>.rpm

This creates a dmexpress folder under the default install location of /usr. To install to a

different location (not recommended), use the --prefix option for both license and software

install, such as:

rpm -i --prefix /some/other/directory dmexpress-<DMX version>-

1.x86_64.rpm

Alternatively, the RPM can be installed with your Linux distribution’s high-level package manager if

it supports RPM. For example, on RHEL and CentOS, the yum command can be used:

yum install dmexpress-<version>-1.x86_64.rpm

or

yum install dmexpresslicense-<license site ID>-<date>-

<revision>.<arch>.rpm

If there is an existing package, you can upgrade the software or license RPM instead:

rpm -U <package>.rpm

or

yum upgrade <package>.rpm

Manual/Silent Installation

Pre-Installation The following steps are required prior to running the manual installation:

1. Create a shared directory, hereafter referred to as <shared_directory>, that can be accessed

by all nodes in the cluster for sharing the following files/folders (otherwise, they would need

to be copied to the same location on each node in the cluster):

• The DMExpressLicense.txt file obtained from the download.

• The dmexpress sub-directory created upon the dmexpress tar file extraction.

• The response file for the DMX silent installations (generated upon install on the first

node).


2. Extract the DMX Software.

a) Copy DMExpressLicense.txt and the dmexpress tar file to the <shared_directory>.

b) Extract the contents of the dmexpress tar file in the <shared_directory> on your UNIX system: tar xvof dmexpress_<DMX version>-1_<language>_linux_2-6_x86-64_64bit.tar

This creates a dmexpress/ directory under the current directory, hereafter referred to as the

<dmx_download_directory>.

Installation To install DMX-h on each node in the cluster, follow the instructions under UNIX Systems, Silent

Installation. You must manually install DMX-h on the first node, specifying a file to record your

responses to the install prompts, and can then silently install DMX-h on the remaining nodes using

the recorded response file, ensuring that all nodes are configured consistently.

When running the manual installation on the first machine, respond no to the prompt about

installing the DMX Run-time Service unless you want all the nodes in the cluster to install/run it.

See Installing/Upgrading the DMX Run-time Service for instructions on installing it on at least one

machine to which DMX-h jobs are submitted from the GUI.

Installing/Upgrading the DMX Run-time Service The DMX Run-time Service (dmxd) must be installed and running on any machine to which DMX-h

jobs are submitted from the GUI; typically this is the machine designated as the edge node. If you

install/upgrade DMX-h on the edge node using any of the managed installation methods, or using the

Manual/Silent installation method where you answer no to the prompt about installing the service,

the DMX Run-time Service is not installed/upgraded.

To install/upgrade the DMX Run-time Service on any machine where DMX-h is installed, follow the

instructions for UNIX systems in Configuring the DMX Run-time Service.

Cluster in the cloud using Cloudera Director Using Cloudera Director, you can install DMX-h on all of the nodes of a cluster in Google Cloud

Platform (GCP) or in Amazon web services (AWS).

Provided that you update the Cloudera Director configuration file, Cloudera Director can install

DMX-h as part of a cluster creation process that is initiated from the Cloudera Director command-

line interface (CLI).

Note: As Cloudera works toward supporting third-party parcels in Cloudera Director, Syncsort is

committed to updating the DMX-h installation procedures in alignment with Cloudera Director

enhanced functionality.

Pre-Installation To enable Cloudera Director to install DMX-h on a cluster in the cloud, update the

instancePostCreateScripts section of the Cloudera Director configuration file to invoke a DMX

installation script, which you create. At a minimum, the DMX installation script must install the

DMX RPM.

Example: instancePostCreateScripts section of a Cloudera Director configuration file In the following instancePostCreateScripts example, the DMX installation script is copied from a

Google Cloud Storage bucket and executed.

instancePostCreateScripts: ["""#!/bin/sh

echo "Installing DMExpress..."


/usr/local/bin/gsutil cp gs://<bucket_name>/installdmx.sh installdmx.sh

chmod a+x installdmx.sh

sudo ./installdmx.sh

if test $? -ne 0

then

echo Failed to install DMX on cluster nodes.

exit 1

fi

echo "Done installing DMX ..."

exit 0

"""]

Example: DMX installation script #!/bin/bash

version=9.2

shrinkWrapFile=dmexpress-${version}-1.x86_64.bin

shrinkWrapResponse=shrinkWrapResponse.txt

# create the shrink-wrap response file

cat < $shrinkWrapResponse

a

EOF

/usr/local/bin/gsutil cp gs://<bucket_name>/$shrinkWrapFile

$shrinkWrapFile

if test $? -ne 0

then

echo Failed to copy DMX shrinkwrap file from the bucket

echo ""

exit 1

fi

chmod a+x $shrinkWrapFile

#extract the rpm

./$shrinkWrapFile < $shrinkWrapResponse > shrinkWrap.out 2>&1

#install the rpm


rpm -i dmexpress-${version}-1.x86_64.rpm

if test $? -ne 0

then

echo Failed to install DMX RPM package

echo ""

exit 1

fi

rm -f $shrinkWrapResponse

rm -f $shrinkWrapFile

rm -f dmexpress-${version}-1.x86_64.rpm

Installation From the Cloudera Director CLI, create the cluster. When the Cloudera Director cluster deployment

completes successfully, DMX-h is installed on all of the nodes in the cluster.

Post-installation To enable the submission of DMX-h jobs from the DMX Job Editor on a Windows instance, do the

following:

1. SSH to the ETL server/edge node and run a preparation script, which you create, to do the

following: start the DMX Run-time Service, dmxd; create a UNIX account, dmxuser/dmxuser;

enable password authentication for SSH.

Example: ETL server/edge node preparation script

#!/bin/bash

# (1) start dmxd on master-node

DMEXPRESS_HOME_DIRECTORY=/usr/dmexpress

export DMEXPRESS_HOME_DIRECTORY

if [ "" != "022" -a "" != "0022" -a "" != "000" -a "" != "00" -a "" !=

"0000" -a "" != "002" -a "" != "02" -a "" != "0002" -a "" != "020" -a

"" != "0020" ]

then

umask 022 2>/dev/null

fi

if [ ! -f $DMEXPRESS_HOME_DIRECTORY/bin/dmxd ]

then

echo Failed to locate the DMX Run-time Service 'dmxd'.

exit 1

fi

mkdir -p $DMEXPRESS_HOME_DIRECTORY/logs


echo "JOBS_DETAILS_DIR=$DMEXPRESS_HOME_DIRECTORY/logs" >

$DMEXPRESS_HOME_DIRECTORY/bin/dmxd.conf

echo "DMEXPRESS_EXE=$DMEXPRESS_HOME_DIRECTORY/bin/dmexpress" >>


echo "DMEXPRESS_AUTHENTICATION_METHOD=DEFAULT" >>


PATH=$DMEXPRESS_HOME_DIRECTORY/bin:$PATH:/usr/bin; export PATH

LD_LIBRARY_PATH=$DMEXPRESS_HOME_DIRECTORY/lib:$LD_LIBRARY_PATH; export

LD_LIBRARY_PATH

cd $DMEXPRESS_HOME_DIRECTORY/bin

echo Starting the DMX Run-time Service at `date`...

nohup ./dmxd ./dmxd.conf 1>dmxd.stdout 2>dmxd.stderr &

# (2) create dmxuser

useradd -d /home/dmxuser -m -s /bin/bash "dmxuser"

echo "dmxuser:dmxuser"| chpasswd

if test $? -ne 0

then

echo Failed to set password for user dmxuser.

exit 1

fi

# (3) enable password authentication for sftp

cat /etc/ssh/sshd_config | sed -e

"s/PasswordAuthentication.*no/PasswordAuthentication yes/" >

sshd_config_temp

mv sshd_config_temp /etc/ssh/sshd_config

/etc/init.d/sshd restart

if test $? -ne 0

then

echo Failed to enable ssh password login.

exit 1

fi

exit 0

2. As dmxd runs on port 32636 and the SSH service runs on port 22, modify the edge node

network rules to allow TCP connections to these ports from the Windows instance.


Deploying DMX to a Databricks cluster in the cloud To run jobs on a Databricks cluster, you must deploy DMX Server to the cluster and install it using

an RPM Resource Manager (RPM) init script. DMX CloudFSUtil, a command-line utility included

with DMX-h, can move the required files to all the nodes in a cluster on Amazon Web Services

(AWS).

Note: DMX supports Databricks clusters deployed to Azure and AWS cloud platforms. CloudFSUtil

can only move files to AWS.

Requirements Setup the Spark cluster, install Databricks, and configure JDBC for Spark and Databricks. DMX

jobs on Databricks cannot run on Spark versions 3.0.0 and higher. When connecting to Databricks

databases, we recommend using the Simba Spark SQL JDBC driver version 2.6.16 or higher.

If your Connect jobs connect to DB2, Oracle, or SQL Server databases, install the JDBC drivers on

Databricks in the same work directory as the dmxspark2ix.jar file and the work directory configured

in DMX execution profile files. For information, see Connecting to Databricks File Systems (DBFSs)

for more information.

You can run CloudFSUtil to copy the jars to the appropriate location. For example:

cloudfsutil -put mssql-jdbc-8.2.2.jre8.jar

dbfs:/mnt/azuregen2/work/mssql-jdbc-8.2.2.jre8.jar

You also need a user account with permission to run sudo to run the RPM install.

Prepare the install files In order to install Connect on Databricks, you require three files.

1. A DMExpress executable bin file for Connect, which is typically named dmexpress-

${version}-1.x86_64.bin.

2. A license key package bin file for Connect, which is typically named dmexpresslicense-

${licenseId}-${licenseDate}.x86_64.bin.

3. An rpm install script. Databricks uses rpm scripts during cluster start up to extract the

DMExpress executable bin files and install them on the cluster. Databricks requires Unix

(LF) end-of-line characters in the script to execute properly. A sample rpm installation script

is shown below:

#!/bin/bash

dbfsPath=/dbfs/mnt/azuregen2/connect

workDir=/dbfs/mnt/azuregen2/work

version=9.10.11

shrinkWrapFile=dmexpress-${version}-1.x86_64.bin

shrinkWrapLicenseFile=dmexpresslicense-${licenseId}-

${licenseDate}.x86_64.bin

shrinkWrapResponse=shrinkWrapResponse.txt

# create the shrink-wrap response file


cat <<EOF > $shrinkWrapResponse

a

EOF

cp -f $dbfsPath/$shrinkWrapFile $shrinkWrapFile

chmod a+x $shrinkWrapFile

#extract the rpm

./$shrinkWrapFile < $shrinkWrapResponse

#install the rpm

if type rpm >/dev/null 2>&1;then

echo "rpm is present"

else

sudo apt-get update

sudo apt-get -y install rpm

fi

mkdir -p /usr/tmp >/dev/null 2>&1

sudo rpm -i dmexpress-${version}-1.x86_64.rpm

rm dmexpress-${version}-1.x86_64.*

#Next step can be done once manually outside of script or included here

if [[ $DB_IS_DRIVER = "TRUE" ]]; then

cp -vf /usr/dmexpress/lib/dmxspark2ix.jar $workDir/dmxspark2ix.jar

fi

Within the example rpm script, we set the following variables:

variable value

dbfsPath The mount point within Databricks where you keep the

DMExpress bin executable files.


workDir A directory path to which to save DMX job staging materials.

This should match the configured workDirectory in your

DMX execution profile file. For information, see Connecting to Databricks File Systems (DBFSs).

version The version of Connect to install, used in the script to build

the shrinkWrapFile variable values

shrinkWrapFile,

shrinkWrapLicenseFile

The file names of your bin executable, built from the same

version variable.

shrinkWrapResponse A buffer for input variables

After variable assignments, the script uses RPM to install the software and makes the /usr/tmp

directory for temporary storage. The last lines of the script copy the Connect dmxspark2ix.jar

library to the work directory, which is required only once per work directory. Therefore, you can

include the step within your RPM install script as shown above, or you can copy the library from

your local installation to the work directory using Connect’s CloudFSUtil. For example:

cloudfsutil -put <dmxpress_installation>/bin/dmxspark2ix.jar

dbfs:/<workdir>/

If the dmxspark2ix.jar library is missing from the work directory, DMX jobs running on the

cluster fail. You can use multiple work directories on the same cluster, and each work directory

requires the library.

Move the install files to Databricks We recommend using the executable CloudFSUtil within Connect to transfer the install files from

your local computer to Databricks. For example, from a directory containing all install files, run:

cloudfsutil -put dmexpress-9.10.11-1.x86_64.bin

azure://<account>.blob.core.windows.net/<container>/

cloudfsutil -put dmexpresslicense-70590-20200925-1.x86_64.bin

azure://<account>.blob.core.windows.net/<container>/

cloudfsutil -put rpmInstall.sh dbfs:/<rootdir>/rpmInstall.sh

To make the RPM install script available during cluster initialization, you must save it to a DBFS

root folder.

The DMX bin executables must reside on a mounted drive.

Note: Azure portal typically moves the DMX bin executables to Databricks faster than DBFS due to

the size of the executable.

Install custom libraries for jobs submitted from a Windows Virtual Machine (VM) Note: this section does not apply if you are using a Linux VM to submit jobs to Databricks


To submit jobs with custom libraries to Databricks using a Windows VM, perform the following

additional steps:

1. If you use custom function libraries, copy the Unix version of your custom function library to

Databricks. For example, to use CloudFSUtil:

cloudfsutil -put mylibrary.so

dbfs:/mnt/azuregen2/connect/customfunctions/mylibrary.so

2. Create a local copy of the RPM install script and open it for editing.

3. Add a line to the end script with a command that copies the custom function libraries to the

plugins folder of the Connect software. For example:

cp -vf $dbfsPath/customfunctions/*.so /usr/dmexpress/plugins/

4. Save the changed RPM install script and upload it to a DBFS root folder.

5. Restart the Databricks cluster to install the custom libraries.

Configure the cluster On the cluster on which to install Connect, execute the following procedure:

1. At the top of the cluster’s page, click Edit.

2. Click Advanced Options.

3. Under Spark, add the JAVA_HOME environment variable. For example,

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

4. Under Logging, set the destination to DBFS and the cluster log path to your desired location.

For example:

dbfs:/mnt/azuregen2/cluster-logs

Note: when you save these values, Databricks adds an additional subdirectory to the end of

the log path for the cluster ID.

5. Under Init Scripts, select Destination DBFS and add the path and filename of your RPM

install script to the Init Script Path value. For example:

dbfs:/<rootDir>/rpmInstall.sh

6. Click on Confirm at the top of the cluster’s page to save these changes.

7. To confirm a successful install,

a. click on Event Log on the cluster’s page to verify the INIT_SCRIPTS_STARTED and

INIT_SCRIPTS_FINISHED messages and times.

b. Review the software installed in /usr/dmexpress by running the following command

within a notebook:

ls -l /usr/dmexpress/

If the cluster reported errors during DMX install, review the init script logs stored in the

init_scripts/<cluster_id>_<container_ip> directory under the logging directory set up in the cluster’s

Advanced Options. For example:

ls -l /dbfs/mnt/azuregen2/cluster-logs/init_scripts

Start the cluster: To start the cluster, click Start at the top of the cluster’s page.


Configuring the DMX Run-time Service The DMX Run-time Service needs to be running on any system to which you want to submit tasks or

jobs for execution. It is also required for certain other functions such as file browsing in a multi-

locale environment or viewing server statistics from the client.

The DMX Run-time Service is usually configured during installation to determine the following

options:

• automatic restart on system startup

• PAM authentication on supported UNIX and Linux platforms

To change these options, you can reconfigure the DMX Run-time Service as described below.

Stopping and starting the DMX Run-time Service The DMX Run-time Service can be stopped and restarted at any time. Before stopping the DMX

Run-time Service, please verify that no job or task submitted from the graphical interface is running.

If a DMX client running a version prior to 5.2.5 connects to this DMX Run-time Service, then the

Remote Procedure Call (RPC) service must be running on the system when the DMX Run-time

Service starts. The RPC service is used to obtain additional ports required to connect to older DMX

clients. Refer to the section below on RPC ports used by older DMX clients. Otherwise, the RPC

service is not required and all ports associated with it may be blocked.

Windows systems This procedure requires Administrator level access. Select the DMExpress Service from Control

Panel, Administrative Tools, Services, then select Properties from the pop-up menu. This opens the

DMExpress Service Properties dialog. Use the Start (or Stop) button. A progress bar may appear and

the Service status in the properties window changes to Started (or Stopped).

NOTE: In order to submit a job or task, a user must have local login privileges to the machine where

the service is running.

UNIX systems Root level access is required. Run the install script in the DMX installation directory. For example:

>cd /usr/software/DMExpress

>./install

This gives you the option of configuring the DMX Run-time Service, where you can choose to stop

and/or start the service.

Automatic restart on system startup

Windows systems Select Automatic in Startup type in the DMX Run-time Service Properties dialog to have the DMX

Run-time Service started automatically when the system starts; select Manual otherwise.

UNIX systems Use the install procedure as described under Stopping and Starting the DMX Run-time Service

above. You are asked the appropriate questions.


PAM authentication

UNIX systems To configure DMX to use PAM authentication, do the following:

1. Use the install procedure to stop and restart the service as described above.

If you have Pluggable Authentication Modules (PAM) installed and configured on the system,

you are asked whether DMExpress should use PAM to authenticate users.

2. Include the PAM library in the system library path.

DMExpress specifically looks for the library name libpam.so. If your library has a different

name, such as libpam.so.0.81.5, create a symbolic link to it in any directory that is included

in the shared library path environment variable. For example, this can be done in the DMX

lib directory, specifying the full path to your library:

cd /<dmx_home>/lib

ln -s /lib64/libpam.so.0.81.5 libpam.so

3. Modify the configuration of the Active Directory software that handles all network

connections to the server running the DMX Run-time Service, dmxd.

• On Linux systems, have your system administrator create a file named dmxd in the

/etc/pam.d/ directory and grant authentication and account management privileges to the

dmxd service. Alternatively, you can do the following:

1. Create a file named dmxd in the directory /etc/pam.d

2. Copy the contents of sshd to dmxd

• On UNIX systems, have your system administrator create a dmxd entry in the pam.conf

file, which is located in the /etc/ directory, and grant authentication and account

management privileges to the dmxd service. Alternatively, you can do the following:

1. Create an entry named dmxd in the file /etc/pam.conf

2. Copy the contents of telnet to the entry created for dmxd

4. Ensure that DMX is configured to use PAM authentication:

• Check the installation log file, install.log, which was created in the directory where you

installed DMX. If PAM is installed on your system, the DMX installation log includes a

question asking if DMX should use PAM authentication. Verify that the recorded

response is yes [y].

• Alternatively, if you have root access to the DMX remote server, login as root and verify

that the following appears in the service configuration file, dmxd.conf, which is located in

the <dmx_home>/bin directory:

DMEXPRESS_AUTHENTICATION_METHOD=PAM

Communication ports required by the DMX server The following TCP and UDP ports are used for communication with the DMX server. When

configuring your firewall, make sure the required ports are not blocked anywhere between the

system running DMX server and the systems running the DMX client.


If any DMX client that connects to this server is running a version of DMX older than 5.2.5,

additional RPC ports are used, and may be configured. Refer to the section below on RPC ports used

by older DMX clients.

Port number/

transport Description

32636/TCP DMX server port, used for communication with DMX client, or with other DMX

servers when using Grid Computing. It is not recommended to override this

port number; please contact Syncsort technical support if you need to do so.

Refer to the section below on Technical Support.

In addition, if a DMX task or job uses a remote UNIX server connection, or a Windows network path

with a UNC name, to access data (including source and target files) or metadata (including tasks,

jobs and external metadata), the following ports need to be open on the system hosting the files.

Port number/


20/TCP,UDP FTP data port, if Secure FTP is not used

21/TCP,UDP FTP control port, if Secure FTP is not used

22/TCP,UDP Secure FTP port, if Secure FTP is used

445/TCP,UDP Windows shares

50070/TCP,UDP Hadoop Distributed File System name node

RPC ports used by older DMX clients DMX clients older than 5.2.5 require additional ports to communicate with the DMX Server. These

ports are assigned by the RPC service at the time the DMX Run-time Service starts.

The following ports are used in addition to the standard ports used by the DMX Run-time Service.

Port number/


Arbitrary

port/TCP

DMX server port used for communication with DMX clients. An arbitrary port

is assigned when the DMX Run-time Service is started. The port number can

be configured as mentioned below, for example if your security policy does not

allow a wide range of ports to be open, or due to the presence of a firewall.

111/TCP,UDP UNIX RPC port mapper

135/TCP,UDP Windows RPC endpoint mapper


Configuring the Server port

Windows systems On the machine where the DMX Run-time Service is installed, open the DMX Run-time Service

Properties dialog and stop the DMX Run-time Service as described above.

In the Start parameters edit box of the properties window, type:

/tcpport <DMX ServerPort>

where <DMX ServerPort> is the port you want the service to use. For example: /tcpport 7771

Start the DMX Run-time Service as described above. The DMX Run-time Service now uses the port

you provided.

UNIX systems Stop the DMX Run-time Service as described above.

Edit the service configuration file, dmxd.conf, which is located in the <dmx_home>/bin directory, to

insert the following line:

DMEXPRESS_TCP_PORT=<DMX ServerPort>

where <DMX ServerPort> is the port you want the service to use. For example:

DMEXPRESS_TCP_PORT=7771

Stop and start the DMX Run-time Service as described above. The DMX Run-time Service now uses

the port you provided.

Applying a New License Key to an Existing Installation Applying a new license key updates your product license to a new licensed version. If your new

license enables features or products not installed in your original installation, applying a new license

key does not install them automatically.

Windows Systems

Applying a new key interactively Perform the following steps to apply a new license key to an existing DMX installation:

1. Go to Programs, DMExpress from the Start menu and select Apply a New License Key.

2. Browse to the location of the license key file, DMExpressLicense.txt, or type in the license

key manually, when prompted.



5. Confirm the location of the existing DMX installation.

Applying a new key silently Applying a new license key silently requires a setup file which can be recorded when applying a new

license key interactively.


To record the setup file Open a command prompt; type the full path to the program applykey.exe, followed by the options:

–r –f1<silent_setup_file>

where <silent_setup_file> is the full path to the setup file that is created. For example, if DMX is

installed in “C:\Program Files\DMExpress\”, type:

"C:\Program Files\DMExpress\Programs\applykey.exe" –r –

f1c:\temp\setup.iss

An interactive session begins and the options that are selected during the interactive session are

recorded in the specified setup file.

To run the applykey.exe program in silent mode Open a command prompt; type the full path to the program applykey.exe, followed by the options:

–s –f1<silent_setup_file> -slog<log_file>

where <silent_setup_file> is the full path to a setup file that was created using the steps above, and

<log_file> is the full path to the log file which contains any output produced by the silent install run.

For example, if DMX is installed in “C:\Program Files\DMExpress\”, type:

"C:\Program Files\DMExpress\Programs\applykey.exe" –s –

f1c:\temp\setup.iss

–slogc:\temp\setup.log

If you do not specify the –slog option, then apply key generates a log, setup.log, in the folder where

the silent setup file is located.

Multiple command line options are separated with a space, but there should be no spaces inside a

command line option (for example, –slogc:\setup.log is valid, but –slog c:\setup.log is not).

UNIX Systems

Apply a new key interactively Perform the following steps to apply a new license key to an existing DMX installation:

1. Change to the <dmx_home> directory and run the apply key program:

cd <dmx_home>

./applykey

2. Specify the location of the license key file when prompted.



Applying a new key silently A silent applykey process allows you to easily apply a new DMX license key on multiple machines

with identical options. You simply apply the key interactively on the first machine using the record

option to save your responses to applykey prompts in a file. Then you run the silent applykey process

on the remaining machines, pointing to the recorded response file. Because the silent applykey

process is non-interactive, it can be scripted to effectively automate applying the license key on many

machines.


1. To prepare to run the silent applykey process, initiate the interactive applykey process on

the first machine as described in the section above, but in step 1, run the applykey command

with the record option, –r, specifying the file in which to store your responses to applykey

prompts as follows:

./applykey –r <silent_setup_file>

Note: Before initiating the silent applykey process, ensure that all actively running jobs

complete successfully.

2. Upon successful completion of the interactive applykey process, run the applykey program

with the silent option, –s, and the silent log option, –slog, on the remaining machines that

require the new key as follows:

./applykey –s <silent_setup_file> -slog <log_file>

where:

• <silent_setup_file> is the full path to the response file generated by the interactive

applykey process.

• <log_file> is the full path to the log file generated by the silent applykey process.

DMX-h in a Hadoop Cluster The method for applying a new license key to DMX-h on the nodes of a Hadoop cluster depends on

how you originally installed DMX-h in the cluster. Follow the instructions in the appropriate section

below.

Cloudera Manager Parcel Apply Key

1. Install and activate the new DMX-h license Cloudera parcel, as described in Cloudera

Manager Parcel Installation. The software parcel does not need to be modified.

2. (optional) Uninstall the old DMX-h license Cloudera parcel, as described in Cloudera

Manager Parcel Uninstall.

Apache Ambari Server Apply Key

1. Install the new DMX-h Ambari license service definition package, as described in Apache

Ambari Service Installation. The software package does not need to be modified. This

effectively updates the existing service definition package.

RPM Apply Key

1. Install the new DMX-h license RPM package, as described in RPM Installation. This

effectively updates the license key.

Manual/Silent Apply Key See UNIX systems, Applying a new key silently.

Running DMX Once you have installed DMX, you can create tasks corresponding to different stages of your process

via the DMX Task Editor, and group tasks as jobs and run jobs via the DMX Job Editor. You can

schedule to run the jobs later or from within a batch script. You can obtain more information on both


the graphical user interfaces and on running tasks and jobs from the command line from the DMX

Online Help.

Graphical User Interfaces On Windows systems, go to Programs, DMExpress from the Start menu and select DMExpress Task

Editor to run the DMX Task Editor. To run the DMX Job Editor, either select it from the Start,

Programs, DMExpress menu, or switch to it from within the Task Editor via the Run, Create Job

menu item.

DMX Help To access DMX Help, go to Programs, DMExpress from the Start menu and select DMX Help or

select the Help, Topics menu item from within the Task Editor or the Job Editor.

Connecting to Databases from DMX In order for DMX to access database tables as sources or targets, the appropriate database client

software must be on the system and accessible via the appropriate shared library or dynamic link

library (dll) paths. The following environment variable must be set to include the path to the

database client libraries and exported on the corresponding platform:

Windows PATH

AIX LIBPATH

HP-UX SHLIB_PATH



On UNIX systems, the variable needs to be set and exported prior to starting the DMX Run-time

Service or running DMX tasks or jobs.

Additional client configuration might be required for a specific DBMS. The configuration steps

needed to access a specific DBMS are described in the following sections.

The DMX install program assists you with configuring and/or verifying connections to databases.

On UNIX systems, if you wish to configure and/or verify database connections any time after the

installation procedure, run the databaseSetup program as follows:

cd <dmx_home>

./databaseSetup

Amazon Redshift

Initial requirements Before attempting to connect to Amazon Redshift, do the following:

• Configure the DMX server, which can be either an Amazon Elastic Compute Cloud (EC2)

instance or your local machine, to accept SSH connections.

• Depending on the DMX server, consider the following:

• EC2 instance – Set the size of the maximum transmission unit (MTU).

• Local machine - Due to throughput on the wide area network (WAN), you may notice a

performance lag at design time and at runtime.

http://docs.aws.amazon.com/redshift/latest/mgmt/connecting-drop-issues.html


If the local machine is behind a firewall, you may need to configure a Virtual Private Network

(VPN) to connect to the local machine from Amazon Redshift.

• Configure the DMX server to include the Amazon Redshift cluster public key and cluster

node IP addresses:

1. Retrieve the Amazon Redshift cluster public key and cluster node IP addresses.

2. Add the Amazon Redshift cluster public key to the DMX host's authorized keys file.

3. Configure the DMX host to accept all of the Amazon Redshift cluster node IP addresses.

4. Get the public key for the DMX host.

• Specify Amazon Redshift parameters in the DMX Redshift configuration file.

The parameters outlined in the DMX Redshift configuration file, as defined by the

DMX_REDSHIFT_INI_FILE environment variable, provide DMX with the values required to

access an Amazon S3 bucket and to invoke the Amazon Redshift COPY command.

Note: If DMX_REDSHIFT_INI_FILE is not set, DMX issues an error message upon task

initiation and the DMX task aborts.

A sample DMX Redshift configuration file is provided in the DMX installation directory as follows:

Windows C:\Program Files\DMExpress\Examples\Databases\Redshift\DMXRedshift.ini

UNIX <DMX _installation>/etc/DMXRedshift.ini

Installation and configuration Connectivity between DMX and Amazon Redshift databases is established through the Amazon

Redshift ODBC driver and, when loading, through multiple SSH connections.

DMX optimizes load performance to Amazon Redshift databases through the invocation of the

Amazon Redshift COPY command.

Amazon Redshift ODBC driver installation

Windows systems For Windows systems, ODBC driver installation includes the following:

1. Install and configure the Amazon Redshift ODBC 32-bit driver on Microsoft Windows

operating systems.

2. When creating a system DSN entry for the ODBC connection, ensure the following settings

on the given dialogs:

• Amazon Redshift ODBC Driver DSN Setup dialog: Use Declare/Fetch is selected.

• Amazon Redshift Data Type Configuration dialog:

o Use Unicode is unselected.

o Show Boolean Column As String is unselected.

o Max Varchar (Default 255) is populated with the value 65530.

UNIX systems For UNIX systems, ODBC driver installation includes the following:

http://en.wikipedia.org/wiki/Virtual_private_network

http://en.wikipedia.org/wiki/Virtual_private_network

http://docs.aws.amazon.com/redshift/latest/dg/load-from-host-steps-retrieve-key-and-ips.html

http://docs.aws.amazon.com/redshift/latest/dg/load-from-host-steps-add-key-to-host.html

http://docs.aws.amazon.com/redshift/latest/dg/load-from-host-steps-configure-security-g

http://docs.aws.amazon.com/redshift/latest/dg/load-from-host-steps-get-the-host-key.html

http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-windows.html

http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-windows.html


1. Install the Amazon Redshift ODBC 64-bit driver on Linux operating systems.

2. Configure the ODBC Driver on Linux operating systems.

When using the unixODBC driver manager, override the standard threading settings in the

ODBC section of odbcinst.ini as follows:

[ODBC]

Threading = 1

3. Update odbc.ini with the following name-value pairs:

UseDeclareFetch=1

UseUnicode=0

BoolsAsChar=0

MaxVarchar=65530

Azure Synapse Analytics (formerly SQL Data Warehouse)

Azure Synapse Analytics (formerly SQL Data Warehouse) is a cloud-based Enterprise Data

Warehouse (EDW) developed by Microsoft. Through JDBC connectivity, DMX-h supports Azure

Synapse Analytics as sources and targets.

Azure Synapse Analytics connection requirements Azure Synapse Analytics requires a JDBC connection configuration with the driver name and

location for all connections. The parameters outlined in a DMX Azure Synapse Analytics

configuration file include the following:

• DriverName - Required JDBC driver name.

• DriverClassPath - Required JDBC class path.

• MAXPARALLELSTREAMS - Optional maximum number of parallel streams created to load

data for performance and according to demand.

• STORAGEACCESSKEY - Required. Azure Blob Storage access key for an active account. If

the storage access key is missing or invalid, DMX issues an AZSQDWTERR error message

and aborts the job.

• WORKTABLECODEC - Optional compression codec to use to compress files in the staging

table. DMX currently supports gzip compression codec only.

• WORKTABLEDIRECTORY - Required. A URL that includes the Blob Storage account name

with the endpoint, including the container name. See https://docs.microsoft.com/en-

us/azure/storage/blobs/storage-blobs-introduction#blob-storage-resources. For example:

WorkTableDirectory=https://<dmxazurestorage>.blob.core.windows.net/dmx-

azstorage-container

Where <dmxazurestorage> is the blob storage account name and <dmx-azstorage-

container> is the container name. If the work table directory is missing or invalid, DMX

issues an AZSQDWTERR error message and aborts the job.

• WORKTABLESCHEMA - Optional schema name to create the staging data. If this

parameter is not set, DMX creates tables in the same schema as the target table.

http://docs.aws.amazon.com/redshift/latest/mgmt/install-odbc-driver-linux.html

http://docs.aws.amazon.com/redshift/latest/mgmt/odbc-driver-configure-linux-mac.html

https://docs.microsoft.com/en-us/connectors/sqldw/

https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction#blob-storage-resources



Defining Azure Synapse Analytics database connections In the Database Connection dialog, define a connection to an Azure Synapse Analytics database as

follows:

• At DBMS, select Azure Synapse Analytics.

• At Access Method, select JDBC.

• At Database, select a previously defined Azure Synapse Analytics JDBC database connection

URL.

• At Authentication, select Auto-detect.

DMX requirements to load data into an Azure Synapse Analytics target

Before using DMX to load data into an Azure Synapse Analytics target, do the following:

1. Create or verify that the master database contains a database master key

2. Enable the db_owner privilege for the user connecting to Azure Synapse Analytics.

Alternately, set or verify the following more granular privileges for the connecting user:

EXEC sp_addrolemember 'db_datawriter', '<user>';

GRANT CONTROL TO <user>;

Azure Synapse Analytics target connections Using an Azure Synapse Analytics JDBC connection, DMX-h can write supported Azure Synapse

Analytics data types to Azure Synapse Analytics targets directly for optimal performance.

Defining Azure Synapse Analytics targets

At the Target Database Table dialog, define an Azure Synapse Analytics database table target:

1. At Connection, select a previously defined Azure Synapse Analytics target connection or

select Add new... to add a new one.

2. Select a table from the list of Tables, or select Create new... to create a new one.

o User defined SQL statement is not supported.

o All target disposition methods are supported.

3. On the Parameters tab, the following optional parameters are available for Azure Synapse

Analytics target database tables. Values specified here take precedence over their

corresponding property in the JDBC configuration file, if any.

o Maximum parallel streams - the maximum number of parallel streams that can be

established to load data for performance and that are created according to demand.

o Work table directory - Required. A URL that includes the Blob Storage account name

with the endpoint, including the container name. See https://docs.microsoft.com/en-

us/azure/storage/blobs/storage-blobs-introduction#blob-storage-resources. For

example:

WorkTableDirectory=https://<dmxazurestorage>.blob.core.windows.ne

t/dmx-azstorage-container




Where <dmxazurestorage> is the Blob storage account name and <dmx-azstorage-

container> is the container name. If the work table directory is missing or invalid,

DMX issues an AZSQDWTERR error message and aborts the job.

o Work table codec - specifies the compression algorithm used to compress data staged

in Blob storage.

o Work table schema - the schema used to create the staging table.

4. Set commit interval and Abort task if any record is rejected are not supported.

Azure Synapse Analytics source connections Using an Azure Synapse Analytics JDBC connection, DMX can read supported Azure Synapse

Analytics data types from any Azure Synapse Analytics table.

Defining Azure Synapse Analytics sources

For all DMX-h ETL jobs, DMX-h supports Azure Synapse Analytics database tables as sources and

as lookup sources. At the Source Database Table dialog or at the Lookup Source Database Table

dialog define either an Azure Synapse Analytics database table source or lookup source respectively:

• At Connection, select a previously defined Azure SQL Data Warehouse source connection or

select Add new... to add a new connection.

Databricks Databricks is a cloud database Platform-as-a-Service for Spark supported on Azure and AWS Cloud

Services. Through JDBC connectivity, DMX-h supports Databricks databases as sources and

targets.

Databricks connection requirements

Databricks requires a JDBC connection configuration with the driver name and location for all

connections.

NOTE: A Databricks database connection is a database connection, which is logically different from

a Databricks File System (DBFS) connection which is a remote file connection.

Before attempting to connect to Databricks, do the following:

• Install DMX server on an Amazon Elastic Compute Cloud (EC2) instance, Azure Virtual

Machine (VM), or your local machine..

• Specify JDBC and Spark parallelization parameters in the DMX JDBC configuration file.

The parameters outlined in the DMX JDBC configuration file, as defined by the

DMX_JDBC_INI_FILE environment variable, provide DMX with the mandatory and optional

values required to access an Amazon S3 bucket or Microsoft Azure blob to invoke a

Databricks query.

• DMX accesses Databricks using keys-based authentication. If no access keys are provided,

DMX issues a UNIAMCRE error message aborts the job.

The parameters outlined in a DMX Databricks configuration file include the following:

https://databricks.com/documentation




• ANALYZETABLESTATISTICS - When set to y, DMX can run analyze queries that collect

table statistics. Default value in n.

• ANALYZECOLUMNSTATISTICS - When set to y, DMX can run analyze queries that collect

column statistics. Default value is n.

• MAXPARALLELSTREAMS - Optional integer representing Maximum number of parallel

streams that can be established for loading data to the staging data file. By default,

MAXPARALLELSTREAMS is set to the number of CPUs available in the client machine.

• WORKTABLEDIRECTORY - Required path to an s3 bucket, Azure blob container, or

Databricks File System (DBFS) store in which to stage data. You must mount an s3 bucket

or Azure blob container using the Databricks File System (DBFS). Example URLs could

include:

o s3a://dev for an S3 bucket

o wasbs://[email protected]/dev for

an Azure Blob

o dbfs://dev for a DBFS store

• DBFSMOUNTPOINT – DBFS mount point (DBFS path) required by

WORKTABLEDIRECTORY. DBFSMOUNTPOINT is mandatory if the work table directory

maps to an S3/Azure URL.

• MAXWORKFILESIZE – Optional integer. The maximum size of a file in bytes of the staging

file written by task. The default value is 134217728, which is equivalent to 128 MB.

• WORKTABLESCHEMA - Optional schema name to use for staging data . The default

schema for staging data is the same as the target data schema.

• WORKTABLECODEC - A compression codec to compress the files in the stagiung directory.

Valid values are gzip (default), bzip2, and uncompressed.

• AWSACCESSKEYID - A 20-character, alphanumeric string that Amazon provides upon

establishing an AWS account. DMX ignores this parameter unless

WORKTABLEDIRECTORY is an S3 bucket. If DMX runs in EC2, AWSACCESSKEYID is

optional.

• AWSACCESSKEY - The 40-character string, also known as the secret access key, which

Amazon provides upon establishing an AWS account. DMX ignores this parameter unless

WORKTABLEDIRECTORY is an S3 bucket. If DMX runs in EC2, AWSACCESSKEY is

optional.

DMX requires the access key id and the secret access key to send requests to an Amazon S3

bucket unless an AWS temporary session token is required, in which case DMX requires the

access key id and AWS temporary session token. See the AWSTOKEN parameter below.

• AWSTOKEN - An AWS temporary session token, granting temporary security credentials

(temporary access keys and a security token) to any IAM user enabling them to access AWS

services. This alternative authentication method replaces a full-access AWS storage access

key. DMX ignores this parameter unless WORKTABLEDIRECTORY is an S3 bucket.

• AzureStorageAccessKey - A 512-bit Azure Blob Storage access key for an active account of

which Microsoft issues two upon establishing an Azure Portal account. If DMX runs in the

Azure Blob container, AzureStorageAccessKey is optional. If the storage access is required

and the key is missing or invalid, DMX issues an AZSQDWTERR error message and aborts

the job. DMX ignores this parameter unless WORKTABLEDIRECTORY is an Azure blob

container.

• AzureStorageSAS - A shared access signature (SAS) URI that grants restricted access rights

to Azure Storage resources. This alternative authentication method replaces a full-access


Azure Storage access key. DMX ignores this parameter unless WORKTABLEDIRECTORY is

an Azure blob container.

Defining Databricks database connections

In the Database Connection dialog, define a connection to a Databricks database as follows:

• At DBMS, select Databricks.


• At Database, select a previously defined Databricks JDBC database connection URL.


Databricks target connections

Using a Databricks JDBC connection, DMX-h can write supported Databricks data types to

Databricks targets directly for optimal performance.

Defining Databricks targets

At the Target Database Table dialog, define a Databricks database table target:

1. At Connection, select a previously defined Databricks target connection or select Add new...

to add a new one.




3. On the Parameters tab, the following optional parameters are available for Databricks target

database tables. Values specified here take precedence over their corresponding property in

the jdbc configuration file, if any.

o Analyze table statistics - enables analyze queries that collect table statistics. o Analyze column statistics - enables analyze queries that collect column statistics. o Maximum parallel streams - Optional integer representing the maximum number of

parallel streams that Connect can establish for loading data into the staging data

file. By default, this is set to the number of CPUs available in the client machine.

o Work table directory - the parent-level directory in s3, blob, and/or dbfs in which

Connect creates job-specific subdirectories.

When the work table directory is an s3 bucket, you must mount the s3 bucket

through DBFS. For more details, see the Databricks documentation concerning

Amazon S3.

When the work table directory is an azure blob container, you must mount the blob

container through DBFS. For more details, see the Databricks documentation

concerning Azure storage.

o Work table schema - the schema used to create the staging table. By default, Connect

creates the staging table in the same schema as the target table.

o Work table codec - specifies the compression algorithm used to compress Databricks

data. Valid values are gzip (default), bzip2, and uncompressed.Set commit interval and

Abort task if any record is rejected are not supported.

Supported_Dispositions_for_DBMSs.htm


Databricks source connections

Using a Spark JDBC connection, DMX can read supported Databricks data types from any

Databricks table.

Defining Databricks sources

For all DMX-h ETL jobs, DMX-h supports Databricks database tables as sources and as lookup

sources.

At the Source Database Table dialog or at the Lookup Source Database Table dialog define either a

Databricks database table source or lookup source respectively:

• At Connection, select a previously defined Databricks source connection or select Add new...

to add a new connection.

DB2 Your DB2 client must be installed on the system and configured so that it can connect to databases

that you want to access from DMX. For example, you can configure the client by cataloging

databases, or by defining database aliases in the db2cli.ini file. Please refer to specific DB2

documentation for details on configuring the client.

Windows Systems To access DB2 databases, DB2 client software must be accessible via the dynamic link libraries (dll)

located under the <db2_home>/sqllib/bin folder, where <db2_home> denotes the directory where DB2

is installed.

UNIX Systems To access DB2 databases, DB2 client software must be accessible via the shared libraries located

under the <instance_home>/sqllib/lib directory, where <instance_home> denotes the home directory

of the DB2 instance that you want to use to connect to the database.

Greenplum

Installation and configuration DMX connects to Greenplum databases through the Greenplum ODBC driver and the Greenplum

psql client utility, which is a component of the Greenplum client software.

Install and configure the Greenplum client software on the system on which the DMX client is

installed.

To establish a connection to the Greenplum database, install the Greenplum ODBC driver and

create, configure, and test the ODBC data source name (DSN).

Greenplum client software installation

Windows systems For Windows systems, client software installation includes the following:

1. Install the Greenplum client software.

a) Register as a user on the Pivotal Network site.

https://network.pivotal.io/products/pivotal-gpdb


b) From the Greenplum Clients section of the Pivotal Greenplum Database download site,

download the Clients for Windows file, for example:

greenplum-clients-<client_software_version_number>-build-

<build_version_number>-WinXP-x86_32.msi

For information on installing and configuring the Greenplum Windows client software, refer

to Greenplum Database Client Tools for Windows.

The default Greenplum client installation directory is as follows:

C:\Program Files (x86)\Greenplum\greenplum-clients-

<client_software_version_number>-build-<build_version_number>

2. Verify that the Greenplum psql client utility is in a directory specified in the PATH.

Note: If the Greenplum psql client utility is not in the PATH when DMX initiates a load to

the Greenplum database, DMX issues an error message and the task aborts.

UNIX systems For UNIX systems, client software installation includes the following:

1. Install the Greenplum client software.

a) Register as a user on the Pivotal Network site.

b) From the Greenplum Clients section of the Pivotal Greenplum Database download site,

download the applicable Greenplum UNIX client software, for example:

greenplum-clients-<client_software_version_number>-build-

<build_version_number>-<platform.zip>

For information on installing and configuring the Greenplum UNIX client software, refer to

Greenplum Database Client Tools for UNIX.

The default Greenplum client installation directory is as follows:

/usr/local/greenplum-clients-<client_software_version_number>-build-

<build_version_number>

To setup the system environment variables, run greenplum_clients_path.sh:

<greenplum_home>/greenplum_clients_path.sh

where <greenplum_home> is the Greenplum client software installation directory

2. Verify that the Greenplum psql client utility is in a directory specified in the PATH.

Note: If the Greenplum psql client utility is not in the PATH when DMX initiates a load to the

Greenplum database, DMX issues an error message and the task aborts.

Greenplum ODBC driver installation and configuration

Windows systems For Windows systems, driver installation and configuration includes the following:

1. Install the Greenplum ODBC driver.

From the Greenplum Connectivity section of the Pivotal Greenplum Database download site,

download the Connectivity for Windows driver file, for example:


http://gpdb.docs.pivotal.io/4281/pdf/GPClientToolsWin.pdf



http://gpdb.docs.pivotal.io/4281/pdf/GPClientToolsUnix.pdf



greenplum-connectivity-<client_software_version_number>-build-

<build_version_number>-WinXP-x86_32.msi

The default Greenplum ODBC driver installation directory is as follows:

C:\Program Files (x86)\Greenplum\greenplum-connectivity-

<client_software_version_number>-build-

<build_version_number>\drivers\odbc

2. Verify that the ODBC driver libraries, which are dynamic linked libraries with the

extension, are installed successfully.

3. Create and configure the ODBC DSN.

4. When creating a system DSN entry for the ODBC connection, ensure the following on the

Greenplum Advanced Options dialog:

o Use Declare/Fetch is selected.

o Show Boolean Column As String is unselected.

o Max Varchar (Default 255) is populated with the value 65530.

UNIX systems For UNIX systems, the Greenplum driver is provided by DMX and is part of the DMX installation.

The Greenplum driver, _Sgplm<version_number>.so, is installed in the following directory:

<dmx>/ThirdParty/DataDirect/lib

Note: <dmx> is the DMX installation directory.

Greenplum driver configuration includes the following:

1. Add an entry for the Greenplum driver in the odbc.ini file.

A sample odbc.ini file is shipped with DMX and is located in the following directory:

<dmx>/etc.

In the Greenplum Data Source section of the odbc.ini file, add the Greenplum driver entry:

Driver=<dmx>/ThirdParty/DataDirect/lib/_Sgplm<version_number>.so

2. Define the embedded DataDirect ODBC Driver Manager as the ODBC driver manager.

The DataDirect ODBC Driver Manager is shipped with DMX and is installed in the following

directory:

<dmx>/ThirdParty/DataDirect

Hive data warehouses Apache Hive is a data warehouse infrastructure built on top of Hadoop that analyzes, queries, and

summarizes large datasets stored in Hadoop's Distributed File System (HDFS) and other compatible

file systems. Hive includes HiveQL, a query language useful for real-time analytics in Hadoop.

DMX-h can connect to Hive data warehouses as:

• sources when running on an ETL server/edge node or in the cluster

• targets when running on an ETL server/edge node or in the cluster


DMX can also access Hive tables as HCatalog sources and targets. To understand how DMX reads

and writes over Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC),

please read “Connecting to Hive data warehouses” in the product help.

Both JDBC and ODBC configurations have two parts:

1. Configuring connections on Windows, typically workstations

2. Configuring connections on Linux, typically edge nodes

To design jobs and tasks using the DMX GUI, configure Hive connections on a Windows workstation.

When you finish developing jobs, execute them on the cluster from an edge node so that they can

read and write to Hive.

Hive source connections Using a Hive ODBC or JDBC connection, DMX-h can read supported Hive data types from all the

supported Hive file types, including Apache Avro, Apache Parquet, Optimized Row Columnar (ORC),

Record Columnar (RCFile), and text. DMX-h jobs running in the cluster can only read Hive sources

using JDBC connections.

Note: On an ETL server/edge node, reading from Hive sources via Hive ODBC drivers

yields low throughput. We only recommend using Hive ODBC drivers for sources serving at

most a few gigabytes of data, such as pre-aggregated data for analytics.

JDBC connectivity When DMX-h runs in the cluster, DMX-h can directly read a Hive table when the underlying file

format is supported by the HCatalog API. In all other cases, when DMX-h reads from a Hive table on

an ETL server/edge node or in the cluster via JDBC, DMX-h stages the data temporarily in

compressed or uncompressed format to a text-backed Hive table.

A user can force DMX-h to stage data by setting the environment variable

DMX_HIVE_SOURCE_FORCE_STAGING to 1, which uses the two-step process implemented in

previous versions of DMX.

Hive target connections DMX-h jobs and tasks write supported Hive data types to Hive targets using different methods

depending on whether they use JDBC or ODBC connections. JDBC is recommended over ODBC for

Hive targets. Consider the following constraints:

• JDBC - When DMX-h writes to a Hive table via JDBC, the job or task loads data directly into

the target tables. Writes are temporarily staged in compressed or non-compressed format to

a text-backed Hive table only when one of the following conditions limits direct access:

o A target table is a transactional (a.k.a. ACID) table

o A target table has one or more partitions

o A target table has any complex type column(s)

o The disposition for the target table is Truncate, Upsert, or Upsert and Apply change

(CDC)

o The job runs on localnode or singleclusternode

o A user forces DMX-h to stage data by setting the environment variable

DMX_HIVE_TARGET_FORCE_STAGING to 1, which uses the two-step process

implemented in previous versions of DMX


• ODBC - based on the file format and whether the Hive table is partitioned, DMX uses one of

the following parallelized processes to write to Hive :

o Staged - DMX-h temporarily stages data from parallel streams in compressed or non-

compressed format to a text-backed Hive table.

o Direct - DMX-h loads data in parallel streams directly to the Hadoop file system for

optimal performance.

File Format Partitioned Non-partitioned

Apache Avro, Apache

Parquet, or delimited text

files

Staged Direct

Other file formats Staged Staged

Generating text-backed Hive tables When reading and writing to Hive targets over JDBC or ODBC, DMX-h stages data in Hive

managed or external tables as shown below:

• By default, there is no specific work table directory configured, so DMX-h creates and stages

the data in a text-backed Hive managed table in the default schema.

• When you specify a work table directory, either by setting the

DMX_HIVE_WORK_TABLE_DIRECTORY environment variable or using a parameter in

the source or target table dialogs DMX-h stages the data to a text-backed Hive external

table.

DMX-h deletes temporary text-backed Hive tables when the DMX-h job ends.

Additionally, you can apply compression to the work table, either by setting the

DMX_HIVE_WORK_TABLE_CODEC environment variable or using a parameter in the source or

target table dialogs.

For more information about work tables, see the Hive table staging topic in the Hive configuration

documentation in the product help.

Work table access By default, DMX creates text-backed Hive tables as a Hive-managed table in the default schema.

The user ID that runs the DMX-h job must have CREATE TABLE privilege on this schema.

If the user cannot create tables in the default schema, you can configure DMX-h to use a different

schema either by

• setting the DMX_HIVE_WORK_TABLE_SCHEMA environment variable or

• using a parameter in the source or target table dialogs.

Hive configuration Connecting to Hive from DMX-h requires the following configuration components:

• Hive JDBC connection and/or Hive ODBC connection.


• Hive table staging

• Hive table creation security (for Hive targets)

• Sentry/Ranger authorization, if used

Setting up Hive JDBC for Linux/UNIX DMX uses Hive Java Database Connectivity (JDBC) on the cluster to execute Hive jobs and tasks. To

setup a JDBC connection for DMH-h to Hive from an edge node or server, you must:

1. Download a Hive JDBC driver JAR file

2. Set the JAVA_HOME environment variable

3. Configure the JDBC ini file

4. (Optional) Secure the database connection using Kerberos

5. Set the JDBC connection URL in DMX

Prerequisites

To successfully configure Hive JDBC and test a database connection from DMX-h on Linux/UNIX,

setup the following resources and gather the following permissions:

1. On cluster machines, obtain administrator privileges or contact information for a system

administrator

2. On the cluster, complete cluster provisioning

3. On the cluster, gather contact information for Cloudera or Hortonworks support

4. On the cluster, create or obtain access to an HDFS test directory with full RWX access

5. To use a Hortonworks cluster, complete the Hortonworks setup

6. To use Ambari:

a. access to the DMX-h Ambari install package

b. On the Ambari server, administrator access to Ambari

c. On the Ambari server, a user account with permission to use the sudo command

Network connectivity

1. On the cluster, open port 2181 (or configured port) for ZooKeeper service discovery, plus all

ports for Hive server hosts/ports known to ZooKeeper, typically 10000 or 10001

2. On the Hive servers, open these ports:

a. 10000 (or configured port) for Hive

b. 8088 and 19888 to use YARN Resource Manager (RM)

3. On the edge node, open these ports:

a. 22 to use SSH

b. 32636 (or configured port) for DMX

4. If using Kerberos, open these ports:

a. 88 for Kerberos KDC

b. 749 for Kerberos admin servers (from the krb5.ini file)

Edge node configuration 1. Obtain a user account with permission to use the sudo command


2. Install Java JRE or JDK version 8 or newer with Hadoop. To use Kerberos authentication

with a JRE/JDK older than 8u161, install the Java Cryptography Extensions (JCE)

unlimited strength policy files or install a later JRE/JDK version.

3. Intall all database clients

4. To use SSL/TLS, and the certificate is self-signed or signed by an in-house Certificate

Authority (CA), confirm read access to a Java keystore/truststore and that you know the

password

5. To use Kerberos, confirm that you can run the Kerberos kinit command

6. For each database, obtain administrator privileges or contact a database administrator

7. For each database, obtain a user ID that can access at least one test table

Set the JAVA_HOME variable

Linux operating systems typically configure the JAVA_HOME system variable with the Java install

path. To check the JAVA_HOME value and update it for all users:

1. Open Terminal

2. Type echo $JAVA_HOME and press enter. If the JAVA_HOME variable is set, a message

containing the value stored the variable, similar to the following:

$ echo $JAVA_HOME

/usr/bin/java/jdk1.8.0_191/jre

If the path displayed is the current install directory for the Java JRE or JDK, the

JAVA_HOME variable is set correctly.

3. If the message is not the current Java JRE or JDK install directory, create or edit the

JAVA_HOME variable value. To edit system variables, you need administrator privileges.

Downloading the Hive JDBC driver

Hive JDBC drivers connect client applications to Hive, including DMX-h. Therefore, you must install

a Hive JDBC driver onto any client, edge node, or server running DMX-h to connect to Hive. All

Hadoop distributors ship Hive JDBC drivers in a JAR package you can download.

• Download the Cloudera Hive JDBC documentation and JAR package from the Cloudera

website.

• Download the Hortonworks JDBC driver and documentation from the downloads page on the

Cloudera website. We recommend using the latest version.

The Cloudera and Hortonworks drivers are both re-branded versions of Simba JDBC drivers. When

the DMX Help references the Simba JDBC driver, the reference also applies to any Cloudera or

Hortonworks JDBC driver you can use.

• For MapR, please refer to the MapR documentation website to download the driver

Configuring JDBC INI

Configure DMX to use the JDBC driver you downloaded.

1. Edit the JDBC configuration file for DMX

2. Create the DMX_JDBC_INI_FILE system variable


Editing the JDBC configuration file

To use the JDBC driver, add JDBC configuration information to a configuration file DMX can access.

Because you configure the path to this file in a system variable below, you can use any location DMX

can access.

1. Create a new or edit an existing JDBC configuration file

2. Type Hive JDBC parameters and values in this file, as described below.

Parameter Description

DriverName JDBC driver class. Please consult JDBC configuration

document from the Hadoop vendor for its value.

DriverClassPath

Path to the Hive JDBC driver file.

In some cases, the path contains a JDBC driver but does not

include all the required Java components and interferes with

Hive connections. Resolve this by installing a standalone JDBC

driver from Apache, which includes all necessary components,

and changing the DriverClassPath variable to the standalone

driver path.

IsSchemaSupported Optional. Set to true to ensure that all specified database tables

are identified correctly in cases.

3. Save the file and record its file path to use in an environment variable

The values in the configuration file must match the driver name set by your Hadoop distributor.

Please consult their documentation for the exact driver name and driver file name.

UNIX: JDBC Configuration File Examples # Cloudera JDBC Simba-based driver for HiveServer2

[hive2]

DriverName=com.cloudera.hive.jdbc41.HS2Driver

DriverClassPath=/opt/hivejdbc

# Hortonworks JDBC Simba-based driver for HiveServer2

[hive2]

DriverName=com.simba.hive.jdbc41.HS2Driver

DriverClassPath=/opt/hivejdbc

# Apache Hive JDBC driver for HiveServer2 on Cloudera CDH

[hive2]

DriverName=org.apache.hive.jdbc.HiveDriver

DriverClassPath=/opt/cloudera/parcels/CDH/jars

# Apache Hive JDBC driver for HiveServer2 on Hortonworks (HDP)

[hive2]


DriverClassPath=/usr/hdp/current/hive-client/lib


# Apache Hive JDBC driver for HiveServer2 on MapR

[hive2]


DriverClassPath=/opt/mapr/hive/hive-1.2/lib

Creating the DMX_JDBC_INI_FILE Environment variable

Create an DMX_JDBC_INI_FILE environment variable and assign the full path of the JDBC

configuration file as its value. The full path includes the directory location and the JDBC

configuration file name.

For example, to set the variable for users opening the bash shell:

1. Open the $HOME/.bash_profile or /etc/bashrc or configuration for editing

2. Type the following on a new line:

export DMX_JDBC_INI_FILE="<jdbc_configuration_file_path>"

Where jdbc_configuration_file_path is the full path of the JDBC configuration file

3. Save and close the configuration file

To verify the update

1. Open a new Terminal that uses the bash shell

2. Type echo $DMX_JDBC_INI_FILE and press enter. The terminal responds with a message

containing the DMX_JDBC_INI_FILE value set in the configuration file if the export

command is correct.

Securing a database connection using Kerberos

Note: You can only manage Kerberos authentication in DMX-h over JDBC connections.

You can configure Kerberos authentication outside of DMX-h for either ODBC or JDBC

connections.

To connect to Kerberos secured databases, DMX requires valid Kerberos tickets, similar to

authentication tokens, to authenticate its identity as a trusted client. To obtain a valid Kerberos

ticket, you can:

• Use DMX-h to leverage Java Authentication and Authorization Service (JAAS) to generate a

ticket and automatically supply it to JDBC driver.

• (Recommended for Linux) Use a Kerberos client outside of DMX, such as the Java kinit

command, to generate a ticket and setup the JDBC driver to use it. We recommend this

method because Linux environments that use Kerberos typically generate a ticket at startup

that you can use for DMX communications.

These methods assume:

1. you can access the Kerberos Distribution Center (KDC) for the realm across the network

2. a valid /etc/krb5.conf file exists on a local machine

3. if your JRE/JDK is older than 8u161, you installed the Java Cryptography Extensions (JCE)

unlimited strength policy files.


Using DMX-h to manage Kerberos authentication We recommend using DMX-h to reference Kerberos tickets, which you can implement with the

following procedure.

1. Stop all DMX jobs and tasks.

2. Verify that the /etc/krb5.conf file exists. If not, you need to install Kerberos on this machine.

3. Set Kerberos environment variables:

a) Set DMX_KERBEROS_KEYTAB to the absolute path to and name of your keytab file

b) Set DMX_KERBEROS_PRINCIPAL to the identity to be authenticated, or principal, to

which Kerberos assigns or has assigned the Kerberos ticket. For example,

[email protected].

4. If your Kerberos service uses AES-256 encryption and your JDK/JRE is older than Java 8

update 151, install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction

Policy JAR file, which you can download for your Java version from the Oracle website.

Managing Kerberos authentication using a Kerberos client For this alternative process, we recommend the following procedure:

1. Close all DMX job and task editors.

2. Verify that the /etc/krb5.conf file exists. If not, you need to install Kerberos on this machine.

3. Add a kinit command to a startup configuration file, such as .bashrc, to validate credentials

with Kerberos at startup. kinit generates or validates a ticket shared with all applications

including DMExpress.

4. Restart the Terminal or machine to load the new configuration.

Once a ticket is validated at startup, you can execute DMExpress jobs from the Terminal.

JDBC Connection URL The JDBC URL you create when you develop a DMX job remains the same when you execute the job

on the cluster. Please refer to “JDBC Connection URL” under JDBC setup for Windows for details on

URL string.

If the URL references any local files, for example sslTrustFile, update the URL string to reference

files stored on the Linux machine. You can avoid these requirements by using environment variables

in the URL with the paths to local files so that you can set these file paths in each environment and

use the same URL in all of them.

Setting up Hive JDBC connections for Windows DMX-h uses Hive Java Database Connectivity (JDBC) on Windows to design jobs. DMX-h only

supports Hive connections on Windows at design-time, not during runtime.

To setup a JDBC connection for DMX-h to Hive from a Windows machine, you must:

1. Install a JDK/JRE on the Windows machine

2. Set the JAVA_HOME environment variable

3. Download a Hive JDBC JAR file

4. Configure the JDBC ini file

5. (Optional) Secure the database connection using Kerberos

6. Set the JDBC connection URL in DMX


Prerequisites

To successfully configure Hive JDBC and test a connection to a database from DMX-h on Windows,

setup the following resources and permissions:

1. On the cluster, complete cluster provisioning

2. Gather contact information for Cloudera or Hortonworks support

3. On the edge node, install database clients

4. Access a workstation with a supported Windows version, which must be Windows XP, 7, 8.x,

10, Server 2003, Server 2008 or Server 2012, either 32-bit or 64-bit

5. On the workstation, access to a local Administrator user account

6. On the workstation, access to the DMX-h Windows Installer

7. On the workstation, to use SSL confirm read access to the trust store file and that you know

the password

8. On the workstation, install database clients

9. For each database, obtain a user ID that can access at least one test table

Network connectivity

A Windows machine must be able to connect to the following machines on the appropriate ports and

resolve their hostnames:

1. The Hive server host(s) on the appropriate port (default 10000)

2. On the edge node, ports 22 (or configured port) for SSH and 32636 (or configured port) for

DMX

3. If using ZooKeeper service discovery:

a. the ZooKeeper hosts in the connection string on the given port (default 2181)

b. all Hive server hosts/ports known to ZooKeeper

4. If the cluster is secured with Kerberos:

a. The Kerberos KDC on port 88

b. The Kerberos admin servers (from the krb5.ini file) on port 749

Kerberos prerequisites Before setting up Kerberos on Windows:

1. Configure Kerberos using a local Windows administrator user.

2. Set the Windows environment variable DMX_KERBEROS_PRINCIPAL to a principal name

([email protected])

3. Copy a keytab file for the appropriate principal to the Windows machine

4. Set the Windows environment variable DMX_KERBEROS_KEYTAB to the path to the

keytab file

5. Copy Kerberos config file (e.g. /etc/krb5.conf) from an edge node to the Windows machine as

C:\Windows\krb5.ini

SSL/TLS prerequisites

If you use SSL/TLS and the certificate is self-signed or signed by an in-house Certificat Authority

(CA), ensure that a Java keystore/truststore and its password are available


Installing a JDK/JRE on the machine

DMExpress requires a Java Runtime Environment (JRE) to connect to Hive, so requires either a

JRE or Java Development Kit (JDK) installed on your system. We recommend installing the same

JRE or JDK version Hadoop or Hive uses on the cluster.

1. Check your system to see if Java is already installed. To check for a JRE install on Windows,

navigate to Add/remove programs in the control panel and look for Java ‘X’ or Java ‘X’ update

‘XX’, as shown below.

2. If Java is not present on your Windows system, download a JRE or JDK install package from

the Oracle website that matches with the bit format (either 32-bit or 64-bit) for the DMX

application installed on Windows. You can identify the bit format of your DMX install by

a) Reviewing the log of any DMX job that ran on Windows, or

b) opening a command prompt, typing dmexpress /quit, and pressing Enter. The command

prompt displays the DMX install information similar to that shown below.

C:\WINDOWS\System32>dmexpress /quit

[DMExpress 9.10 Windows x64 64-bit Copyright (c) 2020 Precisely Inc.]

Setting the JAVA_HOME environment variable

To implement JDBC connection on a Windows operating system, configure the JAVA_HOME system

variable with the Java install path. To check the JAVA_HOME value and update it for all users:

1. Open a console application with a command prompt

2. Type echo %JAVA_HOME%, and press enter. If the JAVA_HOME variable is set, a message

containing the value stored the variable, similar to the following:

C:\WINDOWS\System32>echo %JAVA_HOME%

C:\Program Files\Java\jdk1.8.0_191\jre

If the path displayed is the current install directory for the Java JRE or JDK, the

JAVA_HOME variable is set correctly.


3. If the message is not the current Java JRE or JDK install directory, edit the JAVA_HOME

variable value. To edit system variables, you need local administrator privileges.

a) Click the Windows button, type environment in search bar , and select "Edit the system

environment variables" from the search results

b) Select the JAVA_HOME system variable and click Edit

c) In the Variable Value dialog box, type the full path for the Java JRE or JDK

d) Click OK

e) Verify that the JAVA_HOME variable is in the list of System Variables and its value is

the current Java JRE or JDK install directory.

f) Click OK

4. If the message is simply %JAVA_HOME%, create a JAVA_HOME system variable. To create

system variable, you need local administrator privileges.

a) Click the Windows button, type environment in search bar, and select "Edit the system

environment variables" from the search results

b) In the System Variables section, choose New

c) In the Variable Name dialog box, type JAVA_HOME

d) In the Variable Value dialog box, type the full path for the Java JRE or JDK

e) Click OK

f) Verify that the JAVA_HOME variable is in the list of System Variables and its value is

the current Java JRE or JDK install directory.

g) Click OK

5. If you are unable to create or edit system variables, create the JAVA_HOME user variable.


Downloading a Hive JDBC driver

To connect to Hive through JDBC during development, DMX requires the Hive JDBC driver and its

dependencies installed on your Windows workstation. All Hadoop distributors ship drivers as a

package you can download.

• Download the Cloudera Hive JDBC documentation and JAR package from the Cloudera

website.

• Download the Hortonworks JDBC driver and documentation from the downloads page on the

Cloudera website. We recommend using the latest version.

The Cloudera and Hortonworks drivers are both re-branded versions of Simba JDBC drivers. When

the DMX Help references the Simba JDBC driver, the reference also applies to any Cloudera or

Hortonworks JDBC driver you can use.

• For MapR, please refer to the MapR documentation website to download the driver

• For other distributions, the Apache Hive JDBC driver can be used, but is not recommended.

Use a “standalone” version of the driver, if available.

Once downloaded, create a folder in your filesystem and copy the JDBC package into it.


Configuring JDBC INI

Configure DMX to use the JDBC jar files you downloaded:

1. Create a new or edit an existing JDBC configuration file for DMX

2. Create the DMX_JDBC_INI_FILE system variable

Editing the JDBC configuration file

To use the JDBC driver jar, add JDBC configuration information to a configuration file DMX can

access. Because you configure the path to this file in a system variable below, you can use any

location DMX can access.

1. Create a new or edit an existing JDBC configuration file

2. Type Hive JDBC parameters and values in this file, as described below.

Parameter Description

DriverName JDBC driver class. Please consult JDBC configuration

document from the Hadoop vendor for its value. This name

matches the driver class name used in an edge node.

DriverClassPath

Path to the Hive JDBC driver file.

In some cases, the path contains a JDBC driver but does not

include all the required Java components and interferes with

Hive connections. Resolve this by installing a standalone JDBC

driver from Apache, which includes all necessary components,

and changing the DriverClassPath variable to the standalone

driver path.

IsSchemaSupported

Optional. Set to true to ensure that all specified database tables

are identified correctly when DMX cannot determine whether

the DBMS supports a schema, such as in Hive. Valid values are

true or false and are case-insensitive.

3. Save the file and record its file path to use in the system variable

The values in the configuration file must match the driver name set by your Hadoop distributor.

Please consult their documentation for the exact driver name and driver file name.

Windows: JDBC Configuration File Examples # Cloudera JDBC Simba-based driver for HiveServer2

[hive2]

DriverName=com.cloudera.hive.jdbc41.HS2Driver

DriverClassPath=C:\HiveJDBC\Cloudera_HiveJDBC41_2.5.19.1053

# Hortonworks JDBC Simba-based driver for HiveServer2

[hive2]

DriverName=com.simba.hive.jdbc41.HS2Driver

DriverClassPath=C:\HiveJDBC\Simba_HiveJDBC41_1.0.40.1052


# Apache Hive JDBC driver for HiveServer2

[hive2]


DriverClassPath=C:\Program Files (x86)\Hive\Hive-0.14.0.2.2.9.1-7\lib

Creating the DMX_JDBC_INI_FILE environment variable

Create a system variable DMX_JDBC_INI_FILE and assign the full path of the JDBC configuration

file as its value. The full path includes the directory location and the JDBC configuration file name.

To create this variable using the Windows Explorer:

1. Navigate to the Windows System Properties > Advanced tab

2. Click Environment Variables

3. In the System Variables section, choose New

4. In the Variable Name dialog box, type DMX_JDBC_INI_FILE

5. In the Variable Value dialog box, type the full path of your JDBC configuration file.

6. Click OK

7. Verify that the DMX_JDBC_INI_FILE variable is in the list of System Variables and its

value is the full path of your JDBC configuration file..

8. Click OK

We recommend creating it as a system variable on Windows (similar to the way the JAVA_HOME

variable is created).

Restart all DMX GUI programs after setting the variables, to ensure that the new values take effect.

Securing a database connection using Kerberos

Note: You can only manage Kerberos authentication in DMX-h over JDBC connections.

You can configure Kerberos authentication outside of DMX-h for either ODBC or JDBC

connections.

To connect to Kerberos secured databases, DMX requires valid Kerberos tickets to authenticate its

identity as a trusted client. To obtain a valid Kerberos ticket, you can:

• (Recommended for Windows) Use DMX-h to leverage Java Authentication and Authorization

Service (JAAS) to generate a ticket and automatically supply it to JDBC driver. We

recommend this method for Windows because DMX is usually the first application to connect

to Hive outside the cluster. However, if your organization prohibits keytab configuration on

Windows, you must manage Kerberos tickets outside of DMX.

• Use a Kerberos client, such as the Java kinit command, outside of DMX to generate a ticket

and setup the JDBC driver to use it.

NOTE: Windows ships with a kinit command that is not the Java kinit

command.

These methods assume:

1. you can access the Kerberos Distribution Center (KDC) for the realm across the network

2. a valid C:\Windows\krb5.conf file exists on a local machine

3. If your JDK/JRE is older than Java 9 or 8u161 and your Kerberos service uses AES-256

encryption, you installed the Java Cryptography Extension (JCE) Unlimited Strength


Jurisdiction Policy JAR file, which you can download for your Java version from the Oracle

website.

Using DMX-h to manage Kerberos authentication To enable DMX to manage Kerberos tickets for jobs, you must set at least the one required DMX

Kerberos environment variable and, when connecting via JDBC on Windows, you must verify the

location of the Kerberos configuration file. Specifically, DMX calls the Kerberos kinit utility to

retrieve a Kerberos ticket with valid credentials before executing a job. Upon job initiation, DMX

provides authentication information to the DBMS. After job execution, DMX calls the Kerberos

kdestroy utility to destroy the Kerberos ticket.

DMX manages Kerberos authentication for

• running jobs from the Run job dialog

• connecting to Hive at design time

Note: To manage Kerberos tickets for DMX tasks, see Managing Kerberos outside of DMX

below

Set the following Kerberos environment variables in the Environment Variables tab of the DMX

Server dialog:

• DMX_KERBEROS_PRINCIPAL - required - the principal to be authenticated, to which

Kerberos assigns a ticket. Setting this variable enables DMX control of Kerberos tickets,

including calling the kinit utility.

• DMX_KERBEROS_KEYTAB - optional - the name and location of the principal's Kerberos

keytab file; if not specified, the default Kerberos keytab name and location will be used,

<user_home>\krb5.keytab on Windows.

We recommend using DMX-h to reference Kerberos tickets, which you can implement with the

following procedure.

1. Close all DMExpress job and task editors.

2. Copy the /etc/krb5.conf file from a Linux edge node or server running Kerberos into the

C:\Windows directory and rename it krb5.ini.

3. Generate or copy a keytab file onto your Windows machine.

a) To generate a keytab file, use ktutil with a valid Kerberos server user account. You

may refer to the University of Indiana website for examples.

b) To copy the keytab file, you need to know where it saved on your machine or a Kerberos

edge node or server

4. Set Kerberos environment variables:

a) Set DMX_KERBEROS_KEYTAB to the absolute path to and name of your keytab file

b) Set DMX_KERBEROS_PRINCIPAL to the identity to be authenticated, or principal, to

which Kerberos assigns or has assigned the Kerberos ticket

5. In the server setup dialog, test authentication using the Verify Connection button.

Managing Kerberos authentication outside of DMX You can manage Kerberos tickets outside of DMX using Kerberos clients for Windows. You should

use these clients when:


1. Observing a policy prohibiting keytab configuration on Windows

2. Connecting with DMX-h via JDBC at design time using the Task Editor

3. Running dmxjob at the command line

To manage Kerberos tickets for outside of DMX on Windows, consider the following:

1. Clear the DMX_KERBEROS_PRINCIPAL variable in the Environment Variables tab of the

DMX Server dialog, disabling DMX-h Kerberos authority.

2. Ensure that the Kerberos client utilities, such as the kinit and klist, are installed on the

DMX workstation.

3. Copy the Kerberos config file (e.g. /etc/krb5.conf) from the edge node to the Windows machine

in C:\Windows\krb5.ini

4. Set the Windows environment variable KRB5_CONFIG value to the path to Kerberos

configuration file for the Kerberos client utility

5. Set the Windows environment variable KRB5CCNAME value to the path to the Kerberos

cache file

6. When managing Kerberos authentication outside of DMX, run the Kerberos kinit utility to

initialize a ticket and attempt to authenticate. If this fails, run the Kerberos klist utility to

determine whether you have a valid ticket or not. If your ticket is valid, correct problems

with your Windows or Java configuration. If your ticket is invalid, correct problems with

your Kerberos configuration.

Upon job/task initiation, DMX provides authentication information to the DBMS.

Setting the JDBC Connection URL in DMX DMExpress job developers can set the JDBC Connection URL in DMS and requires no administrator

privilege.

1. Create or open a DMExpress task and open database connection dialog.

2. Select Hive as the DBMS and select JDBC as the access method.


3. Set the following fields with the Database connection dialog:

Field Name Description

Name Connection name representative of database and type of

connection made.

DBMS Hive

Access Method JDBC

Database Provide a connection URL as described below

Authentication Kerberos if using Kerberos. Ignore otherwise

Connect As Ignore.

The connection URL string has a following format:

jdbc:hive2://<host>:<port>/<database>[;<attribute>=<value> [;...]]

where


Field Name Description

host The IP address or hostname of the Hive server. To use a

hostname, we recommend a fully qualify domain name (FQDN) to

avoid machine auto-resolving hostnames to incorrect domains.

Port The listening port for the Hive service. The default is 10000.

Database Enter the name of the database schema you want to use. If the

database controls access using Sentry / Ranger, make sure you can

access the database and have read/write permissions on the table.

The default schema is default.

propertyName Semicolon separated list of attributes that instruct the Hive

server to perform various tasks.

For the Cloudera and Hortonworks drivers, the following

parameter is always required:

UseNativeQuery=1

If you use a Cloudera or Hortonworks driver on a cluster or server

secured with Kerberos, add the following parameters:

AuthMech=1;KrbHostFQDN=HiveFullyQualifiedDoma

inName;KrbServiceName=hive;UseNativeQuery=1

You may also need to add ;KrbRealm=Realm if the Hive server is

not in the default realm.

If you use a HiveServer2 driver over SSL, add:

SL=1;SSLKeyStore=<keystore_path>;SSLKeyStoreP

wd=<keystore_password>

where <keystore_path> is the full path of the sslTrustStore file

saved locally on the system.

Hive ODBC connection Hive ODBC connections can be used for Hive sources and targets. Configuring Hive ODBC requires

the following steps, described in detail in the sections below:

1. Install and configure the Hive ODBC driver

2. Define a Hive ODBC data source

Installing and configuring the Hive ODBC driver

Ensure you have administrator/root privileges on the computer before you install the

driver.

Installing and configuring on Windows

1. Go to one of the following Hadoop vendor websites and download the Windows 32-bit Hive

ODBC driver and associated documentation. For example:

• Cloudera: http://www.cloudera.com/downloads/connectors/hive/odbc/2-5-21.html

• Hortonworks: http://hortonworks.com/hdp/addons/

• MapR: http://package.mapr.com/tools/MapR-ODBC/. Select the latest version of the file

MapR_odbc_<n.n.n>_x86.exe.

http://www.cloudera.com/downloads/connectors/hive/odbc/2-5-21.html

http://hortonworks.com/hdp/addons/

http://package.mapr.com/tools/MapR-ODBC/


2. After downloading the file, double-click the file to run the installer.

3. Follow the installer's instructions and use the default settings.

For additional information about the installation and configuration settings, see the vendor's

documentation.

Installing on Linux

1. Go to one of the following Hadoop vendor websites and download the Linux 64-bit Hive

ODBC driver and associated documentation. For example:

• Cloudera: http://www.cloudera.com/downloads/connectors/hive/odbc/2-5-12.html .

Download the appropriate RPM for your Linux distribution.

• Hortonworks: http://hortonworks.com/hdp/addons/. Download the appropriate tar file

for your Linux distribution and extract the RPMs from it.

• MapR: http://doc.mapr.com/display/MapR/Hive+ODBC+Connector and

http://package.mapr.com/tools/MapR-ODBC/. Navigate the directories for your Linux

distribution and download the appropriate RPM.

2. Unpack the RPM package to install the driver files in the vendor's default location:

rpm -i <vendor-file>.rpm

3. Note the location of the installed files for later configuration steps. The default installation

location depends on the vendor and may be one of the following:

• /opt/Cloudera/hiveodbc/

• /usr/lib/hive/lib/native/hiveodbc/

• /opt/mapr/hiveodbc/

4. Continue with Configuring the Hive ODBC driver on Linux.

Configuring the Hive ODBC driver on Linux The Hive ODBC driver installation includes the file, <users-home>/.<vendor>.hiveodbc.ini, which

you use to configure the specific vendor's Hive ODBC driver. By default, this file begins with a

leading period and gets installed in the user's home directory.

1. If you decide not to use the default location and file name for .<vendor>.hiveodbc.ini, set an

environment variable to locate the file. These examples assume you put the file (without a

leading period) in the /etc/ directory:

• Cloudera:

• For Cloudera Hive ODBC driver version 2.5.12 and higher: Cloudera: export

CLOUDERAHIVEINI=/etc/cloudera.hiveodbc.ini

• For Cloudera Hive ODBC driver versions prior to 2.5.12: Cloudera: export

SIMBAINI=/etc/cloudera.hiveodbc.ini

• Hortonworks:

• For Hortonworks Hive ODBC driver version 0.11 and higher: export SIMBAINI=/etc/hortonworks.hiveodbc.ini

• For Hortonworks Hive ODBC driver versions prior to 0.11: export

SIMBAINI=/etc/hortonworks.hiveodbc.ini

• MapR: export MAPRINI=/etc/mapr.hive.odbc.ini

http://www.cloudera.com/downloads/connectors/hive/odbc/2-5-12.html

http://hortonworks.com/hdp/addons/

http://doc.mapr.com/display/MapR/Hive+ODBC+Connector

http://package.mapr.com/tools/MapR-ODBC/


2. Set the following driver manager options in your vendor-specific configuration file,

<vendor>.hiveodbc.ini, under the Driver section. The default DMX-h driver manager is

unixODBC.

a) Set DriverManagerEncoding to UTF.-.

b) Set ODBCInstLib to identify the ODBC installation's shared library for the ODBC driver

manager. The DMX-h default location is <dmx>/lib/libodbcinstSSL.so.

[Driver]

DriverManagerEncoding=UTF-16

ODBCInstLib=<dmx>/lib/libodbcinstSSL.so

3. Ensure that the Hive ODBC driver library is included at the beginning of the system library

path, LD_LIBRARY_PATH, by running the following command:

export LD_LIBRARY_PATH=<vendor's-Hive-ODBC-driver-

installation>/lib:$LD_LIBRARY_PATH

4. The default location DMX-h uses for the .odbcinst.ini ODBC configuration file is <dmx>/etc.

If you decide to use a different location, set the ODBCSYSINI environment variable to the

directory containing your file.

5. Configuration options set in configuration file odbcinst.ini apply to all Hive connections.

Create a section for the Hive ODBC driver and set the following options as follows:

[ODBC Drivers]

<vendor> Hive ODBC Driver 64-bit=Installed

. . .

[<vendor> Hive ODBC Driver 64-bit]

Description= <vendor> Hive ODBC Driver (64-bit)

Driver=/<vendor's-Hive-ODBC-driver-

installation>/<vendor’s_Hive_ODBC_library_file>

For additional information about the installation and configuration settings, see the vendor's

documentation.

Defining a Hive ODBC data source To identify a Hive ODBC data source, you create a data source name (DSN) and set options required

to connect to the data source.

Defining a Hive data source on Windows

1. Start the ODBC Data Source Administrator by following the instructions for Windows

systems at Defining ODBC Data Sources.

2. In the Create New Data Source dialog, select your Hive ODBC vendor's driver from the list,

and then click Finish.

3. Use the following settings in the ODBC Driver DSN Setup dialog for a Hive ODBC data

source:

Data Source Name Enter a name to identify the Hive DSN.

Host Enter the IP address or hostname of the Hive server.

Port Enter the listening port for the Hive service. The default is 10000.


Database Enter the name of the database schema you want to use. The default

schema is default.

Hive Server Type Select Hive Server 2.

Authentication

mechanism

Most Hive installations use User Name authentication by default. The

authentication mechanism for a Hive data source must match the

mechanism in use on the Hive server or the connection fails. Check with

your Hadoop system administrator.

Advanced Options Check Use Native Query and then click OK. Some ODBC Hive drivers

work with both HiveQL and SQL query languages. This setting enables

the use of native HiveQL instead of SQL.

Defining a Hive data source on Linux

Find general instructions for UNIX systems at Defining ODBC Data Sources.

1. The default location DMX-h uses for the odbc.ini ODBC configuration file is

<dmx>/etc/odbc.ini. If you decide to use a different location and file, set the ODBCSYSINI

environment variable to the full path and file name of your file.

4. In the odbc.ini file, add a new Hive data source entry to the [ODBC Data Sources] section.

Use the format, <data-source-name>=<your-driver-description>:

[ODBC Data Sources]

Sample Hive DSN 64=Hive ODBC Driver 64-bit

5. Configure the new Hive data source by adding a section similar to the following to the

odbc.ini file. Note that sample values are shown. Consult your Hadoop system administrator

for guidance on settings appropriate for your environment:

[Sample Hive DSN 64]

Driver=/<vendor's-Hive-ODBC-driver-

installation>/<vendor’s_Hive_ODBC_library_file>

HiveServerType=2

HOST=<hive-server>

PORT=10000

UseNativeQuery=1

AuthMech=2

The following table lists valid values:

Odbc.ini options Description

Driver Set the location of the installed Hive ODBC Driver file. Find the driver file

<vendor’s_Hive_ODBC_library_file>, for example,

libhortonworkshiveodbc64.so, under your installed files /lib/ directory.

HiveServerType Set the HiveServerType to 2, for HiveServer2. HiveServer2 is a newer

version with improvements from that of HiveServer and with additional

features.

1 (default) HiveServer

2 HiveServer2


HOST Set the IP address or hostname of the Hive server.

PORT Set the listening port for the service. The default port for DMX-h Hive

installation is 10000.

UseNativeQuery Set the UseNativeQuery value to 1. Some Hive ODBC drivers work with

both HiveQL and SQL query languages.

0 (default) enables the SQL Connector feature

1 enables the HiveQL query language and disables the SQL Connector

feature

AuthMech Set the AuthMech value to the number representing the same

authentication mechanism as the Hive server. Most Hive installations use

User Name authentication (value 2) by default.

0 no authentication

1 Kerberos

2 (default) User Name

3 User Name and Password

4 User Name and Password (SSL)

5 Windows Azure HDInsight Emulator

6 Windows Azure HDInsight Service

7 HTTP

8 HTTPS

Hive table staging When writing to Hive targets, DMX-h stages the data as a text-backed Hive table when connecting

via:

• JDBC

• ODBC

Note:

• In all cases, sufficient space as well as CREATE TABLE privileges are needed to stage the

tables.

• For the ODBC cases, DMX-h writes the tables using the hive command, which must be in the

path.

Hive table creation security With Hive version 0.13 and higher, the default security does not allow the user who creates the table

to read from or write to the table. To enable reading from and writing to the table without having to

modify access permissions after creating the table, do the following:

• In hive-site.xml, add the property hive.security.authorization.createtable.owner.grants and

set its value to SELECT and UPDATE.

• Ensure the user has read/write privileges to Hive data files on the Hadoop file system.

Sentry and Ranger authorization DMX-h is compatible with the following authorization schemes:


• Cloudera Sentry

• Apache Ranger

Cloudera Sentry DMX-h is certified to work with Cloudera's Sentry authorization of Hive databases, which requires

the following to be enabled in the Cloudera cluster:

• HDFS Access Control Lists (ACLs)

• automatic synchronization of HDFS ACLs with Sentry privileges

Note: Note: When using Sentry, Hive impersonation is disabled by default. To ensure access to the

Work table directory, the default Hive user must have the correct permissions.

Apache Ranger

DMX-h is compatible with Apache Ranger, a framework for enabling, monitoring, and managing

data security across the Hadoop platform. Ranger works with Apache Hadoop (HDFS), Apache Hive,

Apache Kafka, and YARN, among other Apache projects.

Note: Ranger is currently designated as an Apache incubator project, and there are gaps in what it

works with in the Hadoop ecosystem, such as Apache HCatalog. Additionally, it does not work with

Amazon S3 or other cloud-based distributed filesystems.

Apache Impala Apache Impala is a native analytic database for Apache Hadoop. Through JDBC connectivity,

DMX-h supports Impala databases as sources and targets when running on the ETL server/edge

node, in the cluster, and on a framework-determined node in the cluster.

Connecting to Impala requires configuration steps before the connections can be defined. Connection

requirements and behavior differ between Impala sources and targets.

Maximum length The maximum post-extraction length that DMX-h supports for an Impala database record is

16,777,216 bytes (16 MB).

Impala connections Connecting to Impala requires configuration steps before the connections can be defined. Connection

requirements and behavior differ between Impala sources and targets.

Impala source connections Using an Impala JDBC connection, DMX-h can read supported Impala data types from all supported

Impala file types: Apache Avro, Apache Parquet, Record Columnar (RCFile), Text, and SequenceFile.

Note: As per Impala limitations, DMX-h can read complex data types, which include structures and

arrays, only from Parquet-backed tables in Impala.

JDBC connectivity When DMX-h reads from an Impala database table on an ETL server/edge node or in the cluster via

JDBC, the data is staged temporarily in uncompressed format to a text-backed Impala table.

http://www.cloudera.com/documentation/enterprise/latest/topics/sg_sentry_overview.html

http://ranger.apache.org/

http://impala.apache.org/index.html


Impala target connections Using an Impala JDBC connection, DMX-h can write supported Impala data types to Impala targets

targets directly for optimal performance.

Note: As per Impala limitations, DMX-h can write complex data types to Parquet-backed tables in

Impala via a Hive database connection only, not via a JDBC connection.

JDBC connectivity When DMX-h writes to an Impala database table via JDBC, data is generally loaded directly into

target tables. Writes are staged temporarily in compressed or non-compressed format to a text-

backed Impala table only when one or more of the following conditions limits direct access:

• A target table has one or more partitions

• A parquet-backed target table has any timestamp columns

• A target table performs Truncate or Apply Change (CDC) dispositions

• The job runs on localnode or singleclusternode

• A user forces DMX-h to stage data by setting the environment variable

DMX_IMPALA_TARGET_FORCE_STAGING to 1, which uses the two-step process

implemented in previous versions of DMX

Update and Upsert dispositions are supported only for kudu tables.

At run-time, DMX-h accesses the kudu jars from /opt/cloudera/parcels/CDH/lib/kudu on the

edge/master node for Impala access. You can override this default location by using environment

variable KUDU_HOME. For example, export KUDU_HOME=/opt/cloudera/parcels/CDH/lib/kudu

sets the location accessed at run-time to /opt/cloudera/parcels/CDH/lib/kudu.

Impala configuration Connecting to Impala from DMX-h requires the following configuration components:

• Impala JDBC connection

• Impala table staging

• Apache Sentry authorization when applicable

Impala JDBC connection To connect to Impala via JDBC on Windows at design time, download the JDBC driver and specify

the mandatory driver name and driver class path parameters in the JDBC configuration file:

• Download the applicable Cloudera Impala JDBC Simba-based driver.

See Configuring Impala to Work with JDBC.

• Set the driver name and driver class path in the JDBC configuration file.

Impala table staging When reading Impala sources and writing Impala targets, DMX-h stages the data as a text-backed

Impala table.

To stage the tables, sufficient space and CREATE TABLE privileges are required.

Defining Impala database connections In the Database Connection dialog, the general pattern to define a connection to an Impala database

is as follows:

• At DBMS, select Impala.

https://www.cloudera.com/downloads/connectors/impala/jdbc/2-5-5.html

https://www.cloudera.com/documentation/enterprise/5-13-x/topics/impala_jdbc.html



• At Database, select a previously defined Impala JDBC database connection URL.

• At Authentication, select Auto-detect or Kerberos.

Note: When Kerberos authentication is required, ensure that Kerberos is selected.

Defining Impala sources For all DMX-h ETL jobs, DMX-h supports Impala database tables as source and as lookup source.

At the Source Database Table dialog or at the Lookup Source Database Table dialog define either an

Impala database table source or lookup source respectively:

• At Connection, select a previously defined Impala source connection or select Add

new... to add a new connection.

• On the Parameters tab, the following optional parameters are available for Impala database

table sources and lookup sources:

o Filter - equivalent to the text that follows a WHERE clause in a SQL query, the filter

parameter specifies the condition upon which records are extracted from an Impala

source table.

o For partitioned Impala database table sources and lookup sources, you can specify

a partition predicate in the WHERE clause, which serves as a filter that enables

partition pruning and limits scanning to those portions of the table relevant to

partitions.

o Work table directory - serves as the parent-level directory beneath which job-specific

subdirectories are created for staging data.


o Impala configuration properties - any Impala configuration property can be entered

manually in the parameters grid.

Defining Impala targets At the Target Database Table dialog, define an Impala database table target:

1. At Connection, select a previously defined Impala target connection or select Add

new... to add a new one.

2. Select a table from the list of Tables, or select Create new... to create a new one. By default,

DMX-h creates text-backed Impala database tables; to create an Impala table backed by

some other file format, follow the instructions in the Create Database Table dialog help topic,

with the following modification:

3. Click View SQL.

4. In the SQL textbox, change STORED AS TEXTFILE to STORED AS <file_format>,

where <file_format> is the keyword for the applicable file format, such

as AVRO or PARQUET.

• User defined SQL statement is not supported.

• All target disposition methods are supported.

• All partition columns must be mapped.

5. On the Parameters tab, the following optional parameters are available for Hive target

database tables:

https://impala.apache.org/docs/build/html/topics/impala_partitioning.html

https://impala.apache.org/docs/build/html/topics/impala_shell_options.html#shell_option_summary

https://impala.apache.org/docs/build/html/topics/impala_create_table.html


• Compute table statistics - To optimize subsequent Impala query performance, DMX-h

can run Impala analyze queries that collect target table statistics and target column

statistics after the load to the Impala target database.

o Valid values include true and false (default). If you specify false or if a parameter

value is blank, DMX-h does not run the parameter-specific query after the load to the

Impala target database.

o When Impala auto-analysis is enabled and DMX-h loads via staging table to the

Impala target database, Impala automatically computes table statistics, but not

column statistics, and stores the table statistics to the metastore.

• Maximum parallel streams - the maximum number of parallel streams that can be

established to load data for performance and that are created according to demand. This

value can also be specified via the environment

variable DMX_IMPALA_MAX_WRITE_THREADS. If specified both ways, the parameter

value takes precedence. If neither is specified, the default value is either the number of

CPUs on the edge node when running on the ETL server/edge node or is 1 for each

instance of DMX-h when running on the cluster.

• Work table codec - specifies the compression algorithm used to compress Impala data.

• Work table directory - serves as the parent-level directory beneath which job-specific


• Work table schema - the schema used to create the staging table.

• Impala configuration properties - any Impala configuration property can be entered

manually in the parameters grid.


Microsoft SQL Server Your SQL Server client must be installed on the system and configured so that it can connect to

databases that you want to access from DMX. On 64-bit Windows, a SQL Server Native Client must

also be installed. Please refer to specific SQL Server documentation for details on configuring the

client.

Windows Systems A SQL Server data source needs to be defined for each database that you want DMX to access. The

data source should be named the same as the SQL Server database it points to. Choose a SQL Native

Client as the DBMS driver on 64-bit Windows. See Defining ODBC Data Sources for details on

defining data sources on Windows systems.

Netezza

Installation and Configuration DMX connects to Netezza databases through the Netezza nzload client utility, which is a component

of the Netezza client software package, and the Netezza Open Database Connectivity (ODBC) driver.

For Windows and UNIX systems, the client software package includes the Netezza client interface

software and the Netezza ODBC driver.

To establish a connection to the Netezza database, install the Netezza client software package on the

system on which the DMX client is installed.

Netezza client software package and driver installation

Windows systems For Windows systems, the client software installation includes the following:

https://impala.apache.org/docs/build/html/topics/impala_shell_options.html#shell_option_summary

javascript:BSSCPopup('PU_IMPALA_CONFIGURATION_PROPERTIES.htm');

javascript:BSSCPopup('PU_IMPALA_CONFIGURATION_PROPERTIES.htm');


1. Install the Netezza client software package.

For procedures on installing the Netezza client software and ODBC driver, refer to the

installation chapter of the IBM Netezza System Administration Guide.

The default Netezza client installation is located in the following directory:

C:\Program Files (x86)\IBM Netezza Tools

The default Netezza ODBC driver installation is located in the following directory:

C:\Program Files (x86)\IBM Netezza ODBC Driver

2. Verify that the ODBC driver libraries, which are dynamic linked libraries with the .dll

extension, are installed successfully.

3. Verify that the Netezza ODBC driver installation directory is specified in the PATH.


5. Specify the Netezza client utilities directory in the PATH.

For example, set the PATH as follows:

set PATH=%PATH%;C:\Program Files (x86)\IBM Netezza Tools\bin

Note: If the Netezza nzds and nzload client utilities are not in the PATH when DMX initiates

a load to the Netezza database, DMX does the following:

• nzds - DMX issues a performance warning message and establishes only one connection

to the Netezza database.

• nzload - DMX issues an error message and the DMX task aborts.

6. To run the nzds client utility, ensure that the database user account has the Manage

Hardware privilege.

For additional information on required privileges, see the IBM Netezza System

Administration Guide and the Netezza Data Loading Guide.

7. Verify the port number used to connect to the Netezza database.

When the NZ_DBMS_PORT environment variable is defined, DMX connects to the Netezza

database using the value specified in NZ_DBMS_PORT; otherwise, DMX connects to the

Netezza database using the default port number, 5480.

UNIX systems For UNIX systems, the client software installation includes the following:

1. Install the Netezza client software package.

For procedures on installing the Netezza ODBC driver and the client software, refer to the

installation chapter of the IBM Netezza System Administration Guide.

The default Netezza client installation is located in the following directory:

/usr/local/nz/bin

The default Netezza ODBC driver installation is located in the following directory:

/usr/local/nz/lib64



3. Specify the Netezza client utilities directory in the PATH.

For example, export the PATH as follows:

export PATH=$PATH:/usr/local/nz/bin

Note: If the Netezza nzds and nzload client utilities are not in the PATH when DMX initiates

a load to the Netezza database, DMX does the following:

• nzds - DMX issues a performance warning message and establishes only one connection

to the Netezza database.

• nzload - DMX issues an error message and the DMX task aborts.

4. Set NZ_ODBC_INI_PATH to point to the directory where odbc.ini, without the leading

period, ".", is located.

For example, set NZ_ODBC_INI_PATH as follows:

export

NZ_ODBC_INI_PATH=$NZ_ODBC_INI_PATH:<directory_where_odbc.ini_is_located

>

5. To run the nzds client utility, ensure that the database user account has the Manage

Hardware privilege.

For additional information on required privileges, see the IBM Netezza System

Administration Guide and the Netezza Data Loading Guide.

6. Verify the port number used to connect to the Netezza database.

When the NZ_DBMS_PORT environment variable is defined, DMX connects to the Netezza

database using the value specified in NZ_DBMS_PORT; otherwise, DMX connects to the

Netezza database using the default port number, 5480.

NoSQL Databases DMX can connect to any NoSQL database, for example, Apache Cassandra, Apache Hbase, and

MongoDB, provided that you install the applicable NoSQL database client software and a compliant

ODBC driver or JDBC driver.

DMX requires a Level 3.0 compliant ODBC driver or a Level 4.0 compliant JDBC driver to connect to

a NoSQL database. Provided that your ODBC or JDBC driver supports NoSQL databases as sources

and targets, DMX supports NoSQL databases as sources and targets.

To verify the level of NoSQL database support that your ODBC or JDBC driver provides, contact

your ODBC or JDBC driver vendor.

Installation and configuration DMX connects to NoSQL databases through the client software applicable to your NoSQL database

and through a compliant ODBC or JDBC driver.

Client software installation and configuration To reference client software download information, links, and installation instructions that are

applicable to current NoSQL databases, for example Cassandra, Hbase, and MongoDB, consider the

following sites:


• Cassandra: http://cassandra.apache.org/

• Hbase: http://hbase.apache.org/

• MongoDB: http://www.mongodb.org/

To establish a connection to a NoSQL database, install the applicable client software on the system

on which DMX is installed.

ODBC driver installation DMX requires a Level 3.0 compliant ODBC driver to connect to a NoSQL database. For driver

installation and configuration information, refer to your ODBC driver documentation.

To reference ODBC driver download information for Simba ODBC drivers, for example, consider the

following sites:

• Cassandra: http://www.simba.com/connectors/apache-cassandra-odbc

• Hbase: http://www.simba.com/connectors/apache-hbase-odbc

• MongoDB: http://www.simba.com/connectors/mongodb-odbc

The installation documentation applicable to these sites outlines the steps to create the ODBC DSN

and provides links to advanced options specific to the Simba driver for Cassandra, Hbase, and

MongoDB.

You can also reference Defining ODBC Data Source Names. While you can use any ODBC driver

manager to load ODBC drivers for UNIX systems, by default, DMX uses the shipped unixODBC

driver manager.

JDBC driver installation DMX requires a Level 4.0 compliant JDBC driver to connect to a NoSQL database. For driver

installation and configuration information, refer to your JDBC driver documentation.

Oracle Your Oracle client must be installed on the system and configured so that it can connect to databases

that you want to access from DMX. Please refer to specific Oracle documentation for details on

configuring the client.

Oracle naming method Oracle supports multiple naming methods to resolve Connect Identifiers. DMX only supports the

Oracle Local Naming Method, which uses aliases defined in the tnsnames.ora configuration file on

the Oracle client machine.This file is expected to reside in the <oracle_home>/network/admin

directory, where <oracle_home> denotes the directory where Oracle is installed, or in the directory

pointed to by the TNS_ADMIN environment variable. The Task Editor always reads the list of

available databases from tnsnames.ora to automatically populate the list of databases in the

Database Connection dialog. The file has to be formatted according to the Oracle documentation on

syntax rules for configuration files. Otherwise, DMX may not be able to read the file correctly,

resulting in an empty or partial database list in the Database Connection dialog. Verify that

TNSNAMES is listed as one of the values of the NAMES.DIRECTORY_PATH parameter in the

Oracle Net profile sqlnet.ora. The TNSNAMES field indicates that local naming is enabled.

If TNSNAMES is not listed as one of the values of the NAMES.DIRECTORY_PATH parameter in

sqlnet.ora, run Oracle Net Configuration Assistant or Oracle Net Manager to add local naming

http://cassandra.apache.org/

http://hbase.apache.org/

http://www.mongodb.org/

http://www.simba.com/connectors/apache-cassandra-odbc

http://www.simba.com/connectors/apache-hbase-odbc

http://www.simba.com/connectors/mongodb-odbc


method and the Oracle databases you want DMX to connect to. The configuration utility updates the

Oracle Net profile, sqlnet.ora, located in <oracle_home>/network/admin.

Windows systems To access Oracle databases, Oracle client software must be accessible via the dynamic link libraries

(dll) located under the <oracle_home>\bin folder. The actual location of Oracle installation is usually

stored in the ORACLE_HOME environment variable.

If you have installed the 64-bit version of DMX on 64-bit Windows, there are some important

differences with respect to defining a DMX Task and running your application.

UNIX systems To access Oracle databases, Oracle client software must be accessible via the shared libraries located

under the <oracle_home>/lib and <oracle_home>/network/lib directories. The name of the shared

library directory may vary, e.g. lib32 or lib64, depending on the Oracle version.

Snowflake Snowflake is a cloud data warehouse that leverages separating storage from the platform in a cloud

environment. Through JDBC connectivity, DMX-h supports Snowflake data warehouses as sources

and targets.

Snowflake connection requirements Snowflake requires a JDBC connection configuration with the driver name and location for all

connections. Before attempting to connect to Snowflake, do the following:

• Install DMX server on an Amazon Elastic Compute Cloud (EC2) instance or your local

machine.

• Specify JDBC and cluster parallelization parameters in the DMX JDBC configuration file.


DMX_JDBC_INI_FILE environment variable, provide DMX with the mandatory and

optional values required to access an Amazon S3 bucket and to invoke a Snowflake

COPY/MERGE query.

• If DMX runs inside EC2, attach an IAM role to the EC2 instance with the following

conditions:

1. The attached IAM role must grant DMX read and write access to objects in the work bucket

specified in the configuration file.

2. Configure the IAM role for Snowflake.

3. If the IAM role configured for Snowflake is not the same role attached to EC2, set the

IAMROLE parameter in the configuration file to the IAM role configured for snowflake.

Note: When DMX cannot get temporary security credentials from an IAM role, DMX issues

an error message and the DMX task aborts.

• When DMX is runs outside of an EC2 instance, DMX accesses snowflake using keys-based

authentication. If no access keys are provided, DMX issues a UNIAMCRE error message

aborts the job.

The parameters outlined in a DMX Snowflake configuration file include the following:


https://docs.snowflake.net/manuals/user-guide/intro-key-concepts.html



• MAXPARALLELSTREAMS - Optional integer representing Maximum number of parallel

streams that can be established for loading data to the staging data. By default,

MAXPARALLELSTREAMS is set to the number of CPUs available in the client machine.

• WORKTABLEDIRECTORY - Required path to an s3 bucket or local directory. If the path is

an s3 url, s3://<bucket>, DMX creates an external staging data. If the path is a local

directory, file://<user/data>, DMX creates an internal staging data using the specified local

directory.

• WORKTABLESCHEMA - Optional schema name to create the staging data . The default

schema for the staging data is the same as the target data.

• WORKTABLENCRYPTION - Server side encryption algorithm for encrypting staging data in

the S3 bucket. Valid values are AES256 and aws:kms.


establishing an AWS account. If DMX runs in EC2, AWSACCESSKEYID is optional.

• AWSACCESSKEY - The 40-character string, which is also referred to as the secret access

key, which Amazon provides upon establishing an AWS account. If DMX runs in EC2,

AWSACCESSKEY is optional.

DMX requires the access key id and the secret access key to send requests to an Amazon S3

bucket.

• IAMROLE - Optional Amazon Resource Name (ARN) for an IAM role that Snowflake uses for

authentication and authorization if the same role is not attached to EC2. If EC2 and

Snowflake share the same role, this parameter is not required.

• LoadViaPut - Optional character. If WORKTABLEDIRECTORY is not set, DMX uses a PUT

command to load data when LoadViaPut is set to "y". If the work table directory isn't

provided and the LoadViaPut parameter isn't set to "y", the DMX job aborts with an error

message.

Defining Snowflake database connections

In the Database Connection dialog, define a connection to a Snowflake database as follows:

• At DBMS, select Snowflake.


• At Database, select a previously defined Snowflake JDBC database connection URL.


Snowflake target connections Using a Snowflake JDBC connection, DMX-h can write supported Snowflake data types to Snowflake

targets directly for optimal performance.

Defining Snowflake targets

At the Target Database Table dialog, define a Snowflake database table target:

1. At Connection, select a previously defined Snowflake target connection or select Add new... to

add a new one.




https://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html

https://docs.aws.amazon.com/redshift/latest/dg/copy-usage_notes-access-permissions.html#copy-usage_notes-access-role-based


3. On the Parameters tab, the following optional parameters are available for Snowflake target

database tables. Values specified here take precedence over their corresponding property in

the jdbc configuration file, if any.

o Maximum parallel streams - the maximum number of parallel streams that can be

established to load data for performance and that are created according to demand.

o Work directory connection - name of the Amazon S3 that DMX uses to connect to and

write to work table directory.

o Work table codec - specifies the compression algorithm used to compress Snowflake

data.

o Work table directory - serves as the parent-level directory beneath which job-specific


o Work table encryption - server-side encryption algorithm to encrypt the staging data



Snowflake source connections Using a Snowflake JDBC connection, DMX can read supported Snowflake data types from any

Snowflake table.

Defining Snowflake sources

For all DMX-h ETL jobs, DMX-h supports Snowflake database tables as sources and as lookup

sources. At the Source Database Table dialog or at the Lookup Source Database Table dialog define

either a Snowflake database table source or lookup source respectively:

• At Connection, select a previously defined Snowflake source connection or select Add new...

to add a new connection.

Sybase Your Sybase client and Open Client Library must be installed on the system and configured so that

it can connect to databases that you want to access from DMX. Please refer to specific Sybase

documentation for details on configuring the client.

Windows Systems To access Sybase databases, Sybase client software and Open Client Library must be accessible via

the dynamic link libraries (dll) located in the installation directory. You can configure the client by

using the dsedit utility, provided with the Sybase installation, to define database connections in the

sql.ini file.

UNIX Systems To access Sybase databases, Sybase client software and Open Client Library must be accessible via

the shared libraries located in the Sybase installation directory. You can make the client libraries

accessible by running the scripts provided with Sybase, such as <sybase_home>/SYBASE.sh, where

<sybase_home> denotes the directory where Sybase is installed. You can configure the client by

using the dsedit utility, provided with the Sybase installation, to define database connections in the

interfaces file.

Teradata In order to define a task that uses a Teradata table, the DMX Task Editor needs to access the

Teradata database from the system where the DMX Task Editor is run. This requires the Teradata


Call-Level Interface Version 2 for Network-Attached Systems (CLIv2), which is a Teradata Tools and

Utilities product, to be installed and configured on that system.

To access the Teradata database at run-time, Teradata FastExport, Teradata Parallel Transporter

(TPT), Teradata FastLoad, Teradata MultiLoad and Teradata Parallel Data Pump, which are

Teradata Tools and Utilities products, must be installed and configured on the system where DMX

jobs are run.

Installation and configuration

Teradata client software For Windows and UNIX systems, the client software installation includes the following:

1. On the system where the DMX Task Editor runs, install and configure the Teradata Utility

Pack, which includes CLIv2 and the Teradata ODBC driver.

2. On the system where the DMX Job Editor runs, install and configure the Teradata extract

and load utilities.

For Windows systems, the default, base installation directory is as follows:

C:\Program Files (x86)\Teradata\Client\<client_software_version_number>

For UNIX systems, the default, base installation directory is as follows:

/opt/teradata/client/<client_software_version_number>

For installation instructions and for information on the subdirectories under which the software

components are installed, see the Teradata Tools and Utilities Installation Guide for Windows or UNIX

that corresponds to your Teradata client software version.

Note:

• The Teradata installer updates all required directories to the PATH.

• When connecting through CLIv2 using the TTU access method, you do not have to create and

configure an ODBC data source.

Vertica The primary way to connect to Vertica databases is via ODBC. With Vertica version 7 or later, DMX

establishes parallel connections using Vertica COPY LOCAL, providing optimal performance, ease-

of-use, and dynamic tuning.

With older versions of Vertica, there are cases when the DMX Vertica Load Example Files method

may perform better than the ODBC method, as described at the end of this topic in "Choosing a

Method."

Connecting via ODBC When connecting to Vertica databases via ODBC, DMX uses different load methods based on the

Vertica version, as shown in the following overview table:

Vertica Version Load Method

7 or later Multi-stream Vertica COPY LOCAL via ODBC on both Windows and Linux

http://www.teradata.com/tools-and-utilities/utility-pack/?LangType=1033&LangSelect=true

http://www.teradata.com/tools-and-utilities/utility-pack/?LangType=1033&LangSelect=true

http://www.teradata.com/tools-and-utilities/load-and-unload/?LangType=1033&LangSelect=true

http://www.teradata.com/tools-and-utilities/load-and-unload/?LangType=1033&LangSelect=true

http://www.info.teradata.com/templates/eSrchResults.cfm?txtsrchstring=Teradata%20Tools%20and%20Utilities&todt=&prodline=all&srtord=Asc&txtrelno=&frmdt=&txtpid=&rdsort=Title

http://www.info.teradata.com/templates/eSrchResults.cfm?txtsrchstring=Teradata%20Tools%20and%20Utilities&todt=&prodline=all&srtord=Asc&txtrelno=&frmdt=&txtpid=&rdsort=Title


6 or later Linux: Multi-stream Vertica COPY LOCAL via ODBC

Windows: Multi-stream SQL INSERT via ODBC

Earlier than 6 Single-stream SQL INSERT via ODBC

Vertica Version 6 or Later When connecting to Vertica via ODBC, on Linux as of Vertica version 6, and on Windows as of

Vertica version 7, DMX uses Vertica COPY LOCAL to load data, which provides the best possible

load performance. If running Vertica 6 on Windows, it uses multi-stream SQL INSERT.

Vertica Earlier than Version 6 When connecting to Vertica via ODBC prior to version 6, the Vertica ODBC client driver method

loads data using a SQL INSERT statement to a single Vertica initiator node.

Configuring ODBC for Vertica The Vertica configuration file, vertica.ini, is used by Vertica to determine the absolute path to the

file containing the ODBC installer library and the absolute path to the directory containing the

Vertica client driver's error message files. The path to vertica.ini is set through the Vertica

configuration file environment variable, VERTICAINI.

Note: vertica.ini is different from the DMX node loading configuration file, DMXVertica.ini, which is

specified through DMX_VERTICA_INI_FILE.

To configure the Vertica ODBC driver:

1. Follow the instructions for defining ODBC data sources on Windows and UNIX systems.

For Vertica ODBC client driver v5.1 or later on UNIX/Linux platforms, specify the following

DSN parameters in vertica.ini and set the environment variable VERTICAINI to point to the

location of the vertica.ini file:

[Driver]

ODBCInstLib=<dmx_install>/lib/libodbcinstSSL.so

ErrorMessagesPath=<absolute_path_to_directory_containing_Vertica_client

_driver's_localized_error_message_files>

where <dmx_install> is the directory where DMX is installed.

The error message files are generally stored in the same directory as the Vertica ODBC

driver files.

2. When using the unixODBC driver manager, override the standard threading settings in the

ODBC section of odbcinst.ini as follows:

[ODBC]

Threading = 1

For additional details, see the Vertica Programmer's Guide for your version of Vertica.


Other DBMSs

ODBC Some sources and targets other than the databases mentioned above (e.g. Microsoft Access

databases) may be accessed via ODBC, by defining the appropriate ODBC data sources. See Defining

ODBC Data Sources for details on defining data sources on Windows and UNIX systems.

JDBC To access a database management system (DBMS) that is not explicitly supported by DMX, Java

Database Connectivity API (JDBC) can be used when a JDBC driver is provided by the database

vendor. The JDBC driver establishes the connection to the database and implements the protocol for

transferring queries and results between a client and the database.

Connecting to JDBC sources and targets requires that you define a JDBC configuration file, set DMX

and JAVA environment variables, and specify the database connection URL, which the DBMS JDBC

driver uses to connect to a database through the Database Connection dialog.

Connection Overview Through the DMX_JDBC_INI_FILE environment variable, DMX gains access to the JDBC

configuration file, which you define. As per the JDBC driver properties outlined in the JDBC

configuration file, DMX determines the JDBC driver class name and the Java class path to the class

and dependent classes; establishes the connection with the DBMS; and connects to the source or

target database, which is specified in the database connection URL.

JDBC Configuration File Outlined within the JDBC configuration file are the JDBC driver class name and Java class path for

locating the driver class and dependent classes for each DBMS. A separate section in the JDBC

configuration file is required for each DBMS.

Format Requirements

The JDBC configuration file is organized in sections. Consider the following format requirements:

• A section header marks the beginning of each section and is specified by a string enclosed in

square brackets ([]). The enclosed string specifies the name or alias of the DBMS. To

establish a connection to the database in the DBMS, the DBMS name, which is enclosed

within brackets([]) in the section header of the JDBC configuration file, must match the

DBMS name specified in the second parameter of the database connection URL.

• Within each section, name-value pairs describe the properties of the JDBC driver. Unless

otherwise stated, parameter names are case-insensitive and parameter values are case-

sensitive. Each line can contain a maximum of one parameter description where the

parameter value is separated from the parameter name by an equal sign (“=”). Extra spaces

before and after the equal sign are ignored.

Consider the following parameters for accessing a DBMS through JDBC:

• DriverName - Mandatory - This mandatory parameter identifies the JDBC driver class name

or Java class.

• DriverClassPath - Mandatory - This mandatory parameter identifies the Java class path that

points to the JDBC driver. Use a semi-colon (;) to separate different entries in the path.

• SelectStatement and InsertStatement – Optional - If the query language used by the DBMS

does not follow the SQL92 standard, these optional parameters enable you to provide custom

queries for select and insert operations. When you provide these parameters, DMX uses the


statement templates to create the appropriate select and insert statements. If either the

SelectStatement or InsertStatement parameter is not defined, standard SQL is used for the

corresponding read/write operation. If the right side of the equal sign is blank for the

SelectStatement or InsertStatement parameter, the corresponding read/write operation is

not supported. You can use the following place holders in the statement templates:

o <columns> – location where the actual comma-separated columns should be placed.

o <table> – location where the actual table name should be placed.

• IsSchemaSupported – Optional - This optional parameter ensures that DMX correctly

identifies all specified database tables. Through JDBC calls, DMX can generally determine

whether a DBMS supports a schema; however, certain DBMSs, such as Hive, do not return

the values that are expected from certain JDBC calls. Under these circumstances, you can

set the IsSchemaSupported parameter to ensure that all specified database tables are

identified correctly. The values for this parameter can be either true or false; as an exception

to the general rule, IsSchemaSupported parameter values are case insensitive.

For information on connecting to Hive through ODBC, see Connecting to Hive data warehouses.

• The character '#" marks the start of a comment, which continues until the end of the line.

Comments are permitted anywhere within a JDBC configuration file.

• Empty lines are permitted anywhere within a JDBC configuration file.

DMX and JAVA Environment Variables

DMX_JDBC_INI_FILE To provide DMX with access to the JDBC configuration file, you must set the DMX environment

variable, DMX_JDBC_INI_FILE, to point to the full path of the JDBC configuration file. The full

path includes the directory location and the JDBC configuration file name.

Consider the following examples on setting the DMX_JDBC_INI_FILE environment variable:

On Windows: set DMX_JDBC_INI_FILE=C:\Program Files\DMExpress\Programs\DMXJdbcConfig.ini

On UNIX: export DMX_JDBC_INI_FILE=/usr/dmexpress/etc/DMXJdbcConfig.ini

JAVA_HOME After you install the Java Runtime Environment (JRE) in Windows, you must set the Java

environment variable, JAVA_HOME, to point to the JRE installation directory. The bit level (32 or

64) of the installed JRE must match the bit level of the DMX release that you are running.

Consider the following examples on setting the JAVA_HOME environment variable:

On Windows: set JAVA_HOME=C:\Program Files (x86)\Java\jdk1.7.0_51

On UNIX: export JAVA_HOME=/usr/java/jdk1.6.0_24

Database Connection URL To connect to a database using the JDBC access method, you must specify the database connection

URL as the database specification in the Database Connection dialog.


MySQL Example At a minimum, each database connection URL, which the JDBC driver uses to connect to the JDBC

source or target, consists of jdbc, which is the required first parameter; the DBMS name, the

database host name, the database name, and any additional connection property specification.

To access database db1 in the MySQL DBMS installed on the local computer, consider the following

valid database URL:

jdbc:mysql://localhost/db1

where

jdbc - required first parameter to connect to a JDBC source or target.

mysql - DBMS name. This DBMS name must match the DBMS name specified within

brackets ([]) in the section header of the JDBC configuration file.

//localhost/db1 - host and database identification string that identifies the db1 database

in the local MySQL DBMS installation.

For additional information, see MySQL Driver and Data Source Class Names, URL Syntax, and

Configuration Properties for Connector/J.

Defining ODBC Data Sources A data source needs to be defined for each database that you want DMX to access through ODBC.

The data source name on the client, where DMX tasks or jobs are defined, has to be the same as the

data source name on the server where DMX tasks or jobs run.

Windows Systems You can define an ODBC data source through the ODBC Data Source Administrator as follows:

• From the Start menu, select Settings, Control Panel, Administrative Tools, Data Sources

(ODBC).

• In the ODBC Data Source Administrator dialog, choose the User DSN or System DSN tab,

and click on the Add button. On 64-bit Windows, select the User DSN tab.

• In the Create New Data Source dialog, select the appropriate DBMS driver from the list, e.g.

SQL Server, Microsoft Access Driver (*.mdb), etc. Then, press the Finish button.

• The setup wizard guides you with further driver specific instructions.

UNIX Systems On UNIX systems, you may choose to use the ODBC driver manager, unixODBC, that is shipped

with DMX, or you may use your own driver manager.

DMX Default Driver Manager The DMX install and databaseSetup programs assist you in creating unixODBC data sources.

Alternatively, you can define ODBC data sources manually. The ODBC data source manager

provides support for ODBC data sources through two configuration files, odbcinst.ini and odbc.ini

that are located in the directory <dmx_home>/etc. This directory also contains templates and

examples of the configuration files. To change the location of the configuration files, export the

ODBCSYSINI environment variable to the new directory where both files reside.

http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html

http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html


The files need to be set up appropriately before you can access databases via ODBC. This is a one-

time configuration step, similar to defining system data sources on Windows.

• <dmx_home>/etc/odbcinst.ini: This file contains DBMS specific and system specific driver

definitions. Configuring this file corresponds to selecting the DBMS driver while adding a

data source on a Windows system.

• <dmx_home>/etc/odbc.ini: This file contains DBMS specific data source definitions, based on

the drivers defined previously in the odbcinst.ini file. Configuring this file corresponds to

following DBMS driver specific instructions while adding a data source on a Windows

system.

If you wish to remove a data source, delete the section that corresponds to that data source from the

odbc.ini file. A section starts with the data source name enclosed by [], and ends at the beginning of

the next section or at the end of the file.

64 Bit ODBC The ODBC headers and libraries that are shipped with the Microsoft Data Access Components

(MDAC) 2.7 Software Development Kit (SDK) have changed from earlier versions of ODBC to

support 64-bit platforms. Since the ODBC driver for a specific DBMS and the unixODBC libraries

are built separately, there may be an incompatibility in the definition of SQLLEN variable which

was specifically introduced for ODBC access on 64-bit UNIX platforms. On 64-bit UNIX platforms,

DMX assumes that the ODBC driver is 64-bit compliant and defaults the value of SQLLEN variable

to 8 bytes. You can overwrite this default, that is, the DMXSQLLEN value corresponding to the

specific DBMS driver, in the odbcinst.ini file.

Use Other ODBC Driver Manager By default, DMX uses the shipped unixODBC driver manager to load all ODBC drivers. Some ODBC

drivers, such as Teradata ODBC driver, may not work with it. You can tell DMX to use a different

ODBC driver manager by specifying the option DMXODBCDRIVERMANAGER=No under the driver section in

the odbcinst.ini file. You need to make sure that your ODBC driver manager library path (e.g.

/usr/odbc/lib for Teradata V12) is in the system library path (e.g. LD_LIBRARY_PATH on Linux) so

that it is loaded first by DMX. In addition, you may need to export the ODBCINI environment

variable with the absolute path to the file odbc.ini (e.g. export ODBCINI=<dmx_home>/etc/odbc.ini).

Refer to your DBMS documentation for details on this requirement.

Connecting to Message Queues from DMX DMX can access message queues as sources or targets when the appropriate message queue client

software is installed on the system and accessible. The configuration steps needed to access a specific

message queue are described in the following sections.

To connect to a message queue via a Data Connector, follow the installation instructions that

accompany the connector.

IBM WebSphere MQ To create a connection to an IBM Websphere queue manager, you specify the queue manager name

in the Message Queue Connection dialog, and provide a channel definition that includes:

• the channel name,

• the transport type, and

• the connection name, with an optional port number.

mk:@MSITStore:C:/work/dmx/build/DMX-10570_MQ_doc/help/english/dmexpress.chm::/html/WebsphereMQ.htm


Port number DMX assumes a default port number of 1414 for the port where the server’s listener is expecting

client communication. You can change the port number by specifying it in parentheses following the

connection name. For example, 192.168.2.100(1415) or server-machine.com(1415).

The channel definition may be specified by either:

1. Defining the MQSERVER environment variable, or

2. Using a DMX WebSphere queue manager configuration file.

The MQSERVER environment variable You can define a channel via the MQSERVER environment variable as defined by IBM.

Example On Windows:

SET MQSERVER=CHANNEL1/TCP/MQSERVER01

or, to change the default port number:

SET MQSERVER=CHANNEL1/TCP/MQSERVER01(1418)

On Unix:

export MQSERVER=CHANNEL1/TCP/’MQSERVER01’

or, to change the default port number:

export MQSERVER=CHANNEL1/TCP/’MQSERVER01(1418)’

Queue manager configuration file You can also create channel definitions for one or more queue managers in a configuration file, and

provide the fully qualified file name to DMX in the DMX_CONNECTOR_ENV_MQ_WS_INI_FILE

environment variable. This populates the Queue manager combo box of the Message Queue

Connection dialog with the defined queue managers or their aliases.

The contents of the file must be formatted as follows:

• Anything following a “#” character until the end of the line is a comment. Comments are

allowed anywhere.

• Empty lines are allowed anywhere.

• The file is organized in sections. The beginning of each section (the section header) is

specified by a string enclosed in square brackets. The enclosed string may be a queue

manager name or a queue manager alias.

• The section headers must be unique.

• The lines between section headers contain the channel definition parameters for that

particular queue manager or alias. There are 4 supported parameters: queuemanager, channel,

transport, connectionname. The parameter values are separated from the parameter name by

an “=” character. The parameter names are case insensitive, but their values are case

sensitive, except for the transport parameter. Each line may contain at most one parameter

definition.

• The queuemanager parameter is used for cases where the section name is not a queue manager

name, but a queue manager alias. This is to allow potential configuration of different channel

definition options for the same queue manager.

http://publib.boulder.ibm.com/infocenter/wmqv7/v7r0/index.jsp?topic=%2Fcom.ibm.mq.csqzaf.doc%2Fcs12030_.htm


If there is a configuration section with just the name but no parameters, the MQSERVER

environment variable definition is used at connection time. This saves you from typing in the queue

manager's name in the GUI, as the name appears in the list of known queue managers.

The connection parameters defined in this file override the MQSERVER environment variable only if

all 3 parameters (channel, transport and connectionname) are defined for a particular queue manager.

If you would like to define several different parameter sets for the same queue manager, use an alias

for the section name and override the queue manager name by defining the queuemanager parameter

inside the parameters section for that alias.

Sample configuration file A sample configuration file (DMXWebSphereConnector.ini) is installed in the directory:

• On Windows: <dmx>\Examples\WebSphereConnector\DMXWebSphereConnector.ini

• On Unix: <dmx>/etc/DMXWebSphereConnector.ini

where <dmx> is the directory where DMX is installed.

Example Define the DMX_CONNECTOR_ENV_MQ_WS_INI_FILE environment variable:

SET DMX_CONNECTOR_ENV_MQ_WS_INI_FILE = C:\tmp\DMXWSConfig.ini

Create the DMXWSConfig.ini file at the above location, with the following content:

[my.local.queue.manager]

Channel = all.clients

Transport = tcp

Connectionname = mw-server.com

Connecting to Salesforce from DMX In order for DMX to connect to Salesforce.com, the SSL client certificate must be up-to-date. The

DMX installation includes the file cacert.pem in the <install_dir>/CACertificates directory. This is a

plain text file containing a set of public keys used for SSL authentication when connecting to

Salesforce.com.

If this file goes out-of-date, an HTTPSCVF error is issued when attempting to connect to

Salesforce.com. If that happens, go to http://curl.haxx.se/ca/cacert.pem, save the file as cacert.pem to

the <install_dir>/CACertificates directory, and retry the Salesforce.com connection.

Connecting to SAP from DMX In order for DMX to access data in an SAP system, SAP’s NetWeaver client libraries NW RFC SDK

7.10 with patch level 2 or higher must be on the system and accessible via the appropriate shared

library or dynamic link library (dll) paths. They can be downloaded from SAP’s marketplace at

http://service.sap.com/swdc. For Windows 64-bit platforms, both the 64bit and 32bit NetWeaver

client libraries are required.

The following environment variable must be set to include the path to the NetWeaver client

libraries, for example, <download_path>/nwrfcsdk/lib, and exported on the corresponding platform:

http://service.sap.com/swdc


Windows PATH

AIX LIBPATH

HP-UX SHLIB_PATH



On UNIX systems, the variable needs to be set and exported prior to starting the DMX Run-time

Service or running DMX tasks or jobs.

The SAP NetWeaver client libraries depend on the corresponding C/C++ libraries that they were

built with. The path to the C/C++ libraries must also be included in the library search path.

Windows: Microsoft C Runtime DLLs version 8.0 need to be installed. Refer to SAP Note

684106 at https://service.sap.com/sap/support/notes/684106. The vcredist_x86 package

needs to be installed on all Windows platforms. In addition, the vcredist_IA64 and

vcredist_x64 packages need to be installed on Windows IA64 and Windows x64 platforms,

respectively.

AIX: AIX C++ library libC.a – usually found in /usr/lib.

HP-UX IA64: HP C++ library libCsup.so.1 – usually found in /usr/lib/hpux64.

Linux: C library version 2.3.4 or higher, libstdc++.so.6 – usually found in /lib/tls and /usr/lib.

Refer to SAP Note 1021236 at https://service.sap.com/sap/support/notes/1021236.

Solaris: Sun C++ libraries libCstd.so.1 and libCrun.so.1 for SunOS 5.10 or higher – usually

found in /usr/lib/sparcv9 or /usr/lib/64.

If lower versions of these libraries are also on the system, then the path to the libraries of the

required version must be before the older versions in the library search path.

On all systems, the path to the DMX library must be before the path to the SAP NetWeaver client

libraries in the library search path.

The SAP client library must be configured so that it can connect to SAP systems that you want to

access from DMX. For example, you can configure the client by defining SAP system aliases in the

sapnwrfc.ini file. Please refer to specific SAP documentation for details on configuring the client.

Once configured, the directory where the file is located must be set in the environment variable

RFC_INI.

The DMX install program assists you with verifying connections to SAP systems.

On UNIX systems, if you wish to configure and/or verify SAP connections any time after the

installation procedure, run the SAPSetup program as follows:

cd <dmx_home>

./SAPSetup

Registering DMX in SAP SLD Per SAP recommendation, each DMX server should be registered in your SAP SLD (System

Landscape Directory). Please refer to the topic “Registration of DMExpress Components in the SAP

System Landscape Directory” in the DMX help.

https://service.sap.com/sap/support/notes/684106

https://service.sap.com/sap/support/notes/1021236


Connecting to HDFS from DMX In order for DMX to access data located in a HDFS, a Hadoop distribution must be installed and

configured as follows on the system where the DMX jobs and tasks are executed:

• The hadoop command must be accessible to DMX:

o DMX first looks for the hadoop command in $HADOOP_HOME/bin/hadoop, where the

environment variable HADOOP_HOME is set to the directory where Hadoop is installed.

Defining environment variables can be done through the Environment Variables tab of

the DMX Server dialog.

o If HADOOP_HOME is not defined or the directory can't be found, DMX looks for the

hadoop command in the system path, where it is automatically added by some Hadoop

distributions.

• The fs.default.name property in the core-site.xml configuration file must be set to point to

the Hadoop file system.

• The HTTP namenode daemon must be running on the default port 50070. If you would like

to use a different port number, please contact Technical Support.

• If the Hadoop cluster requires Kerberos authentication, you need to use the dmxkinit utility

to run your HDFS extract/load jobs/tasks.

Connecting to Connect:Direct nodes from DMX In establishing connectivity to a Connect:Direct node on the mainframe, DMX initiates file transfers

from this node to an open-systems Linux server.

Security The Connect:Direct proprietary security protocol offers security through authentication and user

proxies. User authorities and user proxies are setup during Connect:Direct installation and

configuration.

Installation and Configuration For DMX to access data located on a Connect:Direct node, a Connect:Direct server and

Connect:Direct client must be installed on the same Linux machine on which DMX jobs and tasks

are executed.

• Configure Connect:Direct to access the required Connect:Direct nodes.

For details on configuring Connect:Direct nodes, refer to the IBM Sterling Connect:Direct product

documentation.

• Add a Connect:Direct user for each DMX user who accesses Connect:Direct.

• The DMX server must be configured as the Connect:Direct primary node (pnode) to enable

sampling with Connect:Direct connections.

• Prior to starting the DMX Run-time Service or to running DMX tasks or jobs, set the

following environment variables:

o NDMAPICFG points to the CLI/API configuration file, ndmapi.cfg, for example:

export

NDMAPICFG=$NDMAPICFG:<connect:direct_home>/ndm/cfg/cliapi/ndmapi.

cfg


o PATH points to the Connect:Direct /bin directory, for example:

export PATH=$PATH:<connect:direct_home>/ndm/bin

If you plan to start the DMX Run-time Service using sudo, use the –E option to preserve the

environment variable settings.

Note: If the :file.open.exit.program parameter in the user.exits section of the parameter

initialization configuration file, <connect_direct_home>/ndm/cfg/<node_name>/initparm.cfg, contains

any path, including the path to SSConnectDirectFileOpenUserExit, remove the full path such that the

parameter value is blank: :file.open.exit.program=:\

Connecting to Databricks File Systems (DBFSs) Databricks is a cloud storage Platform-as-a-Service for Spark supported on Azure and AWS Cloud

Services. Through JDBC connectivity, DMX-h supports Databricks File System (DBFS) content as a

source and as a target.

NOTE: A DBFS connection is a remote file connection, which is logically different from a Databricks

database connection that supports queries.

Databricks File System (DBFS) connection requirements

Databricks requires a JDBC connection configuration with the driver name and location for all

connections. To use a Databricks File System (DBFS) connection, you also require a DMX execution

profile with Databricks deployment configuration parameters. To install DMX on a Databricks

cluster, follow the instructions in Deploying DMX to a Databricks cluster in the cloud.

Before attempting to connect to Databricks, do the following:

• Install DMX server on an Amazon Elastic Compute Cloud (EC2) instance or your local machine..

• Specify JDBC and cluster parallelization parameters in the DMX JDBC configuration file.


DMX_JDBC_INI_FILE environment variable, provide DMX with the mandatory and optional

values required to access an Amazon S3 bucket or Azure blob to invoke a Databricks query.

Refer to DMX Databricks configuration for JDBC information provided below.

• DMX accesses Databricks using token-based authentication. If no access keys are provided,

whether session-based or explicit, DMX issues a UNIAMCRE error message aborts the job.

• To use a DBFS connection, specify Databricks deployment configuration parameters in a DMX

execution profile. You can use a global, user, and/or job-specific execution profile, but DBFS is

unreachable without this configuration. See the Execution Profile topic in the DMX Help.

The parameters outlined in a DMX Databricks JDBC configuration file include the following:



• ANALYZETABLESTATISTICS - When set to y, DMX can run analyze queries that collect

table statistics. Default value in n.

• ANALYZECOLUMNSTATISTICS - When set to y, DMX can run analyze queries that collect

column statistics. Default value is n.


• MAXPARALLELSTREAMS - Optional integer representing the maximum number of

parallel streams that Connect can establish for loading data into the staging data file. By

default, MAXPARALLELSTREAMS is set to the number of CPUs available in the client

machine.

• WORKTABLEDIRECTORY - Required path to an s3 bucket, Azure blob container, or

Databricks File System (DBFS) store in which to stage data. You must mount an s3 bucket

or Azure blob container using the Databricks File System (DBFS). Example URLs could

include:

• s3a://dev for an S3 bucket

• wasbs://[email protected]/dev for an Azure Blob

• dbfs://dev for a DBFS store

• WORKTABLESCHEMA - Optional schema name to create the staging data . The default

schema for the staging data is the same as the target data.

• WORKTABLECODEC - A compression codec to compress the files in the stagiung directory.

Valid values are gzip (default), bzip2, and uncompressed.

• MAXWORKFILESIZE - Optional integer. The maximum size of a file in bytes of the staging

file written by task. The default value is 134217728, which is equivalent to 128 MB.


establishing an AWS account. DMX ignores this parameter unless

WORKTABLEDIRECTORY is an S3 bucket. If DMX runs in EC2, AWSACCESSKEYID is

optional.

• AWSACCESSKEY - The 40-character string, also known as the secret access key, which

Amazon provides upon establishing an AWS account. DMX ignores this parameter unless

WORKTABLEDIRECTORY is an S3 bucket. If DMX runs in EC2, AWSACCESSKEY is

optional. DMX requires the access key id and the secret access key to send requests to an

Amazon S3 bucket unless an AWS temporary session token is required, in which case DMX

requires the access key id and AWS temporary session token. See the AWSTOKEN

parameter below.

• AWSTOKEN - An AWS temporary session token, granting temporary security credentials

(temporary access keys and a security token) to any IAM user enabling them to access AWS

services. This alternative authentication method replaces a full-access AWS storage access

key. DMX ignores this parameter unless WORKTABLEDIRECTORY is an S3 bucket.

• AZURESTORAGEACCESSKEY - A 512-bit Azure Blob Storage access key for an active

account of which Microsoft issues two upon establishing an Azure Portal account. If DMX

runs in the Azure Blob container, AzureStorageAccessKey is optional. If the storage access is

required and the key is missing or invalid, DMX issues an AZSQDWTERR error message

and aborts the job. DMX ignores this parameter unless WORKTABLEDIRECTORY is an

Azure blob container.

• AZURESTORAGESAS - A shared access signature (SAS) URI that grants restricted access

rights to Azure Storage resources. This alternative authentication method replaces a full-

access Azure Storage access key. DMX ignores this parameter unless

WORKTABLEDIRECTORY is an Azure blob container.

• DBFSMOUNTPOINT - DBFS mount point (DBFS path) required by

WORKTABLEDIRECTORY. DBFSMOUNTPOINT is mandatory if the work table directory

maps to an S3/Azure URL.

• LOADVIADBFS - Optional character. If WORKTABLEDIRECTORY is not set, DMX uses a

DBFS command to load data when LoadViaDBFS is set to "y". If the work table directory

isn't provided and the LoadViaDBFS parameter isn't set to "y", the DMX job aborts with an

error message. This option requires Databricks deployment configuration parameters from a

DMX execution profile.

Consider the following format of a DMX Databricks configuration file:

[spark]


DriverName=<JDBC.Driver.name>

DriverClassPath=<JDBC_Driver_ClassPath>

ANALYZETABLESTATISTICS=<y|n>

ANALYZECOLUMNSTATISTICS=<y|n>

MAXPARALLELSTREAMS=<Maximum_Parallel_Streams>

WORKTABLEDIRECTORY=<Directory_path>

DBFSMOUNTPOINT=<DBFS_path>

MAXWORKFILESIZE=<Maximum_Bytes>

WORKTABLESCHEMA=<Amazon_S3_schema>

WORKTABLECODEC=<Amazon_S3_codec>

AWSACCESSKEYID=<AWS_access_key_id>

AWSACCESSKEY=<AWS_access_key>

AWSTOKEN=<AWS_token>

AZURESTORAGEACCESSKEY=<Azure_access_key>

AZURESTORAGESAS=<Azure_SAS_URI>

LoadViaDBFS=<y|n>

Defining Databricks File System (DBFS) connections

Databricks File System (DBFS) connections connect to Databricks sources and targets as a remote

file connection. In the Databricks File System Connection dialog, define a connection to Databricks

File System (DBFS) as follows:

• At Name, type the name of the DBFS deployment configurtion from the execution profile.

• At Current DMExpress Server, Select the applicable DMExpress server running on the

DBFS cluster.

• At Instance, select the URL of the Databricks endpoint to which DMX connects and sends

API requests.

• At Authentication, either:

1. At Token:Specify, type a TLS access token.

or

2. At Token:Use Repository select the repository used to store tokens, either DMX

Repository or Cyberark; At Token alias type the alias for your TLS authentication token

or Define to define a new token alias in a DMS Repository.

After you choose OK, you can define Source File connections and Target File connections for DBFS

sources or targets, respectively.


Connecting to CyberArk Enterprise Password

Vault

DMX connects to CyberArk Enterprise Password Vault over a TLS-secured HTTPS connection and

requires access to an up-to-date TLS client certificate. If the CyberArk server secures the DMX

connection with a self-signed certificate, update <dmx_home>/CACertificates/cacert.pem with the public

certificate at the same time you update or install the client certificate, where <dmx_home> is DMX

install directory

For DMX jobs run in a Hadoop cluster, update the client certificate and cacert.pem on the edge node

only. DMX distributes TLS configurations, keys, and certificates to the cluster nodes.

If a client certificate file is out-of-date, DMX issues an HTTPSCVF error when it attempts to connect

to the CyberArk server.

CyberArk Licenses

DMX can only connect to licensed CyberArk vaults. Check the CyberArk license status if you

encounter repeated failures to retrieve a CyberArk password.

Connecting to Protegrity Data Security Gateway

DMX connects to Protegrity Data Security Gateway by making REST API POST requests over HTTP.

The DMX Protect and Unprotect functions use Protegrity resources to protect and unprotect data

sent to Protegrity. You must configure the Protegrity Gateway server to receive and process REST

requests before DMX can use the functions, The API end point implementation determines the

specific protection methods. Some details needed to set up protection are:

• All REST API calls use the POST method.

• Data is always sent as part of HTTP message body.

• Data is always sent without any encoding change. The Protegrity server must return

protected data with the same encoding as the data input.

• DMX does not pass data with empty or NULL values to the Protegrity server.

Connecting to QlikView data eXchange files from

QlikView or Qlik Sense Qlik is the provider of QlikView and Qlik Sense business intelligence and visualization software

applications. DMX supports QlikView data eXchange (QVX) files as targets. Through DMX, you

define the QVX file and the QlikView data eXchange reformat layout.

QVX files can be used as data sources for QlikView or Qlik Sense.





QlikView desktop installation overview To access QVX files as sources from within QlikView:

1. Install the QlikView desktop.

2. At the QlikView desktop:

a) Start QlikView Personal Edition.

b) At the File menu, select Open.

c) In the Open dialog, ensure that the file type is All Files (*.*) and browse to the

appropriate QVX file.

d) Select the QVX file and select Open.

e) At the File Wizard dialog, ensure that the File type is Qvx and select Finish.

f) At the Edit Script dialog, select Reload to execute the displayed script, which loads the

QVX data.

g) At the Fields tab of the Sheet Properties dialog, select the fields to display on the Main

QlikView sheet.

h) To save the data in the QlikView document, select Save.

Qlik Sense desktop installation overview To access QVX files as sources from within Qlik Sense:

1. Install the Qlik Sense desktop.

For information on Qlik Sense, see Qlik Sense help.

2. At the Qlik Sense desktop:

a) Start the Qlik Sense desktop.

b) Select Create a New App.

c) In the Create new app dialog, enter the name of the application and select Create.

d) At the New app created dialog, select Open.

e) At the Qlik Sense desktop, select Quick data load.

f) At the Select file dialog, ensure that the file type is QlikView data exchange files (qvx),

browse to the appropriate QVX file, and select Select.

g) At the Select data from .qvx dialog, select the appropriate fields to load and select Load

data.

h) When the data loads successfully, a new data sheet is created.

i) To edit the data sheet, select Edit the sheet.

Connecting to Tableau Data Extract files from

Tableau Tableau is a business intelligence application that provides browser-based analytics. DMX supports

Tableau Data Extract (TDE) files as targets. Through DMX, you define the TDE file and the Tableau

Data Extract reformat layout.

TDE files can be used as data sources for Tableau.

http://go.qlikview.com/rs/qliktech/images/Downloading_and_Starting_the_QlikView_Desktop_Installation.pdf


http://help.qlik.com/sense/en-us/online/

http://www.tableau.com/


Tableau desktop installation overview To access TDE files as sources from within Tableau:

1. Install the Tableau desktop.

2. At the Tableau desktop:

a) Start Tableau Desktop.

b) Select Connect to Data.

c) In the File section of the Connect to Data page, select Tableau Data Extract.

d) At the Open dialog, browse to and select the Tableau Data Extract file.

e) At the Tableau Data Extract Connection dialog, enter the name of the data connection

for use in Tableau.

The data in the Tableau Data Extract file displays within the Tableau Desktop.

Removing DMX/DMX-h from Your System Windows Systems Perform the following steps to remove DMX from your system:

1. Ensure that the DMX Task Editor, DMX Job Editor, and DMX Server are closed and no

DMX jobs are running.

2. Go to Programs, DMExpress from the Start menu and select Uninstall DMX.

3. Alternatively, you can remove DMX as follows: Go to Settings, Control Panel from the Start

menu and double-click Add/Remove Programs. In the list of applications that can be

removed, select the entry for DMX. Click Add/Remove and confirm.

4. Delete folders if necessary. If you created any of your own files in the folder where you

installed DMX, these files are not removed by the uninstall program.

UNIX Systems Perform the following steps to remove DMX from your system:

1. Ensure that no DMX jobs are running.

2. If you installed the DMX Run-time Service, you need to uninstall it first. Login as root and

run:

cd <dmx_home>

./install

When prompted, select to uninstall the service.

3. Remove the DMX directory:

cd <dmx_home>/..

rm –rf <dmx_home>

Remove any environment variable settings that you added to your profile, e.g. <dmx_home>/bin in

your PATH, after the DMX installation.

http://www.tableau.com/products/desktop/download?os=windows


DMX-h in a Hadoop Cluster The method for removing DMX-h from the nodes of a Hadoop cluster depends on how you originally

installed DMX-h in the cluster. Follow the instructions in the appropriate section below.

Cloudera Manager Parcel Uninstall Uninstall DMX-h on all nodes in the cluster as follows:

1. Ensure that no DMX-h jobs are running.

2. Uninstall the DMX Run-time Service on any edge/cluster node where it is running.

3. Click on the parcel indicator button in the Cloudera Manager Admin console navigation bar

to bring up the Parcels tab of the Hosts page.

4. In the currently activated dmexpress parcel, click on the Actions button and select

Deactivate to deactivate the parcel.

5. Once deactivated, click on the Actions button and select Remove From Hosts to remove the

parcel from the cluster nodes.

6. Once the parcel is removed from the cluster nodes, click on the Actions button and select

Delete to delete the parcel from the repository.

Apache Ambari Service Uninstall Follow the instructions for RPM Uninstall or

1. Open the Ambari Web UI and navigate to “Hosts”

2. For each host, choose the “Installed” drop-down next to “Clients”

3. For both “DMX-h” and “DMX-h License” (if present), choose “UNINSTALL.”

Once uninstalled, either via the UI or using RPM, disable the uninstalled services:

1. Open the Ambari Web UI, and navigate to “Services”

2. For each of “DMX-h” and “DMX-h License” (if present), choose “Service Actions” -> “Delete

Service” and follow the prompts.

RPM Uninstall Uninstall DMX-h on all nodes in the cluster as follows:



3. Run the following command with sudo or root privileges using the erase option, –e:

Software: rpm -e dmexpress

License: rpm -e dmexpresslicense-<license site ID>

e.g. rpm -e dmexpresslicense-12345

If you do not know your license site ID, run the following command to find the installed

license package name:

rpm -qa | grep dmexpresslicense-


You can also use an RPM wrapper such as yum instead:

yum erase dmexpress

yum erase dmexpresslicense-<license site ID>

Manual/Silent Uninstall Uninstall DMX-h on the edge/ETL node and each remaining node in the cluster as follows:



3. Remove the DMX home directory on the edge node and all remaining nodes in the cluster:

cd <dmx_home>/..

rm –rf <dmx_home>

4. Remove any environment variable modifications made for DMX, such as the addition of

<dmx_home>/bin to your PATH.

Uninstall the DMX Run-time Service When instructed to uninstall the DMX Run-time Service, run the install script in the DMX

installation directory as root, and select the option to uninstall the DMX Run-time Service. For

example:

cd /usr/local/DMExpress

./install


DMX installation component options DMX installation component options include the following:

• Standard

The standard installation enables you to install the following components on one server:

o Development client, Job Editor and Task Editor

o DMX engine, dmxdfnl/ dmxjob/ dmexpress

o Service for development client, which is the DMX Run-time Service, dmxd

o DataFunnel Run-time Service, dmxrund

o See DMX DataFunnel run-time service installation and configuration.

• Full

The full installation enables you to install all DMX components on one server:

o Development client, DMX Job Editor and Task Editor

o DMX engine, dmxdfnl/ dmxjob/ dmexpress


o DataFunnel Run-time Service, dmxrund

o See DMX DataFunnel run-time service installation and configuration.

o Management Service, which includes, dmxmgr, REST APIs, and the Connect Portal

user interface (UI)

See DMX Management Service installation and configuration.

• Classic

The classic installation enables you to install traditional DMX components on one server:

o Development client, Job Editor and Task Editor

o DMX engine, dmxjob/dmexpress


• Custom

The custom installation enables you to install individual components on different servers:

o DMX engine

Installs the DMX engine, dmxdfnl/ dmxjob/ dmexpress.

o Service for development client

Installs the DMX Run-time Service, dmxd.

o DataFunnel Run-time Service

Installs the DataFunnel Run-time Service, dmxrund.

See DMX DataFunnel run-time service installation and configuration.

o Development client

Installs the Job Editor and Task Editor.

o Management Service

Installs the management service, dmxmgr, REST APIs, and the Connect Portal UI.

See DMX Management Service installation and configuration.


DMX Management Service installation and configuration

Installation

DMX Management Service executable The DMX Management Service executable, DMXManager, is installed in the following directory:

Windows <DMX _installation>\Programs

Linux <DMX _installation>/bin

The DMX management service configuration file The DMX management service configuration file, dmxmgr.properties, is installed in the following

directory:

Windows <DMX _installation>\Conf

Linux <DMX _installation>/conf

Configuration

DMX management service configuration file Many of the properties within dmxmgr.properties are populated with commented, preliminary

default values. Consider each of the name value pairs among the following properties within the file;

uncomment and update to meet your system requirements:

• Server

• Secure socket layer (SSL)

• Authentication

• Central file repository

• Central database repository

• Logging

Configuration properties as environment variables You can specify the configuration properties defined within dmxmgr.properties as environment

variables by capitalizing the property name and replacing the period separator, ".", with an

underscore. The configuration property name, authentication.method could be specified as a Linux

environment variable, for example, as follows: export AUTHENTICATION_METHOD=LDAP.

Server configuration properties Server configuration properties are defined through the name value pairs specified in the

DMExpress management service configuration file, dmxmgr.properties.

Consider the following server configuration properties:

Property name Property description Property values


server.address

The address of the

embedded Apache Tomcat

web server application.

Required

server.port

The DMX management

service port that is assigned

to listen for client requests.

8280 (default).

If the port number dedicated to listening to

client requests is different from 8280, assign

the appropriate value.

Secure socket layer configuration properties Secure socket layer (SSL) configuration properties are defined through the name value pairs

specified in the DMX management service configuration file, dmxmgr.properties.

By default, the DMX management service disables SSL certification. To enable SSL certification,

SSL configuration properties must be added to dmxmgr.properties.

Consider the following SSL configuration properties:

Property name Property description Property values

security.require-ssl Determines whether SSL

certification is required.

False (default), True

For SSL certification to be enabled, the

property value must be set to True.

server.ssl.client-auth Determines whether client

authentication occurs during

the SSL handshake.

want

For client authentication to occur

during the SSL handshake, the

property value must be set to want.

server.ssl.key-alias Alias of the SSL key. Required when SSL certification is

enabled.

server.ssl.key-

password

Password of the SSL key. Required when SSL certification is

enabled.

server.ssl.key-store Location of the DMX central

management server keyStore.

Required when SSL certification is

enabled.

server.ssl.key-store-

password

Password of the DMX central

management server keyStore.

Required when SSL certification is

enabled.

server.ssl.trust-store Location of the DMX central

management server trustStore.

Required when the DMX DataFunnel

run-time service uses SSL

server.ssl.trust-

store-password

Password of the DMX central

management server trustStore.

Required when the DMX DataFunnel

run-time service uses SSL.

Authentication configuration properties Authentication configuration properties are defined through the name value pairs specified in the

DMX management service configuration file, dmxmgr.properties.


Consider the following authentication configuration properties:

Property name Property

description Property Values

authentication.method

The authentication

method for

authenticating users.

LDAP (default) and SIMPLE.

If you skip the configuration setup during the

installation, the installation process

automatically assigns the value LDAP to the

authentication.method property name.

When LDAP is the authentication method, you

must provide LDAP authentication

configuration properties.

authentication.login.aut

o_create_users

Specifies whether

new users should be

created dynamically

upon login.

true (default) and false.

To successfully call REST APIs, valid user

credentials on the authentication backend (for

example, on the LDAP active directory) must

also be registered with the DMX management

service. When the property value is set to

false, users are not automatically created and

registered on the DMX management service

even when they are registered on the LDAP

active directory. Any attempt by an

unregistered user to call the REST API layer

of the DMX management service results in a

call failure with status code 401/Unauthorized.

authentication.login.def

ault_role

The default user

roles, which the DMX

management service

requires to operate,

are automatically

established as part of

the DMX

management service

installation.

role_administrator and role_user (defaults).

These roles are assigned dynamically as part

of the initial login:

The first user who successfully logs into the

system is granted the user role,

role_administrator.

Any subsequent user who successfully logs

into the system is granted the user role,

role_user.

While not required, system administrators can

create new, custom user roles and assign

existing permissions to the new roles.

Examples of possible custom user roles include

the following: business user, operator, data

scientist, data architect, solution engineer,

developer.


authentication.token.sig

nature_secret

The signature secret

to Secure Hash

Algorithm (SHA)-sign

generated

authentication

tokens.

If a signature secret value is not specified, a

random secret is generated at DMX

management service at start up time.

When generating the cryptographic signature

of an authentication token, a portion of the

authentication token segment is signed using

a SHA message digest. A signature secret is

applied to the message digest. The resulting

signature secret value, which is applied to the

authentication cookie, is encoded as a Base64

string.

authentication.token.to

ken_validity

The time in seconds

in which a generated

token is valid.

36000 seconds (default).

36000 seconds is equivalent to 10 hours.

authentication.token.co

okie_domain

The domain attribute

of the authentication

token cookie.

The cookie domain specifies to the browser

that cookies should only be sent back to the

DMX management service for the given

domain.

If the cookie domain is not specified, the cookie

is sent back to the domain on the DMX

management service from which the object

was requested by default.

For additional information, see HTTP Cookie

Domain and Path.

authentication.token.co

okie_path

The path attribute of

the authentication

token cookie.

The cookie path specifies to the browser that

cookies should only be sent back to the DMX

management service for the given path. If the

cookie path is not specified, the cookie is sent

back to the path on the DMX management

service from which the object was requested by

default.

For additional information, see HTTP Cookie

Domain and Path.

LDAP authentication configuration properties

Property name Property description Property Values

ldap.url LDAP URL. Required when

authentication.met

hod is set to LDAP.


ldap.active_directory.us

er_domain

LDAP active directory user domain. Required when

authentication.met

hod is set to LDAP.

ldap.active_directory.ro

ot_domain

LDAP active directory root domain. Required when

authentication.met

hod is set to LDAP.

ldap.search.managerDn Distinguished name (DN) of the manager, which

is the user that performs searches when the

LDAP server does not support or has not

enabled anonymous searches.

ldap.search.managerPa

ssword

Password of the manager that performs LDAP

searches.

ldap.search.userBaseDn The search base DN for finding users.

Central file repository configuration properties Central file repository configuration properties are defined through the name value pairs specified in

the DMX management service configuration file, dmxmgr.properties.

The DMXDFNL root job and its job dependencies, which include subjobs, tasks, and operational

metadata, are stored in the DMX central file repository. The DMX central file repository must be

configured to reside on a local file system.

Consider the following file repository configuration properties:


Local central file repository configuration properties

Property

name Property description Property Values

repository.url

Location of the local DMX central file repository. The default location of the

central file repository is the

home directory on your local

client workstation.

History repository configuration properties

Property name Property description Property

Values

history.repository.

location

Location of the job execution history directory, which is relative

to the DMX central file repository.

Beneath the top-level history directory, individual job run

directories are created and organized by date:

~/.dmexpress/history/

{YEAR}/

{YEAR}/

{MONTH}/

{DAY}/

{<job_name>}_{<starttime>[_<job_number>]_log.{xml|txt}

{<job_name>}_{<starttime>[_<job_number>].json

The job log is generated in XML or Text format; the operational

metadata log is generated in JSON format.

Required

Central database repository configuration properties Central database repository configuration properties are defined through the name value pairs

specified in the DMX management service configuration file, dmxmgr.properties.

The DMXDFNL job definition and runtime connection data are stored in the central database

repository. The central database repository must be configured to reside on your local client

workstation.

Consider the following database repository configuration properties:


Values


spring.datasource.url Location of the local DMX central

database repository.

The default location of the central

database repository is located beneath the

home directory on your local client

workstation:

~/.dmexpress/com.syncsort.dmxmgr/

Required.

spring.datasource.username Name of the database user with access to

the database repository.

Required

spring.datasource.password Password value associated with the user

with access to the database.

Required

spring.datasource.driverClassName Identifies the JDBC driver class name or

Java class.

Required

Logging configuration properties Logging configuration properties are defined through the name value pairs specified in the DMX

management service configuration file, dmxmgr.properties.

Consider the following logging configuration properties:

Property name Property description Property Values

logging.file

The relative or absolute path to and name of

the DMX management service log file;

for example:

logging.file=${java.io.tmpdir:-

/tmp}/dmxmgr.log.

logging.level.*

The level of logging detail written to the DMX

management service log file that is defined in

logging.file.

ERROR, WARN and INFO level messages are

logged by default.

Valid values include the

following: ERROR, WARN,

INFO, DEBUG, or TRACE.

DMX DataFunnel run-time service install and configuration

Installation

DMX DataFunnel run-time service executable The DMX DataFunnel run-time service executable, dmxrund, is installed in the following directory:

Windows <DMX _installation>\Programs

Linux <DMX _installation>/bin

DMX DataFunnel Run-time Service configuration file The DMX DataFunnel Run-time Service configuration file, dmxrund.conf, is installed in the

following directory:


Windows <DMX _installation>\Conf

Linux <DMX _installation>/conf

Linux only: DMX impersonation executable The DMX impersonation executable, dmxexecutor.exe, is installed in the following Linux directory:

<DMX _installation>/bin

Linux only: DMX custom impersonation configuration file The DMX custom impersonation configuration file, dmxexecutor.conf, is located in the following

Linux directory:

<DMX _installation>/conf

Configuration

DMX custom impersonation configuration file Many of the properties within dmxrund.conf are populated with commented, preliminary default

values. Uncomment and update applicable properties to meet your system requirements.

DMX DataFunnel Run-time Service configuration properties DMX DataFunnel Run-time Service configuration properties are defined through the name value

pairs specified in the DMX DataFunnel Run-time Service service file, dmxrund.conf.

Consider the following DataFunnel Run-time Service configuration properties:


Values

SERVER_PORT

The DMX execution service port that is assigned to listen

for job execution requests from the DMX management

service, dmxmgr.

33636

(default)

DMEXPRESS_HOME

The directory where DMX is installed. Required

UNPACK_WORK_DIR

ECTORY

The working directory where jobs are unpacked.

Value: Required.

Required

SECURITY_ENABLED

Determines whether Secure socket layer (SSL) security is

enabled.

For SSL certification to be enabled, the property value

must be set to Y.

Y (default), N


SSL_SERVER_PRIVAT

E_KEY

The path to the SSL server private key file, which is in

PEM format.

Required

when SSL

certification is

enabled.

SSL_SERVER_CERTIFI

CATE

The path to the SSL server certificate public key file,

which is in PEM format.

Required

when SSL

certification is

enabled

SSL_CLIENT_AUTHE

NTICATION_ENABLE

D

Determines whether to authenticate the client. Y (default), N

SSL_TRUSTED_CERTI

FICATES

The path to the trusted certificates file, which is in PEM

format. This file can contain multiple client certificates in

PEM format

Required

when SSL

certification is

enabled.

Linux only: DMX custom impersonation configuration file If dmxexecutor was established as the impersonated user during Linux pre-installation, you have

the option of updating dmxexecutor.conf. Updating properties in dmxexecutor.conf is optional;

update only to customize the impersonation process.

DMX custom impersonation configuration properties To customize impersonation, DMX custom impersonation configuration properties are defined

through the name value pairs specified in the DMX custom impersonation file, dmxexecutor.conf.

Consider the following custom impersonation configuration properties:


Values

SERVICE_GROUP

The service group to which the service user belongs. dmexpress

(default)

MIN_USERID

The minimum user identification (ID) number or

security access level that is assigned for

impersonation. If the user ID is greater than this

minimum value, the user is not impersonated and

the job run aborts.

Value:

500 (default)


BANNED_USERS

Users listed as banned prevent dmxexecutor from

impersonating that user. Multiple banned users in

the list must be separated by commas. All users not

listed as banned qualify for impersonation and allow

dmxexecutor to impersonate that user.

Upon receipt of a job submission request

• from a banned user, dmxexecutor rejects the job

request, generates an error, and the job aborts.

• from a user not listed as banned, dmxexecutor

calls the DMX engine to run the job.

ALLOWED_USERS

Users listed as allowed are the only users that enable

dmxexecutor to impersonate that user. Multiple

allowed users in the list must be separated by

commas. All users not listed as allowed are

disqualified from impersonation and prevent

dmxexecutor from impersonating that user.

Upon receipt of a job request

• from an allowed user, dmxexecutor calls the

DMX engine to run the job.

• from a user not listed as allowed, dmxexecutor

rejects the job request, generates an error, and

the job aborts.


Technical Support If you have a maintenance support agreement for DMX, and you encounter difficulties in installing

or running DMX, contact Syncsort Incorporated.

In the United States (available 24 hours a day, 7 days a week): Phone 1-877-700-8270 or 201-930-8270

E-mail [email protected]

In other countries:

Contact information can be found by country at https://mysupport.syncsort.com/.

mailto:[email protected]

https://mysupport.syncsort.com/

Date post:	16-Aug-2021
Category:	Documents
Upload:	others
View:	10 times
Download:	0 times

Install Guide - .NET Framework...• DMX server license key installs components based on whether you...

Documents