DMX
Install Guide
Version 9.10
DMX Install Guide
Copyright 1990, 2020 Syncsort Incorporated. All rights reserved.
This document contains unpublished, confidential, and proprietary information of Syncsort
Incorporated. No disclosure or use of any portion of the contents of this document may be made
without the express written consent of Syncsort Incorporated.
Getting technical support: Customers with a valid maintenance contact can get technical assistance
via MySupport. There you will find product downloads and documentation for the products to which
you are entitled, as well as an extensive knowledge base.
Version 9.10
Last Update: 21 October 2020
DMX Install Guide i
Contents DMX Overview ............................................................................................................... 4
Installing DMX/DMX-h ................................................................................................... 4
DMX-h Overview .......................................................................................... 4
Prerequisites ................................................................................................ 5
Step-by-Step Installation ............................................................................. 9
Configuring the DMX Run-time Service ................................................... 28
Applying a New License Key to an Existing Installation ......................... 31
Running DMX ............................................................................................................... 33
Graphical User Interfaces ......................................................................... 34
DMX Help .................................................................................................... 34
Connecting to Databases from DMX ......................................................................... 34
Amazon Redshift ........................................................................................ 34
Azure Synapse Analytics (formerly SQL Data Warehouse) .................... 36
Databricks .................................................................................................. 38
DB2 .......................................................................................................... 41
Greenplum ................................................................................................. 41
Hive data warehouses ............................................................................... 43
Apache Impala ........................................................................................... 65
Microsoft SQL Server ................................................................................. 68
Netezza ....................................................................................................... 68
NoSQL Databases ...................................................................................... 70
Oracle .......................................................................................................... 71
Snowflake ................................................................................................... 72
Sybase ......................................................................................................... 74
Teradata ..................................................................................................... 74
Vertica ......................................................................................................... 75
Other DBMSs .............................................................................................. 77
Defining ODBC Data Sources ................................................................... 79
Connecting to Message Queues from DMX ............................................................. 80
IBM WebSphere MQ .................................................................................. 80
Connecting to Salesforce from DMX ......................................................................... 82
Connecting to SAP from DMX .................................................................................... 82
ii DMX Install Guide
Registering DMX in SAP SLD ..................................................................... 83
Connecting to HDFS from DMX ................................................................................. 84
Connecting to Connect:Direct nodes from DMX ...................................................... 84
Security ....................................................................................................... 84
Installation and Configuration ................................................................. 84
Connecting to Databricks File Systems (DBFSs) ....................................................... 85
Databricks File System (DBFS) connection requirements ..................... 85
Defining Databricks File System (DBFS) connections ............................. 87
Connecting to CyberArk Enterprise Password Vault ............................................... 88
CyberArk Licenses ..................................................................................... 88
Connecting to Protegrity Data Security Gateway ..................................................... 88
Connecting to QlikView data eXchange files from QlikView or Qlik Sense............ 88
QlikView desktop installation overview ................................................... 89
Qlik Sense desktop installation overview ................................................ 89
Connecting to Tableau Data Extract files from Tableau .......................................... 89
Tableau desktop installation overview .................................................... 90
Removing DMX/DMX-h from Your System ............................................................... 90
DMX installation component options ....................................................................... 93
DMX Management Service installation and configuration .................... 94
DMX DataFunnel run-time service install and configuration .............. 100
Technical Support ..................................................................................................... 104
DMX Install Guide 3
Documentation Conventions The following conventions are used in the format sections of the command options in this manual.
Convention Explanation Example
Regular type Items in regular type must be entered literally using
either lowercase or uppercase letters. Items may be
abbreviated.
ASCII
ascending
Italics (non-
bold)
Items in italics (non-bold) represent variables. You must
substitute an appropriate numerical or text value for the
variable.
file_name
Braces { } Braces indicate that a choice must be made among items
contained in the braces. The choices may be presented
in an aligned column, or on one line separated by a
vertical bar ( | ).
{"a" }
{X"xx" }
OR
{AND | OR}
Brackets [ ] Brackets indicate that an item is optional. A choice may
be made among multiple items contained in brackets.
[alias]
OR
[+ | -]
Slash / A slash identifies a DMX option keyword. The slash
must be included when an option keyword is specified.
/INFILE
/infile
Double quotes
" "
Double quotation marks that appear in a format
statement must be specified literally.
"b"-"e"
Ellipsis … An ellipsis indicates that the preceding argument or
group of arguments may be repeated.
[expression…]
Sequence
number
A sequence number indicates that a series of arguments
or values may be specified. The sequence number itself
must never be specified.
field2
4 DMX Install Guide
DMX Overview DMX™ is a high-performance data transformation product. With DMX you can design, schedule, and
control all your data transformations from a simple graphical interface on your Windows desktop.
Data records can be input from many types of sources such as database tables, SAP systems,
Salesforce.com objects, flat files, XML files, pipes, etc. The records can be aggregated, joined, sorted,
merged, or just copied to the appropriate target(s). Before output, records can be filtered,
reformatted, or otherwise transformed.
Metadata, including record layouts, business rules, transformation definitions, run history and data
statistics, can be maintained either within a specific task or in a central repository. The effects of
making a change to your application can be analyzed through impact and lineage analysis.
You can run your data transformations directly from your desktop, on any UNIX or Windows server,
or schedule them for later execution, embed them in batch scripts, or invoke them from your own
programs.
Installing DMX/DMX-h Installed DMX components are dependent on your license key:
• DMX server license key installs components based on whether you select a Standard, Full,
Classic, or Custom installation. See DMX installation component options.
• DMX workstation license key installs the development client, Job and Task Editors; the
DMX engine, dmxjob/dmexpress;; and the service for development client, which is the DMX
Run-time Service, dmxd.
The version of DMX server software must be at least as high as the version of the DMX client
software that is used to develop jobs and connect to the server. Thus, when installing a new version
of DMX, ensure that you install the same release of DMX on your client and server machines. If you
are upgrading and unable to install both the client and the server at the same time, you need to
upgrade the server prior to upgrading the client.
DMX-h Overview DMX-h is the Hadoop-enabled edition of DMX, providing the following Hadoop functionality:
• ETL Processing in Hadoop – Develop a DMX-h ETL application entirely in the DMX GUI to
run seamlessly in the Hadoop MapReduce framework, with no Pig, Hive, or Java
programming required. Currently, jobs can be run in either MapReduce or Spark. See the
online DMX Help topic "DMX-h”.
• Hadoop Sort Acceleration – Seamlessly replace the native sort within Hadoop MapReduce
processing with the high-speed DMX engine sort, providing performance benefits without
programming changes to existing MapReduce jobs. See the DMX-h Sort User Guide, which is
included in the Documentation folder under your DMX software installation directory.
• Apache Spark Integration – Use the Spark mainframe connector to transfer mainframe data
to HDFS. See the online DMX Help topic “Spark Mainframe Connector”.
• Apache Sqoop Integration – Use the Sqoop mainframe import connector to transfer
mainframe data into HDFS. See the online DMX Help topic "Sqoop Mainframe Import
Connector”.
DMX Install Guide 5
DMX-h Requirements DMX-h requires the following:
• DMX-h Edition
• A supported Hadoop MapReduce and/or Spark distribution:
o MapReduce
▪ Cloudera CDH 5.x (5.2 and higher) – YARN (MRv2)
▪ Hortonworks Data Platform (HDP) 2.x (2.3 and higher) – YARN
▪ Apache Hadoop 2.x (2.2 and higher) – YARN
▪ MapR, Community Edition and Enterprise Edition only (previously termed M5 and
M7, respectively), 6.x – YARN
▪ Pivotal HD 3.0 – YARN
DMX-h is certified as ODPi (1.0 and higher) interoperable.
o Spark
▪ Spark on YARN on the following Hadoop distributions:
• Cloudera CDH 5.x (5.5 and higher)
• Hortonworks Data Platform (HDP) 2.3.4, 2.x (2.4 and higher)
• MapR 5.x (5.1 and higher), Community Edition and Enterprise Edition only
(previously named M5 and M7, respectively)
▪ Spark on Mesos 0.21.0
▪ Spark Standalone 1.5.2 and higher
DMX-h Component Setup and Operation A DMX-h setup consists of the following:
• Windows workstation
o DMX must be installed as described in Step-by-Step Installation, Windows Systems.
o DMX Job and Task Editors are used for MapReduce job development.
o MapReduce jobs are submitted to Hadoop via the ETL server from the Job Editor.
• Linux ETL server (edge node)
o DMX must be installed as described in Step-by-Step Installation, UNIX Systems.
o The Hadoop client must be installed and configured to connect to the Hadoop cluster.
o The DMX Run-time Service, dmxd, must be running to respond to jobs run via the
Windows workstation; it calls dmxjob with the /HADOOP option, which ultimately calls
hadoop to submit jobs to the cluster.
• Hadoop cluster
o DMX must be installed without dmxd on all nodes in the Hadoop cluster as described in
Step-by-Step Installation, Hadoop Cluster.
o Each mapper and reducer runs the map side or reduce side task(s), respectively.
o All file descriptors for sources, targets, and intermediate files are carefully connected so
they fit into the Hadoop MapReduce flow.
Prerequisites Before you install DMX on your system, ensure that the following are available:
6 DMX Install Guide
• DMX software: This is generally shipped downloaded from Syncsort’s web site as a self-
extracting executable file (Windows) or a tar file (UNIX).
• DMX license key: License keys are sent via e-mail as an attachment file called
DMExpressLicense.txt. If you need specific system information to obtain a license key, refer
to the section below on Getting DMX License Information.
If you have a DMX server license key and plan to install DMX installation components, the type of
user that you setup depends on whether impersonation privileges are extended. See DMX
installation user setup considerations.
• Operating system: DMX runs on the following operating systems, with the listed release
being the minimum supported. Both 32 bit and 64 bit versions are supported, unless
otherwise stated: AIX release 6.1 64-bit; HP-UX release 11.31 IA64 64-bit; Linux kernel
version 2.6.18 to 2.6.31 with C library version 2.5 to 2.11 on Pentium-class x86_64 64-bit
machines; Linux kernel version 2.6.16 with C library version 2.4 on IBM System z 64-bit
mainframes; SunOS 5.10 SPARC 64-bit; Windows Vista; Windows 7; Windows 8.x; Windows
10; and Windows Server 2008, 2012; and 2012 R2.
• Java version requirements: On Windows and UNIX/Linux systems, DMX requires Java
runtime version 1.7 or higher unless you are only running DMX Sort, which does not use
Java. DMX requires JDK 7.
• Communication security protocol: On Windows and UNIX/Linux systems, DMX supports
Transport Layer Security (TLS) up to and including TLS version 1.2.
• User rights: Sufficient privileges to install and start Windows Services for Windows
platforms and root privileges to install and start UNIX daemons on UNIX platforms. An
umask setting of 022 is required so that other users can run the installed executables. The
installation procedure sets and resets umask if required. • Pluggable Authentication Modules (PAM): If you want to use PAM for authentication on
UNIX or Linux platforms, PAM must be installed and configured on the system.
• Database client software: If you want DMX to access data in database tables (either as data
source or target), then the appropriate database client software must be on the system and
accessible via the appropriate shared library or dynamic link library (dll) paths.
For example, to access an Oracle database, Oracle Client must be installed on the system
where you run DMX; to access a database via ODBC, an ODBC data source must be defined
on the system where you run DMX. For details on how to connect to a specific Database
Management System (DBMS), refer to the section Connecting to Databases from DMX.
• Message queue client software: If you want DMX to access data in a message queue, then the
appropriate message queue client software must be on the system and accessible via the
appropriate shared library or dynamic link library (dll) paths.
For example, to access an IBM WebSphere MQ queue, IBM WebSphere MQ client must be
installed on the system where you run DMX. For details on how to connect to a specific
message queue type, refer to the section Connecting to Message Queues from DMX.
• SAP client software: If you want DMX to access data in an SAP system, then the appropriate
SAP client software must be installed on the system where you run DMX and accessible via
the appropriate shared library or dynamic link library (dll) paths. For details on how to
connect to an SAP system, refer to the section Connecting to SAP from DMX. • Hadoop software – If you want DMX to access data in a Hadoop Distributed File System
(HDFS), or you want to run DMX-h ETL MapReduce jobs, then a Hadoop distribution
configured to access the cluster must be installed on the edge/ETL node from which you run
DMX. For details on how to connect to HDFS, refer to the section Connecting to HDFS from DMX.
• Connect:Direct software – If you want DMX to access data using a Connect:Direct
connection, a Connect:Direct server and client (CLI/API) must be installed on the system
where you run DMX and must be configured to access the required Connect:Direct nodes. For
DMX Install Guide 7
details on how to connect to a Connect:Direct node, refer to Connecting to Connect:Direct nodes from DMX.
• QlikView software – DMX supports QlikView data eXchange (QVX) files as targets. To access
QVX files as sources from QlikView or Qlik Sense, refer to Connecting to QlikView data
eXchange files from QlikView or Qlik Sense.
• Tableau software – DMX supports Tableau Data Extract (TDE) files as targets. To access
TDE files as sources from Tableau, refer to Connecting to Tableau Data Extract files from
Tableau.
DMX installation user setup considerations The type of user that you setup to install DMX installation components is dependent on whether
impersonation privileges are extended:
• If you plan to use impersonation when running the DMX Run-time Service, dmxd, you must
install as root.
• When running the DataFunnel Run-time Service, dmxrund, considerations exist for the type
of user that installs components.
User setup when running dmxrund If you do not plan to use impersonation when running dmxrund, setup a non-administrative user to
install and run on Windows or setup a service user to install and run on Linux.
Setup a non-administrative/service user
Windows As the administrative user has impersonation privileges by default, setup a new user who does not
have administrative rights.
Linux To install and run job requests without impersonation, create a service user, dmxuser, and run the
installation as dmxuser.
Setup impersonation If you plan to use impersonation when running dmxrund, no user setup is required to install and run
on Windows; setup an impersonated user to install and run on Linux.
Windows As the administrative user has impersonation privileges by default, no setup is required.
Linux DMX installation impersonation considerations on Linux follow:
• No impersonation – Running jobs without impersonation does not require root access. Upon
receipt of a job submission request from the DMX management service, dmxmgr, dmxrund
calls the DMX engine, dmxdfnl, to run the submitted job as the service user, dmxuser.
• Impersonation – Running jobs with impersonation requires root access to impersonate the
specified user. While dmxrund never is granted root access, another installed component,
dmxexecutor, can enable impersonation. When dmxrund detects that dmxexecutor is
installed in the required directory with the correct permissions, dmxrund calls dmxexecutor
to impersonate the specific user that calls the DMX engine, dmxdfnl, which runs the
submitted jobs.
To install and run job requests with impersonation, do the following:
8 DMX Install Guide
• Create a service user, dmxuser.
• Create a service group, dmexpress.
Note: If you choose to change the name of the service group, you must update the SERVICE_GROUP
property of the DMX custom impersonation configuration properties file.
• Add dmxuser to the service group.
• Run the installation as dmxuser.
• Ensure that the following files are in the specified directories with the specified permissions:
Directory and file Permissions Notes
<DMX_installation>/bin/
dmxexecutor
-rwsr-x--- The ‘s’ represents the set-user identification
(setuid) bit and indicates that dmxexecutor is
extended impersonation privileges to run
submitted jobs as a specific user.
<DMX_installation>/conf/
dmxexecutor.conf
-rwx------ Updates to dmxexecutor.conf are required only
if you choose to customize the impersonation.
Getting DMX License Information To obtain a license key, you need the computer name, the hardware model, the number of processors,
and the operating system of each system on which DMX is to run. You can gather the system
information by running the DMX License Information program.
Windows Systems You can run the DMX License Information program in the following ways:
• From Syncsort’s web site at:
http://www.syncsort.com/software/licenseinfo.exe
• If DMX is installed, go to Programs, DMExpress from the Start menu and select License Information.
The program prepares a license information document with the system information and then
displays it in a Notepad window. You can save the form (File, Save As) and e-mail it to Syncsort or to
your local DMX sales agent. The information is then used to create your license key(s).
UNIX Systems You can run the DMX License Information program in the following ways:
• From Syncsort’s web site at:
http://www.syncsort.com/software/licenseinfo.sh
• If DMX is installed, go to the <dmx_home>/bin directory, where <dmx_home> denotes the
directory where DMX is installed, and type: ./licenseinfo
The program generates and displays a text file named SyncsortLicenseInfo.txt in the current
directory. You can e-mail the file to Syncsort or to your local DMX sales agent. The information is
then used to create your license key(s).
DMX Install Guide 9
Step-by-Step Installation
Windows Systems
Interactive Installation
1. Make sure that any previous version of DMX has been removed (see Removing DMX from
Your System later in this guide if necessary).
2. You can also install DMX by running
\Windows\x86\setup.exe for 32-bit Windows
\Windows\x64\setup.exe for 64-bit Windows x64
extracted from the downloaded executable directly or via Control Panel, Add/Remove
Programs.
3. You are prompted to either enter a license key or start a free trial. If you've selected to enter
a license key, you can type in the location of the DMExpressLicense.txt, or browse to it, when
prompted. You can also enter the license key manually.
4. Read the terms of the Syncsort License Agreement and confirm your acceptance of them.
5. Review the product options, components and features that are enabled by your license key.
6. If your license key is a
• DMX server license key, a menu displays from which you select from among the
component options:
o Standard
o Full
o Classic
o Custom
For information on these options, see DMX installation component options. Select an
option and make the appropriate selections.
• DMX workstation license key, no component options display for selection.
You are eligible for the classic DMX/DMX-h installation, which installs the development
client, Job and Task Editors; the DMX engine, dmxjob/dmexpress; and the service for
development client, which is the DMX Run-time Service, dmxd.
1. Confirm the file folder into which you want to install DMX. The file folder is subsequently
referred to as <dmx_home>.
2. Select the program folder in which you want the DMX icons to appear.
3. Review the Setup Information, choose back to change these options or install to complete
installation.
4. If your license key enables the DMX Run-time Service, select the configuration options for
the Service. You can also configure the DMX Run-time Service later via Control Panel,
Administrative Tools, Services.
10 DMX Install Guide
5. You may be prompted to choose whether to automatically run SyncSort jobs in DMX, either
immediately or after subsequent un-installation of SyncSort, depending on the presence of the
SyncSort Conversion license option and an existing installation of SyncSort.
6. Upon setup completion, a list of menu shortcuts display in the DMX program folder, which is
available through the Windows Start menu.
7. To run the Connect Portal web UI, you must configure the DMX management service,
dmxmgr, including authentication. Then, start the DMX Management Service via Control Panel,
Administrative Tools, Services.
8. To run copy projects in Connect Portal, start the DataFunnel Run-time Service via Control
Panel, Administrative Tools, Services. See DMX DataFunnel run-time service installation and
configuration for more details.
9. To run CDC replication projects in Connect Portal, separately install the latest version of
MIMIX Share. The MIMIX Share Listener service starts automatically when you install MIMIX
Share, but the listener service must be running to run CDC replication projects in Connect
Portal.
If you performed a full install, including the development client, the following menu shortcuts
display:
• DMExpress
• Apply a New License Key
• DMExpress Application Upgrade
• DMExpress Global Find
• DMExpress Help
• DMExpress Job Editor
• DMExpress Server
• DMExpress Task Editor
• DataFunnel
• License Information
• Reference Guides
• Release Notes
If you performed a standard or classic install, with the development client but not the Management
Service, the following menu shortcuts display:
• DMExpress
• Apply a New License Key
• DMExpress Application Upgrade
• DMExpress Global Find
• DMExpress Help
• DMExpress Job Editor
• DMExpress Server
• DMExpress Task Editor
• License Information
• Reference Guides
• Release Notes
If you installed the Management Service only (custom install), the following menu shortcuts display:
• DMExpress
• Apply a New License Key
DMX Install Guide 11
• DMExpress Help
• DataFunnel
• License Information
• Reference Guides
• Release Notes
If you did not install the development client nor the Management Service, the following menu
shortcuts display:
• DMExpress
• Apply a New License Key
• DMExpress Help
• Documentation
• License Information
• Reference Guides
• Release Notes
If you have ActiveX based SyncSort applications which you choose to run with DMX, and you
subsequently uninstall SyncSort, you may need to re-register the SyncSortX ActiveX control. To
register the ActiveX control, open a command prompt and type the following command: regsvr32.exe <dmx_home>/Programs/SyncSortX.dll
Silent Installation Silent installation requires a silent setup file that can be recorded during an interactive installation.
Installation steps may differ depending on product licensing, so changing the version of DMX or
adding or removing packages may require re-recording the silent setup file.
To record the installation options Open a command prompt; type the full path for the installation program followed by the options:
–r –f1<silent_setup_file>
where <silent_setup_file> is the full path for the file to record the installation options. If you are
installing from a downloaded image which is located in c:\downloads, you would type a command
like:
C:\downloads\DMExpress_1-4_windows.exe–r –f1c:\temp\setup.iss
An interactive installation starts and all the selected installation options are saved in the specified
file.
To run the installation in silent mode Open a command prompt; type the pathname of the install executable followed by the options:
–s –f1<silent_setup_file> -slog<log_file>
where <silent_setup_file> is the full path for the file that was previously used to record the
installation options, and <log_file> is the full path for the installation log file generated by silent
installation. If you are installing from a downloaded image which is located in c:\downloads, you
would type a command like:
C:\downloads\DMExpress_1-4_windows.exe –s–f1c:\temp\setup.iss
If you do not specify the –slog option, then setup generates a log of the silent installation, setup.log,
in the folder from which the setup is run or in the folder where the specified silent setup file is
12 DMX Install Guide
located.
Multiple command line options are separated with a space, but there should be no spaces inside a
command line option (for example, –slogc:\setup.log is valid, but –slog c:\setup.log is not).
Note: When running silent installation on a machine with User Account Control enabled, an
administrator command prompt or batch file can be used to avoid the initial prompt by the operating
system requesting elevated privileges. To start a Command Prompt with administrative privileges,
right-click the Command Prompt shortcut and select "Run as administrator".
UNIX Systems
Prerequisites for COBOL Support DMX can be used to accelerate COBOL SORT and MERGE verbs or to process COBOL data files as
source or target. In order to use these features, you must have a license to use the COBOL compiler
on the system where the DMX task runs.
Micro Focus COBOL or Server Express The following variables must be set prior to installation: the COBDIR and PATH variables must be
set and exported to include the COBOL compiler, and the following environment variable for shared
libraries must be set to include all the shareable libraries used by the compiler and exported on the
corresponding platform:
AIX LIBPATH
HP-UX SHLIB_PATH
Linux LD_LIBRARY_PATH
Solaris LD_LIBRARY_PATH
AcuCorp’s ACUCOBOL-GT™ COBOL Development System Support for ACUCOBOL-GT™ is available on the following UNIX platforms:
Operating System Architecture
HP-UX 64-bit for Itanium
AIX 64-bit on PowerPC
SunOS 64-bit on SPARC processors
The bit level of DMX must match that of the ACUCOBOL-GT™ installation.
Before running the DMX install script, set the environment variable ACUCOBOL:
export ACUCOBOL=<acucobol_install_dir>
where <acucobol_install_dir> is the location of your ACUCOBOL-GT™ installation. If the
environment variable COBDIR is set, unset it:
unset COBDIR
Once DMX has been installed, additional steps need to be performed to enable support for
AcuCOBOL. Please refer to the DMX online help topic “Installing support for AcuCOBOL.”
DMX Install Guide 13
COBOL-IT Support for COBOL-IT line sequential files is available on the following UNIX platforms:
Operating System Architecture
AIX 64-bit on PowerPC
Linux 64-bit for Intel-compatible processors
The bit level of DMX must match that of the COBOL-IT installation. The minimum supported
COBOL-IT version is 3.7.
Before running the DMX install script, do the following:
• Unset the environment variable COBDIR, if set:
unset COBDIR
• Set the environment variable COBOLITDIR:
export COBOLITDIR=<cobol-it_install_dir>
where <cobol-it_install_dir> is the location of your COBOL-IT installation.
To configure COBOL-IT runtime environment variables, refer to the DMX online help topic,
“Installing support for COBOL-IT.”
Informix C-ISAM Support If you plan to use DMX to process Informix C-ISAM files, the environment variable INFORMIXDIR
must be set and exported prior to running the install script. The directory $INFORMIXDIR/lib must
contain the library libisam.a.
Unikix VSAM Support If you plan to use DMX to process Unikix VSAM files, the environment variable UNIKIX must be set
and exported prior to running the install script. The directory $UNIKIX/lib must contain the library
libbcisam.a.
Interactive Installation
1. If you are installing from a tar file that you downloaded from Syncsort’s web site, extract the
contents of the tar file on your UNIX system using a command similar to: tar xvof
DMEXPRESS.TAR
This creates a directory dmexpress under the current directory.
2. Log in as user root if you wish to install or configure the DMX Run-time Service. The DMX
Run-time Service allows you to submit tasks or jobs from the DMX Task Editor or Job Editor
components, running on remote desktops, to execute on this DMX server.
To install using downloaded software, navigate to the dmexpress directory created when you
extracted the contents of the tar file and then run the install program. For example,
cd /usr/tmp/dmexpress
./install
14 DMX Install Guide
3. Depending on your system and the licensed options, you may be asked several questions. For
example, on platforms where both a 32-bit and a 64-bit version of DMX are available, you are
asked to choose which one you would like to install.
You are prompted to either enter a license key or start a free trial. If you've selected to enter
a license key, specify the location of the license key file, DMExpressLicense.txt, when
prompted.
4. Read the terms of the Syncsort License Agreement and confirm your acceptance of them.
5. Review the product options, components and features that are enabled by your license key.
If your license key is a
• DMX server license key, a menu displays from which you select from among the
component options:
DMExpress Components
DMExpress Engine
Service for Development Client
DataFunnel Run-time Service
Management Service
System
Computer name: ...
License Expiry Date
...
or information on these options, see DMX installation component options.
• DMX workstation license key, no component options display.
You are eligible for the classic DMX/DMX-h installation, which installs the development
client, Job and Task Editors; the DMX engine, dmxjob/dmexpress; and the service for
development client, which is the DMX Run-time Service, dmxd.
6. Specify the directory into which you want to install DMX. This directory is subsequently
referred to as <dmx_home>.
7. If you logged on as root, you are prompted to indicate your choice for configuring the DMX
Run-time Service. This allows you to start the service immediately, and choose to start it
with system restart. This also allows you select PAM authentication if it is available on the
system. To configure the DMX Run-time Service at a later time, run the installation
procedure as root from the DMX installation directory. See run-time service install and
configuration for additional information.
8. You may be prompted to choose whether to automatically run SyncSort jobs in DMX, either
immediately or after subsequent un-installation of SyncSort, depending on the presence of
the SyncSort Conversion license option and an existing installation of SyncSort.
9. If you have a DMX server license, you are given the option to install the DataFunnel Run-
time Service and the option to install Management Service.
10. When the installation procedure completes, update your environment variables. Add
<dmx_home>/bin to your PATH, and add <dmx_home>/lib to the shared library path, for
DMX Install Guide 15
example, by updating your profile. The environment variable that must be set for specific
platforms is as follows:
AIX LIBPATH
HP-UX SHLIB_PATH
Linux LD_LIBRARY_PATH
Solaris LD_LIBRARY_PATH
11. To run the Connect Portal web UI, you must configure the DMX management service,
dmxmgr, including authentication. Then, start the DMX Management Service, dmxmgr. See
configure the DMX management service for more details.
10. To run copy projects in Connect Portal, start the DataFunnel Runtime Service, dmxrund. See
DMX DataFunnel run-time service installation and configuration for more details.
11. To run CDC replication projects in Connect Portal, separately install the latest version of
MIMIX Share. The MIMIX Share Listener service starts automatically when you install MIMIX
Share, but the listener service must be running to run CDC replication projects in Connect
Portal.
Silent Installation A silent installation allows you to easily install DMX on multiple machines with identical options.
You simply install interactively on the first machine using the record option to save your responses
to installation prompts in a file. Then you run the silent installation on the remaining machines,
pointing to the recorded response file. Because the silent installation is non-interactive, it can be
scripted to effectively automate installation on many machines.
1. To prepare to run the silent installation, initiate the interactive installation on the first
machine as described in the section above, but in step 3, run the install command with the
record option, –r, specifying the file in which to store your responses to installation prompts
as follows:
./install –r <silent_setup_file>
2. Upon successful completion of the interactive installation, run the install program with the
silent option, –s, and the silent log option, –slog, on the remaining machines that require
installation as follows:
./install –s <silent_setup_file> -slog <log_file>
where:
o <silent_setup_file> is the full path to the response file generated by the interactive
installation.
o <log_file> is the full path to the log file generated by the silent installation.
Hadoop Cluster DMX-h must be installed on all the nodes in the Hadoop cluster using one of the following methods:
• Managed Methods - recommended for large clusters
o Cloudera Manager Parcel Installation – Store the parcel in the Cloudera Manager local
or remote parcel repository (requires root/sudo privileges), then distribute and activate
16 DMX Install Guide
the parcel on the cluster nodes via Cloudera Manager (requires Administrator access to
Cloudera Manager). Available as of Cloudera Manager 4.5.
o Apache Ambari Service Installation – Deploy the DMX-h Service Definition Package to
the Ambari repository, then install DMX-h on the nodes in the cluster using the Ambari
web interface (requires root/sudo privileges). Available as of Ambari 1.7.
o RPM Installation – Deploy the RPM (Red Hat Package Manager) on all nodes in the
cluster, then use the RPM to install DMX-h on all nodes in the cluster (requires root/sudo
privileges).
• Manual/Silent Installation – Install DMX-h on one node and replicate on all remaining nodes
The DMX Run-time Service (dmxd) only needs to be running on the node(s) to which you want to
submit jobs from the DMX GUI; typically, this is the machine designated as the edge node. When
installing DMX-h using any of the managed methods, the DMX Run-time Service is not installed.
See Installing/Upgrading the DMX Run-time Service for instructions on how to do this on the edge
node.
Installation Packages for Managed Methods There are two separate installation packages for DMX-h, one for the software and another for the
license. If you do not already have a license installed, install a license package along with the
software package. If the license isn't installed, DMX-h runs in trial mode, which eventually expires
and stops working.
If you want to upgrade from a release before the introduction of the second license package, you must
install both the software and license packages.
Cloudera Manager Parcel Installation
Note: Cloudera Manager does not support the mixing of parcels with any other managed
install method, and doing so could result in your Hadoop cluster not restarting.
Pre-Installation Execute the following steps on the machine where Cloudera Manager is installed:
1. Run the self-extracting shrink-wrap executables for the software package and license
packages from the directory where it is located. For the software executable this is:
./dmexpress-<DMX version>-<OS>.parcel.bin
For the license executable, this is:
./dmexpresslicense_<license site ID>-<date>-<OS>.parcel.bin
For example, dmexpresslicense_12345-20190928-el6.parcel.bin
2. Read and accept the Software License Agreement.
3. Enter a target directory in which to put the extracted .parcel, .sha, and manifest.json files.
The manifest.json file is required to use DMX via a remote parcel repository. The default is
the current folder.
Installation Install the DMX-h (dmexpress) parcel and the DMX-h license (dmexpresslicense-XXXXX) parcel on
all nodes in the cluster as follows:
DMX Install Guide 17
1. Depending on whether you are using a local parcel repository or a remote parcel repository,
do one of the following:
• Local parcel repository – With root/sudo privileges, copy the extracted .parcel and .sha
files for software and license to the Cloudera Manager local parcel repository. The default
location is /opt/cloudera/parcel-repo/.
• Remote parcel repository – With root/sudo privileges, copy the extracted .parcel and
manifest.json files for software and license to your remote parcel repository. Ensure that
the files have read and execute permissions for all users. As outlined on Cloudera’s
Creating and Using a Parcel Repository page, follow the steps to Configure the Cloudera
Manager Server to Use the Parcel URL.
2. Logged in to Cloudera Manager as an Administrator user, click on the parcel indicator
button in the Cloudera Manager Admin console navigation bar to bring up the Parcels tab of
the Hosts page.
3. If not already detected, click on the Check for New Parcels button. Consider the following:
• If you are using a local parcel repository, you can see the “downloaded” parcels on this
page, for example, dmexpress 9.8.1 and/or dmexpresslicense_12345 20180928.
• If you are using a remote parcel repository, click on the Download button to download the
dmexpress and/or dmexpresslicense-XXXXX parcel from the remote repository.
Click on the Distribute button to distribute the dmexpress and/or dmexpresslicense-XXXXX
parcel to the nodes in the cluster. By default, the files are written to
/opt/cloudera/parcels/parcel_name/ on each node.
4. Upon completion of the distribution, either or both parcels can be activated by clicking on its
Activate button. If there was a previously activated distribution of DMX-h, be sure that no
DMX-h jobs are running, because Cloudera Manager automatically deactivates the old parcel
upon activation of the new parcel, and any running jobs fail.
5. Upon activation, the symbolic link /usr/dmexpress is created/updated to point to the
activated DMX installation.
See the Cloudera Manager Enterprise Edition User Guide for details on Managing Parcels.
Apache Ambari Service Installation
Pre-Installation Execute the following steps on the machine where the Ambari server resides:
1. Run the self-extracting shrink-wrap executable for the software package from the directory
where it is located. For the software executable this is:
./dmexpress-<DMX version>-<OS>.parcel.bin
For the license executable, this is:
./dmexpresslicense-<license site ID>-<date>-<arch>.ambari-service.bin
e.g. dmexpresslicense-12345-20180928-any.ambari-service.bin
2. Read and accept the Software License Agreement.
3. Enter a target directory in which to extract the DMX-h or DMX-h license service folder, or
press Enter to accept the default, which is the current directory. If a folder with the same
name already exists, you are prompted to overwrite; enter yes to overwrite, or no to exit the
extracting process.
18 DMX Install Guide
4. Enter a target directory in which to copy the DMX-h or DMX-h license service package where
it can be found by the Ambari server, or press Enter to accept the default, which is the root
path of the latest stack.
5. Enter yes to restart the Ambari server for the new package to be picked up, or no to restart
later.
6. If the DMX-h or DMX-h license, respectively, service definition already exists in the
repository, you are prompted to upgrade; enter yes to upgrade, or no to exit the process
without updating the existing service definition package.
7. Enter the Ambari server's hostname, username, and password, and the cluster name, as
prompted, to complete the upgrade.
a. If the credentials entered fail, you can re-run this step manually by executing the
following script, where <Ambari service extracted package path> is the directory you
specified in step 3:
<Ambari service extracted package
path>/services/DMXh/package/scripts/prepare_dmxh_upgrade.sh
b. If the credentials entered fail for the license package, execute this script:
<Ambari service extracted package path>/services/
DMXhLicense/package/scripts/prepare_dmxh_license_upgrade.sh
8. If there is no license installed, repeat steps 1-7 for the license .bin file.
Installation Install the DMX-h and/or DMX-h License service on all nodes in the cluster as follows:
1. Log in to the Ambari dashboard and select Actions->Add Service.
2. On the Add Service Wizard page, select DMX-h and/or DMX-h License and click Next.
3. On the Assign Slaves and Clients page, check Client for all nodes, and click Next.
4. On the Configure Services page, click Next to continue with the default options
(recommended). Alternatively, if you wish to change the default installation directory,
expand the “Advanced” section and make changes to the DMX-h Base Directory setting,
ensuring that the same directory is specified for both the DMX-h and DMX-h License tabs,
and then click Next.
5. On the Review page, verify the configuration and click Deploy to deploy DMX-h and/or DMX-
h License, or click Back to make modifications.
6. On the Install, Start and Test page, wait for the DMX-h and/or DMX-h License service to be
successfully installed on each node. If an error occurs, select the "Failures encountered" text
to display an error log and identify the problem.
See http://docs.hortonworks.com/ for details on Apache Ambari.
RPM Installation
Pre-Installation Execute the following steps starting with one node in or with access to the Hadoop cluster:
1. Run the self-extracting shrink-wrap executable for the software and license packages from
the directories where they are located. For the software RPM, this is:
./dmexpress-<DMX version>-1.x86_64.bin
For the license RPM, this is:
DMX Install Guide 19
./dmexpresslicense-<license site ID>-<date>-<revision>.<arch>.bin
e.g. dmexpresslicense-12345-20180927-1.x86_64.bin
2. Read and accept the Software License Agreement.
3. Enter a target directory in which to put the extracted RPM file (the default is the current
folder).
Installation You can deploy the RPM on all nodes in the cluster using configuration management software or
install the DMX-h RPM package on all nodes in the cluster directly:
1. Execute the following command with sudo or root privileges:
rpm -i dmexpress-<DMX version>-1.x86_64.rpm
The license RPM equivalent command is:
rpm -i dmexpresslicense-<license site ID>-<date>-
<revision>.<arch>.rpm
This creates a dmexpress folder under the default install location of /usr. To install to a
different location (not recommended), use the --prefix option for both license and software
install, such as:
rpm -i --prefix /some/other/directory dmexpress-<DMX version>-
1.x86_64.rpm
Alternatively, the RPM can be installed with your Linux distribution’s high-level package manager if
it supports RPM. For example, on RHEL and CentOS, the yum command can be used:
yum install dmexpress-<version>-1.x86_64.rpm
or
yum install dmexpresslicense-<license site ID>-<date>-
<revision>.<arch>.rpm
If there is an existing package, you can upgrade the software or license RPM instead:
rpm -U <package>.rpm
or
yum upgrade <package>.rpm
Manual/Silent Installation
Pre-Installation The following steps are required prior to running the manual installation:
1. Create a shared directory, hereafter referred to as <shared_directory>, that can be accessed
by all nodes in the cluster for sharing the following files/folders (otherwise, they would need
to be copied to the same location on each node in the cluster):
• The DMExpressLicense.txt file obtained from the download.
• The dmexpress sub-directory created upon the dmexpress tar file extraction.
• The response file for the DMX silent installations (generated upon install on the first
node).
20 DMX Install Guide
2. Extract the DMX Software.
a) Copy DMExpressLicense.txt and the dmexpress tar file to the <shared_directory>.
b) Extract the contents of the dmexpress tar file in the <shared_directory> on your UNIX system: tar xvof dmexpress_<DMX version>-1_<language>_linux_2-6_x86-64_64bit.tar
This creates a dmexpress/ directory under the current directory, hereafter referred to as the
<dmx_download_directory>.
Installation To install DMX-h on each node in the cluster, follow the instructions under UNIX Systems, Silent
Installation. You must manually install DMX-h on the first node, specifying a file to record your
responses to the install prompts, and can then silently install DMX-h on the remaining nodes using
the recorded response file, ensuring that all nodes are configured consistently.
When running the manual installation on the first machine, respond no to the prompt about
installing the DMX Run-time Service unless you want all the nodes in the cluster to install/run it.
See Installing/Upgrading the DMX Run-time Service for instructions on installing it on at least one
machine to which DMX-h jobs are submitted from the GUI.
Installing/Upgrading the DMX Run-time Service The DMX Run-time Service (dmxd) must be installed and running on any machine to which DMX-h
jobs are submitted from the GUI; typically this is the machine designated as the edge node. If you
install/upgrade DMX-h on the edge node using any of the managed installation methods, or using the
Manual/Silent installation method where you answer no to the prompt about installing the service,
the DMX Run-time Service is not installed/upgraded.
To install/upgrade the DMX Run-time Service on any machine where DMX-h is installed, follow the
instructions for UNIX systems in Configuring the DMX Run-time Service.
Cluster in the cloud using Cloudera Director Using Cloudera Director, you can install DMX-h on all of the nodes of a cluster in Google Cloud
Platform (GCP) or in Amazon web services (AWS).
Provided that you update the Cloudera Director configuration file, Cloudera Director can install
DMX-h as part of a cluster creation process that is initiated from the Cloudera Director command-
line interface (CLI).
Note: As Cloudera works toward supporting third-party parcels in Cloudera Director, Syncsort is
committed to updating the DMX-h installation procedures in alignment with Cloudera Director
enhanced functionality.
Pre-Installation To enable Cloudera Director to install DMX-h on a cluster in the cloud, update the
instancePostCreateScripts section of the Cloudera Director configuration file to invoke a DMX
installation script, which you create. At a minimum, the DMX installation script must install the
DMX RPM.
Example: instancePostCreateScripts section of a Cloudera Director configuration file In the following instancePostCreateScripts example, the DMX installation script is copied from a
Google Cloud Storage bucket and executed.
instancePostCreateScripts: ["""#!/bin/sh
echo "Installing DMExpress..."
DMX Install Guide 21
/usr/local/bin/gsutil cp gs://<bucket_name>/installdmx.sh installdmx.sh
chmod a+x installdmx.sh
sudo ./installdmx.sh
if test $? -ne 0
then
echo Failed to install DMX on cluster nodes.
exit 1
fi
echo "Done installing DMX ..."
exit 0
"""]
Example: DMX installation script #!/bin/bash
version=9.2
shrinkWrapFile=dmexpress-${version}-1.x86_64.bin
shrinkWrapResponse=shrinkWrapResponse.txt
# create the shrink-wrap response file
cat < $shrinkWrapResponse
a
EOF
/usr/local/bin/gsutil cp gs://<bucket_name>/$shrinkWrapFile
$shrinkWrapFile
if test $? -ne 0
then
echo Failed to copy DMX shrinkwrap file from the bucket
echo ""
exit 1
fi
chmod a+x $shrinkWrapFile
#extract the rpm
./$shrinkWrapFile < $shrinkWrapResponse > shrinkWrap.out 2>&1
#install the rpm
22 DMX Install Guide
rpm -i dmexpress-${version}-1.x86_64.rpm
if test $? -ne 0
then
echo Failed to install DMX RPM package
echo ""
exit 1
fi
rm -f $shrinkWrapResponse
rm -f $shrinkWrapFile
rm -f dmexpress-${version}-1.x86_64.rpm
Installation From the Cloudera Director CLI, create the cluster. When the Cloudera Director cluster deployment
completes successfully, DMX-h is installed on all of the nodes in the cluster.
Post-installation To enable the submission of DMX-h jobs from the DMX Job Editor on a Windows instance, do the
following:
1. SSH to the ETL server/edge node and run a preparation script, which you create, to do the
following: start the DMX Run-time Service, dmxd; create a UNIX account, dmxuser/dmxuser;
enable password authentication for SSH.
Example: ETL server/edge node preparation script
#!/bin/bash
# (1) start dmxd on master-node
DMEXPRESS_HOME_DIRECTORY=/usr/dmexpress
export DMEXPRESS_HOME_DIRECTORY
if [ "" != "022" -a "" != "0022" -a "" != "000" -a "" != "00" -a "" !=
"0000" -a "" != "002" -a "" != "02" -a "" != "0002" -a "" != "020" -a
"" != "0020" ]
then
umask 022 2>/dev/null
fi
if [ ! -f $DMEXPRESS_HOME_DIRECTORY/bin/dmxd ]
then
echo Failed to locate the DMX Run-time Service 'dmxd'.
exit 1
fi
mkdir -p $DMEXPRESS_HOME_DIRECTORY/logs
DMX Install Guide 23
echo "JOBS_DETAILS_DIR=$DMEXPRESS_HOME_DIRECTORY/logs" >
$DMEXPRESS_HOME_DIRECTORY/bin/dmxd.conf
echo "DMEXPRESS_EXE=$DMEXPRESS_HOME_DIRECTORY/bin/dmexpress" >>
$DMEXPRESS_HOME_DIRECTORY/bin/dmxd.conf
echo "DMEXPRESS_AUTHENTICATION_METHOD=DEFAULT" >>
$DMEXPRESS_HOME_DIRECTORY/bin/dmxd.conf
PATH=$DMEXPRESS_HOME_DIRECTORY/bin:$PATH:/usr/bin; export PATH
LD_LIBRARY_PATH=$DMEXPRESS_HOME_DIRECTORY/lib:$LD_LIBRARY_PATH; export
LD_LIBRARY_PATH
cd $DMEXPRESS_HOME_DIRECTORY/bin
echo Starting the DMX Run-time Service at `date`...
nohup ./dmxd ./dmxd.conf 1>dmxd.stdout 2>dmxd.stderr &
# (2) create dmxuser
useradd -d /home/dmxuser -m -s /bin/bash "dmxuser"
echo "dmxuser:dmxuser"| chpasswd
if test $? -ne 0
then
echo Failed to set password for user dmxuser.
exit 1
fi
# (3) enable password authentication for sftp
cat /etc/ssh/sshd_config | sed -e
"s/PasswordAuthentication.*no/PasswordAuthentication yes/" >
sshd_config_temp
mv sshd_config_temp /etc/ssh/sshd_config
/etc/init.d/sshd restart
if test $? -ne 0
then
echo Failed to enable ssh password login.
exit 1
fi
exit 0
2. As dmxd runs on port 32636 and the SSH service runs on port 22, modify the edge node
network rules to allow TCP connections to these ports from the Windows instance.
24 DMX Install Guide
Deploying DMX to a Databricks cluster in the cloud To run jobs on a Databricks cluster, you must deploy DMX Server to the cluster and install it using
an RPM Resource Manager (RPM) init script. DMX CloudFSUtil, a command-line utility included
with DMX-h, can move the required files to all the nodes in a cluster on Amazon Web Services
(AWS).
Note: DMX supports Databricks clusters deployed to Azure and AWS cloud platforms. CloudFSUtil
can only move files to AWS.
Requirements Setup the Spark cluster, install Databricks, and configure JDBC for Spark and Databricks. DMX
jobs on Databricks cannot run on Spark versions 3.0.0 and higher. When connecting to Databricks
databases, we recommend using the Simba Spark SQL JDBC driver version 2.6.16 or higher.
If your Connect jobs connect to DB2, Oracle, or SQL Server databases, install the JDBC drivers on
Databricks in the same work directory as the dmxspark2ix.jar file and the work directory configured
in DMX execution profile files. For information, see Connecting to Databricks File Systems (DBFSs)
for more information.
You can run CloudFSUtil to copy the jars to the appropriate location. For example:
cloudfsutil -put mssql-jdbc-8.2.2.jre8.jar
dbfs:/mnt/azuregen2/work/mssql-jdbc-8.2.2.jre8.jar
You also need a user account with permission to run sudo to run the RPM install.
Prepare the install files In order to install Connect on Databricks, you require three files.
1. A DMExpress executable bin file for Connect, which is typically named dmexpress-
${version}-1.x86_64.bin.
2. A license key package bin file for Connect, which is typically named dmexpresslicense-
${licenseId}-${licenseDate}.x86_64.bin.
3. An rpm install script. Databricks uses rpm scripts during cluster start up to extract the
DMExpress executable bin files and install them on the cluster. Databricks requires Unix
(LF) end-of-line characters in the script to execute properly. A sample rpm installation script
is shown below:
#!/bin/bash
dbfsPath=/dbfs/mnt/azuregen2/connect
workDir=/dbfs/mnt/azuregen2/work
version=9.10.11
shrinkWrapFile=dmexpress-${version}-1.x86_64.bin
shrinkWrapLicenseFile=dmexpresslicense-${licenseId}-
${licenseDate}.x86_64.bin
shrinkWrapResponse=shrinkWrapResponse.txt
# create the shrink-wrap response file
DMX Install Guide 25
cat <<EOF > $shrinkWrapResponse
a
EOF
cp -f $dbfsPath/$shrinkWrapFile $shrinkWrapFile
chmod a+x $shrinkWrapFile
#extract the rpm
./$shrinkWrapFile < $shrinkWrapResponse
#install the rpm
if type rpm >/dev/null 2>&1;then
echo "rpm is present"
else
sudo apt-get update
sudo apt-get -y install rpm
fi
mkdir -p /usr/tmp >/dev/null 2>&1
sudo rpm -i dmexpress-${version}-1.x86_64.rpm
rm dmexpress-${version}-1.x86_64.*
#Next step can be done once manually outside of script or included here
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
cp -vf /usr/dmexpress/lib/dmxspark2ix.jar $workDir/dmxspark2ix.jar
fi
Within the example rpm script, we set the following variables:
variable value
dbfsPath The mount point within Databricks where you keep the
DMExpress bin executable files.
26 DMX Install Guide
workDir A directory path to which to save DMX job staging materials.
This should match the configured workDirectory in your
DMX execution profile file. For information, see Connecting to Databricks File Systems (DBFSs).
version The version of Connect to install, used in the script to build
the shrinkWrapFile variable values
shrinkWrapFile,
shrinkWrapLicenseFile
The file names of your bin executable, built from the same
version variable.
shrinkWrapResponse A buffer for input variables
After variable assignments, the script uses RPM to install the software and makes the /usr/tmp
directory for temporary storage. The last lines of the script copy the Connect dmxspark2ix.jar
library to the work directory, which is required only once per work directory. Therefore, you can
include the step within your RPM install script as shown above, or you can copy the library from
your local installation to the work directory using Connect’s CloudFSUtil. For example:
cloudfsutil -put <dmxpress_installation>/bin/dmxspark2ix.jar
dbfs:/<workdir>/
If the dmxspark2ix.jar library is missing from the work directory, DMX jobs running on the
cluster fail. You can use multiple work directories on the same cluster, and each work directory
requires the library.
Move the install files to Databricks We recommend using the executable CloudFSUtil within Connect to transfer the install files from
your local computer to Databricks. For example, from a directory containing all install files, run:
cloudfsutil -put dmexpress-9.10.11-1.x86_64.bin
azure://<account>.blob.core.windows.net/<container>/
cloudfsutil -put dmexpresslicense-70590-20200925-1.x86_64.bin
azure://<account>.blob.core.windows.net/<container>/
cloudfsutil -put rpmInstall.sh dbfs:/<rootdir>/rpmInstall.sh
To make the RPM install script available during cluster initialization, you must save it to a DBFS
root folder.
The DMX bin executables must reside on a mounted drive.
Note: Azure portal typically moves the DMX bin executables to Databricks faster than DBFS due to
the size of the executable.
Install custom libraries for jobs submitted from a Windows Virtual Machine (VM) Note: this section does not apply if you are using a Linux VM to submit jobs to Databricks
DMX Install Guide 27
To submit jobs with custom libraries to Databricks using a Windows VM, perform the following
additional steps:
1. If you use custom function libraries, copy the Unix version of your custom function library to
Databricks. For example, to use CloudFSUtil:
cloudfsutil -put mylibrary.so
dbfs:/mnt/azuregen2/connect/customfunctions/mylibrary.so
2. Create a local copy of the RPM install script and open it for editing.
3. Add a line to the end script with a command that copies the custom function libraries to the
plugins folder of the Connect software. For example:
cp -vf $dbfsPath/customfunctions/*.so /usr/dmexpress/plugins/
4. Save the changed RPM install script and upload it to a DBFS root folder.
5. Restart the Databricks cluster to install the custom libraries.
Configure the cluster On the cluster on which to install Connect, execute the following procedure:
1. At the top of the cluster’s page, click Edit.
2. Click Advanced Options.
3. Under Spark, add the JAVA_HOME environment variable. For example,
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
4. Under Logging, set the destination to DBFS and the cluster log path to your desired location.
For example:
dbfs:/mnt/azuregen2/cluster-logs
Note: when you save these values, Databricks adds an additional subdirectory to the end of
the log path for the cluster ID.
5. Under Init Scripts, select Destination DBFS and add the path and filename of your RPM
install script to the Init Script Path value. For example:
dbfs:/<rootDir>/rpmInstall.sh
6. Click on Confirm at the top of the cluster’s page to save these changes.
7. To confirm a successful install,
a. click on Event Log on the cluster’s page to verify the INIT_SCRIPTS_STARTED and
INIT_SCRIPTS_FINISHED messages and times.
b. Review the software installed in /usr/dmexpress by running the following command
within a notebook:
ls -l /usr/dmexpress/
If the cluster reported errors during DMX install, review the init script logs stored in the
init_scripts/<cluster_id>_<container_ip> directory under the logging directory set up in the cluster’s
Advanced Options. For example:
ls -l /dbfs/mnt/azuregen2/cluster-logs/init_scripts
Start the cluster: To start the cluster, click Start at the top of the cluster’s page.
28 DMX Install Guide
Configuring the DMX Run-time Service The DMX Run-time Service needs to be running on any system to which you want to submit tasks or
jobs for execution. It is also required for certain other functions such as file browsing in a multi-
locale environment or viewing server statistics from the client.
The DMX Run-time Service is usually configured during installation to determine the following
options:
• automatic restart on system startup
• PAM authentication on supported UNIX and Linux platforms
To change these options, you can reconfigure the DMX Run-time Service as described below.
Stopping and starting the DMX Run-time Service The DMX Run-time Service can be stopped and restarted at any time. Before stopping the DMX
Run-time Service, please verify that no job or task submitted from the graphical interface is running.
If a DMX client running a version prior to 5.2.5 connects to this DMX Run-time Service, then the
Remote Procedure Call (RPC) service must be running on the system when the DMX Run-time
Service starts. The RPC service is used to obtain additional ports required to connect to older DMX
clients. Refer to the section below on RPC ports used by older DMX clients. Otherwise, the RPC
service is not required and all ports associated with it may be blocked.
Windows systems This procedure requires Administrator level access. Select the DMExpress Service from Control
Panel, Administrative Tools, Services, then select Properties from the pop-up menu. This opens the
DMExpress Service Properties dialog. Use the Start (or Stop) button. A progress bar may appear and
the Service status in the properties window changes to Started (or Stopped).
NOTE: In order to submit a job or task, a user must have local login privileges to the machine where
the service is running.
UNIX systems Root level access is required. Run the install script in the DMX installation directory. For example:
>cd /usr/software/DMExpress
>./install
This gives you the option of configuring the DMX Run-time Service, where you can choose to stop
and/or start the service.
Automatic restart on system startup
Windows systems Select Automatic in Startup type in the DMX Run-time Service Properties dialog to have the DMX
Run-time Service started automatically when the system starts; select Manual otherwise.
UNIX systems Use the install procedure as described under Stopping and Starting the DMX Run-time Service
above. You are asked the appropriate questions.
DMX Install Guide 29
PAM authentication
UNIX systems To configure DMX to use PAM authentication, do the following:
1. Use the install procedure to stop and restart the service as described above.
If you have Pluggable Authentication Modules (PAM) installed and configured on the system,
you are asked whether DMExpress should use PAM to authenticate users.
2. Include the PAM library in the system library path.
DMExpress specifically looks for the library name libpam.so. If your library has a different
name, such as libpam.so.0.81.5, create a symbolic link to it in any directory that is included
in the shared library path environment variable. For example, this can be done in the DMX
lib directory, specifying the full path to your library:
cd /<dmx_home>/lib
ln -s /lib64/libpam.so.0.81.5 libpam.so
3. Modify the configuration of the Active Directory software that handles all network
connections to the server running the DMX Run-time Service, dmxd.
• On Linux systems, have your system administrator create a file named dmxd in the
/etc/pam.d/ directory and grant authentication and account management privileges to the
dmxd service. Alternatively, you can do the following:
1. Create a file named dmxd in the directory /etc/pam.d
2. Copy the contents of sshd to dmxd
• On UNIX systems, have your system administrator create a dmxd entry in the pam.conf
file, which is located in the /etc/ directory, and grant authentication and account
management privileges to the dmxd service. Alternatively, you can do the following:
1. Create an entry named dmxd in the file /etc/pam.conf
2. Copy the contents of telnet to the entry created for dmxd
4. Ensure that DMX is configured to use PAM authentication:
• Check the installation log file, install.log, which was created in the directory where you
installed DMX. If PAM is installed on your system, the DMX installation log includes a
question asking if DMX should use PAM authentication. Verify that the recorded
response is yes [y].
• Alternatively, if you have root access to the DMX remote server, login as root and verify
that the following appears in the service configuration file, dmxd.conf, which is located in
the <dmx_home>/bin directory:
DMEXPRESS_AUTHENTICATION_METHOD=PAM
Communication ports required by the DMX server The following TCP and UDP ports are used for communication with the DMX server. When
configuring your firewall, make sure the required ports are not blocked anywhere between the
system running DMX server and the systems running the DMX client.
30 DMX Install Guide
If any DMX client that connects to this server is running a version of DMX older than 5.2.5,
additional RPC ports are used, and may be configured. Refer to the section below on RPC ports used
by older DMX clients.
Port number/
transport Description
32636/TCP DMX server port, used for communication with DMX client, or with other DMX
servers when using Grid Computing. It is not recommended to override this
port number; please contact Syncsort technical support if you need to do so.
Refer to the section below on Technical Support.
In addition, if a DMX task or job uses a remote UNIX server connection, or a Windows network path
with a UNC name, to access data (including source and target files) or metadata (including tasks,
jobs and external metadata), the following ports need to be open on the system hosting the files.
Port number/
transport Description
20/TCP,UDP FTP data port, if Secure FTP is not used
21/TCP,UDP FTP control port, if Secure FTP is not used
22/TCP,UDP Secure FTP port, if Secure FTP is used
445/TCP,UDP Windows shares
50070/TCP,UDP Hadoop Distributed File System name node
RPC ports used by older DMX clients DMX clients older than 5.2.5 require additional ports to communicate with the DMX Server. These
ports are assigned by the RPC service at the time the DMX Run-time Service starts.
The following ports are used in addition to the standard ports used by the DMX Run-time Service.
Port number/
transport Description
Arbitrary
port/TCP
DMX server port used for communication with DMX clients. An arbitrary port
is assigned when the DMX Run-time Service is started. The port number can
be configured as mentioned below, for example if your security policy does not
allow a wide range of ports to be open, or due to the presence of a firewall.
111/TCP,UDP UNIX RPC port mapper
135/TCP,UDP Windows RPC endpoint mapper
DMX Install Guide 31
Configuring the Server port
Windows systems On the machine where the DMX Run-time Service is installed, open the DMX Run-time Service
Properties dialog and stop the DMX Run-time Service as described above.
In the Start parameters edit box of the properties window, type:
/tcpport <DMX ServerPort>
where <DMX ServerPort> is the port you want the service to use. For example: /tcpport 7771
Start the DMX Run-time Service as described above. The DMX Run-time Service now uses the port
you provided.
UNIX systems Stop the DMX Run-time Service as described above.
Edit the service configuration file, dmxd.conf, which is located in the <dmx_home>/bin directory, to
insert the following line:
DMEXPRESS_TCP_PORT=<DMX ServerPort>
where <DMX ServerPort> is the port you want the service to use. For example:
DMEXPRESS_TCP_PORT=7771
Stop and start the DMX Run-time Service as described above. The DMX Run-time Service now uses
the port you provided.
Applying a New License Key to an Existing Installation Applying a new license key updates your product license to a new licensed version. If your new
license enables features or products not installed in your original installation, applying a new license
key does not install them automatically.
Windows Systems
Applying a new key interactively Perform the following steps to apply a new license key to an existing DMX installation:
1. Go to Programs, DMExpress from the Start menu and select Apply a New License Key.
2. Browse to the location of the license key file, DMExpressLicense.txt, or type in the license
key manually, when prompted.
3. Read the terms of the Syncsort License Agreement and confirm your acceptance of them.
4. Review the product options, components and features that are enabled by your license key.
5. Confirm the location of the existing DMX installation.
Applying a new key silently Applying a new license key silently requires a setup file which can be recorded when applying a new
license key interactively.
32 DMX Install Guide
To record the setup file Open a command prompt; type the full path to the program applykey.exe, followed by the options:
–r –f1<silent_setup_file>
where <silent_setup_file> is the full path to the setup file that is created. For example, if DMX is
installed in “C:\Program Files\DMExpress\”, type:
"C:\Program Files\DMExpress\Programs\applykey.exe" –r –
f1c:\temp\setup.iss
An interactive session begins and the options that are selected during the interactive session are
recorded in the specified setup file.
To run the applykey.exe program in silent mode Open a command prompt; type the full path to the program applykey.exe, followed by the options:
–s –f1<silent_setup_file> -slog<log_file>
where <silent_setup_file> is the full path to a setup file that was created using the steps above, and
<log_file> is the full path to the log file which contains any output produced by the silent install run.
For example, if DMX is installed in “C:\Program Files\DMExpress\”, type:
"C:\Program Files\DMExpress\Programs\applykey.exe" –s –
f1c:\temp\setup.iss
–slogc:\temp\setup.log
If you do not specify the –slog option, then apply key generates a log, setup.log, in the folder where
the silent setup file is located.
Multiple command line options are separated with a space, but there should be no spaces inside a
command line option (for example, –slogc:\setup.log is valid, but –slog c:\setup.log is not).
UNIX Systems
Apply a new key interactively Perform the following steps to apply a new license key to an existing DMX installation:
1. Change to the <dmx_home> directory and run the apply key program:
cd <dmx_home>
./applykey
2. Specify the location of the license key file when prompted.
3. Read the terms of the Syncsort License Agreement and confirm your acceptance of them.
4. Review the product options, components and features that are enabled by your license key.
Applying a new key silently A silent applykey process allows you to easily apply a new DMX license key on multiple machines
with identical options. You simply apply the key interactively on the first machine using the record
option to save your responses to applykey prompts in a file. Then you run the silent applykey process
on the remaining machines, pointing to the recorded response file. Because the silent applykey
process is non-interactive, it can be scripted to effectively automate applying the license key on many
machines.
DMX Install Guide 33
1. To prepare to run the silent applykey process, initiate the interactive applykey process on
the first machine as described in the section above, but in step 1, run the applykey command
with the record option, –r, specifying the file in which to store your responses to applykey
prompts as follows:
./applykey –r <silent_setup_file>
Note: Before initiating the silent applykey process, ensure that all actively running jobs
complete successfully.
2. Upon successful completion of the interactive applykey process, run the applykey program
with the silent option, –s, and the silent log option, –slog, on the remaining machines that
require the new key as follows:
./applykey –s <silent_setup_file> -slog <log_file>
where:
• <silent_setup_file> is the full path to the response file generated by the interactive
applykey process.
• <log_file> is the full path to the log file generated by the silent applykey process.
DMX-h in a Hadoop Cluster The method for applying a new license key to DMX-h on the nodes of a Hadoop cluster depends on
how you originally installed DMX-h in the cluster. Follow the instructions in the appropriate section
below.
Cloudera Manager Parcel Apply Key
1. Install and activate the new DMX-h license Cloudera parcel, as described in Cloudera
Manager Parcel Installation. The software parcel does not need to be modified.
2. (optional) Uninstall the old DMX-h license Cloudera parcel, as described in Cloudera
Manager Parcel Uninstall.
Apache Ambari Server Apply Key
1. Install the new DMX-h Ambari license service definition package, as described in Apache
Ambari Service Installation. The software package does not need to be modified. This
effectively updates the existing service definition package.
RPM Apply Key
1. Install the new DMX-h license RPM package, as described in RPM Installation. This
effectively updates the license key.
Manual/Silent Apply Key See UNIX systems, Applying a new key silently.
Running DMX Once you have installed DMX, you can create tasks corresponding to different stages of your process
via the DMX Task Editor, and group tasks as jobs and run jobs via the DMX Job Editor. You can
schedule to run the jobs later or from within a batch script. You can obtain more information on both
34 DMX Install Guide
the graphical user interfaces and on running tasks and jobs from the command line from the DMX
Online Help.
Graphical User Interfaces On Windows systems, go to Programs, DMExpress from the Start menu and select DMExpress Task
Editor to run the DMX Task Editor. To run the DMX Job Editor, either select it from the Start,
Programs, DMExpress menu, or switch to it from within the Task Editor via the Run, Create Job
menu item.
DMX Help To access DMX Help, go to Programs, DMExpress from the Start menu and select DMX Help or
select the Help, Topics menu item from within the Task Editor or the Job Editor.
Connecting to Databases from DMX In order for DMX to access database tables as sources or targets, the appropriate database client
software must be on the system and accessible via the appropriate shared library or dynamic link
library (dll) paths. The following environment variable must be set to include the path to the
database client libraries and exported on the corresponding platform:
Windows PATH
AIX LIBPATH
HP-UX SHLIB_PATH
Linux LD_LIBRARY_PATH
Solaris LD_LIBRARY_PATH
On UNIX systems, the variable needs to be set and exported prior to starting the DMX Run-time
Service or running DMX tasks or jobs.
Additional client configuration might be required for a specific DBMS. The configuration steps
needed to access a specific DBMS are described in the following sections.
The DMX install program assists you with configuring and/or verifying connections to databases.
On UNIX systems, if you wish to configure and/or verify database connections any time after the
installation procedure, run the databaseSetup program as follows:
cd <dmx_home>
./databaseSetup
Amazon Redshift
Initial requirements Before attempting to connect to Amazon Redshift, do the following:
• Configure the DMX server, which can be either an Amazon Elastic Compute Cloud (EC2)
instance or your local machine, to accept SSH connections.
• Depending on the DMX server, consider the following:
• EC2 instance – Set the size of the maximum transmission unit (MTU).
• Local machine - Due to throughput on the wide area network (WAN), you may notice a
performance lag at design time and at runtime.
DMX Install Guide 35
If the local machine is behind a firewall, you may need to configure a Virtual Private Network
(VPN) to connect to the local machine from Amazon Redshift.
• Configure the DMX server to include the Amazon Redshift cluster public key and cluster
node IP addresses:
1. Retrieve the Amazon Redshift cluster public key and cluster node IP addresses.
2. Add the Amazon Redshift cluster public key to the DMX host's authorized keys file.
3. Configure the DMX host to accept all of the Amazon Redshift cluster node IP addresses.
4. Get the public key for the DMX host.
• Specify Amazon Redshift parameters in the DMX Redshift configuration file.
The parameters outlined in the DMX Redshift configuration file, as defined by the
DMX_REDSHIFT_INI_FILE environment variable, provide DMX with the values required to
access an Amazon S3 bucket and to invoke the Amazon Redshift COPY command.
Note: If DMX_REDSHIFT_INI_FILE is not set, DMX issues an error message upon task
initiation and the DMX task aborts.
A sample DMX Redshift configuration file is provided in the DMX installation directory as follows:
Windows C:\Program Files\DMExpress\Examples\Databases\Redshift\DMXRedshift.ini
UNIX <DMX _installation>/etc/DMXRedshift.ini
Installation and configuration Connectivity between DMX and Amazon Redshift databases is established through the Amazon
Redshift ODBC driver and, when loading, through multiple SSH connections.
DMX optimizes load performance to Amazon Redshift databases through the invocation of the
Amazon Redshift COPY command.
Amazon Redshift ODBC driver installation
Windows systems For Windows systems, ODBC driver installation includes the following:
1. Install and configure the Amazon Redshift ODBC 32-bit driver on Microsoft Windows
operating systems.
2. When creating a system DSN entry for the ODBC connection, ensure the following settings
on the given dialogs:
• Amazon Redshift ODBC Driver DSN Setup dialog: Use Declare/Fetch is selected.
• Amazon Redshift Data Type Configuration dialog:
o Use Unicode is unselected.
o Show Boolean Column As String is unselected.
o Max Varchar (Default 255) is populated with the value 65530.
UNIX systems For UNIX systems, ODBC driver installation includes the following:
36 DMX Install Guide
1. Install the Amazon Redshift ODBC 64-bit driver on Linux operating systems.
2. Configure the ODBC Driver on Linux operating systems.
When using the unixODBC driver manager, override the standard threading settings in the
ODBC section of odbcinst.ini as follows:
[ODBC]
Threading = 1
3. Update odbc.ini with the following name-value pairs:
UseDeclareFetch=1
UseUnicode=0
BoolsAsChar=0
MaxVarchar=65530
Azure Synapse Analytics (formerly SQL Data Warehouse)
Azure Synapse Analytics (formerly SQL Data Warehouse) is a cloud-based Enterprise Data
Warehouse (EDW) developed by Microsoft. Through JDBC connectivity, DMX-h supports Azure
Synapse Analytics as sources and targets.
Azure Synapse Analytics connection requirements Azure Synapse Analytics requires a JDBC connection configuration with the driver name and
location for all connections. The parameters outlined in a DMX Azure Synapse Analytics
configuration file include the following:
• DriverName - Required JDBC driver name.
• DriverClassPath - Required JDBC class path.
• MAXPARALLELSTREAMS - Optional maximum number of parallel streams created to load
data for performance and according to demand.
• STORAGEACCESSKEY - Required. Azure Blob Storage access key for an active account. If
the storage access key is missing or invalid, DMX issues an AZSQDWTERR error message
and aborts the job.
• WORKTABLECODEC - Optional compression codec to use to compress files in the staging
table. DMX currently supports gzip compression codec only.
• WORKTABLEDIRECTORY - Required. A URL that includes the Blob Storage account name
with the endpoint, including the container name. See https://docs.microsoft.com/en-
us/azure/storage/blobs/storage-blobs-introduction#blob-storage-resources. For example:
WorkTableDirectory=https://<dmxazurestorage>.blob.core.windows.net/dmx-
azstorage-container
Where <dmxazurestorage> is the blob storage account name and <dmx-azstorage-
container> is the container name. If the work table directory is missing or invalid, DMX
issues an AZSQDWTERR error message and aborts the job.
• WORKTABLESCHEMA - Optional schema name to create the staging data. If this
parameter is not set, DMX creates tables in the same schema as the target table.
DMX Install Guide 37
Defining Azure Synapse Analytics database connections In the Database Connection dialog, define a connection to an Azure Synapse Analytics database as
follows:
• At DBMS, select Azure Synapse Analytics.
• At Access Method, select JDBC.
• At Database, select a previously defined Azure Synapse Analytics JDBC database connection
URL.
• At Authentication, select Auto-detect.
DMX requirements to load data into an Azure Synapse Analytics target
Before using DMX to load data into an Azure Synapse Analytics target, do the following:
1. Create or verify that the master database contains a database master key
2. Enable the db_owner privilege for the user connecting to Azure Synapse Analytics.
Alternately, set or verify the following more granular privileges for the connecting user:
EXEC sp_addrolemember 'db_datawriter', '<user>';
GRANT CONTROL TO <user>;
Azure Synapse Analytics target connections Using an Azure Synapse Analytics JDBC connection, DMX-h can write supported Azure Synapse
Analytics data types to Azure Synapse Analytics targets directly for optimal performance.
Defining Azure Synapse Analytics targets
At the Target Database Table dialog, define an Azure Synapse Analytics database table target:
1. At Connection, select a previously defined Azure Synapse Analytics target connection or
select Add new... to add a new one.
2. Select a table from the list of Tables, or select Create new... to create a new one.
o User defined SQL statement is not supported.
o All target disposition methods are supported.
3. On the Parameters tab, the following optional parameters are available for Azure Synapse
Analytics target database tables. Values specified here take precedence over their
corresponding property in the JDBC configuration file, if any.
o Maximum parallel streams - the maximum number of parallel streams that can be
established to load data for performance and that are created according to demand.
o Work table directory - Required. A URL that includes the Blob Storage account name
with the endpoint, including the container name. See https://docs.microsoft.com/en-
us/azure/storage/blobs/storage-blobs-introduction#blob-storage-resources. For
example:
WorkTableDirectory=https://<dmxazurestorage>.blob.core.windows.ne
t/dmx-azstorage-container
38 DMX Install Guide
Where <dmxazurestorage> is the Blob storage account name and <dmx-azstorage-
container> is the container name. If the work table directory is missing or invalid,
DMX issues an AZSQDWTERR error message and aborts the job.
o Work table codec - specifies the compression algorithm used to compress data staged
in Blob storage.
o Work table schema - the schema used to create the staging table.
4. Set commit interval and Abort task if any record is rejected are not supported.
Azure Synapse Analytics source connections Using an Azure Synapse Analytics JDBC connection, DMX can read supported Azure Synapse
Analytics data types from any Azure Synapse Analytics table.
Defining Azure Synapse Analytics sources
For all DMX-h ETL jobs, DMX-h supports Azure Synapse Analytics database tables as sources and
as lookup sources. At the Source Database Table dialog or at the Lookup Source Database Table
dialog define either an Azure Synapse Analytics database table source or lookup source respectively:
• At Connection, select a previously defined Azure SQL Data Warehouse source connection or
select Add new... to add a new connection.
Databricks Databricks is a cloud database Platform-as-a-Service for Spark supported on Azure and AWS Cloud
Services. Through JDBC connectivity, DMX-h supports Databricks databases as sources and
targets.
Databricks connection requirements
Databricks requires a JDBC connection configuration with the driver name and location for all
connections.
NOTE: A Databricks database connection is a database connection, which is logically different from
a Databricks File System (DBFS) connection which is a remote file connection.
Before attempting to connect to Databricks, do the following:
• Install DMX server on an Amazon Elastic Compute Cloud (EC2) instance, Azure Virtual
Machine (VM), or your local machine..
• Specify JDBC and Spark parallelization parameters in the DMX JDBC configuration file.
The parameters outlined in the DMX JDBC configuration file, as defined by the
DMX_JDBC_INI_FILE environment variable, provide DMX with the mandatory and optional
values required to access an Amazon S3 bucket or Microsoft Azure blob to invoke a
Databricks query.
• DMX accesses Databricks using keys-based authentication. If no access keys are provided,
DMX issues a UNIAMCRE error message aborts the job.
The parameters outlined in a DMX Databricks configuration file include the following:
DMX Install Guide 39
• DriverName - Required JDBC driver name.
• DriverClassPath - Required JDBC class path.
• ANALYZETABLESTATISTICS - When set to y, DMX can run analyze queries that collect
table statistics. Default value in n.
• ANALYZECOLUMNSTATISTICS - When set to y, DMX can run analyze queries that collect
column statistics. Default value is n.
• MAXPARALLELSTREAMS - Optional integer representing Maximum number of parallel
streams that can be established for loading data to the staging data file. By default,
MAXPARALLELSTREAMS is set to the number of CPUs available in the client machine.
• WORKTABLEDIRECTORY - Required path to an s3 bucket, Azure blob container, or
Databricks File System (DBFS) store in which to stage data. You must mount an s3 bucket
or Azure blob container using the Databricks File System (DBFS). Example URLs could
include:
o s3a://dev for an S3 bucket
o wasbs://[email protected]/dev for
an Azure Blob
o dbfs://dev for a DBFS store
• DBFSMOUNTPOINT – DBFS mount point (DBFS path) required by
WORKTABLEDIRECTORY. DBFSMOUNTPOINT is mandatory if the work table directory
maps to an S3/Azure URL.
• MAXWORKFILESIZE – Optional integer. The maximum size of a file in bytes of the staging
file written by task. The default value is 134217728, which is equivalent to 128 MB.
• WORKTABLESCHEMA - Optional schema name to use for staging data . The default
schema for staging data is the same as the target data schema.
• WORKTABLECODEC - A compression codec to compress the files in the stagiung directory.
Valid values are gzip (default), bzip2, and uncompressed.
• AWSACCESSKEYID - A 20-character, alphanumeric string that Amazon provides upon
establishing an AWS account. DMX ignores this parameter unless
WORKTABLEDIRECTORY is an S3 bucket. If DMX runs in EC2, AWSACCESSKEYID is
optional.
• AWSACCESSKEY - The 40-character string, also known as the secret access key, which
Amazon provides upon establishing an AWS account. DMX ignores this parameter unless
WORKTABLEDIRECTORY is an S3 bucket. If DMX runs in EC2, AWSACCESSKEY is
optional.
DMX requires the access key id and the secret access key to send requests to an Amazon S3
bucket unless an AWS temporary session token is required, in which case DMX requires the
access key id and AWS temporary session token. See the AWSTOKEN parameter below.
• AWSTOKEN - An AWS temporary session token, granting temporary security credentials
(temporary access keys and a security token) to any IAM user enabling them to access AWS
services. This alternative authentication method replaces a full-access AWS storage access
key. DMX ignores this parameter unless WORKTABLEDIRECTORY is an S3 bucket.
• AzureStorageAccessKey - A 512-bit Azure Blob Storage access key for an active account of
which Microsoft issues two upon establishing an Azure Portal account. If DMX runs in the
Azure Blob container, AzureStorageAccessKey is optional. If the storage access is required
and the key is missing or invalid, DMX issues an AZSQDWTERR error message and aborts
the job. DMX ignores this parameter unless WORKTABLEDIRECTORY is an Azure blob
container.
• AzureStorageSAS - A shared access signature (SAS) URI that grants restricted access rights
to Azure Storage resources. This alternative authentication method replaces a full-access
40 DMX Install Guide
Azure Storage access key. DMX ignores this parameter unless WORKTABLEDIRECTORY is
an Azure blob container.
Defining Databricks database connections
In the Database Connection dialog, define a connection to a Databricks database as follows:
• At DBMS, select Databricks.
• At Access Method, select JDBC.
• At Database, select a previously defined Databricks JDBC database connection URL.
• At Authentication, select Auto-detect.
Databricks target connections
Using a Databricks JDBC connection, DMX-h can write supported Databricks data types to
Databricks targets directly for optimal performance.
Defining Databricks targets
At the Target Database Table dialog, define a Databricks database table target:
1. At Connection, select a previously defined Databricks target connection or select Add new...
to add a new one.
2. Select a table from the list of Tables, or select Create new... to create a new one.
o User defined SQL statement is not supported.
o All target disposition methods are supported.
3. On the Parameters tab, the following optional parameters are available for Databricks target
database tables. Values specified here take precedence over their corresponding property in
the jdbc configuration file, if any.
o Analyze table statistics - enables analyze queries that collect table statistics. o Analyze column statistics - enables analyze queries that collect column statistics. o Maximum parallel streams - Optional integer representing the maximum number of
parallel streams that Connect can establish for loading data into the staging data
file. By default, this is set to the number of CPUs available in the client machine.
o Work table directory - the parent-level directory in s3, blob, and/or dbfs in which
Connect creates job-specific subdirectories.
When the work table directory is an s3 bucket, you must mount the s3 bucket
through DBFS. For more details, see the Databricks documentation concerning
Amazon S3.
When the work table directory is an azure blob container, you must mount the blob
container through DBFS. For more details, see the Databricks documentation
concerning Azure storage.
o Work table schema - the schema used to create the staging table. By default, Connect
creates the staging table in the same schema as the target table.
o Work table codec - specifies the compression algorithm used to compress Databricks
data. Valid values are gzip (default), bzip2, and uncompressed.Set commit interval and
Abort task if any record is rejected are not supported.
DMX Install Guide 41
Databricks source connections
Using a Spark JDBC connection, DMX can read supported Databricks data types from any
Databricks table.
Defining Databricks sources
For all DMX-h ETL jobs, DMX-h supports Databricks database tables as sources and as lookup
sources.
At the Source Database Table dialog or at the Lookup Source Database Table dialog define either a
Databricks database table source or lookup source respectively:
• At Connection, select a previously defined Databricks source connection or select Add new...
to add a new connection.
DB2 Your DB2 client must be installed on the system and configured so that it can connect to databases
that you want to access from DMX. For example, you can configure the client by cataloging
databases, or by defining database aliases in the db2cli.ini file. Please refer to specific DB2
documentation for details on configuring the client.
Windows Systems To access DB2 databases, DB2 client software must be accessible via the dynamic link libraries (dll)
located under the <db2_home>/sqllib/bin folder, where <db2_home> denotes the directory where DB2
is installed.
UNIX Systems To access DB2 databases, DB2 client software must be accessible via the shared libraries located
under the <instance_home>/sqllib/lib directory, where <instance_home> denotes the home directory
of the DB2 instance that you want to use to connect to the database.
Greenplum
Installation and configuration DMX connects to Greenplum databases through the Greenplum ODBC driver and the Greenplum
psql client utility, which is a component of the Greenplum client software.
Install and configure the Greenplum client software on the system on which the DMX client is
installed.
To establish a connection to the Greenplum database, install the Greenplum ODBC driver and
create, configure, and test the ODBC data source name (DSN).
Greenplum client software installation
Windows systems For Windows systems, client software installation includes the following:
1. Install the Greenplum client software.
a) Register as a user on the Pivotal Network site.
42 DMX Install Guide
b) From the Greenplum Clients section of the Pivotal Greenplum Database download site,
download the Clients for Windows file, for example:
greenplum-clients-<client_software_version_number>-build-
<build_version_number>-WinXP-x86_32.msi
For information on installing and configuring the Greenplum Windows client software, refer
to Greenplum Database Client Tools for Windows.
The default Greenplum client installation directory is as follows:
C:\Program Files (x86)\Greenplum\greenplum-clients-
<client_software_version_number>-build-<build_version_number>
2. Verify that the Greenplum psql client utility is in a directory specified in the PATH.
Note: If the Greenplum psql client utility is not in the PATH when DMX initiates a load to
the Greenplum database, DMX issues an error message and the task aborts.
UNIX systems For UNIX systems, client software installation includes the following:
1. Install the Greenplum client software.
a) Register as a user on the Pivotal Network site.
b) From the Greenplum Clients section of the Pivotal Greenplum Database download site,
download the applicable Greenplum UNIX client software, for example:
greenplum-clients-<client_software_version_number>-build-
<build_version_number>-<platform.zip>
For information on installing and configuring the Greenplum UNIX client software, refer to
Greenplum Database Client Tools for UNIX.
The default Greenplum client installation directory is as follows:
/usr/local/greenplum-clients-<client_software_version_number>-build-
<build_version_number>
To setup the system environment variables, run greenplum_clients_path.sh:
<greenplum_home>/greenplum_clients_path.sh
where <greenplum_home> is the Greenplum client software installation directory
2. Verify that the Greenplum psql client utility is in a directory specified in the PATH.
Note: If the Greenplum psql client utility is not in the PATH when DMX initiates a load to the
Greenplum database, DMX issues an error message and the task aborts.
Greenplum ODBC driver installation and configuration
Windows systems For Windows systems, driver installation and configuration includes the following:
1. Install the Greenplum ODBC driver.
From the Greenplum Connectivity section of the Pivotal Greenplum Database download site,
download the Connectivity for Windows driver file, for example:
DMX Install Guide 43
greenplum-connectivity-<client_software_version_number>-build-
<build_version_number>-WinXP-x86_32.msi
The default Greenplum ODBC driver installation directory is as follows:
C:\Program Files (x86)\Greenplum\greenplum-connectivity-
<client_software_version_number>-build-
<build_version_number>\drivers\odbc
2. Verify that the ODBC driver libraries, which are dynamic linked libraries with the
extension, are installed successfully.
3. Create and configure the ODBC DSN.
4. When creating a system DSN entry for the ODBC connection, ensure the following on the
Greenplum Advanced Options dialog:
o Use Declare/Fetch is selected.
o Show Boolean Column As String is unselected.
o Max Varchar (Default 255) is populated with the value 65530.
UNIX systems For UNIX systems, the Greenplum driver is provided by DMX and is part of the DMX installation.
The Greenplum driver, _Sgplm<version_number>.so, is installed in the following directory:
<dmx>/ThirdParty/DataDirect/lib
Note: <dmx> is the DMX installation directory.
Greenplum driver configuration includes the following:
1. Add an entry for the Greenplum driver in the odbc.ini file.
A sample odbc.ini file is shipped with DMX and is located in the following directory:
<dmx>/etc.
In the Greenplum Data Source section of the odbc.ini file, add the Greenplum driver entry:
Driver=<dmx>/ThirdParty/DataDirect/lib/_Sgplm<version_number>.so
2. Define the embedded DataDirect ODBC Driver Manager as the ODBC driver manager.
The DataDirect ODBC Driver Manager is shipped with DMX and is installed in the following
directory:
<dmx>/ThirdParty/DataDirect
Hive data warehouses Apache Hive is a data warehouse infrastructure built on top of Hadoop that analyzes, queries, and
summarizes large datasets stored in Hadoop's Distributed File System (HDFS) and other compatible
file systems. Hive includes HiveQL, a query language useful for real-time analytics in Hadoop.
DMX-h can connect to Hive data warehouses as:
• sources when running on an ETL server/edge node or in the cluster
• targets when running on an ETL server/edge node or in the cluster
44 DMX Install Guide
DMX can also access Hive tables as HCatalog sources and targets. To understand how DMX reads
and writes over Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC),
please read “Connecting to Hive data warehouses” in the product help.
Both JDBC and ODBC configurations have two parts:
1. Configuring connections on Windows, typically workstations
2. Configuring connections on Linux, typically edge nodes
To design jobs and tasks using the DMX GUI, configure Hive connections on a Windows workstation.
When you finish developing jobs, execute them on the cluster from an edge node so that they can
read and write to Hive.
Hive source connections Using a Hive ODBC or JDBC connection, DMX-h can read supported Hive data types from all the
supported Hive file types, including Apache Avro, Apache Parquet, Optimized Row Columnar (ORC),
Record Columnar (RCFile), and text. DMX-h jobs running in the cluster can only read Hive sources
using JDBC connections.
Note: On an ETL server/edge node, reading from Hive sources via Hive ODBC drivers
yields low throughput. We only recommend using Hive ODBC drivers for sources serving at
most a few gigabytes of data, such as pre-aggregated data for analytics.
JDBC connectivity When DMX-h runs in the cluster, DMX-h can directly read a Hive table when the underlying file
format is supported by the HCatalog API. In all other cases, when DMX-h reads from a Hive table on
an ETL server/edge node or in the cluster via JDBC, DMX-h stages the data temporarily in
compressed or uncompressed format to a text-backed Hive table.
A user can force DMX-h to stage data by setting the environment variable
DMX_HIVE_SOURCE_FORCE_STAGING to 1, which uses the two-step process implemented in
previous versions of DMX.
Hive target connections DMX-h jobs and tasks write supported Hive data types to Hive targets using different methods
depending on whether they use JDBC or ODBC connections. JDBC is recommended over ODBC for
Hive targets. Consider the following constraints:
• JDBC - When DMX-h writes to a Hive table via JDBC, the job or task loads data directly into
the target tables. Writes are temporarily staged in compressed or non-compressed format to
a text-backed Hive table only when one of the following conditions limits direct access:
o A target table is a transactional (a.k.a. ACID) table
o A target table has one or more partitions
o A target table has any complex type column(s)
o The disposition for the target table is Truncate, Upsert, or Upsert and Apply change
(CDC)
o The job runs on localnode or singleclusternode
o A user forces DMX-h to stage data by setting the environment variable
DMX_HIVE_TARGET_FORCE_STAGING to 1, which uses the two-step process
implemented in previous versions of DMX
DMX Install Guide 45
• ODBC - based on the file format and whether the Hive table is partitioned, DMX uses one of
the following parallelized processes to write to Hive :
o Staged - DMX-h temporarily stages data from parallel streams in compressed or non-
compressed format to a text-backed Hive table.
o Direct - DMX-h loads data in parallel streams directly to the Hadoop file system for
optimal performance.
File Format Partitioned Non-partitioned
Apache Avro, Apache
Parquet, or delimited text
files
Staged Direct
Other file formats Staged Staged
Generating text-backed Hive tables When reading and writing to Hive targets over JDBC or ODBC, DMX-h stages data in Hive
managed or external tables as shown below:
• By default, there is no specific work table directory configured, so DMX-h creates and stages
the data in a text-backed Hive managed table in the default schema.
• When you specify a work table directory, either by setting the
DMX_HIVE_WORK_TABLE_DIRECTORY environment variable or using a parameter in
the source or target table dialogs DMX-h stages the data to a text-backed Hive external
table.
DMX-h deletes temporary text-backed Hive tables when the DMX-h job ends.
Additionally, you can apply compression to the work table, either by setting the
DMX_HIVE_WORK_TABLE_CODEC environment variable or using a parameter in the source or
target table dialogs.
For more information about work tables, see the Hive table staging topic in the Hive configuration
documentation in the product help.
Work table access By default, DMX creates text-backed Hive tables as a Hive-managed table in the default schema.
The user ID that runs the DMX-h job must have CREATE TABLE privilege on this schema.
If the user cannot create tables in the default schema, you can configure DMX-h to use a different
schema either by
• setting the DMX_HIVE_WORK_TABLE_SCHEMA environment variable or
• using a parameter in the source or target table dialogs.
Hive configuration Connecting to Hive from DMX-h requires the following configuration components:
• Hive JDBC connection and/or Hive ODBC connection.
46 DMX Install Guide
• Hive table staging
• Hive table creation security (for Hive targets)
• Sentry/Ranger authorization, if used
Setting up Hive JDBC for Linux/UNIX DMX uses Hive Java Database Connectivity (JDBC) on the cluster to execute Hive jobs and tasks. To
setup a JDBC connection for DMH-h to Hive from an edge node or server, you must:
1. Download a Hive JDBC driver JAR file
2. Set the JAVA_HOME environment variable
3. Configure the JDBC ini file
4. (Optional) Secure the database connection using Kerberos
5. Set the JDBC connection URL in DMX
Prerequisites
To successfully configure Hive JDBC and test a database connection from DMX-h on Linux/UNIX,
setup the following resources and gather the following permissions:
1. On cluster machines, obtain administrator privileges or contact information for a system
administrator
2. On the cluster, complete cluster provisioning
3. On the cluster, gather contact information for Cloudera or Hortonworks support
4. On the cluster, create or obtain access to an HDFS test directory with full RWX access
5. To use a Hortonworks cluster, complete the Hortonworks setup
6. To use Ambari:
a. access to the DMX-h Ambari install package
b. On the Ambari server, administrator access to Ambari
c. On the Ambari server, a user account with permission to use the sudo command
Network connectivity
1. On the cluster, open port 2181 (or configured port) for ZooKeeper service discovery, plus all
ports for Hive server hosts/ports known to ZooKeeper, typically 10000 or 10001
2. On the Hive servers, open these ports:
a. 10000 (or configured port) for Hive
b. 8088 and 19888 to use YARN Resource Manager (RM)
3. On the edge node, open these ports:
a. 22 to use SSH
b. 32636 (or configured port) for DMX
4. If using Kerberos, open these ports:
a. 88 for Kerberos KDC
b. 749 for Kerberos admin servers (from the krb5.ini file)
Edge node configuration 1. Obtain a user account with permission to use the sudo command
DMX Install Guide 47
2. Install Java JRE or JDK version 8 or newer with Hadoop. To use Kerberos authentication
with a JRE/JDK older than 8u161, install the Java Cryptography Extensions (JCE)
unlimited strength policy files or install a later JRE/JDK version.
3. Intall all database clients
4. To use SSL/TLS, and the certificate is self-signed or signed by an in-house Certificate
Authority (CA), confirm read access to a Java keystore/truststore and that you know the
password
5. To use Kerberos, confirm that you can run the Kerberos kinit command
6. For each database, obtain administrator privileges or contact a database administrator
7. For each database, obtain a user ID that can access at least one test table
Set the JAVA_HOME variable
Linux operating systems typically configure the JAVA_HOME system variable with the Java install
path. To check the JAVA_HOME value and update it for all users:
1. Open Terminal
2. Type echo $JAVA_HOME and press enter. If the JAVA_HOME variable is set, a message
containing the value stored the variable, similar to the following:
$ echo $JAVA_HOME
/usr/bin/java/jdk1.8.0_191/jre
If the path displayed is the current install directory for the Java JRE or JDK, the
JAVA_HOME variable is set correctly.
3. If the message is not the current Java JRE or JDK install directory, create or edit the
JAVA_HOME variable value. To edit system variables, you need administrator privileges.
Downloading the Hive JDBC driver
Hive JDBC drivers connect client applications to Hive, including DMX-h. Therefore, you must install
a Hive JDBC driver onto any client, edge node, or server running DMX-h to connect to Hive. All
Hadoop distributors ship Hive JDBC drivers in a JAR package you can download.
• Download the Cloudera Hive JDBC documentation and JAR package from the Cloudera
website.
• Download the Hortonworks JDBC driver and documentation from the downloads page on the
Cloudera website. We recommend using the latest version.
The Cloudera and Hortonworks drivers are both re-branded versions of Simba JDBC drivers. When
the DMX Help references the Simba JDBC driver, the reference also applies to any Cloudera or
Hortonworks JDBC driver you can use.
• For MapR, please refer to the MapR documentation website to download the driver
Configuring JDBC INI
Configure DMX to use the JDBC driver you downloaded.
1. Edit the JDBC configuration file for DMX
2. Create the DMX_JDBC_INI_FILE system variable
48 DMX Install Guide
Editing the JDBC configuration file
To use the JDBC driver, add JDBC configuration information to a configuration file DMX can access.
Because you configure the path to this file in a system variable below, you can use any location DMX
can access.
1. Create a new or edit an existing JDBC configuration file
2. Type Hive JDBC parameters and values in this file, as described below.
Parameter Description
DriverName JDBC driver class. Please consult JDBC configuration
document from the Hadoop vendor for its value.
DriverClassPath
Path to the Hive JDBC driver file.
In some cases, the path contains a JDBC driver but does not
include all the required Java components and interferes with
Hive connections. Resolve this by installing a standalone JDBC
driver from Apache, which includes all necessary components,
and changing the DriverClassPath variable to the standalone
driver path.
IsSchemaSupported Optional. Set to true to ensure that all specified database tables
are identified correctly in cases.
3. Save the file and record its file path to use in an environment variable
The values in the configuration file must match the driver name set by your Hadoop distributor.
Please consult their documentation for the exact driver name and driver file name.
UNIX: JDBC Configuration File Examples # Cloudera JDBC Simba-based driver for HiveServer2
[hive2]
DriverName=com.cloudera.hive.jdbc41.HS2Driver
DriverClassPath=/opt/hivejdbc
# Hortonworks JDBC Simba-based driver for HiveServer2
[hive2]
DriverName=com.simba.hive.jdbc41.HS2Driver
DriverClassPath=/opt/hivejdbc
# Apache Hive JDBC driver for HiveServer2 on Cloudera CDH
[hive2]
DriverName=org.apache.hive.jdbc.HiveDriver
DriverClassPath=/opt/cloudera/parcels/CDH/jars
# Apache Hive JDBC driver for HiveServer2 on Hortonworks (HDP)
[hive2]
DriverName=org.apache.hive.jdbc.HiveDriver
DriverClassPath=/usr/hdp/current/hive-client/lib
DMX Install Guide 49
# Apache Hive JDBC driver for HiveServer2 on MapR
[hive2]
DriverName=org.apache.hive.jdbc.HiveDriver
DriverClassPath=/opt/mapr/hive/hive-1.2/lib
Creating the DMX_JDBC_INI_FILE Environment variable
Create an DMX_JDBC_INI_FILE environment variable and assign the full path of the JDBC
configuration file as its value. The full path includes the directory location and the JDBC
configuration file name.
For example, to set the variable for users opening the bash shell:
1. Open the $HOME/.bash_profile or /etc/bashrc or configuration for editing
2. Type the following on a new line:
export DMX_JDBC_INI_FILE="<jdbc_configuration_file_path>"
Where jdbc_configuration_file_path is the full path of the JDBC configuration file
3. Save and close the configuration file
To verify the update
1. Open a new Terminal that uses the bash shell
2. Type echo $DMX_JDBC_INI_FILE and press enter. The terminal responds with a message
containing the DMX_JDBC_INI_FILE value set in the configuration file if the export
command is correct.
Securing a database connection using Kerberos
Note: You can only manage Kerberos authentication in DMX-h over JDBC connections.
You can configure Kerberos authentication outside of DMX-h for either ODBC or JDBC
connections.
To connect to Kerberos secured databases, DMX requires valid Kerberos tickets, similar to
authentication tokens, to authenticate its identity as a trusted client. To obtain a valid Kerberos
ticket, you can:
• Use DMX-h to leverage Java Authentication and Authorization Service (JAAS) to generate a
ticket and automatically supply it to JDBC driver.
• (Recommended for Linux) Use a Kerberos client outside of DMX, such as the Java kinit
command, to generate a ticket and setup the JDBC driver to use it. We recommend this
method because Linux environments that use Kerberos typically generate a ticket at startup
that you can use for DMX communications.
These methods assume:
1. you can access the Kerberos Distribution Center (KDC) for the realm across the network
2. a valid /etc/krb5.conf file exists on a local machine
3. if your JRE/JDK is older than 8u161, you installed the Java Cryptography Extensions (JCE)
unlimited strength policy files.
50 DMX Install Guide
Using DMX-h to manage Kerberos authentication We recommend using DMX-h to reference Kerberos tickets, which you can implement with the
following procedure.
1. Stop all DMX jobs and tasks.
2. Verify that the /etc/krb5.conf file exists. If not, you need to install Kerberos on this machine.
3. Set Kerberos environment variables:
a) Set DMX_KERBEROS_KEYTAB to the absolute path to and name of your keytab file
b) Set DMX_KERBEROS_PRINCIPAL to the identity to be authenticated, or principal, to
which Kerberos assigns or has assigned the Kerberos ticket. For example,
4. If your Kerberos service uses AES-256 encryption and your JDK/JRE is older than Java 8
update 151, install the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction
Policy JAR file, which you can download for your Java version from the Oracle website.
Managing Kerberos authentication using a Kerberos client For this alternative process, we recommend the following procedure:
1. Close all DMX job and task editors.
2. Verify that the /etc/krb5.conf file exists. If not, you need to install Kerberos on this machine.
3. Add a kinit command to a startup configuration file, such as .bashrc, to validate credentials
with Kerberos at startup. kinit generates or validates a ticket shared with all applications
including DMExpress.
4. Restart the Terminal or machine to load the new configuration.
Once a ticket is validated at startup, you can execute DMExpress jobs from the Terminal.
JDBC Connection URL The JDBC URL you create when you develop a DMX job remains the same when you execute the job
on the cluster. Please refer to “JDBC Connection URL” under JDBC setup for Windows for details on
URL string.
If the URL references any local files, for example sslTrustFile, update the URL string to reference
files stored on the Linux machine. You can avoid these requirements by using environment variables
in the URL with the paths to local files so that you can set these file paths in each environment and
use the same URL in all of them.
Setting up Hive JDBC connections for Windows DMX-h uses Hive Java Database Connectivity (JDBC) on Windows to design jobs. DMX-h only
supports Hive connections on Windows at design-time, not during runtime.
To setup a JDBC connection for DMX-h to Hive from a Windows machine, you must:
1. Install a JDK/JRE on the Windows machine
2. Set the JAVA_HOME environment variable
3. Download a Hive JDBC JAR file
4. Configure the JDBC ini file
5. (Optional) Secure the database connection using Kerberos
6. Set the JDBC connection URL in DMX
DMX Install Guide 51
Prerequisites
To successfully configure Hive JDBC and test a connection to a database from DMX-h on Windows,
setup the following resources and permissions:
1. On the cluster, complete cluster provisioning
2. Gather contact information for Cloudera or Hortonworks support
3. On the edge node, install database clients
4. Access a workstation with a supported Windows version, which must be Windows XP, 7, 8.x,
10, Server 2003, Server 2008 or Server 2012, either 32-bit or 64-bit
5. On the workstation, access to a local Administrator user account
6. On the workstation, access to the DMX-h Windows Installer
7. On the workstation, to use SSL confirm read access to the trust store file and that you know
the password
8. On the workstation, install database clients
9. For each database, obtain a user ID that can access at least one test table
Network connectivity
A Windows machine must be able to connect to the following machines on the appropriate ports and
resolve their hostnames:
1. The Hive server host(s) on the appropriate port (default 10000)
2. On the edge node, ports 22 (or configured port) for SSH and 32636 (or configured port) for
DMX
3. If using ZooKeeper service discovery:
a. the ZooKeeper hosts in the connection string on the given port (default 2181)
b. all Hive server hosts/ports known to ZooKeeper
4. If the cluster is secured with Kerberos:
a. The Kerberos KDC on port 88
b. The Kerberos admin servers (from the krb5.ini file) on port 749
Kerberos prerequisites Before setting up Kerberos on Windows:
1. Configure Kerberos using a local Windows administrator user.
2. Set the Windows environment variable DMX_KERBEROS_PRINCIPAL to a principal name
3. Copy a keytab file for the appropriate principal to the Windows machine
4. Set the Windows environment variable DMX_KERBEROS_KEYTAB to the path to the
keytab file
5. Copy Kerberos config file (e.g. /etc/krb5.conf) from an edge node to the Windows machine as
C:\Windows\krb5.ini
SSL/TLS prerequisites
If you use SSL/TLS and the certificate is self-signed or signed by an in-house Certificat Authority
(CA), ensure that a Java keystore/truststore and its password are available
52 DMX Install Guide
Installing a JDK/JRE on the machine
DMExpress requires a Java Runtime Environment (JRE) to connect to Hive, so requires either a
JRE or Java Development Kit (JDK) installed on your system. We recommend installing the same
JRE or JDK version Hadoop or Hive uses on the cluster.
1. Check your system to see if Java is already installed. To check for a JRE install on Windows,
navigate to Add/remove programs in the control panel and look for Java ‘X’ or Java ‘X’ update
‘XX’, as shown below.
2. If Java is not present on your Windows system, download a JRE or JDK install package from
the Oracle website that matches with the bit format (either 32-bit or 64-bit) for the DMX
application installed on Windows. You can identify the bit format of your DMX install by
a) Reviewing the log of any DMX job that ran on Windows, or
b) opening a command prompt, typing dmexpress /quit, and pressing Enter. The command
prompt displays the DMX install information similar to that shown below.
C:\WINDOWS\System32>dmexpress /quit
[DMExpress 9.10 Windows x64 64-bit Copyright (c) 2020 Precisely Inc.]
Setting the JAVA_HOME environment variable
To implement JDBC connection on a Windows operating system, configure the JAVA_HOME system
variable with the Java install path. To check the JAVA_HOME value and update it for all users:
1. Open a console application with a command prompt
2. Type echo %JAVA_HOME%, and press enter. If the JAVA_HOME variable is set, a message
containing the value stored the variable, similar to the following:
C:\WINDOWS\System32>echo %JAVA_HOME%
C:\Program Files\Java\jdk1.8.0_191\jre
If the path displayed is the current install directory for the Java JRE or JDK, the
JAVA_HOME variable is set correctly.
DMX Install Guide 53
3. If the message is not the current Java JRE or JDK install directory, edit the JAVA_HOME
variable value. To edit system variables, you need local administrator privileges.
a) Click the Windows button, type environment in search bar , and select "Edit the system
environment variables" from the search results
b) Select the JAVA_HOME system variable and click Edit
c) In the Variable Value dialog box, type the full path for the Java JRE or JDK
d) Click OK
e) Verify that the JAVA_HOME variable is in the list of System Variables and its value is
the current Java JRE or JDK install directory.
f) Click OK
4. If the message is simply %JAVA_HOME%, create a JAVA_HOME system variable. To create
system variable, you need local administrator privileges.
a) Click the Windows button, type environment in search bar, and select "Edit the system
environment variables" from the search results
b) In the System Variables section, choose New
c) In the Variable Name dialog box, type JAVA_HOME
d) In the Variable Value dialog box, type the full path for the Java JRE or JDK
e) Click OK
f) Verify that the JAVA_HOME variable is in the list of System Variables and its value is
the current Java JRE or JDK install directory.
g) Click OK
5. If you are unable to create or edit system variables, create the JAVA_HOME user variable.
54 DMX Install Guide
Downloading a Hive JDBC driver
To connect to Hive through JDBC during development, DMX requires the Hive JDBC driver and its
dependencies installed on your Windows workstation. All Hadoop distributors ship drivers as a
package you can download.
• Download the Cloudera Hive JDBC documentation and JAR package from the Cloudera
website.
• Download the Hortonworks JDBC driver and documentation from the downloads page on the
Cloudera website. We recommend using the latest version.
The Cloudera and Hortonworks drivers are both re-branded versions of Simba JDBC drivers. When
the DMX Help references the Simba JDBC driver, the reference also applies to any Cloudera or
Hortonworks JDBC driver you can use.
• For MapR, please refer to the MapR documentation website to download the driver
• For other distributions, the Apache Hive JDBC driver can be used, but is not recommended.
Use a “standalone” version of the driver, if available.
Once downloaded, create a folder in your filesystem and copy the JDBC package into it.
DMX Install Guide 55
Configuring JDBC INI
Configure DMX to use the JDBC jar files you downloaded:
1. Create a new or edit an existing JDBC configuration file for DMX
2. Create the DMX_JDBC_INI_FILE system variable
Editing the JDBC configuration file
To use the JDBC driver jar, add JDBC configuration information to a configuration file DMX can
access. Because you configure the path to this file in a system variable below, you can use any
location DMX can access.
1. Create a new or edit an existing JDBC configuration file
2. Type Hive JDBC parameters and values in this file, as described below.
Parameter Description
DriverName JDBC driver class. Please consult JDBC configuration
document from the Hadoop vendor for its value. This name
matches the driver class name used in an edge node.
DriverClassPath
Path to the Hive JDBC driver file.
In some cases, the path contains a JDBC driver but does not
include all the required Java components and interferes with
Hive connections. Resolve this by installing a standalone JDBC
driver from Apache, which includes all necessary components,
and changing the DriverClassPath variable to the standalone
driver path.
IsSchemaSupported
Optional. Set to true to ensure that all specified database tables
are identified correctly when DMX cannot determine whether
the DBMS supports a schema, such as in Hive. Valid values are
true or false and are case-insensitive.
3. Save the file and record its file path to use in the system variable
The values in the configuration file must match the driver name set by your Hadoop distributor.
Please consult their documentation for the exact driver name and driver file name.
Windows: JDBC Configuration File Examples # Cloudera JDBC Simba-based driver for HiveServer2
[hive2]
DriverName=com.cloudera.hive.jdbc41.HS2Driver
DriverClassPath=C:\HiveJDBC\Cloudera_HiveJDBC41_2.5.19.1053
# Hortonworks JDBC Simba-based driver for HiveServer2
[hive2]
DriverName=com.simba.hive.jdbc41.HS2Driver
DriverClassPath=C:\HiveJDBC\Simba_HiveJDBC41_1.0.40.1052
56 DMX Install Guide
# Apache Hive JDBC driver for HiveServer2
[hive2]
DriverName=org.apache.hive.jdbc.HiveDriver
DriverClassPath=C:\Program Files (x86)\Hive\Hive-0.14.0.2.2.9.1-7\lib
Creating the DMX_JDBC_INI_FILE environment variable
Create a system variable DMX_JDBC_INI_FILE and assign the full path of the JDBC configuration
file as its value. The full path includes the directory location and the JDBC configuration file name.
To create this variable using the Windows Explorer:
1. Navigate to the Windows System Properties > Advanced tab
2. Click Environment Variables
3. In the System Variables section, choose New
4. In the Variable Name dialog box, type DMX_JDBC_INI_FILE
5. In the Variable Value dialog box, type the full path of your JDBC configuration file.
6. Click OK
7. Verify that the DMX_JDBC_INI_FILE variable is in the list of System Variables and its
value is the full path of your JDBC configuration file..
8. Click OK
We recommend creating it as a system variable on Windows (similar to the way the JAVA_HOME
variable is created).
Restart all DMX GUI programs after setting the variables, to ensure that the new values take effect.
Securing a database connection using Kerberos
Note: You can only manage Kerberos authentication in DMX-h over JDBC connections.
You can configure Kerberos authentication outside of DMX-h for either ODBC or JDBC
connections.
To connect to Kerberos secured databases, DMX requires valid Kerberos tickets to authenticate its
identity as a trusted client. To obtain a valid Kerberos ticket, you can:
• (Recommended for Windows) Use DMX-h to leverage Java Authentication and Authorization
Service (JAAS) to generate a ticket and automatically supply it to JDBC driver. We
recommend this method for Windows because DMX is usually the first application to connect
to Hive outside the cluster. However, if your organization prohibits keytab configuration on
Windows, you must manage Kerberos tickets outside of DMX.
• Use a Kerberos client, such as the Java kinit command, outside of DMX to generate a ticket
and setup the JDBC driver to use it.
NOTE: Windows ships with a kinit command that is not the Java kinit
command.
These methods assume:
1. you can access the Kerberos Distribution Center (KDC) for the realm across the network
2. a valid C:\Windows\krb5.conf file exists on a local machine
3. If your JDK/JRE is older than Java 9 or 8u161 and your Kerberos service uses AES-256
encryption, you installed the Java Cryptography Extension (JCE) Unlimited Strength
DMX Install Guide 57
Jurisdiction Policy JAR file, which you can download for your Java version from the Oracle
website.
Using DMX-h to manage Kerberos authentication To enable DMX to manage Kerberos tickets for jobs, you must set at least the one required DMX
Kerberos environment variable and, when connecting via JDBC on Windows, you must verify the
location of the Kerberos configuration file. Specifically, DMX calls the Kerberos kinit utility to
retrieve a Kerberos ticket with valid credentials before executing a job. Upon job initiation, DMX
provides authentication information to the DBMS. After job execution, DMX calls the Kerberos
kdestroy utility to destroy the Kerberos ticket.
DMX manages Kerberos authentication for
• running jobs from the Run job dialog
• connecting to Hive at design time
Note: To manage Kerberos tickets for DMX tasks, see Managing Kerberos outside of DMX
below
Set the following Kerberos environment variables in the Environment Variables tab of the DMX
Server dialog:
• DMX_KERBEROS_PRINCIPAL - required - the principal to be authenticated, to which
Kerberos assigns a ticket. Setting this variable enables DMX control of Kerberos tickets,
including calling the kinit utility.
• DMX_KERBEROS_KEYTAB - optional - the name and location of the principal's Kerberos
keytab file; if not specified, the default Kerberos keytab name and location will be used,
<user_home>\krb5.keytab on Windows.
We recommend using DMX-h to reference Kerberos tickets, which you can implement with the
following procedure.
1. Close all DMExpress job and task editors.
2. Copy the /etc/krb5.conf file from a Linux edge node or server running Kerberos into the
C:\Windows directory and rename it krb5.ini.
3. Generate or copy a keytab file onto your Windows machine.
a) To generate a keytab file, use ktutil with a valid Kerberos server user account. You
may refer to the University of Indiana website for examples.
b) To copy the keytab file, you need to know where it saved on your machine or a Kerberos
edge node or server
4. Set Kerberos environment variables:
a) Set DMX_KERBEROS_KEYTAB to the absolute path to and name of your keytab file
b) Set DMX_KERBEROS_PRINCIPAL to the identity to be authenticated, or principal, to
which Kerberos assigns or has assigned the Kerberos ticket
5. In the server setup dialog, test authentication using the Verify Connection button.
Managing Kerberos authentication outside of DMX You can manage Kerberos tickets outside of DMX using Kerberos clients for Windows. You should
use these clients when:
58 DMX Install Guide
1. Observing a policy prohibiting keytab configuration on Windows
2. Connecting with DMX-h via JDBC at design time using the Task Editor
3. Running dmxjob at the command line
To manage Kerberos tickets for outside of DMX on Windows, consider the following:
1. Clear the DMX_KERBEROS_PRINCIPAL variable in the Environment Variables tab of the
DMX Server dialog, disabling DMX-h Kerberos authority.
2. Ensure that the Kerberos client utilities, such as the kinit and klist, are installed on the
DMX workstation.
3. Copy the Kerberos config file (e.g. /etc/krb5.conf) from the edge node to the Windows machine
in C:\Windows\krb5.ini
4. Set the Windows environment variable KRB5_CONFIG value to the path to Kerberos
configuration file for the Kerberos client utility
5. Set the Windows environment variable KRB5CCNAME value to the path to the Kerberos
cache file
6. When managing Kerberos authentication outside of DMX, run the Kerberos kinit utility to
initialize a ticket and attempt to authenticate. If this fails, run the Kerberos klist utility to
determine whether you have a valid ticket or not. If your ticket is valid, correct problems
with your Windows or Java configuration. If your ticket is invalid, correct problems with
your Kerberos configuration.
Upon job/task initiation, DMX provides authentication information to the DBMS.
Setting the JDBC Connection URL in DMX DMExpress job developers can set the JDBC Connection URL in DMS and requires no administrator
privilege.
1. Create or open a DMExpress task and open database connection dialog.
2. Select Hive as the DBMS and select JDBC as the access method.
DMX Install Guide 59
3. Set the following fields with the Database connection dialog:
Field Name Description
Name Connection name representative of database and type of
connection made.
DBMS Hive
Access Method JDBC
Database Provide a connection URL as described below
Authentication Kerberos if using Kerberos. Ignore otherwise
Connect As Ignore.
The connection URL string has a following format:
jdbc:hive2://<host>:<port>/<database>[;<attribute>=<value> [;...]]
where
60 DMX Install Guide
Field Name Description
host The IP address or hostname of the Hive server. To use a
hostname, we recommend a fully qualify domain name (FQDN) to
avoid machine auto-resolving hostnames to incorrect domains.
Port The listening port for the Hive service. The default is 10000.
Database Enter the name of the database schema you want to use. If the
database controls access using Sentry / Ranger, make sure you can
access the database and have read/write permissions on the table.
The default schema is default.
propertyName Semicolon separated list of attributes that instruct the Hive
server to perform various tasks.
For the Cloudera and Hortonworks drivers, the following
parameter is always required:
UseNativeQuery=1
If you use a Cloudera or Hortonworks driver on a cluster or server
secured with Kerberos, add the following parameters:
AuthMech=1;KrbHostFQDN=HiveFullyQualifiedDoma
inName;KrbServiceName=hive;UseNativeQuery=1
You may also need to add ;KrbRealm=Realm if the Hive server is
not in the default realm.
If you use a HiveServer2 driver over SSL, add:
SL=1;SSLKeyStore=<keystore_path>;SSLKeyStoreP
wd=<keystore_password>
where <keystore_path> is the full path of the sslTrustStore file
saved locally on the system.
Hive ODBC connection Hive ODBC connections can be used for Hive sources and targets. Configuring Hive ODBC requires
the following steps, described in detail in the sections below:
1. Install and configure the Hive ODBC driver
2. Define a Hive ODBC data source
Installing and configuring the Hive ODBC driver
Ensure you have administrator/root privileges on the computer before you install the
driver.
Installing and configuring on Windows
1. Go to one of the following Hadoop vendor websites and download the Windows 32-bit Hive
ODBC driver and associated documentation. For example:
• Cloudera: http://www.cloudera.com/downloads/connectors/hive/odbc/2-5-21.html
• Hortonworks: http://hortonworks.com/hdp/addons/
• MapR: http://package.mapr.com/tools/MapR-ODBC/. Select the latest version of the file
MapR_odbc_<n.n.n>_x86.exe.
DMX Install Guide 61
2. After downloading the file, double-click the file to run the installer.
3. Follow the installer's instructions and use the default settings.
For additional information about the installation and configuration settings, see the vendor's
documentation.
Installing on Linux
1. Go to one of the following Hadoop vendor websites and download the Linux 64-bit Hive
ODBC driver and associated documentation. For example:
• Cloudera: http://www.cloudera.com/downloads/connectors/hive/odbc/2-5-12.html .
Download the appropriate RPM for your Linux distribution.
• Hortonworks: http://hortonworks.com/hdp/addons/. Download the appropriate tar file
for your Linux distribution and extract the RPMs from it.
• MapR: http://doc.mapr.com/display/MapR/Hive+ODBC+Connector and
http://package.mapr.com/tools/MapR-ODBC/. Navigate the directories for your Linux
distribution and download the appropriate RPM.
2. Unpack the RPM package to install the driver files in the vendor's default location:
rpm -i <vendor-file>.rpm
3. Note the location of the installed files for later configuration steps. The default installation
location depends on the vendor and may be one of the following:
• /opt/Cloudera/hiveodbc/
• /usr/lib/hive/lib/native/hiveodbc/
• /opt/mapr/hiveodbc/
4. Continue with Configuring the Hive ODBC driver on Linux.
Configuring the Hive ODBC driver on Linux The Hive ODBC driver installation includes the file, <users-home>/.<vendor>.hiveodbc.ini, which
you use to configure the specific vendor's Hive ODBC driver. By default, this file begins with a
leading period and gets installed in the user's home directory.
1. If you decide not to use the default location and file name for .<vendor>.hiveodbc.ini, set an
environment variable to locate the file. These examples assume you put the file (without a
leading period) in the /etc/ directory:
• Cloudera:
• For Cloudera Hive ODBC driver version 2.5.12 and higher: Cloudera: export
CLOUDERAHIVEINI=/etc/cloudera.hiveodbc.ini
• For Cloudera Hive ODBC driver versions prior to 2.5.12: Cloudera: export
SIMBAINI=/etc/cloudera.hiveodbc.ini
• Hortonworks:
• For Hortonworks Hive ODBC driver version 0.11 and higher: export SIMBAINI=/etc/hortonworks.hiveodbc.ini
• For Hortonworks Hive ODBC driver versions prior to 0.11: export
SIMBAINI=/etc/hortonworks.hiveodbc.ini
• MapR: export MAPRINI=/etc/mapr.hive.odbc.ini
62 DMX Install Guide
2. Set the following driver manager options in your vendor-specific configuration file,
<vendor>.hiveodbc.ini, under the Driver section. The default DMX-h driver manager is
unixODBC.
a) Set DriverManagerEncoding to UTF.-.
b) Set ODBCInstLib to identify the ODBC installation's shared library for the ODBC driver
manager. The DMX-h default location is <dmx>/lib/libodbcinstSSL.so.
[Driver]
DriverManagerEncoding=UTF-16
ODBCInstLib=<dmx>/lib/libodbcinstSSL.so
3. Ensure that the Hive ODBC driver library is included at the beginning of the system library
path, LD_LIBRARY_PATH, by running the following command:
export LD_LIBRARY_PATH=<vendor's-Hive-ODBC-driver-
installation>/lib:$LD_LIBRARY_PATH
4. The default location DMX-h uses for the .odbcinst.ini ODBC configuration file is <dmx>/etc.
If you decide to use a different location, set the ODBCSYSINI environment variable to the
directory containing your file.
5. Configuration options set in configuration file odbcinst.ini apply to all Hive connections.
Create a section for the Hive ODBC driver and set the following options as follows:
[ODBC Drivers]
<vendor> Hive ODBC Driver 64-bit=Installed
. . .
[<vendor> Hive ODBC Driver 64-bit]
Description= <vendor> Hive ODBC Driver (64-bit)
Driver=/<vendor's-Hive-ODBC-driver-
installation>/<vendor’s_Hive_ODBC_library_file>
For additional information about the installation and configuration settings, see the vendor's
documentation.
Defining a Hive ODBC data source To identify a Hive ODBC data source, you create a data source name (DSN) and set options required
to connect to the data source.
Defining a Hive data source on Windows
1. Start the ODBC Data Source Administrator by following the instructions for Windows
systems at Defining ODBC Data Sources.
2. In the Create New Data Source dialog, select your Hive ODBC vendor's driver from the list,
and then click Finish.
3. Use the following settings in the ODBC Driver DSN Setup dialog for a Hive ODBC data
source:
Data Source Name Enter a name to identify the Hive DSN.
Host Enter the IP address or hostname of the Hive server.
Port Enter the listening port for the Hive service. The default is 10000.
DMX Install Guide 63
Database Enter the name of the database schema you want to use. The default
schema is default.
Hive Server Type Select Hive Server 2.
Authentication
mechanism
Most Hive installations use User Name authentication by default. The
authentication mechanism for a Hive data source must match the
mechanism in use on the Hive server or the connection fails. Check with
your Hadoop system administrator.
Advanced Options Check Use Native Query and then click OK. Some ODBC Hive drivers
work with both HiveQL and SQL query languages. This setting enables
the use of native HiveQL instead of SQL.
Defining a Hive data source on Linux
Find general instructions for UNIX systems at Defining ODBC Data Sources.
1. The default location DMX-h uses for the odbc.ini ODBC configuration file is
<dmx>/etc/odbc.ini. If you decide to use a different location and file, set the ODBCSYSINI
environment variable to the full path and file name of your file.
4. In the odbc.ini file, add a new Hive data source entry to the [ODBC Data Sources] section.
Use the format, <data-source-name>=<your-driver-description>:
[ODBC Data Sources]
Sample Hive DSN 64=Hive ODBC Driver 64-bit
5. Configure the new Hive data source by adding a section similar to the following to the
odbc.ini file. Note that sample values are shown. Consult your Hadoop system administrator
for guidance on settings appropriate for your environment:
[Sample Hive DSN 64]
Driver=/<vendor's-Hive-ODBC-driver-
installation>/<vendor’s_Hive_ODBC_library_file>
HiveServerType=2
HOST=<hive-server>
PORT=10000
UseNativeQuery=1
AuthMech=2
The following table lists valid values:
Odbc.ini options Description
Driver Set the location of the installed Hive ODBC Driver file. Find the driver file
<vendor’s_Hive_ODBC_library_file>, for example,
libhortonworkshiveodbc64.so, under your installed files /lib/ directory.
HiveServerType Set the HiveServerType to 2, for HiveServer2. HiveServer2 is a newer
version with improvements from that of HiveServer and with additional
features.
1 (default) HiveServer
2 HiveServer2
64 DMX Install Guide
HOST Set the IP address or hostname of the Hive server.
PORT Set the listening port for the service. The default port for DMX-h Hive
installation is 10000.
UseNativeQuery Set the UseNativeQuery value to 1. Some Hive ODBC drivers work with
both HiveQL and SQL query languages.
0 (default) enables the SQL Connector feature
1 enables the HiveQL query language and disables the SQL Connector
feature
AuthMech Set the AuthMech value to the number representing the same
authentication mechanism as the Hive server. Most Hive installations use
User Name authentication (value 2) by default.
0 no authentication
1 Kerberos
2 (default) User Name
3 User Name and Password
4 User Name and Password (SSL)
5 Windows Azure HDInsight Emulator
6 Windows Azure HDInsight Service
7 HTTP
8 HTTPS
Hive table staging When writing to Hive targets, DMX-h stages the data as a text-backed Hive table when connecting
via:
• JDBC
• ODBC
Note:
• In all cases, sufficient space as well as CREATE TABLE privileges are needed to stage the
tables.
• For the ODBC cases, DMX-h writes the tables using the hive command, which must be in the
path.
Hive table creation security With Hive version 0.13 and higher, the default security does not allow the user who creates the table
to read from or write to the table. To enable reading from and writing to the table without having to
modify access permissions after creating the table, do the following:
• In hive-site.xml, add the property hive.security.authorization.createtable.owner.grants and
set its value to SELECT and UPDATE.
• Ensure the user has read/write privileges to Hive data files on the Hadoop file system.
Sentry and Ranger authorization DMX-h is compatible with the following authorization schemes:
DMX Install Guide 65
• Cloudera Sentry
• Apache Ranger
Cloudera Sentry DMX-h is certified to work with Cloudera's Sentry authorization of Hive databases, which requires
the following to be enabled in the Cloudera cluster:
• HDFS Access Control Lists (ACLs)
• automatic synchronization of HDFS ACLs with Sentry privileges
Note: Note: When using Sentry, Hive impersonation is disabled by default. To ensure access to the
Work table directory, the default Hive user must have the correct permissions.
Apache Ranger
DMX-h is compatible with Apache Ranger, a framework for enabling, monitoring, and managing
data security across the Hadoop platform. Ranger works with Apache Hadoop (HDFS), Apache Hive,
Apache Kafka, and YARN, among other Apache projects.
Note: Ranger is currently designated as an Apache incubator project, and there are gaps in what it
works with in the Hadoop ecosystem, such as Apache HCatalog. Additionally, it does not work with
Amazon S3 or other cloud-based distributed filesystems.
Apache Impala Apache Impala is a native analytic database for Apache Hadoop. Through JDBC connectivity,
DMX-h supports Impala databases as sources and targets when running on the ETL server/edge
node, in the cluster, and on a framework-determined node in the cluster.
Connecting to Impala requires configuration steps before the connections can be defined. Connection
requirements and behavior differ between Impala sources and targets.
Maximum length The maximum post-extraction length that DMX-h supports for an Impala database record is
16,777,216 bytes (16 MB).
Impala connections Connecting to Impala requires configuration steps before the connections can be defined. Connection
requirements and behavior differ between Impala sources and targets.
Impala source connections Using an Impala JDBC connection, DMX-h can read supported Impala data types from all supported
Impala file types: Apache Avro, Apache Parquet, Record Columnar (RCFile), Text, and SequenceFile.
Note: As per Impala limitations, DMX-h can read complex data types, which include structures and
arrays, only from Parquet-backed tables in Impala.
JDBC connectivity When DMX-h reads from an Impala database table on an ETL server/edge node or in the cluster via
JDBC, the data is staged temporarily in uncompressed format to a text-backed Impala table.
66 DMX Install Guide
Impala target connections Using an Impala JDBC connection, DMX-h can write supported Impala data types to Impala targets
targets directly for optimal performance.
Note: As per Impala limitations, DMX-h can write complex data types to Parquet-backed tables in
Impala via a Hive database connection only, not via a JDBC connection.
JDBC connectivity When DMX-h writes to an Impala database table via JDBC, data is generally loaded directly into
target tables. Writes are staged temporarily in compressed or non-compressed format to a text-
backed Impala table only when one or more of the following conditions limits direct access:
• A target table has one or more partitions
• A parquet-backed target table has any timestamp columns
• A target table performs Truncate or Apply Change (CDC) dispositions
• The job runs on localnode or singleclusternode
• A user forces DMX-h to stage data by setting the environment variable
DMX_IMPALA_TARGET_FORCE_STAGING to 1, which uses the two-step process
implemented in previous versions of DMX
Update and Upsert dispositions are supported only for kudu tables.
At run-time, DMX-h accesses the kudu jars from /opt/cloudera/parcels/CDH/lib/kudu on the
edge/master node for Impala access. You can override this default location by using environment
variable KUDU_HOME. For example, export KUDU_HOME=/opt/cloudera/parcels/CDH/lib/kudu
sets the location accessed at run-time to /opt/cloudera/parcels/CDH/lib/kudu.
Impala configuration Connecting to Impala from DMX-h requires the following configuration components:
• Impala JDBC connection
• Impala table staging
• Apache Sentry authorization when applicable
Impala JDBC connection To connect to Impala via JDBC on Windows at design time, download the JDBC driver and specify
the mandatory driver name and driver class path parameters in the JDBC configuration file:
• Download the applicable Cloudera Impala JDBC Simba-based driver.
See Configuring Impala to Work with JDBC.
• Set the driver name and driver class path in the JDBC configuration file.
Impala table staging When reading Impala sources and writing Impala targets, DMX-h stages the data as a text-backed
Impala table.
To stage the tables, sufficient space and CREATE TABLE privileges are required.
Defining Impala database connections In the Database Connection dialog, the general pattern to define a connection to an Impala database
is as follows:
• At DBMS, select Impala.
DMX Install Guide 67
• At Access Method, select JDBC.
• At Database, select a previously defined Impala JDBC database connection URL.
• At Authentication, select Auto-detect or Kerberos.
Note: When Kerberos authentication is required, ensure that Kerberos is selected.
Defining Impala sources For all DMX-h ETL jobs, DMX-h supports Impala database tables as source and as lookup source.
At the Source Database Table dialog or at the Lookup Source Database Table dialog define either an
Impala database table source or lookup source respectively:
• At Connection, select a previously defined Impala source connection or select Add
new... to add a new connection.
• On the Parameters tab, the following optional parameters are available for Impala database
table sources and lookup sources:
o Filter - equivalent to the text that follows a WHERE clause in a SQL query, the filter
parameter specifies the condition upon which records are extracted from an Impala
source table.
o For partitioned Impala database table sources and lookup sources, you can specify
a partition predicate in the WHERE clause, which serves as a filter that enables
partition pruning and limits scanning to those portions of the table relevant to
partitions.
o Work table directory - serves as the parent-level directory beneath which job-specific
subdirectories are created for staging data.
o Work table schema - the schema used to create the staging table.
o Impala configuration properties - any Impala configuration property can be entered
manually in the parameters grid.
Defining Impala targets At the Target Database Table dialog, define an Impala database table target:
1. At Connection, select a previously defined Impala target connection or select Add
new... to add a new one.
2. Select a table from the list of Tables, or select Create new... to create a new one. By default,
DMX-h creates text-backed Impala database tables; to create an Impala table backed by
some other file format, follow the instructions in the Create Database Table dialog help topic,
with the following modification:
3. Click View SQL.
4. In the SQL textbox, change STORED AS TEXTFILE to STORED AS <file_format>,
where <file_format> is the keyword for the applicable file format, such
as AVRO or PARQUET.
• User defined SQL statement is not supported.
• All target disposition methods are supported.
• All partition columns must be mapped.
5. On the Parameters tab, the following optional parameters are available for Hive target
database tables:
68 DMX Install Guide
• Compute table statistics - To optimize subsequent Impala query performance, DMX-h
can run Impala analyze queries that collect target table statistics and target column
statistics after the load to the Impala target database.
o Valid values include true and false (default). If you specify false or if a parameter
value is blank, DMX-h does not run the parameter-specific query after the load to the
Impala target database.
o When Impala auto-analysis is enabled and DMX-h loads via staging table to the
Impala target database, Impala automatically computes table statistics, but not
column statistics, and stores the table statistics to the metastore.
• Maximum parallel streams - the maximum number of parallel streams that can be
established to load data for performance and that are created according to demand. This
value can also be specified via the environment
variable DMX_IMPALA_MAX_WRITE_THREADS. If specified both ways, the parameter
value takes precedence. If neither is specified, the default value is either the number of
CPUs on the edge node when running on the ETL server/edge node or is 1 for each
instance of DMX-h when running on the cluster.
• Work table codec - specifies the compression algorithm used to compress Impala data.
• Work table directory - serves as the parent-level directory beneath which job-specific
subdirectories are created for staging data.
• Work table schema - the schema used to create the staging table.
• Impala configuration properties - any Impala configuration property can be entered
manually in the parameters grid.
6. Set commit interval and Abort task if any record is rejected are not supported.
Microsoft SQL Server Your SQL Server client must be installed on the system and configured so that it can connect to
databases that you want to access from DMX. On 64-bit Windows, a SQL Server Native Client must
also be installed. Please refer to specific SQL Server documentation for details on configuring the
client.
Windows Systems A SQL Server data source needs to be defined for each database that you want DMX to access. The
data source should be named the same as the SQL Server database it points to. Choose a SQL Native
Client as the DBMS driver on 64-bit Windows. See Defining ODBC Data Sources for details on
defining data sources on Windows systems.
Netezza
Installation and Configuration DMX connects to Netezza databases through the Netezza nzload client utility, which is a component
of the Netezza client software package, and the Netezza Open Database Connectivity (ODBC) driver.
For Windows and UNIX systems, the client software package includes the Netezza client interface
software and the Netezza ODBC driver.
To establish a connection to the Netezza database, install the Netezza client software package on the
system on which the DMX client is installed.
Netezza client software package and driver installation
Windows systems For Windows systems, the client software installation includes the following:
DMX Install Guide 69
1. Install the Netezza client software package.
For procedures on installing the Netezza client software and ODBC driver, refer to the
installation chapter of the IBM Netezza System Administration Guide.
The default Netezza client installation is located in the following directory:
C:\Program Files (x86)\IBM Netezza Tools
The default Netezza ODBC driver installation is located in the following directory:
C:\Program Files (x86)\IBM Netezza ODBC Driver
2. Verify that the ODBC driver libraries, which are dynamic linked libraries with the .dll
extension, are installed successfully.
3. Verify that the Netezza ODBC driver installation directory is specified in the PATH.
4. Create and configure the ODBC DSN.
5. Specify the Netezza client utilities directory in the PATH.
For example, set the PATH as follows:
set PATH=%PATH%;C:\Program Files (x86)\IBM Netezza Tools\bin
Note: If the Netezza nzds and nzload client utilities are not in the PATH when DMX initiates
a load to the Netezza database, DMX does the following:
• nzds - DMX issues a performance warning message and establishes only one connection
to the Netezza database.
• nzload - DMX issues an error message and the DMX task aborts.
6. To run the nzds client utility, ensure that the database user account has the Manage
Hardware privilege.
For additional information on required privileges, see the IBM Netezza System
Administration Guide and the Netezza Data Loading Guide.
7. Verify the port number used to connect to the Netezza database.
When the NZ_DBMS_PORT environment variable is defined, DMX connects to the Netezza
database using the value specified in NZ_DBMS_PORT; otherwise, DMX connects to the
Netezza database using the default port number, 5480.
UNIX systems For UNIX systems, the client software installation includes the following:
1. Install the Netezza client software package.
For procedures on installing the Netezza ODBC driver and the client software, refer to the
installation chapter of the IBM Netezza System Administration Guide.
The default Netezza client installation is located in the following directory:
/usr/local/nz/bin
The default Netezza ODBC driver installation is located in the following directory:
/usr/local/nz/lib64
70 DMX Install Guide
2. Create and configure the ODBC DSN.
3. Specify the Netezza client utilities directory in the PATH.
For example, export the PATH as follows:
export PATH=$PATH:/usr/local/nz/bin
Note: If the Netezza nzds and nzload client utilities are not in the PATH when DMX initiates
a load to the Netezza database, DMX does the following:
• nzds - DMX issues a performance warning message and establishes only one connection
to the Netezza database.
• nzload - DMX issues an error message and the DMX task aborts.
4. Set NZ_ODBC_INI_PATH to point to the directory where odbc.ini, without the leading
period, ".", is located.
For example, set NZ_ODBC_INI_PATH as follows:
export
NZ_ODBC_INI_PATH=$NZ_ODBC_INI_PATH:<directory_where_odbc.ini_is_located
>
5. To run the nzds client utility, ensure that the database user account has the Manage
Hardware privilege.
For additional information on required privileges, see the IBM Netezza System
Administration Guide and the Netezza Data Loading Guide.
6. Verify the port number used to connect to the Netezza database.
When the NZ_DBMS_PORT environment variable is defined, DMX connects to the Netezza
database using the value specified in NZ_DBMS_PORT; otherwise, DMX connects to the
Netezza database using the default port number, 5480.
NoSQL Databases DMX can connect to any NoSQL database, for example, Apache Cassandra, Apache Hbase, and
MongoDB, provided that you install the applicable NoSQL database client software and a compliant
ODBC driver or JDBC driver.
DMX requires a Level 3.0 compliant ODBC driver or a Level 4.0 compliant JDBC driver to connect to
a NoSQL database. Provided that your ODBC or JDBC driver supports NoSQL databases as sources
and targets, DMX supports NoSQL databases as sources and targets.
To verify the level of NoSQL database support that your ODBC or JDBC driver provides, contact
your ODBC or JDBC driver vendor.
Installation and configuration DMX connects to NoSQL databases through the client software applicable to your NoSQL database
and through a compliant ODBC or JDBC driver.
Client software installation and configuration To reference client software download information, links, and installation instructions that are
applicable to current NoSQL databases, for example Cassandra, Hbase, and MongoDB, consider the
following sites:
DMX Install Guide 71
• Cassandra: http://cassandra.apache.org/
• Hbase: http://hbase.apache.org/
• MongoDB: http://www.mongodb.org/
To establish a connection to a NoSQL database, install the applicable client software on the system
on which DMX is installed.
ODBC driver installation DMX requires a Level 3.0 compliant ODBC driver to connect to a NoSQL database. For driver
installation and configuration information, refer to your ODBC driver documentation.
To reference ODBC driver download information for Simba ODBC drivers, for example, consider the
following sites:
• Cassandra: http://www.simba.com/connectors/apache-cassandra-odbc
• Hbase: http://www.simba.com/connectors/apache-hbase-odbc
• MongoDB: http://www.simba.com/connectors/mongodb-odbc
The installation documentation applicable to these sites outlines the steps to create the ODBC DSN
and provides links to advanced options specific to the Simba driver for Cassandra, Hbase, and
MongoDB.
You can also reference Defining ODBC Data Source Names. While you can use any ODBC driver
manager to load ODBC drivers for UNIX systems, by default, DMX uses the shipped unixODBC
driver manager.
JDBC driver installation DMX requires a Level 4.0 compliant JDBC driver to connect to a NoSQL database. For driver
installation and configuration information, refer to your JDBC driver documentation.
Oracle Your Oracle client must be installed on the system and configured so that it can connect to databases
that you want to access from DMX. Please refer to specific Oracle documentation for details on
configuring the client.
Oracle naming method Oracle supports multiple naming methods to resolve Connect Identifiers. DMX only supports the
Oracle Local Naming Method, which uses aliases defined in the tnsnames.ora configuration file on
the Oracle client machine.This file is expected to reside in the <oracle_home>/network/admin
directory, where <oracle_home> denotes the directory where Oracle is installed, or in the directory
pointed to by the TNS_ADMIN environment variable. The Task Editor always reads the list of
available databases from tnsnames.ora to automatically populate the list of databases in the
Database Connection dialog. The file has to be formatted according to the Oracle documentation on
syntax rules for configuration files. Otherwise, DMX may not be able to read the file correctly,
resulting in an empty or partial database list in the Database Connection dialog. Verify that
TNSNAMES is listed as one of the values of the NAMES.DIRECTORY_PATH parameter in the
Oracle Net profile sqlnet.ora. The TNSNAMES field indicates that local naming is enabled.
If TNSNAMES is not listed as one of the values of the NAMES.DIRECTORY_PATH parameter in
sqlnet.ora, run Oracle Net Configuration Assistant or Oracle Net Manager to add local naming
72 DMX Install Guide
method and the Oracle databases you want DMX to connect to. The configuration utility updates the
Oracle Net profile, sqlnet.ora, located in <oracle_home>/network/admin.
Windows systems To access Oracle databases, Oracle client software must be accessible via the dynamic link libraries
(dll) located under the <oracle_home>\bin folder. The actual location of Oracle installation is usually
stored in the ORACLE_HOME environment variable.
If you have installed the 64-bit version of DMX on 64-bit Windows, there are some important
differences with respect to defining a DMX Task and running your application.
UNIX systems To access Oracle databases, Oracle client software must be accessible via the shared libraries located
under the <oracle_home>/lib and <oracle_home>/network/lib directories. The name of the shared
library directory may vary, e.g. lib32 or lib64, depending on the Oracle version.
Snowflake Snowflake is a cloud data warehouse that leverages separating storage from the platform in a cloud
environment. Through JDBC connectivity, DMX-h supports Snowflake data warehouses as sources
and targets.
Snowflake connection requirements Snowflake requires a JDBC connection configuration with the driver name and location for all
connections. Before attempting to connect to Snowflake, do the following:
• Install DMX server on an Amazon Elastic Compute Cloud (EC2) instance or your local
machine.
• Specify JDBC and cluster parallelization parameters in the DMX JDBC configuration file.
The parameters outlined in the DMX JDBC configuration file, as defined by the
DMX_JDBC_INI_FILE environment variable, provide DMX with the mandatory and
optional values required to access an Amazon S3 bucket and to invoke a Snowflake
COPY/MERGE query.
• If DMX runs inside EC2, attach an IAM role to the EC2 instance with the following
conditions:
1. The attached IAM role must grant DMX read and write access to objects in the work bucket
specified in the configuration file.
2. Configure the IAM role for Snowflake.
3. If the IAM role configured for Snowflake is not the same role attached to EC2, set the
IAMROLE parameter in the configuration file to the IAM role configured for snowflake.
Note: When DMX cannot get temporary security credentials from an IAM role, DMX issues
an error message and the DMX task aborts.
• When DMX is runs outside of an EC2 instance, DMX accesses snowflake using keys-based
authentication. If no access keys are provided, DMX issues a UNIAMCRE error message
aborts the job.
The parameters outlined in a DMX Snowflake configuration file include the following:
• DriverName - Required JDBC driver name.
DMX Install Guide 73
• DriverClassPath - Required JDBC class path.
• MAXPARALLELSTREAMS - Optional integer representing Maximum number of parallel
streams that can be established for loading data to the staging data. By default,
MAXPARALLELSTREAMS is set to the number of CPUs available in the client machine.
• WORKTABLEDIRECTORY - Required path to an s3 bucket or local directory. If the path is
an s3 url, s3://<bucket>, DMX creates an external staging data. If the path is a local
directory, file://<user/data>, DMX creates an internal staging data using the specified local
directory.
• WORKTABLESCHEMA - Optional schema name to create the staging data . The default
schema for the staging data is the same as the target data.
• WORKTABLENCRYPTION - Server side encryption algorithm for encrypting staging data in
the S3 bucket. Valid values are AES256 and aws:kms.
• AWSACCESSKEYID - A 20-character, alphanumeric string that Amazon provides upon
establishing an AWS account. If DMX runs in EC2, AWSACCESSKEYID is optional.
• AWSACCESSKEY - The 40-character string, which is also referred to as the secret access
key, which Amazon provides upon establishing an AWS account. If DMX runs in EC2,
AWSACCESSKEY is optional.
DMX requires the access key id and the secret access key to send requests to an Amazon S3
bucket.
• IAMROLE - Optional Amazon Resource Name (ARN) for an IAM role that Snowflake uses for
authentication and authorization if the same role is not attached to EC2. If EC2 and
Snowflake share the same role, this parameter is not required.
• LoadViaPut - Optional character. If WORKTABLEDIRECTORY is not set, DMX uses a PUT
command to load data when LoadViaPut is set to "y". If the work table directory isn't
provided and the LoadViaPut parameter isn't set to "y", the DMX job aborts with an error
message.
Defining Snowflake database connections
In the Database Connection dialog, define a connection to a Snowflake database as follows:
• At DBMS, select Snowflake.
• At Access Method, select JDBC.
• At Database, select a previously defined Snowflake JDBC database connection URL.
• At Authentication, select Auto-detect.
Snowflake target connections Using a Snowflake JDBC connection, DMX-h can write supported Snowflake data types to Snowflake
targets directly for optimal performance.
Defining Snowflake targets
At the Target Database Table dialog, define a Snowflake database table target:
1. At Connection, select a previously defined Snowflake target connection or select Add new... to
add a new one.
2. Select a table from the list of Tables, or select Create new... to create a new one.
o User defined SQL statement is not supported.
o All target disposition methods are supported.
74 DMX Install Guide
3. On the Parameters tab, the following optional parameters are available for Snowflake target
database tables. Values specified here take precedence over their corresponding property in
the jdbc configuration file, if any.
o Maximum parallel streams - the maximum number of parallel streams that can be
established to load data for performance and that are created according to demand.
o Work directory connection - name of the Amazon S3 that DMX uses to connect to and
write to work table directory.
o Work table codec - specifies the compression algorithm used to compress Snowflake
data.
o Work table directory - serves as the parent-level directory beneath which job-specific
subdirectories are created for staging data.
o Work table encryption - server-side encryption algorithm to encrypt the staging data
o Work table schema - the schema used to create the staging table.
4. Set commit interval and Abort task if any record is rejected are not supported.
Snowflake source connections Using a Snowflake JDBC connection, DMX can read supported Snowflake data types from any
Snowflake table.
Defining Snowflake sources
For all DMX-h ETL jobs, DMX-h supports Snowflake database tables as sources and as lookup
sources. At the Source Database Table dialog or at the Lookup Source Database Table dialog define
either a Snowflake database table source or lookup source respectively:
• At Connection, select a previously defined Snowflake source connection or select Add new...
to add a new connection.
Sybase Your Sybase client and Open Client Library must be installed on the system and configured so that
it can connect to databases that you want to access from DMX. Please refer to specific Sybase
documentation for details on configuring the client.
Windows Systems To access Sybase databases, Sybase client software and Open Client Library must be accessible via
the dynamic link libraries (dll) located in the installation directory. You can configure the client by
using the dsedit utility, provided with the Sybase installation, to define database connections in the
sql.ini file.
UNIX Systems To access Sybase databases, Sybase client software and Open Client Library must be accessible via
the shared libraries located in the Sybase installation directory. You can make the client libraries
accessible by running the scripts provided with Sybase, such as <sybase_home>/SYBASE.sh, where
<sybase_home> denotes the directory where Sybase is installed. You can configure the client by
using the dsedit utility, provided with the Sybase installation, to define database connections in the
interfaces file.
Teradata In order to define a task that uses a Teradata table, the DMX Task Editor needs to access the
Teradata database from the system where the DMX Task Editor is run. This requires the Teradata
DMX Install Guide 75
Call-Level Interface Version 2 for Network-Attached Systems (CLIv2), which is a Teradata Tools and
Utilities product, to be installed and configured on that system.
To access the Teradata database at run-time, Teradata FastExport, Teradata Parallel Transporter
(TPT), Teradata FastLoad, Teradata MultiLoad and Teradata Parallel Data Pump, which are
Teradata Tools and Utilities products, must be installed and configured on the system where DMX
jobs are run.
Installation and configuration
Teradata client software For Windows and UNIX systems, the client software installation includes the following:
1. On the system where the DMX Task Editor runs, install and configure the Teradata Utility
Pack, which includes CLIv2 and the Teradata ODBC driver.
2. On the system where the DMX Job Editor runs, install and configure the Teradata extract
and load utilities.
For Windows systems, the default, base installation directory is as follows:
C:\Program Files (x86)\Teradata\Client\<client_software_version_number>
For UNIX systems, the default, base installation directory is as follows:
/opt/teradata/client/<client_software_version_number>
For installation instructions and for information on the subdirectories under which the software
components are installed, see the Teradata Tools and Utilities Installation Guide for Windows or UNIX
that corresponds to your Teradata client software version.
Note:
• The Teradata installer updates all required directories to the PATH.
• When connecting through CLIv2 using the TTU access method, you do not have to create and
configure an ODBC data source.
Vertica The primary way to connect to Vertica databases is via ODBC. With Vertica version 7 or later, DMX
establishes parallel connections using Vertica COPY LOCAL, providing optimal performance, ease-
of-use, and dynamic tuning.
With older versions of Vertica, there are cases when the DMX Vertica Load Example Files method
may perform better than the ODBC method, as described at the end of this topic in "Choosing a
Method."
Connecting via ODBC When connecting to Vertica databases via ODBC, DMX uses different load methods based on the
Vertica version, as shown in the following overview table:
Vertica Version Load Method
7 or later Multi-stream Vertica COPY LOCAL via ODBC on both Windows and Linux
76 DMX Install Guide
6 or later Linux: Multi-stream Vertica COPY LOCAL via ODBC
Windows: Multi-stream SQL INSERT via ODBC
Earlier than 6 Single-stream SQL INSERT via ODBC
Vertica Version 6 or Later When connecting to Vertica via ODBC, on Linux as of Vertica version 6, and on Windows as of
Vertica version 7, DMX uses Vertica COPY LOCAL to load data, which provides the best possible
load performance. If running Vertica 6 on Windows, it uses multi-stream SQL INSERT.
Vertica Earlier than Version 6 When connecting to Vertica via ODBC prior to version 6, the Vertica ODBC client driver method
loads data using a SQL INSERT statement to a single Vertica initiator node.
Configuring ODBC for Vertica The Vertica configuration file, vertica.ini, is used by Vertica to determine the absolute path to the
file containing the ODBC installer library and the absolute path to the directory containing the
Vertica client driver's error message files. The path to vertica.ini is set through the Vertica
configuration file environment variable, VERTICAINI.
Note: vertica.ini is different from the DMX node loading configuration file, DMXVertica.ini, which is
specified through DMX_VERTICA_INI_FILE.
To configure the Vertica ODBC driver:
1. Follow the instructions for defining ODBC data sources on Windows and UNIX systems.
For Vertica ODBC client driver v5.1 or later on UNIX/Linux platforms, specify the following
DSN parameters in vertica.ini and set the environment variable VERTICAINI to point to the
location of the vertica.ini file:
[Driver]
ODBCInstLib=<dmx_install>/lib/libodbcinstSSL.so
ErrorMessagesPath=<absolute_path_to_directory_containing_Vertica_client
_driver's_localized_error_message_files>
where <dmx_install> is the directory where DMX is installed.
The error message files are generally stored in the same directory as the Vertica ODBC
driver files.
2. When using the unixODBC driver manager, override the standard threading settings in the
ODBC section of odbcinst.ini as follows:
[ODBC]
Threading = 1
For additional details, see the Vertica Programmer's Guide for your version of Vertica.
DMX Install Guide 77
Other DBMSs
ODBC Some sources and targets other than the databases mentioned above (e.g. Microsoft Access
databases) may be accessed via ODBC, by defining the appropriate ODBC data sources. See Defining
ODBC Data Sources for details on defining data sources on Windows and UNIX systems.
JDBC To access a database management system (DBMS) that is not explicitly supported by DMX, Java
Database Connectivity API (JDBC) can be used when a JDBC driver is provided by the database
vendor. The JDBC driver establishes the connection to the database and implements the protocol for
transferring queries and results between a client and the database.
Connecting to JDBC sources and targets requires that you define a JDBC configuration file, set DMX
and JAVA environment variables, and specify the database connection URL, which the DBMS JDBC
driver uses to connect to a database through the Database Connection dialog.
Connection Overview Through the DMX_JDBC_INI_FILE environment variable, DMX gains access to the JDBC
configuration file, which you define. As per the JDBC driver properties outlined in the JDBC
configuration file, DMX determines the JDBC driver class name and the Java class path to the class
and dependent classes; establishes the connection with the DBMS; and connects to the source or
target database, which is specified in the database connection URL.
JDBC Configuration File Outlined within the JDBC configuration file are the JDBC driver class name and Java class path for
locating the driver class and dependent classes for each DBMS. A separate section in the JDBC
configuration file is required for each DBMS.
Format Requirements
The JDBC configuration file is organized in sections. Consider the following format requirements:
• A section header marks the beginning of each section and is specified by a string enclosed in
square brackets ([]). The enclosed string specifies the name or alias of the DBMS. To
establish a connection to the database in the DBMS, the DBMS name, which is enclosed
within brackets([]) in the section header of the JDBC configuration file, must match the
DBMS name specified in the second parameter of the database connection URL.
• Within each section, name-value pairs describe the properties of the JDBC driver. Unless
otherwise stated, parameter names are case-insensitive and parameter values are case-
sensitive. Each line can contain a maximum of one parameter description where the
parameter value is separated from the parameter name by an equal sign (“=”). Extra spaces
before and after the equal sign are ignored.
Consider the following parameters for accessing a DBMS through JDBC:
• DriverName - Mandatory - This mandatory parameter identifies the JDBC driver class name
or Java class.
• DriverClassPath - Mandatory - This mandatory parameter identifies the Java class path that
points to the JDBC driver. Use a semi-colon (;) to separate different entries in the path.
• SelectStatement and InsertStatement – Optional - If the query language used by the DBMS
does not follow the SQL92 standard, these optional parameters enable you to provide custom
queries for select and insert operations. When you provide these parameters, DMX uses the
78 DMX Install Guide
statement templates to create the appropriate select and insert statements. If either the
SelectStatement or InsertStatement parameter is not defined, standard SQL is used for the
corresponding read/write operation. If the right side of the equal sign is blank for the
SelectStatement or InsertStatement parameter, the corresponding read/write operation is
not supported. You can use the following place holders in the statement templates:
o <columns> – location where the actual comma-separated columns should be placed.
o <table> – location where the actual table name should be placed.
• IsSchemaSupported – Optional - This optional parameter ensures that DMX correctly
identifies all specified database tables. Through JDBC calls, DMX can generally determine
whether a DBMS supports a schema; however, certain DBMSs, such as Hive, do not return
the values that are expected from certain JDBC calls. Under these circumstances, you can
set the IsSchemaSupported parameter to ensure that all specified database tables are
identified correctly. The values for this parameter can be either true or false; as an exception
to the general rule, IsSchemaSupported parameter values are case insensitive.
For information on connecting to Hive through ODBC, see Connecting to Hive data warehouses.
• The character '#" marks the start of a comment, which continues until the end of the line.
Comments are permitted anywhere within a JDBC configuration file.
• Empty lines are permitted anywhere within a JDBC configuration file.
DMX and JAVA Environment Variables
DMX_JDBC_INI_FILE To provide DMX with access to the JDBC configuration file, you must set the DMX environment
variable, DMX_JDBC_INI_FILE, to point to the full path of the JDBC configuration file. The full
path includes the directory location and the JDBC configuration file name.
Consider the following examples on setting the DMX_JDBC_INI_FILE environment variable:
On Windows: set DMX_JDBC_INI_FILE=C:\Program Files\DMExpress\Programs\DMXJdbcConfig.ini
On UNIX: export DMX_JDBC_INI_FILE=/usr/dmexpress/etc/DMXJdbcConfig.ini
JAVA_HOME After you install the Java Runtime Environment (JRE) in Windows, you must set the Java
environment variable, JAVA_HOME, to point to the JRE installation directory. The bit level (32 or
64) of the installed JRE must match the bit level of the DMX release that you are running.
Consider the following examples on setting the JAVA_HOME environment variable:
On Windows: set JAVA_HOME=C:\Program Files (x86)\Java\jdk1.7.0_51
On UNIX: export JAVA_HOME=/usr/java/jdk1.6.0_24
Database Connection URL To connect to a database using the JDBC access method, you must specify the database connection
URL as the database specification in the Database Connection dialog.
DMX Install Guide 79
MySQL Example At a minimum, each database connection URL, which the JDBC driver uses to connect to the JDBC
source or target, consists of jdbc, which is the required first parameter; the DBMS name, the
database host name, the database name, and any additional connection property specification.
To access database db1 in the MySQL DBMS installed on the local computer, consider the following
valid database URL:
jdbc:mysql://localhost/db1
where
jdbc - required first parameter to connect to a JDBC source or target.
mysql - DBMS name. This DBMS name must match the DBMS name specified within
brackets ([]) in the section header of the JDBC configuration file.
//localhost/db1 - host and database identification string that identifies the db1 database
in the local MySQL DBMS installation.
For additional information, see MySQL Driver and Data Source Class Names, URL Syntax, and
Configuration Properties for Connector/J.
Defining ODBC Data Sources A data source needs to be defined for each database that you want DMX to access through ODBC.
The data source name on the client, where DMX tasks or jobs are defined, has to be the same as the
data source name on the server where DMX tasks or jobs run.
Windows Systems You can define an ODBC data source through the ODBC Data Source Administrator as follows:
• From the Start menu, select Settings, Control Panel, Administrative Tools, Data Sources
(ODBC).
• In the ODBC Data Source Administrator dialog, choose the User DSN or System DSN tab,
and click on the Add button. On 64-bit Windows, select the User DSN tab.
• In the Create New Data Source dialog, select the appropriate DBMS driver from the list, e.g.
SQL Server, Microsoft Access Driver (*.mdb), etc. Then, press the Finish button.
• The setup wizard guides you with further driver specific instructions.
UNIX Systems On UNIX systems, you may choose to use the ODBC driver manager, unixODBC, that is shipped
with DMX, or you may use your own driver manager.
DMX Default Driver Manager The DMX install and databaseSetup programs assist you in creating unixODBC data sources.
Alternatively, you can define ODBC data sources manually. The ODBC data source manager
provides support for ODBC data sources through two configuration files, odbcinst.ini and odbc.ini
that are located in the directory <dmx_home>/etc. This directory also contains templates and
examples of the configuration files. To change the location of the configuration files, export the
ODBCSYSINI environment variable to the new directory where both files reside.
80 DMX Install Guide
The files need to be set up appropriately before you can access databases via ODBC. This is a one-
time configuration step, similar to defining system data sources on Windows.
• <dmx_home>/etc/odbcinst.ini: This file contains DBMS specific and system specific driver
definitions. Configuring this file corresponds to selecting the DBMS driver while adding a
data source on a Windows system.
• <dmx_home>/etc/odbc.ini: This file contains DBMS specific data source definitions, based on
the drivers defined previously in the odbcinst.ini file. Configuring this file corresponds to
following DBMS driver specific instructions while adding a data source on a Windows
system.
If you wish to remove a data source, delete the section that corresponds to that data source from the
odbc.ini file. A section starts with the data source name enclosed by [], and ends at the beginning of
the next section or at the end of the file.
64 Bit ODBC The ODBC headers and libraries that are shipped with the Microsoft Data Access Components
(MDAC) 2.7 Software Development Kit (SDK) have changed from earlier versions of ODBC to
support 64-bit platforms. Since the ODBC driver for a specific DBMS and the unixODBC libraries
are built separately, there may be an incompatibility in the definition of SQLLEN variable which
was specifically introduced for ODBC access on 64-bit UNIX platforms. On 64-bit UNIX platforms,
DMX assumes that the ODBC driver is 64-bit compliant and defaults the value of SQLLEN variable
to 8 bytes. You can overwrite this default, that is, the DMXSQLLEN value corresponding to the
specific DBMS driver, in the odbcinst.ini file.
Use Other ODBC Driver Manager By default, DMX uses the shipped unixODBC driver manager to load all ODBC drivers. Some ODBC
drivers, such as Teradata ODBC driver, may not work with it. You can tell DMX to use a different
ODBC driver manager by specifying the option DMXODBCDRIVERMANAGER=No under the driver section in
the odbcinst.ini file. You need to make sure that your ODBC driver manager library path (e.g.
/usr/odbc/lib for Teradata V12) is in the system library path (e.g. LD_LIBRARY_PATH on Linux) so
that it is loaded first by DMX. In addition, you may need to export the ODBCINI environment
variable with the absolute path to the file odbc.ini (e.g. export ODBCINI=<dmx_home>/etc/odbc.ini).
Refer to your DBMS documentation for details on this requirement.
Connecting to Message Queues from DMX DMX can access message queues as sources or targets when the appropriate message queue client
software is installed on the system and accessible. The configuration steps needed to access a specific
message queue are described in the following sections.
To connect to a message queue via a Data Connector, follow the installation instructions that
accompany the connector.
IBM WebSphere MQ To create a connection to an IBM Websphere queue manager, you specify the queue manager name
in the Message Queue Connection dialog, and provide a channel definition that includes:
• the channel name,
• the transport type, and
• the connection name, with an optional port number.
DMX Install Guide 81
Port number DMX assumes a default port number of 1414 for the port where the server’s listener is expecting
client communication. You can change the port number by specifying it in parentheses following the
connection name. For example, 192.168.2.100(1415) or server-machine.com(1415).
The channel definition may be specified by either:
1. Defining the MQSERVER environment variable, or
2. Using a DMX WebSphere queue manager configuration file.
The MQSERVER environment variable You can define a channel via the MQSERVER environment variable as defined by IBM.
Example On Windows:
SET MQSERVER=CHANNEL1/TCP/MQSERVER01
or, to change the default port number:
SET MQSERVER=CHANNEL1/TCP/MQSERVER01(1418)
On Unix:
export MQSERVER=CHANNEL1/TCP/’MQSERVER01’
or, to change the default port number:
export MQSERVER=CHANNEL1/TCP/’MQSERVER01(1418)’
Queue manager configuration file You can also create channel definitions for one or more queue managers in a configuration file, and
provide the fully qualified file name to DMX in the DMX_CONNECTOR_ENV_MQ_WS_INI_FILE
environment variable. This populates the Queue manager combo box of the Message Queue
Connection dialog with the defined queue managers or their aliases.
The contents of the file must be formatted as follows:
• Anything following a “#” character until the end of the line is a comment. Comments are
allowed anywhere.
• Empty lines are allowed anywhere.
• The file is organized in sections. The beginning of each section (the section header) is
specified by a string enclosed in square brackets. The enclosed string may be a queue
manager name or a queue manager alias.
• The section headers must be unique.
• The lines between section headers contain the channel definition parameters for that
particular queue manager or alias. There are 4 supported parameters: queuemanager, channel,
transport, connectionname. The parameter values are separated from the parameter name by
an “=” character. The parameter names are case insensitive, but their values are case
sensitive, except for the transport parameter. Each line may contain at most one parameter
definition.
• The queuemanager parameter is used for cases where the section name is not a queue manager
name, but a queue manager alias. This is to allow potential configuration of different channel
definition options for the same queue manager.
82 DMX Install Guide
If there is a configuration section with just the name but no parameters, the MQSERVER
environment variable definition is used at connection time. This saves you from typing in the queue
manager's name in the GUI, as the name appears in the list of known queue managers.
The connection parameters defined in this file override the MQSERVER environment variable only if
all 3 parameters (channel, transport and connectionname) are defined for a particular queue manager.
If you would like to define several different parameter sets for the same queue manager, use an alias
for the section name and override the queue manager name by defining the queuemanager parameter
inside the parameters section for that alias.
Sample configuration file A sample configuration file (DMXWebSphereConnector.ini) is installed in the directory:
• On Windows: <dmx>\Examples\WebSphereConnector\DMXWebSphereConnector.ini
• On Unix: <dmx>/etc/DMXWebSphereConnector.ini
where <dmx> is the directory where DMX is installed.
Example Define the DMX_CONNECTOR_ENV_MQ_WS_INI_FILE environment variable:
SET DMX_CONNECTOR_ENV_MQ_WS_INI_FILE = C:\tmp\DMXWSConfig.ini
Create the DMXWSConfig.ini file at the above location, with the following content:
[my.local.queue.manager]
Channel = all.clients
Transport = tcp
Connectionname = mw-server.com
Connecting to Salesforce from DMX In order for DMX to connect to Salesforce.com, the SSL client certificate must be up-to-date. The
DMX installation includes the file cacert.pem in the <install_dir>/CACertificates directory. This is a
plain text file containing a set of public keys used for SSL authentication when connecting to
Salesforce.com.
If this file goes out-of-date, an HTTPSCVF error is issued when attempting to connect to
Salesforce.com. If that happens, go to http://curl.haxx.se/ca/cacert.pem, save the file as cacert.pem to
the <install_dir>/CACertificates directory, and retry the Salesforce.com connection.
Connecting to SAP from DMX In order for DMX to access data in an SAP system, SAP’s NetWeaver client libraries NW RFC SDK
7.10 with patch level 2 or higher must be on the system and accessible via the appropriate shared
library or dynamic link library (dll) paths. They can be downloaded from SAP’s marketplace at
http://service.sap.com/swdc. For Windows 64-bit platforms, both the 64bit and 32bit NetWeaver
client libraries are required.
The following environment variable must be set to include the path to the NetWeaver client
libraries, for example, <download_path>/nwrfcsdk/lib, and exported on the corresponding platform:
DMX Install Guide 83
Windows PATH
AIX LIBPATH
HP-UX SHLIB_PATH
Linux LD_LIBRARY_PATH
Solaris LD_LIBRARY_PATH
On UNIX systems, the variable needs to be set and exported prior to starting the DMX Run-time
Service or running DMX tasks or jobs.
The SAP NetWeaver client libraries depend on the corresponding C/C++ libraries that they were
built with. The path to the C/C++ libraries must also be included in the library search path.
Windows: Microsoft C Runtime DLLs version 8.0 need to be installed. Refer to SAP Note
684106 at https://service.sap.com/sap/support/notes/684106. The vcredist_x86 package
needs to be installed on all Windows platforms. In addition, the vcredist_IA64 and
vcredist_x64 packages need to be installed on Windows IA64 and Windows x64 platforms,
respectively.
AIX: AIX C++ library libC.a – usually found in /usr/lib.
HP-UX IA64: HP C++ library libCsup.so.1 – usually found in /usr/lib/hpux64.
Linux: C library version 2.3.4 or higher, libstdc++.so.6 – usually found in /lib/tls and /usr/lib.
Refer to SAP Note 1021236 at https://service.sap.com/sap/support/notes/1021236.
Solaris: Sun C++ libraries libCstd.so.1 and libCrun.so.1 for SunOS 5.10 or higher – usually
found in /usr/lib/sparcv9 or /usr/lib/64.
If lower versions of these libraries are also on the system, then the path to the libraries of the
required version must be before the older versions in the library search path.
On all systems, the path to the DMX library must be before the path to the SAP NetWeaver client
libraries in the library search path.
The SAP client library must be configured so that it can connect to SAP systems that you want to
access from DMX. For example, you can configure the client by defining SAP system aliases in the
sapnwrfc.ini file. Please refer to specific SAP documentation for details on configuring the client.
Once configured, the directory where the file is located must be set in the environment variable
RFC_INI.
The DMX install program assists you with verifying connections to SAP systems.
On UNIX systems, if you wish to configure and/or verify SAP connections any time after the
installation procedure, run the SAPSetup program as follows:
cd <dmx_home>
./SAPSetup
Registering DMX in SAP SLD Per SAP recommendation, each DMX server should be registered in your SAP SLD (System
Landscape Directory). Please refer to the topic “Registration of DMExpress Components in the SAP
System Landscape Directory” in the DMX help.
84 DMX Install Guide
Connecting to HDFS from DMX In order for DMX to access data located in a HDFS, a Hadoop distribution must be installed and
configured as follows on the system where the DMX jobs and tasks are executed:
• The hadoop command must be accessible to DMX:
o DMX first looks for the hadoop command in $HADOOP_HOME/bin/hadoop, where the
environment variable HADOOP_HOME is set to the directory where Hadoop is installed.
Defining environment variables can be done through the Environment Variables tab of
the DMX Server dialog.
o If HADOOP_HOME is not defined or the directory can't be found, DMX looks for the
hadoop command in the system path, where it is automatically added by some Hadoop
distributions.
• The fs.default.name property in the core-site.xml configuration file must be set to point to
the Hadoop file system.
• The HTTP namenode daemon must be running on the default port 50070. If you would like
to use a different port number, please contact Technical Support.
• If the Hadoop cluster requires Kerberos authentication, you need to use the dmxkinit utility
to run your HDFS extract/load jobs/tasks.
Connecting to Connect:Direct nodes from DMX In establishing connectivity to a Connect:Direct node on the mainframe, DMX initiates file transfers
from this node to an open-systems Linux server.
Security The Connect:Direct proprietary security protocol offers security through authentication and user
proxies. User authorities and user proxies are setup during Connect:Direct installation and
configuration.
Installation and Configuration For DMX to access data located on a Connect:Direct node, a Connect:Direct server and
Connect:Direct client must be installed on the same Linux machine on which DMX jobs and tasks
are executed.
• Configure Connect:Direct to access the required Connect:Direct nodes.
For details on configuring Connect:Direct nodes, refer to the IBM Sterling Connect:Direct product
documentation.
• Add a Connect:Direct user for each DMX user who accesses Connect:Direct.
• The DMX server must be configured as the Connect:Direct primary node (pnode) to enable
sampling with Connect:Direct connections.
• Prior to starting the DMX Run-time Service or to running DMX tasks or jobs, set the
following environment variables:
o NDMAPICFG points to the CLI/API configuration file, ndmapi.cfg, for example:
export
NDMAPICFG=$NDMAPICFG:<connect:direct_home>/ndm/cfg/cliapi/ndmapi.
cfg
DMX Install Guide 85
o PATH points to the Connect:Direct /bin directory, for example:
export PATH=$PATH:<connect:direct_home>/ndm/bin
If you plan to start the DMX Run-time Service using sudo, use the –E option to preserve the
environment variable settings.
Note: If the :file.open.exit.program parameter in the user.exits section of the parameter
initialization configuration file, <connect_direct_home>/ndm/cfg/<node_name>/initparm.cfg, contains
any path, including the path to SSConnectDirectFileOpenUserExit, remove the full path such that the
parameter value is blank: :file.open.exit.program=:\
Connecting to Databricks File Systems (DBFSs) Databricks is a cloud storage Platform-as-a-Service for Spark supported on Azure and AWS Cloud
Services. Through JDBC connectivity, DMX-h supports Databricks File System (DBFS) content as a
source and as a target.
NOTE: A DBFS connection is a remote file connection, which is logically different from a Databricks
database connection that supports queries.
Databricks File System (DBFS) connection requirements
Databricks requires a JDBC connection configuration with the driver name and location for all
connections. To use a Databricks File System (DBFS) connection, you also require a DMX execution
profile with Databricks deployment configuration parameters. To install DMX on a Databricks
cluster, follow the instructions in Deploying DMX to a Databricks cluster in the cloud.
Before attempting to connect to Databricks, do the following:
• Install DMX server on an Amazon Elastic Compute Cloud (EC2) instance or your local machine..
• Specify JDBC and cluster parallelization parameters in the DMX JDBC configuration file.
The parameters outlined in the DMX JDBC configuration file, as defined by the
DMX_JDBC_INI_FILE environment variable, provide DMX with the mandatory and optional
values required to access an Amazon S3 bucket or Azure blob to invoke a Databricks query.
Refer to DMX Databricks configuration for JDBC information provided below.
• DMX accesses Databricks using token-based authentication. If no access keys are provided,
whether session-based or explicit, DMX issues a UNIAMCRE error message aborts the job.
• To use a DBFS connection, specify Databricks deployment configuration parameters in a DMX
execution profile. You can use a global, user, and/or job-specific execution profile, but DBFS is
unreachable without this configuration. See the Execution Profile topic in the DMX Help.
The parameters outlined in a DMX Databricks JDBC configuration file include the following:
• DriverName - Required JDBC driver name.
• DriverClassPath - Required JDBC class path.
• ANALYZETABLESTATISTICS - When set to y, DMX can run analyze queries that collect
table statistics. Default value in n.
• ANALYZECOLUMNSTATISTICS - When set to y, DMX can run analyze queries that collect
column statistics. Default value is n.
86 DMX Install Guide
• MAXPARALLELSTREAMS - Optional integer representing the maximum number of
parallel streams that Connect can establish for loading data into the staging data file. By
default, MAXPARALLELSTREAMS is set to the number of CPUs available in the client
machine.
• WORKTABLEDIRECTORY - Required path to an s3 bucket, Azure blob container, or
Databricks File System (DBFS) store in which to stage data. You must mount an s3 bucket
or Azure blob container using the Databricks File System (DBFS). Example URLs could
include:
• s3a://dev for an S3 bucket
• wasbs://[email protected]/dev for an Azure Blob
• dbfs://dev for a DBFS store
• WORKTABLESCHEMA - Optional schema name to create the staging data . The default
schema for the staging data is the same as the target data.
• WORKTABLECODEC - A compression codec to compress the files in the stagiung directory.
Valid values are gzip (default), bzip2, and uncompressed.
• MAXWORKFILESIZE - Optional integer. The maximum size of a file in bytes of the staging
file written by task. The default value is 134217728, which is equivalent to 128 MB.
• AWSACCESSKEYID - A 20-character, alphanumeric string that Amazon provides upon
establishing an AWS account. DMX ignores this parameter unless
WORKTABLEDIRECTORY is an S3 bucket. If DMX runs in EC2, AWSACCESSKEYID is
optional.
• AWSACCESSKEY - The 40-character string, also known as the secret access key, which
Amazon provides upon establishing an AWS account. DMX ignores this parameter unless
WORKTABLEDIRECTORY is an S3 bucket. If DMX runs in EC2, AWSACCESSKEY is
optional. DMX requires the access key id and the secret access key to send requests to an
Amazon S3 bucket unless an AWS temporary session token is required, in which case DMX
requires the access key id and AWS temporary session token. See the AWSTOKEN
parameter below.
• AWSTOKEN - An AWS temporary session token, granting temporary security credentials
(temporary access keys and a security token) to any IAM user enabling them to access AWS
services. This alternative authentication method replaces a full-access AWS storage access
key. DMX ignores this parameter unless WORKTABLEDIRECTORY is an S3 bucket.
• AZURESTORAGEACCESSKEY - A 512-bit Azure Blob Storage access key for an active
account of which Microsoft issues two upon establishing an Azure Portal account. If DMX
runs in the Azure Blob container, AzureStorageAccessKey is optional. If the storage access is
required and the key is missing or invalid, DMX issues an AZSQDWTERR error message
and aborts the job. DMX ignores this parameter unless WORKTABLEDIRECTORY is an
Azure blob container.
• AZURESTORAGESAS - A shared access signature (SAS) URI that grants restricted access
rights to Azure Storage resources. This alternative authentication method replaces a full-
access Azure Storage access key. DMX ignores this parameter unless
WORKTABLEDIRECTORY is an Azure blob container.
• DBFSMOUNTPOINT - DBFS mount point (DBFS path) required by
WORKTABLEDIRECTORY. DBFSMOUNTPOINT is mandatory if the work table directory
maps to an S3/Azure URL.
• LOADVIADBFS - Optional character. If WORKTABLEDIRECTORY is not set, DMX uses a
DBFS command to load data when LoadViaDBFS is set to "y". If the work table directory
isn't provided and the LoadViaDBFS parameter isn't set to "y", the DMX job aborts with an
error message. This option requires Databricks deployment configuration parameters from a
DMX execution profile.
Consider the following format of a DMX Databricks configuration file:
[spark]
DMX Install Guide 87
DriverName=<JDBC.Driver.name>
DriverClassPath=<JDBC_Driver_ClassPath>
ANALYZETABLESTATISTICS=<y|n>
ANALYZECOLUMNSTATISTICS=<y|n>
MAXPARALLELSTREAMS=<Maximum_Parallel_Streams>
WORKTABLEDIRECTORY=<Directory_path>
DBFSMOUNTPOINT=<DBFS_path>
MAXWORKFILESIZE=<Maximum_Bytes>
WORKTABLESCHEMA=<Amazon_S3_schema>
WORKTABLECODEC=<Amazon_S3_codec>
AWSACCESSKEYID=<AWS_access_key_id>
AWSACCESSKEY=<AWS_access_key>
AWSTOKEN=<AWS_token>
AZURESTORAGEACCESSKEY=<Azure_access_key>
AZURESTORAGESAS=<Azure_SAS_URI>
LoadViaDBFS=<y|n>
Defining Databricks File System (DBFS) connections
Databricks File System (DBFS) connections connect to Databricks sources and targets as a remote
file connection. In the Databricks File System Connection dialog, define a connection to Databricks
File System (DBFS) as follows:
• At Name, type the name of the DBFS deployment configurtion from the execution profile.
• At Current DMExpress Server, Select the applicable DMExpress server running on the
DBFS cluster.
• At Instance, select the URL of the Databricks endpoint to which DMX connects and sends
API requests.
• At Authentication, either:
1. At Token:Specify, type a TLS access token.
or
2. At Token:Use Repository select the repository used to store tokens, either DMX
Repository or Cyberark; At Token alias type the alias for your TLS authentication token
or Define to define a new token alias in a DMS Repository.
After you choose OK, you can define Source File connections and Target File connections for DBFS
sources or targets, respectively.
88 DMX Install Guide
Connecting to CyberArk Enterprise Password
Vault
DMX connects to CyberArk Enterprise Password Vault over a TLS-secured HTTPS connection and
requires access to an up-to-date TLS client certificate. If the CyberArk server secures the DMX
connection with a self-signed certificate, update <dmx_home>/CACertificates/cacert.pem with the public
certificate at the same time you update or install the client certificate, where <dmx_home> is DMX
install directory
For DMX jobs run in a Hadoop cluster, update the client certificate and cacert.pem on the edge node
only. DMX distributes TLS configurations, keys, and certificates to the cluster nodes.
If a client certificate file is out-of-date, DMX issues an HTTPSCVF error when it attempts to connect
to the CyberArk server.
CyberArk Licenses
DMX can only connect to licensed CyberArk vaults. Check the CyberArk license status if you
encounter repeated failures to retrieve a CyberArk password.
Connecting to Protegrity Data Security Gateway
DMX connects to Protegrity Data Security Gateway by making REST API POST requests over HTTP.
The DMX Protect and Unprotect functions use Protegrity resources to protect and unprotect data
sent to Protegrity. You must configure the Protegrity Gateway server to receive and process REST
requests before DMX can use the functions, The API end point implementation determines the
specific protection methods. Some details needed to set up protection are:
• All REST API calls use the POST method.
• Data is always sent as part of HTTP message body.
• Data is always sent without any encoding change. The Protegrity server must return
protected data with the same encoding as the data input.
• DMX does not pass data with empty or NULL values to the Protegrity server.
Connecting to QlikView data eXchange files from
QlikView or Qlik Sense Qlik is the provider of QlikView and Qlik Sense business intelligence and visualization software
applications. DMX supports QlikView data eXchange (QVX) files as targets. Through DMX, you
define the QVX file and the QlikView data eXchange reformat layout.
QVX files can be used as data sources for QlikView or Qlik Sense.
DMX Install Guide 89
QlikView desktop installation overview To access QVX files as sources from within QlikView:
1. Install the QlikView desktop.
2. At the QlikView desktop:
a) Start QlikView Personal Edition.
b) At the File menu, select Open.
c) In the Open dialog, ensure that the file type is All Files (*.*) and browse to the
appropriate QVX file.
d) Select the QVX file and select Open.
e) At the File Wizard dialog, ensure that the File type is Qvx and select Finish.
f) At the Edit Script dialog, select Reload to execute the displayed script, which loads the
QVX data.
g) At the Fields tab of the Sheet Properties dialog, select the fields to display on the Main
QlikView sheet.
h) To save the data in the QlikView document, select Save.
Qlik Sense desktop installation overview To access QVX files as sources from within Qlik Sense:
1. Install the Qlik Sense desktop.
For information on Qlik Sense, see Qlik Sense help.
2. At the Qlik Sense desktop:
a) Start the Qlik Sense desktop.
b) Select Create a New App.
c) In the Create new app dialog, enter the name of the application and select Create.
d) At the New app created dialog, select Open.
e) At the Qlik Sense desktop, select Quick data load.
f) At the Select file dialog, ensure that the file type is QlikView data exchange files (qvx),
browse to the appropriate QVX file, and select Select.
g) At the Select data from .qvx dialog, select the appropriate fields to load and select Load
data.
h) When the data loads successfully, a new data sheet is created.
i) To edit the data sheet, select Edit the sheet.
Connecting to Tableau Data Extract files from
Tableau Tableau is a business intelligence application that provides browser-based analytics. DMX supports
Tableau Data Extract (TDE) files as targets. Through DMX, you define the TDE file and the Tableau
Data Extract reformat layout.
TDE files can be used as data sources for Tableau.
90 DMX Install Guide
Tableau desktop installation overview To access TDE files as sources from within Tableau:
1. Install the Tableau desktop.
2. At the Tableau desktop:
a) Start Tableau Desktop.
b) Select Connect to Data.
c) In the File section of the Connect to Data page, select Tableau Data Extract.
d) At the Open dialog, browse to and select the Tableau Data Extract file.
e) At the Tableau Data Extract Connection dialog, enter the name of the data connection
for use in Tableau.
The data in the Tableau Data Extract file displays within the Tableau Desktop.
Removing DMX/DMX-h from Your System Windows Systems Perform the following steps to remove DMX from your system:
1. Ensure that the DMX Task Editor, DMX Job Editor, and DMX Server are closed and no
DMX jobs are running.
2. Go to Programs, DMExpress from the Start menu and select Uninstall DMX.
3. Alternatively, you can remove DMX as follows: Go to Settings, Control Panel from the Start
menu and double-click Add/Remove Programs. In the list of applications that can be
removed, select the entry for DMX. Click Add/Remove and confirm.
4. Delete folders if necessary. If you created any of your own files in the folder where you
installed DMX, these files are not removed by the uninstall program.
UNIX Systems Perform the following steps to remove DMX from your system:
1. Ensure that no DMX jobs are running.
2. If you installed the DMX Run-time Service, you need to uninstall it first. Login as root and
run:
cd <dmx_home>
./install
When prompted, select to uninstall the service.
3. Remove the DMX directory:
cd <dmx_home>/..
rm –rf <dmx_home>
Remove any environment variable settings that you added to your profile, e.g. <dmx_home>/bin in
your PATH, after the DMX installation.
DMX Install Guide 91
DMX-h in a Hadoop Cluster The method for removing DMX-h from the nodes of a Hadoop cluster depends on how you originally
installed DMX-h in the cluster. Follow the instructions in the appropriate section below.
Cloudera Manager Parcel Uninstall Uninstall DMX-h on all nodes in the cluster as follows:
1. Ensure that no DMX-h jobs are running.
2. Uninstall the DMX Run-time Service on any edge/cluster node where it is running.
3. Click on the parcel indicator button in the Cloudera Manager Admin console navigation bar
to bring up the Parcels tab of the Hosts page.
4. In the currently activated dmexpress parcel, click on the Actions button and select
Deactivate to deactivate the parcel.
5. Once deactivated, click on the Actions button and select Remove From Hosts to remove the
parcel from the cluster nodes.
6. Once the parcel is removed from the cluster nodes, click on the Actions button and select
Delete to delete the parcel from the repository.
Apache Ambari Service Uninstall Follow the instructions for RPM Uninstall or
1. Open the Ambari Web UI and navigate to “Hosts”
2. For each host, choose the “Installed” drop-down next to “Clients”
3. For both “DMX-h” and “DMX-h License” (if present), choose “UNINSTALL.”
Once uninstalled, either via the UI or using RPM, disable the uninstalled services:
1. Open the Ambari Web UI, and navigate to “Services”
2. For each of “DMX-h” and “DMX-h License” (if present), choose “Service Actions” -> “Delete
Service” and follow the prompts.
RPM Uninstall Uninstall DMX-h on all nodes in the cluster as follows:
1. Ensure that no DMX-h jobs are running.
2. Uninstall the DMX Run-time Service on any edge/cluster node where it is running.
3. Run the following command with sudo or root privileges using the erase option, –e:
Software: rpm -e dmexpress
License: rpm -e dmexpresslicense-<license site ID>
e.g. rpm -e dmexpresslicense-12345
If you do not know your license site ID, run the following command to find the installed
license package name:
rpm -qa | grep dmexpresslicense-
92 DMX Install Guide
You can also use an RPM wrapper such as yum instead:
yum erase dmexpress
yum erase dmexpresslicense-<license site ID>
Manual/Silent Uninstall Uninstall DMX-h on the edge/ETL node and each remaining node in the cluster as follows:
1. Ensure that no DMX-h jobs are running.
2. Uninstall the DMX Run-time Service on any edge/cluster node where it is running.
3. Remove the DMX home directory on the edge node and all remaining nodes in the cluster:
cd <dmx_home>/..
rm –rf <dmx_home>
4. Remove any environment variable modifications made for DMX, such as the addition of
<dmx_home>/bin to your PATH.
Uninstall the DMX Run-time Service When instructed to uninstall the DMX Run-time Service, run the install script in the DMX
installation directory as root, and select the option to uninstall the DMX Run-time Service. For
example:
cd /usr/local/DMExpress
./install
DMX Install Guide 93
DMX installation component options DMX installation component options include the following:
• Standard
The standard installation enables you to install the following components on one server:
o Development client, Job Editor and Task Editor
o DMX engine, dmxdfnl/ dmxjob/ dmexpress
o Service for development client, which is the DMX Run-time Service, dmxd
o DataFunnel Run-time Service, dmxrund
o See DMX DataFunnel run-time service installation and configuration.
• Full
The full installation enables you to install all DMX components on one server:
o Development client, DMX Job Editor and Task Editor
o DMX engine, dmxdfnl/ dmxjob/ dmexpress
o Service for development client, which is the DMX Run-time Service, dmxd
o DataFunnel Run-time Service, dmxrund
o See DMX DataFunnel run-time service installation and configuration.
o Management Service, which includes, dmxmgr, REST APIs, and the Connect Portal
user interface (UI)
See DMX Management Service installation and configuration.
• Classic
The classic installation enables you to install traditional DMX components on one server:
o Development client, Job Editor and Task Editor
o DMX engine, dmxjob/dmexpress
o Service for development client, which is the DMX Run-time Service, dmxd
• Custom
The custom installation enables you to install individual components on different servers:
o DMX engine
Installs the DMX engine, dmxdfnl/ dmxjob/ dmexpress.
o Service for development client
Installs the DMX Run-time Service, dmxd.
o DataFunnel Run-time Service
Installs the DataFunnel Run-time Service, dmxrund.
See DMX DataFunnel run-time service installation and configuration.
o Development client
Installs the Job Editor and Task Editor.
o Management Service
Installs the management service, dmxmgr, REST APIs, and the Connect Portal UI.
See DMX Management Service installation and configuration.
94 DMX Install Guide
DMX Management Service installation and configuration
Installation
DMX Management Service executable The DMX Management Service executable, DMXManager, is installed in the following directory:
Windows <DMX _installation>\Programs
Linux <DMX _installation>/bin
The DMX management service configuration file The DMX management service configuration file, dmxmgr.properties, is installed in the following
directory:
Windows <DMX _installation>\Conf
Linux <DMX _installation>/conf
Configuration
DMX management service configuration file Many of the properties within dmxmgr.properties are populated with commented, preliminary
default values. Consider each of the name value pairs among the following properties within the file;
uncomment and update to meet your system requirements:
• Server
• Secure socket layer (SSL)
• Authentication
• Central file repository
• Central database repository
• Logging
Configuration properties as environment variables You can specify the configuration properties defined within dmxmgr.properties as environment
variables by capitalizing the property name and replacing the period separator, ".", with an
underscore. The configuration property name, authentication.method could be specified as a Linux
environment variable, for example, as follows: export AUTHENTICATION_METHOD=LDAP.
Server configuration properties Server configuration properties are defined through the name value pairs specified in the
DMExpress management service configuration file, dmxmgr.properties.
Consider the following server configuration properties:
Property name Property description Property values
DMX Install Guide 95
server.address
The address of the
embedded Apache Tomcat
web server application.
Required
server.port
The DMX management
service port that is assigned
to listen for client requests.
8280 (default).
If the port number dedicated to listening to
client requests is different from 8280, assign
the appropriate value.
Secure socket layer configuration properties Secure socket layer (SSL) configuration properties are defined through the name value pairs
specified in the DMX management service configuration file, dmxmgr.properties.
By default, the DMX management service disables SSL certification. To enable SSL certification,
SSL configuration properties must be added to dmxmgr.properties.
Consider the following SSL configuration properties:
Property name Property description Property values
security.require-ssl Determines whether SSL
certification is required.
False (default), True
For SSL certification to be enabled, the
property value must be set to True.
server.ssl.client-auth Determines whether client
authentication occurs during
the SSL handshake.
want
For client authentication to occur
during the SSL handshake, the
property value must be set to want.
server.ssl.key-alias Alias of the SSL key. Required when SSL certification is
enabled.
server.ssl.key-
password
Password of the SSL key. Required when SSL certification is
enabled.
server.ssl.key-store Location of the DMX central
management server keyStore.
Required when SSL certification is
enabled.
server.ssl.key-store-
password
Password of the DMX central
management server keyStore.
Required when SSL certification is
enabled.
server.ssl.trust-store Location of the DMX central
management server trustStore.
Required when the DMX DataFunnel
run-time service uses SSL
server.ssl.trust-
store-password
Password of the DMX central
management server trustStore.
Required when the DMX DataFunnel
run-time service uses SSL.
Authentication configuration properties Authentication configuration properties are defined through the name value pairs specified in the
DMX management service configuration file, dmxmgr.properties.
96 DMX Install Guide
Consider the following authentication configuration properties:
Property name Property
description Property Values
authentication.method
The authentication
method for
authenticating users.
LDAP (default) and SIMPLE.
If you skip the configuration setup during the
installation, the installation process
automatically assigns the value LDAP to the
authentication.method property name.
When LDAP is the authentication method, you
must provide LDAP authentication
configuration properties.
authentication.login.aut
o_create_users
Specifies whether
new users should be
created dynamically
upon login.
true (default) and false.
To successfully call REST APIs, valid user
credentials on the authentication backend (for
example, on the LDAP active directory) must
also be registered with the DMX management
service. When the property value is set to
false, users are not automatically created and
registered on the DMX management service
even when they are registered on the LDAP
active directory. Any attempt by an
unregistered user to call the REST API layer
of the DMX management service results in a
call failure with status code 401/Unauthorized.
authentication.login.def
ault_role
The default user
roles, which the DMX
management service
requires to operate,
are automatically
established as part of
the DMX
management service
installation.
role_administrator and role_user (defaults).
These roles are assigned dynamically as part
of the initial login:
The first user who successfully logs into the
system is granted the user role,
role_administrator.
Any subsequent user who successfully logs
into the system is granted the user role,
role_user.
While not required, system administrators can
create new, custom user roles and assign
existing permissions to the new roles.
Examples of possible custom user roles include
the following: business user, operator, data
scientist, data architect, solution engineer,
developer.
DMX Install Guide 97
authentication.token.sig
nature_secret
The signature secret
to Secure Hash
Algorithm (SHA)-sign
generated
authentication
tokens.
If a signature secret value is not specified, a
random secret is generated at DMX
management service at start up time.
When generating the cryptographic signature
of an authentication token, a portion of the
authentication token segment is signed using
a SHA message digest. A signature secret is
applied to the message digest. The resulting
signature secret value, which is applied to the
authentication cookie, is encoded as a Base64
string.
authentication.token.to
ken_validity
The time in seconds
in which a generated
token is valid.
36000 seconds (default).
36000 seconds is equivalent to 10 hours.
authentication.token.co
okie_domain
The domain attribute
of the authentication
token cookie.
The cookie domain specifies to the browser
that cookies should only be sent back to the
DMX management service for the given
domain.
If the cookie domain is not specified, the cookie
is sent back to the domain on the DMX
management service from which the object
was requested by default.
For additional information, see HTTP Cookie
Domain and Path.
authentication.token.co
okie_path
The path attribute of
the authentication
token cookie.
The cookie path specifies to the browser that
cookies should only be sent back to the DMX
management service for the given path. If the
cookie path is not specified, the cookie is sent
back to the path on the DMX management
service from which the object was requested by
default.
For additional information, see HTTP Cookie
Domain and Path.
LDAP authentication configuration properties
Property name Property description Property Values
ldap.url LDAP URL. Required when
authentication.met
hod is set to LDAP.
98 DMX Install Guide
ldap.active_directory.us
er_domain
LDAP active directory user domain. Required when
authentication.met
hod is set to LDAP.
ldap.active_directory.ro
ot_domain
LDAP active directory root domain. Required when
authentication.met
hod is set to LDAP.
ldap.search.managerDn Distinguished name (DN) of the manager, which
is the user that performs searches when the
LDAP server does not support or has not
enabled anonymous searches.
ldap.search.managerPa
ssword
Password of the manager that performs LDAP
searches.
ldap.search.userBaseDn The search base DN for finding users.
Central file repository configuration properties Central file repository configuration properties are defined through the name value pairs specified in
the DMX management service configuration file, dmxmgr.properties.
The DMXDFNL root job and its job dependencies, which include subjobs, tasks, and operational
metadata, are stored in the DMX central file repository. The DMX central file repository must be
configured to reside on a local file system.
Consider the following file repository configuration properties:
DMX Install Guide 99
Local central file repository configuration properties
Property
name Property description Property Values
repository.url
Location of the local DMX central file repository. The default location of the
central file repository is the
home directory on your local
client workstation.
History repository configuration properties
Property name Property description Property
Values
history.repository.
location
Location of the job execution history directory, which is relative
to the DMX central file repository.
Beneath the top-level history directory, individual job run
directories are created and organized by date:
~/.dmexpress/history/
{YEAR}/
{YEAR}/
{MONTH}/
{DAY}/
{<job_name>}_{<starttime>[_<job_number>]_log.{xml|txt}
{<job_name>}_{<starttime>[_<job_number>].json
The job log is generated in XML or Text format; the operational
metadata log is generated in JSON format.
Required
Central database repository configuration properties Central database repository configuration properties are defined through the name value pairs
specified in the DMX management service configuration file, dmxmgr.properties.
The DMXDFNL job definition and runtime connection data are stored in the central database
repository. The central database repository must be configured to reside on your local client
workstation.
Consider the following database repository configuration properties:
Property name Property description Property
Values
100 DMX Install Guide
spring.datasource.url Location of the local DMX central
database repository.
The default location of the central
database repository is located beneath the
home directory on your local client
workstation:
~/.dmexpress/com.syncsort.dmxmgr/
Required.
spring.datasource.username Name of the database user with access to
the database repository.
Required
spring.datasource.password Password value associated with the user
with access to the database.
Required
spring.datasource.driverClassName Identifies the JDBC driver class name or
Java class.
Required
Logging configuration properties Logging configuration properties are defined through the name value pairs specified in the DMX
management service configuration file, dmxmgr.properties.
Consider the following logging configuration properties:
Property name Property description Property Values
logging.file
The relative or absolute path to and name of
the DMX management service log file;
for example:
logging.file=${java.io.tmpdir:-
/tmp}/dmxmgr.log.
logging.level.*
The level of logging detail written to the DMX
management service log file that is defined in
logging.file.
ERROR, WARN and INFO level messages are
logged by default.
Valid values include the
following: ERROR, WARN,
INFO, DEBUG, or TRACE.
DMX DataFunnel run-time service install and configuration
Installation
DMX DataFunnel run-time service executable The DMX DataFunnel run-time service executable, dmxrund, is installed in the following directory:
Windows <DMX _installation>\Programs
Linux <DMX _installation>/bin
DMX DataFunnel Run-time Service configuration file The DMX DataFunnel Run-time Service configuration file, dmxrund.conf, is installed in the
following directory:
DMX Install Guide 101
Windows <DMX _installation>\Conf
Linux <DMX _installation>/conf
Linux only: DMX impersonation executable The DMX impersonation executable, dmxexecutor.exe, is installed in the following Linux directory:
<DMX _installation>/bin
Linux only: DMX custom impersonation configuration file The DMX custom impersonation configuration file, dmxexecutor.conf, is located in the following
Linux directory:
<DMX _installation>/conf
Configuration
DMX custom impersonation configuration file Many of the properties within dmxrund.conf are populated with commented, preliminary default
values. Uncomment and update applicable properties to meet your system requirements.
DMX DataFunnel Run-time Service configuration properties DMX DataFunnel Run-time Service configuration properties are defined through the name value
pairs specified in the DMX DataFunnel Run-time Service service file, dmxrund.conf.
Consider the following DataFunnel Run-time Service configuration properties:
Property name Property description Property
Values
SERVER_PORT
The DMX execution service port that is assigned to listen
for job execution requests from the DMX management
service, dmxmgr.
33636
(default)
DMEXPRESS_HOME
The directory where DMX is installed. Required
UNPACK_WORK_DIR
ECTORY
The working directory where jobs are unpacked.
Value: Required.
Required
SECURITY_ENABLED
Determines whether Secure socket layer (SSL) security is
enabled.
For SSL certification to be enabled, the property value
must be set to Y.
Y (default), N
102 DMX Install Guide
SSL_SERVER_PRIVAT
E_KEY
The path to the SSL server private key file, which is in
PEM format.
Required
when SSL
certification is
enabled.
SSL_SERVER_CERTIFI
CATE
The path to the SSL server certificate public key file,
which is in PEM format.
Required
when SSL
certification is
enabled
SSL_CLIENT_AUTHE
NTICATION_ENABLE
D
Determines whether to authenticate the client. Y (default), N
SSL_TRUSTED_CERTI
FICATES
The path to the trusted certificates file, which is in PEM
format. This file can contain multiple client certificates in
PEM format
Required
when SSL
certification is
enabled.
Linux only: DMX custom impersonation configuration file If dmxexecutor was established as the impersonated user during Linux pre-installation, you have
the option of updating dmxexecutor.conf. Updating properties in dmxexecutor.conf is optional;
update only to customize the impersonation process.
DMX custom impersonation configuration properties To customize impersonation, DMX custom impersonation configuration properties are defined
through the name value pairs specified in the DMX custom impersonation file, dmxexecutor.conf.
Consider the following custom impersonation configuration properties:
Property name Property description Property
Values
SERVICE_GROUP
The service group to which the service user belongs. dmexpress
(default)
MIN_USERID
The minimum user identification (ID) number or
security access level that is assigned for
impersonation. If the user ID is greater than this
minimum value, the user is not impersonated and
the job run aborts.
Value:
500 (default)
DMX Install Guide 103
BANNED_USERS
Users listed as banned prevent dmxexecutor from
impersonating that user. Multiple banned users in
the list must be separated by commas. All users not
listed as banned qualify for impersonation and allow
dmxexecutor to impersonate that user.
Upon receipt of a job submission request
• from a banned user, dmxexecutor rejects the job
request, generates an error, and the job aborts.
• from a user not listed as banned, dmxexecutor
calls the DMX engine to run the job.
ALLOWED_USERS
Users listed as allowed are the only users that enable
dmxexecutor to impersonate that user. Multiple
allowed users in the list must be separated by
commas. All users not listed as allowed are
disqualified from impersonation and prevent
dmxexecutor from impersonating that user.
Upon receipt of a job request
• from an allowed user, dmxexecutor calls the
DMX engine to run the job.
• from a user not listed as allowed, dmxexecutor
rejects the job request, generates an error, and
the job aborts.
104 DMX Install Guide
Technical Support If you have a maintenance support agreement for DMX, and you encounter difficulties in installing
or running DMX, contact Syncsort Incorporated.
In the United States (available 24 hours a day, 7 days a week): Phone 1-877-700-8270 or 201-930-8270
E-mail [email protected]
In other countries:
Contact information can be found by country at https://mysupport.syncsort.com/.