Quick Start Installation Administration - MapR · Quick Start Installation Administration...

MapR Administrator Training

April 2012

Version 2.1.1

Quick StartInstallation

AdministrationDevelopment

Reference

All rights reserved.

The MapR logo is a registered trademark of MapR Technologies, Inc.

DOCUMENTATION IS PROVIDED “AS IS” AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.

MapR Technologies, Inc. has intellectual property rights relating to technology embodied in the product that is described in this document. In particular, and without limitation, these intellectual property rights may include one or more U.S. patents or pending patent applications in the U.S. and in other countries.

Table of Contents: Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Start Here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Quick Start - Test Drive MapR on a Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Installing the MapR Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 A Tour of the MapR Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Working with Snapshots, Mirrors, and Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Getting Started with HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Getting Started with Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 Getting Started with Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Installation Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Requirements for Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

ulimit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Planning the Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Planning Cluster Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Preparing Packages and Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Installing MapR Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Setting Up Hadoop Ecosystem Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Flume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Mahout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

MultiTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Oozie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Bringing Up the Cluster and Applying a License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

Configuring the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Central Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Setting up the MapR Metrics Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Working with Multiple Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

Setting Up the Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Third Party Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Datameer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Karmasphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

HParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Troubleshooting Installation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Administration Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Alarms and Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Centralized Logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Monitoring Node Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Service Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Job Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Third-Party Monitoring Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Ganglia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Nagios Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Managing Data with Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Mirror Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Managing the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Balancers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Cluster Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Converting a Cluster from Root to Non-root User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Manual Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Rolling Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Working with a Logical Volume Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Setting Up Disks for MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 Specifying Disks or Partitions for Use by MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

Dial Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

Adding Nodes to a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Managing Services on a Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

Node Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Isolating CLDB Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

Isolating ZooKeeper Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Removing Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Changing the User for MapR Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 TaskTracker Blacklisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

Assigning Services to Nodes for Best Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Startup and Shutdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

Uninstalling MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

Users and Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Managing Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

Managing Quotas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

PAM Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Secured TaskTracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

Subnet Whitelist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Placing Jobs on Specified Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

Setting Up MapR NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 Disaster Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Troubleshooting Cluster Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202'ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils' in cldb.log, caused by mixed-case cluster name in

mapr-clusters.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Out of Memory Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

Setting up a MapR Cluster on Amazon Elastic MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Development Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

Working with MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Configuring MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

Job Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 Standalone Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

Tuning Your MapR Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 Compiling Pipes Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

Working with MapR-FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 Chunk Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Working with Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

Accessing Data with NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Copying Data from Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

Data Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 Provisioning Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

Provisioning for Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Provisioning for Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

MapR Metrics and Job Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Troubleshooting Development Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

Migration Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Planning the Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

Initial MapR Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Component Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 Application Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Node Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255

Reference Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

Version 2.1 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 Hadoop Compatibility in Version 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

Version 2.1.1 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 Version 2.0 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

Hadoop Compatibility in Version 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Package Dependencies for MapR version 2.x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

Packages and Dependencies for MapR Version 2.x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Version 2.0.1 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

Version 1.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Version 1.2.10 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

Version 1.2.9 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Version 1.2.7 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 Version 1.2.3 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 Version 1.2.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

Hadoop Compatibility in Version 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 Version 1.1 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

Version 1.1.3 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 Version 1.1.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 Version 1.1.1 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

Hadoop Compatibility in Version 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 Version 1.0 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

Hadoop Compatibility in Version 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307 Beta Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Alpha Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316

Packages and Dependencies for MapR Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 MapR Control System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318

Cluster Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 MapR-FS Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

NFS HA Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 Alarms Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349

System Settings Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 Other Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358

CLDB View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 HBase View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361

JobTracker View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 Nagios View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372

Terminal View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 Hadoop Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 hadoop archive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378

hadoop classpath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 hadoop daemonlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380

hadoop distcp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 hadoop fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 hadoop jar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 hadoop job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

hadoop jobtracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 hadoop mfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

hadoop mradmin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 hadoop pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 hadoop queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397

hadoop tasktracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 hadoop version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402

hadoop conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 API Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

acl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407 acl edit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 acl set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

acl show . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 alarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413

alarm clear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 alarm clearall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

alarm config load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416 alarm config save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

alarm list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 alarm names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421

alarm raise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

config load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 config save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 dashboard info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430

dialhome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 dialhome ackdial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435 dialhome enable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436

dialhome lastdialed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 dialhome metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438

dialhome status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

disk add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 disk list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

disk listall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 disk remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445

entity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 entity info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448

entity list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 entity modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 license add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454

license addcrl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 license apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456

license list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 license listcrl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458

license remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 license showid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460

nagios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 nagios generate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462

nfsmgmt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 nfsmgmt refreshexports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466

node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 add-to-cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468

node allow-into-cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 node heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470

node list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 node listcldbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477

node listcldbzks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478 node listzookeepers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479

node maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480

node metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 node move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486

node path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 node remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 node services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489

node topo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490 schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491

schedule create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493 schedule list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494

schedule modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 schedule remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496

service list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 setloglevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498

setloglevel cldb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 setloglevel fileserver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 setloglevel hbmaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501

setloglevel hbregionserver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 setloglevel jobtracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503

setloglevel nfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 setloglevel tasktracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505

trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 trace dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507

trace info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 trace print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 trace reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 trace resize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512

trace setlevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 trace setmode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514

urls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515 virtualip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516

virtualip add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 virtualip edit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518

virtualip list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 virtualip move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520

virtualip remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522

job changepriority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523 job kill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524

job linklogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525 job table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526

volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 volume create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532

volume dump create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 volume dump restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 volume fixmountpath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539

volume info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540 volume link create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 volume link remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542

volume list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543 volume mirror push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547 volume mirror start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548 volume mirror stop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549

volume modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550 volume mount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552

volume move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 volume remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554 volume rename . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555

volume showmounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556 volume snapshot create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557

volume snapshot list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558 volume snapshot preserve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560

volume snapshot remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 volume unmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564

Metrics API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566

task failattempt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 task killattempt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568

task table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 rlimit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574

rlimit get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 rlimit set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576

userconfig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577 userconfig load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578

dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580 dump balancerinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581

dump balancermetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 dump changeloglevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589

dump cldbnodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 dump containerinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592

dump replicationmanagerinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 dump replicationmanagerqueueinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598

dump rereplicationinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601 dump rolebalancerinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604

dump rolebalancermetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 dump volumeinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606

dump volumenodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 dump zkinfo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610

Alarms Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612 Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621

configure.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 disksetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625

mapr-support-collect.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627 pullcentralconfig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629 rollingupgrade.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631

Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632 .dfs_attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633

cldb.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 core-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636 daemon.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639

disktab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640 hadoop-metrics.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641

mapr-clusters.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644 mapred-default.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645

mapred-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655 mfs.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668

taskcontroller.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669 warden.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670

zoo.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 Ports Used by MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675

Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680

MapR v2.1.1 Documentation, Page 6For the latest documentation visit http://www.mapr.com/doc

Copyright © 2012, MapR Technologies, Inc.

HomeWelcome to MapR! If you are not sure how to get started, here are a few places to find the information you are looking for:

Quick Start - Test Drive MapR on a Virtual Machine - Try out a single-node cluster that's ready to roll, right out of the box!Installation Guide - Learn how to set up a production cluster, large or smallDevelopment Guide - Read more about what you can do with a MapR clusterAdministration Guide - Learn how to configure and tune a MapR cluster for performance



Start HereThe MapR Distribution for Apache Hadoop is the easiest, most dependable, and fastest Hadoop distribution on the planet. It is the only Hadoopdistribution that allows direct data input and output via MapR Direct Access NFS™ with realtime analytics, and the first to provide true HighAvailability (HA) at all levels. MapR introduces logical volumes to Hadoop. A volume is a way to group data and apply policy across an entire dataset. MapR provides hardware status and control with the , a comprehensive UI including a Heatmap™ that displays theMapR Control Systemhealth of the entire cluster at a glance.

In this section, you can learn about MapR's unique features and how they provide the highest performing, lowest cost Hadoop available.

To learn more about MapR, including information about MapR , see the following sections:partners

MapR Provides Complete Hadoop CompatibilityIntuitive, Powerful Cluster Management with the MapR Control SystemReliability, Fault-Tolerance, and Data Recovery with MapRHigh-Performance Hadoop Clusters with MapR Direct ShuffleGet Started

MapR Provides Complete Hadoop Compatibility

MapR is a complete Hadoop distribution.

MapR provides the following packages:

Apache Hadoop 0.20.2Cascading 2.0Flume 1.2.0Hbase 0.92.1Hcatalog 0.4.0Hive 0.9.0Mahout 0.6 and 0.7Oozie 3.1.0Pig 0.10.0Sqoop 1.4.1Whirr 0.7.0

For more information, see the .Version 2.0 Release Notes

Intuitive, Powerful Cluster Management with the MapR Control System

The MapR Control System webapp provides powerful hardware insight down to the node level, as well as complete control of users, volumes,quotas, mirroring, and snapshots. Filterable alarms and notifications provide immediate warnings about hardware failures or other conditions thatrequire attention, allowing a cluster administrator to detect and resolve problems quickly.

MapR lets you control data access and placement, so that multiple concurrent Hadoop jobs can safely share the cluster.

Provisioning resources is simple. You can easily create a volume for a project or department in a few clicks. MapR integrates with NIS and LDAP,making it easy to manage users and groups. The provides a flexible web-based user interface to cluster administration.MapR Control SystemFrom the MapR Control System, you can assign user or group , limit the amount of data a user or group can write, or limit a size.quotas volume's

Setting recovery time objective (RTO) and recovery point objective (RPO) points for a data set are a simple matter of and scheduling snapshots m on a volume through the MapR Control System. You can set read and write permissions on volumes directly via or using cirrors NFS hadoop fs

ommands, and volumes provide administrative delegation through Access Control Lists (ACLs). Through the MapR Control System you cancontrol who can mount, unmount, snapshot, or mirror a volume.

Because MapR is a complete Hadoop distribution, you can run your Hadoop jobs the way you always have.

Unrestricted Writes to the Cluster with MapR Direct Access NFS

The MapR NFS service lets you access data on a licensed MapR cluster via the protocol. You mount a cluster through NFS on a ofNFS varietyclients.

Clusters with the M3 license can run MapR NFS on one node, enabling you to mount your cluster as a standard POSIX-compliant filesystem.Once your cluster is mounted on NFS, you can use standard shell scripting to read and write live data in the cluster.

http://en.wikipedia.org/wiki/Network_File_System_%28protocol%29



You can run multiple NFS server nodes by upgrading to the M5 license level. You can use virtual IP addresses (VIPs) to provide transparent NFSfailover with multiple NFS servers. You can also have each node in your cluster self-mount to NFS to make all of your cluster's data available fromevery node. These NFS self-mounts enable you to run standard shell scripts to work with the cluster's Hadoop data directly.

Data Protection, Availability, and Performance with Volume Management

With volumes, you can control access to data, set replication factor, and place specific data sets on specific racks or nodes for performance ordata protection. Volumes control data access to specific users or groups with Linux-style permissions that integrate with existing LDAP and NISdirectories. Use volume quotas to prevent data overruns from consuming excessive storage capacity.

One of the most powerful aspects of the volume concept is the ways in which a volume provides data protection:

To enable point-in-time recovery and easy backups, volumes have manual and policy-based snapshot capability.For true business continuity, you can manually or automatically mirror volumes and synchronize them between clusters or datacenters toenable easy disaster recovery.You can set volume read/write permission and delegate administrative functions to control data access.

You can export volumes with MapR Direct Access NFS with HA, allowing data read and write operations directly to Hadoop without the need fortemporary storage or log collection. Multiple NFS nodes provide the same view of the cluster regardless of where the client connects.

Realtime Hadoop Analytics: Intuitive and Powerful Performance Metrics

New in the 2.0 release, the service provides in-depth access to the performance statistics of your cluster and the jobs that runMapR Job Metricson it. With MapR Job Metrics, you can examine trends in resource use, diagnose unusual node behavior, or examine how changes in your jobconfiguration affects the job's execution.

The service, also new in the 2.0 release, provides detailed information on the activity and resource usage of specific nodesMapR Node Metricswithin your cluster.

Critical MapR services collect on cluster resource utilization and activity that you can write directly to a file or integrate into the information Gangliathird-party tool.

Expand Your Capabilities with Third-Party Solutions

MapR has with Datameer, which provides a self-service Business Intelligence platform that runs best on the MapR Distribution forpartneredApache Hadoop. Your download of MapR includes a 30-day trial version of Datameer Analytics Solution (DAS), which provides spreadsheet-styleanalytics, ETL and data visualization capabilities.

For More Information

Read about Provisioning ApplicationsLearn about Direct Access NFSCheck out Datameer

Reliability, Fault-Tolerance, and Data Recovery with MapR

With clusters growing to thousands of nodes, hardware failures are inevitable even with the most reliable machines in place. The MapRDistribution for Hadoop has been designed from the ground up to seamlessly tolerate hardware failure.

MapR is the first Hadoop distibution to provide true high availability (HA) and failover at all levels, including a MapR Distributed HA NameNode™.If a disk or node in the cluster fails, MapR automatically restarts any affected processes on another node without requiring administrativeintervention. The HA JobTracker ensures that any tasks interrupted by a node or disk failure are re-started on another TaskTracker node. In theevent of any failure, the job's completed task state is preserved and no tasks are lost. For additional data reliability, every bit of data on the wire iscompressed and CRC-checked.

For more information:

Take a look at the HeatmapLearn about Volumes, Snapshots, and MirroringExplore scenariosData ProtectionRead about and Job Metrics Node Metrics

High-Performance Hadoop Clusters with MapR Direct Shuffle

The MapR distribution for Hadoop achieves up to three times the performance of any other Hadoop distribution, and can reduce your equipmentcosts by half.



MapR Direct Shuffle uses the Distributed NameNode to drastically improve Reduce-phase performance. Unlike Hadoop distributions that use thelocal filesystem for shuffle and HTTP to transport shuffle data, MapR shuffle data is readable directly from anywhere on the network. MapR storesdata with Lockless Storage Services™, a sharded system that eliminates contention and overhead from data transport and retrieval. Automatic,transparent client-side compression reduces network overhead and reduces footprint on disk, while direct block device I/O provides throughput athardware speed with no additional overhead. As an additional performance boost, with MapR Realtime Hadoop, you can read files while they arestill being written.

MapR gives you ways to tune the performance of your cluster. Using mirrors, you can load-balance reads on highly-accessed data to alleviatebottlenecks and improve read bandwidth to multiple users. You can run MapR Direct Access NFS on many nodes – all nodes in the cluster, ifdesired – and load-balance reads and writes across the entire cluster. Volume topology helps you further tune performance by allowing you toplace resource-intensive Hadoop jobs and high-activity data on the fastest machines in the cluster.

For more information:

Read about Tuning Your MapR InstallRead about Provisioning for Performance

Get Started

Now that you know a bit about how the features of MapR Distribution for Apache Hadoop work, take a quick tour to see for yourself how they canwork for you:

Quick Start - Test Drive MapR on a Virtual Machine - Try out a single-node cluster that's ready to roll, right out of the box!Installation Guide - Learn how to set up a production cluster, large or smallDevelopment Guide - Read more about what you can do with a MapR clusterAdministration Guide - Learn how to configure and tune a MapR cluster for performance



Quick Start - Test Drive MapR on a Virtual MachineThe MapR Virtual Machine is a fully-functional single-node Hadoop cluster capable of running MapReduce programs and working withapplications like Hive, Pig, and HBase. You can try the MapR Virtual Machine on nearly any 64-bit computer by downloading the free VMware

.Player

The MapR Virtual Machine desktop contains the following icons:

MapR Control System - navigates to the graphical control system for managing the clusterMapR User Guide - navigates to the MapR online documentationMapR NFS - navigates to the NFS-mounted cluster storage layer

Ready for a tour? The following documents will help you get started:

Installing the MapR Virtual MachineA Tour of the MapR Virtual MachineWorking with Snapshots, Mirrors, and SchedulesGetting Started with HiveGetting Started with PigGetting Started with HBase

http://downloads.vmware.com/d/info/desktop_end_user_computing/vmware_player/4_0




1.

2.

3. 4.

5. 6.

Installing the MapR Virtual Machine

The MapR Virtual Machine runs on VMware Player, a free desktop application that lets you run a virtual machine on a Windows or Linux PC. Youcan download VMware Player from the . To install the VMware Player, see the .VMware web site VMware documentation

For Linux and Windows, download the free VMware PlayerFor Mac, purchase VMware Fusion

Use of VMware Player is subject to the VMware Player end user license terms, and VMware provides no support for VMware Player. For self-helpresources, see the .VMware Player FAQ

Requirements

The MapR Virtual Machine requires at least 20 GB free hard disk space and 2 GB of RAM on the host system. You will see higher performancewith more RAM and more free hard disk space.

To run the MapR Virtual Machine, the host system must have one of the following 64-bit x86 architectures:

A 1.3 GHz or faster AMD CPU with segment-limit support in long modeA 1.3 GHz or faster Intel CPU with VT-x support

If you have an Intel CPU with VT-x support, you must verify that VT-x support is enabled in the host system BIOS. The BIOS settings that must beenabled for VT-x support vary depending on the system vendor. See theVMware knowledge base article at for information about how to determine if VT-x support is enabled.http://kb.vmware.com/kb/1003944

Installing and Running the MapR Virtual Machine

Choose whether to install the M3 Edition or the M5 Edition, and download the corresponding archive file:M3 Edition - http://package.mapr.com/releases/v2.0.0/vmdemo/MapR-VM-2.0.0.15153GA-1-m3.tbz2M5 Edition - http://package.mapr.com/releases/v2.0.0/vmdemo/MapR-VM-2.0.0.15153GA-1-m5.tbz2

On a UNIX system, use the command to extract the archive to your home directory or another directory of your choosing:tar

tar -xvf MapR-VM-<version>tar.bzip2

On a Windows system, use a decompression utility such as to extract the archive.7-zip

Run the VMware player.Click , navigate to the directory into which you extracted the archive, then open the MapR-VM.vmx virtualOpen a Virtual Machinemachine.

Tip for VMWare Fusion

If you are running VMware Fusion, make sure to select or instead of creating a new virtual machine.Open Open and RunTo log on to the MapR Control System, use the username and the password (all lowercase).mapr maprOnce the virtual machine is fully started, you can proceed with the .tour

http://www.vmware.com/download/player/

http://www.vmware.com/support/pubs/player_pubs.html

http://www.vmware.com/products/player/

http://www.vmware.com/products/fusion/

http://www.vmware.com/products/player/faqs.html

http://kb.vmware.com/kb/1003944

http://www.mapr.com/products/mapr-editions/m3-edition

http://package.mapr.com/releases/v2.0.0/vmdemo/MapR-VM-2.0.0.15153GA-1-m3.tbz2

http://www.mapr.com/products/mapr-editions/m5-edition

http://package.mapr.com/releases/v2.0.0/vmdemo/MapR-VM-2.0.0.15153GA-1-m5.tbz2

http://www.7-zip.org/



A Tour of the MapR Virtual Machine

In this tutorial, you'll get familiar with the MapR Control System dashboard, learn how to get data into the cluster (and organized), and run someMapReduce jobs on Hadoop. You can read the following sections in order or browse them as you explore on our own:

The DashboardWorking with VolumesExploring NFSRunning a MapReduce Job

Once you feel comfortable working with the MapR Virtual Machine, you can move on to more advanced topics:

Working with Snapshots, Mirrors, and SchedulesGetting Started with HiveGetting Started with PigGetting Started with HBase

The Dashboard

The dashboard, the main screen in the , shows the health of the cluster at a glance. To get to the dashboard, click the MapR Control System Map link on the desktop of the MapR Virtual Machine and log on with the username and the password . If it is your firstR Control System root mapr

time using the MapR Control System, you will need to accept the terms of the license agreement to proceed.

Parts of the dashboard:

To the left, the lets you navigate to other views that display more detailed information about , navigation pane nodes in the cluster volume, , , and .s in the MapR Storage Services layer NFS settings Alarms Views System Settings Views

In the center, the main dashboard view displays the nodes in a " " that uses color to indicate node health--since there is only oneheat mapnode in the MapR Virtual Machine cluster, there is a single green square.To the right, information about cluster usage is displayed.

Try clicking the button at the top right of the heat map. You will see different kinds of information that can be displayed in the heatHealthmap.Try clicking the green square representing the node. You will see more detailed information about the status of the node.

By the way, the browser is pre-configured with the following bookmarks, which you will find useful as you gain experience with Hadoop,MapReduce, and the MapR Control System:

MapR Control SystemJobTracker StatusTaskTracker StatusHBase MasterCLDB Status

Don't worry if you aren't sure what those are yet.



1. 2. 3. 4. 5. 6. 7.

Working with Volumes

MapR provides as a way to organize data into groups, so that you can manage your data and apply policy all at once instead of file byvolumesfile. Think of a volume as being similar to a huge hard drive---it can be mounted or unmounted, belong to a specific department or user, and havepermissions set as a whole or on any directory or file within. Volumes provide the following features which are critical to efficient usage of thecluster as cluster size grows:

Applying user permissions and ownership at the volume levelProviding mirrors and snapshots of volume contentSpecifying data replication properties and the location of data on specific nodes/racksAllowing a volume of data to be mounted or unmounted like any other device

As your cluster grows, you will work more and more with volumes to provision for efficient, high-availability access to data. For more details, see .Managing Data with Volumes

In this section, you will create a volume that you can use for later parts of the tutorial.

Create a volume:

In the Navigation pane, click in the group.Volumes MapR-FSClick the button to display the dialog.New Volume New VolumeFor the , select .Volume Type Standard VolumeType the name in the field.MyVolume Volume NameType the path in the field./myvolume Mount PathSelect in the field./default-rack TopologyScroll to the bottom and click to create the volume.OK

Notice that the mount path and the volume name do not have to match. The volume name is a permanent identifier for the volume, and the mountpath determines the file path by which the volume is accessed. The topology determines the racks available for storing the volume and itsreplicas.

In the next step, you'll see how to get data into the cluster with NFS.

Exploring NFS

With MapR, you can mount the cluster via NFS, and browse it as if it were a filesystem. Try clicking the icon on the MapR VirtualMapR NFSMachine desktop.



When you navigate to you can see the volume that you created in the previous example.mapr > my.cluster.com myvolume

Try copying some files to the volume; a good place to start is the files and which are attached to thisconstitution.txt sample-table.txtpage. Both are text files, which will be useful when running the Word Count example later.

To download them, select from the menu to the top right of this document (the one you are reading now) and thenAttachments Toolsclick the links for those two files.Once they are downloaded, you can add them to the cluster.Since you'll be using them as input to MapReduce jobs in a few minutes, create a directory called in the volume and dragin myvolumethe files there. If you do not have a volume mounted at on the cluster, use the instructions in above tomyvolume Working with Volumescreate it.

By the way, if you want to verify that you are really copying the files into the Hadoop cluster, you can open a terminal on the MapR VirtualMachine (select ) and type to see that the files are there.Applications > Accessories > Terminal hadoop fs -ls /myvolume/in

The Terminal

When you run MapReduce jobs, and when you use Hive, Pig, or HBase, you'll be working with the Linux terminal. Open a terminal window byselecting .Applications > Accessories > Terminal



1. 2.

3.

4.

Running a MapReduce Job

In this section, you will run the well-known Word Count MapReduce example. You'll need one or more text files (like the ones you copied to thecluster in the previous section). The Word Count program reads files from an input directory, counts the words, and writes the results of the job tofiles in an output directory. For this exercise you will use for the input, and for the output. The input directory/myvolume/in /myvolume/outmust exist and must contain the input files before running the job; the output directory must not exist, as the Word Count example creates it.

Try MapReduce

On the MapR Virtual Machine, open a terminal (select )Applications > Accessories > TerminalCopy a couple of text files into the cluster. If you are not sure how, see the previous section. Create the directory and put/myvolume/inthe files there.Type the following line to run the Word Count job:

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in/myvolume/out

Look in the newly-created for a file called containing the results./myvolume/out part-r-00000

That's it! If you're ready, you can try some more advanced exercises:

Working with Snapshots, Mirrors, and SchedulesGetting Started with HiveGetting Started with PigGetting Started with HBase

The page Our Partners does not exist.



1. 2. 3. 4. 5. 6.

1. 2.

3. 4.

Working with Snapshots, Mirrors, and Schedules

Snapshots, mirrors, and schedules help you protect your data from user error, make backup copies, and in larger clusters provide load balancingfor highly-accessed data. These features are available under the M5 license.

If you are working with an M5 virtual machine, you can use this section to get acquainted with snapshots, mirrors, and schedules.If you are working with the M3 virtual machine, you should proceed to the sections about Getting Started with , , and Hive Pig Getting

.Started with HBase

Taking Snapshots

A is a point-in-time image of a volume that protects data against user error. Although other strategies such as replication and mirroringsnapshotprovide good protection, they cannot protect against accidental file deletion or corruption. You can create a snapshot of a volume manually beforeembarking on risky jobs or operations, or set a snapshot schedule on the volume to ensure that you can always roll back to specific points in time.

Try creating a snapshot manually:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the volume (which you created during the ).MyVolume previous tutorialExpand the MapR Virtual Machine window or scroll the browser to the right until the New Snapshot button is visible.Click to display the Snapshot Name dialog.New SnapshotType a name for the new snapshot in the field.NameClick to create the snapshot.OK

Try scheduling snapshots:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the Volume Properties dialog by clicking the volume name (which you created during the ), or byMyVolume previous tutorialselecting the checkbox beside and clicking the button.MyVolume PropertiesIn the section, choose a schedule from the dropdown menu.Replication and Snapshot Scheduling Snapshot ScheduleClick to save changes to the volume.Modify Volume

Viewing Snapshot Contents

All the snapshots of a volume are available in a directory called at the volume's top level. For example, the snapshots of the volume.snapshotMyVolume, which is mounted at , are available in the directory. You can view the snapshots using the /myvolume /myvolume/.snapshot had

command or via NFS. If you list the contents of the top-level directory in the volume, you will not see — but it's there.oop fs -ls .snapshot



1. 2. 3. 4. 5. 6. 7. 8.

To view the snapshots for on the command line, type /myvolume hadoop fs -ls /myvolume/.snapshotTo view the snapshots for in the file browser via NFS, navigate to and use to specify an explicit path,/myvolume /myvolume CTRL-Lthen add to the end..snapshot

Creating Mirrors

A is a full read-only copy of a volume, which you can use for backups, data transfer to another cluster, or load balancing. A mirror is itself amirrortype of volume; after you create a mirror volume, you can sync it with its source volume manually or set a schedule for automatic sync.

Try creating a mirror volume:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesClick the button to display the dialog.New Volume New VolumeSelect the radio button at the top of the dialog.Local Mirror VolumeType in the field.my-mirror Mirror NameType the in the field.MyVolume Source Volume NameType in the field./my-mirror Mount PathTo schedule mirror sync, select a from the dropdown menu respectively.schedule Mirror Update ScheduleClick to create the volume.OK

You can also sync a mirror manually; it works just like taking a manual snapshot. View the list of volumes, select the checkbox next to a mirrorvolume, and click .Start Mirroring



1. 2. 3. 4.

a. b.

5.

Working with Schedules

The MapR Virtual machine comes pre-loaded with a few , but you can create your own as well. Once you have created a schedule, youschedulescan use it for snapshots and mirrors on any volume. Each schedule contains one or more rules that determine when to trigger a snapshot or amirror sync, and how long to keep snapshot data resulting from the rule.

Try creating a schedule:

In the Navigation pane, expand the group and click the view.MapR-FS SchedulesClick .New ScheduleType in the field.My Schedule Schedule NameDefine a schedule rule in the section:Schedule Rules

From the first dropdown menu, select Every 5 minUse the field to specify how long the data is to be preserved. Type in the box, and select from theRetain For 1 hour(s)dropdown menu.

Click to create the schedule.Save Schedule

You can use the schedule "My Schedule" to perform a snapshot or mirror operation automatically every 5 minutes. If you use "My Schedule" toautomate snapshots, they will be preserved for one hour (you will have 12 snapshots of the volume, on average).

Next Steps

If you haven't already, try the following tutorials:

Getting Started with HiveGetting Started with PigGetting Started with HBase



1.

2.

3.

4.

5.

6.

7.

Getting Started with HBase

HBase is the Hadoop database, designed to provide random, realtime read/write access to very large tables — billions of rows amd millions ofcolumns — on clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google'sBigtable. (For more information about HBase, see the .)HBase project page

We'll be working with HBase from the Linux shell. Open a terminal by selecting (see Applications > Accessories > Terminal A Tour of the).MapR Virtual Machine

Note: Although this tutorial was originally designed for users of the MapR Virtual Machine, you can easily adapt these instructions for a node in acluster, for example by using a different directory structure.

In this tutorial, we'll create an HBase table on the cluster, enter some data, query the table, then clean up the data and exit.

HBase tables are organized by column, rather than by row. Furthermore, the columns are organized in groups called . Whencolumn familiescreating an HBase table, you must define the column families before inserting any data. Column families should not be changed often, nor shouldthere be too many of them, so it is important to think carefully about what column families will be useful for your particular data. Each columnfamily, however, can contain a very large number of columns. Columns are named using the format .family:qualifier

Unlike columns in a relational database, which reserve empty space for columns with no values, HBase columns simply don't exist for rows wherethey have no values. This not only saves space, but means that different rows need not have the same columns; you can use whatever columnsyou need for your data on a per-row basis.

Create a table in HBase:

Start the HBase shell by typing the following command:

/opt/mapr/hbase/hbase-0.90.4/bin/hbase shell

Create a table called with one column family named :weblog stats

create 'weblog', 'stats'

Verify the table creation by listing everything:

list

Add a test value to the column in the column family for row 1:daily stats

put 'weblog', 'row1', 'stats:daily', 'test-daily-value'

Add a test value to the column in the column family for row 1:weekly stats

put 'weblog', 'row1', 'stats:weekly', 'test-weekly-value'



Type to display the contents of the table. Sample output:scan 'weblog'

http://hbase.apache.org/



7.

8.

9. 10. 11.

ROW COLUMN+CELL row1 column=stats:daily, timestamp=1321296699190, value=test-daily-value row1 column=stats:weekly, timestamp=1321296715892, value=test-weekly-value row2 column=stats:weekly, timestamp=1321296787444, value=test-weekly-value2 row(s) in 0.0440 seconds

Type to display the contents of row 1. Sample output:get 'weblog', 'row1'

COLUMN CELL stats:daily timestamp=1321296699190, value=test-daily-value stats:weekly timestamp=1321296715892, value=test-weekly-value2 row(s) in 0.0330 seconds

Type to disable the table.disable 'weblog'Type to drop the table and delete all data.drop 'weblog'Type to exit the HBase shell.exit



1. 2.

1.

2.

3.

Getting Started with Hive


Hive is a data warehouse system for Hadoop that uses a SQL-like language called HiveQL to query structured data stored in a distributedfilesystem. (For more information about Hive, see the .)Apache Hive project page

You'll be working with Hive from the Linux shell. To use Hive, open a terminal by selecting (see Applications > Accessories > Terminal A Tour).of the MapR Virtual Machine


In this tutorial, you'll create a Hive table, load data from a tab-delimited text file, and run a couple of basic queries against the table.

First, make sure you have downloaded the sample table: On the page , select A Tour of the MapR Virtual Machine Tools > Attachmentsand right-click on , select from the pop-up menu, select a directory to save to, then click OK. Ifsample-table.txt Save Link As...you're working on the MapR Virtual Machine, we'll be loading the file from the MapR Virtual Machine's local file system (not the clusterstorage layer), so save the file in the MapR Home directory (for example, )./home/mapr

Take a look at the source data

First, take a look at the contents of the file using the terminal:

Make sure you are in the Home directory where you saved (type if you are not sure).sample-table.txt cd ~Type to display the following output.cat sample-table.txt

mapr@mapr-desktop:~$ cat sample-table.txt1320352532 1001 http://www.mapr.com/doc http://www.mapr.com 192.168.10.11320352533 1002 http://www.mapr.com http://www.example.com 192.168.10.101320352546 1001 http://www.mapr.com http://www.mapr.com/doc 192.168.10.1

Notice that the file consists of only three lines, each of which contains a row of data fields separated by the TAB character. The data in the filerepresents a web log.

Create a table in Hive and load the source data:

Type the following command to start the Hive shell, using tab-completion to expand the :<version>

/opt/mapr/hive/hive-<version>/bin/hive

At the prompt, type the following command to create the table:hive>

CREATE TABLE web_log(viewTime INT, userid BIGINT, url STRING, referrer STRING, ip STRING) ROWFORMAT DELIMITED FIELDS TERMINATED BY '\t';

Type the following command to load the data from into the table:sample-table.txt

LOAD DATA LOCAL INPATH '/home/mapr/sample-table.txt' INTO TABLE web_log;

Run basic queries against the table:

Try the simplest query, one that displays all the data in the table:

SELECT web_log.* FROM web_log;

This query would be inadvisable with a large table, but with the small sample table it returns very quickly.

Try a simple SELECT to extract only data that matches a desired string:

http://hive.apache.org/



SELECT web_log.* FROM web_log WHERE web_log.url LIKE '%doc';

This query launches a MapReduce job to filter the data.



1. 2.

3.

Getting Started with Pig

Apache Pig is a platform for parallelized analysis of large data sets via a language called Pig Latin. (For more information about Pig, see the Pig.)project page

You'll be working with Pig from the Linux shell. Open a terminal by selecting Applications > Accessories > Terminal(see A Tour of the MapR).Virtual Machine


In this tutorial, we'll use Pig to run a MapReduce job that counts the words in the file on the cluster, and/myvolume/in/constitution.txtstore the results in the file ./myvolume/wordcount.txt

First, make sure you have downloaded the file: On the page , select Tools > Attachments andA Tour of the MapR Virtual Machineright-click to save it.constitution.txtMake sure the file is loaded onto the cluster, in the directory . If you are not sure how, look at the tutorial on /myvolume/in NFS A Tour

.of the MapR Virtual Machine

Open a Pig shell and get started:

In the terminal, type the command to start the Pig shell.pigAt the prompt, type the following lines (press ENTER after each):grunt>

A = LOAD '/myvolume/in' USING TextLoader() AS (words:chararray);

B = FOREACH A GENERATE FLATTEN(TOKENIZE(*));

C = GROUP B BY $0;

D = FOREACH C GENERATE group, COUNT(B);

STORE D INTO '/myvolume/wordcount';

After you type the last line, Pig starts a MapReduce job to count the words in the file .constitution.txt

When the MapReduce job is complete, type to exit the Pig shell and take a look at the contents of the directory quit /myvolume/wordc to see the results.ount

http://pig.apache.org/




1. 2. 3. 4.

a. b. c.

5. 6.

a. b. c. d. e. f.

g. h. i. j.

k. l.

Installation GuideThis guide provides instructions on how to install MapR software, including details on system requirements, planning for the deployment, installingpackages, configuring, and launching a cluster.

The process of designing and configuring a cluster from the ground up involves the following steps.

PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements.PLAN which services to deploy on which nodes in the cluster.PREPARE package files for installation, either relying on MapR's repository or locating packages on a local network.INSTALL the MapR software.

Show details...On each node, the planned MapR services.INSTALLOn all nodes, .RUN configure.shOn all nodes, disks for use by MapR.FORMAT

BRING UP the cluster and apply a license.CONFIGURE the cluster.

Show details...SET UP the administrative user.SET UP MapR Metrics.CHECK that the correct services are running.SET UP node topology.SET UP initial volume structure.SET UP NFS for high availability (HA). (M5 Edition only)SET UP authentication.CONFIGURE cluster email settings.CONFIGURE permissions.SET user quotas.CONFIGURE alarm notifications.ISOLATE the CLDB service on dedicated nodes for large clusters. (optional)



Requirements for Installation

Before setting up a MapR cluster, ensure that every node satisfies the following hardware and software requirements, and consider which MapRlicense provides the features you need.

Node Hardware — At least 4 GB of RAM, preferably 32 GB or moreDirectories on Operating System Partition — At least 10 GB of free space on the operating system partition, 10% more swapspace than physical RAM with a minimum swap space of 24 GB, 10 GB available in the directory, and 128 GB in the /tmp /optdirectory

Storage — At least three unmounted physical drives or partitions per node for use by MapR storage. In a production environment, eachnode is likely to have at least 12 physical disks.

Do not use RAID or Logical Volume Management on any disks or partitions for use by MapROperating System and Software

One of the following Linux distributions:64-bit CentOS 5.4 or greater64-bit Red Hat 5.4 or greater64-bit SUSE Linux Enterprise Server 11.x64-bit Ubuntu 9.04 or greater

One of the following versions of Java:Sun Java JDK 1.6 or 1.7OpenJDK 1.6OpenJDK 1.7 (Ubuntu only)

MapR Metrics requires a MySQL database accessible from the cluster.Configuration — Each node must have a unique, resolvable hostname, and the following configuration:

Environment Variables — Make sure points to the correct version of Java. If desired, set to limitJAVA_HOME MAPR_SUBNETSMapR network traffic to certain subnets.NTP — To keep all cluster nodes time-synchronized, MapR requires NTP to be configured and running on every node.Hostname Resolution — Each node must be able to perform forward and reverse hostname resolution with every other node inthe cluster.Users and Groups — Add a MapR user (the user under which MapR services will run) with matching name, UID and GID on allnodes. MapR uses each node's native operating system configuration to authenticate users and groups for access to the cluster.If you are deploying a large cluster, you should consider configuring all nodes to use LDAP or another user management system.Keyless SSH — Set up keyless (passwordless) SSH access to all nodes in the cluster for the MapR user on the installing nodeand webserver nodes.

Network Ports — Make sure all ports used by MapR are open.ulimit — On each node, the value for ulimit should be set to 64000.

Licensing — Before installing MapR, consider the capabilities you will need and make sure you have obtained the corresponding license.

If you are setting up a large cluster, it is a good idea to use a configuration management tool such as Puppet or Chef, or a parallel ssh tool, tofacilitate the installation of MapR packages across all the nodes in the cluster. The following sections provide details about the prerequisites forsetting up the cluster.

Node Hardware and Cluster Architecture

Minimum Requirements Recommended

64-bit processor4GB DRAM1 network interfaceAt least one free unmounted drive or partition, 100 GB or moreAt least 10 GB of free space on the operating system partition24 GB or 10% more swap space than RAM, whichever is greater (if this is notpossible, see )Memory Overcommit

64-bit processor with 8-12 cores32GB DRAM or more2 GigE network interfaces3-12 disks of 1-3 TB eachAt least 20 GB of free space on theoperating system partition32 GB swap space or more (see also Mem

)ory Overcommit

In practice, it is useful to have 12 or more disks per node, not only for greater total storage but also to provide a larger number of avstorage poolsailable. If you anticipate a lot of big reduces, you will need additional network bandwidth in relation to disk I/O speeds. MapR can detect multipleNICs with multiple IP addresses on each node and manage network throughput accordingly to maximize bandwidth. In general, the more networkbandwidth you can provide, the faster jobs will run on the cluster. When designing a cluster for heavy CPU workloads, the processor on eachnode is more important than networking bandwidth and available disk space.

When you plan the hardware architecture for your cluster, consider the following elements:

Data storage needs.Network bandwidth needs, including intermediate data needed during MapReduce job execution.Workload type: CPU intensive, I/O intensive, or memory-intensive.How data is moved to and from the cluster, including how much of this data is transmitted over the network.

http://www.mapr.com/doc/display/MapR/Memory+Overcommit





Standby nodes to handle failover for critical services such as CLDB, ZooKeeper, JobTracker, or the HBase Master.

For most use cases, network bandwidth and disk I/O are more common limiting factors than CPU capacity. Balance your network and disktransfer rates to meet your expected data rates where possible, using multiple network interface controllers (NICs) per node. MapR nodes usemultiple NICs transparently, making it unnecessary to bond or trunk your NICs together.

Because the MapR software handles disk formatting and data protection on its own, the disks and partitions your node provides must be raw,without any RAID or logical volume management.

Example Architecture

The following example architecture provides specifications for a standard compute/storage node for general purposes, and two sample rackconfigurations made up of the standard nodes. MapR is able to make effective use of more drives per node than standard Hadoop, so each nodeshould present enough face plate area to allow a large number of drives. The standard node specification allows for either 2 or 4 1Gb/s ethernetnetwork interfaces.

Compute/Storage Node

2U ChassisSingle motherboard, dual socket2 x 4-core + 32 GB RAM or 2 x 6-core + 48 GB RAM12 x 2 TB 7200-RPM drives2 or 4 network interfaces(on-board NIC + additional NIC)OS on single partition on one drive (remainder of drive used for storage)

50TB Rack Configuration

10 compute/storage nodes(10 x 12 x 2 TB storage; 3x replication, 25% margin)24-port 1 Gb/s rack-top switch with 2 x 10Gb/s uplinkAdd second switch if each node uses 4 network interfaces

100TB Rack Configuration

20 compute/storage nodes(20 x 12 x 2 TB storage; 3x replication, 25% margin)48-port 1 Gb/s rack-top switch with 4 x 10Gb/s uplinkAdd second switch if each node uses 4 network interfaces

To grow the cluster, just add more nodes and racks, adding additional service instances as needed. MapR rebalances the cluster automatically.

Directories on Operating System Partition

Follow these guidelines for disk space allocated to the following directories on each node:

/tmp: 10 GB/opt: 128 GB/opt/mapr/zkdata: A separate 500MB partition

Storage

Set up at least three unmounted drives or partitions, separate from the operating system drives or partitions, for use by MapR-FS. You mustensure that available disks can be used by MapR, and after MapR installation the disks must be formatted for use. See Setting Up Disks for MapRfor details.

It is not necessary to set up RAID on disks used by MapR-FS. MapR uses a script called to set up storage pools. In most cases, youdisksetupshould let MapR calculate storage pools using the default of two or three disks. If you anticipate a high volume of random-access I/O,stripe widthyou can use the option with to specify larger storage pools of up to 8 disks each.-W disksetupYou can set up RAID on the operating system partition(s) or drive(s) at installation time, to provide higher operating system performance (RAID0), disk mirroring for failover (RAID 1), or both (RAID 10), for example. See the following instructions from the operating system websites:

CentOSRed HatUbuntuSuSE

When you set up a node to use RAID, use the command to set up storage pools that consist of exactly one logical storage unitdisksetup -w 1each.

http://wiki.centos.org/HowTos/SoftwareRAIDonCentOS5

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/3/html/System_Administration_Guide/ch-software-raid.html

https://help.ubuntu.com/community/Installation/SoftwareRAID

http://doc.opensuse.org/products/draft/SLES/SLES-deployment_sd_draft/cha.advdisk.html#sec.yast2.system.raid



1.

2.

1.

2.

3.

If you do not have dedicated disks or partitions available for MapR, you can use a flat file instead. Using a flat filefor evaluation purposes onlydoes not provide high performance or data protection. Do not use a flat file for storage on a production cluster. See forSetting Up Disks for MapRdetails.

Operating System and Software

Install a compatible 64-bit operating system on all nodes. MapR currently supports the following operating systems:

64-bit CentOS 5.4 or greater64-bit Red Hat 5.4 or greater64-bit Ubuntu 9.04 or greater64-bit SUSE Linux Enterprise Server 11.x

Each node must also have one of the following versions of Java installed:

Sun Java JDK 1.6 or 1.7OpenJDK 1.6OpenJDK 1.7 (Ubuntu only)If Java is already installed, check which versions of Java are installed: java -versionUse update-alternatives to make sure the correct Java version is being used as the default Java: sudo update-alternatives--config java

To use functionality, the cluster must meet the following requirements:MapR Metrics

MySQL Server — You must provide access to a MySQL server. The MapR Metrics service uses the MySQL server to store analyticsdata about jobs and tasks in the cluster. The MySQL server must be on the same network as the cluster.EPEL Repository — Extra Packages for Enterprise Linux (EPEL) provides components that MapR Metrics needs (CentOS, Red Hat andSUSE only).M5 License — To get the most out of MapR Metrics, you'll need an M5 License. With an M3 license, you won't have access to charts orhistograms to visualize your data.MapR 2.0 or higher — The Metrics feature is new to MapR 2.0 and is not usable with earlier releases.

packagelibmysqlclient16 — The package is a dependency for the package on SUSE.libmysqlclient16 mapr-metricsInstall the package with the following command before installing the package:libmysqlclient16 mapr-metrics

zypper install libmysqlclient16

EPEL for Red Hat/CentOS and MapR Metrics

The package is a dependency for MapR's package on Red Hat. Red Hat v5.x does not include , while Red Hat v6.xsdparm mapr-core sdparmdoes. On Red Hat v5.x, nodes must have access to the EPEL repository to get the package. The package issdparm mysql-connector-javaa dependency for Red Hat 6.x

The EPEL repository is also required for the MapR Metrics package on Red Hat, CentOS and SUSE platforms.

Enabling access to the EPEL repository on CentOS or Red Hat 5.x:

Download version 5 of the EPEL repository:

wget http://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

Install the EPEL repository:

rpm -Uvh epel-release-5*.rpm

Enabling access to the EPEL repository on CentOS or Red Hat 6.x:

Download version 5 of the EPEL repository to enable access to the package:mysql-connector-java

wget http://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

Download the current version of the EPEL repository:

wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-<version>-noarch.rpm



3.

1.

2.

3.

Install the EPEL repository:

rpm -Uvh epel-release-6*.rpm

Configuration

Each node must be configured as follows:

Each node have a unique hostname.SELinux must be disabled during the install procedure. If the MapR services run as a non-root user, SELinux can be enabled afterinstallation and while the cluster is running.Each node must be able to perform forward and reverse hostname resolution with every other node in the cluster.MapR Administrative user - a Linux user chosen to have administrative privileges on the cluster.

The MapR user must exist on each node, and the user name, user id (UID) and primary group id (GID) must match on all nodes. Make sure the user has a password (using for example).sudo passwd <user>

Make sure the limit on the number of processes is not set too low for the user; the value should be at least (NPROC_RLIMIT) root 327. In Red Hat or CentOS, the default may be very low ( , for example). In Ubuntu, there may be no default; you should only set this86 1024

value if you see errors related to inability to create new threads. Use the command to remove limits on file sizes or or otherulimitcomputing resources. Each node must have a number of available file descriptors greater than four times the number of nodes in thecluster. See for more detailed information.ulimitsyslog must be enabled.To reduce TaskTracker failover time, set tcp_retries2 to 5 on every node:

Add the following line to /etc/sysctl.conf:

net.ipv4.tcp_retries2 = 5

Issue the following command:

sysctl -p

Ensure that the setting has taken effect. Issue the following command, and make sure the output is 5:

cat /proc/sys/net/ipv4/tcp_retries2

In VM environments like EC2, VMware, and Xen, when running Ubuntu 10.10, problems can occur due to an Ubuntu bug unless the IRQ balanceris turned off. On all nodes, edit the file and set to turn off the IRQ balancer (requires reboot to take/etc/default/irqbalance ENABLED=0effect).

Environment Variables

The script sets default values for the and environment variables with /opt/mapr/conf/env.sh JAVA_HOME MAPR_SUBNETS /etc/environm. You can set these values manually as needed:ent

JAVA_HOME points to a specific version of Java, if this node needs to run multiple versions of JavaMAPR_SUBNETS takes a comma-separated list of up to four subnets in CIDR notationwith no spaces, as in the example . Every node in the cluster must be reachable onexport MAPR_SUBNETS=1.2.3.4/12, 5.6/24one of the subnets listed. Set this value to limit MapR-FS to a particular set of network interface controllers (NICs) on a node with multipleNICs.

NTP

To keep all cluster nodes time-synchronized, MapR requires NTP to be configured and running on every node. If server clocks in the cluster driftout of sync, serious problems will occur with HBase and other MapR services. MapR raises a Time Skew alarm on any out-of-sync nodes. See htt

for more information about obtaining and installing NTP. In the event that a large adjustment must be made to the time on ap://www.ntp.org/particular node, you should stop ZooKeeper on the node, then adjust the time, then restart ZooKeeper.

An internal NTP server enables your cluster to remain synchronized in the event that an outside NTP server is inaccessible.

DNS Resolution

For MapR to work properly, all nodes on the cluster must be able to communicate with each other. Each node must have a unique hostname, andmust be able to resolve all other hosts with both normal and reverse DNS name lookup.

You can use the command on each node to check the hostname. Example:hostname

http://en.wikipedia.org/wiki/CIDR_notation

http://www.ntp.org/

http://www.ntp.org/



$ hostname -fswarm

If the command returns a hostname, you can use the command to check whether the hostname exists in the hosts database. The getent geten command should return a valid IP address on the local network, associated with a fully-qualified domain name for the host. Example:t

$ getent hosts `hostname`10.250.1.53 swarm.corp.example.com

If you do not get the expected output from the command or the command, correct the host and DNS settings on the node. Ahostname getentcommon problem is an incorrect loopback entry ( ), which prevents the correct IP address from being assigned to the hostname.127.0.x.xPay special attention to the format of . For more information, see the . Example:/etc/hosts hosts(5) man page

127.0.0.1 localhost10.10.5.10 mapr-hadoopn.maprtech.prv mapr-hadoopn

Users and Groups

Two users are important when installing and setting up the MapR cluster:

is used to install MapR software on each noderootThe “MapR user” is the user that MapR services run as (typically named or ) on each node. The MapR user has fullmapr hadoopprivileges to administer the cluster. Administrative privilege with varying levels of control can be assigned to other users as well.

Before installing MapR, decide on the name, user id (UID) and group id (GID) for the MapR user. The MapR user must exist on each node, andthe user name, UID and primary GID must match on all nodes.

MapR uses each node's native operating system configuration to authenticate users and groups for access to the cluster. If you are deploying alarge cluster, you should consider configuring all nodes to use LDAP or another user management system. You can use the MapR ControlSystem to give specific permissions to particular users and groups. For more information, see . Each user can be restrictedManaging Permissionsto a specific amount of disk usage. For more information, see .Managing Quotas

By default, MapR gives the user full administrative permissions. If the nodes do not have an explicit login (as is sometimes the caseroot rootwith Ubuntu, for example), you can give full permissions to another user after deployment. See .Configuring the Cluster

On the node where you plan to run the (the MapR Control System), install Pluggable Authentication Modules (PAM). See mapr-webserver PAM.Configuration

Keyless SSH

Before beginning installation, set up keyless SSH access to nodes on the cluster for the MapR user. During normal operation of the cluster, theMapR Control System (running on the webserver or a client machine) relies on keyless SSH as the MapR user for certain features, includingcentralized management of disks (via the commands), , and . If you choose not to provide keyless SSH for thedisk support utilities rolling upgradesMapR user to all nodes, the cluster will run fine, but you will be unable to use the above features remotely. However, you can accomplish thesame tasks locally on each node as follows:

Use the commands on each node to manage its own disks.diskUse the support utility with the or option, to use the warden instead of SSH for support dumpmapr-support-collect.sh -O --onlinecollection from nodes.Upgrade the cluster instead of performing a rolling upgrade.manually

Network Ports

The following table lists the network ports that must be open for use by MapR.

Service Port

CLDB 7222

CLDB JMX monitor port 7220

CLDB web port 7221

http://www.kernel.org/doc/man-pages/online/pages/man5/hosts.5.html



1.

2.

HBase Master 60000

Hive Metastore 9083

JobTracker 9001

JobTracker web 50030

LDAP 389

LDAPS 636

MFS server 5660

NFS 2049

NFS monitor (for HA) 9997

NFS management 9998

NFS VIP service 9997 and 9998

Oozie 11000

Port mapper 111

SMTP 25

SSH 22

TaskTracker web 50060

Web UI HTTPS 8443

Web UI HTTP 8080

ZooKeeper 5181

ZooKeeper follower-to-leader communication 2888

ZooKeeper leader election 3888

The MapR UI runs on Apache. By default, installation does not close port 80 (even though the MapR Control System is available over HTTPS onport 8443). If this would present a security risk to your datacenter, you should close port 80 manually on any nodes running the MapR ControlSystem.

ulimit

On each node, specifies the number of inodes that can be opened simultaneously. With the default value of 1024, the system appears toulimitbe out of disk space and shows no inodes available. The value for should be set to 64000.ulimit

Setting ulimit for Centos/Redhat:

Edit and add the following lines:/etc/security/limits.conf

root soft nofile 64000root hard nofile 64000

Check that the /etc/pam.d/su file contains the following settings:



2.

3. 4.

1.

2.

3. 4.

#%PAM-1.0

auth sufficient pam_rootok.so

# Uncomment the following line to implicitly trust users in the group."wheel"

#auth sufficient pam_wheel.so trust use_uid

# Uncomment the following line to require a user to be in the group."wheel"

#auth required pam_wheel.so use_uid

auth include system-auth

account sufficient pam_succeed_if.so uid = 0 use_uid quiet

account include system-auth

password include system-auth

session include system-auth

session optional pam_xauth.so

Reboot the system.Run the following command to check the setting:ulimit

ulimit -n

The command should report .64000

Setting ulimit for Ubuntu:



Edit and uncomment the following line:/etc/pam.d/su

session required pam_limits.so


ulimit -n


Licensing

Before installing MapR, consider the capabilities you will need and make sure you have obtained the corresponding license. If you need NFS,data protection with snapshots and mirroring, job and cluster performance , or plan to set up a cluster with high availability (HA), you willanalyticsneed an M5 license. You can obtain and install a license through the after installation. For more information about whichLicense Managerfeatures are included in each license type, see .MapR Editions

http://www.mapr.com/beta/mapr-editions



If installing a new cluster, make sure to install the latest version of MapR software. If applying a new license to an existing MapRcluster, make sure to upgrade to the latest version of MapR first. If you are not sure, check the contents of the file MapRBuildV

in the directory. If the version is and includes then you must upgrade before applying a license.ersion /opt/mapr 1.0.0 GAExample:

# cat /opt/mapr/MapRBuildVersion 1.0.0.10178GA-0v

For information about upgrading the cluster, see .Cluster Upgrade



1.

2.

3. 4.

1.

2.

3. 4.

ulimit

On each node, specifies the number of inodes that can be opened simultaneously. With the default value of 1024, the system appears toulimitbe out of disk space and shows no inodes available. The value for should be set to 64000.ulimit

Setting ulimit for Centos/Redhat:



Check that the /etc/pam.d/su file contains the following settings:

#%PAM-1.0

auth sufficient pam_rootok.so

# Uncomment the following line to implicitly trust users in the group."wheel"

#auth sufficient pam_wheel.so trust use_uid

# Uncomment the following line to require a user to be in the group."wheel"

#auth required pam_wheel.so use_uid

auth include system-auth

account sufficient pam_succeed_if.so uid = 0 use_uid quiet

account include system-auth

password include system-auth

session include system-auth

session optional pam_xauth.so


ulimit -n


Setting ulimit for Ubuntu:



Edit and uncomment the following line:/etc/pam.d/su



ulimit -n



4.




Planning the Deployment

MapR is a complete Hadoop distribution, implemented as a number of services running on individual in a cluster. Deploying the MapRnodesoftware to a cluster requires some preliminary planning. A MapR cluster is made up of several individual nodes. Each node runs a number of Ma

, which define the node's role(s) in the cluster. In a typical cluster, most (or all) nodes are dedicated to data processing and storage,pR servicesand a smaller number of nodes run other services that provide cluster coordination and management.

This page contains the following topics addressing deployment considerations:

Planning Cluster HardwareExample Architecture

MapR ServicesService Coordination with ZooKeeper and Warden

Assigning Services to Nodes for Best PerformanceDon't Overload the ZooKeeperReduce TaskTracker Slots Where NecessarySeparate High-Demand Services

Planning for High Availability (HA)Planning Services for Nodes

Planning a Small M3 ClusterPlanning a Small High-Availability ClusterPlanning a Large High-Availability Cluster

Planning for NFS on an M5 ClusterPlanning for MapR Metrics

Planning Cluster Hardware

When planning the hardware architecture for the cluster, make sure all hardware meets the specifications.Requirements for Installation

The architecture of the cluster hardware is an important consideration when planning a deployment. Among the considerations are anticipateddata storage and network bandwidth needs, including intermediate data generated during MapReduce job execution. The type of workload isimportant: consider whether the planned cluster usage will be CPU-intensive, I/O-intensive, or memory-intensive. Think about how data will beloaded into and out of the cluster, and how much data is likely to be transmitted over the network.

Typically, the CPU is less of a bottleneck than network bandwidth and disk I/O. To the extent possible, network and disk transfer rates should bebalanced to meet the anticipated data rates using multiple NICs per node. It is not necessary to bond or trunk the NICs together; MapR is able totake advantage of multiple NICs transparently. Each node should provide raw disks and partitions to MapR, with no RAID or logical volumemanager, as MapR takes care of formatting and data protection.



Standard Compute/Storage Node


Standard 50TB Rack Configuration

10 standard compute/storage nodes(10 x 12 x 2 TB storage; 3x replication, 25% margin)24-port 1 Gb/s rack-top switch with 2 x 10Gb/s uplinkAdd second switch if each node uses 4 network interfaces


20 standard nodes(20 x 12 x 2 TB storage; 3x replication, 25% margin)



48-port 1 Gb/s rack-top switch with 4 x 10Gb/s uplinkAdd second switch if each node uses 4 network interfaces


MapR Services

The following table shows the services that can be run on a node, and the name of the package used to install the service.

Service Description Package Name

CLDB Maintains the (CLDB) service. The CLDB service maintains thecontainer location databaseMapR FileServer storage (MapR-FS) and is aware of all NFS and FileServer nodes in thecluster. The CLDB service coordinates data storage services among MapR FileServernodes, MapR NFS gateways, and MapR clients.

mapr-cldb

FileServer Runs the MapR FileServer (MapR-FS) service. mapr-fileserver

HBaseMaster The master service manages the region servers that make up HBase table storage.HBase mapr-hbase-master

HRegionServer HBase region server is used with the HBaseMaster service and provides storage for anindividual HBase region.

mapr-hbase-regionserver

JobTracker Hadoop JobTracker service. The JobTracker service coordinates the execution ofMapReduce jobs by assigning tasks to TaskTracker nodes and monitoring task execution.

mapr-jobtracker

Metrics Provides real-time analytics data on cluster and job performance through the inJob Metricsterface.

mapr-metrics

NFS Provides read-write MapR Direct Access NFS™ access to the cluster, with full support forconcurrent read and write access.

mapr-nfs

TaskTracker Hadoop TaskTracker service. The TaskTracker service starts and tracks MapReduce taskson a node. The TaskTracker service receives task assignments from the JobTrackerservice and manages task execution.

mapr-tasktracker

WebServer Runs the MapR Control System and provides the MapR Heatmap™ mapr-webserver

ZooKeeper Provides coordination Enables high availability (HA) and fault tolerance for MapR clustersby providing coordination.

mapr-zookeeper

Service Coordination with ZooKeeper and Warden

At runtime a process called the runs on all nodes to manage, monitor, and report on the other services on each node. The MapR clusterwardenuses to coordinate services. ZooKeeper runs on an odd number of nodes (at least three, and preferably five or more) andApache ZooKeeperprevents service coordination conflicts by enforcing a rigid set of rules and conditions that determine which instance of each service is the master.The warden will not start any services unless ZooKeeper is reachable and more than half of the configured ZooKeeper nodes are live.

Assigning Services to Nodes for Best Performance

How you assign services to nodes depends on the scale of your cluster and the MapR license level. As the cluster size grows, it becomesadvantageous to locate control services (such as ZooKeeper and CLDB) on nodes that do not run compute services (such as TaskTracker). TheMapR M3 Edition license does not include HA capabilities, which restricts how many instances of certain services can run. The number of nodesand the services they run will evolve over the life cycle of the cluster, and there is no such thing as the perfect service layout for all applications.When setting up a cluster initially, take into consideration the following points from the page .Assigning Services to Nodes for Best Performance

The architecture of MapR software allows virtually any service to run on any node, or nodes, to provide a high-availability, high-performancecluster. Below are some guidelines to help plan your cluster's service layout.

Don't Overload the ZooKeeper

High latency on a ZooKeeper node can lead to an increased incidence of ZooKeeper quorum failures. A ZooKeeper quorum failure occurs whenthe cluster finds too few copies of the ZooKeeper service running. If the ZooKeeper node is also running other services, competition for computingresources can lead to increased latency for that node. If your cluster experiences issues relating to ZooKeeper quorum failures, consider reducingor eliminating the number of other services running on the ZooKeeper node.

Reduce TaskTracker Slots Where Necessary

Monitor the server load on the nodes in your cluster that are running high-demand services such as ZooKeeper or CLDB. If the TaskTracker

http://wiki.apache.org/hadoop/Hbase

http://zookeeper.apache.org/



service is running on nodes that also run a high-demand service, you can reduce the number of task slots provided by the TaskTracker service.Tune the number of task slots according to the acceptable load levels for nodes in your cluster.

Separate High-Demand Services

The following are guidelines about which services to separate on large clusters:

JobTracker on ZooKeeper nodes: Avoid running the JobTracker service on nodes that are running the ZooKeeper service. On largeclusters, the JobTracker service can consume significant resources.MySQL on CLDB nodes: Avoid running the MySQL server that supports the MapR Metrics service on a CLDB node. Consider runningthe MySQL server on a machine external to the cluster to prevent the MySQL server’s resource needs from affecting services on thecluster.TaskTracker on CLDB or ZooKeeper nodes: When the TaskTracker service is running on a node that is also running the CLDB orZooKeeper services, consider reducing the number of task slots that this node's instance of the TaskTracker service provides. See Tunin

.g Your MapR InstallWebserver on CLDB nodes: Avoid running the webserver on CLDB nodes. Queries to the MapR Metrics service can impose abandwidth load that reduces CLDB performance.JobTracker on large clusters: Run the JobTracker service on a dedicated node for clusters with over 250 nodes.

Planning for High Availability (HA)

A properly licensed and configured MapR cluster provides automatic failover for continuity throughout the stack. Configuring a cluster for HAinvolves redundant instances of specific services, as well as a correct configuration of the MapR NFS service. HA features are not available withthe M3 Edition license.

The following are the minimum numbers of each service required for HA:

Service Minimum number of instances for HA

CLDB 2

ZooKeeper 3

HBase Master 2

JobTracker 2

NFS 2

On a large cluster, you may choose to prepare extra machines in preparation for failover events. In this case, you keep cold spare nodes on theready to replace administrative nodes (nodes running CLDB, JobTracker, ZooKeeper, or HBase Master) in case of a hardware failure. Use the{{maprcli node maintenance} command to place a node into maintenance mode when the node needs replacement.

Planning Services for Nodes

How you assign services to nodes depends on the scale of your cluster and the MapR license level. For a single-node cluster, no decisions areinvolved. All of the services you are using run on the single node. On medium clusters, the performance demands of the CLDB and ZooKeeperservices requires them to be assigned to separate nodes to optimize performance. On large clusters, good cluster performance requires thatthese services run on separate nodes.

MapR clusters running the M5 license enable you to run on multiple nodes to provide virtual IP addresses for automatic transparent failoverNFSand (HA).High Availability

Below are examples of several possible cluster configurations.

Planning a Small M3 Cluster

For a small cluster using the free M3 Edition license, assign the CLDB, JobTracker, NFS, and WebServer services to one node each. A hardwarefailure on any of these nodes would result in a service interruption, but the cluster can be recovered. Assign the ZooKeeper service to the CLDBnode and two other nodes. Assign the FileServer and TaskTracker services to every node in the cluster.

Example Service Configuration for a 5-Node M3 Cluster



This cluster has several single points of failure, at the nodes with CLDB, JobTracker and NFS.

Planning a Small High-Availability Cluster

A small M5 cluster can ensure high availability (HA) for all services by providing at least two instances of each service, eliminating single points offailure. The example below depicts a 5-node HA M5 cluster with HBase installed. ZooKeeper is installed on three nodes. CLDB, JobTracker, andHBase Master services are installed on two nodes each, spread out as much as possible across the nodes:

Example Service Configuration for a 5-Node M5 Cluster

Planning a Large High-Availability Cluster

On a large cluster designed for high availability (HA), assign services according to . TheAssigning Services to Nodes for Best Performanceexample below depicts a 150-node HA M5 cluster. The majority of nodes are dedicated to the TaskTracker service. ZooKeeper, CLDB, andJobTracker are installed on three nodes each, and are isolated from other services. The NFS server is installed on most machines, providing highnetwork bandwidth to the cluster.

Example Service Configuration for a 100+ Node M5 Cluster



Planning for NFS on an M5 Cluster

You can run NFS on multiple nodes in your cluster with an M5 license for MapR.

Plan which nodes will provide NFS access according to your anticipated traffic. For instance, if you need 5Gbps of write throughput and 5Gbps ofread throughput, the following node configurations would be suitable:

12 NFS nodes with a single 1GbE connection each6 NFS nodes with dual 1GbE connections each4 NFS nodes with quadruple 1GbE connections each

When you set up NFS on all of the file server nodes, you enable a self-mounted NFS point for each node. A cluster made up of nodes withself-mounted NFS points enable you to run native applications as tasks. You can use round-robin DNS or a hardware load balancer to mount NFSon one or more dedicated gateways outside the cluster to allow controlled access.

See for details on configuring NFS on the cluster.Setting Up MapR NFS

Planning for NFS with Virtual IP addresses

You can set up virtual IP addresses (VIPs) for NFS nodes in an M5-licensed MapR cluster, for load balancing or failover. VIPs provide multipleaddresses that can be leveraged for round-robin DNS, allowing client connections to be distributed among a pool of NFS nodes. VIPs also enablehigh availability (HA) NFS. In a HA NFS system, when an NFS node fails, data requests are satisfied by other NFS nodes in the pool. Use aminimum of one VIP per NFS node per NIC that clients will use to connect to the NFS server. If you have four nodes with four NICs each, witheach NIC connected to an individual IP subnet, use a minimum of 16 VIPs and direct clients to the VIPs in round-robin fashion. The VIPs shouldbe in the same IP subnet as the interfaces to which they will be assigned. See for details on enabling VIPs for yourSetting Up VIPs for NFScluster.

If you plan to use VIPs on your M5 cluster's NFS nodes, consider the following tips:

Set up NFS on at least three nodes if possible.All NFS nodes must be accessible over the network from the machines where you want to mount them.To serve a large number of clients, set up dedicated NFS nodes and load-balance between them. If the cluster is behind a firewall, youcan provide access through the firewall via a load balancer instead of direct access to each NFS node. You can run NFS on all nodes inthe cluster, if needed.To provide maximum bandwidth to a specific client, install the NFS service directly on the client machine. The NFS gateway on the clientmanages how data is sent in or read back from the cluster, using all its network interfaces (that are on the same subnet as the clusternodes) to transfer data via MapR APIs, balancing operations among nodes as needed.Use VIPs to provide High Availability (HA) and failover.

http://www.mapr.com/doc/display/MapR/Setting+Up+VIPs+for+NFS



Planning for MapR Metrics

MapR Metrics provides statistical information about jobs, tasks, and task attempts in easy-to-read graphical form. See for in-depthJob Metricsinformation about the kinds of data MapR Metrics can display.

If you plan to use the MapR Metrics service, install the package on all nodes running JobTracker and/or the webserver.mapr-metrics

MapR Metrics uses a MySQL database to store analytics data. The MapR Control System draws on the data from the MySQL database topresent the charts that represent job and task attempt characteristics to you.

Before you install MapR Metrics, make sure your cluster meets the and that you have Requirements for Installation configured the MySQL with the MapR Control System.database



Planning Cluster Hardware

The architecture of the cluster hardware is an important consideration when planning a deployment. Among the considerations are anticipateddata storage and network bandwidth needs, including intermediate data generated during MapReduce job execution. The type of workload isimportant: consider whether the planned cluster usage will be CPU-intensive, I/O-intensive, or memory-intensive. Think about how data will beloaded into and out of the cluster, and how much data is likely to be transmitted over the network.

Typically, the CPU is less of a bottleneck than network bandwidth and disk I/O. To the extent possible, network and disk transfer rates should bebalanced to meet the anticipated data rates using multiple NICs per node. It is not necessary to bond or trunk the NICs together; MapR is able totake advantage of multiple NICs transparently. Each node should provide raw disks and partitions to MapR, with no RAID or logical volumemanager, as MapR takes care of formatting and data protection.



Standard Compute/Storage Node



10 standard compute/storage nodes(10 x 12 x 2 TB storage; 3x replication, 25% margin)24-port 1 Gb/s rack-top switch with 2 x 10Gb/s uplinkAdd second switch if each node uses 4 network interfaces


20 standard nodes(20 x 12 x 2 TB storage; 3x replication, 25% margin)48-port 1 Gb/s rack-top switch with 4 x 10Gb/s uplinkAdd second switch if each node uses 4 network interfaces




1. 2.

3.

1. 2.

3.

Preparing Packages and Repositories

When installing MapR software, each node must have access to the package files. There are several ways to specify where the packages will be.This section describes the ways to make packages available to each node. The options are:

Installing from MapR's Internet repositoryInstalling from a local repositoryInstalling from a local path containing or package filesrpm deb

You also must consider all packages that the MapR software depends on. You can install dependencies on each node before beginning theMapR installation process, or you can specify repositories and allow the package manager on each node to resolve dependencies. See Packages

for details.and Dependencies for MapR Software

As of version 2.0, MapR has separated the distribution into two repositories:

MapR packages which provide core functionality for MapR clusters, such as the MapR filesystemHadoop ecosystem packages which are not specific to MapR, such as HBase, Hive and Pig

Installing from MapR's Internet repository

The MapR repository on the Internet provides all the packages you need in order to install a MapR cluster using native tools such as on RedyumHat or CentOS, or on Ubuntu. Installing from MapR's repository is generally the easiest method for installation, but requires the greatestapt-getamount of bandwidth. With this method, each node must be connected to the Internet and will individually download the necessary packages.

Below are instructions on setting up repositories for each supported Linux distribution.

Adding the MapR repository on Red Hat or CentOS

Change to the user (or use for the following commands).root sudoCreate a text file called in the directory with the following contents, substituting the appropriate maprtech.repo /etc/yum.repos.d/

:<version>

[maprtech]name=MapR Technologiesbaseurl=http:// .mapr.com/releases/v<version>/redhat/packageenabled=1gpgcheck=0protect=1

[maprecosystem]name=MapR Technologiesbaseurl=http:// .mapr.com/releases/ecosystem/redhatpackageenabled=1gpgcheck=0protect=1

(See the for the correct paths for all past releases.)Release Notes

If your connection to the Internet is through a proxy server, you must set the environment variable before installation:http_proxy

http_proxy=http://<host>:<port>export http_proxy

Adding the MapR repository on SUSE

Change to the user (or use for the following commands).root sudoUse the following command to add the repository for MapR packages, substituting the appropriate :<version>

zypper ar http:// .mapr.com/releases/v<version>/suse/ maprtechpackage

Use the following command to add the repository for MapR ecosystem packages:



3.

4.

5.

6.

1. 2.

3.

4.

1. 2. 3.

zypper ar http:// .mapr.com/releases/ecosystem/suse/ maprecosystempackage


If your connection to the Internet is through a proxy server, you must set the environment variable before installation:http_proxy


Update the system package index by running the following command:

zypper refresh

MapR packages require a compatibility package in order to install and run on SUSE. Execute the following command to install the SUSEcompatibility package:

zypper install mapr-compat-suse

Adding the MapR repository on Ubuntu

Change to the user (or use for the following commands).root sudoAdd the following lines to , substituting the appropriate :/etc/apt/sources.list <version>

deb http://package.mapr.com/releases/v<version>/ubuntu/ mapr optionaldeb http://package.mapr.com/releases/ecosystem/ubuntu binary/


Update the package indexes.

apt-get update

If your connection to the Internet is through a proxy server, add the following lines to :/etc/apt/apt.conf

Acquire {Retries "0";HTTP {Proxy "http://<user>:<password>@<host>:<port>";};};

Installing from a local repository

You can set up a local repository on each node to provide access to installation packages. With this method, the package manager on each nodeinstalls from packages in the local repository. Nodes do not need to be connected to the Internet.

Below are instructions on setting up a local repository for each supported Linux distribution. These instructions create a single repository thatincludes both MapR components and the Hadoop ecosystem components.

Setting up a local repository requires running a web server that nodes access to download the packages. Setting up a web server is notdocumented here.

Creating a local repository on Red Hat or CentOS

Login as on the node.rootCreate the following directory if it does not exist: /var/www/html/yum/baseOn a computer that is connected to the Internet, download the following files, substituting the appropriate and <version> <datestamp>



3.

4.

5.

1.

1. 2. 3.

4.

5.

:

http:// .mapr.com/releases/v<version>/redhat/mapr-v<version>GA.rpm.tgzpackagehttp:// .mapr.com/releases/ecosystem/redhat/mapr-ecosystem-<datestamp>.rpm.tgzpackage


Copy the files to on the node, and extract them there./var/www/html/yum/base

tar -xvzf mapr-v<version>GA.rpm.tgztar -xvzf mapr-ecosystem-<datestamp>.rpm.tgz

Create the base repository headers:

createrepo / /www/html/yum/basevar

When finished, verify the contents of the new directory: /var/www/html/yum/base/repodata filelists.xml.gz,other.xml.gz, primary.xml.gz, repomd.xml

To add the repository on each node:

Add the following lines to the file:/etc/yum.conf

[maprtech]name=MapR Technologies, Inc.baseurl=http://<host>/yum/baseenabled=1gpgcheck=0

Creating a local repository on SUSE

Login as on the node.rootCreate the following directory if it does not exist: /var/www/html/zypper/baseOn a computer that is connected to the Internet, download the following files, substituting the appropriate and <version> <datestamp>:

http:// .mapr.com/releases/v<version>/suse/mapr-v<version>GA.rpm.tgzpackagehttp:// .mapr.com/releases/ecosystem/suse/mapr-ecosystem-<datestamp>.rpm.tgzpackage


Copy the files to on the node, and extract them there./var/www/html/zypper/base


Create the base repository headers:

createrepo / /www/html/zypper/basevar

When finished, verify the contents of the new directory: /var/www/html/zypper/base/repodata filelists.xml.gz,other.xml.gz, primary.xml.gz, repomd.xml




1.

1. 2.

3.

4.

5. 6.

7.

1.

2.

Use the following commands to add the repository for MapR packages and the MapR ecosystem packages, substituting the appropriate<version>:

zypper ar http://<host>/zypper/base/ maprtech

Creating a local repository on Ubuntu

To create a local repository:

Login as on the machine where you will set up the repository.rootChange to the directory and create the following directories within it:/root

~/mapr. dists binary optional binary-amd64 mapr

On a computer that is connected to the Internet, download the following files, substituting the appropriate and <version> <datestamp>.

http:// .mapr.com/releases/v<version>/ubuntu/mapr-v<version>GA.deb.tgzpackagehttp:// .mapr.com/releases/ecosystem/ubuntu/mapr-ecosystem-<datestamp>.deb.tgzpackage


Copy the files to on the node, and extract them there./root/mapr/mapr


Navigate to the directory./root/mapr/Use to create in the directory:dpkg-scanpackages Packages.gz binary-amd64

dpkg-scanpackages . /dev/ | gzip -9c > ./dists/binary/optional/binary-amd64/Packages.gznull

Move the entire directory to the default directory served by the HTTP server (e. g. ) and make sure the HTTP/root/mapr /var/wwwserver is running.


Add the following line to on each node, replacing with the IP address or hostname of the node/etc/apt/sources.list <host>where you created the repository:

deb http://<host>/mapr binary optional

On each node update the package indexes (as or with ).root sudo

apt-get update

After performing the above steps, you can use as normal to install MapR software and Hadoop ecosystem components on each nodeapt-getfrom the local repository.



1.

2.

Installing from a local path containing or package filesrpm deb

You can download package files and store them locally, and install from there. This option is useful for clusters that are not connected to theInternet.

Using a machine connected to the Internet, download the tarball for the MapR components and the Hadoop ecosystem components,substituting appropriate , and :<platform> <version> <datestamp>

http://package.mapr.com/releases/v<version>/<platform>/mapr-v<version>GA.rpm.tgz (or ).deb.tgzhttp://package.mapr.com/releases/ecosystem/<platform>/mapr-ecosystem-<datestamp>.rpm.tgz (or .deb

).tgz(See the for the correct paths for all past releases.)Release Notes

Extract the tarball to a local directory, either on each node or on a local network accessible by all nodes.


MapR package dependencies need to be pre-installed on each node in order for MapR installation to succeed. If you are not using a packagemanager to install dependencies from Internet repositories, you need to manually download and install other dependency packages as well.



1. 2. 3.

1. 2.

1. 2. 3.

Installing MapR Software

After and , you are ready to install the MapR software.planning for your MapR deployment preparing packages and repositories

To proceed you will need the following:

A list of the hostnames (or IP addresses) for all CLDB nodesA list of the hostnames (or IP addresses) for all ZooKeeper nodesA list of all disks and/or partitions to be used for the MapR cluster on all nodes

Perform the following steps, starting with the , which are the nodes that will run the CLDB and ZooKeeper services:control nodes

On each node, the planned MapR services.INSTALLOn all nodes, the script to configure the node.RUN configure.shOn all fileserver nodes, disks allocated to MapR using the script.FORMAT disksetup

Before you proceed, make sure that all nodes meet the .Requirements for Installation

Installing MapR Packages

Based on your deployment plan for which services to run on which nodes, use the commands in this section to install the appropriate packagesfor each node.

You can use a package manager such as or , which will automatically resolve and install dependency packages, provided thatyum apt-getnecessary repositories have been set up correctly. Alternatively, you can use or commands to manually install package files that yourpm dpkghave downloaded and extracted to a local directory.

When installing from package files, you must manually pre-install any dependency packages in order for the installation to succeed. Note thatmost MapR packages depend on the package . Similarly, many Hadoop ecosystem components have internal dependencies, such asmapr-corethe package for . See for details.hbase-internal mapr-hbase-regionserver Packages and Dependencies for MapR Software

Installing on Red Hat or CentOS

If you are installing from a repository:

Change to the user (or use for the following command).root sudoUse the command to install the services planned for the node. For example:yum

Use the following command to install TaskTracker and MapR-FS

yum install mapr-tasktracker mapr-fileserver

Use the following command to install CLDB, JobTracker, Webserver, ZooKeeper, Hive, Pig, HBase and Mahout:

yum install mapr-cldb mapr-jobtracker mapr-webserver mapr-zookeeper mapr-hive mapr-pigmapr-hbase mapr-mahout

If you are installing from a local path containing package files:

Change to the user (or use for the following command).root sudoChange the working directory to the location where the package files are located.rpmUse the command to install the appropriate packages for the node. For example:rpm


rpm -ivh mapr-core-<version>.GA-1.x86_64.rpm mapr-fileserver-<version>.GA-1.x86_64.rpmmapr-tasktracker-<version>.GA-1.x86_64.rpm

Use the following command to install CLDB, JobTracker, Webserver, ZooKeeper, HBase Master, HBase, Hive and Pig:



3.

1. 2.

1. 2. 3.

1. 2.

3.

rpm -ivh mapr-core-<version>.GA-1.x86_64.rpm mapr-cldb-<version>.GA-1.x86_64.rpm \mapr-jobtracker-<version>.GA-1.x86_64.rpm mapr-webserver-<version>.GA-1.x86_64.rpm \mapr-zk-internal-<version>-1.x86_64.rpm mapr-zookeeper-<version>.GA-1.x86_64.rpm \mapr-hbase-internal-<version>-1.noarch.rpm mapr-hbase-master-<version>-1.noarch.rpm \mapr-hive-<version>-1.noarch.rpm \mapr-pig-<version>-1.noarch.rpm

Installing on SUSE


Change to the user (or use for the following command).root sudoUse the command to install the services planned for the node. For example:zypper


zypper install mapr-tasktracker mapr-fileserver

Use the following command to install CLDB, JobTracker, Webserver, ZooKeeper, Hive, Pig, HBase and Mahout

zypper install mapr-cldb mapr-jobtracker mapr-webserver mapr-zookeeper mapr-hive mapr-pigmapr-hbase mapr-mahout


Change to the user (or use for the following command).root sudoChange the working directory to the location where the package files are located.rpmUse the command to install the appropriate packages for the node. For example:rpm


rpm -ivh mapr-core-<version>.GA-1.x86_64.rpm mapr-fileserver-<version>.GA-1.x86_64.rpmmapr-tasktracker-<version>.GA-1.x86_64.rpm


rpm -ivh mapr-core-<version>.GA-1.x86_64.rpm mapr-cldb-<version>.GA-1.x86_64.rpm \mapr-jobtracker-<version>.GA-1.x86_64.rpm mapr-webserver-<version>.GA-1.x86_64.rpm \mapr-zk-internal-<version>-1.x86_64.rpm mapr-zookeeper-<version>.GA-1.x86_64.rpm \mapr-hbase-internal-<version>-1.noarch.rpm mapr-hbase-master-<version>-1.noarch.rpm \mapr-hive-<version>-1.noarch.rpm \mapr-pig-<version>-1.noarch.rpm

Installing on Ubuntu


Change to the user (or use for the following commands).root sudoOn all nodes, issue the following command to update the Ubuntu package cache:

apt-get update

Use the command to install the services planned for the node. For example:apt-get installUse the following command to install TaskTracker and MapR-FS

apt-get install mapr-tasktracker mapr-fileserver

Use the following command to install CLDB, JobTracker, Webserver, ZooKeeper, Hive, Pig, HBase and Mahout



3.

1. 2. 3.

apt-get install mapr-cldb mapr-jobtracker mapr-webserver mapr-zookeeper mapr-hivemapr-pig mapr-hbase mapr-mahout


Change to the user (or use for the following command).root sudoChange the working directory to the location where the package files are located.debUse the command to install the appropriate packages for the node. For example:dpkg


dpkg -i mapr-core_<version>.GA-1.x86_64.deb mapr-fileserver_<version>.GA-1.x86_64.debmapr-tasktracker_<version>.GA-1.x86_64.deb


dpkg -i mapr-core_<version>.GA-1_amd64.deb mapr-cldb_<version>.GA-1_amd64.deb \mapr-jobtracker_<version>.GA-1_amd64.deb mapr-webserver_<version>.GA-1_amd64.deb \mapr-zk-internal_<version>-1_amd64.deb mapr-zookeeper_<version>.GA-1_amd64.deb \mapr-hbase-internal-<version>_all.deb mapr-hbase-master-<version>_all.deb \mapr-hive-<version>_all.deb \mapr-pig-<version>_all.deb

After installation completes without errors, the software is installed in directory . For every service that installs successfully, a file is/opt/maprcreated in . You can examine this directory to verify installation for the node. For example:/opt/mapr/roles

# ls -l /opt/mapr/rolestotal 0-rwxr-xr-x 1 root root 0 Sep 19 17:59 fileserver-rwxr-xr-x 1 root root 0 Sep 19 17:58 tasktracker-rwxr-xr-x 1 root root 0 Sep 19 17:58 webserver-rwxr-xr-x 1 root root 0 Sep 19 17:58 zookeeper

Configure Nodes in the Cluster with the Scriptconfigure.sh

The script configures a node to be part of a MapR cluster, or modifies services running on an existing node in the cluster. Theconfigure.shscript creates (or updates) configuration files related to the cluster and the services running on the node. Before performing this step, make sureyou have a list of the hostnames of the CLDB and ZooKeeper nodes. You can optionally specify the ports for the CLDB and ZooKeeper nodes aswell. If you do not specify them, the default ports are:

CLDB – 7222ZooKeeper – 5181

The script takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IPconfigure.shaddresses (and optionally ports), using the following syntax:

/opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z<host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>]

Example:

/opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Zr1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r1n1.sj.us5:5181 -N MyCluster

Formatting Disks with the Scriptdisksetup

On all nodes on which is installed, use the following procedure to format disks and partitions for use by MapR.mapr-fileserver



This procedure assumes you have free, unmounted physical partitions or hard disks for use by MapR. If you are not sure,please read .Setting Up Disks for MapR

The script is used to format disks for use by the MapR cluster. Create a text file listing the disks and partitions fordisksetup /tmp/disks.txtuse by MapR on the node. Each line lists either a single disk or all applicable partitions on a single disk. When listing multiple partitions on a line,separate by spaces. For example:

/dev/sdb/dev/sdc1 /dev/sdc2 /dev/sdc4/dev/sdd

Later, when you run to format the disks, specify the file. For example:disksetup disks.txt

/opt/mapr/server/disksetup -F /tmp/disks.txt

The script removes all data from the specified disks. Make sure you specify the disks correctly, and that any datadisksetupyou wish to keep has been backed up elsewhere.

If you are re-using a node that was used previously in another cluster, it is important to format the disks to remove any traces of data from the oldcluster.



Setting Up Hadoop Ecosystem Components

This section provides information about integrating the following tools with a MapR cluster:

Mahout - Environment variable settings needed to run Mahout on MapRGanglia - Setting up Ganglia monitoring on a MapR clusterNagios Integration - Generating a Nagios Object Definition file for use with a MapR clusterCompiling Pipes Programs - Using Hadoop Pipes on a MapR clusterHBase - Installing and using HBase on MapRMultiTool - Starting Cascading Multitool on a MapR clusterFlume - Installing and using Flume on a MapR clusterHive - Installing and using Hive on a MapR cluster, and setting up a MySQL metastorePig - Installing and using Pig on a MapR cluster



1. 2. 3.

4.

1. 2. 3.

Flume

Flume is a reliable, distributed service for collecting, aggregating, and moving large amounts of log data, generally delivering the data to adistributed file system such as MapR-FS. For more information about Flume, see the .Apache Flume Incubation Wiki

Installing Flume

The following procedures use the operating system package managers to download and install from the MapR Repository. If you want to installthis component manually from packages files, see .Package Dependencies for MapR version 2.x

To install Flume on an Ubuntu cluster:

Execute the following commands as or using .root sudoThis procedure is to be performed on a MapR cluster. If you have not installed MapR, see the .Installation GuideUpdate the list of available packages:

apt-get update

On each planned Flume node, install :mapr-flume

apt-get install mapr-flume

To install Flume on a Red Hat or CentOS cluster:

Execute the following commands as or using .root sudoThis procedure is to be performed on a MapR cluster. If you have not installed MapR, see the .Installation GuideOn each planned Flume node, install :mapr-flume

yum install mapr-flume

Using Flume

For information about configuring and using Flume, see the following documents:

Flume User FAQFlume Recipes

https://cwiki.apache.org/FLUME/

https://cwiki.apache.org/FLUME/user-faq.html

https://cwiki.apache.org/FLUME/flume-recipes.html



1. 2. 3.

4.

5.

6. 7.

1. 2.

3.

4. 5.

HBase

HBase is the Hadoop database, which provides random, realtime read/write access to very large data.

See for information about using HBase with MapRInstalling HBaseSee for information about compressing HFile storageSetting Up Compression with HBaseSee for information about using MapReduce with HBaseRunning MapReduce Jobs with HBaseSee for HBase tips and tricksHBase Best Practices

Installing HBase

Plan which nodes should run the HBase Master service, and which nodes should run the HBase RegionServer. At least one node (generally threenodes) should run the HBase Master; for example, install HBase Master on the ZooKeeper nodes. Only a few of the remaining nodes or all of theremaining nodes can run the HBase RegionServer. When you install HBase RegionServer on nodes that also run TaskTracker, reduce thenumber of map and reduce slots to avoid oversubscribing the machine. The following procedures use the operating system package managers todownload and install from the MapR Repository. If you want to install this component manually from packages files, see Package Dependencies

.for MapR version 2.x

To install HBase on an Ubuntu cluster:


apt-get update

On each planned HBase Master node, install :mapr-hbase-master

apt-get install mapr-hbase-master

On each planned HBase RegionServer node, install :mapr-hbase-regionserver

apt-get install mapr-hbase-regionserver

On all HBase nodes, run with a list of the CLDB nodes and ZooKeeper nodes in the cluster.configure.shThe warden picks up the new configuration and automatically starts the new services. When it is convenient, restart the warden:

# /etc/init.d/mapr-warden stop# /etc/init.d/mapr-warden start

To install HBase on a Red Hat or CentOS cluster:

Execute the following commands as or using .root sudoOn each planned HBase Master node, install :mapr-hbase-master

yum install mapr-hbase-master

On each planned HBase RegionServer node, install :mapr-hbase-regionserver

yum install mapr-hbase-regionserver

On all HBase nodes, run with a list of the CLDB nodes and ZooKeeper nodes in the cluster.configure.shThe warden picks up the new configuration and automatically starts the new services. When it is convenient, restart the warden:


Installing HBase on a Client



1. 2.

3.

1.

2.

3.

4.

5.

6.

To use the HBase shell from a machine outside the cluster, you can install HBase on a computer running the the MapR client. For MapR clientsetup instructions, see .Setting Up the Client

Prerequisites:

The MapR client must be installedYou must know the IP addresses or hostnames of the ZooKeeper nodes on the cluster

To install HBase on a client computer:

Execute the following commands as or using .root sudoOn the client computer, install :mapr-hbase-internal

CentOS or Red Hat: yum install mapr-hbase-internalUbuntu: apt-get install mapr-hbase-internal

Edit , setting the property to include a comma-separated list of the IP addresses orhbase-site.xml hbase.zookeeper.quorumhostnames of the ZooKeeper nodes on the cluster you will be working with. Example:

<property> <name>hbase.zookeeper.quorum</name> <value>10.10.25.10,10.10.25.11,10.10.25.13</value></property>

Getting Started with HBase

In this tutorial, we'll create an HBase table on the cluster, enter some data, query the table, then clean up the data and exit.

HBase tables are organized by column, rather than by row. Furthermore, the columns are organized in groups called . Whencolumn familiescreating an HBase table, you must define the column families before inserting any data. Column families should not be changed often, nor shouldthere be too many of them, so it is important to think carefully about what column families will be useful for your particular data. Each columnfamily, however, can contain a very large number of columns. Columns are named using the format .family:qualifier

Unlike columns in a relational database, which reserve empty space for columns with no values, HBase columns simply don't exist for rows wherethey have no values. This not only saves space, but means that different rows need not have the same columns; you can use whatever columnsyou need for your data on a per-row basis.

Create a table in HBase:

Start the HBase shell by typing the following command:

/opt/mapr/hbase/hbase-0.90.4/bin/hbase shell

Create a table called with one column family named :weblog stats

create 'weblog', 'stats'

Verify the table creation by listing everything:

list

Add a test value to the column in the column family for row 1:daily stats

put 'weblog', 'row1', 'stats:daily', 'test-daily-value'






6.

7.

8.

9. 10. 11.

1.

2.

3.

4.


Type to display the contents of the table. Sample output:scan 'weblog'

ROW COLUMN+CELL row1 column=stats:daily, timestamp=1321296699190, value=test-daily-value row1 column=stats:weekly, timestamp=1321296715892, value=test-weekly-value row2 column=stats:weekly, timestamp=1321296787444, value=test-weekly-value2 row(s) in 0.0440 seconds

Type to display the contents of row 1. Sample output:get 'weblog', 'row1'

COLUMN CELL stats:daily timestamp=1321296699190, value=test-daily-value stats:weekly timestamp=1321296715892, value=test-weekly-value2 row(s) in 0.0330 seconds

Type to disable the table.disable 'weblog'Type to drop the table and delete all data.drop 'weblog'Type to exit the HBase shell.exit

Setting Up Compression with HBase

Using compression with HBase reduces the number of bytes transmitted over the network and stored on disk. These benefits often outweigh theperformance cost of compressing the data on every write and uncompressing it on every read.

GZip Compression

GZip compression is included with most Linux distributions, and works natively with HBase. To use GZip compression, specify it in the per-columnfamily compression flag while creating tables in HBase shell. Example:

create 'mytable', {NAME=>'colfam:', COMPRESSION=>'gz'}

LZO Compression

Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm, included in most Linux distributions, that is designed for decompressionspeed.

To Set Up LZO Compression for Use with HBase:

Make sure HBase is installed on the nodes where you plan to run it. See and for morePlanning the Deployment Installing MapR Softwareinformation.On each HBase node, ensure the native LZO base library is installed:

on Ubuntu: apt-get install liblzo2-devOn Red Hat or CentOS: yum install liblzo2-devel

Check out the native connector library from http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/For 0.20.2 check out branches/branch-0.1

svn checkout http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/branches/branch-0.1/

For 0.21 or 0.22 check out trunk

svn checkout http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/branches/trunk/

Set the compiler flags and build the native connector library:

http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/



4.

5.

6.

$ export CFLAGS="-m64"$ ant compile-native$ ant jar

Create a directory for the native libraries (use TAB completion to fill in the <version> placeholder):

mkdir -p /opt/mapr/hbase/hbase-<version>/lib/ /Linux-amd64-64/native

Copy the build results into the appropriate HBase directories on every HBase node. Example:

$ cp build/hadoop-gpl-compression-0.2.0-dev.jar /opt/mapr/hbase/hbase-<version>/lib$ cp build/ /Linux-amd64-64/lib/libgplcompression.* /opt/mapr/hbase/hbase-<version>/lib/native na

/Linux-amd64-64/tive

Once LZO is set up, you can specify it in the per-column family compression flag while creating tables in HBase shell. Example:

create 'mytable', {NAME=>'colfam:', COMPRESSION=>'lzo'}

Running MapReduce Jobs with HBase

To run MapReduce jobs with data stored in HBase, set the environment variable to the output of the cHADOOP_CLASSPATH hbase classpathommand (use TAB completion to fill in the placeholder):<version>

$ export HADOOP_CLASSPATH=`/opt/mapr/hbase/hbase-<version>/bin/hbase classpath`

Note the backticks ( ).`

Example: Exporting a table named t1 with MapReduce

Notes: On a node in a MapR cluster, the output directory /hbase/export_t1 will be located in the mapr hadoop filesystem, so to list the output filesin the example below use the following hadoop fs command from the node's command line:

# hadoop fs \-ls /hbase/export_t1

To view the output:

# hadoop fs \-cat /hbase/export_t1/part-m-00000



# cd /opt/mapr/hadoop/hadoop-0.20.2# export HADOOP_CLASSPATH=`/opt/mapr/hbase/hbase-0.90.4/bin/hbase classpath`# ./bin/hadoop jar /opt/mapr/hbase/hbase-0.90.4/hbase-0.90.4.jar export t1 /hbase/export_t111/09/28 09:35:11 INFO mapreduce.Export: verisons=1, starttime=0,endtime=922337203685477580711/09/28 09:35:11 INFO fs.JobTrackerWatcher: Current running JobTracker is:lohit-ubuntu/10.250.1.91:900111/09/28 09:35:12 INFO mapred.JobClient: Running job: job_201109280920_000311/09/28 09:35:13 INFO mapred.JobClient: map 0% reduce 0%11/09/28 09:35:19 INFO mapred.JobClient: Job complete: job_201109280920_000311/09/28 09:35:19 INFO mapred.JobClient: Counters: 1511/09/28 09:35:19 INFO mapred.JobClient: Job Counters11/09/28 09:35:19 INFO mapred.JobClient: Aggregate execution time ofmappers(ms)=325911/09/28 09:35:19 INFO mapred.JobClient: Total time spent by all reduceswaiting after reserving slots (ms)=011/09/28 09:35:19 INFO mapred.JobClient: Total time spent by all mapswaiting after reserving slots (ms)=011/09/28 09:35:19 INFO mapred.JobClient: Launched map tasks=111/09/28 09:35:19 INFO mapred.JobClient: Data-local map tasks=111/09/28 09:35:19 INFO mapred.JobClient: Aggregate execution time ofreducers(ms)=011/09/28 09:35:19 INFO mapred.JobClient: FileSystemCounters11/09/28 09:35:19 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6131911/09/28 09:35:19 INFO mapred.JobClient: Map-Reduce Framework11/09/28 09:35:19 INFO mapred.JobClient: Map input records=511/09/28 09:35:19 INFO mapred.JobClient: PHYSICAL_MEMORY_BYTES=10799104011/09/28 09:35:19 INFO mapred.JobClient: Spilled Records=011/09/28 09:35:19 INFO mapred.JobClient: CPU_MILLISECONDS=78011/09/28 09:35:19 INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=75983667211/09/28 09:35:19 INFO mapred.JobClient: Map output records=511/09/28 09:35:19 INFO mapred.JobClient: SPLIT_RAW_BYTES=6311/09/28 09:35:19 INFO mapred.JobClient: GC time elapsed (ms)=35



HBase Best Practices

The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Before using HBase,turn off MapR compression for directories in the HBase volume (normally mounted at . Example:/hbase

hadoop mfs -setcompression off /hbase

You can check whether compression is turned off in a directory or mounted volume by using to list the file contents.hadoop mfsExample:

hadoop mfs -ls /hbase

The letter in the output indicates compression is turned on; the letter indicates compression is turned off. See for moreZ U hadoop mfsinformation.

On any node where you plan to run both HBase and MapReduce, give more memory to the FileServer than to the RegionServer so thatthe node can handle high throughput. For example, on a node with 24 GB of physical memory, it might be desirable to limit theRegionServer to 4 GB, give 10 GB to MapR-FS, and give the remainder to TaskTracker. To change the memory allocated to eachservice, edit the file. See for more information./opt/mapr/conf/warden.conf Tuning Your MapR Install



1.

Hive

Apache Hive is a data warehouse system for Hadoop that uses a SQL-like language called Hive Query Language (HQL) to query structured datastored in a distributed filesystem. For more information about Hive, see the .Apache Hive project page

On this page:

Installing HiveGetting Started with HiveUsing Hive with MapR VolumesSetting Up Hive with a MySQL MetastoreHive-HBase Integration

Once Hive is installed, the executable is located at: /opt/mapr/hive/hive-<version>/bin/hive

Make sure the environment variable is set correctly. Example:JAVA_HOME

# export JAVA_HOME=/usr/lib/jvm/java-6-sun

Make sure the environment variable is set correctly. Example:HIVE_HOME

# export HIVE_HOME=/opt/mapr/hive/hive-<version>

Installing Hive

The following procedures use the operating system package managers to download and install Hive from the MapR Repository. If you want toinstall this component manually from packages files, see . This procedure is to be performed onPackages and Dependencies for MapR Softwarea MapR cluster (see the ) or client (see ). Installation Guide Setting Up the Client

Default Hive MapR Hadoop Filesystem Directories

It is not necessary to create and the Hive and directories in the MapR Hadoopchmod /tmp /user/hive/warehousefilesystem. By default MapR creates and configures these directories for you when you create your first Hive table.

These default directories are defined in the file:$HIVE_HOME/conf/hive-default.xml

<configuration>...<property><name>hive.exec.scratchdir</name><value>/tmp/hive-${user.name}</value><description>Scratch space for Hive jobs</description></property>

<property><name>hive.metastore.warehouse.dir</name><value>/user/hive/warehouse</value><description>location of default database for the warehouse</description></property>...</configuration>

If you need to modify the default names for one or both of these directories, create a file$HIVE_HOME/conf/hive-site.xmlfor this purpose if it doesn't already exist.

Copy the and/or the property elements from the hive.exec.scratchdir hive.metastore.warehouse.dir hive-defau file and paste them into an XML configuration element in the file. Modify the value elements for theselt.xml hive-site.xml

directories in the file as desired, and then save and close the file and close the hive-site.xml hive-site.xml hive-defa file.ult.xml

To install Hive on an Ubuntu cluster:

http://hive.apache.org/



1. 2.

3.

1. 2.

1. 2.

1.

2.

3.

Execute the following commands as or using .root sudoUpdate the list of available packages:

apt-get update

On each planned Hive node, install :mapr-hive

apt-get install mapr-hive

To install Hive on a Red Hat or CentOS cluster:

Execute the following commands as or using .root sudoOn each planned Hive node, install :mapr-hive

yum install mapr-hive


In this tutorial, you'll create a Hive table, load data from a tab-delimited text file, and run a couple of basic queries against the table.

First, make sure you have downloaded the sample table: On the page , select A Tour of the MapR Virtual Machine Tools > Attachmentsand right-click on , select from the pop-up menu, select a directory to save to, then click OK. Ifsample-table.txt Save Link As...you're working on the MapR Virtual Machine, we'll be loading the file from the MapR Virtual Machine's local file system (not the clusterstorage layer), so save the file in the MapR Home directory (for example, )./home/mapr

Take a look at the source data

First, take a look at the contents of the file using the terminal:

Make sure you are in the Home directory where you saved (type if you are not sure).sample-table.txt cd ~Type to display the following output.cat sample-table.txt

mapr@mapr-desktop:~$ cat sample-table.txt1320352532 1001 http://www.mapr.com/doc http://www.mapr.com 192.168.10.11320352533 1002 http://www.mapr.com http://www.example.com 192.168.10.101320352546 1001 http://www.mapr.com http://www.mapr.com/doc 192.168.10.1

Notice that the file consists of only three lines, each of which contains a row of data fields separated by the TAB character. The data in the filerepresents a web log.

Create a table in Hive and load the source data:

Type the following command to start the Hive shell, using tab-completion to expand the :<version>

/opt/mapr/hive/hive-<version>/bin/hive

At the prompt, type the following command to create the table:hive>

CREATE TABLE web_log(viewTime INT, userid BIGINT, url STRING, referrer STRING, ip STRING) ROWFORMAT DELIMITED FIELDS TERMINATED BY '\t';

Type the following command to load the data from into the table:sample-table.txt

LOAD DATA LOCAL INPATH '/home/mapr/sample-table.txt' INTO TABLE web_log;

Run basic queries against the table:

Try the simplest query, one that displays all the data in the table:



SELECT web_log.* FROM web_log;

This query would be inadvisable with a large table, but with the small sample table it returns very quickly.

Try a simple SELECT to extract only data that matches a desired string:

SELECT web_log.* FROM web_log WHERE web_log.url LIKE '%doc';

This query launches a MapReduce job to filter the data.

When the Hive shell starts, it reads an initialization file called which is located in the or directories. You can.hiverc HIVE_HOME/bin/ $HOME/edit this file to set custom parameters or commands that initialize the Hive command-line environment, one command per line.

When you run the Hive shell, you can specify a MySQL initialization script file using the option. Example:-i

hive -i <filename>

Using Hive with MapR Volumes

MapR-FS does not allow moving or renaming across volume boundaries. Be sure to set the Hive Scratch Directory and Hive Warehouse Directoryin the same directory as the volume where the data for the Hive job resides before running the job. The following sections provide additionaldetail.

Hive Scratch Directory

When running an import job on data from a MapR volume, make sure to set to a directory in the same volume (wherehive.exec.scratchdirthe data for the job resides). Set the parameter to a directory (for example, ) under the volume's mount point (as viewed in /tmp Volume

). You can set this parameter from the Hive shell. Example:Properties

hive> set hive.exec.scratchdir=/myvolume/tmp

Hive Warehouse Directory

When writing queries that move data between tables, make sure the tables are in the same volume. By default, all volumes are created under thepath "/user/hive/warehouse" under the root volume. This value is specified by the property , which you canhive.metastore.warehouse.dirset from the Hive shell. Example:

hive> set hive.metastore.warehouse.dir=/myvolume/mydirectory

Setting Up Hive with a MySQL Metastore

The metadata for Hive tables and partitions are stored in the Hive Metastore (for more information, see the ). ByHive project documentationdefault, the Hive Metastore stores all Hive metadata in an embedded Apache Derby database in MapR-FS. Derby only allows one connection at atime; if you want multiple concurrent Hive sessions, you can use MySQL for the Hive Metastore. You can run the Hive Metastore on any machinethat is accessible from Hive.

Prerequisites

Make sure MySQL is installed on the machine on which you want to run the Metastore, and make sure you are able to connect to theMySQL Server from the Hive machine. You can test this with the following command:

mysql -h <hostname> -u <user>

The database administrator must create a database for the Hive metastore data, and the username specified in javax.jdo.Connecti must have permissions to access it. The database can be specified using the parameter. The tables andonUser ConnectionURL

schemas are created automatically when the metastore is first started.

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin



Download and install the driver for the MySQL JDBC connector. Example:

$ curl -L 'http://www.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.18.tar.gz/from/http://mysql.he.net/|http://mysql.he.net/' | tar xz$ sudo cp mysql-connector-java-5.1.18/mysql-connector-java-5.1.18-bin.jar/opt/mapr/hive/hive-<version>/lib/

Configuring Hive for MySQL

Create the file in the Hive configuration directory ( ) with the contents from thehive-site.xml /opt/mapr/hive/hive-<version>/confexample below. Then set the parameters as follows:

You can set a specific port for Thrift URIs by adding the command into the file (if export METASTORE_PORT=<port> hive-env.sh h does not exist, create it in the Hive configuration directory). Example:ive-env.sh

export METASTORE_PORT=9083

To connect to an existing MySQL metastore, make sure the parameter and the parameters in ConnectionURL Thrift URIs hive-si point to the metastore's host and port.te.xml

Once you have the configuration set up, start the Hive Metastore service using the following command (use tab auto-complete to fill in the):<version>

/opt/mapr/hive/hive-<version>/bin/hive --service metastore

You can use to run metastore in the background.nohup hive --service metastore

Example hive-site.xml



<configuration>

<property> <name>hive.metastore.local</name> <value> </value>true <description>controls whether to connect to remove metastore server or open a metastore servernewin Hive Client JVM</description> </property>

<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist= </value>true<description>JDBC connect string a JDBC metastore</description>for</property>

<property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name a JDBC metastore</description>for </property>

<property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>username to use against metastore database</description> </property>

<property> <name>javax.jdo.option.ConnectionPassword</name> <value><fill in with password></value> <description>password to use against metastore database</description> </property>

<property> <name>hive.metastore.uris</name> <value>thrift://localhost:3306</value></property>

</configuration>

Hive-HBase Integration

You can create HBase tables from Hive that can be accessed by both Hive and HBase. This allows you to run Hive queries on HBase tables. Youcan also convert existing HBase tables into Hive-HBase tables and run Hive queries on those tables as well.

In this section:

Install and Configure Hive and HBaseGetting Started with Hive-HBase Integration

Install and Configure Hive and HBase

1. if it is not already installed.Install and configure Hive

2. if it is not already installed.Install and configure HBase

3. Execute the command and ensure that all relevant Hadoop, HBase and Zookeeper processes are running.jps

Example:



$ jps21985 HRegionServer1549 jenkins.war15051 QuorumPeerMain30935 Jps15551 CommandServer15698 HMaster15293 JobTracker15328 TaskTracker15131 WardenMain

Configure the the Filehive-site.xml

1. Open the file with your favorite editor, or create a file if it doesn't already exist:hive-site.xml hive-site.xml

$ cd $HIVE_HOME$ vi conf/hive-site.xml

2. Copy the following XML code and paste it into the file.hive-site.xml

Note: If you already have an existing file with a element block, just copy the element block codehive-site.xml configuration propertybelow and paste it inside the element block in the file.configuration hive-site.xml

Example configuration:

<configuration>

<property> <name>hive.aux.jars.path</name> <value>file:///opt/mapr/hive/hive-0.7.1/lib/hive-hbase-handler-0.7.1.jar,file:///opt/mapr/hbase/hbase-0.90.4/hbase-0.90.4.jar,file:///opt/mapr/zookeeper/zookeeper-3.3.2/zookeeper-3.3.2.jar</value><description>A comma separated list (with no spaces) of the jar files required Hive-HBaseforintegration</description></property>

<property> <name>hbase.zookeeper.quorum</name> <value>xx.xx.x.xxx,xx.xx.x.xxx,xx.xx.x.xxx</value> <description>A comma separated list (with no spaces) of the IP addresses of all ZooKeeper servers inthe cluster.</description></property>

<property> <name>hbase.zookeeper.property.clientPort</name> <value>5181</value> <description>The Zookeeper client port. The MapR clientPort is 5181.</description>default</property>

</configuration>

3. Save and close the file.hive-site.xml

If you have successfully completed all of the steps in this Install and Configure Hive and HBase section, you're ready to begin the Getting Startedwith Hive-HBase Integration tutorial in the next section.

Getting Started with Hive-HBase Integration

In this tutorial we will:

Create a Hive tablePopulate the Hive table with data from a text file



Query the Hive tableCreate a Hive-HBase tableIntrospect the Hive-HBase table from HBasePopulate the Hive-Hbase table with data from the Hive tableQuery the Hive-HBase table from HiveConvert an existing HBase table into a Hive-HBase table

Be sure that you have successfully completed all of the steps in the Install and Configure Hive and HBase section before beginning this GettingStarted tutorial.

This Getting Started tutorial closely parallels the section of the Apache Hive Wiki, and thanks to Samuel Guo and otherHive-HBase Integrationcontributors to that effort. If you are familiar with their approach to Hive-HBase integration, you should be immediately comfortable with thismaterial.

However, please note that there are some significant differences in this Getting Started section, especially in regards to configuration andcommand parameters or the lack thereof. Follow the instructions in this Getting Started tutorial to the letter so you can have an enjoyable andsuccessful experience.

Create a Hive table with two columns:

Change to your Hive installation directory if you're not already there and start Hive:

$ cd $HIVE_HOME$ bin/hive

Execute the CREATE TABLE command to create the Hive pokes table:

hive> CREATE TABLE pokes (foo INT, bar STRING);

To see if the pokes table has been created successfully, execute the SHOW TABLES command:

hive> SHOW TABLES;OKpokesTime taken: 0.74 seconds

The table appears in the list of tables. pokes

Populate the Hive pokes table with data

Execute the LOAD DATA LOCAL INPATH command to populate the Hive table with data from the file.pokes kv1.txt

The file is provided in the directory.kv1.txt $HIVE_HOME/examples

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

A message appears confirming that the table was created successfully, and the Hive prompt reappears:

Copying data from file:...OKTime taken: 0.278 secondshive>

Execute a SELECT query on the Hive pokes table:

https://cwiki.apache.org/Hive/hbaseintegration.html



hive> SELECT * FROM pokes WHERE foo = 98;

The SELECT statement executes, runs a MapReduce job, and prints the job output:

OK98 val_9898 val_98Time taken: 18.059 seconds

The output of the SELECT command displays two identical rows because there are two identical rows in the Hive table with a key of 98. pokes

Note: This is a good illustration of the concept that Hive tables can have multiple identical keys. As we will see shortly, HBase tables cannot havemultiple identical keys, only unique keys.

To create a Hive-HBase table, enter these four lines of code at the Hive prompt:

hive> CREATE TABLE hbase_table_1(key , value string)int > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( = )"hbase.columns.mapping" ":key,cf1:val" > TBLPROPERTIES ( = );"hbase.table.name" "xyz"

After a brief delay, a message appears confirming that the table was created successfully:

OKTime taken: 5.195 seconds

Note: The TBLPROPERTIES command is not required, but those new to Hive-HBase integration may find it easier to understand what's going onif Hive and HBase use different names for the same table.

In this example, Hive will recognize this table as "hbase_table_1" and HBase will recognize this table as "xyz".

Start the HBase shell:

Keeping the Hive terminal session open, start a new terminal session for HBase, then start the HBase shell:

$ cd $HBASE_HOME$ bin/hbase shellHBase Shell; enter 'help<RETURN>' list of supported commands.forType to leave the HBase Shell"exit<RETURN>"Version 0.90.4, rUnknown, Wed Nov 9 17:35:00 PST 2011

hbase(main):001:0>

Execute the list command to see a list of HBase tables:

hbase(main):001:0> listTABLExyz1 row(s) in 0.8260 seconds

HBase recognizes the Hive-HBase table named . This is the same table known to Hive as . xyz hbase_table_1

Display the description of the xyz table in the HBase shell:



hbase(main):004:0> describe "xyz"DESCRIPTION ENABLED {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', BLOOMFILTER => 'NONE', REPLICATI true ON_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BL OCKSIZE => '65536', IN_MEMORY => ' ', BLOCKCACHE => ' '}]}false true1 row(s) in 0.0190 seconds

From the Hive prompt, insert data from the Hive table pokes into the Hive-HBase table hbase_table_1:

hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98;...2 Rows loaded to hbase_table_1OKTime taken: 13.384 seconds

Query hbase_table_1 to see the data we have inserted into the Hive-HBase table:

hive> SELECT * FROM hbase_table_1;OK98 val_98Time taken: 0.56 seconds

Even though we loaded two rows from the Hive table that had the same key of 98, only one row was actually inserted into pokes hbase_table_. This is because is an HBASE table, and although Hive tables support duplicate keys, HBase tables only support unique1 hbase_table_1

keys. HBase tables arbitrarily retain only one key, and will silently discard all of the data associated with duplicate keys.

Convert a pre-existing HBase table to a Hive-HBase table

To convert a pre-existing HBase table to a Hive-HBase table, enter the following four commands at the Hive prompt.

Note that in this example the existing HBase table is .my_hbase_table

hive> CREATE EXTERNAL TABLE hbase_table_2(key , value string)int > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( = )"hbase.columns.mapping" "cf1:val" > TBLPROPERTIES( = );"hbase.table.name" "my_hbase_table"

Now we can run a Hive query against the pre-existing HBase table that Hive sees as :my_hbase_table hbase_table_2

hive> SELECT * FROM hbase_table_2 WHERE key > 400 AND key < 410;Total MapReduce jobs = 1Launching Job 1 out of 1

of reduce tasks is set to 0 since there's no reduce Number operator...OK401 val_401402 val_402403 val_403404 val_404406 val_406407 val_407409 val_409Time taken: 9.452 seconds



Zookeeper Connections

If you see a similar error message to the following, ensure that and hbase.zookeeper.quorum hbase.zookeeper.property.clientPortare properly defined in the file.$HIVE_HOME/conf/hive-site.xml

Failed with exception java.io.IOException:org.apache.hadoop.hbase.ZooKeeperConnectionException:HBase is able to connect to ZooKeeper but the connection closes immediately. This could be asign that the server has too many connections (30 is the ). Consider inspecting yourdefaultZK server logs that error and then make sure you are reusing HBaseConfiguration as often asforyou can. See HTable's javadoc more information.for



1.

2. a. b.

3. 4. 5.

1.

2. 3. 4.

5. 6.

Hive ODBC Connector

Before You Begin

The MapR Hive ODBC Connector is an ODBC driver for Apache Hive 0.70 and later that conforms to the ODBC 3.52 specification. The standardquery language for ODBC is SQL; Hive’s query language, HiveQL, includes a subset of ANSI SQL-92. When using an application that connectsvia ODBC to Hive, you may need to rewrite queries to compensate for SQL features that are not present in Hive. Applications that use SQL willrecognize HiveQL, but might not provide access to HiveQL-specific features such as multi-table insert. Please refer to the forHiveQL wikiup-to-date information on HiveQL.

You will need to configure a (DSN), a definition that specifies how to connect to Hive. DSNs are typically managed by theData Source Nameoperating system and may be used by multiple applications. Some applications do not use DSNs. You will need to refer to your particularapplication’s documentation to understand how it connects using ODBC.

Software and Hardware Requirements

To use MapR Hive ODBC Connector on Windows requires:

Windows® 7 Professional or Windows® 2008 R2. Both 32 and 64-bit editions are supported.Microsoft .NET Framework 4.0The Microsoft Visual C++ 2010 Redistributable Package (runtimes required to run applications developed with Visual C++ on a computerthat does not have Visual C++ 2010 installed.)A Hadoop cluster with the Hive service installed and running. You should find out from the cluster administrator the hostname or IPaddress for the Hive service and the port that the service is running on. (The default port for Hive is 10000.)

Installation and Configuration

There are versions of the connector for 32-bit and 64-bit applications. The 64-bit version of the connector works only with 64-bit DSNs; the 32-bitconnector works only with 32-bit DSNs. 64-bit Windows machines can run both 64-bit and 32-bit applications; on a 64-bit system, you might needto install both versions of the connector in order to set up DSNs to work with both. If both the 32-bit connector and the 64-bit connector areinstalled, you must configure DSNs for each independently, in their separate Data Source Administrators.

To install the Hive ODBC Connector:

Run the installer to get started:To install the 64-bit connector, download and run http://package.mapr.com/tools/MapR-ODBC/MapR_odbc_1.00.100

.7_x64.exeTo install the 32-bit connector, download and run http://package.mapr.com/tools/MapR-ODBC/MapR_odbc_1.00.100

.7_x86.exePerform the following steps, clicking after each:Next

Accept the license agreement.Select an installation folder.

On the Information window, click .NextOn the Completing... window, click Finish.Install a DSN corresponding to your Hive server.

To create a Data Source Name (DSN)

Open the Data Source Administrator from the Start menu. Example:Start > MapR Hive ODBC Connector > 64-Bit ODBC Driver ManagerClick to open the Create New Data Source dialog.AddSelect and click to open the MapR Hive ODBC Connector Setup window.Hive ODBC Connector FinishEnter the connection information for the Hive instance:

Data Source Name — a name for the DSNDescription — an optional description for the DSNHost — the IP or hostname of your Hive serverPort — the listening port for the Hive serviceDatabase — use at the Hive command line if you are not sureshow databases

Click to test the connection.TestWhen you're sure the connection works, click .Finish

SQLPrepare Optimization

The connector currently uses query execution to determine the result-set’s metadata for SQLPrepare. The down side of this is that SQLPrepare isslow because query execution tends to be slow. You can configure the connector to speed up SQLPrepare if you do not need the result-set’smetadata. To change the behavior for SQLPrepare, create a String value under your DSN. If the value is set to a non-zeroNOPSQLPreparevalue, SQLPrepare will not use query execution to derive the result-set’s metadata. If this registry entry is not defined, the default value is 0.

Notes

https://cwiki.apache.org/confluence/display/Hive/HiveQL

http://package.mapr.com/tools/MapR-ODBC/MapR_odbc_1.00.1007_x64.exe






Data Types

The following data types are supported:

Type Description

TINYINT 1-byte integer

SMALLINT 2- byte integer

INT 4-byte integer

BIGINT 8-byte integer

FLOAT Single-precision floating-point number

DOUBLE Double-precision floating-point number

BOOLEAN True/false value

STRING Sequence of characters

Not yet supported:

The aggregate types (ARRAY, MAP, and STRUCT)The new timestamp types introduced in Hive 0.80

HiveQL Notes

CAST Function

HiveQL doesn’t support the CONVERT function; it uses the CAST function to perform type conversion. Example:

CAST (<expression> AS <type>)

Using in HiveQL:CAST

Use the HiveQL names for the eight data types supported by Hive in the expression. For example, to convert 1.0 to an integer, use CAST rather than .CAST (1.0 AS INT) CAST (1.0 AS SQL_INTEGER)

Hive does not do a range check during operations. For example, returns a ofCAST CAST (1000000 AS SQL_TINYINT) TINYINTvalue 64, rather than the expected error.Unlike SQL, Hive returns instead of an error if it fails to convert the data. For example, returns null.null CAST (“STRING” AS INT)

Using with values:CAST BOOLEAN

The boolean value converts to the numeric value TRUE 1The boolean value converts to the numeric value FALSE 0The numeric value converts to the boolean value ; any other number converts to 0 FALSE TRUEThe empty string converts to the boolean value ; any other string converts to FALSE TRUE

The HiveQL type stores text strings, and corresponds to the data type. The operation successfully convertsSTRING SQL_LONGVARCHAR CASTstrings to numbers if the strings contain only numeric characters; otherwise the conversion fails.

You can tune the column length used for columns. To change the default length reported for columns, add the registry entry STRING STRING Def under your DSN and specify a value. If this registry entry is not defined, the default length of 1024 characters isaultStringColumnLength

used.

Delimiters

The connector uses Thrift to connect to the Hive server. Hive returns the result set of a HiveQL query as newline-delimited rows whose fields aretab-delimited. Hive currently does not escape any tab character in the field. Make sure to escape any tab or newline characters in the Hive data,indlucing platform-specific newline character sequences such as line-feed (LF) for UNIX/Linux/Mac OS X/etc, carriage return/line-feed (CR/LF) forWindows, and carriage return (CR) for older Macintosh platforms.

Notes on Applications

Microsoft Access

Version tested “2010” (=14.0), 32 and 64-bit.

Notes Linked table is not available currently.

Microsoft Excel/Query



Versiontested

“2010” (=14.0), 32 and 64-bit.

Notes From the ribbon, use and select either or . TheData From Other Sources From Data Connection Wizard From Microsoft Queryformer requires a pre-defined DSN while the latter supports creating a DSN on the fly. You can use the ODBC driver via the OLE DBfor ODBC Driver bridge.

Tableau Desktop

Version tested 7.0, 32-bit only.

Notes Prior to version 7.0.n, you will need to install a TDC to maximize the capability of the driver. From version 7.0.n onward, you can specify the driver via the option from the tab.MapR Hadoop Hive Connect to Data



Hive ODBC Connector License and Copyright Information

Third Party Trademarks

ICU License - ICU 1.8.1 and later

COPYRIGHT AND PERMISSION NOTICE

Copyright (c) 1995-2010 International Business Machines Corporation and others

All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"),to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copiesof the Software, and to permit persons to whom the Software is furnished to do so, provided that the above copyright notice(s) and this permissionnotice appear in all copies of the Software and that both the above copyright notice(s) and this permission notice appear in supportingdocumentation.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TOTHE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTYRIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, ORANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATAOR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR INCONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or otherdealings in this Software without prior written authorization of the copyright holder.

All trademarks and registered trademarks mentioned herein are the property of their respective owners.

OpenSSL

Copyright (c) 1998-2008 The OpenSSL Project. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in thedocumentation and/or other materials provided with the distribution.

3. All advertising materials mentioning features or use of this software must display the following acknowledgment:"This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. ( )"http://www.openssl.org/

4. The names "OpenSSL Toolkit" and "OpenSSL Project" must not be used to endorse or promote products derived from this software withoutprior written permission. For written permission, please contact [email protected].

5. Products derived from this software may not be called "OpenSSL" nor may "OpenSSL" appear in their names without prior written permissionof the OpenSSL Project.

6. Redistributions of any form whatsoever must retain the following acknowledgment:"This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit ( )"http://www.openssl.org/

THIS SOFTWARE IS PROVIDED BY THE OpenSSL PROJECT ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING,BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE AREDISCLAIMED. IN NO EVENT SHALL THE OpenSSL PROJECT OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OFSUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ONANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Expat

Copyright (c) 1998, 1999, 2000 Thai Open Source Software Center Ltd

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the""Software""), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute,sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the followingconditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED ""AS IS"", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED

http://www.openssl.org/

http://www.openssl.org/



TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NOINFRINGEMENT. IN NO EVENT SHALLTHE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OFCONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHERDEALINGS IN THE SOFTWARE."

Apache Hive

Copyright 2008-2011 The Apache Software Foundation.

Apache Thrift

Copyright 2006-2010 The Apache Software Foundation.



Mahout

Apache Mahout™ is a scalable machine learning library. For more information about Mahout, see the project.Apache Mahout

On this page:

Installing MahoutConfiguring the Mahout EnvironmentGetting Started with Mahout

Installing Mahout

Mahout can be installed when MapR services are initially installed as discussed in . If Mahout wasn't installed during theInstalling MapR Servicesinitial MapR services installation, Mahout can be installed at a later date by executing the instructions in this section. These procedures may beperformed on a node in a MapR cluster (see the ) or on a client (see ).Installation Guide Setting Up the Client

The Mahout installation procedures below use the operating system's package manager to download and install Mahout from the MapRRepository. If you want to install this component manually from packages files, see .Package Dependencies for MapR version 2.x

Installing Mahout on a MapR Node

Mahout only needs to be installed on the nodes in the cluster from which Mahout applications will be executed. So you may only need to installMahout on one node. However, depending on the number of Mahout users and the number of scheduled Mahout jobs, you may need to installMahout on more than one node.

Mahout applications may run MapReduce programs, and by default Mahout will use the cluster's default JobTracker to execute MapReduce jobs.

Install Mahout on a MapR node running Ubuntu

Install Mahout on a MapR node running Ubuntu as or using by executing the following command:root sudo apt-get install

# apt-get install mapr-mahout

Install Mahout on a MapR node running Red Hat or CentOS

Install Mahout on a MapR node running Red Hat or CentOS as or using by executing the following command:root sudo yum install

# yum install mapr-mahout

Installing Mahout on a Client

If you install Mahout on a Linux client, you can run Mahout applications from the client that execute MapReduce jobs on the cluster that your clientis configured to use.

Tip: You don't have to install Mahout on the cluster in order to run Mahout applications from your client.

Install Mahout on a client running Ubuntu

Install Mahout on a client running Ubuntu as or using by executing the following command:root sudo apt-get install

# apt-get install mapr-mahout

Install Mahout on a client running Red Hat or CentOS

Install Mahout on a client running Red Hat or CentOS as or using by executing the following command:root sudo yum install

# yum install mapr-mahout

Configuring the Mahout Environment

After installation the Mahout executable is located in the following directory:/opt/mapr/mahout/mahout-<version>/bin/mahout

http://mahout.apache.org/



Example: /opt/mapr/mahout/mahout-0.7/bin/mahout

To use Mahout with MapR, set the following environment variables:

MAHOUT_HOME - the path to the Mahout directory. Example: $ export MAHOUT_HOME=/opt/mapr/mahout/mahout-0.7

JAVA_HOME - the path to the Java directory. Example for Ubuntu: $ export JAVA_HOME=/usr/lib/jvm/java-6-sun

JAVA_HOME - the path to the Java directory. Example for Red Hat and CentOS: $ export JAVA_HOME=/usr/java/jdk1.6.0_24

HADOOP_HOME - the path to the Hadoop directory. Example: $ export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2

HADOOP_CONF_DIR - the path to the directory containing Hadoop configuration parameters. Example:$ export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf

You can set these environment variables persistently for all users by adding them to the file as or using . The/etc/environment root sudoorder of the environment variables in the file doesn't matter.

Example entries for setting environment variables in the /etc/environment file for Ubuntu:

JAVA_HOME=/usr/lib/jvm/java-6-sun MAHOUT_HOME=/opt/mapr/mahout/mahout-0.7 HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2 HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf

Example entries for setting environment variables in the /etc/environment file for Red Hat and CentOS:

JAVA_HOME=/usr/java/jdk1.6.0_24 MAHOUT_HOME=/opt/mapr/mahout/mahout-0.7 HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2 HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf

After adding or editing environment variables to the file, you can activate them without rebooting by executing the c/etc/environment sourceommand:

$ source /etc/environment

Note: A user who doesn't have or permissions can add these environment variable entries to his or her file. Theroot sudo ~/.bashrcenvironment variables will be set each time the user logs in.

Getting Started with Mahout

To see the sample applications bundled with Mahout, execute the following command:

$ ls $MAHOUT_HOME/examples/bin

To run the Twenty Newsgroups Classification Example, execute the following commands:

$ cd $MAHOUT_HOME$ ./examples/bin/build-20news-bayes.sh

The output from this example will look similar to the following:





MultiTool

The command is the wrapper around Cascading.Multitool, a command line tool for processing large text files and datasets (like sed and grepmton unix). The command is located in the directory. To use , change to the directory.mt /opt/mapr/contrib/multitool/bin mt multitoolExample:

cd /opt/mapr/contrib/multitool./bin/mt



1. 2.

3.

4.

5.

6.

7.

8.

Oozie

Oozie is a workflow system for Hadoop. Using Oozie, you can set up that execute MapReduce jobs and that manageworkflows coordinatorsworkflows.

Installing Oozie

The following procedures use the operating system package managers to download and install from the MapR Repository. To install thepackages manually, refer to .Preparing Packages and Repositories

To install Oozie on a MapR cluster:

Execute the following commands as or using .root sudoThis procedure is to be performed on a MapR cluster with the MapR repository properly set. If you have not installed MapR, see the Instal

.lation GuideIf you are installing on Ubuntu, update the list of available packages:

apt-get update

Install :mapr-oozieRHEL/CentOS:

yum install mapr-oozie

SUSE:

zypper install mapr-oozie

Ubuntu:

apt-get install mapr-oozie

The warden picks up the new configuration and automatically starts the new services. When it is convenient, restart the warden:


Use the script to set up Oozie:oozie-setup.sh

/opt/mapr/oozie/oozie-<version>/bin/oozie-setup.sh

Start the Oozie daemon:

/etc/init.d/mapr-oozie start

The command returns immediately, but it might take a few minutes for Oozie to start.

Use the following command to see if Oozie has started:

/etc/init.d/mapr-oozie status

Checking the Status of Oozie

Once Oozie is installed, you can check the status using the command line or the Oozie web console.

To check the status of Oozie using the command line:

Use the command:oozie admin



1.

2.

3.

4.

5.

1.

2. 3.

/opt/mapr/oozie/oozie-<version>/bin/oozie admin -oozie http://localhost:11000/oozie -status

The following output indicates normal operation:

System mode: NORMAL

To check the status of Oozie using the web console:

Point your browser to http://localhost:11000/oozie

Examples

After verifying the status of Oozie, set up and try the examples, to get familiar with Oozie.

To set up the examples and copy them to the cluster:

Extract the oozie examples archive :oozie-examples.tar.gz

cd /opt/mapr/oozie/oozie-<version>tar xvfz ./oozie-examples.tar.gz

Mount the cluster via NFS (See .). Example:Accessing Data with NFS

mkdir /mnt/maprmount localhost:/mapr /mnt/mapr

Create a directory for the examples. Example:

mkdir /mnt/mapr/my.cluster.com/myvolume/oozie-examples

Copy the Oozie examples from the local directory to the cluster directory. Example:

cp -r /opt/mapr/oozie/oozie-<version>/examples /mnt/mapr/my.cluster.com/myvolume/oozie-examples

Set the environment variable so that you don't have to provide the option when you run each job:OOZIE_URL -oozie

export OOZIE_URL="http://localhost:11000/oozie"

To run the examples:

Choose an example and run it with the command. Example:oozie job

/opt/mapr/oozie/oozie-<version>/bin/oozie job -config/opt/mapr/oozie/oozie-<version>/examples/apps/map-reduce/job.properties -run

Make a note of the returned job ID.Using the job ID, check the status of the job using the command line or the Oozie web console, as shown below.

Using the command line, type the following (substituting the job ID for the placeholder):<job id>

/opt/mapr/oozie/oozie-<version>/bin/oozie job -info <job id>

Using the Oozie web console, point your browser to and click .http://localhost:11000/oozie All Jobs

http://localhost:11000/oozie

http://localhost:11000/oozie



1. 2. 3.

4.

1. 2. 3.

1. 2.

Pig

Apache Pig is a platform for parallelized analysis of large data sets via a language called PigLatin. For more information about Pig, see the Pig.project page

Once Pig is installed, the executable is located at: /opt/mapr/pig/pig-<version>/bin/pig

Make sure the environment variable is set correctly. Example:JAVA_HOME

# export JAVA_HOME=/usr/lib/jvm/java-6-sun

Installing Pig

The following procedures use the operating system package managers to download and install Pig from the MapR Repository. If you want toinstall this component manually from packages files, see .Package Dependencies for MapR version 2.x

To install Pig on an Ubuntu cluster:


apt-get update

On each planned Pig node, install :mapr-pig

apt-get install mapr-pig

To install Pig on a Red Hat or CentOS cluster:

Execute the following commands as or using .root sudoThis procedure is to be performed on a MapR cluster. If you have not installed MapR, see the .Installation GuideOn each planned Pig node, install :mapr-pig

yum install mapr-pig

Getting Started with Pig

In this tutorial, we'll use Pig to run a MapReduce job that counts the words in the file on the cluster, and/myvolume/in/constitution.txtstore the results in the file ./myvolume/wordcount.txt

First, make sure you have downloaded the file: On the page , select Tools > Attachments andA Tour of the MapR Virtual Machineright-click to save it.constitution.txtMake sure the file is loaded onto the cluster, in the directory . If you are not sure how, look at the tutorial on /myvolume/in NFS A Tour

.of the MapR Virtual Machine

Open a Pig shell and get started:

In the terminal, type the command to start the Pig shell.pigAt the prompt, type the following lines (press ENTER after each):grunt>

A = LOAD '/myvolume/in' USING TextLoader() AS (words:chararray);

B = FOREACH A GENERATE FLATTEN(TOKENIZE(*));

C = GROUP B BY $0;





2.

3.

D = FOREACH C GENERATE group, COUNT(B);

STORE D INTO '/myvolume/wordcount';

After you type the last line, Pig starts a MapReduce job to count the words in the file .constitution.txt

When the MapReduce job is complete, type to exit the Pig shell and take a look at the contents of the directory quit /myvolume/wordc to see the results.ount



1.

2.

3.

4.

a.

b. c.

5.

6.

7. 8.

9.

Bringing Up the Cluster and Applying a License

Bringing up the cluster involves starting the ZooKeeper service, starting the CLDB service, setting up the administrative user, and installing alicense with the MapR Control System. Once these initial steps are done, the cluster is functional on a limited set of nodes. Not all services arestarted yet, but you can use the MapR Control System Dashboard to examine nodes and activity on the cluster. You then proceed to startservices on all remaining nodes.

To bring up the cluster, you will need a list of the following:

Node(s) on which CLDB is installedNode(s) on which ZooKeeper is installedNode(s) on which the webserver is installedUsername for the MapR user, which is the Linux (or LDAP) user that will have administrative privileges on the cluster

To Bring Up the Cluster

Start ZooKeeper on nodes where it is installed, by issuing one of the following commands:all

/etc/init.d/mapr-zookeeper start

service mapr-zookeeper start

On of the CLDB nodes and of the webserver nodes, start the warden by issuing one of the following commands. (If the CLDBone oneand webserver are on separate nodes, run the command on each node. If they are on the same node, run the command on only that onenode.)

/etc/init.d/mapr-warden start

service mapr-warden start

Before continuing, wait 30 to 60 seconds for the warden to start the CLDB service. The following command may fail if executedbefore the CLDB has started successfully.

Log in to the node running CLDB, and issue the following command to give full permission to the chosen administrative user:

/opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc

On a machine that is connected to the cluster and to the Internet, perform the following steps to open the MapR Control System andinstall the license:

In a browser, view the MapR Control System by navigating to the node that is running the MapR Control System: https://<MCS node>:8443Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You canignore the warning this time.The first time MapR starts, you must accept the Terms of Use and choose whether to enable the MapR service.Dial HomeLog in to the MapR Control System as the administrative user you designated earlier.Until a license is applied, the MapR Control System dashboard might show some nodes in the amber "degraded" state. Don'tworry if not all nodes are green and "healthy" at this stage.

In the navigation pane of the MapR Control System, expand the group and click to display theSystem Settings Views Manage LicensesMapR License Management dialog.

Click .Add Licenses via WebIf the cluster is already registered, the license is applied automatically. Otherwise, click to register the cluster on MapR.comOKand follow the instructions there.

On all remaining nodes, execute one of the following commands to start the warden:



Log in to the MapR Control System.Under the group in the left pane, click .Cluster Dashboard



9. Check the pane and make sure each service is running the correct number of instances, according to your deployment plan.Services



1. 2.

1. 2.

3. 4. 5.

6.

7.

8.

Configuring the Cluster

After installing MapR Services and bringing up the cluster, perform the following configuration steps.

This page contains the following topics:

Setting Up the Administrative UserSetting up the MapR Metrics DatabaseChecking the ServicesCluster TopologyVolume SetupNFS Setup RequirementsConfiguring AuthenticationConfiguring EmailConfiguring SMTPConfiguring PermissionsSetting QuotasConfiguring alarm notificationsIsolating the CLDB Node(s)

Setting Up the Administrative User

Give the full control over the cluster:administrative user

Log on to any cluster node as (or use for the following command).root sudoExecute the following command, replacing with the administrative username:<user>sudo /opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc

For general information about users and groups in the cluster, see .Users and Groups

Setting up the MapR Metrics Database

Perform these post-install configuration steps to enable the MapR Metrics database:

Log on to your SQL server.Add the following line to the file:/etc/mysql/my.cnf

bind-address = <internal_ip_of_mysql_node>

Restart MySQL with the command .service mysql restartStart MySQL as the root user with the command mysql -u root -pSource the script from the prompt to set up the database schema for the Metrics database:/opt/mapr/bin/setup.sql mysql\>

mysql> source /opt/mapr/bin/setup.sql

Create a local user on the MySQL server with the command mapr CREATE USER 'mapr'@'localhost' IDENTIFIED BY 'mapr';.Grant read and write privileges to the user with the following commands:mapr

GRANT ALL PRIVILEGES ON metrics.* TO 'mapr'@'%';FLUSH PRIVILEGES;exit

On each node in the cluster that has the package installed, specify your MySQL database parameters in one of themapr-metricsfollowing ways:

To specify the MySQL database parameters from the command line, run the script:configure.sh

configure.sh -R -d <host>:<port> -du <database username> -dp <database password> -ds metrics



1.

2. 3. 4. 5.

To specify the MySQL database parameters from the (MCS), click toMapR Control System Navigation > System Settings > Metricsdisplay the Configure Metrics Database dialog. In the field, enter the hostname and port of the machine running the MySQL server.URLIn the and fields, enter the username and password of the MySQL user.Username Password

The username you provide must have full permissions when logged in from any node in the cluster.When you change the Metrics configuration information from the initial settings, you must restart the servichoststatse on each node that reports Metrics data. You can restart the service from the or with the command hoststats MCS

.maprcli -name hoststats -action restartnode services

Customizing the Database Name

You can change the name of the database from the default name of by editing the script before sourcing the script.metrics setup.sql

Checking the Services

Use the following steps to start the MapR Control System and check that all configured services are running:

Start the MapR Control System: in a browser, go to the following URL, replacing with the hostname of the node that is running<host>the WebServer: https://<host>:8443Log in using the administrative username and password.The first time you run the MapR Control System, you must accept the MapR Terms of Service. Click to proceed.I AcceptUnder the group in the left pane, click .Cluster Views DashboardCheck the pane and make sure each service is running the correct number of instances. For example: if you have configured 5Servicesservers to run the CLDB service, you should see that 5 of 5 instances are running.

If one or more services have not started, wait a few minutes to see if the warden rectifies the problem. If not, you can try to start the servicesmanually. See .Managing ServicesIf too few instances of a service have been configured, check that the service is installed on all appropriate nodes. If not, you can add the serviceto any nodes where it is missing. See .Reconfiguring a Node

Cluster Topology

The recommended procedure is to set up a physical topology as follows:



1. 2.

/data /rack 1 /rack 2 ...

/rack n/decommissioned

Then, set the default topology for volumes to . When a volume is created, MapR will properly replicate it across all racks, ensuring that/datacopies exist on separate racks to guard against rack failure. The topology exists to allow you to automatically migrate data off/decommissionednodes that you plan to remove from the cluster. When you move a node to all its data is migrated back to the topolog/decommissioned /datay; once the node is empty, it can be removed without danger of data under-replication.

To set up the recommended topology:

Put all cluster nodes in a topology called , with each rack as a subtopology (example: , , and so on)./data /data/rack1 /data/rack2Set the default volume topology to :/data

maprcli config save -values '{ : }'"cldb. .volume.topology "default "/data"

If you need to segregate CLDB data, put the CLDB nodes in a topology called and change the topology for the CLDB volume (/cldb mapr.cldb) to ..internal /cldb

To decommission a node and avoid data under-replication, set the topology for the node to be decommissioned to and wait for/decomissionedsome time. All data should move off of that node silently. When the node is empty of all data, you can turn it off and remove it from the clusterpermanently using the GUI or CLI.

Volume Setup

Follow these guidelines to establish a basic set of volumes for a MapR cluster:

Top of the cluster: Create multiple small volumes with shallow pathnames at the top of the cluster.Mirroring strategy: Mirror and other volumes along short mount paths to provide high availability, highmapr.cluster.rootbandwidth, and to decrease the risk of data corruption for the most frequently accessed paths. Set up a mirroring policy and defaultre-replication schedule for the mirrors.Basic volume layout: Create non-mirrored volumes for users or projects. Create these non-mirrored volumes below the layer of mirroredvolumes.HBase volume: Do not mirror the volume , which is used for all HBase data./hbase

NFS Setup Requirements

Make sure that your cluster meets the following conditions before you start the MapR NFS gateway:

The stock Linux NFS service must not be running. Linux NFS and MapR NFS cannot run concurrently.The service must be running. You can use the command to check.portmapper ps a | grep portmapThe package must be present and installed. You can list the contents in the directory to check for inmapr-nfs /opt/mapr/roles nfsthe list.Make sure you have applied an M3 license or an M5 (paid or trial) license to the cluster. See .Adding a LicenseMake sure the MapR NFS service is started (see ).ServicesFor information about mounting the cluster via NFS, see .Setting Up the Client

MapR-NFS uses 64-bit inode numbers by default. You cannot use 32-bit clients or 32-bit applications running on 64-bit clientswithout forcing an inode conversion from 64 to 32 bits. Hashing 64-bit inodes down to 32 bits can potentially cause inumconflicts and is not advised. To force the NFS server to convert inode numbers to 32 bits, set the value of the Use32BitFileI

property to 1 in the file.d nfsserver.conf

When upgrading from MapR 1.2.7 to a higher version, you must stop and start the NFS servers, and re-mount all clients. Thefile handle changes between 1.x and 2.0, and the cached file IDs cannot persist across an upgrade.

http://www.mapr.com/doc/display/MapR/Adding+a+License



1. 2. 3.

4.

NFS Memory Settings

The memory allocated to each MapR service is specified in the file, which MapR automatically configures/opt/mapr/conf/warden.confbased on the physical memory available on the node. You can adjust the minimum and maximum memory used for NFS, as well as thepercentage of the heap that it tries to use, by setting the , , and parameters in the file on each NFS node.percent max min warden.conf

NFS Memory Setting Example

...service.command.nfs.heapsize.percent=3service.command.nfs.heapsize.max=1000service.command.nfs.heapsize.min=64...

The percentages need not add up to 100; in fact, you can use less than the full heap by setting the parameters for allheapsize.percentservices to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settings for individual services,unless you see specific memory-related problems occurring.

Running NFS on a Non-standard Port

To run NFS on an arbitrary port, modify the following line in :warden.conf

service.command.nfs.start=/etc/init.d/mapr-nfsserver start

Add to the end of the line, as in the following example:-p <portnumber>

service.command.nfs.start=/etc/init.d/mapr-nfsserver start -p 12345

After modifying , restart the MapR NFS server by issuing the following command:warden.conf

maprcli node services -nodes <nodename> -nfs restart

You can verify the port change with the command.rpcinfo -p localhost

High Availability NFS and Virtual IPs

You can easily set up a pool of NFS nodes with HA and failover using virtual IP addresses (VIPs); if one node fails the VIP will be automaticallyreassigned to the next NFS node in the pool. If you do not specify a list of NFS nodes, then MapR uses any available node running the MapRNFS service. You can add a server to the pool simply by starting the MapR NFS service on it. Before following this procedure, make sure you arerunning NFS on the servers to which you plan to assign VIPs. You should install NFS on at least three nodes. If all NFS nodes are connected toonly one subnet, then adding another NFS server to the pool is as simple as starting NFS on that server; the MapR cluster automatically detects itand adds it to the pool.

You can restrict VIP assignment to specific NFS nodes or MAC addresses by adding them to the NFS pool list manually. VIPs are not assigned toany nodes that are not on the list, regardless of whether they are running NFS. If the cluster's NFS nodes have multiple network interface cards(NICs) connected to different subnets, you should restrict VIP assignment to the NICs that are on the correct subnet: for each NFS server, choosewhichever MAC address is on the subnet from which the cluster will be NFS-mounted, then add it to the list. If you add a VIP that is not accessibleon the subnet, then failover will not work. You can only set up VIPs for failover between network interfaces that are in the same subnet. In largeclusters with multiple subnets, you can set up multiple groups of VIPs to provide NFS failover for the different subnets.

You can set up VIPs with the command, or using the Add Virtual IPs dialog in the MapR Control System. The Add Virtual IPs dialogvirtualip addlets you specify a range of virtual IP addresses and assign them to the pool of servers that are running the NFS service. The available servers aredisplayed in the left pane in the lower half of the dialog. Servers that have been added to the NFS VIP pool are displayed in the right pane in thelower half of the dialog.

To set up VIPs for NFS using the MapR Control System:

In the Navigation pane, expand the group and click the view.NFS HA NFS SetupClick to start the NFS Gateway service on nodes where it is installed.Start NFSClick to display the Add Virtual IPs dialog.Add VIP



4. 5. 6. 7.

a. b.

8.

9.

1.

2. 3.

1.

2. 3.

1. 2. 3. 4. 5.

1. 2.

3. 4. 5.

Enter the start of the VIP range in the field.Starting IPEnter the end of the VIP range in the field. If you are assigning one one VIP, you can leave the field blank.Ending IPEnter the Netmask for the VIP range in the field. Example: Netmask 255.255.255.0If you wish to restrict VIP assignment to specific servers or MAC addresses:

If each NFS node has one NIC, or if all NICs are on the same subnet, select NFS servers in the left pane.If each NFS node has multiple NICs connected to different subnets, select the server rows with the correct MAC addresses in theleft pane.

Click to add the selected servers or MAC addresses to the list of servers to which the VIPs will be assigned. The servers appear inAddthe right pane.Click to assign the VIPs and exit.OK

Configuring Authentication

If you use Kerberos, LDAP, or another authentication scheme, make sure PAM is configured correctly to give MapR access. See PAM.Configuration

Configuring Email

MapR can notify users by email when certain conditions occur. There are three ways to specify the email addresses of MapR users:

From an LDAP directoryBy domainManually, for each user

To configure email from an LDAP directory:

In the MapR Control System, expand the group and click to display the System Settings Views Email Addresses Configure Email dialog.Addresses

Select and enter the information about the LDAP directory into the appropriate fields.Use LDAPClick to save the settings.Save

To configure email by domain:

In the MapR Control System, expand the group and click to display the System Settings Views Email Addresses Configure Email dialog.Addresses

Select and enter the domain name in the text field.Use Company DomainClick to save the settings.Save

To configure email manually for each user:

Create a volume for the user.In the MapR Control System, expand the group and click .MapR-FS User Disk UsageClick the to display the User Properties dialog.usernameEnter the user's email address in the field.EmailClick to save the settings.Save

Configuring SMTP

Use the following procedure to configure the cluster to use your SMTP server to send mail:

In the MapR Control System, expand the group and click to display the dialog.System Settings Views SMTP Configure Sending EmailEnter the information about how MapR will send mail:

Provider: assists in filling out the fields if you use Gmail.SMTP Server: the SMTP server to use for sending mail.This server requires an encrypted connection (SSL): specifies an SSL connection to SMTP.SMTP Port: the SMTP port to use for sending mail.Full Name: the name MapR should use when sending email. Example: MapR ClusterEmail Address: the email address MapR should use when sending email.Username: the username MapR should use when logging on to the SMTP server.SMTP Password: the password MapR should use when logging on to the SMTP server.

Click .Test SMTP ConnectionIf there is a problem, check the fields to make sure the SMTP information is correct.Once the SMTP connection is successful, click to save the settings.Save

Configuring Permissions

By default, users are able to log on to the MapR Control System, but do not have permission to perform any actions. You can grant specific



1.

2.

3.

1.

2.

3.

4.

1.

2.

permissions to individual users and groups. See .Managing Permissions

Setting Quotas

Set default disk usage quotas. If needed, you can set specific quotas for individual users and groups. See .Managing Quotas

Configuring alarm notifications

If an alarm is raised on the cluster, MapR sends an email notification by default to the user associated with the object on which the alarm wasraised. For example, if a volume goes over its allotted quota, MapR raises an alarm and sends email to the volume creator. You can configureMapR to send email to a custom email address in addition or instead of the default email address, or not to send email at all, for each alarm type.See .Notifications

Isolating the CLDB Node(s)

If your deployment plan isolates the CDLB on dedicated nodes for performance reasons, you need to perform additional configuration steps.

In a large cluster (100 nodes or more) create CLDB-only nodes to ensure high performance. This configuration also provides additional controlover the placement of the CLDB data, for load balancing, fault tolerance, or high availability (HA). Setting up CLDB-only nodes involves restrictingthe CLDB volume to its own topology and making sure all other volumes are on a separate topology. Unless you specify a default volumetopology, new volumes have no topology when they are created, and reside at the root topology path: " ". Because both the CLDB-only path and/the non-CLDB path are children of the root topology path, new non-CLDB volumes are not guaranteed to keep off the CLDB-only nodes. To avoidthis problem, set a default volume topology. See .Setting Default Volume Topology

To set up a CLDB-only node:

SET UP the node as usual:PREPARE the node, making sure it meets the requirements.ADD the MapR Repository.

CREATE a file for the node that lists only the following packages:rolesmapr-cldbmapr-webservermapr-coremapr-fileserver

INSTALL the services to your node.

To set up a volume topology that restricts the CLDB volume to specific nodes:

Move all CLDB nodes to a CLDB-only topology (e. g. ) using the MapR Control System or the following command:/cldbonlymaprcli node move -serverids <CLDB nodes> -topology /cldbonlyRestrict the CLDB volume to the CLDB-only topology. Use the MapR Control System or the following command:maprcli volume move -name mapr.cldb.internal -topology /cldbonlyIf the CLDB volume is present on nodes not in /cldbonly, increase the replication factor of mapr.cldb.internal to create enough copies in /

using the MapR Control System or the following command:cldbonlymaprcli volume modify -name mapr.cldb.internal -replication <replication factor>Once the volume has sufficient copies, remove the extra replicas by reducing the replication factor to the desired value using the MapRControl System or the command used in the previous step.

To move all other volumes to a topology separate from the CLDB-only nodes:

Move all non-CLDB nodes to a non-CLDB topology (e. g. ) using the MapR Control System or the following command:/defaultRackmaprcli node move -serverids <all non-CLDB nodes> -topology /defaultRackRestrict all existing volumes to the topology using the MapR Control System or the following command:/defaultRackmaprcli volume move -name <volume> -topology /defaultRackAll volumes except are re-replicated to the changed topology automatically.mapr.cluster.root

To prevent subsequently created volumes from encroaching on the CLDB-only nodes, set a default topology thatexcludes the CLDB-only topology.

http://www.mapr.com/doc/display/MapR/The+Roles+File



Central Configuration

MapR services can be configured globally across the cluster, from master configuration files stored in a MapR-FS, eliminating the need to editconfiguration files on all nodes individually. Each service has a corresponding file in listing the configuration files it/opt/mapr/servicesconfrequires. A script called on each node periodically pulls updates from the master configuration files to the local disk:pullcentralconfig

If the master configuration file is newer, the local copy is overwritten by the master copyIf the local configuration file is newer, no changes are made to the local copy

The services must be restarted for any configuration changes to take effect; does not restart the services automatically.pullcentralconfig

To enable Central Configuration after upgrading to version 2.x from 1.2.x, see .Rolling Upgrade

The volume (normally mounted at ) contains directories with master configuration files:mapr.configuration /var/mapr/configuration

Configuration files in the directory are applied to all nodesdefaultSubdirectories in the directory contain configuration files that are applied to individual nodes. To specify custom configuration fornodesan individual node, create a directory corresponding to the hostname. For example, the configuration files in a directory named /var/ma

would only be applied to the machine with the hostname .pr/configuration/nodes/host1.r1.nyc host1.r1.nyc

The following parameters in control whether central configuration is enabled, the path to the master configuration files, and howwarden.confoften runs:pullcentralconfig

centralconfig.enabled — Specifies whether to enable central configuration.pollcentralconfig.interval.seconds--- The frequency to check for configuration updates, in seconds.



1. 2.

3. 4. 5.

6.

7.

8.

Setting up the MapR Metrics Database

Perform these post-install configuration steps to enable the MapR Metrics database:

Log on to your SQL server.Add the following line to the file:/etc/mysql/my.cnf

bind-address = <internal_ip_of_mysql_node>

Restart MySQL with the command .service mysql restartStart MySQL as the root user with the command mysql -u root -pSource the script from the prompt to set up the database schema for the Metrics database:/opt/mapr/bin/setup.sql mysql\>

mysql> source /opt/mapr/bin/setup.sql

Create a local user on the MySQL server with the command mapr CREATE USER 'mapr'@'localhost' IDENTIFIED BY 'mapr';.Grant read and write privileges to the user with the following commands:mapr

GRANT ALL PRIVILEGES ON metrics.* TO 'mapr'@'%';FLUSH PRIVILEGES;exit

On each node in the cluster that has the package installed, specify your MySQL database parameters in one of themapr-metricsfollowing ways:

To specify the MySQL database parameters from the command line, run the script:configure.sh

configure.sh -R -d <host>:<port> -du <database username> -dp <database password> -ds metrics

To specify the MySQL database parameters from the (MCS), click toMapR Control System Navigation > System Settings > Metricsdisplay the Configure Metrics Database dialog. In the field, enter the hostname and port of the machine running the MySQL server.URLIn the and fields, enter the username and password of the MySQL user.Username Password



The username you provide must have full permissions when logged in from any node in the cluster.When you change the Metrics configuration information from the initial settings, you must restart the servichoststatse on each node that reports Metrics data. You can restart the service from the or with the command hoststats MCS

.maprcli -name hoststats -action restartnode services

Customizing the Database Name

You can change the name of the database from the default name of by editing the script before sourcing the script.metrics setup.sql



1. 2.

3.

Working with Multiple Clusters

In order to mirror volumes between clusters, you must use to create an additional entry in on the sourceconfigure.sh mapr-clusters.confcluster, for each cluster to which it will mirror. The command syntax for is:configure.sh

configure.sh -C <cluster 2 CLDB nodes> -N <cluster 2 name>.

You can cross-mirror between clusters: to mirror some volumes from cluster A to cluster B and other volumes from cluster B to cluster A, youwould set up as follows:mapr-clusters.conf

Entries in on cluster A nodes:mapr-clusters.confFirst line contains name and CLDB servers of cluster ASecond line contains name and CLDB servers of cluster B

Entries in on cluster B nodes:mapr-clusters.confFirst line contains name and CLDB servers of cluster BSecond line contains name and CLDB servers of cluster A

By creating additional entries, you can mirror from one cluster to several others.

Each cluster must already be set up and running, and must have a unique name. All nodes in each cluster should be able to resolve all othernodes (via DNS or entries in )./etc/hosts

To set up multiple clusters:

On each cluster make a note of the cluster name and CLDB nodes (the first line in )mapr-clusters.confOn each cluster's webserver node add the remote cluster's CLDB nodes to , using the/opt/mapr/conf/mapr-clusters.conffollowing format:

clusterA 10.10.80.241:7222 10.10.80.242:7222 10.10.80.243:7222clusterB 10.10.80.231:7222

On each cluster, restart the service on all nodes where it is running.mapr-webserver



1.

2.

3.

4.

Setting Up the Client

MapR provides several interfaces for working with a cluster from a client computer:

MapR Control System - manage the cluster, including nodes, volumes, users, and alarmsDirect Access NFS™ - mount the cluster in a local directoryMapR client - work with MapR Hadoop directly

Mac OS XRed Hat/CentOSSUSEUbuntuWindows

MapR Control System

The MapR Control System is web-based, and works with the following browsers:

ChromeSafariFirefox 3.0 and aboveInternet Explorer 7 and 8

To use the MapR Control System, navigate to the host that is running the WebServer in the cluster. MapR Control System access to the cluster istypically via HTTP on port 8080 or via HTTPS on port 8443; you can specify the protocol and port in the dialog. You shouldConfigure HTTPdisable pop-up blockers in your browser to allow MapR to open help links in new browser tabs.

Direct Access NFS™

You can mount a MapR cluster locally as a directory on a Mac, Linux, or Windows computer.

Before you begin, make sure you know the hostname and directory of the NFS share you plan to mount.Example:

usa-node01:/mapr - for mounting from the command linenfs://usa-node01/mapr - for mounting from the Mac Finder

Make sure the client machine has the appropriate username and password to access the NFS share. For best results, the username andpassword for accessing the MapR cluster should be the same username and password used to log into the client machine.

Automatically mounting NFS to MapRFS on a Cluster

To automatically mount NFS to MapRFS on the cluster at the mount point, add the following line to my.cluster.com /mapr2 /opt/mapr/con:f/mapr_fstab

localhost:/mapr/my.cluster.com/user /mapr2 hard,nolock

Linux

Make sure the NFS client is installed. Examples: sudo yum install nfs-utils (Red Hat or CentOS)sudo apt-get install nfs-common (Ubuntu)sudo zypper install nfs-client (SUSE)

List the NFS shares exported on the server. Example:showmount -e usa-node01Set up a mount point for an NFS share. Example:sudo mkdir /maprMount the cluster via NFS. Example:sudo mount -o nolock usa-node01:/mapr /mapr

You can also add an NFS mount to so that it mounts automatically when your system starts up. Example:/etc/fstab



1. 2. 3. 4.

5. 6. 7. 8.

1.

2.

3.

# device mountpoint fs-type options dump fsckorder...usa-node01:/mapr /mapr nfs rw 0 0...

Mac

To mount the cluster from the Finder:

Open the Disk Utility: go to .Applications > Utilities > Disk UtilitySelect .File > NFS MountsClick the at the bottom of the NFS Mounts window.+In the dialog that appears, enter the following information:

Remote NFS URL: The URL for the NFS mount. If you do not know the URL, use the command described below.showmountExample: nfs://usa-node01/maprMount location: The mount point where the NFS mount should appear in the local filesystem.

Click the triangle next to .Advanced Mount ParametersEnter in the text field.nolocksClick .VerifyImportant: On the dialog that appears, click to skip the verification process.Don't Verify

The MapR cluster should now appear at the location you specified as the mount point.

To mount the cluster from the command line:


Windows

Because of Windows directory caching, there may appear to be no directory in each volume's root directory. To work.snapshotaround the problem, force Windows to re-load the volume's root directory by updating its modification time (for example, bycreating an empty file or directory in the volume's root directory).



1. 2. 3. 4. 5.

1. 2.

3.

With Windows NFS clients, use the option on the NFS server to prevent the Linux NLM from registering with the-o nolockportmapper.The native Linux NLM conflicts with the MapR NFS server.

To mount the cluster on Windows 7 Ultimate or Windows 7 Enterprise:

Open .Start > Control Panel > ProgramsSelect .Turn Windows features on or offSelect .Services for NFSClick .OKMount the cluster and map it to a drive using the tool or from the command line. Example:Map Network Drivemount -o nolock usa-node01:/mapr z:

To mount the cluster on other Windows versions:

Download and install (SFU). You only need to install the NFS Client and the User Name Mapping.Microsoft Windows Services for UnixConfigure the user authentication in SFU to match the authentication used by the cluster (LDAP or operating system users). You canmap local Windows users to cluster Linux users, if desired.Once SFU is installed and configured, mount the cluster and map it to a drive using the tool or from the commandMap Network Driveline. Example:mount -o nolock usa-node01:/mapr z:

To map a network drive with the Map Network Drive tool:

http://www.microsoft.com/downloads/en/details.aspx?FamilyID=896c9688-601b-44f1-81a4-02878ff11778&DisplayLang=en



1. 2. 3. 4. 5.

6. 7.

Open .Start > My ComputerSelect .Tools > Map Network DriveIn the Map Network Drive window, choose an unused drive letter from the drop-down list.DriveSpecify the by browsing for the MapR cluster, or by typing the hostname and directory into the text field.FolderBrowse for the MapR cluster or type the name of the folder to map. This name must follow UNC. Alternatively, click the Browse… buttonto find the correct folder by browsing available network shares.Select to reconnect automatically to the MapR cluster whenever you log into the computer.Reconnect at loginClick Finish.

See for more information.Accessing Data with NFS

MapR Client

The MapR client lets you interact with MapR Hadoop directly. With the MapR client, you can submit Map/Reduce jobs and run and hadoop fs h commands. The MapR client is compatible with the following operating systems:adoop mfs

CentOS 5.5 or aboveMac OS X (Intel)Red Hat Enterprise Linux 5.5 or aboveUbuntu 9.04 or aboveSUSE Enterprise 11.1 or aboveWindows 7 and Windows Server 2008

Do not install the client on a cluster node. It is intended for use on a computer that has no other MapR software installed. Do notinstall other MapR software on a MapR client computer. To run commands, establish an session to a node in theMapR CLI sshcluster.

To configure the client, you will need the cluster name and the IP addresses and ports of the CLDB nodes on the cluster. The configuration script has the following syntax:configure.sh

Linux —

configure.sh [-N <cluster name>] -c -C <CLDB node>[:<port>][,<CLDB node>[:<port>]...]

Windows —

server\configure.bat -c -C <CLDB node>[:<port>][,<CLDB node>[:<port>]...]

Linux or Mac Example:

/opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222

Windows Example:



1. 2.

3.

4.

5.

6.

1. 2.

3.

4.

server\configure.bat -c -C 10.10.100.1:7222

Installing the MapR Client on CentOS or Red Hat

The MapR Client supports Red Hat Enterprise Linux 5.5 or above.

Change to the user (or use sudo for the following commands).rootCreate a text file called in the directory with the following contents:maprtech.repo /etc/yum.repos.d/[maprtech]name=MapR Technologiesbaseurl=http://package.mapr.com/releases/v2.1.0/redhat/enabled=1gpgcheck=0protect=1To install a previous release, see the for the correct path to use in the parameter.Release Notes baseurlIf your connection to the Internet is through a proxy server, you must set the environment variable before installation:http_proxy


Remove any previous MapR software. You can use to get a list of installed MapR packages, then type therpm -qa | grep maprpackages separated by spaces after the command. Example:rpm -e

rpm -qa | grep maprrpm -e mapr-fileserver mapr-core

Install the MapR client for your target architecture:

yum install mapr-client.i386

yum install mapr-client.x86_64

Run to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) optionconfigure.sh -C -cto specify a client configuration. Example:


Installing the MapR Client on SUSE

The MapR Client supports SUSE Enterprise 11.1 or above.

Change to the user (or use sudo for the following commands).rootCreate a text file called in the directory with the following contents:maprtech.repo /etc/yum.repos.d/[maprtech]name=MapR Technologiesbaseurl=http://package.mapr.com/releases/v2.1.0/redhat/enabled=1gpgcheck=0protect=1To install a previous release, see the for the correct path to use in the parameter.Release Notes baseurlIf your connection to the Internet is through a proxy server, you must set the environment variable before installation:http_proxy


Remove any previous MapR software. You can use to get a list of installed MapR packages, then type therpm -qa | grep maprpackages separated by spaces after the command. Example:zypper rm



4.

5. 6.

1. 2.

3.

4.

5.

6. 7.

1. 2. 3.

4.

5.

rpm -qa | grep maprzypper rm mapr-fileserver mapr-core

Install the MapR client: zypper install mapr-clientRun to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) optionconfigure.sh -C -cto specify a client configuration. Example:


Installing the MapR Client on Ubuntu

The MapR Client supports Ubuntu 9.04 or above.

Change to the user (or use sudo for the following commands).rootAdd the following line to :/etc/apt/sources.listdeb http://package.mapr.com/releases/v2.1.0/ubuntu/ mapr optionalTo install a previous release, see the for the correct path to useRelease NotesIf your connection to the Internet is through a proxy server, add the following lines to :/etc/apt.conf

Acquire {Retries ;"0"HTTP {Proxy "http: ;//<user>:<password>@<host>:<port>"};};

Remove any previous MapR software. You can use to get a list of installed MapR packages, then type thedpkg -list | grep maprpackages separated by spaces after the command. Example:dpkg -r

dpkg -l | grep maprdpkg -r mapr-core mapr-fileserver

Update your Ubuntu repositories. Example:

apt-get update

Install the MapR client: apt-get install mapr-clientRun to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) optionconfigure.sh -C -cto specify a client configuration. Example:


Installing the MapR Client on Mac OS X

The MapR Client supports Mac OS X (Intel).

Download the archive package.mapr.com/releases/v2.1.0/mac/mapr-client-2.1.0.16877GA-1.x86_64.tar.gzOpen the application.TerminalCreate the directory :/optsudo mkdir -p /optExtract mapr-client-2.1.0.16877GA-1.x86_64.tar.gz into the directory. Example:/optsudo tar -C /opt -xvf mapr-client-2.1.0.16877GA-1.x86_64.tar.gzRun to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) optionconfigure.sh -C -cto specify a client configuration. Example:sudo /opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222

Installing the MapR Client on Windows

The MapR Client supports Windows 7 and Windows Server 2008.



1. 2. 3.

4.

5.

6.

7.

8.

Make sure Java is installed on the computer, and set correctly.JAVA_HOMEOpen the command line.Create the directory on your drive (or another hard drive of your choosing)--- either use Windows Explorer, or type the\opt\mapr c:following at the command prompt:mkdir c:\opt\maprSet to the directory you created in the previous step. Example:MAPR_HOMESET MAPR_HOME=c:\opt\maprNavigate to :MAPR_HOMEcd %MAPR_HOME%Download the correct archive into :MAPR_HOME

On a 64-bit Windows machine, download http://package.mapr.com/releases/v2.1.0/windows/mapr-client-2.1.0.16877GA-1.amd64.zipOn a 32-bit Windows machine, download http://package.mapr.com/releases/v2.1.0/windows/mapr-client-2.1.0.16877GA-1.x86.zip

Extract the archive:On a 64-bit Windows machine: jar -xvf mapr-client-2.1.0.16877GA-1.amd64.zipOn a 32-bit Windows machine: jar -xvf mapr-client-2.1.0.16877GA-1.x86.zip

Run to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) optionconfigure.bat -C -cto specify a client configuration. Example:server\configure.bat -c -C 10.10.100.1:7222

On the Windows client, you can run MapReduce jobs using the command the way you would normally use the normal comhadoop.bat hadoopmand. For example, to list the contents of a directory, instead of you would type the following:hadoop fs -lshadoop.bat fs -ls

Before running jobs on the Windows client, set the following properties in %MAPR_HOME%\hadoop\hadoop-<version>\conf\core-site.xm on the Windows machine to match the username, user ID, and group ID that have been set up for you on the cluster:l

<property> <name>hadoop.spoofed.user.uid</name> <value>{UID}</value></property><property> <name>hadoop.spoofed.user.gid</name> <value>{GID}</value></property><property> <name>hadoop.spoofed.user.username</name> <value>{id of user who has UID}</value></property>

You can determine the correct and values for your username by logging into a cluster node and typing the command. Example:uid gid id

$ iduid=1000(pconrad) gid=1000(pconrad)groups=4(adm),20(dialout),24(cdrom),46(plugdev),105(lpadmin),119(admin),122(sambashare),1000(pconrad)

On the Windows client, because the native Hadoop library is not present, the command is nothadoop fs -getmergeavailable.

http://package.mapr.com/releases/v2.1.0/windows/mapr-client-2.1.0.16877GA-1.amd64.zip

http://package.mapr.com/releases/v2.1.0/windows/mapr-client-2.1.0.16877GA-1.amd64.zip

http://package.mapr.com/releases/v2.1.0/windows/mapr-client-2.1.0.16877GA-1.x86.zip

http://package.mapr.com/releases/v2.1.0/windows/mapr-client-2.1.0.16877GA-1.x86.zip



Third Party Solutions

MapR works with the leaders in the Hadoop ecosystem to provide the most powerful data analysis solutions. For more information about ourpartners, take a look at the following pages:

DatameerHParserKarmasphere



Datameer

Datameer provides the world's first business intelligence platform built natively for Hadoop. Datameer delivers powerful, self-service analytics forthe BI user through a simple spreadsheet UI, along with point-and-click data integration (ETL) and data visualization capabilities.

MapR provides a pre-packaged version of ("DAS"). DAS is delivered as an RPM or Debian package.Datameer Analytics Solution

See to add the DAS package to your MapR environment.How to setup DAS on MapRVisit to explore several demos included in the package to illustrate the usage of DAS in behavioral analytics and ITDemos for MapRsystems management use cases.Check out the library of with step-by-step walk-throughs on how to use DAS, and demo videos showing variousvideo tutorialsapplications.

If you have questions about using DAS, please visit the . For information about Datameer, please visit .DAS documentation www.datameer.com

http://www.datameer.com

http://www.datameer.com/products/overview.html

http://wiki.datameer.com/display/DAS13/How+to+setup+DAS+on+MapR

http://wiki.datameer.com/display/DAS13/Demos+for+MapR

http://wiki.datameer.com/display/DAS13/DAS+Video+Tutorials

http://wiki.datameer.com/display/DAS13/Home

http://www.datameer.com



Karmasphere

Karmasphere provides software products for data analysts and data professionals so they can unlock the power of Big Data in Hadoop, opening awhole new world of possibilities to add value to the business. Karmasphere equips analysts with the ability to discover new patterns, relationships,and drivers in any kind of data – unstructured, semi-structured or structured - that were not possible to find before.

The Karmasphere Big Data Analytics product line supports the Map R distributions, M3 and M5 Editions and includes:

Karmasphere Analyst, which provides data analysts immediate entry to structured and unstructured data on Hadoop, through SQL andother familiar languages, so that they can make ad-hoc queries, interact with the results, and iterate – without the aid of IT.Karmasphere Studio, which provides developers that support analytic teams a graphical environment to analyze their MapReduce codeas they develop custom analytic algorithms and systematize the creation of meaningful datasets for analysts.

To get started with Karmasphere Analyst or Studio:

Request a 30-day trial of Karmasphere Analyst or Studio for MapRLearn more about Karmasphere Big Data Analytics productsView videos about Karmasphere productsBig Data AnalyticsAccess technical resourcesRead documentation for Karmasphere products

If you have questions about Karmasphere please email [email protected] or visit www.karmasphere.com.

http://www.karmasphere.com/Products-Information/karmasphere-analyst.html

http://www.karmasphere.com/Products-Information/karmasphere-studio.html

http://karmasphere.com/MapRtrials

http://www.karmasphere.com/Products-Information/overview.html

http://www.karmasphere.com/ksc/Demos-and-Videos/videos.html

http://www.karmasphere.com/ksc

http://www.karmasphere.com/ksc/Article/documentation.html

http://www.karmasphere.com/



1. 2. 3. 4.

5. 6.

HParser

HParser is a data transformation (data handler) environment optimized for Hadoop. This easy-to-use, codeless parsing software enablesprocessing of any file format inside Hadoop with scale and efficiency. It provides Hadoop developers with out-of-the-box Hadoop parsingcapabilities to address the variety and complexity of data sources, including logs, industry standards, documents, and binary or hierarchical data.

MapR has partnered with Informatica to provide the of :Community Edition HParser

The HParser package can be downloaded from Informatica as a Zip archive that includes the HParser engine, the Data TransformationHParser Jar file, HParser Studio, and the .HParser Operator GuideThe HParser engine is also available as an RPM via the MapR repository, making it easier to install the HParser Engine on all nodes inthe cluster.

HParser can be installed on a MapR cluster running CentOS or Red Hat Enterprise Linux.

To install HParser on a MapR cluster:

Register on the site.InformaticaDownload the Zip file containing the of HParser, and extract it.Community EditionFamiliarize yourself with the installation procedure in the HParser Operator Guide.On each node, install HParser Engine from the MapR repository by typing the following command as or with :root sudoyum install hparser-engineChoose a a node in the cluster from which you will issue HParser commands.Command Node,Following the instructions in the copy the HParser Jar file to the Command Node and create the HParserHParser Operator Guide,configuration file.

https://community.informatica.com/solutions/1679

http://http://www.informatica.com/hparser/

http://www.informatica.com

https://community.informatica.com/solutions/1679



Troubleshooting Installation Issues

This section provides information about troubleshooting installation problems. Click a subtopic below for more detail.



Administration GuideWelcome to the MapR Administration Guide! This guide is for system administrators tasked with managing MapR clusters. Topics include how tomanage data by using volumes; how to monitor the cluster for performance; how to manage users and groups; how to add and remove nodesfrom the cluster; and more.

The focus of the Administration Guide is managing the nodes and services that make up a cluster. For details of fine-tuning MapR for specificjobs, see the . The Administration Guide does not cover the details of installing MapR software on a cluster. See Development Guide Installation

for details on planning and installing a MapR cluster.Guide

Click on one of the sub-sections below to get started.

MonitoringAlarms and NotificationsCentralized LoggingMonitoring Node MetricsService MetricsJob MetricsThird-Party Monitoring Tools

Managing Data with VolumesMirror VolumesSchedulesSnapshots

Managing the ClusterBalancersCluster UpgradeDisksNodesServicesStartup and ShutdownUninstalling MapR

Users and GroupsManaging PermissionsManaging Quotas

SecurityPAM ConfigurationSecured TaskTrackerSubnet Whitelist

Placing Jobs on Specified NodesSetting Up MapR NFSDisaster RecoveryTroubleshooting Cluster Administration

'ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils' in cldb.log, caused by mixed-case cluster name inmapr-clusters.confOut of Memory Troubleshooting

Setting up a MapR Cluster on Amazon Elastic MapReduce



Monitoring

This section provides information about monitoring the cluster. Click a subtopic below for more details.

Alarms and NotificationsCentralized LoggingMonitoring Node MetricsService MetricsJob MetricsThird-Party Monitoring Tools

GangliaNagios Integration



1. 2.

1. 2. 3.

4.

Alarms and Notifications

On a cluster with an M5 license, MapR raises alarms and sends notifications to alert you to information about the cluster:

Cluster health, including disk failuresVolumes that are under-replicated or over quotaServices not running

You can see any currently raised alarms in the view of the MapR Control System, or using the command. For a list of allAlarms Views alarm listalarms, see .Alarms Reference

To view cluster alarms using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster DashboardAll alarms for the cluster and its nodes and volumes are displayed in the pane.Alarms

To view node alarms using the MapR Control System:

In the Navigation pane, expand the group and click the view.Alarms Node Alarms

You can also view node alarms in the view, the view, and the pane of the view.Node Properties NFS Alarm Status Alarms Dashboard

To view volume alarms using the MapR Control System:

In the Navigation pane, expand the group and click the view.Alarms Volume Alarms

You can also view node alarms in the pane of the view.Alarms Dashboard

Notifications

When an alarm is raised, MapR can send an email notification to either or both of the following addresses:

The owner of the cluster, node, volume, or entity for which the alarm was raised (standard notification)A custom email address for the named alarm.

You can set up alarm notifications using the command or from the view in the MapR Control System.alarm config save Alarms Views

To set up alarm notifications using the MapR Control System:

In the Navigation pane, expand the group and click the view.Alarms Alarm NotificationsDisplay the dialog by clicking .Configure Alarm Subscriptions Alarm NotificationsFor each :Alarm

To send notifications to the owner of the cluster, node, volume, or entity: select the checkbox.Standard NotificationTo send notifications to an additional email address, type an email address in the field.Additional Email Address

Click to save the configuration changes.Save



Centralized Logging

Analyzing log files is an essential part of tracking and tuning Hadoop jobs and tasks. MapR's Centralized Logging feature, new with the v2.0release, makes job analysis easier than it has ever been before.

MapR's Centralized Logging feature provides a job-centric view of all log files generated by tracker nodes throughout the cluster. During or afterexecution of a job, use the command to create a centralized log directory populated with symbolic links to all log filesmaprcli job linklogsrelated to tasks, map attempts, and reduce attempts pertaining to the specified job(s). If MapR-FS is mounted using NFS, you can use standardtools like and to investigate issues which may be distributed across multiple nodes in the cluster.grep find

Log files contain details such as which Mapper and Reducer tasks ran on which nodes; how many attempts were tried; and how long eachattempt lasted. The distributed nature of MapReduce processing has historically created challenges for analyzing the execution of jobs, becauseMapper and Reducer tasks are scattered throughout the cluster. Task-related logs are written out by task trackers running on distributed nodes,and each node might be processing tasks for multiple jobs simultaneously. Without Centralized Logging, a user with access to all nodes wouldneed to access all log files created by task trackers, filter out information from unrelated jobs, and then merge together log details in order to get acomplete picture of job execution. Centralized Logging automates all of these steps.

Usage

Use the to initiate Centralized Logging:maprcli

maprcli job linklogs -jobid <jobPattern> -todir <maprfsDir> [ -jobconf <pathToJobXml> ]

The following directory structure will be created under specified directory for all matching .<maprfsDir> jobids <jobPattern>

<jobid>/hosts/<host>/ contains symbolic links to log directories of tasks executed for on <jobid> <host><jobid>/mappers/ contains symbolic links to log directories of all map task attempts for across the whole cluster<jobid><jobid>/reducers/ contains symbolic links to log directories of all reduce task attempts for across the whole cluster<jobid>

You can use any glob prefixed , otherwise, is automatically prepended. There is just one match if the full job id is used.job job

This command uses the centralized job history location as specified in your current configuration by mapred.job.tracker.history.comple, by default . If location has changed since the job(s) of interestted.location /var/mapr/cluster/mapred/jobTracker/history/done

was run, you can supply the optional parameter.jobconf

Examples

maprcli job linklogs -jobid job_201204041514_0001 -todir /myvolume/joblogviewdirlink logs of a single job job_201204041514_0001.maprcli job linklogs -jobid job_${USER} -todir /myvolume/joblogviewdirlink logs of all jobs by the current shell user.maprcli job linklogs -jobid job_*_wordcount1 -todir /myvolume/joblogviewdirlink logs all jobs named wordcount1

Enabling/Disabling Centralized Logging

This feature is controlled by the definition of in . The default value is HADOOP_TASKTRACKER_ROOT_LOGGER hadoop-env.sh INFO,maprfsDR, implying the feature is on. Changing the value to turns Centralized Logging off. Restart TaskTracker for theFA <anyLoggingLevel>,DRFA

change to take effect:

maprcli node services -tasktracker restart -nodes <list of nodes>



Monitoring Node Metrics

You can examine fine-grained analytics information about the nodes in your cluster by using the API with the command-liNode Metrics maprcline tool. You can use this information to examine specific aspects of your node's performance at a very granular level. The APInode metricsreturns data as a table sent to your terminal's standard output or as a JSON file. The JSON file includes in-depth reports on the activity on eachCPU, disk partition, and network interface in your node.

Node metrics cover the following general categories:

CPU time usedMemory usedRPC activityProcess activityStorage usedTaskTracker resources used



1.

2.

3.

1.

1.

Service Metrics

MapR services produce metrics that can be written to an output file or consumed by . The file metrics output is directed by the Ganglia hadoop-m files.etrics.properties

By default, the CLDB and FileServer metrics are sent via unicast to the Ganglia gmon server running on localhost. To send the metrics directly toa Gmeta server, change the property to the hostname of the Gmeta server. To send the metrics to a multicast channel, changecldb.serversthe property to the IP address of the multicast channel.cldb.servers

Metrics Collected

Below are the kinds of metrics that can be collected.

CLDB FileServers

Number of FileServersNumber of VolumesNumber of ContainersCluster Disk Space Used GBCluster Disk Space Available GBCluster Disk Capacity GBCluster Memory Capacity MBCluster Memory Used MBCluster Cpu Busy %Cluster Cpu TotalNumber of FS Container Failure ReportsNumber of Client Container Failure ReportsNumber of FS RW Container ReportsNumber of Active Container ReportsNumber of FS Volume ReportsNumber of FS RegisterNumber of container lookupsNumber of container assignNumber of container corrupt reportsNumber of rpc failedNumber of rpc received

FS Disk Used GBFS Disk Available GBCpu Busy %Memory Total MBMemory Used MBMemory Free MBNetwork Bytes ReceivedNetwork Bytes Sent

Setting Up Service Metrics

To configure metrics for a service:

Edit the appropriate file on all CLDB nodes, depending on the service:hadoop-metrics.propertiesFor MapR-specific services, edit /opt/mapr/conf/hadoop-metrics.propertiesFor standard Hadoop services, edit /opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties

In the sections specific to the service:Un-comment the lines pertaining to the context to which you wish the service to send metrics.Comment out the lines pertaining to other contexts.

Restart the service.

To enable service metrics:

As root (or using sudo), run the following commands:

maprcli config save -values '{ : }'"cldb.ganglia.cldb.metrics" "1"maprcli config save -values '{ : }'"cldb.ganglia.fileserver.metrics" "1"

To disable service metrics:

As root (or using sudo), run the following commands:


Example



In the following example, CLDB service metrics will be sent to the Ganglia context:

#CLDB metrics config - Pick one out of ,file or ganglia.null#Uncomment all properties in , file or ganglia context, to send cldb metrics to that contextnull

# Configuration of the context "cldb" for null#cldb.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread#cldb.period=10

# Configuration of the context file"cldb" for#cldb.class=org.apache.hadoop.metrics.file.FileContext#cldb.period=60#cldb.fileName=/tmp/cldbmetrics.log

# Configuration of the context ganglia"cldb" forcldb.class=com.mapr.fs.cldb.counters.MapRGangliaContext31cldb.period=10cldb.servers=localhost:8649cldb.spoof=1



Job Metrics

The MapR Metrics service collects and displays analytics information about the Hadoop jobs, tasks, and task attempts that run on the nodes inyour cluster. You can use this information to examine specific aspects of your cluster's performance at a very granular level, enabling you tomonitor how your cluster responds to changing workloads and optimize your Hadoop jobs or cluster configuration. The analytics informationcollected by the MapR Metrics service is stored in a MySQL database. The server running MySQL does not have to be a node in the cluster, butthe nodes in your cluster must have access to the server.

View this video for an introduction to Job Metrics...

The MapR Control System presents the jobs running on your cluster and the tasks that make up a specific job as a sortable list, along withhistograms and line charts that represent the distribution of a particular metric. You can sort the list by the metric you're interested in to quicklyfind any outliers, then display specific detailed information about a job or task attempt that you want to learn more about. The filtering capabilitiesof the MapR Control System enable you to narrow down the display of data to the ranges you're interested in.

The MapR Control System displays data using histograms (for jobs) and line charts (for jobs and task attempts). All histograms and charts areimplemented in HTML5, CSS and JavaScript to enable display on your browser or mobile device without requiring plug-ins. The histogramspresented by MapR Metrics divide continuous data, such as a range of job durations, into a sequence of discrete bins. For example, a range ofdurations from 0 to 10000 seconds could be presented as 20 individual bins that cover a 500-second band each. The height of the histogram'sbar for each bin represents the number of jobs with a duration in the bin's range. The line charts in MapR Metrics display the trend over time forthe value of a specific metric.

An M3 license for MapR displays basic information. The M5 license provides sophisticated graphs, and histograms, providing access to trendsand detailed statistics. Either license provides access to MapR Metrics from the MapR Control System and .command line interface

The job metrics cover the following categories:

Cluster resource use (CPU and memory)DurationTask count (map, reduce, failed map, failed reduce)Map rates (record input and output, byte input and output)Reduce rates (record input and output, shuffle bytes)Task attempt counts (map, reduce, failed map, failed attemptTask attempt durations (average map, average reduce, maximum map, maximum reduce)

The task attempt metrics cover the following categories:

Times (task attempt duration, garbage collection time, CPU time)Local byte rate (read and written)Mapr-FS byte rate (read and written)Memory usage (bytes of physical and virtual memory)Records rates (map input, map output, reduce input, reduce output, skipped, spilled, combined input, combined output)Reduce task attempt input groupsReduce task attempt shuffle bytes

Example: Using MapR Metrics To Diagnose a Faulty Network Interface Card (NIC)

In this example, a node in your cluster has a NIC that is intermittently failing. This condition is leading to abnormally long task completion timesdue to that node being occasionally unreachable. In the Metrics interface, you can display a job's average and maximum task attempt durationsfor both map and reduce attempts. A high variance between the average and maximum attempt durations suggests that some task attempts aretaking an unusually long time. You can sort the list of jobs by maximum map task attempt duration to find jobs with such an unusually highvariance.

Click the name of a job name to display information about the job's tasks, then sort the task attempt list by duration to find the outliers. Becausethe list of tasks includes information about the node the task is running on, you can see that several of these unusually long-running task attemptsare assigned to the same node. This information suggests that there may be an issue with that specific node that is causing task attempts to takelonger than usual.

When you display summary information for that node, you can see that the Network I/O speeds are lower than the speeds for other similarlyconfigured nodes in the cluster. You can use that information to examine the node's network I/O configuration and hardware and diagnose thespecific cause.



Third-Party Monitoring Tools

MapR works with the following third-party monitoring tools:

GangliaNagios



1. 2. 3.

1.

2. 3. 4.

1. 2.

Ganglia

Ganglia is a scalable distributed system monitoring tool that allows remote viewing live or historical statistics for a cluster. The Ganglia systemconsists of the following components:

A PHP-based web front endGanglia monitoring daemon ( ): a multi-threaded monitoring daemongmondGanglia meta daemon ( ): a multi-threaded aggregation daemongmetadA few small utility programs

The daemon aggregates metrics from the instances, storing them in a database. The front end pulls metrics from the databasegmetad gmondand graphs them. You can aggregate data from multiple clusters by setting up a separate for each, and then a master togmetad gmetadaggregate data from the others. If you configure Ganglia to monitor multiple clusters, remember to use a separate port for each cluster.

MapR with Ganglia

The CLDB reports metrics about its own load, as well as cluster-wide metrics such as CPU and memory utilization, the number of activeFileServer nodes, the number of volumes created, etc. For a complete list of metrics, see .Service Metrics

MapRGangliaContext collects and sends CLDB metrics, FileServer metrics, and cluster-wide metrics to Gmon or Gmeta, depending on theconfiguration. On the Ganglia front end, these metrics are displayed separately for each FileServer by hostname. The ganglia monitor only needsto be installed on CLDB nodes to collect all the metrics required for monitoring a MapR cluster. To monitor other services such as HBase andMapReduce, install Gmon on nodes running the services and configure them as you normally would.

The Ganglia properties for the and contexts are configured in the file cldb fileserver $INSTALL_DIR/conf/hadoop-metrics.properti. Any changes to this file require a CLDB restart.es

Installing Ganglia

To install Ganglia on Ubuntu:

On each CLDB node, install : ganglia-monitor sudo apt-get install ganglia-monitorOn the machine where you plan to run the Gmeta daemon, install : gmetad sudo apt-get install gmetadOn the machine where you plan to run the Ganglia front end, install : ganglia-webfrontend sudo apt-get installganglia-webfrontend

To install Ganglia on Red Hat:

Download the following RPM packages for Ganglia version 3.1 or later:ganglia-gmondganglia-gmetadganglia-web

On each CLDB node, install : ganglia-monitor rpm -ivh <ganglia-gmond>On the machine where you plan to run the Ganglia meta daemon, install : gmetad rpm -ivh <gmetad>On the machine where you plan to run the Ganglia front end, install : ganglia-webfrontend rpm -ivh <ganglia-web>

For more details about Ganglia configuration and installation, see the .Ganglia documentation

To start sending CLDB metrics to Ganglia:

Make sure the CLDB is configured to send metrics to Ganglia (see ).Service MetricsAs (or using ), run the following commands:root sudo


To stop sending CLDB metrics to Ganglia:

As (or using ), run the following commands:root sudo


http://sourceforge.net/apps/trac/ganglia/wiki/Ganglia%203.1.x%20Installation%20and%20Configuration#installation



1. 2.

Nagios Integration

Nagios is an open-source cluster monitoring tool. MapR can generate a Nagios Object Definition File that describes the nodes in the cluster andthe services running on each. You can generate the file using the MapR Control System or the command, then save the file innagios generatethe proper location in your Nagios environment.

MapR recommends Nagios version 3.3.1 and version 1.4.15 of the plugins.

To generate a Nagios file using the MapR Control System:

In the Navigation pane, click .NagiosCopy and paste the output, and save as the appropriate Object Definition File in your Nagios environment.

For more information, see the .Nagios documentation

http://nagios.org

http://nagios.org/documentation



Managing Data with Volumes

MapR provides as a way to organize data and manage cluster performance. A volume is a logical unit that allows you to apply policiesvolumesto a set of files, directories, and sub-volumes. A well-structured volume hierarchy is an essential aspect of your cluster's performance. As yourcluster grows, keeping your volume hierarchy efficient maximizes your data's availability. Without a volume structure in place, your cluster'sperformance will be negatively affected. This section discusses fundamental volume concepts.

You can use volumes to enforce disk usage limits, set replication levels, establish ownership and accountability, and measure the cost generatedby different projects or departments. Create a volume for each user, department, or project. You can mount volumes under other volumes to builda structure that reflects the needs of your organization. The volume structure defines how data is distributed across the nodes in your cluster.Create multiple small volumes with shallow paths at the top of your cluster's volume hierarchy to spread the load of access requests across thenodes.

On a cluster with an M5 license, you can create a special type of volume called a , a local or remote read-only copy of an entire volume.mirrorMirrors are useful for load balancing or disaster recovery.

With an M5 license, you can also create a , an image of a volume at a specific point in time. Snapshots are useful for rollback to a knownsnapshotdata set. You can create snapshots and synchronize mirrors manually or using a .schedule

MapR lets you control and configure volumes in a number of ways:

Replication - set the number of physical copies of the data, for robustness and performanceTopology - restrict a volume to certain physical racks or nodes (requires M5 license and permission on the volume)mQuota - set a hard disk usage limit for a volume (requires M5 license)Advisory Quota - receive a notification when a volume exceeds a soft disk usage quota (requires M5 license)Ownership - set a user or group as the accounting entity for the volumePermissions - give users or groups permission to perform specified volume operationsFile Permissions - Unix-style read/write permissions on volumes

Volumes are stored as pieces called that contain files, directories, and other data. Containers are to protect data. There arecontainers replicatednormally three copies of each container stored on separate nodes to provide uninterrupted access to all data even if a node fails. For eachvolume, you can specify a replication factor and a replication factor:desired minimum

The desired replication factor is the number of replicated copies you want to keep for normal operation and data protection. When thenumber of copies falls below the desired replication factor, but remains equal to or above the minimum replication factor, re-replicationoccurs after the timeout specified in the parameter (configurable using the API).cldb.fs.mark.rereplicate.sec configThe minimum replication factor is the absolute minimum number of copies you can tolerate. When the replication factor falls below thisminimum, re-replication occurs as aggressively as possible to restore the replication level.

If any containers in the CLDB volume fall below the minimum replication factor, writes are disabled until aggressive re-replication restores theminimum level of replication. If a disk failure is detected, any data stored on the failed disk is re-replicated without regard to the timeout specifiedin the parameter.cldb.fs.mark.rereplicate.sec

The following sections describe procedures associated with volumes:

To create a new volume, see (requires permission on the volume)Creating a Volume cvTo view a list of volumes, see Viewing a List of VolumesTo view a single volume's properties, see Viewing Volume PropertiesTo modify a volume, see (requires permission on the volume)Modifying a Volume mTo mount a volume, see (requires permission on the volume)Mounting a Volume mntTo unmount a volume, see (requires permission on the volume)Unmounting a Volume mTo remove a volume, see (requires permission on the volume)Removing a Volume dTo set volume topology, see (requires permission on the volume)Setting Volume Topology m

See also:

Mirror VolumesSnapshotsSchedules

Creating a Volume

When creating a volume, the only required parameters are the volume type (normal or mirror) and the volume name. You can set the ownership,permissions, quotas, and other parameters at the time of volume creation, or use the dialog to set them later. If you plan toVolume Propertiesschedule snapshots or mirrors, it is useful to create a ahead of time; the schedule will appear in a drop-down menu in the VolumescheduleProperties dialog.

By default, the root user and the volume creator have full control permissions on the volume. You can grant specific permissions to other usersand groups:



1. 2. 3.

4. 5.

6. 7. 8.

a. b. c.

9. a. b. c.

10. a.

b.

c.

11.

1.

1. 2.

3.

Code Allowed Action

dump Dump the volume

restore Mirror or restore the volume

m Modify volume properties, create and delete snapshots

d Delete a volume

fc Full control (admin access and permission to change volume ACL)

You can create a volume using the command, or use the following procedure to create a volume using the MapR Control System.volume create

To create a volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesClick the button to display the dialog.New Volume New VolumeUse the radio button at the top of the dialog to choose whether to create a standard volume, a local mirror, or a remoteVolume Typemirror.Type a name for the volume or source volume in the or field.Volume Name Mirror NameIf you are creating a mirror volume:

Type the name of the source volume in the field.Source Volume NameIf you are creating a remote mirror volume, type the name of the cluster where the source volume resides, in the Source Cluster

field.NameYou can set a mount path for the volume by typing a path in the field.Mount PathYou can specify which rack or nodes the volume will occupy by selecting a toplogy from the drop-down selector.TopologyYou can set permissions using the fields in the section:Ownership & Permissions

Click to display fields for a new permission.[ + Add Permission ]In the left field, type either u: and a user name, or g: and a group name.In the right field, select permissions to grant to the user or group.

You can associate a standard volume with an accountable entity and set quotas in the section:Usage TrackingIn the field, select or from the dropdown menu and type the user or group name in the text field.Group/User User GroupTo set an advisory quota, select the checkbox beside and type a quota (in megabytes) in the text field.Volume Advisory QuotaTo set a quota, select the checkbox beside and type a quota (in megabytes) in the text field.Volume Quota

You can set the replication factor and choose a snapshot or mirror in the Replication and Snapshot section:scheduleType the desired replication factor in the field. When the number of replicas drops below this threshold, theReplication Factorvolume is re-replicated after a timeout period (configurable with the parameter using the cldb.fs.mark.rereplicate.sec c

API).onfigType the minimum replication factor in the field. When the number of replicas drops below this threshold,Minimum Replicationthe volume is aggressively re-replicated to bring it above the minimum replication factor.To schedule snapshots or mirrors, select a from the dropdown menu or the schedule Snapshot Schedule Mirror Update

dropdown menu respectively.ScheduleClick to create the volume.OK

Viewing a List of Volumes

You can view all volumes using the command, or view them in the MapR Control System using the following procedure.volume list

To view all volumes using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS Volumes

Viewing Volume Properties

You can view volume properties using the command, or use the following procedure to view them using the MapR Control System.volume info

To view the properties of a volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name, or by selecting the checkbox beside the volume name, then clickingVolume Propertiesthe button.PropertiesAfter examining the volume properties, click to exit without saving changes to the volume.Close

Modifying a Volume

You can modify any attributes of an existing volume, except for the following restriction:



1. 2.

3. 4.

1. 2. 3.

1. 2. 3.

1. 2. 3. 4.

Mirror and normal volumes cannot be converted to the other type.

You can modify a volume using the command, or use the following procedure to modify a volume using the MapR Control System.volume modify

To modify a volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clickingVolume Propertiesthe button.PropertiesMake changes to the fields. See for more information about the fields.Creating a VolumeAfter examining the volume properties, click to save changes to the volume.Modify Volume

Mounting a Volume

You can mount a volume using the command, or use the following procedure to mount a volume using the MapR Control System.volume mount

To mount a volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name of each volume you wish to mount.Click the button.Mount

You can also mount or unmount a volume using the checkbox in the dialog. See for moreMounted Volume Properties Modifying a Volumeinformation.

Unmounting a Volume

You can unmount a volume using the command, or use the following procedure to unmount a volume using the MapR Controlvolume unmountSystem.

To unmount a volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name of each volume you wish to unmount.Click the button.Unmount

You can also mount or unmount a volume using the Mounted checkbox in the dialog. See for moreVolume Properties Modifying a Volumeinformation.

Removing a Volume or Mirror

You can remove a volume using the command, or use the following procedure to remove a volume using the MapR Controlvolume removeSystem.

To remove a volume or mirror using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesClick the checkbox next to the volume you wish to remove.Click the button to display the Remove Volume dialog.RemoveIn the Remove Volume dialog, click the button.Remove Volume

Setting Volume Topology

You can place a volume on specific racks, nodes, or groups of nodes by setting its topology to an existing node topology.

Your node topology describes the locations of nodes and racks in a cluster to the MapR system. The MapR software uses node topology todetermine the location of replicated copies of data. Optimally defined cluster topology results in data being replicated to separate racks, resultingin continued data availability in the event of rack or node failure.

Define your cluster's topology by specifying a topology path for each node in the cluster. The paths group nodes by rack or switch, depending onhow the physical cluster is arranged and how you want MapR to place replicated data.

For more information about node topology, see .Node Topology

To set volume topology, choose the path that corresponds to the node topology of the rack or nodes where you would like the volume to reside.You can set volume topology using the MapR Control System or with the command.volume move



1. 2.

3. 4. 5. 6.

1.

2.

3.

1.

2.

3.

4.

1.

2.

To set volume topology using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name or by selecting the checkbox beside the volume name, then clickingVolume Propertiesthe button.PropertiesClick to display the Move Volume dialog.Move VolumeSelect a topology path that corresponds to the rack or nodes where you would like the volume to reside.Click to return to the Volume Properties dialog.Move VolumeClick to save changes to the volume.Modify Volume

Setting Default Volume Topology

By default, new volumes are created with a topology of (root directory). To change the default topology, use the command to/ config savechange the configuration parameter. Example:cldb.default.volume.topology

maprcli config save -values cldb. .volume.topology\ / -rack\"{\" default ":\" default "}"

After running the above command, new volumes have the volume topology by default./default-rack

Example: Setting Up CLDB-Only Nodes















Mirror Volumes

A is a read-only physical copy of another volume, the . You can use mirror volumes in the same cluster (localmirror volume source volumemirroring) to provide local load balancing by using mirror volumes to serve read requests for the most frequently accessed data in the cluster. Youcan also mirror volumes on a separate cluster (remote mirroring) for backup and disaster readiness purposes.

Once you've created a mirror volume, keeping your mirror synchronized with its source volume is fast. Because mirror operations are based on a of the source volume, your source volume remains available for read and write operations for the entire duration of the process.snapshot

Mirroring Overview

Creating a mirror volume is similar to creating a normal read/write volume. However, when you create a mirror volume, you must specify a sourcevolume that the mirror retrieves content from. This retrieval is called the .mirroring operation

The MapR system creates a temporary of the source volume at the start of a mirroring operation. The mirroring process reads contentsnapshotfrom the snapshot into the mirror volume. The source volume remains available for read and write operations during the mirroring process. If themirroring operation is , the snapshot expires according to the value of the schedule's parameter. Snapshots createdschedule-based Retain Forduring manual mirroring persist until they are deleted manually.

The mirroring process transmits differences between the source volume and the mirror. The initial mirroring operation copies the entire sourcevolume, but subsequent mirroring operations can be extremely fast.

Mirroring is extremely resilient. In the case of a , where some or all of the machines that host the source volume cannotnetwork partitioncommunicate with the machines that host the mirror volume, the mirroring operation periodically retries the connection. Once the network isrestored, the mirroring operation resumes.

The root volume contains a writable volume link, . Use this link to navigate to the source volume instead of the mirror. When the root volume.rwis mirrored, the mount path refers to one of the root volume's mirrors, and is read-only. The mount path refers to the source volume, and/ /.rwis read/write.

A mount path that consists entirely of mirrored volumes refers to a mirrored copy of the target volume. When a mount path contains volumes thatare not mirrored, the path refers to the target volume directly.

Example Volume Topology with Mirrors

For the four volumes , , , and , the following table indicates the volumes referred to by example mount paths for particular combinations of/ a b cmirrored and not mirrored volumes in the path:

/ a b c This Path Refers To This Volume... Which is...

Mirrored Mirrored Mirrored Mirrored /a/b/c Mirror of c Read-only

Mirrored Mirrored Mirrored Mirrored /.rw/a/b/c c directly Read/Write

Mirrored Mirrored Not Mirrored Mirrored /a/b/c c directly Read/Write

Mirrored Mirrored Not Mirrored Mirrored /a Mirror of a Read-only

Not Mirrored Mirrored Mirrored Mirrored /a/b/c c directly Read/Write

Working with Mirrors

You can automate mirror synchronization by setting a . You can also use the command to synchronize dataschedule volume mirror startmanually.

Completion time for a mirroring operation is affected by available network bandwidth and the amount of data to transmit. Initializing a mirrorrequires creating a full copy of the source volume, which can take some time if the source volume is large.

For best performance, set the mirroring schedule according to the anticipated rate of data changes and the available bandwidth for mirroring.

Mirror Cascades

In a cascade, one mirror synchronizes to the source volume, and each successive mirror uses a previous mirror as its source. Mirror cascadesare useful for propagating data over a distance, then re-propagating the data locally instead of transferring the same data remotely again for eachcopy of the mirror. In the example below, the character indicates a mirror's source:<

/ < mirror1 < mirror2 < mirror3



1. 2. 3. 4.

1. 2.

A mirror cascade makes more efficient use of your cluster's network bandwidth, but synchronization can be slower to propagate through thechain. For cases where synchronization of mirrors is a higher priority than network bandwidth optimization, make each mirror read directly fromthe source volume:

mirror1 > < mirror2 / mirror3 > < mirror4

You can create or break a mirror cascade made from existing mirror volumes by changing the source volume of each mirror in the Volume dialog.Properties

Other Mirror Operations

For more information on mirror volume operations, see the following sections:

You can set the of a mirror volume to determine the placement of the data.topologyYou can change a mirror's source volume by changing the source volume in the dialog.Volume PropertiesTo create a new mirror volume refer to (requires license and permission)Creating a Volume M5 cvTo modify a mirror (including changing its source), see Modifying a VolumeTo remove a mirror, see Removing a Volume or Mirror

Local Mirroring

A is a mirror volume whose source is on the same cluster. Local mirror volumes are useful for load balancing or for providing alocal mirror volumeread-only copy of a data set.

You can locate your local mirror volumes in specific servers or on racks with particularly high bandwidth, mounted in a public directory separatefrom the source volume.

The most frequently accessed volumes in a cluster are likely to be the root volume and its immediate children. In order to load-balance readoperations on these volumes, mirror the root volume (typically , which is mounted at ). By mirroring these volumes, readmapr.cluster.root /requests can be served from the mirrors, distributing load across the nodes. Less-frequently accessed volumes that are lower in the hierarchy donot need mirror volumes. Since the mount paths for those volumes are not mirrored throughout, those volumes are writable.

To create a local mirror using the MapR Control System:

Log on to the .MapR Control SystemIn the navigation pane, select .MapR-FS > VolumesClick the button.New VolumeIn the dialog, specify the following values:New Volume

Select .Local Mirror VolumeEnter a name for the mirror volume in the field. If the mirror is on the same cluster as the source volume, theVolume Namesource and mirror volumes must have different names.Enter the source volume name (not mount point) in the field.Source VolumeTo automate mirroring, select a from the dropdown menu.schedule Mirror Update Schedule

To create a local mirror using the command:volume create

Connect via ssh to a node on the cluster where you wish to create the mirror.Use the command to create the mirror volume. Specify the volume name, provide a for the mirrorvolume create source namevolume, and specify a of . Example:type 1

maprcli volume create -name volume-a -source volume-a -type 1 -schedule 2

Remote Mirroring

A is a mirror volume with a source in another cluster. You can use remote mirrors for offsite backup, for data transfer toremote mirror volumeremote facilities, and for load and latency balancing for large websites. By mirroring the cluster's root volume and all other volumes in the cluster,you can create an entire mirrored cluster that keeps in sync with the source cluster.

Backup mirrors for disaster recovery can be located on physical media outside the cluster or in a remote cluster. In the event of a disasteraffecting the source cluster, you can check the time of last successful synchronization to determine how current the backup is (see bMirror Statuselow).

Creating Remote Mirrors



1. 2.

3. 4. 5.

1. 2.

1. 2.

3.

Creating remote mirrors is similar to creating local mirrors, except that the source cluster name must also be specified.

To create a remote mirror using the MapR Control System:

Log on to the .MapR Control SystemCheck the (near the MapR logo). If you are not connected to the cluster on which you wish to create a mirror:Cluster Name

Click the next to the Cluster Name.[+]In the Available Clusters dialog, click the name of the desired cluster.In the Launching Web Interface dialog, click the desired cluster again to connect.

In the navigation pane, select .MapR-FS > VolumesClick the button.New VolumeIn the dialog, specify the following values:New Volume

Select or .Local Mirror Volume Remote Mirror VolumeEnter a name for the mirror volume in the field. If the mirror is on the same cluster as the source volume, theVolume Namesource and mirror volumes must have different names.Enter the source volume name (not mount point) in the field.Source VolumeEnter the source cluster name in the field.Source ClusterTo automate mirroring, select a from the dropdown menu.schedule Mirror Update Schedule

To create a remote mirror using the command:volume create

Connect to a node on the cluster where you wish to create the mirror.Use the command to create the mirror volume. Specify the source volume and cluster in the format volume create <volume>@<clus

, provide a for the mirror volume, and specify a of . Example:ter> name type 1

maprcli volume create -name volume-a -source volume-a@cluster-1 -type 1 -schedule 2

Moving Large Amounts of Data to a Remote Cluster

Use the command to create volume copies for transport on physical media. The commandvolume dump create volume dump createcreates backup files containing the volumes, which can be reconstituted into mirrors at the remote cluster with the comvolume dump restoremand. Associate these mirrors with their source volumes with the command to re-establish synchronization.volume modify

Working with Multiple Clusters

In order to mirror volumes between clusters, you must use to create an additional entry in on the sourceconfigure.sh mapr-clusters.confcluster, for each cluster to which it will mirror. The command syntax for is:configure.sh

configure.sh -C <cluster 2 CLDB nodes> -N <cluster 2 name>.

You can cross-mirror between clusters: to mirror some volumes from cluster A to cluster B and other volumes from cluster B to cluster A, youwould set up as follows:mapr-clusters.conf

Entries in on cluster A nodes:mapr-clusters.confFirst line contains name and CLDB servers of cluster ASecond line contains name and CLDB servers of cluster B

Entries in on cluster B nodes:mapr-clusters.confFirst line contains name and CLDB servers of cluster BSecond line contains name and CLDB servers of cluster A

By creating additional entries, you can mirror from one cluster to several others.

Each cluster must already be set up and running, and must have a unique name. All nodes in each cluster should be able to resolve all othernodes (via DNS or entries in )./etc/hosts

To set up multiple clusters:

On each cluster make a note of the cluster name and CLDB nodes (the first line in )mapr-clusters.confOn each cluster's webserver node add the remote cluster's CLDB nodes to , using the/opt/mapr/conf/mapr-clusters.conffollowing format:

clusterA 10.10.80.241:7222 10.10.80.242:7222 10.10.80.243:7222clusterB 10.10.80.231:7222



3.

1. 2. 3.

1. 2. 3.

On each cluster, restart the service on all nodes where it is running.mapr-webserver

Mirror Status

You can see a list of all mirror volumes and their current status on the view (in the MapR Control System, select then Mirror Volumes MapR-FS M) or using the command. You can see additional information about mirror volumes on the CLDB status page (in theirror Volumes volume list

MapR Control System, select ), which shows the status and last successful synchronization of all mirrors, as well as the container locationsCLDBfor all volumes. You can also find container locations using the commands.hadoop mfs

Starting a Mirror

When a mirror , all the data in the source volume is copied into the mirror volume. Starting a mirror volume requires that the mirror volumestartsexist and be associated with a source. After you start a mirror, synchronize it with the source volume regularly to keep the mirror current. You canstart a mirror using the command, or use the following procedure to start mirroring using the MapR Control System.volume mirror start

To start mirroring using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name of each volume you wish to mirror.Click the button.Start Mirroring

Stopping a Mirror

Stopping a mirror halts any replication or synchronization process currently in progress. Stopping a mirror does not delete or remove the mirrorvolume. Stop a mirror with the command, or use the following procedure to stop mirroring using the MapR Control System.volume mirror stop

To stop mirroring using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name of each volume you wish to stop mirroring.Click the button.Stop Mirroring

Pushing Changes to Mirrors

To a mirror means to start pushing data from the source volume to all its local mirrors. You can push source volume changes out to allpushmirrors using the command, which returns after the data has been pushed.volume mirror push

Using Volume Links with Mirrors

When you mirror a volume, read requests to the source volume can be served by any of its mirrors on the same cluster via a of type volume link m. A volume link is similar to a normal volume mount point, except that you can specify whether it points to the source volume or its mirrors.irror

To write to (and read from) the source volume, mount the source volume normally. As long as the source volume is mounted below anon-mirrored volume, you can read and write to the volume normally via its direct mount path. You can also use a volume link of type wri

to write directly to the source volume regardless of its mount point.teableTo read from the mirrors, use the command to make a volume link (of type ) to the source volume. Any readvolume link create mirrorrequests from the volume link are distributed among the volume's mirrors. Since the volume link provides access to the mirror volumes,you do not need to mount the mirror volumes.



1. 2. 3. 4.

a. b.

c. d.

5. 6.

1. 2. 3.

a. b.

4.

Schedules

A schedule is a group of rules that specify recurring points in time at which certain actions are determined to occur. You can use schedules toautomate the creation of snapshots and mirrors; after you create a schedule, it appears as a choice in the scheduling menu when you are editingthe properties of a task that can be scheduled:

To apply a schedule to snapshots, see .Scheduling a SnapshotTo apply a schedule to volume mirroring, see .Creating a Volume

Schedules require the M5 license. The following sections provide information about the actions you can perform on schedules:

To create a schedule, see Creating a ScheduleTo view a list of schedules, see Viewing a List of SchedulesTo modify a schedule, see Modifying a ScheduleTo remove a schedule, see Removing a Schedule

Creating a Schedule

You can create a schedule using the command, or use the following procedure to create a schedule using the MapR Controlschedule createSystem.

To create a schedule using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS SchedulesClick .New ScheduleType a name for the new schedule in the field.Schedule NameDefine one or more schedule rules in the section:Schedule Rules

From the first dropdown menu, select a frequency (Once, Yearly, Monthly, etc.))From the next dropdown menu, select a time point within the specified frequency. For example: if you selected Monthly in thefirst dropdown menu, select the day of the month in the second dropdown menu.Continue with each dropdown menu, proceeding to the right, to specify the time at which the scheduled action is to occur.Use the field to specify how long the data is to be preserved. For example: if the schedule is attached to a volume forRetain Forcreating snapshots, the Retain For field specifies how far after creation the snapshot expiration date is set.

Click to specify additional schedule rules, as desired.[ + Add Rule ]Click to create the schedule.Save Schedule

Viewing a List of Schedules

You can view a list of schedules using the command, or use the following procedure to view a list of schedules using the MapRschedule listControl System.

To view a list of schedules using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS Schedules

Modifying a Schedule

When you modify a schedule, the new set of rules replaces any existing rules for the schedule.

You can modify a schedule using the command, or use the following procedure to modify a schedule using the MapR Controlschedule modifySystem.

To modify a schedule using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS SchedulesClick the name of the schedule to modify.Modify the schedule as desired:

Change the schedule name in the field.Schedule NameAdd, remove, or modify rules in the section.Schedule Rules

Click to save changes to the schedule.Save Schedule

For more information, see .Creating a Schedule

Removing a Schedule

You can remove a schedule using the command, or use the following procedure to remove a schedule using the MapR Controlschedule removeSystem.



1. 2. 3. 4.

To remove a schedule using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS SchedulesClick the name of the schedule to remove.Click to display the dialog.Remove Schedule Remove ScheduleClick to remove the schedule.Yes



1. 2.

3. 4.

1. 2.

3. 4.

Snapshots

A snapshot is a read-only image of a volume at a specific point in time. On an M5-licensed cluster, you can create a snapshot manually orautomate the process with a . Snapshots are useful any time you need to be able to roll back to a known good data set at a specific pointschedulein time. For example, before performing a risky operation on a volume, you can create a snapshot to enable "undo" capability for the entirevolume. A snapshot takes no time to create, and initially uses no disk space, because it stores only the incremental changes needed to roll thevolume back to the point in time when the snapshot was created.

The following sections describe procedures associated with snapshots:

To view the contents of a snapshot, see Viewing the Contents of a SnapshotTo create a snapshot, see (requires license)Creating a Volume Snapshot M5To view a list of snapshots, see Viewing a List of SnapshotsTo remove a snapshot, see Removing a Volume Snapshot

Viewing the Contents of a Snapshot

At the top level of each volume is a directory called containing all the snapshots for the volume. You can view the directory with .snapshot hado commands or by mounting the cluster with NFS. To prevent recursion problems, and do not show the diop fs ls hadoop fs -ls .snapshot

rectory when the top-level volume directory contents are listed. You must navigate explicitly to the directory to view and list the.snapshotsnapshots for the volume.

Example:

root@node41:/opt/mapr/bin# hadoop fs -ls /myvol/.snapshotFound 1 itemsdrwxrwxrwx - root root 1 2011-06-01 09:57 /myvol/.snapshot/2011-06-01.09-57-49

Creating a Volume Snapshot

You can create a snapshot manually or use a to automate snapshot creation. Each snapshot has an expiration date that determinesschedulehow long the snapshot will be retained:

When you create the snapshot manually, specify an expiration date.When you schedule snapshots, the expiration date is determined by the Retain parameter of the .schedule

For more information about scheduling snapshots, see .Scheduling a Snapshot

Creating a Snapshot Manually

You can create a snapshot using the command, or use the following procedure to create a snapshot using the MapRvolume snapshot createControl System.

To create a snapshot using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name of each volume for which you want a snapshot, then click the button to display the New Snapshot

dialog.Snapshot NameType a name for the new snapshot in the field.Name...Click to create the snapshot.OK

Scheduling a Snapshot

You schedule a snapshot by associating an existing schedule with a normal (non-mirror) volume. You cannot schedule snapshots on mirrorvolumes; in fact, since mirrors are read-only, creating a snapshot of a mirror would provide no benefit. You can schedule a snapshot by passingthe ID of a to the command, or you can use the following procedure to choose a schedule for a volume using the MapRschedule volume modifyControl System.

To schedule a snapshot using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name, or by selecting the checkbox beside the name of the volume thenVolume Propertiesclicking the button.PropertiesIn the Replication and Snapshot Scheduling section, choose a from the dropdown menu.schedule Snapshot ScheduleClick to save changes to the volume.Modify Volume



1. 2.

1. 2. 3. 4.

1. 2. 3. 4. 5. 6.

1. 2. 3.

1. 2. 3. 4. 5.

For information about creating a schedule, see .Schedules

Viewing a List of Snapshots

Viewing all Snapshots

You can view snapshots for a volume with the command or using the MapR Control System.volume snapshot list

To view snapshots using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS Snapshots

Viewing Snapshots for a Volume

You can view snapshots for a volume by passing the volume to the command or using the MapR Control System.volume snapshot list

To view snapshots using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesClick the button to display the dialog.Snapshots Snapshots for Volume

Removing a Volume Snapshot

Each snapshot has an expiration date and time, when it is deleted automatically. You can remove a snapshot manually before its expiration, oryou can preserve a snapshot to prevent it from expiring.

Removing a Volume Snapshot Manually

You can remove a snapshot using the command, or use the following procedure to remove a snapshot using the MapRvolume snapshot removeControl System.

To remove a snapshot using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS SnapshotsSelect the checkbox beside each snapshot you wish to remove.Click to display the dialog.Remove Snapshot Remove SnapshotsClick to remove the snapshot or snapshots.Yes

To remove a snapshot from a specific volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the volume name.Click Snapshots to display the dialog.Snapshots for VolumeSelect the checkbox beside each snapshot you wish to remove.Click to display the dialog.Remove Remove SnapshotsClick to remove the snapshot or snapshots.Yes

Preserving a Volume Snapshot

You can preserve a snapshot using the command, or use the following procedure to create a volume using the MapRvolume snapshot preserveControl System.

To remove a snapshot using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS SnapshotsSelect the checkbox beside each snapshot you wish to preserve.Click to preserve the snapshot or snapshots.Preserve Snapshot

To remove a snapshot from a specific volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the volume name.Click Snapshots to display the dialog.Snapshots for VolumeSelect the checkbox beside each snapshot you wish to preserve.Click to preserve the snapshot or snapshots.Preserve



Managing the Cluster

This section describes the tools and processes involved in managing a MapR cluster. Topics include upgrading the MapR software version;adding and removing disks and nodes; managing data replication and disk space with balancers; managing the services on a node; managing thetopology of a cluster; and more.

Choose a subtopic below for more detail.

BalancersCluster Upgrade

Converting a Cluster from Root to Non-root UserManual UpgradeRolling Upgrade

DisksWorking with a Logical Volume ManagerSetting Up Disks for MapRSpecifying Disks or Partitions for Use by MapRDial Home

NodesAdding Nodes to a ClusterManaging Services on a NodeNode TopologyIsolating CLDB NodesIsolating ZooKeeper NodesRemoving Roles

ServicesChanging the User for MapR ServicesFailoverTaskTracker BlacklistingAssigning Services to Nodes for Best Performance

Startup and ShutdownUninstalling MapR



Balancers

The disk space balancer and the replication role balancer redistribute data in the MapR storage layer to ensure maximum performance andefficient use of space:

The works to ensure that the percentage of space used on all disks in the node is similar, so that no nodes aredisk space balanceroverloaded.The changes the replication roles of cluster containers so that the replication process uses network bandwidthreplication role balancerevenly.

To view balancer configuration values:

Pipe the command through . Example:maprcli config load grep

# maprcli config load -json | grep balancer : ,"cldb.balancer.disk.max.switches.in.nodes.percentage" "10" : ,"cldb.balancer.disk.paused" "1" : ,"cldb.balancer.disk.sleep.interval.sec" "120" : ,"cldb.balancer.disk.threshold.percentage" "70" : ,"cldb.balancer.logging" "0" : ,"cldb.balancer.role.max.switches.in.nodes.percentage" "10" : ,"cldb.balancer.role.paused" "1" : ,"cldb.balancer.role.sleep.interval.sec" "900" : ,"cldb.balancer.startup.interval.sec" "1800"

To set balancer configuration values:

Use the command to set the appropriate values. Example:config save

# maprcli config save -values { : }"cldb.balancer.disk.max.switches.in.nodes.percentage" "20"

By default, the balancers are turned off.

To turn on the disk space balancer, use to set to config save cldb.balancer.disk.paused 0To turn on the replication role balancer, use to set to config save cldb.balancer.role.paused 0

Disk Space Balancer

The is a tool that balances disk space usage on a cluster by moving containers between storage pools.disk space balancer

Whenever a storage pool is over 70% full, the disk space balancer distributes containers to other storage pools that have lower utilization than theaverage for that cluster. The disk space balancer aims to ensure that the percentage of space used on all of the disks in the node is similar.

You can view disk usage on all nodes in the view, by clicking in the Navigation pane and the choosing from theDisks Cluster > Nodes Disksdropdown.

Disk Space Balancer Configuration Parameters

Parameter Value Description

cldb.balancer.disk.threshold.percentage 70 Threshold for moving containers out of a given storage pool, expressed asutilization percentage.

cldb.balancer.disk.paused 1 Specifies whether the disk space balancer runs:

0 - Not paused (normal operation)1 - Paused (does not perform any container moves)

cldb.balancer.disk.max.switches.in.nodes.percentage 10 This can be used to throttle the disk balancer. If it is set to 10, the balancer willthrottle the number of concurrent container moves to 10% of the total nodes inthe cluster (minimum 2).



Disk Space Balancer Status

Use the command to view detailed information about the storage pools on a cluster.maprcli dump balancerinfo

# maprcli dump balancerinfousedMB fsid spid percentage outTransitMB inTransitMB capacityMB209 5567847133641152120 01f8625ba1d15db7004e52b9570a8ff3 1 0 0 15200209 1009596296559861611 816709672a690c96004e52b95f09b58d 1 0 0 15200

If there are any active container moves at the time the command is run, returns information about the sourcemaprcli dump balancerinfoand destination storage pools.

# maprcli dump balancerinfo -json

....

{ :7840,"containerid" :15634,"sizeMB" :8081858704500413174,"From fsid" : ,"From IP:Port" "10.50.60.64:5660-" : ,"From SP" "9e649bf0ac6fb9f7004fa19d200abcde" :3770844641152008527,"To fsid" : ,"To IP:Port" "10.50.60.73:5660-" :"To SP" "fefcc342475f0286004fad963f0fghij" }

For more information about this command, see .maprcli dump balancerinfo

Disk Space Balancer Metrics

The command returns a cumulative count of container moves and MB of data moved between storagemaprcli dump balancermetricspools since the current CLDB became the the master CLDB.

# maprcli dump balancermetrics -json{ :1337770325979,"timestamp" : ,"status" "OK" :1,"total" :["data" { :10090,"numContainersMoved" :3147147,"numMBMoved" : "timeOfLastMove" "Wed May 23 03:51:44 PDT 2012" } ]}

For more information about this command, see .maprcli dump balancermetrics

Replication Role Balancer

The is a tool that switches the replication roles of containers to ensure that every node has an equal share of master andreplication role balancerreplica containers (for name containers) and an equal share of master, intermediate, and tail containers (for data containers).

The replication role balancer changes the replication role of the containers in a cluster so that network bandwidth is spread evenly across allnodes during the replication process. A container's replication role determines how it is replicated to the other nodes in the cluster. For name



(the volume's first container), replication occurs simultaneously from the master to all replica containers. For ,containers data containersreplication proceeds from the master to the intermediate container(s) until it reaches the tail containers. Replication occurs over the networkbetween nodes, often in separate racks.

Replication Role Balancer Configuration Parameters


cldb.balancer.role.paused 1 Specifies whether the role balancer runs:

0 - Not paused (normal operation)1 - Paused (does not perform any container replication role switches)

cldb.balancer.role.max.switches.in.nodes.percentage 10 This can be used to throttle the role balancer. If it is set to 10, the balancer willthrottle the number of concurrent role switches to 10% of the total nodes in the cluster (minimum 2).

Replication Role Balancer Status

The command returns information the number of active replication role switches. maprcli dump rolebalancerinfo During a replication roleswitch, the replication role balancer selects a master or intermediate data container and switches its replication role to that of a tail data container.

# maprcli dump rolebalancerinfo -json{ :1335835436698,"timestamp" : ,"status" "OK" :1,"total" :["data" { : 36659,"containerid" : ,"Tail IP:Port" "10.50.60.123:5660-" :"Updates blocked Since" "Wed May 23 05:48:15 PDT 2012" }

]}

For more information about this command, see .maprcli dump rolebalancerinfo



1. 2.

3. 4. 5. 6.

7.

Cluster Upgrade

The following sections provide information about upgrading the cluster:

Rolling Upgrade provides information about automatically applying MapR software upgrades to a cluster.Manual Upgrade provides information about stopping all nodes, installing updated packages manually, and restarting all nodes.Converting a Cluster from Root to Non-root User provides the procedure for converting a MapR cluster to a non-root user.

NFS and Upgrading

Starting in MapR release 1.2.8, a change in the NFS file handle format makes NFS file handles incompatible between NFS servers running MapRversion 1.2.7 or earlier and servers running MapR 1.2.8 and following.

NFS clients that were originally mounted to NFS servers on nodes running MapR version 1.2.7 or earlier must remount the file system when thenode is upgraded to MapR version 1.2.8 or following.

When upgrading from MapR version 1.2.7 or earlier to version 1.2.8 or later:

Upgrade a subset of the existing NFS server nodes, or install the newer version of MapR on a set of new nodes.If the selected NFS server nodes are using virtual IP numbers (VIPs), reassign those VIPs to other NFS server nodes taht are stillrunning the previous version of MapR.Apply the upgrade to the selected set of NFS server nodes.Start the NFS servers on nodes upgraded to the newer version.Unmount the NFS clients from the NFS servers of the older version.Remount the NFS clients on the upgraded NFS server nodes. Stage these remounts in groups of 100 or fewer clients to preventperformance disruptions.After remounting all NFS clients, stop the NFS servers on nodes running the older version, then continue the upgrade process.

Due to changes in file handles between versions, cached file IDs cannot persist across this upgrade.



1. 2.

a.

b. c.

d.

Converting a Cluster from Root to Non-root User

This procedure converts a MapR cluster running as to run as a non-root user.root

Perform this procedure on all nodes after completing the upgrade procedure. Do not perform this procedure concurrently withupgrade.

To convert a MapR cluster from running as root to running as a non-root user:

Create a user with the same UID/GID across the cluster. Assign that user to the environment variable.MAPR_USEROn each node:

Stop the Warden service and the ZooKeeper service, if present.

service mapr-warden stopservice mapr-zookeeper stop

Run the script./opt/mapr/server/config-mapr-user.sh -u <MapR user> [-g <MapR group>Start the Warden service and the ZooKeeper service, if present.

service mapr-zookeeper startservice mapr-warden start

Run the script./opt/mapr/server/upgrade2mapruser.sh

The alarm may raise during this process. The alarm will clear when this process is complete on allMAPR_UID_MISMATCHnodes.The script waits ten minutes. If the cluster upgrade takes longer than ten minutes, the script fails.upgrade2mapruser.shAfter completing the cluster upgrade, re-run the script on all nodes where the script failed.

Disabling Superuser Access for the Root User

To disable root user (UID 0) access to the MapR filesystem on a cluster that is running as a non-root user, use either of the following commands:

The configuration value treats all requests from UID 0 as coming from UID -2 (nobody):squash root

maprcli config save -values { : }"cldb.squash.root" "1"

The configuration value automatically fails all filesystem requests from UID 0:reject root

maprcli config save -values { : }"cldb.reject.root" "1"



1.

2.

1. 2. 3.

4.

Manual Upgrade

Upgrading the MapR cluster manually entails stopping all nodes, installing updated packages, and restarting the nodes. Here are a few tips:

Make sure to add the correct repository directory for the version of MapR software you wish to install.Work on all nodes at the same time; that is, stop the warden on all nodes before proceeding to the next step, and so on.Use the procedure corresponding to the operating system on your cluster.

After upgrading your cluster to MapR 2.x, you can run MapR as a .non-root user

To upgrade your Hadoop ecosystem components after a , follow the procedures in rolling upgrade Manual Upgrade for Hadoop Ecosystem. (See for a list of the Hadoop ecosystem packages.)Components Packages and Dependencies for MapR Software

Tips when upgrading manually to MapR 2.x:

If upgrading to version 2.1, stop the cluster entirely before performing the upgrade.After upgrade is successful on all nodes and the cluster is up and running, run the following command on any clusternode to enable new features:

maprcli config save -values {cldb.v2.features.enabled:1}

Installing a newer version of MapR software might introduce new package dependencies. Dependency packages mustbe installed on all nodes in the cluster in addition to the updated MapR packages. If you are upgrading using a packagemanager such as or , then the package manager on each node must have access to repositories forzypper apt-getdependency packages. If installing from package files, you must pre-install dependencies on all nodes in the clusterprior to upgrading the MapR software. See .Packages and Dependencies for MapR SoftwareWhen performing a manual upgrade, it is necessary to run on any nodes that are running the HBaseconfigure.shregion server or HBase master.If you are upgrading a node that is running the NFS service, use the command to determine whether the nodemounthas mounted MapR-FS through its own physical IP (not a VIP). If so, unmount any such mount points before beginningthe upgrade process.If you are upgrading a node that is running the HBase RegionServer and the command does not returnwarden stopafter five minutes, kill the HBase RegionServer process manually:

Determine the process ID of the HBase RegionServer:

cat /opt/mapr/logs/hbase-root-regionserver.pid

Kill the HBase RegionServer using the following command, substituting the process ID from the previous stepfor the placeholder :<PID>

kill -9 <PID>

After upgrading, you can optionally the cluster from running as to a non-root user.convert root

CentOS and Red Hat

Perform the following three procedures:

Upgrading the cluster - installing the new versions of the packagesSetting the version - manually updating the software configuration to reflect the correct versionUpdating the configuration - switching to the new versions of the configuration files, and preserving any custom settings

To upgrade the cluster:

On each node, perform the following steps:

Change to the user or use for the following commands.root sudoMake sure the MapR software is correctly installed and configured.Add the MapR yum repository for the latest version of MapR software, removing any old versions. For more information, see Preparing

.Packages and RepositoriesStop the warden:

/etc/init.d/mapr-warden stop



5.

6.

7.

8.

1.

2. 3.

1. 2. 3.

4.

5.

6.

7.

8.

If ZooKeeper is installed on the node, stop it:

/etc/init.d/mapr-zookeeper stop

Upgrade the MapR packages with the following command:

yum upgrade 'mapr-*'

If ZooKeeper is installed on the node, start it:


Start the warden:


To update the configuration files:

If you have made any changes to or in the dirmapred-site.xml core-site.xml /opt/mapr/hadoop/hadoop-<version>/confectory, then make the same changes to the same files in the directory./opt/mapr/hadoop/hadoop-<version>/conf.newRename to to deactivate it./opt/mapr/hadoop/hadoop-<version>/conf /opt/mapr/hadoop/hadoop-<version>/conf.oldRename to to activate it as/opt/mapr/hadoop/hadoop-<version>/conf.new /opt/mapr/hadoop/hadoop-<version>/confthe configuration directory.

Ubuntu

Perform the following three procedures:



The easiest way to upgrade all MapR packages on Ubuntu is to temporarily move the normal and replace it with a special sources.list sourc that specifies only the MapR software repository, then use to upgrade all packages from the special es.list apt-get upgrade sources.li

file. On each node, perform the following steps:st

Change to the user or use for the following commands.root sudoMake sure the MapR software is correctly installed and configured.Rename the normal to prevent from reading packages from other repositories:/etc/apt/sources.list apt-get upgrade

mv /etc/apt/sources.list /etc/apt/sources.list.orig

Create a new file and add the MapR apt-get repository for the latest version of MapR software, removing/etc/apt/sources.listany old versions. For more information, see .Preparing Packages and RepositoriesStop the warden:




Clear the APT cache:

apt-get clean

Update the list of available packages:



8.

9.

10.

11.

12.

1.

2.

1.

2. 3.

apt-get update

Upgrade all MapR packages:

apt-get upgrade

Rename the special and restore the original:/etc/apt/sources.list

mv /etc/apt/sources.list /etc/apt/sources.list.maprmv /etc/apt/sources.list.orig /etc/apt/sources.list



Start the warden:


Setting the Version

After completing the upgrade on all nodes, use the following steps on any node to set the correct version:

Check the software version by looking at the contents of the file. Example:MapRBuildVersion

$ cat /opt/mapr/MapRBuildVersion

Set the version accordingly using the command. Example:config save

maprcli config save -values {mapr.targetversion:"2.0.1.15869GA-1"}

Updating the Configuration

If you have made any changes to or in the dirmapred-site.xml core-site.xml /opt/mapr/hadoop/hadoop-<version>/confectory, then make the same changes to the same files in the directory./opt/mapr/hadoop/hadoop-<version>/conf.newRename to to deactivate it./opt/mapr/hadoop/hadoop-<version>/conf /opt/mapr/hadoop/hadoop-<version>/conf.oldRename to to activate it as/opt/mapr/hadoop/hadoop-<version>/conf.new /opt/mapr/hadoop/hadoop-<version>/confthe configuration directory.



1. 2.

1. 2.

1. 2.

1.

2.

Manual Upgrade for Hadoop Ecosystem Components

The script upgrades only. Follow the procedures in this section to upgrade the Hadoop ecosystemrolling upgrade core MapR packagescomponents on your MapR cluster.

The procedures in this section depend on the presence of the correct entryMapR Hadoop Ecosystem Components Repositoryin your system's package manager.

Red Hat and CentOS

For a list of the currently supported versions of ecosystem components, see . The repository for Red Hat andHadoop Compatibility in Version 2.1CentOS packages is at .http://package.mapr.com/releases/ecosystem/redhat

Alternate supported versions of some components are available at http://package.mapr.com/releases/ecosystem-al, but this directory is not a repository. Download any desired packages manually.l/redhat/

Flume

Upgrading Flume version 0.9.4 to version 1.2

Remove Flume 0.9.4 manually and install version 1.2

Run the command yum remove mapr-flumeRun the command {{ yum install mapr-flume}}

Install Flume version 1.2 with the downgrade option

Run the command . This command removes Flume version 0.9.4 and installs Flume version 1.2.yum downgrade mapr-flume

Keep Flume version 0.9.4 and install version 1.2

Download the RPM file for Flume 1.2 from package.mapr.com/releases/ecosystem/Run the command rpm -i --force mapr-flume-1.2.15190-1.noarch.rpm

HBase

HBase

Different HBase versions use different data formats for and tables. Due to these differences, and differences in the methods used toROOT METAregister with the ZooKeeper service, do not downgrade HBase from versions 0.94.1 or 0.92.1 to version 0.90.6.

Upgrading HBase version 0.90.6 to version 0.94.1

Due to changes in package naming conventions, running the command yum install mapr-hbase-master fails with these error messages:mapr-hbase-regionserver

Error: Protected multilib versions: mapr-hbase-master-0.94.1.15190-1.noarch !=mapr-hbase-master-1.2.9.14962.GA-sp1.x86_64Error: Protected multilib versions: mapr-hbase-regionserver-0.94.1.15190-1.noarch !=mapr-hbase-regionserver-1.2.9.14962.GA-sp1.x86_64

Remove Hbase version 0.90.6 manually and install version 0.94.1

Run the command .yum remove mapr-hbase-regionserver mapr-hbase-master mapr-hbase-internalRun the command .yum install mapr-hbase-regionserver mapr-hbase-master

Install HBase version 0.94.1 using the downgrade option

Run the command . This commandyum downgrade mapr-hbase-master mapr-hbase-regionserver mapr-hbase-internalremoves all directories except conf and logs.Manually copy over any configuration changes from the previous install.

http://package.mapr.com/releases/ecosystem/redhat

http://package.mapr.com/releases/ecosystem-all/redhat/

http://package.mapr.com/releases/ecosystem-all/redhat/



1. 2.

1.

2.

The command removes the file from the directory. You must manuallyyum downgrade hbaseversion /opt/mapr/hbasere-create this file after installing HBase. Run the comand to re-create the file.echo "0.94.1" > hbaseversion

Keep HBase version 0.90.6 and install version 0.94.1

Run the command yum install mapr-hbase-master-0.94.1.15190-1 mapr-hbase-regionserver-0.94.1.15190-1.mapr-hbase-internal-0.94.1.15190-1

Installing the package is not optional. This installation updates several required files directing othermapr-hbase-internalMapR packages to the correct HBase version.

Upgrade HBase version 0.90.6 to version 0.92.1

Due to changes in package naming conventions, running the command yum install mapr-hbase-master fails with these error messages:mapr-hbase-regionserver

Error: Protected multilib versions: mapr-hbase-master-0.92.1.15190-1.noarch !=mapr-hbase-master-1.2.9.14962.GA-sp1.x86_64Error: Protected multilib versions: mapr-hbase-regionserver-0.92.1.15190-1.noarch !=mapr-hbase-regionserver-1.2.9.14962.GA-sp1.x86_64

Remove HBase version 0.90.6 manually and install version 0.92.1

Run the command yum remove mapr-hbase-regionserver mapr-hbase-master mapr-hbase-internalRun the command yum install mapr-hbase-regionserver mapr-hbase-master

Install HBase version 0.92.1 using the downgrade option

Run the command . This commandyum downgrade mapr-hbase-master mapr-hbase-regionserver mapr-hbase-internalremoves all directories except and ./conf /logsManually copy over any configuration changes from the previous install.

The command removes the file from the directory. You must manuallyyum downgrade hbaseversion /opt/mapr/hbasere-create this file after installing HBase. Run the comand to re-create the file.echo "0.92.1" > hbaseversion


Run the command yum install mapr-hbase-master-0.92.1.15190-1 mapr-hbase-regionserver-0.92.1.15190-1mapr-hbase-internal-0.92.1.15190-1

Installing the package is not optional. This installation updates several required files directing othermapr-hbase-internalMapR packages to the correct HBase version.

Upgrading HBase version 0.92.1 to version 0.94.1

Upgrading HBase version 0.92.1 to version 0.94.1 requires you to have installed the 0.92.1 versions of these packages:

mapr-hbase-mastermapr-hbase-regionservermapr-hbase-internal

Run the command to ensure theseyum install mapr-hbase-master-0.92.1.15190-1 mapr-hbase-internal-0.92.1.15190-1packages are present.

Clean upgrade:

Run the command .yum install mapr-hbase-master mapr-hbase-internal



1. 2.

1. 2.

3.

1. 2.

1. 2.

3.

1. 2.

1. 2.

Keep to avoid issues from conflicting versions of .mapr-hbase-internal hbase-internal

Manually remove 0.92.1 and install 0.94.1:

Run the command yum remove mapr-hbase-master mapr-hbase-internalRun the command yum install mapr-hbase-master mapr-hbase-internal

Keep HBase version 0.92.1 and install version 0.94.1:

Run the command yum install mapr-hbase-master-0.92.1.15190-1 mapr-hbase-internal-0.92.1.15190-1Download the following HBase version 0.94.1 RPM packages from package.mapr.com/releases/ecosystem/:


Run the command rpm -ivh --force mapr-hbase-master-0.94.1.15190-1.noarch.rpmmapr-hbase-internal-0.94.1.15190-1.noarch.rpm

Downgrade from HBase version 0.94.1 to version 0.92.1 Remove HBase version 0.94.1 and install version 0.92.1

Run the command yum remove mapr-hbase-master mapr-hbase-internalRun the command yum install mapr-hbase-master-0.92.1.15190-1 mapr-hbase-internal-0.92.1.15190-1


Run the command yum install mapr-hbase-master mapr-hbase-internalDownload the following HBase version 0.92.1 RPM packages from package.mapr.com/releases/ecosystem/:


Run the command rpm -ivh --force mapr-hbase-master-0.92.1.15190-1.noarch.rpmmapr-hbase-internal-0.92.1.15190-1.noarch.rpm

Hive

Upgrading Hive from version 0.7.1 to 0.9.0

Back up your metastore database before upgrading Hive.

Refer to the README file in the /opt/mapr/hive/hive-0.9.0/scripts/metastore/upgrade/<metastore_databas directory after upgrading Hive for directions on updating your existing schema to work with the new Hivee> metastore_db

version. Scripts are provided for MySQL and Derby. You must update your metastore database schema before creating anynew Hive tables.

After the upgrade, verify that the schema for the metastore database has completed successfully. A few sample diagnostics:

The command in Hive should provide a complete list of all your Hive tables.show tablesPerform simple operations on Hive tables that existed before the upgrade.SELECTPerform filtered operations on Hive tables that existed before the upgrade.SELECT

Remove Hive version 0.7.1 manually and install Hive version 0.9.0

Run the command.yum remove mapr-hiveRun the command.yum install mapr-hive

Remove Hive version 0.7.1 and install version 0.9.0 using the optiondowngrade

Run the command to remove Hive version 0.7.1 and install Hive version 0.9.0.yum downgrade mapr-hive

Keep Hive version 0.7.1 and install Hive version 0.9.0

Download the RPM files for Hive version 0.9.0 from package.mapr.com/releases/ecosystem/.Run the command.rpm -i --force mapr-hive-0.9.0.15541-1.noarch.rpm

Mahout

Upgrading Mahout version 0.5 to Mahout version 0.6 or version 0.7



1. 2. 3. 4.

1. 2. 3.

1. 2. 3.

1. 2.

1. 2. 3.

1. 2.

Due to changes in package naming conventions, running fails with the following error:yum install mapr-mahout

h != mapr-mahout-1.2.9.15300.GA.v0.5-sp3.x86_64

Use one of the following options to install Mahout version 0.6 or version 0.7.

Remove Mahout version 0.5 manually and install Mahout version 0.6 or Mahout version 0.7

Remove Mahout version 0.5 with the command.yum remove mapr-mahoutTo install Mahout version 0.7, run .yum install mapr-mahoutTo install Mahout version 0.6, run .yum install mapr-mahout-0.6.15190-1

Remove Mahout version 0.5 manually and install Mahout version 0.6. and Mahout version 0.7

Remove Mahout version 0.5 with the command.yum remove mapr-mahoutRun to install Mahout version 0.7.yum install mapr-mahoutDownload the files for Mahout version 0.6.rpmRun the command to install Mahout version 0.6.rpm -i --force mapr-mahout-0.6.15190-1.noarch.rpm

Install Mahout version 0.6 or version 0.7 with the downgrade option

To install Mahout version 0.7, run the command.yum downgrade mapr-mahoutTo install Mahout version 0.6, run the command.yum downgrade mapr-mahout-0.6.15190-1Both of these options clean up the existing Mahout version 0.5 installation.

Keep Mahout version 0.5 and install Mahout version 0.6 or Mahout version 0.7

Download the RPM files for Mahout version 0.6 or Mahout version 0.7 from package.mapr.com/releases/ecosystem/.To install Mahout version 0.6 while keeping an existing Mahout version 0.5 installation, run the rpm -i --force

command.mapr-mahout-0.6.15190-1.noarch.rpmTo install Mahout version 0.7 while keeping an existing Mahout version 0.5 installation, run the rpm -i --force

command.mapr-mahout-0.7.15190-1.noarch.rpm

Keep Mahout version 0.5 and install Mahout version 0.6 and Mahout version 0.7

Download the RPM files for Mahout version 0.6 or Mahout version 0.7 from package.mapr.com/releases/ecosystem/.Run the command to install Mahout version 0.6.rpm -i --force mapr-mahout-0.6.15190-1.noarch.rpmRun the command to install Mahout version 0.7.rpm -i --force mapr-mahout-0.7.15190-1.noarch.rpm

Oozie

Upgrading Oozie version 3.0.0 to version 3.1.0

Remove Oozie version 3.0.0 manually and install Oozie version 3.1.0

Run the command.yum remove mapr-oozieRun the command.yum install mapr-oozieRun the script to set directory permissions.configure.sh

Remove Oozie version 3.0.0 automatically and install Oozie version 3.1.0.

Run the command to remove your Oozie version 3.0.0 installation and install Oozie version 3.1.0.yum install mapr-oozieRun the script to set directory permissions.configure.sh

Keep Oozie version 3.0.0 and install Oozie version 3.1.0

Download the RPM files for Oozie 3.1.0 from package.mapr.com/releases/ecosystem/.Run the command.rpm -i --force mapr-oozie-internal-3.2.0.15541-1.noarch.rpmRun the script to set directory permissions.configure.sh

Pig

Upgrading Pig version 0.9.0 to version 0.10.0

Remove Pig version 0.9.0 manually and install version 0.10.0

Run the command yum remove mapr-pig mapr-pig-internalRun the command yum install mapr-pig-0.10.0.15190-1

Installing Pig version 0.10.0 with the downgrade option

Run the command . This command erases Pig version 0.9.0 and installs version 0.10.0.yum downgrade mapr-pig



1. 2.

1. 2.

1. 2.

1. 2.

1. 2.

1. 2.

1. 2.

Keep Pig version 0.9.0 and install version 0.10.0

Download the RPM file for Pig version 0.10.0 from package.mapr.com/releases/ecosystem/.Run the command .rpm -i --force mapr-pig-0.10.0.15190-1.noarch.rpm

Sqoop

Upgrading Sqoop version 1.3.0 to version 1.4.1

Remove Sqoop 1.3.0 version manually and install version 1.4.1

Run the command yum remove mapr-sqoopRun the command yum install mapr-sqoop

Upgrading to Sqoop version 1.4.1 with yum install

Run the command . This command removes Sqoop version 1.3.0 and installs version 1.4.1.yum install mapr-sqoop

Keep Sqoop version 1.3.0 and install version 1.4.1

Download the RPM file for Sqoop 1.4.1 from package.mapr.com/releases/ecosystem/Run the command rpm -i --force mapr-sqoop-1.4.1.15190-1.noarch.rpm

Whirr

Upgrading Whirr version 0.3.0 to version 0.7.0

Remove Whirr version 0.3.0 manually and install version 0.7.0

Run the command yum remove mapr-whirrRun the command yum install mapr-whirr

Install Whirr version 0.70 using the downgrade option

Run the command to erase Whirr version 0.3.0 and install version 0.7.0.yum downgrade mapr-whirr

Keep Whirr version 0.3.0 and install version 0.7.0

Download the RPM file for whirr 0.7.0 from package.mapr.com/releases/ecosystem/.Run the command .rpm -i --force mapr-whirr-0.7.0.15190-1.noarch.rpm

Ubuntu

For a list of the currently supported versions of ecosystem components, see . The repository for UbuntuHadoop Compatibility in Version 2.1packages is at .http://package.mapr.com/releases/ecosystem/ubuntu

Alternate supported versions of some components are available at http://package.mapr.com/releases/ecosystem-al, but this directory is not a repository. Download any desired packages manually.l/ubuntu/

Flume

Upgrading Flume version 0.9.4 to version 1.2

Remove Flume version 0.9.4 manually and install version 1.2

Run the command apt-get remove mapr-flumeRun the command apt-get install mapr-flume

Install Flume version 1.2 and keep version 0.9.4

Run the command .apt-get install mapr-flume-1.2.0.15190

Install Flume version 1.2 and keep version 0.9.4 using the dpkg command

Download the DEB package for version 1.2 from package.mapr.com/releases/ecosystem/mapr-flumeRun the command dpkg -i mapr-flume-1.2.0.15190_all.deb

HBase

Upgrading HBase version 0.90.6 to 0.92.1

http://package.mapr.com/releases/ecosystem/ubuntu

http://package.mapr.com/releases/ecosystem-all/ubuntu/

http://package.mapr.com/releases/ecosystem-all/ubuntu/



1. 2.

1.

1. 2.

1. 2.

1.

1. 2.

1. 2.

1. 2.


Run the command apt-get remove mapr-hbase-internal mapr-hbase-master mapr-hbase-regionserverRun the command apt-get install mapr-hbase-master mapr-hbase-regionserver

Install HBase version 0.92.1 and keep version 0.90.6

Run the command apt-get install mapr-hbase-internal-0.92.1.15190 mapr-hbase-master-0.92.1.15190mapr-hbase-regionserver-0.92.1.15190

Install HBase version 0.92.1 using the dpkg command and keep version 0.9.0

Download the DEB package for from package.mapr.com/releases/ecosystem/mapr-hbaseRun the command dpkg -i mapr-hbase-internal-0.92.1.15190_all.deb mapr-hbase-master-0.92.1.15190_all.debmapr-hbase-regionserver-0.92.1.15190_all.deb

Upgrading HBase version 0.90.6 to 0.94.1


Run the command apt-get remove mapr-hbase-internal mapr-hbase-master mapr-hbase-regionserverRun the command apt-get install mapr-hbase-master mapr-hbase-regionserver

Install HBase version 0.94.1 and keep version 0.90.6

Run the command apt-get install mapr-hbase-internal-0.94.1.15190 mapr-hbase-master-0.94.1.15190mapr-hbase-regionserver-0.94.1.15190

Install HBase version 0.94.1 using the dpkg command and keep version 0.9.0

Download the DEB package for from mapr-hbase http://package.mapr.com/releases/ecosystem-all/Run the command dpkg -i mapr-hbase-internal-0.94.1.15190_all.deb mapr-hbase-master-0.94.1.15190_all.debmapr-hbase-regionserver-0.94.1.15190_all.deb

Hive

Upgrading Hive version 0.7.1 to 0.9.0

Back up your metastore database before upgrading Hive.

Refer to the README file in the /opt/mapr/hive/hive-0.9.0/scripts/metastore/upgrade/<metastore_databas directory after upgrading Hive for directions on updating your existing schema to work with the new Hivee> metastore_db

version. Scripts are provided for MySQL and Derby. You must update your metastore database schema before creating anynew Hive tables.

After the upgrade, verify that the schema for the metastore database has completed successfully. A few sample diagnostics:

The command in Hive should provide a complete list of all your Hive tables.show tablesPerform simple operations on Hive tables that existed before the upgrade.SELECTPerform filtered operations on Hive tables that existed before the upgrade.SELECT

Remove Hive version 0.7.1 manually and install version 0.9.0

Run the command apt-get remove mapr-hive mapr-hive-internalRun the command apt-get install mapr-hive

Force install Hive version 0.9.0 and keep Hive version 0.7.1

Run the command .apt-get install mapr-hive=0.9.0.15190

Install using the dpkg command

Download the DEB package for from package.mapr.com/releases/ecosystem/mzapr-hiveRun the command to install Hive version 0.9.0 and keep any existing version of Hive.dpkg -i mapr-hive-0.9.0.15190_all.deb

Mahout

Upgrading Mahout version 0.5 to version 0.6

Remove Mahout version 0.5 manually and install version 0.6

http://package.mapr.com/releases/ecosystem-all/



1. 2.

1. 2.

1. 2.

1. 2. 3.

1.

2.

1. 2. 3.

1. 2.

1. 2.

1. 2.

Run the command apt-get remove mapr-mahoutRun the command apt-get install mapr-mahout

Automatically remove Mahout version 0.5 and install version 0.6

Run the command . This command installs Mahout version 0.6 and removes the existingapt-get install mapr-mahout-0.6.15190Mahout installation.

Install Mahout version 0.6 and remove version 0.5 using the dpkg command

Download the DEB package for version 0.6 from package.mapr.com/releases/ecosystem/mapr-whirrRun the command to install Mahout version 0.6 and remove version 0.5dpkg -i mapr-mahout-0.6.15190_all.deb

Install Mahout version 0.7 and remove version 0.5 using the dpkg command

Download the DEB package for version 0.7 from package.mapr.com/releases/ecosystem/mapr-whirrRun the command to install Mahout version 0.7 and remove version 0.5dpkg -i mapr-mahout-0.6.15190_all.deb

Oozie

Upgrading Oozie version 0.3.0 to version 0.3.1

Remove Oozie version 0.3.0 manually and install version 0.3.1

Run the command apt-get remove mapr-oozie mapr-oozie-internalRun the command apt-get install mapr-oozie mapr-oozie-internalRun the script to set directory permissions.configure.sh

Install Oozie version 0.3.1 and keep version 0.3.0

Run the command . This command installs Oozie version 0.3.1 and keeps the existing installation ofapt-get install mapr-oozieversion 0.3.0.Run the script to set directory permissions.configure.sh

Install Oozie version 0.3.1 and keep version 0.3.0 using the dpkg command

Download the DEB package for from package.mapr.com/releases/ecosystem/mapr-oozieRun the command dpkg -i mapr-oozie-3.1.0.15190_all.deb mapr-oozie-internal-3.1.0.15190_all.debRun the script to set directory permissions.configure.sh

Pig

Upgrading Pig version 0.9.0 to version 0.10.0

Remove Pig version 0.9.0 manually and install version 0.10.0

#Run the command apt-get remove mapr-pig mapr-pig-internal#Run the command apt-get install mapr-pig

Force install Pig version 0.10.0 and keep Hive version 0.9.0

Run the command .apt-get install mapr-pig=0.10.0.15190

Install Pig version 0.10.0 using the dpkg command and keep version 0.9.0

Download the DEB package for from package.mapr.com/releases/ecosystem/mapr-pigRun the command dpkg -i mapr-pig-0.10.0.15190_all.deb

Sqoop

Upgrading Sqoop version 1.3.0 to 1.4.1

Remove Sqoop version 1.3.0 manually and install version 1.4.1

Run the command apt-get remove mapr-sqoop mapr-sqoop-internalRun the command apt-get install mapr-sqoop

Install Sqoop version 1.4.1 and keep version 1.3.0 using the apt-get command

Run the command .apt-get install mapr-sqoop

Install Sqoop version 1.4.1 and keep version 1.3.0 using the dpkg command

Download the DEB package for version 1.4.1 from package.mapr.com/releases/ecosystem/mapr-sqoop



2. Run the command dpkg -i mapr-sqoop-1.4.1.15190_all.deb



1.

2.

1. a.

Upgrading to Version 2.1.0

Upgrading the MapR cluster to version 2.1.0 entails preparing to install, installing updated packages, and restarting the MapR services on allnodes. After upgrading your cluster to MapR 2.1.0, you can run MapR as a .non-root user

This procedure is specific to Version 2.1.0. MapR recommends upgrading to the latest version instead, using one of thestandard procedures:

Manual UpgradeRolling Upgrade

To upgrade your Hadoop ecosystem components after an upgrade, follow the procedures in Manual Upgrade for Hadoop Ecosystem. (See for a list of the Hadoop ecosystem packages.)Components Packages and Dependencies for MapR Software

Important when upgrading to Version 2.1.0:

Upgrade the active JobTracker (after upgrading all other nodes in the cluster).lastBefore upgrade, make sure no CLDB nodes have the JobTracker installed.

Tips when upgrading to MapR 2.1.0:

Installing a newer version of MapR software might introduce new package dependencies. Dependency packages must be installed on allnodes in the cluster in addition to the updated MapR packages. If you are upgrading using a package manager such as or zypper apt-

, then the package manager on each node must have access to repositories for dependency packages. If installing from packagegetfiles, you must pre-install dependencies on all nodes in the cluster prior to upgrading the MapR software. See Packages and

.Dependencies for MapR SoftwareWhen performing a manual upgrade, it is necessary to run {{ /opt/mapr/server/configure.sh}} on any nodes that are running the HBaseregion server or HBase master.If you are upgrading a node that is running the NFS service, use the command to determine whether the node has mountedmountMapR-FS through its own physical IP (not a VIP). If so, unmount any such mount points before beginning the upgrade process.If you are upgrading a node that is running the HBase RegionServer and the command does not return after five minutes,warden stopkill the HBase RegionServer process manually:

Determine the process ID of the HBase RegionServer:

cat /opt/mapr/logs/hbase-root-regionserver.pid

Kill the HBase RegionServer using the following command, substituting the process ID from the previous step for the placeholder:<PID>

kill -9 <PID>

After upgrading, you can optionally the cluster from running as to a non-root user.convert root

CentOS and Red Hat

Perform the following procedures:

Preparing for upgrade - setting and moving the JobTrackertcp_retries2Upgrading the cluster - installing the new versions of the packagesSetting the version - manually updating the software configuration to reflect the correct versionUpdating the configuration - switching to the new versions of the configuration files, and preserving any custom settings

To prepare for upgrade:

Do not update the repository yet. You'll do that in the next section.

On all nodes, set to :tcp_retries2 5Add the following line to :/etc/sysctl.conf



1. a.

b.

c.

2.

3.

4.

5.

1. 2. 3.

4.

5.

6.

7.


This step reduces TaskTracker failover time.


sysctl -p

Ensure that the setting has taken effect. Issue the following command, and make sure the output is :5


If the JobTracker is running on any CLDB nodes, install replacement JobTrackers on non-CLDB nodes so that you can remove theJobTracker from any CLDB nodes:

Install one replacement JobTracker on a non-CLDB node for each JobTracker that is on a CLDB node. Make sure to install theold version you are upgrading . Run the following commands as on each node where you are installing a JobTracker:from root

yum install mapr-jobtracker /opt/mapr/server/configure.sh -C <CLDB nodes> -Z <Zookeeper nodes> [-N <cluster name>]

Determine which host is running the active JobTracker:

maprcli node list -filter '[svc==jobtracker]' -columns h

Remove the JobTracker from any CLDB nodes where it is installed, making sure to remove the active JobTracker :lastOn each CLDB node that has JobTracker installed, except for the active JobTracker, run the following commands:

yum remove mapr-jobtracker /opt/mapr/server/configure.sh -C <CLDB nodes> -Z <Zookeeper nodes> [-N <cluster name>]

Determine which host is now running the active JobTracker:



Remember to upgrade the active JobTracker last.

On each node, saving the active JobTracker for last, perform the following steps:

Change to the user or use for the following commands.root sudoMake sure the MapR software is correctly installed and configured.Add the MapR yum repository for the latest version of MapR software, removing any old versions. For more information, see Preparing


service mapr-warden stop


service mapr-zookeeper stop


yum upgrade 'mapr-*'



7.

8.

1. a.

b.

c.

2.

3.

4.

5.



Start the warden:


After performing the procedure on all nodes, proceed to the steps and .Setting the Version Updating the Configuration

Ubuntu


Preparing for upgrade - setting and moving the JobTrackertcp_retries2Upgrading the cluster - installing the new versions of the packagesSetting the version - manually updating the software configuration to reflect the correct versionUpdating the configuration - switching to the new versions of the configuration files, and preserving any custom settings






sysctl -p





apt-get install mapr-jobtracker /opt/mapr/server/configure.sh -C <CLDB nodes> -Z <Zookeeper nodes> [-N <cluster name>]




apt-get purge mapr-jobtracker /opt/mapr/server/configure.sh -C <CLDB nodes> -Z <Zookeeper nodes> [-N <cluster name>]





1. 2. 3.

4.

5.

6.

7.

8.

9.

10.

11.

12.


The easiest way to upgrade all MapR packages on Ubuntu is to temporarily move the normal and replace it with a special sources.list sourc that specifies only the MapR software repository, then use to upgrade all packages from the special es.list apt-get upgrade sources.li

file.st



Change to the user or use for the following commands.root sudoMake sure the MapR software is correctly installed and configured.Rename the normal to prevent from reading packages from other repositories:/etc/apt/sources.list apt-get upgrade

mv /etc/apt/sources.list /etc/apt/sources.list.orig

Create a new file and add the MapR apt-get repository for the latest version of MapR software, removing/etc/apt/sources.listany old versions. For more information, see .Preparing Packages and RepositoriesStop the warden:




Clear the APT cache:

apt-get clean

Update the list of available packages:

apt-get update

Upgrade all MapR packages:

apt-get upgrade

Rename the special and restore the original:/etc/apt/sources.list

mv /etc/apt/sources.list /etc/apt/sources.list.maprmv /etc/apt/sources.list.orig /etc/apt/sources.list



Start the warden:



SUSE


Preparing for upgrade - setting and moving the JobTrackertcp_retries2



1. a.

b.

c.

2.

3.

4.

5.

1. 2. 3.

4.

5.







sysctl -p





zypper install mapr-jobtracker /opt/mapr/server/configure.sh -C <CLDB nodes> -Z <Zookeeper nodes> [-N <cluster name>]




zypper remove mapr-jobtracker /opt/mapr/server/configure.sh -C <CLDB nodes> -Z <Zookeeper nodes> [-N <cluster name>]






Change to the user or use for the following commands.root sudoMake sure the MapR software is correctly installed and configured.Add the MapR repository for the latest version of MapR software, removing any old versions. For more information, see Preparing






5.

6.

7.

8.

1.

2.

1.

2. 3.

4.



zypper upgrade 'mapr-*'



Start the warden:



Setting the Version

After completing the upgrade on all nodes, use the following steps on any node to set the correct version:

Check the software version by looking at the contents of the file. Example:MapRBuildVersion

$ cat /opt/mapr/MapRBuildVersion

Set the version accordingly using the command. Example:config save

maprcli config save -values {mapr.targetversion:"2.0.1.15869GA-1"}

Updating the Configuration Files

If you have made any changes to or in the dirmapred-site.xml core-site.xml /opt/mapr/hadoop/hadoop-<version>/confectory, then make the same changes to the same files in the directory./opt/mapr/hadoop/hadoop-<version>/conf.newRename to to deactivate it./opt/mapr/hadoop/hadoop-<version>/conf /opt/mapr/hadoop/hadoop-<version>/conf.oldRename to to activate it as/opt/mapr/hadoop/hadoop-<version>/conf.new /opt/mapr/hadoop/hadoop-<version>/confthe configuration directory.To enable the new features, issue the following command:

maprcli config save -values {cldb.v2.features.enabled:1}



1. 2.

Rolling Upgrade

A rolling upgrade installs the latest version of MapR core software on all nodes in the cluster. Perform a rolling upgrade by running the script roll from a node in the cluster.ingupgrade.sh

The rolling upgrade script upgrades the core packages on each node, logging output to the rolling upgrade log (/opt/mapr/logs/rollingupg). You must specify either a directory containing packages (using the option) or a version to fetch from the MapR repository (usingrade.log -p

the option). Here are a few tips:-v

If you specify a local directory with the option, you must either ensure that the same directory containing the packages exists on all the-pnodes, or use the option to copy packages out to each node via SCP automatically (requires the option). If you use the option,-x -s -xthe upgrade process copies the packages from the directory specified with into the same directory path on each node. For the path-pwhere you can download MapR software, see the page.Release NotesIn a multi-cluster setting, use to specify which cluster to upgrade. If is not specified, the default cluster is upgraded.-c -cWhen specifying the version with the parameter, use the format to specify the major, minor, and revision numbers of the target-v x.y.zversion. Example: 2.0.1The package (Red Hat) or (Ubuntu) enables automatic rollback if the upgrade fails.rpmrebuild dpkg-repack

The upgrade script does not install the packages that are required for automatic rollback. To enable automatic rollback,install these packages before attempting the upgrade.

Specify the option to the script to disable rollback on a failed upgrade.-n rollingupgrade.shInstalling a newer version of MapR software might introduce new package dependencies. Dependency packages must be installed on allnodes in the cluster in addition to the updated MapR packages. If you are upgrading using a package manager such as or ,yum apt-getthen the package manager on each node must have access to repositories for dependency packages. If installing from package files, youmust pre-install dependencies on all nodes in the cluster prior to upgrading the MapR software. See Packages and Dependencies for

.MapR Software

Starting in MapR version 2.0, the rolling upgrade script only upgrades MapR core packages, not any of the Hadoop ecosystemcomponents. (See for a list of the MapR packages and Hadoop ecosystemPackages and Dependencies for MapR Softwarepackages.) Follow the procedures in to upgrade your cluster's HadoopManual Upgrade for Hadoop Ecosystem Componentsecosystem components.

There are two ways to perform a rolling upgrade:

Via SSH - If keyless SSH is set up between all nodes, use the option to automatically upgrade all nodes without user intervention.-sNode by node - If SSH is not available, the script prepares the cluster for upgrade and guides the user through upgrading each node. In anode-by-node installation, you must individually run the commands to upgrade each node when instructed by the rollingupgrade.shscript.

Before upgrading the cluster, make sure that the following packages are installed on the appropriate nodes:

On all Red Hat and Centos nodes, rpmrebuild 2.4 or higherOn all Ubuntu nodes, dpkg-repack

After upgrading your cluster to MapR 2.x, you can run MapR as a .non-root user

To determine whether or not the appropriate package is installed on each node, run the following command to see a list of all installed versions ofthe package:

On Red Hat and Centos nodes:

rpm -qa | grep rpmrebuild

On Ubuntu nodes:

dpkg -l | grep dpkg-repack

Requirements

On the computer from which you will be starting the upgrade, perform the following steps:

Change to the user (or use for the following commands).root sudoIf you are starting the upgrade from a computer that is not a MapR client or a MapR cluster node, you must add the MapR repository (see

) and install :Preparing Packages and Repositories mapr-core



2.

3.

1. 2. 3.

4.

5.

1.

2. 3.

CentOS or Red Hat: yum install mapr-coreSUSE: zypper install mapr-coreUbuntu: apt-get install mapr-coreRun , using to specify the cluster CLDB nodes and to specify the cluster ZooKeeper nodes. Example:configure.sh -C -Z

/opt/mapr/server/configure.sh -C 10.10.100.1,10.10.100.2,10.10.100.3 -Z10.10.100.1,10.10.100.2,10.10.100.3

To enable a fully automatic rolling upgrade, ensure passwordless SSH is enabled to all nodes for the user , from the computer onrootwhich the upgrade will be started.

On all nodes and the computer from which you will be starting the upgrade, perform the following steps:

Change to the user (or use for the following commands).root sudoAdd the MapR software repository (see ).Preparing Packages and RepositoriesInstall rolling upgrade scripts:

CentOS or Red Hat: or yum install mapr-upgradeUbuntu: apt-get install mapr-upgrade

Install the following packages to enable automatic rollback:Ubuntu: apt-get install dpkg-repackCentOS or Red Hat: yum install rpmrebuild

If you are planning to upgrade from downloaded packages instead of the repository, prepare a directory containing the package files. Thisdirectory should reside at the same absolute path on each node.

Upgrading the Cluster via SSH

On the node from which you will be starting the upgrade, issue the command as (or with ) to upgrade therollingupgrade.sh root sudocluster:

If you have prepared a directory of packages to upgrade, issue the following command, substituting the path to the directory for the <dir placeholder:ectory>

/opt/upgrade-mapr/rollingupgrade.sh -s -p <directory>

If you are upgrading from the MapR software repository, issue the following command, substituting the version (x.y.z) for the <version>placeholder:

/opt/upgrade-mapr/rollingupgrade.sh -s -v <version>

Upgrading the Cluster Node by Node

On the node from which you will be starting the upgrade, use the command as (or with ) to upgrade the cluster:rollingupgrade.sh root sudo

Start the upgrade:If you have prepared a directory of packages to upgrade, issue the following command, substituting the path to the directory forthe placeholder:<directory>

/opt/upgrade-mapr/rollingupgrade.sh -p <directory>

If you are upgrading from the MapR software repository, issue the following command, substituting the version (x.y.z) for the <ve placeholder:rsion>

/opt/upgrade-mapr/rollingupgrade.sh -v <version>

When prompted, run on all nodes other than the master CLDB node, following the on-screen instructions.singlenodeupgrade.shWhen prompted, run on the master CLDB node, following the on-screen instructions.singlenodeupgrade.sh

Enabling Central Configuration When Upgrading to MapR v2.0



When upgrading from to version 2.0 from 1.2.x, existing configuration files are not overwritten by the new configuration files for v2.0. This givesadministrators greater control of the upgrade process. Settings for new features in v2.0, including Central Configuration, must be manuallymigrated into configuration files.

Add the following lines to the file :/opt/mapr/conf/warden.conf

centralconfig.enabled=truepollcentralconfig.interval.seconds=300



1.

2. 3.

1. 2. 3. 4. 5.

1. 2. 3. 4. 5.

Disks

MapR-FS groups disks into usually made up of two or three disks.storage pools,

When adding disks to MapR-FS, it is a good idea to add at least two or three at a time so that MapR can create properly-sized storage pools.Each node in a MapR cluster can support up to 36 storage pools.

When you remove a disk from MapR-FS, any other disks in the storage pool are also removed from MapR-FS automatically; the disk youremoved, as well as the others in its storage pool, are then to be added to MapR-FS again. You can either replace the disk and re-add itavailablealong with the other disks that were in the storage pool, or just re-add the other disks if you do not plan to replace the disk you removed.

MapR maintains a list of disks used by MapR-FS in a file called on each node.disktab

The following sections provide procedures for working with disks:

Adding Disks - adding disks for use by MapR-FSRemoving Disks - removing disks from use by MapR-FSHandling Disk Failure - replacing a disk in case of failureTolerating Slow Disks - increasing the disk timeout to handle slow disks

Before removing or replacing disks, make sure the Replication Alarm ( ) and DataVOLUME_ALARM_DATA_UNDER_REPLICATEDAlarm ( ) are not raised. These alarms can indicate potential or actual data loss! If eitherVOLUME_ALARM_DATA_UNAVAILABLEalarm is raised, it may be necessary to attempt repair using fsck before removing or replacing disks.

Adding Disks

You can add one or more available disks to MapR-FS using the command or the MapR Control System. In both cases, MapRdisk addautomatically takes care of formatting the disks and creating storage pools.

If you are running MapR 1.2.2 or earlier, do not use the command or the MapR Control System to add disks todisk addMapR-FS. You must either upgrade to MapR 1.2.3 before adding or replacing a disk, or use the following procedure (whichavoids the command):disk add

Use the MapR Control System to the failed disk. All other disks in the same storage pool are removed at theremovesame time. Make a note of which disks have been removed.Create a text file containing a list of the disks you just removed. See ./tmp/disks.txt Setting Up Disks for MapRAdd the disks to MapR-FS by typing the following command (as or with ):root sudo/opt/mapr/server/disksetup -F /tmp/disks.txt

To add disks using the MapR Control System:

Add physical disks to the node or nodes according to the correct hardware procedure.In the Navigation pane, expand the group and click the view.Cluster NodesClick the name of the node on which you wish to add disks.In the pane, select the checkboxes beside the disks you wish to add.MapR-FS and Available DisksClick to add the disks. Properly-sized storage pools are allocated automatically.Add Disks to MapR-FS

Removing Disks

You can remove one or more disks from MapR-FS using the command or the MapR Control System. When you remove disks fromdisk removeMapR-FS, any other disks in the same storage pool are also removed from MapR-FS and become (not in use, and eligible to beavailablere-added to MapR-FS).

If you are removing and replacing failed disks, you can install the replacements, then re-add the replacement disks and the other disksthat were in the same storage pool(s) as the failed disks.If you are removing disks but not replacing them, you can just re-add the other disks that were in the same storage pool(s) as the faileddisks.

To remove disks using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster NodesClick the name of the node from which you wish to remove disks.In the pane, select the checkboxes beside the disks you wish to remove.MapR-FS and Available DisksClick to remove the disks from MapR-FS.Remove Disks from MapR-FSWait several minutes while the removal process completes. After you remove the disks, any other disks in the same storage pools are



5.

6.

1.

2. 3.

4.

5. 6.

1.

2.

3. 4.

taken offline and marked as (not in use by MapR).availableRemove the physical disks from the node or nodes according to the correct hardware procedure.

Handling Disk Failure

When disks fail, MapR raises an alarm snd identifies which disks on which nodes have failed.

If a disk failure alarm is raised, check the field in to determine the reason for failure.Failure Reason /opt/mapr/logs/faileddisk.logThere are two failure cases that may not require disk replacement:

Failure Reason: Timeout - try increasing the in .mfs.io.disk.timeout mfs.confFailure Reason: Disk GUID mismatch - if a node has restarted, the drive labels ( , etc.) can be reassigned by the operating system,sdaand will no longer match the entries in . Edit according to the instructions in the log to repair the problem.disktab disktab

To replace disks using the MapR command-line interface:

On the node with the failed disk(s), look at the entries in to determine which disk or disksDisk /opt/mapr/logs/faileddisk.loghave failed.Look at the entries to determine whether the disk(s) should be replaced.Failure ReasonUse the command to remove the disk(s). Use the following syntax, substituting the hostname or IP address for adisk remove <host>nd a list of disks for :<disks>

maprcli disk remove -host <host> -disks <disks>

Wait several minutes while the removal process completes.

Note any disks that appear in the output from but not in the file (the disks from the same storage pool(s) as thefdisk -l disktabfailed disk(s), which have been removed from MapR-FS in the previous step).Replace the failed disks on the node or nodes according to the correct hardware procedure.Use the command to add the replacement disk(s) and the others that were in the same storage pool(s). Use the followingdisk addsyntax, substituting the hostname or IP address for and a list of disks for :<host> <disks>

maprcli disk add -host <host> -disks <disks>

Properly-sized storage pools are allocated automatically.

To replace disks using the MapR Control System:

Identify the failed disk or disks:In the Navigation pane, expand the group and click the view.Cluster NodesClick the name of the node on which you wish to replace disks, and look in the pane.MapR-FS and Available Disks

Remove the failed disk or disks from MapR-FS:In the pane, select the checkboxes beside the failed disks.MapR-FS and Available DisksClick to remove the disks from MapR-FS.Remove Disks from MapR-FSWait several minutes while the removal process completes. After you remove the disks, any other disks in the same storagepools are taken offline and marked as (not in use by MapR).available

Replace the failed disks on the node or nodes according to the correct hardware procedure.Add the replacement and available disks to MapR-FS:

In the Navigation pane, expand the group and click the view.Cluster NodesClick the name of the node on which you replaced the disks.In the pane, select the checkboxes beside the disks you wish to add.MapR-FS and Available DisksClick to add the disks. Properly-sized storage pools are allocated automatically.Add Disks to MapR-FS

Tolerating Slow Disks

The parameter in determines how long MapR waits for a disk to respond before assuming it has failed. Ifmfs.io.disk.timeout mfs.confhealthy disks are too slow, and are erroneously marked as failed, you can increase the value of this parameter.



1.

2. 3.

4.

5.

Working with a Logical Volume Manager

The Logical Volume Manager creates symbolic links to each logical volume's block device, from a directory path in the form: /dev/<volume. MapR needs the actual block location, which you can find by using the command to list the symbolic links.group>/<volume name> ls -l

Make sure you have free, unmounted logical volumes for use by MapR:Unmount any mounted logical volumes that can be erased and used for MapR.Allocate any free space in an existing logical volume group to new logical volumes.

Make a note of the volume group and volume name of each logical volume.Use with the volume group and volume name to determine the path of each logical volume's block device. Each logical volume isls -la symbolic link to a logical block device from a directory path that uses the volume group and volume name: /dev/<volumegroup>/<volume name>The following example shows output that represents a volume group named containing logical volumes named , , mapr mapr1 mapr2 map

, and :r3 mapr4

# ls -l /dev/mapr/mapr*lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr1 -> /dev/mapper/mapr-mapr1lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr2 -> /dev/mapper/mapr-mapr2lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr3 -> /dev/mapper/mapr-mapr3lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr4 -> /dev/mapper/mapr-mapr4

Create a text file containing the paths to the block devices for the logical volumes (one path on each line). Example:/tmp/disks.txt

$ cat /tmp/disks.txt/dev/mapper/mapr-mapr1/dev/mapper/mapr-mapr2/dev/mapper/mapr-mapr3/dev/mapper/mapr-mapr4

Pass to disks.txt disksetup

# sudo /opt/mapr/server/disksetup -F /tmp/disks.txt



1. 2.

3. 4. 5.

Setting Up Disks for MapR

MapR formats and uses disks for the Lockless Storage Services layer (MapR-FS), recording these disks in the file . In a productiondisktabenvironment, or when testing performance, MapR should be configured to use physical hard drives and partitions. In some cases, it is necessaryto reinstall the operating system on a node so that the physical hard drives are available for direct use by MapR. Reinstalling the operating systemprovides an unrestricted opportunity to configure the hard drives. If the installation procedure assigns hard drives to be managed by the LinuxLogical Volume manger (LVM) by default, you should explicitly remove from LVM configuration the drives you plan to use with MapR. It iscommon to let LVM manage one physical drive containing the operating system partition(s) and to leave the rest unmanaged by LVM for use withMapR.

To determine if a disk or partition is ready for use by MapR:

Run the command to determine whether any processes are already using the disk or partition.sudo lsof <partition>There should be no output when running , indicating there is no process accessing the specific disk orsudo fuser <partition>partition.The disk or partition should not be mounted, as checked via the output of the command.mountThe disk or partition should not have an entry in the file./etc/fstabThe disk or partition should be accessible to standard Linux tools such as . You should be able to successfully format the partitionmkfsusing a command like as this is similar to the operations MapR performs during installation. If fasudo mkfs.ext3 <partition> mkfsils to access and format the partition, then it is highly likely MapR will encounter the same problem.

Any disk or partition that passes the above testing procedure can be added to the list of disks and partitions passed to the command.disksetup

To specify disks or partitions for use by MapR:







Run only after running .disksetup configure.sh

To evaluate MapR using a flat storage file instead of formatting disks:

When setting up a small cluster for evaluation purposes, if a particular node does not have physical disks or partitions available to dedicate to thecluster, you can use a flat file on an existing disk partition as the node's storage. Create at least a 16GB file, and include a path to the file in thedisk list file for the script.disksetup

The following example creates a 20 GB flat file ( specifies 1 gigabyte blocks, multiplied by ) at :bs=1G count=20 /root/storagefile

$ dd =/dev/zero of=/root/storagefile bs=1G count=20if

Then, you would add the following to the disk list file to be used by :/tmp/disks.txt disksetup

/root/storagefile



1.

2. 3.

4.

5.

Working with a Logical Volume Manager

The Logical Volume Manager creates symbolic links to each logical volume's block device, from a directory path in the form: /dev/<volume. MapR needs the actual block location, which you can find by using the command to list the symbolic links.group>/<volume name> ls -l

Make sure you have free, unmounted logical volumes for use by MapR:Unmount any mounted logical volumes that can be erased and used for MapR.Allocate any free space in an existing logical volume group to new logical volumes.

Make a note of the volume group and volume name of each logical volume.Use with the volume group and volume name to determine the path of each logical volume's block device. Each logical volume isls -la symbolic link to a logical block device from a directory path that uses the volume group and volume name: /dev/<volumegroup>/<volume name>The following example shows output that represents a volume group named containing logical volumes named , , mapr mapr1 mapr2 map

, and :r3 mapr4

# ls -l /dev/mapr/mapr*lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr1 -> /dev/mapper/mapr-mapr1lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr2 -> /dev/mapper/mapr-mapr2lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr3 -> /dev/mapper/mapr-mapr3lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr4 -> /dev/mapper/mapr-mapr4

Create a text file containing the paths to the block devices for the logical volumes (one path on each line). Example:/tmp/disks.txt

$ cat /tmp/disks.txt/dev/mapper/mapr-mapr1/dev/mapper/mapr-mapr2/dev/mapper/mapr-mapr3/dev/mapper/mapr-mapr4

Pass to disks.txt disksetup

# sudo /opt/mapr/server/disksetup -F /tmp/disks.txt



Specifying Disks or Partitions for Use by MapR









Testing MapR Without Formatting Physical Disks





/root/storagefile



Dial Home

MapR provides a service called , which automatically collects certain metrics about the cluster for use by support engineers and to helpDial Homeus improve and evolve our product. When you first install MapR, you are presented with the option to enable or disable Dial Home. Werecommend enabling it. You can enable or disable Dial Home later, using the command.dialhome enable



1. 2. 3.

4.

1. 2. 3. 4. 5.

Nodes

This page provides information about managing nodes in the cluster, including the following topics:

Viewing a List of NodesAdding a NodeManaging ServicesFormatting Disks on a NodeRemoving a NodeDecommissioning a NodeReconfiguring a Node

Stopping a NodeInstalling or Removing Software or HardwareSetting Up a NodeStarting the Node

Renaming a Node

Viewing a List of Nodes

You can view all nodes using the command, or view them in the MapR Control System using the following procedure.node list

To view all nodes using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster Nodes

Adding a Node

To Add Nodes to a Cluster

PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements.PLAN which services to run on the new nodes.INSTALL MapR Software:

On all new nodes, the MapR Repository.ADDOn each new node, the planned MapR services.INSTALLOn all new nodes, configure.sh.RUNOn all new nodes, disks for use by MapR.FORMATIf any configuration files on your existing cluster's nodes have been modified (for example, or warden.conf mapred-site.xm

), replace the default configuration files on all new nodes with the appropriate modified files.lStart ZooKeeper on all new nodes that have ZooKeeper installed:


Start the warden on all new nodes:


If any of the new nodes are CLDB and/or ZooKeeper nodes, on all new and existing nodes in the cluster, specifying allRUN configure.shCLDB and ZooKeeper nodes.SET UP node topology for the new nodes.On any new nodes running NFS, NFS for HA.SET UP

Managing Services

You can manage node services using the command, or in the MapR Control System using the following procedure.node services

To manage node services using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster NodesSelect the checkbox beside the node or nodes you wish to remove.Click the button to display the dialog.Manage Services Manage Node ServicesFor each service you wish to start or stop, select the appropriate option from the corresponding drop-down menu.Click to start and stop the services according to your selections.Change Node

You can also display the Manage Node Services dialog by clicking in the view.Manage Services Node Properties

http://www.mapr.com/doc/display/MapR/High+Availability+NFS



1. 2. 3. 4. 5. 6.

1. 2.

3.

Formatting Disks on a Node







Removing a Node

You can remove a node using the command, or in the MapR Control System using the following procedure. Removing a nodenode removedetaches the node from the cluster, but does not remove the MapR software from the cluster.

To remove a node using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster NodesSelect the checkbox beside the node or nodes you wish to remove.Click and stop all services on the node.Manage ServicesWait 5 minutes. The Remove button becomes active.Click the button to display the dialog.Remove Remove NodeClick to remove the node.Remove Node

If you are using Ganglia, restart all gmeta and gmon daemons in the cluster. See .Ganglia

You can also remove a node by clicking in the view.Remove Node Node Properties

Decommissioning a Node

Use the following procedures to remove a node and uninstall the MapR software. This procedure detaches the node from the cluster and removesthe MapR packages, log files, and configuration files, but does not format the disks.

Before Decommissioning a NodeMake sure any data on the node is replicated and any needed services are running elsewhere. For example, if decommissioning the node would result in too few instances of the CLDB, start CLDB on another node beforehand; if you aredecommissioning a ZooKeeper node, make sure you have enough ZooKeeper instances to meet a quorum after the node isremoved. See for recommendations.Planning the Deployment

To decommission a node permanently:

Do not use this procedure to decommission multiple nodes concurrently.

Change to the root user (or use sudo for the following commands).Stop the Warden:/etc/init.d/mapr-warden stopIf ZooKeeper is installed on the node, stop it:



3.

4.

5.

6.

7. 8.

1. 2.

3.

1. 2. 3. 4. 5.

/etc/init.d/mapr-zookeeper stopDetermine which MapR packages are installed on the node:

dpkg --list | grep mapr (Ubuntu)rpm -qa | grep mapr (Red Hat or CentOS)

Remove the packages by issuing the appropriate command for the operating system, followed by the list of services. Examples:apt-get purge mapr-core mapr-cldb mapr-fileserver (Ubuntu)yum erase mapr-core mapr-cldb mapr-fileserver (Red Hat or CentOS)

Remove the directory to remove any instances of , , , and left behind by the package/opt/mapr hostid hostname zkdata zookeepermanager.Remove any MapR cores in the directory./opt/coresIf the node you have decommissioned is a CLDB node or a ZooKeeper node, then run on all other nodes in the clusterconfigure.sh(see ).Configuring a Node


Reconfiguring a Node

You can add, upgrade, or remove services on a node to perform a manual software upgrade or to change the roles a node serves. There are foursteps to this procedure:

Stopping the NodeFormatting the Disks (optional)Installing or Removing Software or HardwareConfiguring the NodeStarting the Node

This procedure is designed to make changes to existing MapR software on a machine that has already been set up as a MapR cluster node. Ifyou need to install software for the first time on a machine to create a new node, please see instead.Adding a Node

Stopping a Node

Change to the root user (or use sudo for the following commands).Stop the Warden:/etc/init.d/mapr-warden stopIf ZooKeeper is installed on the node, stop it:/etc/init.d/mapr-zookeeper stop

Installing or Removing Software or Hardware

Before installing or removing software or hardware, stop the node using the procedure described in .Stopping the Node

Once the node is stopped, you can add, upgrade or remove software or hardware. At some point in time after adding or removing services, it isrecommended to restart the warden, to re-optimize memory allocation among all the services on the node. It is not crucial to perform this stepimmediately; you can restart the warden at a time when the cluster is not busy.

To add or remove individual MapR packages, use the standard package management commands for your Linux distribution:

apt-get (Ubuntu)yum (Red Hat or CentOS)

For information about the packages to install, see .Planning the Deployment

You can add or remove services from a node after it has been deployed in a cluster. This process involves installing or uninstalling packages onthe node, and then updating the cluster to recognize the new role for this node.

Adding a service to an existing node:

The process of adding a service to a node is similar to the initial installation process for nodes. For further detail see .Installing MapR Software

Install the package(s) corresponding to the new role(s) using or .apt-get yumRun with a list of the CLDB nodes and ZooKeeper nodes in the cluster.configure.shIf you added the CLDB or ZooKeeper role, you must run on all other nodes in the cluster.configure.shIf you added the fileserver role, run to format and prepare disks for use as storage.disksetupRestart the warden



5.

1.

2. 3. 4.

% /etc/init.d/mapr-warden restart

When the warden restarts, it picks up the new configuration and starts the new services, making them active in the cluster.

Removing a service from an existing node:

Stop the service you want to remove from the MapR Control System (MCS) or with the command-line tool. The followingmaprcliexample stops the HBase master service:

% maprcli node services -hbmaster stop -nodes mapr-node1

Purge the service packages with the , , or commands, as suitable for your operating system.apt-get yum zypperRun the script with the option.configure.sh -RWhen you remove the CLDB or ZooKeeper role from a node, run on all nodes in the cluster.configure.sh

Setting Up a Node

Formatting the Disks







Configuring the Node





Example:




1.

2.

1.

2.

3.

4. 5.

6.

Starting the Node

If ZooKeeper is installed on the node, start it:/etc/init.d/mapr-zookeeper startStart the Warden:/etc/init.d/mapr-warden start

Renaming a Node

To rename a node:

Stop the warden on the node. Example:


If the node is a ZooKeeper node, stop ZooKeeper on the node. Example:


Rename the host:On Red Hat or CentOS, edit the parameter in the file and restart the service orHOSTNAME /etc/sysconfig/network xinetdreboot the node.On Ubuntu, change the old hostname to the new hostname in the and files./etc/hostname /etc/hosts

If the node is a ZooKeeper node or a CLDB node, run with a list of CLDB and ZooKeeper nodes. See .configure.sh configure.shIf the node is a ZooKeeper node, start ZooKeeper on the node. Example:


Start the warden on the node. Example:




1. 2. 3.

4.

5.

6.

7. 8.

Adding Nodes to a Cluster

To Add Nodes to a Cluster

PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements.PLAN which services to run on the new nodes.INSTALL MapR Software:

On all new nodes, the MapR Repository.ADDOn each new node, the planned MapR services.INSTALLOn all new nodes, configure.sh.RUNOn all new nodes, disks for use by MapR.FORMATIf any configuration files on your existing cluster's nodes have been modified (for example, or warden.conf mapred-site.xm

), replace the default configuration files on all new nodes with the appropriate modified files.lStart ZooKeeper on all new nodes that have ZooKeeper installed:


Start the warden on all new nodes:


If any of the new nodes are CLDB and/or ZooKeeper nodes, on all new and existing nodes in the cluster,RUN configure.shspecifying all CLDB and ZooKeeper nodes.SET UP node topology for the new nodes.On any new nodes running NFS, NFS for HA.SET UP




1. 2. 3. 4. 5.

1.

2. 3. 4.

Managing Services on a Node

You can add or remove services from a node after it has been deployed in a cluster. This process involves installing or uninstalling packages onthe node, and then updating the cluster to recognize the new role for this node.

Adding a service to an existing node:

The process of adding a service to a node is similar to the initial installation process for nodes. For further detail see .Installing MapR Software

Install the package(s) corresponding to the new role(s) using or .apt-get yumRun with a list of the CLDB nodes and ZooKeeper nodes in the cluster.configure.shIf you added the CLDB or ZooKeeper role, you must run on all other nodes in the cluster.configure.shIf you added the fileserver role, run to format and prepare disks for use as storage.disksetupRestart the warden

% /etc/init.d/mapr-warden restart

When the warden restarts, it picks up the new configuration and starts the new services, making them active in the cluster.

Removing a service from an existing node:

Stop the service you want to remove from the MapR Control System (MCS) or with the command-line tool. The followingmaprcliexample stops the HBase master service:

% maprcli node services -hbmaster stop -nodes mapr-node1

Purge the service packages with the , , or commands, as suitable for your operating system.apt-get yum zypperRun the script with the option.configure.sh -RWhen you remove the CLDB or ZooKeeper role from a node, run on all nodes in the cluster.configure.sh



1. 2. 3. 4.

5.

Node Topology

Your node topology describes the locations of nodes and racks in a cluster to the MapR system. The MapR software uses node topology todetermine the location of replicated copies of data. Optimally defined cluster topology results in data being replicated to separate racks, resultingin continued data availability in the event of rack or node failure.

Define your cluster's topology by specifying a topology path for each node in the cluster. The paths group nodes by rack or switch, depending onhow the physical cluster is arranged and how you want MapR to place replicated data.

Topology paths can be as simple or complex as needed to correspond to your cluster layout. In a simple cluster, each topology path might consistof the rack only (for example, ). In a deployment consisting of multiple large datacenters, each topology path can be much longer (for/rack-1example, ). MapR uses topology paths to spread out replicated copies of data,/europe/uk/london/datacenter2/room4/row22/rack5/placing each copy on a separate path. By setting each path to correspond to a physical rack, you can ensure that replicated data is distributedacross racks to improve fault tolerance.

After you have defined node topology for the nodes in your cluster, you can use volume topology to place volumes on specific racks, nodes, orgroups of nodes. See for more information.Setting Volume Topology

Recommended Node Topology

The node topology described in this section enables you to gracefully migrate data off a node in order to decommission the node for replacementor maintenance while avoiding data under-replication.

Establish a topology path to serve as the default topology path for the volumes in that cluster. Establish a topology/data /decommissionedpath that is not assigned to any volumes.

When you need to migrate a data volume off a particular node, move that node from the path to the path. Since no/data /decommissioneddata volumes are assigned to that topology path, standard data replication will migrate the data off that node to other nodes that are still in the /d

topology path.ata

You can run the following command to check if a given volume is present on a specified node:

maprcli dump volumenodes -volumename <volume> -json | grep <ip:port>

Run this command for each non-local volume in your cluster. Once all the data has migrated off the node, you can decommission the node orplace it in maintenance mode.

If you need to segregate CLDB data, create a topology node and move the CLDB nodes under . Point the topology for the CLDB/cldb /cldbvolume ( ) to . See for details.mapr.cldb.internal /cldb Isolating CLDB Nodes

Setting Node Topology Manually

You can specify a topology path for one or more nodes using the command, or in the MapR Control System using the followingnode topoprocedure.

To set node topology using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster NodesSelect the checkbox beside each node whose topology you wish to set.Click the button to display the dialog.Change Topology Change Node TopologySet the path in the field:New Path

To define a new path, type a topology path. Topology paths must begin with a forward slash ('/').To use a path you have already defined, select it from the dropdown.

Click to set the new topology.Move Node

Setting Node Topology with a Script

For large clusters, you can specify complex topologies in a text file or by using a script. Each line in the text file or script output specifies a singlenode and the full topology path for that node in the following format:<ip or hostname> <topology>

The text file or script must be specified and available on the local filesystem on all CLDB nodes:

To set topology with a text file, set in to the text file namenet.topology.table.file.name /opt/mapr/conf/cldb.confTo set topology with a script, set in to the script file namenet.topology.script.file.name /opt/mapr/conf/cldb.conf

If you specify a script and a text file, the MapR system uses the topology specified by the script.



1.

2.

3.

1.

2.

3.

4.

1.

2.

Isolating CLDB Nodes















1.

2.

Isolating ZooKeeper Nodes

For large clusters (100 nodes or more), isolate the ZooKeeper on nodes that do not perform any other function. Isolating the ZooKeeper nodeenables the node to perform its functions without competing for resources with other processes. Installing a ZooKeeper-only node is similar to anytypical node installation, but with a specific subset of packages.

Do not install the FileServer package on an isolated ZooKeeper node in order to prevent MapR from using this node for datastorage.

To set up a ZooKeeper-only node:


INSTALL the following packages to the node.mapr-zookeepermapr-zk-internalmapr-core



1. 2. 3.

Removing Roles

To remove roles from an existing node:

Purge the packages corresponding to the roles using or .apt-get yumRun with a list of the CLDB nodes and ZooKeeper nodes in the cluster.configure.shIf you have removed the CLDB or ZooKeeper role, run on all nodes in the cluster.configure.sh

The warden picks up the new configuration automatically. When it is convenient, restart the warden:

# /etc/init.d/mapr-warden restart

Removing the role requires additional steps. Refer to .mapr-filesystem Removing the Filesystem Role



1. 2. 3.

4. 5. 6. 7.

Removing the Filesystem Role

Removing the role from a node is more complex than removing other roles. The CLDB tracks data precisely on all filesystemmapr-filesystemnodes, and therefore you should direct the cluster CLDB to stop tracking the node before removing the role.mapr-filesystem

For a planned decommissioning of a node, use node topologies to migrate data off the node before removing the role. Formapr-filesystemexample, you could move the node out of a live topology into a topology that has no volumes assigned to it, in order to force/active /offlinedata off the node. Otherwise, some data will be under-replicated as soon as the node is removed. Refer to .Node Topology

The following procedure involves halting all MapR services on the node temporarily. If this will disrupt critical services on yourcluster, such as CLDB or JobTracker, migrate those services to a different node first, and then proceed.

To Remove the role from a nodemapr-filesystem

Stop the warden, which will halt all MapR services on the node.Wait 5 minutes, after which the CLDB will mark the node as critical.Remove the node from the cluster, to direct the CLDB to stop tracking this node. You can do this in the MapR Control System GUI or withthe command.maprcli node removeRemove the role by deleting the file on the node.mapr-fileserver /opt/mapr/roles/fileserverRun on the node to reconfigure the node without the role.configure.sh mapr-fileserverStart the warden.Remove any volumes that were stored locally on the node. You can do this in the MapR Control System GUI or with the maprcli

command.volume remove

For example:

/opt/mapr # /etc/init.d/mapr-warden stop ...wait 5 minutes CLDB to recognize the node is down...for/opt/mapr # maprcli node remove 10.10.80.61/opt/mapr # rm /opt/mapr/roles/fileserver/opt/mapr # /opt/mapr/server/configure.sh -R/opt/mapr # /etc/init.d/mapr-warden start/opt/mapr # maprcli volume remove -name mapr.mapr-desktop.local.logs/opt/mapr # maprcli volume remove -name mapr.mapr-desktop.local.mapred/opt/mapr # maprcli volume remove -name mapr.mapr-desktop.local.metrics



1. 2.

1. 2. 3.

1. 2. 3. 4.

1. 2. 3. 4.

Services

Viewing Services on the Cluster

You can view services on the cluster using the command, or using the MapR Control System. In the MapR Control System, thedashboard inforunning services on the cluster are displayed in the pane of the .Services Dashboard

To view the running services on the cluster using the MapR Control System:

Log on to the MapR Control System.In the Navigation pane, expand the pane and click .Cluster Views Dashboard

Viewing Services on a Node

You can view services on a single node using the command, or using the MapR Control System. In the MapR Control System, theservice listrunning services on a node are displayed in the .Node Properties View

To view the running services on a node using the MapR Control System:

Log on to the MapR Control System.In the Navigation pane, expand the pane and click .Cluster Views NodesClick the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane.

Starting Services

You can start services using the command, or using the MapR Control System.node services

To start specific services on a node using the MapR Control System:

Log on to the MapR Control System.In the Navigation pane, expand the pane and click .Cluster Views NodesClick the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane.Click the checkbox next to each service you would like to start, and click .Start Service

Stopping Services

You can stop services using the command, or using the MapR Control System.node services

To stop specific services on a node using the MapR Control System:

Log on to the MapR Control System.In the Navigation pane, expand the pane and click .Cluster Views NodesClick the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane.Click the checkbox next to each service you would like to stop, and click .Stop Service

Adding Services

Services determine which roles a node fulfills. You can view a list of the roles configured for a given node by listing the direct/opt/mapr/rolesory on the node. To add roles to a node, you must install the corresponding services.

The page Adding Roles does not exist.



1.

2.

3. 4.

5.

1.

2.

3. 4. 5. 6.

7.

8.

Changing the User for MapR Services

All services should be run with same uid/gid on all nodes in the cluster. To fix the alarm, the following steps should be done on the node for which the alarm is raised.

To run MapR services as the root user:

Stop the warden:




Run the script $INSTALL_DIR/server/config-mapr-user.sh -u rootIf Zookeeper is installed, start it:


Start the warden:


To run MapR services as a non-root user:

Stop the warden:




If the MAPR_USER does not exist, create the user/group with the same UID and GID.If the MAPR_USER exists, verify that the uid of MAPR_USER is the same same as the value on the CLDB node.Run $INSTALL_DIR/server/config-mapr-user.sh -u MAPR_USERIf Zookeeper is installed, start it:


Start the warden:


After clearing NODE_ALARM_MAPRUSER_MISMATCH alarms on all nodes, run o$INSTALL_DIR/server/upgrade2mapruser.shn all nodes wherever the alarm is raised.



1. 2. 3. 4. 5. 6. 7.

1. 2.

3.

4.

1. 2.

Failover

The CLDB service automatically replicates its data to other nodes in the cluster, preserving at least two (and generally three) copies of the CLDBdata. If the CLDB process dies, it is automatically restarted on the node. All jobs and processes wait for the CLDB to return, and resume fromwhere they left off, with no data or job loss.

If the node itself fails, the CLDB data is still safe, and the cluster can continue normally as soon as the CLDB is started on another node. In anM5-licensed cluster, a failed CLDB node automatically fails over to another CLDB node without user intervention and without data loss. It ispossible to recover from a failed CLDB node on an M3 cluster, but the procedure is somewhat different.

Recovering from a Failed CLDB Node on an M3 Cluster

To recover from a failed CLDB node, perform the steps listed below:

Restore ZooKeeper - if necessary, install ZooKeeper on an additional node.Locate the CLDB data - locate the nodes where replicates of CLDB data are stored, and choose one to serve as the new CLDB node.Stop the selected node - stop the node you have chosen, to prepare for installing the CLDB service.Install the CLDB on the selected node - install the CLDB service on the new CLDB node.Configure the selected node - run to inform the CLDB node where the CLDB and ZooKeeper services are running.configure.shStart the selected node - start the new CLDB node.Restart all nodes - stop each node in the cluster, run on it, and start it.configure.sh

After the CLDB restarts, there is a 15-minute delay before replication resumes, in order to allow all nodes to register and heartbeat. This delay canbe configured using the command to set the parameter.config save cldb.replication.manager.start.mins

Restore ZooKeeper

If the CLDB node that failed was also running ZooKeeper, install ZooKeeper on another node to maintain the minimum required number ofZooKeeper nodes.

Locate the CLDB Data

After restoring the ZooKeeper service on the M3 cluster, use the command to identify the latest epoch of the CLDB,maprcli dump zkinfoidentify the nodes where replicates of the CLDB are stored, and select one of those nodes to serve the new CLDB node.

Perform the following steps on any cluster node:

Log in as or use for the following commands.root sudoIssue the command using the flag.maprcli dump zkinfo -json

# maprcli dump zkinfo -json

The output displays the ZooKeeper znodes.

In the directory, locate the CLDB with the latest epoch./datacenter/controlnodes/cldb/epoch/1

{ :" Container ID:1"/datacenter/controlnodes/cldb/epoch/1/KvStoreContainerInfo" VolumeId:1 Master:10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--13-VALID Servers: 10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--13-VALID Inactive Servers: UnusedServers: Latest epoch:13"}

The Latest Epoch field identifies the current epoch of the CLDB data. In this example, the latest epoch is .13

Select a CLDB from among the copies at the latest epoch. For example, indicates that the node has a10.250.2.41:5660--13-VALIDcopy at epoch 13 (the latest epoch).

You can now install a new CLDB on the selected node.

Stop the Selected Node

Perform the following steps on the node you have selected for installation of the CLDB:

Change to the root user (or use sudo for the following commands).Stop the Warden:



2.

3.

1. 2.

3.

1.

2.

1. 2.

3.

/etc/init.d/mapr-warden stopIf ZooKeeper is installed on the node, stop it:/etc/init.d/mapr-zookeeper stop

Install the CLDB on the Selected Node


Login as or use for the following commands.root sudoInstall the CLDB service on the node:

RHEL/CentOS: yum install mapr-cldbUbuntu: apt-get install mapr-cldb

Wait until the failover delay expires. If you try to start the CLDB before the failover delay expires, the following message appears:

CLDB HA check failed: not licensed, failover denied: elapsed time since last failure=<time inminutes> minutes

Configure the Selected Node






Example:


Start the Node



Restart All Nodes

On all nodes in the cluster, perform the following procedures:

Stop the node:

Change to the root user (or use sudo for the following commands).Stop the Warden:/etc/init.d/mapr-warden stopIf ZooKeeper is installed on the node, stop it:/etc/init.d/mapr-zookeeper stop

Configure the node with the new CLDB and ZooKeeper addresses:


CLDB – 7222



1.

2.

ZooKeeper – 5181



Example:


Start the node:




TaskTracker Blacklisting

In the event that a TaskTracker is not performing properly, it can be so that no jobs will be scheduled to run on it. There are two typesblacklistedof TaskTracker blacklisting:

Per-job blacklisting, which prevents scheduling new tasks from a particular jobCluster-wide blacklisting, which prevents scheduling new tasks from all jobs

Per-Job Blacklisting

The configuration value in specifies a number of task failures in a specific job aftermapred.max.tracker.failures mapred-site.xmlwhich the TaskTracker is blacklisted for that job. The TaskTracker can still accept tasks from other jobs, as long as it is not blacklistedcluster-wide (see below).

A job can only blacklist up to 25% of TaskTrackers in the cluster.

Cluster-Wide Blacklisting

A TaskTracker can be blacklisted cluster-wide for any of the following reasons:

The number of blacklists from successful jobs (the ) exceeds fault count mapred.max.tracker.blacklistsThe TaskTracker has been manually blacklisted using hadoop job -blacklist-tracker <host>The status of the TaskTracker (as reported by a user-provided health-check script) is not healthy

If a TaskTracker is blacklisted, any currently running tasks are allowed to finish, but no further tasks are scheduled. If a TaskTracker has beenblacklisted due to or using the command, un-blacklistingmapred.max.tracker.blacklists hadoop job -blacklist-tracker <host>requires a TaskTracker restart.

Only 50% of the TaskTrackers in a cluster can be blacklisted at any one time.

After 24 hours, the TaskTracker is automatically removed from the blacklist and can accept jobs again.

Blacklisting a TaskTracker Manually

To blacklist a TaskTracker manually, run the following command as the administrative user:

hadoop job -blacklist-tracker <hostname>

Manually blacklisting a TaskTracker prevents additional tasks from being scheduled on the TaskTracker. Any currently running tasks are allowedto fihish.

Un-blacklisting a TaskTracker Manually

If a TaskTracker is blacklisted per job, you can un-blacklist it by running the following command as the administrative user:

hadoop job -unblacklist <jobid> <hostname>

If a TaskTracker has been blacklisted cluster-wide due to or using the mapred.max.tracker.blacklists hadoop job command, un-blacklisting requires a TaskTracker restart. If a TaskTracker has been blacklisted cluster-wide-blacklist-tracker <host>

due to a non-healthy status, correct the indicated problem and run the health check script again. When the script picks up the healthy status, theTaskTracker is un-blacklisted.



Assigning Services to Nodes for Best Performance

The architecture of MapR software allows virtually any service to run on any node, or nodes, to provide a high-availability, high-performancecluster. Below are some guidelines to help plan your cluster's service layout.

Don't Overload the ZooKeeper

High latency on a ZooKeeper node can lead to an increased incidence of ZooKeeper quorum failures. A ZooKeeper quorum failure occurs whenthe cluster finds too few copies of the ZooKeeper service running. If the ZooKeeper node is also running other services, competition for computingresources can lead to increased latency for that node. If your cluster experiences issues relating to ZooKeeper quorum failures, consider reducingor eliminating the number of other services running on the ZooKeeper node.

Reduce TaskTracker Slots Where Necessary

Monitor the server load on the nodes in your cluster that are running high-demand services such as ZooKeeper or CLDB. If the TaskTrackerservice is running on nodes that also run a high-demand service, you can reduce the number of task slots provided by the TaskTracker service.Tune the number of task slots according to the acceptable load levels for nodes in your cluster.

Separate High-Demand Services

The following are guidelines about which services to separate on large clusters:

JobTracker on ZooKeeper nodes: Avoid running the JobTracker service on nodes that are running the ZooKeeper service. On largeclusters, the JobTracker service can consume significant resources.MySQL on CLDB nodes: Avoid running the MySQL server that supports the MapR Metrics service on a CLDB node. Consider runningthe MySQL server on a machine external to the cluster to prevent the MySQL server’s resource needs from affecting services on thecluster.TaskTracker on CLDB or ZooKeeper nodes: When the TaskTracker service is running on a node that is also running the CLDB orZooKeeper services, consider reducing the number of task slots that this node's instance of the TaskTracker service provides. See Tunin

.g Your MapR InstallWebserver on CLDB nodes: Avoid running the webserver on CLDB nodes. Queries to the MapR Metrics service can impose abandwidth load that reduces CLDB performance.JobTracker on large clusters: Run the JobTracker service on a dedicated node for clusters with over 250 nodes.



1. 2. 3.

1. 2.

3.

4.

5.

6.

1. 2. 3.

Startup and Shutdown

To safely shut down and restart an entire cluster, preserving all data and full replication, you must follow a specific sequence that stops writes sothat the cluster does not shut down in the middle of an operation:

Shut down the NFS service everywhere it is running.Shut down the CLDB nodes.Shut down all remaining nodes.

This procedure ensures that on restart the data is replicated and synchronized, so that there is no single point of failure for any data.

To shut down the cluster:

Change to the user (or use for the following commands).root sudoBefore shutting down the cluster, you will need a list of NFS nodes, CLDB nodes, and all remaining nodes. Once the CLDB is shut down,you cannot retrieve a list of nodes; it is important to obtain this information at the beginning of the process. Use the commannode listd as follows:

Determine which nodes are running the NFS gateway. Example:

/opt/mapr/bin/maprcli node list -filter -columns id,h,hn,svc, rp"[rp==/*]and[svc==nfs]"id service hostname health ip 6475182753920016590 fileserver,tasktracker,nfs,hoststats node-252.cluster.us 0 10.10.50.252 8077173244974255917 tasktracker,cldb,fileserver,nfs,hoststats node-253.cluster.us 0 10.10.50.253 5323478955232132984 webserver,cldb,fileserver,nfs,hoststats,jobtracker node-254.cluster.us 0 10.10.50.254

Determine which nodes are running the CLDB. Example:

/opt/mapr/bin/maprcli node list -filter -columns id,h,hn,svc, rp"[rp==/*]and[svc==cldb]"

List all non-CLDB nodes. Example:

/opt/mapr/bin/maprcli node list -filter -columns id,h,hn,svc, rp"[rp==/*]and[svc!=cldb]"

Shut down all NFS instances. Example:

/opt/mapr/bin/maprcli node services -nfs stop -nodesnode-252.cluster.us,node-253.cluster.us,node-254.cluster.us

SSH into each CLDB node and stop the warden. Example:


SSH into each of the remaining nodes and stop the warden. Example:


If desired, you can shut down the nodes using the Linux command.halt

To start up the cluster:

If the cluster nodes are not running, start them.Change to the user (or use for the following commands).root sudoStart the ZooKeeper on nodes where it is installed. Example:



3.

4.

5.


On all nodes, start the warden. Example:


Over a period of time (depending on the cluster size and other factors) the cluster comes up automatically. After the CLDB restarts, thereis a 15-minute delay before replication resumes, in order to allow all nodes to register and heartbeat. This delay can be configured usingthe command to set the parameter.config save cldb.replication.manager.start.mins



1. 2.

3.

4.

5.

6.

7. 8.

Uninstalling MapR

To re-purpose machines, you may wish to remove nodes and uninstall MapR software.

Removing Nodes from a Cluster

To remove nodes from a cluster: first uninstall the desired nodes, then run on the remaining nodes. Finally, if you are usingconfigure.shGanglia, restart all and daemons in the cluster.gmeta gmon

To uninstall a node:

On each node you want to uninstall, perform the following steps:

Do not use this procedure to decommission multiple nodes concurrently.

Change to the root user (or use sudo for the following commands).Stop the Warden:/etc/init.d/mapr-warden stopIf ZooKeeper is installed on the node, stop it:/etc/init.d/mapr-zookeeper stopDetermine which MapR packages are installed on the node:

dpkg --list | grep mapr (Ubuntu)rpm -qa | grep mapr (Red Hat or CentOS)

Remove the packages by issuing the appropriate command for the operating system, followed by the list of services. Examples:apt-get purge mapr-core mapr-cldb mapr-fileserver (Ubuntu)yum erase mapr-core mapr-cldb mapr-fileserver (Red Hat or CentOS)

Remove the directory to remove any instances of , , , and left behind by the package/opt/mapr hostid hostname zkdata zookeepermanager.Remove any MapR cores in the directory./opt/coresIf the node you have decommissioned is a CLDB node or a ZooKeeper node, then run on all other nodes in the clusterconfigure.sh(see ).Configuring a Node

To reconfigure the cluster:





Example:





1. 2.

3.

Users and Groups

Two users are important when installing and setting up the MapR cluster:

is used to install MapR software on each noderootThe “MapR user” is the user that MapR services run as (typically named or ) on each node. The MapR user has fullmapr hadoopprivileges to administer the cluster. Administrative privilege with varying levels of control can be assigned to other users as well.

Before installing MapR, decide on the name, user id (UID) and group id (GID) for the MapR user. The MapR user must exist on each node, andthe user name, UID and primary GID must match on all nodes.

MapR uses each node's native operating system configuration to authenticate users and groups for access to the cluster. If you are deploying alarge cluster, you should consider configuring all nodes to use LDAP or another user management system. You can use the MapR ControlSystem to give specific permissions to particular users and groups. For more information, see . Each user can be restrictedManaging Permissionsto a specific amount of disk usage. For more information, see .Managing Quotas

By default, MapR gives the user full administrative permissions. If the nodes do not have an explicit login (as is sometimes the caseroot rootwith Ubuntu, for example), you can give full permissions to another user after deployment. See .Configuring the Cluster

On the node where you plan to run the (the MapR Control System), install Pluggable Authentication Modules (PAM). See mapr-webserver PAM.Configuration

To create a volume for a user or group:

In the view, click .Volumes New VolumeIn the dialog, set the volume attributes:New Volume

In , type a volume name. Make sure the Volume Type is set to Normal Volume.Volume SetupIn , set the volume owner and specify the users and groups who can perform actions on the volume.Ownership & PermissionsIn , set the accountable group or user, and set a quota or advisory quota if needed.Usage TrackingIn , set the replication factor and choose a snapshot schedule.Replication & Snapshot Scheduling

Click to save the settings.OK

See for more information. You can also create a volume using the command.Managing Data with Volumes volume create

You can see users and groups that own volumes in the view or using the command.User Disk Usage entity list



1. 2. 3.

4. 5. 6.

1. 2. 3.

4.

5.

Managing Permissions

MapR manages permissions using two mechanisms:

Cluster and volume permissions use , which specify actions particular users are allowed to perform on aaccess control lists (ACLs)certain cluster or volumeMapR-FS permissions control access to directories and files in a manner similar to Linux file permissions. To manage permissions, youmust have permissions.fc

Cluster and Volume Permissions

Cluster and volume permissions use ACLs, which you can edit using the MapR Control System or the commands.acl

Cluster Permissions

The following table lists the actions a user can perform on a cluster, and the corresponding codes used in the cluster ACL.

Code Allowed Action Includes

login Log in to the MapR Control System, use the API and command-line interface, read access on cluster andvolumes

cv

ss Start/stop services

cv Create volumes

a Admin access All permissions exceptfc

fc Full control (administrative access and permission to change the cluster ACL) a

Setting Cluster Permissions

You can modify cluster permissions using the and commands, or using the MapR Control System.acl edit acl set

To add cluster permissions using the MapR Control System:

Expand the group and click to display the dialog.System Settings Views Permissions Edit PermissionsClick to add a new row. Each row lets you assign permissions to a single user or group.[ + Add Permission ]Type the name of the user or group in the empty text field:

If you are adding permissions for a user, type , replacing with the username.u:<user> <user>If you are adding permissions for a group, type , replacing with the group name.g:<group> <group>

Click the ( ) to expand the Permissions dropdown.Open ArrowSelect the permissions you wish to grant to the user or group.Click to save the changes.OK

To remove cluster permissions using the MapR Control System:

Expand the group and click to display the dialog.System Settings Views Permissions Edit PermissionsRemove the desired permissions:To remove all permissions for a user or group:

Click the delete button ( ) next to the corresponding row.To change the permissions for a user or group:

Click the ( ) to expand the Permissions dropdown.Open ArrowUnselect the permissions you wish to revoke from the user or group.

Click to save the changes.OK

Volume Permissions

The following table lists the actions a user can perform on a volume, and the corresponding codes used in the volume ACL.

Code Allowed Action





1.

2.

3.

4. 5. 6.

1. 2. 3. 4.

5.

6.


d Delete a volume


To mount or unmount volumes under a directory, the user must have read/write permissions on the directory (see ).MapR-FS Permissions

You can set volume permissions using the and commands, or using the MapR Control System.acl edit acl set

To add volume permissions using the MapR Control System:

Expand the group and click .MapR-FS VolumesTo create a new volume and set permissions, click to display the dialog.New Volume New VolumeTo edit permissions on a existing volume, click the volume name to display the dialog.Volume Properties

In the section, click to add a new row. Each row lets you assign permissions to a single user orPermissions [ + Add Permission ]group.Type the name of the user or group in the empty text field:

If you are adding permissions for a user, type , replacing with the username.u:<user> <user>If you are adding permissions for a group, type , replacing with the group name.g:<group> <group>

Click the ( ) to expand the Permissions dropdown.Open ArrowSelect the permissions you wish to grant to the user or group.Click to save the changes.OK

To remove volume permissions using the MapR Control System:

Expand the group and click .MapR-FS VolumesClick the volume name to display the dialog.Volume PropertiesRemove the desired permissions:To remove all permissions for a user or group:

Click the delete button ( ) next to the corresponding row.To change the permissions for a user or group:

Click the ( ) to expand the Permissions dropdown.Open ArrowUnselect the permissions you wish to revoke from the user or group.

Click to save the changes.OK

MapR-FS Permissions

MapR-FS permissions are similar to the POSIX permissions model. Each file and directory is associated with a user (the ) and a group. Youownercan set read, write, and execute permissions separately for:

The owner of the file or directoryMembers of the group associated with the file or directoryAll other users.

The permissions for a file or directory are called its . The mode of a file or directory can be expressed in two ways:mode

Text - a string that indicates the presence of the read ( ), write ( ), and execute ( ) permission or their absence ( ) for the owner, group,r w x -and other users respectively. Example:rwxr-xr-xOctal - three octal digits (for the owner, group, and other users), that use individual bits to represent the three permissions. Example:755

Both and represent the same mode: the owner has all permissions, and the group and other users have read and executerwxr-xr-x 755permissions only.

Text Modes

String modes are constructed from the characters in the following table.

Text Description

u The file's owner.

g The group associated with the file or directory.

o Other users (users that are not the owner, and not in the group).

a All (owner, group and others).



= Assigns the permissions Example: "a=rw" sets read and write permissions and disables execution for all.

- Removes a specific permission. Example: "a-x" revokes execution permission from all users without changing read and writepermissions.

+ Adds a specific permission. Example: "a+x" grants execution permission to all users without changing read and write permissions.

r Read permission

w Write permission

x Execute permission

Octal Modes

To construct each octal digit, add together the values for the permissions you wish to grant:

Read: 4Write: 2Execute: 1

Syntax

You can change the modes of directories and files in the MapR storage using either the command with the option, or usinghadoop fs -chmodthe command via NFS. The syntax for both commands is similar:chmod

hadoop fs -chmod [-R] <MODE>[,<MODE>]... | <OCTALMODE> <URI> [<URI> ...]

chmod [-R] <MODE>[,<MODE>]... | <OCTALMODE> <URI> [<URI> ...]

Parameters and Options

Parameter/Option Description

-R If specified, this option applies the new mode recursively throughout the directory structure.

MODE A string that specifies a mode.

OCTALMODE A three-digit octal number that specifies the new mode for the file or directory.

URI A relative or absolute path to the file or directory for which to change the mode.

Examples

The following examples are all equivalent:

chmod 755 script.sh

chmod u=rwx,g=rx,o=rx script.sh

chmod u=rwx,go=rx script.sh



1. 2.

3. 4. 5.

1. 2.

3. 4.

5.

Managing Quotas

Quotas limit the disk space used by a volume or an (user or group) on an M5-licensed cluster, by specifying the amount of disk space theentityvolume or entity is allowed to use:

A volume quota limits the space used by a volume.A user/group quota limits the space used by all volumes owned by a user or group.

Quotas are expressed as an integer value plus a single letter to represent the unit:

B - bytesK - kilobytesM - megabytesG - gigabytesT - terabytesP - petabytes

Example: 500G specifies a 500 gigabyte quota.

If a volume or entity exceeds its quota, further disk writes are prevented and a corresponding alarm is raised:

AE_ALARM_AEQUOTA_EXCEEDED - an entity exceeded its quotaVOLUME_ALARM_QUOTA_EXCEEDED - a volume exceeded its quota

A quota that prevents writes above a certain threshold is also called a . In addition to the hard quota, you can also set an quothard quota advisorya for a user, group, or volume. An advisory quota does not enforce disk usage limits, but raises an alarm when it is exceeded:

AE_ALARM_AEADVISORY_QUOTA_EXCEEDED - an entity exceeded its advisory quotaVOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED - a volume exceeded its advisory quota

In most cases, it is useful to set the advisory quota somewhat lower than the hard quota, to give advance warning that disk usage is approachingthe allowed limit.

To manage quotas, you must have or permissions.a fc

Quota Defaults

You can set hard quota and advisory quota defaults for users and groups. When a user or group is created, the default quota and advisory quotaapply unless overridden by specific quotas.

Setting Volume Quotas and Advisory Quotas

You can set a volume quota using the command, or use the following procedure to set a volume quota using the MapR Controlvolume modifySystem.

To set a volume quota using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clickingVolume Propertiesthe button.PropertiesIn the Usage Tracking section, select the checkbox and type a quota (value and unit) in the field. Example: Volume Quota 500GTo set the advisory quota, select the checkbox and type a quota (value and unit) in the field. Example: Volume Advisory Quota 250GAfter setting the quota, click to exit save changes to the volume.Modify Volume

Setting User/Group Quotas and Advisory Quotas

You can set a user/group quota using the command, or use the following procedure to set a user/group quota using the MapRentity modifyControl System.

To set a user or group quota using the MapR Control System:

In the Navigation pane, expand the MapR-FS group and click the view.User Disk UsageSelect the checkbox beside the user or group name for which you wish to set a quota, then click the button to display theEdit Properties

dialog.User PropertiesIn the Usage Tracking section, select the checkbox and type a quota (value and unit) in the field. Example: User/Group Quota 500GTo set the advisory quota, select the checkbox and type a quota (value and unit) in the field. Example: User/Group Advisory Quota 250GAfter setting the quota, click to exit save changes to the entity.OK



1. 2. 3.

4.

5.

6.

7.

Setting Quota Defaults

You can set an entity quota using the command, or use the following procedure to set an entity quota using the MapR Controlentity modifySystem.

To set quota defaults using the MapR Control System:

In the Navigation pane, expand the group.System SettingsClick the view to display the dialog.Quota Defaults Configure Quota DefaultsTo set the user quota default, select the checkbox in the User Quota Defaults section, then type a quotaDefault User Total Quota(value and unit) in the field.To set the user advisory quota default, select the checkbox in the User Quota Defaults section, then typeDefault User Advisory Quotaa quota (value and unit) in the field.To set the group quota default, select the checkbox in the Group Quota Defaults section, then type a quotaDefault Group Total Quota(value and unit) in the field.To set the group advisory quota default, select the checkbox in the Group Quota Defaults section, thenDefault Group Advisory Quotatype a quota (value and unit) in the field.After setting the quota, click to exit save changes to the entity.Save



Security

This section provides information about managing security on a MapR cluster. Click a subtopic below for more detail.

PAM ConfigurationSecured TaskTrackerSubnet Whitelist



1. 2. 3.

1.

2.

PAM Configuration

MapR uses for user authentication in the MapR Control System. Make sure PAM is installed andPluggable Authentication Modules (PAM)configured on the node running the .mapr-webserver

There are typically several PAM modules (profiles), configurable via configuration files in the directory. Each standard UNIX/etc/pam.d/program normally installs its own profile. MapR can use (but does not require) its own PAM profile. The MapR Control Systemmapr-adminwebserver tries the following three profiles in order:

mapr-admin (Expects that user has created the profile)/etc/pam.d/mapr-adminsudo ( )/etc/pam.d/sudosshd ( )/etc/pam.d/sshd

The profile configuration file (for example, ) should contain an entry corresponding to the authentication scheme used by your/etc/pam.d/sudosystem. For example, if you are using local OS authentication, check for the following entry:

auth sufficient pam_unix.so # For local OS Auth

Example: Configuring PAM with mapr-admin

Although there are several viable ways to configure PAM to work with the MapR UI, we recommend using the profile. The followingmapr-adminexample shows how to configure the file. If LDAP is not configured, comment out the LDAP lines.

Example /etc/pam.d/mapr-admin file

account required pam_unix.soaccount sufficient pam_succeed_if.so uid < 1000 quietaccount [ =bad success=ok user_unknown=ignore] pam_ldap.sodefaultaccount required pam_permit.so

auth sufficient pam_unix.so nullok_secureauth requisite pam_succeed_if.so uid >= 1000 quietauth sufficient pam_ldap.so use_first_passauth required pam_deny.so

password sufficient pam_unix.so md5 obscure min=4 max=8 nulloktry_first_passpassword sufficient pam_ldap.sopassword required pam_deny.so

session required pam_limits.sosession required pam_unix.sosession optional pam_ldap.so

The following sections provide information about configuring PAM to work with LDAP or Kerberos.

The file should be modified only with care and only when absolutely necessary./etc/pam.d/sudo

LDAP

To configure PAM with LDAP:

Install the appropriate PAM packages:On Ubuntu, sudo apt-get libpam-ldapOn Redhat/Centos, sudo yum install pam_ldap

Open and check for the following line:/etc/pam.d/sudo

auth sufficient pam_ldap.so # For LDAP Auth

http://en.wikipedia.org/wiki/Pluggable_Authentication_Modules



1.

2.

Kerberos

To configure PAM with Kerberos:

Install the appropriate PAM packages:On Redhat/Centos, sudo yum install pam_krb5On Ubuntu, install sudo apt-get -krb5

Open and check for the following line:/etc/pam.d/sudo

auth sufficient pam_krb5.so # For kerberos Auth



1.

2. 3.

1.

2. 3.

1.

2.

3.

1.

2.

Secured TaskTracker

You can control which users are able to submit jobs to the TaskTracker. By default, the TaskTracker is secured; All TaskTracker nodes shouldhave the same user and group databases, and only users who are present on all TaskTracker nodes (same user ID on all nodes) can submit jobs.You can disallow certain users (including or other superusers) from submitting jobs, or remove user restrictions from the TaskTrackerrootcompletely./opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml

To disallow :root

Edit and set on all TaskTrackermapred-site.xml mapred.tasktracker.task-controller.config.overwrite = falsenodes.Edit and set on all TaskTracker nodes.taskcontroller.cfg min.user.id=0Restart all TaskTrackers.

To disallow all superusers:

Edit and set on all TaskTrackermapred-site.xml mapred.tasktracker.task-controller.config.overwrite = falsenodes.Edit and set on all TaskTracker nodes.taskcontroller.cfg min.user.id=1000Restart all TaskTrackers.

To disallow specific users:

Edit and set on all TaskTrackermapred-site.xml mapred.tasktracker.task-controller.config.overwrite = falsenodes.Edit and add the parameter on all TaskTracker nodes, setting it to a comma-separated list oftaskcontroller.cfg banned.usersusernames. Example:

banned.users=foo,bar

Restart all TaskTrackers.

To remove all user restrictions, and run all jobs as :root

Edit and set mapred-site.xml mapred.task.tracker.task.controller = on all TaskTracker nodes.org.apache.hadoop.mapred.DefaultTaskController

Restart all TaskTrackers.

When you make the above setting, all jobs submitted by any user will run as , and will have the ability to overwrite, delete,rootor damage data regardless of ownership or permissions.



Subnet Whitelist

To provide additional cluster security, you can limit cluster data access to a whitelist of trusted subnets. The parametmfs.subnets.whitelister in accepts a comma-separated list of subnets in CIDR notation. If this parameter is set, the FileServer service only accepts requestsmfs.conffrom the specified subnets.



Placing Jobs on Specified Nodes

You can run jobs on specified nodes or groups of nodes using – assigning labels to various groups of nodes and thenlabel-based schedulingusing the labels to specify where jobs run. The labels are mapped to nodes using the , a file stored in MapR-FS. When you runnode labels filejobs, you can place them on specified nodes individually or at the queue level.

When using label-based job placement, you cannot use the or task prefetch. For details onFair Scheduler with preemptionprefetch, see parameter on page .mapreduce.tasktracker.prefetch.maptasks mapred-site.xml

The Node Labels File

The node labels file defines labels for cluster nodes, to identify them for the purpose of specifying where to run jobs. Each line in the node labelfile consists of an that specifies one or more nodes, and one or more to apply to the specified nodes, separated by whitespace:identifier labels

<identifier> <labels>

The specifies nodes by matching the node names or IP addresses in one of two ways:identifierUnix-style which supports the wildcards and glob ? *Java regular expressions

The are a comma-delimited list of labels to apply to the nodes matched by the identifier. Labels containing whitespace should belabelsenclosed in single or double quotation marks.

Sample node label file

The following example shows both glob identifiers and regular expression identifiers.

/perfnode200.*/ big, “Production Machines”/perfnode203.*/ big, ‘Development Machines’perfnode15* goodperfnode201* slowperfnode204* good, big

The file path and name are specified in the parameter in . If no file ismapreduce.jobtracker.node.labels.file mapred-site.xmlspecified, jobs can run on any nodes in the cluster. You can use to view the labels of all active nodes. -showlabelshadoop job

The parameter in determines how often the JobTrackermapreduce.jobtracker.node.labels.monitor.interval mapred-site.xmlshould poll the node label file for changes (the default is two minutes). You can also use to manually tell the -refreshlabelshadoop jobJobTracker to re-load the node label file.

Placing Jobs

The placement of jobs on nodes or groups of nodes is controlled by applied to the queues to which jobs are submittied, or to the jobslabelsthemselves. A queue or job label is an expression that uses logical operators OR, AND, NOT to specify the nodes described in the node labelsfile:

A specifies the node or nodes that will run all jobs submitted to a queue.queue labelA specifies the node or nodes that will run a jobjob label

Job and queue labels specify nodes using labels corresponding to the node labels file, with the operators (OR) and (AND).|| &&

Examples:

“Production Machines” || good — selects nodes that are in either the group or the group."Production Machines" good‘Development Machines’ && good — selects nodes only if they are in both the group and the ‘Development Machines’ goodgroup.

If a job is submitted with a label that does not include any nodes, the job will remain in the state until nodes exist that meet the criteria (orPREPuntil the job is killed). For example: in the node labels file above, there are no nodes in both the group and the ‘Development Machines’ goo

group. If a job is submitted with the label it cannot execute while there are nodes that exist in bothd ‘Development Machines’ && goodgroups. If the node labels file is edited so that the group and the group have nodes in common, the job will‘Development Machines’ goodexecute as soon as the JobTracker becomes aware — either after the or whenmapreduce.jobtracker.node.labels.monitor.intervalyou execute the command. -refreshlabelshadoop job

http://en.wikipedia.org/wiki/Glob_%28programming%29



Queue Labels

Queue Labels are defined using the parameter in . The corresponding parameter mapred.queue.<queue-name>.label mapred-site.xml m specifies one of the following policies that determine the precedence of queue labels and jobapred.queue.<queue-name>.label.policy

labels:

PREFER_QUEUE — always use label set on queuePREFER_JOB — always use label set on jobAND (default) — job label AND node labelOR — job label OR node label

You can set a default queue policy using .mapred.queue.default.label

Example: Setting a policy on the default queue

The following excerpt from shows the policy set on the default queue.mapred-site.xml PREFER_QUEUE

<property><name>mapred.queue. .label</name>default<value>big || good</value></property><property><name>mapred.queue. .label.policy</name>default<value>PREFER_QUEUE</value></property>

Job Labels

There are three ways to set job labels:

Use from the Hadoop configuration API in your Java application (Example: set() conf.set("mapred.job.label","Production)Machines");

Pass the label in when running a job with -Dmapred.job.label hadoop jarSet in mapred.job.label mapred-site.xml

Examples

The following examples show the job placement policy behavior in certain scenarios, using the sample node labels file above.

JobLabel

QueueLabel

Queue Policy Outcome

good big PREFER_JOB The job runs on nodes labeled "good" (hostnames match perfnode15* or perfnode204*)

good big PREFER_QUEUE The job runs on nodes labeled "big" (hostnames match /perfnode200. /)/ or /perfnode203.

good big AND The job runs on nodes only if they are labeled both "good" and "big" (hostnames matchperfnode204*)

good big OR The job runs on nodes if they are labeled either "good" or "big" (hostnames match /perfnode200./,/, perfnode15*, or perfnode204*)/perfnode203.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/conf/Configuration.html#set%28java.lang.String,%20java.lang.String%29



1. 2.

3. 4. 5. 6.

7.

Setting Up MapR NFS

The MapR NFS service lets you access data on a licensed MapR cluster via the protocol:NFS

M3 license: one NFS node allows you to access your cluster as a standard POSIX-compliant filesystem.M5 license: multiple NFS servers allow each node to mount its own MapR-FS on NFS enable with VIPs for high availability (HA) and loadbalancing

You can mount the MapR cluster via NFS and use standard shell scripting to read and write live data in the cluster. NFS access to cluster datacan be faster than accessing the same data with the commands. To mount the cluster via NFS from a client machine, see hadoop Setting Up the

.Client

Before You Start: NFS Setup Requirements

Make sure the following conditions are met before using the MapR NFS gateway:

The stock Linux NFS service must not be running. Linux NFS and MapR NFS cannot run concurrently.The service must be running. You can use the command to check.portmapper ps a | grep portmapThe package must be present and installed. You can list the contents in the directory to check for inmapr-nfs /opt/mapr/roles nfsthe list.Make sure you have applied an M3 license or an M5 (paid or trial) license to the cluster. See .Adding a LicenseMake sure the MapR NFS service is started (see ).ServicesFor information about mounting the cluster via NFS, see .Setting Up the Client

NFS and Upgrading






To preserve compatibility with 32-bit applications and system calls, MapR-NFS uses 32-bit inode numbers by default. On 64-bitclients, this default forces the client's 64-bit inode numbers to be hashed down to 32 bits. Hashing 64-bit inodes down to 32 bitscan potentially cause inum conflicts. To change the default behavior to 64 bit inode numbers, set the value of the Use32BitFil

property to 0 in the file, then restart the NFS server.eId nfsserver.conf

NFS on an M3 Cluster

At installation time, choose one node on which to run the NFS gateway. NFS is lightweight and can be run on a node running services such asCLDB or ZooKeeper. To add the NFS service to a running cluster, use the instructions in to install the pManaging Services on a Node mapr-nfsackage on the node where you would like to run NFS.

NFS on an M5 Cluster

At cluster installation time, plan which nodes should provide NFS access according to your anticipated traffic. For instance, if you need 5Gbps ofwrite throughput and 5Gbps of read throughput, here are a few ways to set up NFS:

12 NFS nodes, each of which has a single 1Gbe connection6 NFS nodes, each of which has a dual 1Gbe connection4 NFS nodes, each of which has a quad 1Gbe connection

http://en.wikipedia.org/wiki/Network_File_System_%28protocol%29

http://www.mapr.com/doc/display/MapR/Adding+a+License



You can also set up NFS on all file server nodes to enable a self-mounted NFS point for each node. Self-mounted NFS for each node in a clusterenables you to run native applications as tasks. You can mount NFS on one or more dedicated gateways outside the cluster (using round-robinDNS or behind a hardware load balancer) to allow controlled access.

NFS and Virtual IP addresses

You can set up virtual IP addresses (VIPs) for NFS nodes in an M5-licensed MapR cluster, for load balancing or failover. VIPs provide multipleaddresses that can be leveraged for round-robin DNS, allowing client connections to be distributed among a pool of NFS nodes. VIPs also enablehigh availability (HA) NFS. In a HA NFS system, when an NFS node fails, data requests are satisfied by other NFS nodes in the pool. Use aminimum of one VIP per NFS node per NIC that clients will use to connect to the NFS server. If you have four nodes with four NICs each, witheach NIC connected to an individual IP subnet, use a minimum of 16 VIPs and direct clients to the VIPs in round-robin fashion. The VIPs shouldbe in the same IP subnet as the interfaces to which they will be assigned. See for details on enabling VIPs for yourSetting Up VIPs for NFScluster.

Here are a few tips:

Set up NFS on at least three nodes if possible.All NFS nodes must be accessible over the network from the machines where you want to mount them.To serve a large number of clients, set up dedicated NFS nodes and load-balance between them. If the cluster is behind a firewall, youcan provide access through the firewall via a load balancer instead of direct access to each NFS node. You can run NFS on all nodes inthe cluster, if needed.To provide maximum bandwidth to a specific client, install the NFS service directly on the client machine. The NFS gateway on the clientmanages how data is sent in or read back from the cluster, using all its network interfaces (that are on the same subnet as the clusternodes) to transfer data via MapR APIs, balancing operations among nodes as needed.Use VIPs to provide High Availability (HA) and failover.

To add the NFS service to a running cluster, use the instructions in to install the package on the nodesManaging Services on a Node mapr-nfswhere you would like to run NFS.

NFS Memory Settings

The memory allocated to each MapR service is specified in the file, which MapR automatically configures/opt/mapr/conf/warden.confbased on the physical memory available on the node. You can adjust the minimum and maximum memory used for NFS, as well as thepercentage of the heap that it tries to use, by setting the , , and parameters in the file on each NFS node.percent max min warden.confExample:

...service.command.nfs.heapsize.percent=3service.command.nfs.heapsize.max=1000service.command.nfs.heapsize.min=64...

The percentages need not add up to 100; in fact, you can use less than the full heap by setting the parameters for allheapsize.percentservices to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settings for individual services,unless you see specific memory-related problems occurring.

Running NFS on a Non-standard Port

To run NFS on an arbitrary port, modify the following line in :warden.conf

service.command.nfs.start=/etc/init.d/mapr-nfsserver start

Add to the end of the line, as in the following example:-p <portnumber>

service.command.nfs.start=/etc/init.d/mapr-nfsserver start -p 12345

After modifying , restart the MapR NFS server by issuing the following command:warden.conf

maprcli node services -nodes <nodename> -nfs restart

You can verify the port change with the command.rpcinfo -p localhost

http://www.mapr.com/doc/display/MapR/Setting+Up+VIPs+for+NFS



MapR uses version 3 of the NFS protocol. NFS version 4 bypasses the port mapper and attempts to connect to the default portonly. If you are running NFS on a non-standard port, mounts from NFS version 4 clients time out. Use the optio-o nfsvers=3n to specify NFS version 3.



1.

2. 3.

1.

2.

Disaster Recovery

It is a good idea to set up an automatic backup of the CLDB volume at regular intervals; in the event that all CLDB nodes fail, you can restore theCLDB from a backup. If you have more than one MapR cluster, you can back up the CLDB volume for each cluster onto the other clusters;otherwise, you can save the CLDB locally to external media such as a USB drive.

To back up a CLDB volume from a remote cluster:

Set up a cron job on the remote cluster to save the container information to a file by running the following command:/opt/mapr/bin/maprcli dump cldbnodes -zkconnect <IP:port of ZooKeeper leader> > <path to file>Set up a cron job to copy the container information file to a volume on the local cluster.Create a mirror volume on the local cluster, choosing the volume from the remote cluster as the source volume.mapr.cldb.internalSet the mirror sync schedule so that it will run at the same time as the cron job.

To back up a CLDB volume locally:

Set up a cron job to save the container information to a file on external media by running the following command:/opt/mapr/bin/maprcli dump cldbnodes -zkconnect <IP:port of ZooKeeper leader> > <path to file>Set up a cron job to create a dump file of the local volume on external media. Example:mapr.cldb.internal/opt/mapr/bin/maprcli volume dump create -name mapr.cldb.internal -dumpfile <path_to_file>

For information about restoring from a backup of the CLDB, contact MapR Support.



Troubleshooting Cluster Administration

This section provides information about troubleshooting cluster administration problems. Click a subtopic below for more detail.

'ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils' in cldb.log, caused by mixed-case cluster name in mapr-clusters.confOut of Memory Troubleshooting



'ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils' in cldb.log, caused bymixed-case cluster name in mapr-clusters.conf

MapR cluster names are case sensitive. However, some versions of MapR v1.2.x have a bug in which the cluster names specified in /opt/mapr are not treated as case sensitive. If you have a cluster with a mixed-case name, after upgrading from v1.2 to/conf/mapr-clusters.conf

v2.0+, you may experience CLDB errors (in particular for mirror volumes) which generate messages like the following in :cldb.log

2012-07-31 04:43:50,716 ERROR com.mapr.baseutils.cldbutils.CLDBRpcCommonUtils[VolumeMirrorThread]: Unable to reach cluster with name: qacluster1.2.9. Noentry found in file /conf/mapr-clusters.conf for cluster qacluster1.2.9.Failing the CLDB RPC with status 133

(The path given in this message is relative to , which might be misleading.)/opt/mapr/

As a work-around after upgrading, to continue working with mirror volumes created in v1.2, duplicate any lines with upper-case letters in mapr-cl, converting all letters to lower case.usters.conf

Mirror volumes created in v2.0+ do not exhibit this behavior.



1.

2.

3.

Out of Memory Troubleshooting

When the aggregated memory used by MapReduce tasks exceeds the memory reserve on a TaskTracker node, tasks can fail or be killed. MapRattempts to prevent out-of-memory exceptions by killing MapReduce tasks when memory becomes scarce. If you allocate too little Java heap forthe expected memory requirements of your tasks, an exception can occur. The following steps can help configure MapR to avoid these problems:

If a particular job encounters out-of-memory conditions, the simplest way to solve the problem might be to reduce the memory footprint ofthe map and reduce functions, and to ensure that the partitioner distributes map output to reducers evenly.

If it is not possible to reduce the memory footprint of the application, try increasing the Java heap size (-Xmx) in the client-sideMapReduce configuration.

If many jobs encounter out-of-memory conditions, or if jobs tend to fail on specific nodes, it may be that those nodes are advertising toomany TaskTracker slots. In this case, the cluster administrator should reduce the number of slots on the affected nodes.

To reduce the number of slots on a node:

Stop the TaskTracker service on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker stop

Edit the file :/opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xmlReduce the number of map slots by lowering mapred.tasktracker.map.tasks.maximumReduce the number of reduce slots by lowering mapred.tasktracker.reduce.tasks.maximum

Start the TaskTracker on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker start



1.

2. 3.

4.

Setting up a MapR Cluster on Amazon Elastic MapReduce

MapR offers an open, enterprise-grade distribution that makes Hadoop easier-to-use and more dependable. Combined with Amazon ElasticMapReduce's managed Hadoop environment, seamless integration with other AWS services, and hourly pricing with no upfront fees or long-termcommitments, Amazon EMR with MapR offers customers a powerful tool for generating insights from their data. For more details on EMR withMapR, visit the EMR with the MapR Distribution for Hadoop detail page.

Starting an EMR Job Flow with the MapR Distribution for Hadoop from the AWSManagement Console

Log in to your Amazon Web Services Account: Use your normal Amazon Web Services (AWS) credentials to log in to your AWSaccount.From the AWS Management Console, select , then select .Elastic MapReduce Create New Job FlowSelect or from the drop-down selector.MapR M3 Edition MapR M5 Edition Hadoop Version

MapR M3 Edition is a complete Hadoop distribution that provides many unique capabilities such as industry-standard NFS andODBC interfaces, end-to-end management, high reliability and automatic compression. You can manage a MapR cluster via theAWS Management Console, the command line, or a REST API. Amazon EMR's standard rates include the full functionality ofMapR M3 at no additional cost.MapR M5 Edition expands the capabilities of M3 with enterprise-grade capabilities such as , and high availability snapshots mirror

.ingContinue to specify your job flow as described in .Creating a Job Flow

To start an interactive Pig session, select when you create the job flow, then select Pig program Start an Interactive.Pig Session

To start an interactive Hive session, select when you create the job flow, then select Hive program Start an.Interactive Hive Session

Amazon EMR with MapR provides a Debian environment with MapR software running on each node. MapR's NFS interface mounts the cluster ismounted on localhost at the directory. Packages for Hadoop ecosystem components are in the directory./mapr /home/hadoop/mapr-pkgsThe default ecosystem components for Amazon EMR clusters are for HBase and Sqoop.

Starting an EMR Job Flow with the MapR Distribution for Hadoop from the Command LineInterface


http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_CreateJobFlow.html



1.

2. 3. 4. 5.

6.

1.

2.

To use the command line interface commands, download and install the .Amazon Elastic MapReduce Ruby Client

Use the parameter with the command to specify a MapR distribution:--with-supported-products elastic-mapreduce

Use mapr-m3 for MapR M3.Use mapr-m5 for MapR M5.

Launching a job flow with MapR M3

The following command launches a job flow with one EC2 Large instance as a master that uses the MapR M3 Edition distribution.

./elastic-mapreduce --create --alive \ --instance-type m1.xlarge\ --num-instances 5 \ --with-supported-products mapr-m3

See for more information about the elastic-mapreduce command's options.the linked article

To use MapR commands with a REST API, include the following mandatory parameters:

SupportedProducts.member.1=mapr-m3bootstrap-action=s3://elasticmapreduce/thirdparty/mapr/scripts/mapr_emr_install.shargs="--base-path,s3://elasticmapreduce/thirdparty/mapr/"

See for more information on how to interact with your EMR cluster using a REST API.the documentation

Configuring your MapR Job Flow

After your MapR job flow is running, you need to open a port to enable access to the (MCS). Follow these steps to open aMapR Control Systemport.

Select your job from the list of jobs displayed in in the tab of the AWSYour Elastic MapReduce Job Flows Elastic MapReduceManagement Console, then select the tab in the lower pane. Make a note of the Master Public DNS Name value. Click the Description A

tab in the AWS Management Console to open the Amazon EC2 Console Dashboard.mazon EC2Select from the group in the pane at the left of the EC2 Console Dashboard.Security Groups Network & Security NavigationSelect from the list displayed in .Elastic MapReduce-master Security GroupsIn the lower pane, click the tab.InboundIn , type . Leave the default value in the : field.Port Range: 8453 Source

The standard MapR port is 8443. Use port number 8453 instead of 8443 when you use the MapR REST API calls to aMapR on Amazon EMR cluster.

Click , then click .Add Rule Apply Rule Changes

You can now navigate to the master node's DNS address. Connect to port 8453 to log in to the MapR Control System. Use the string forhadoopboth login and password at the MCS login screen.

Testing Your Cluster

Follow these steps to create a file and run your first MapReduce job:

Connect to the master node with SSH as user hadoop. Pass your .pem credentials file to ssh with the -i flag, as in this example:

ssh -i /path_to_pemfile/credentials.pem [email protected]

Create a simple text file:

http://aws.amazon.com/developertools/2264

http://aws.amazon.com/articles/3938

http://docs.amazonwebservices.com/ElasticMapReduce/latest/API/API_RunJobFlow.html



2.

3.

4.

cd /mapr/my.cluster.commkdir inecho > in/data.txt"the quick brown fox jumps over the lazy dog"

Run the following command to perform a word count on the text file:

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount/mapr/my.cluster.com/in /mapr/my.cluster.com/out

As the job runs, you should see terminal output similar to the following:

12/06/09 00:00:37 INFO fs.JobTrackerWatcher: Current running JobTracker is:ip-10-118-194-139.ec2.internal/10.118.194.139:900112/06/09 00:00:37 INFO input.FileInputFormat: Total input paths to process : 112/06/09 00:00:37 INFO mapred.JobClient: Running job: job_201206082332_000412/06/09 00:00:38 INFO mapred.JobClient: map 0% reduce 0%12/06/09 00:00:50 INFO mapred.JobClient: map 100% reduce 0%12/06/09 00:00:57 INFO mapred.JobClient: map 100% reduce 100%12/06/09 00:00:58 INFO mapred.JobClient: Job complete: job_201206082332_000412/06/09 00:00:58 INFO mapred.JobClient: Counters: 2512/06/09 00:00:58 INFO mapred.JobClient: Job Counters12/06/09 00:00:58 INFO mapred.JobClient: Launched reduce tasks=112/06/09 00:00:58 INFO mapred.JobClient: Aggregate execution time of mappers(ms)=619312/06/09 00:00:58 INFO mapred.JobClient: Total time spent by all reduces waiting after reservingslots (ms)=012/06/09 00:00:58 INFO mapred.JobClient: Total time spent by all maps waiting after reservingslots (ms)=012/06/09 00:00:58 INFO mapred.JobClient: Launched map tasks=112/06/09 00:00:58 INFO mapred.JobClient: Data-local map tasks=112/06/09 00:00:58 INFO mapred.JobClient: Aggregate execution time of reducers(ms)=487512/06/09 00:00:58 INFO mapred.JobClient: FileSystemCounters12/06/09 00:00:58 INFO mapred.JobClient: MAPRFS_BYTES_READ=38512/06/09 00:00:58 INFO mapred.JobClient: MAPRFS_BYTES_WRITTEN=27612/06/09 00:00:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=9444912/06/09 00:00:58 INFO mapred.JobClient: Map-Reduce Framework12/06/09 00:00:58 INFO mapred.JobClient: Map input records=112/06/09 00:00:58 INFO mapred.JobClient: Reduce shuffle bytes=9412/06/09 00:00:58 INFO mapred.JobClient: Spilled Records=1612/06/09 00:00:58 INFO mapred.JobClient: Map output bytes=8012/06/09 00:00:58 INFO mapred.JobClient: CPU_MILLISECONDS=153012/06/09 00:00:58 INFO mapred.JobClient: Combine input records=912/06/09 00:00:58 INFO mapred.JobClient: SPLIT_RAW_BYTES=12512/06/09 00:00:58 INFO mapred.JobClient: Reduce input records=812/06/09 00:00:58 INFO mapred.JobClient: Reduce input groups=812/06/09 00:00:58 INFO mapred.JobClient: Combine output records=812/06/09 00:00:58 INFO mapred.JobClient: PHYSICAL_MEMORY_BYTES=32924467212/06/09 00:00:58 INFO mapred.JobClient: Reduce output records=812/06/09 00:00:58 INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=325296947212/06/09 00:00:58 INFO mapred.JobClient: Map output records=912/06/09 00:00:58 INFO mapred.JobClient: GC time elapsed (ms)=18

Check the /mapr/my.cluster.com/out directory for a file named part-r-00000 with the results of the job.

cat out/part-r00000brown 1dog 1fox 1jumps 1lazy 1over 1quick 1the 2





Development GuideWelcome to the MapR Development Guide! This guide is for Hadoop developers who create, manage and optimize MapReduce jobs on a MapRcluster. The topics in this guide include tuning MapReduce settings; working with the MapR file system (MapR-FS); using NFS to access data;using Metrics to analyze jobs; and more.

The focus of the Development Guide is job management. See the for details of configuring cluster topology and services.Administration GuideSee the for details on planning and installing a MapR cluster.Installation Guide


Working with MapReduceConfiguring MapReduceCompiling Pipes Programs

Working with MapR-FSChunk SizeCompression

Working with DataAccessing Data with NFSCopying Data from Apache HadoopData ProtectionProvisioning Applications

MapR Metrics and Job PerformanceTroubleshooting Development Issues



Working with MapReduce

If you have used Hadoop in the past to run MapReduce jobs, then running jobs on MapR Distribution for Apache Hadoop will be very familiar toyou. MapR is a full Hadoop distribution, API-compatible with all versions of Hadoop. MapR provides additional capabilities not present in any otherHadoop distribution.


Configuring MapReduceJob SchedulingStandalone OperationTuning Your MapR Install

Compiling Pipes Programs



Configuring MapReduce

You can configure your MapR installation in a number of ways to address your specific cluster's needs.

This section contains information about the following topics:

Job Scheduling - Prioritize the MapReduce jobs that run on your MapR clusterStandalone Operation - Running MapReduce jobs locally, using the local filesystemTuning Your MapR Install - Strategies for optimizing resources to meet the goals of your application



Job Scheduling

You can use job scheduling to prioritize the MapReduce jobs that run on your MapR cluster.

The MapReduce system supports a minimum of one queue, named . Hence, this parameter's value should always contain the string default de. Some job schedulers, like the Capacity Scheduler, support multiple queues.fault

The default job schedule is queue-based and uses FIFO (First In First Out) ordering. In a production environment with multiple users or groupsthat compete for cluster resources, consider using one of the multiuser schedulers available in MapR: the Fair Scheduler or the CapacityScheduler.

MapR Hadoop supports the following job schedulers:

FIFO queue-based scheduler: This is the default scheduler. The FIFO queue scheduler runs jobs based on the order in which the jobswere submitted. You can prioritize a job by changing the value of the property or by calling the mapred.job.priority setJobPriori

method.ty()Fair Scheduler: The Fair Scheduler allocates a share of cluster capacity to each user over time. The design goal of the Fair Schedulingis to assign resources in to jobs so that each job receives an equal share of resources over time. The Fair Scheduler enforces fair sharingwithin each pool. Running jobs share the pool’s resources.Capacity Scheduler: The Capacity Scheduler users or organizations to simulate an individual MapReduce cluster with FIFO schedulingfor each user or organization. You can define organizations using .queues



The Capacity Scheduler

The Capacity Scheduler is a multi-user MapReduce job scheduler that enables users organizations to simulate a dedicated MapReduce clusterwith FIFO scheduling for each users or organization.

The Capacity Scheduler divides the cluster into multiple queues that identify distinct groups or organizations. The Capacity Scheduler allocates afraction of the cluster's total capacity to each queue. When a job is submitted to a queue, the job is scheduled on a FIFO (First In First Out) basis.

Enabling the Capacity Scheduler

Define the property in the file to enable the Capacity Scheduler.mapred.jobtracker.taskScheduler mapred-default.xml

Property Value

mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.CapacityTaskScheduler

Configuring the Capacity Scheduler

Setting Up Queues

The Capacity Scheduler enables you to define multiple queues to which users and groups can submit jobs. Once queues are defined, users cansubmit jobs to a queue using the property name in the job configuration.mapred.job.queue.name

To define multiple queues, modify the property in the file.mapred.queue.names mapred-site.xml

Property Description

mapred.queue.names Comma separated list of queues to which jobs can be submitted.

A separate configuration file may be used to configure properties for each of the queues managed by the scheduler. For more information see Configuring Properties for Queues

Setting Up Job Queue ACLs

The Capacity Scheduler enables you to define access control lists (ACLs) to control which users and groups can submit jobs to each queue usingthe configuration parameters of the form .mapred.queue.queue-name.acl-name

To enable and configure ACLs for the queue, define the following properties in the file.mapred-default.xml


mapred.acls.enabled If , access control lists are supported and are checked whenever a job is submitted or administered.true

mapred.queue.<queue-name>.acl-submit-job

Specifies a list of users and groups that can submit jobs to the specified . The comma-separated listsqueue-nameof users and groups are separated by a blank space. For example, . To define a list ofuser1,user2 group1,group2groups only, enter a blank space at the beginning of the group list.

mapred.queue.<queue-name>.acl-administer-job

Specifies a list of users and groups that can change the priority or kill jobs submitted to the specified .queue-nameThe comma-separated lists of users and groups are separated by a blank space. Example: user1,user2

. To define a list of groups only, enter a blank space at the beginning of the group list. No matter thegroup1,group2ACL, the job owner can always change the priority of or kill a job.

Configuring Properties for Queues

The Capacity Scheduler enables you to configure queue-specific properties that determine how each queue is managed by the scheduler. Allqueue-specific properties are defined in the file. By default, a single queue named is configured.conf/capacity-scheduler.xml default

To specify a property for a queue that is defined in the site configuration, you should use the property name as mapred.capacity-scheduler. For example, to define the property for queue named , you should.queue.<queue-name>.<property-name> guaranteed-capacity research

specify the property name as .mapred.capacity-scheduler.queue.research.guaranteed-capacity

The properties defined for queues and their descriptions are listed in the table below:


mapred.capacity-scheduler.queue.<queue-name>.guaranteed-capacity

Specifies the percentage of the slots in the cluster that are guaranteed to be available for jobs inthis queue. The sum of the guaranteed capacities configured for all queues must be less than orequal to .100



1. 2. 3. 4.

mapred.capacity-scheduler.queue.<queue-name>.reclaim-time-limit

Specifies the amount of time (in seconds) before which resources distributed to other queues willbe reclaimed.

mapred.capacity-scheduler.queue.<queue-name>.supports-priority

If , the priorities of jobs are taken into account in scheduling decisions. Jobs with highertruepriority value are given access to the queue's resources before jobs with the lower priority value.

mapred.capacity-scheduler.queue.<queue-name>.minimum-user-limit-percent

Specifies a value that defines the maximum percentage of resources that can be allocated to auser at any given time. The minimum percentage of resources allocated depends on the numberof users who have submitted jobs. For example, suppose a value of is set for this property. If25two users have submitted jobs to a queue, no single user can use more than 50% of the queueresources. If a third user submits a job, no single user can use more than 33% of the queueresources. With four or more users, no user can use more than 25% of the queue's resources. If avalue of is set, no user limits are imposed.100

Memory Management

Job Initialization Parameters

The Capacity Scheduler initializes jobs before they are scheduled and thereby reduces the memory footprint of the JobTracker. You can controlthe "laziness" of the job initialization by defining the following properties in the file.capacity-scheduler.xml


mapred.capacity-scheduler.queue.<queue-name>.maximum-initialized-jobs-per-user

Specifies the maximum number of jobs that can be pre-initialized for a user inthe queue. Once a job starts running, the scheduler no longer takes that job intoconsideration when it computes the maximum number of jobs each user isallowed to initialize.

mapred.capacity-scheduler.init-poll-interval Specifies the time (in milliseconds) used to poll the scheduler job queue for jobsto be initialized.

mapred.capacity-scheduler.init-worker-threads Specifies the number of worker threads used to initialize jobs in a set of queues.If the configured value is equal to the number of job queues, each thread isassigned jobs from a single queue. If the configured value is less than numberof queues, a single thread can receive jobs from more than one queue; thethread initializes the queues in a round-robin fashion. If the configured value isgreater than number of queues, the number of threads spawned is equal tonumber of job queues.

Administering the Capacity Scheduler

Once the installation and configuration is completed, you can review it after starting the cluster from the admin UI.

Start the Map/Reduce cluster as usual.Open the JobTracker web UI.The queues you have configured should be listed under the Scheduling Information section of the page.The properties for the queues should be visible in the Scheduling Information column against each queue.



1.

2.

3.

4.

5.

The Fair Scheduler

The Fair Scheduler is a multi-user MapReduce job scheduler that enables organizations to share a large cluster among multiple users and ensurethat all jobs get roughly an equal share of CPU time.

The Fair Scheduler organizes jobs into and shares resources fairly across all pools. By default, each user is allocated a separate pool and,poolstherefore, gets an equal share of the cluster no matter how many jobs they submit. "Within each pool, fair sharing is used to share capacitybetween the running jobs. Pools can also be given weights to share the cluster non-proportionally in the config file.

Using the Fair Scheduler, you can define custom pools that are guaranteed minimum capacities.

When using the Fair Scheduler with preemption, you must disable and task prefetch. For details onlabel-based job placementprefetch, see parameter on page . For details onmapreduce.tasktracker.prefetch.maptasks mapred-site.xmlpreemption, see the .Apache Hadoop documentation on the Fair Scheduler

Enabling the Fair Scheduler

To enable the Fair Scheduler in your MapR cluster, define the property in the filemapred.jobtracker.taskScheduler mapred-site.xmland set several Fair Scheduler properties in the file.mapred.site.xml

Define the property to in the file.mapred.fairscheduler.allocation.file conf/pools.xml mapred-site.xml

<property> <name>mapred.fairscheduler.allocation.file</name> <value>conf/pools.xml</value></property>

Define the property in the file.mapred.jobtracker.taskScheduler mapred-site.xml

<property> <name>mapred.jobtracker.taskScheduler</name> <value>org.apache.hadoop.mapred.FairScheduler</value></property>

Set the property to in the file.mapred.fairscheduler.assignmultiple true mapred-site.xml

<property> <name>mapred.fairscheduler.assignmultiple</name> <value> </value>true</property>

Set the property to in the file.mapred.fairscheduler.eventlog.enabled false mapred-site.xml

<property> <name>mapred.fairscheduler.eventlog.enabled</name> <value> </value>false</property>

Restart the JobTracker, then check that the Fair Scheduler is running by going to on thehttp://<jobtracker URL>/schedulerJobTracker's web UI. For example, browse to on a node running the job tracker. For morehttp://localhost:50030/schedulerinformation about the job scheduler administration page, see .Administering the Fair Scheduler

Configuring the Fair Scheduler

The following properties can be set in to configure the Fair Scheduler. Whenever you change Fair Scheduler properties, youmapred-site.xmlmust restart the JobTracker.


mapred.fairscheduler.allocation.file Specifies the path to the XML file ( ) that contains the allocations for eachconf/pools.xmlpool, as well as the per-pool and per-user limits on number of running jobs. If this propertyis not provided, allocations are not used.

http://hadoop.apache.org/docs/mapreduce/r0.22.0/fair_scheduler.html

http://localhost:50030/scheduler



mapred.fairscheduler.assignmultiple Allows the scheduler to assign both a map task and a reduce task on each heartbeat,which improves cluster throughput when there are many small tasks to run. A Booleanvalue; by default, .false

mapred.fairscheduler.sizebasedweight If , the size of a job is taken into account in calculating its weight for fair sharing. Thetrueweight given to the job is to the log of the number of tasks required. If ,proportional falsethe weight of a job is based entirely on its priority.

mapred.fairscheduler.poolnameproperty Specify which jobconf property is used to determine the pool that a job belongs in. String,default: user.name (that is, one pool for each user). Some other useful values to set this toare:

group.name: to create a pool per Unix group.mapred.job.queue.name: the same property as the queue name in theCapacity Scheduler.

mapred.fairscheduler.preemption Boolean property for enabling preemption. Default: false.

The Fair Scheduler ExpressLane

MapR provides an express path for small MapReduce jobs to run when all slots are occupied by long tasks. Small jobs are only given this specialtreatment when the cluster is busy, and only if they meet the criteria specified by the following parameters in :mapred-site.xml

Property Value Description

mapred.fairscheduler.smalljob.schedule.enable true Enable small job fast scheduling inside fair scheduler.TaskTrackers should reserve a slot called ephemeralslot which is used for smalljob if cluster is busy.

mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps allowed insmall job.

mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of reducers allowed insmall job.

mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes allowed fora small job. Default is 10GB.

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 Small job definition. Max estimated input size for areducer allowed in small job. Default is 1GB perreducer.

mapred.cluster.ephemeral.tasks.memory.limit.mb 200 Small job definition. Max memory in mbytes reservedfor an ephermal slot. Default is 200mb. This value mustbe same on JobTracker and TaskTracker nodes.

MapReduce jobs that appear to fit the small job definition but are in fact larger than anticipated are killed and re-queued for normal execution.

Fair Scheduler Extension Points

The Fair Scheduler offers several extension points through which the basic functionality can be extended. For example, the weight calculation canbe modified to give a priority boost to new jobs, implementing a "shortest job first" policy which reduces response times for interactive jobs evenfurther.

mapred.fairscheduler.weightadjuster An extensibility point that enables you to specify a class that adjusts the weights of runningjobs. This class should implement the WeightAdjuster interface.

There is currently one example implementation: the , whichNewJobWeightBoosterincreases the weight of jobs for the first five minutes of their lifetime to let short jobs finishfaster. To use it, set the property to the full class name, weightadjuster org.apache.ha

itself provides twodoop.mapred.NewJobWeightBooster NewJobWeightBoosterparameters for setting the duration and boost factor.

mapred.newjobweightbooster.factor: Factor by which new jobs weightshould be boosted. Default is 3mapred.newjobweightbooster.duration: Duration in milliseconds, default300000 for five minutes



mapred.fairscheduler.loadmanager An extensibility point that enables you to specify a class that determines how many maps andreduces can run on a given TaskTracker. This class should implement the LoadManagerinterface. By default, the task caps in the Hadoop config file are used, but this option could beused to make the load based on available memory and CPU utilization for example.

mapred.fairscheduler.taskselector An extensibility point that enables you to specify a class that determines which task fromwithin a job to launch on a given tracker. This can be used to change either the locality policy(for example, keep some jobs within a particular rack) or the (speculative execution algorithmselect when to launch speculative tasks). By default, it uses Hadoop's default algorithms fromJobInProgress.

Administering the Fair Scheduler

You can administer the Fair Scheduler at runtime using two mechanisms:

Allocation config file: It is possible to modify pools' allocations and user and pool running job limits at runtime by editing the allocationconfig file. The scheduler will reload this file 10-15 seconds after it sees that it was modified.Jobtracker web interface: Current jobs, pools, and fair shares can be examined through the JobTracker's web interface, at http://<j

. For example, browse to on a node running the job tracker.obtracker URL>/scheduler http://localhost:50030/schedulerIn the web interface you can modify job priorities, move jobs between pools, and see the effects on the fair shares.

The following fields can be seen for each job on the web interface:

Fields Description

Submitted Shows the date and time job was submitted.

JobID, User,Name

Displays job identifiers as on the standard web UI.

Pool Shows the current pool of the job. Select another value to move job to another pool.

Priority Shows the current priority of the job. Select another value to change the job's priority.

Maps/ReducesFinished

Shows the number of tasks finished / total tasks.

Maps/ReducesRunning

Shows the tasks currently running.

Map/ReduceFair Share

Shows the average number of task slots that this job should have at any given time according to fair sharing. The actualnumber of tasks will go up and down depending on how much compute time the job has had, but on average it will get its fairshare amount.

In the advanced web UI (navigate to ), you can view four additional columns that display internalhttp://<jobtracker URL>/scheduler?advancedcalculations:

Fields Description

Maps/ReduceWeight

Shows the weight of the job in the fair sharing calculations. The weight of the job depends on its priority and optionally, if the si and properties are enabled, the its size and age.zebasedweight newjobweightbooster

Map/ReduceDeficit

Shows the job's scheduling deficit in machine-seconds, that is, the amount of resources the job should have receivedaccording to its fair share, minus the amount it actually received. A positive value indicates the job will be scheduled again inthe near future because it needs to catch up to its fair share. The Fair Scheduler schedules jobs with higher deficit ahead ofothers. See for details.Fair Scheduler Implementation Details

Fair Scheduler Implementation Details

There are two aspects to implementing fair scheduling: Calculating each job's fair share, and choosing which job to run when a task slot becomesavailable.

To select jobs to run, the scheduler keeps track of a "deficit" for each job, which is the difference between the amount of compute time the jobshould have gotten on an ideal scheduler, and the amount of compute time it actually got. This is a measure of how "unfair" the job's situation is.Every few hundred milliseconds, the scheduler updates the deficit of each job by looking at how many tasks each job had running during thisinterval vs. its fair share. Whenever a task slot becomes available, it is assigned to the job with the highest deficit. There is one exception: If thereare one or more jobs not meeting their pool capacity guarantees, the scheduler choose among only these "needy" jobs, based on their deficit, toensure that the scheduler meets pool guarantees as soon as possible.

The fair shares are calculated by dividing the capacity of the cluster among runnable jobs according to a "weight" for each job. By default theweight is based on priority, with each level of priority having 2x higher weight than the next. (For example, VERY_HIGH has 4x the weight ofNORMAL.) However, weights can also be based on job sizes and ages, as described in section . For jobs that areConfiguring the Fair Scheduler

http://localhost:50030/scheduler



in a pool, fair shares also take into account the minimum guarantee for that pool. This capacity is divided among the jobs in that pool according totheir weights.

When limits on a user's running jobs or a pool's running jobs are in place, the scheduler chooses which jobs get to run by sorting all jobs, first inorder of priority, and second in order of submit time, as in the standard Hadoop scheduler. Any jobs that fall after the user/pool's limit in thisordering are queued up and wait idle until they can be run. During this time, they are ignored from the fair sharing calculations and do not gain orlose deficit (i.e., their fair share is set to zero).



Standalone Operation

You can run MapReduce jobs locally, using the local filesystem, by setting in . With that=localmapred.job.tracker mapred-site.xmlparameter set, you can use the local filesystem for both input and output, use MapR-FS for input and output to the local filesystem, or use thelocal filesystem for input and output to MapR-FS.

Examples

Input and output on local filesystem

./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local file:///opt/mapr/hadoop/hadoop-0.20.2/input file:///opt/mapr/hadoop/hadoop-0.20.2/output 'dfs[a-z.]+'

Input from MapR-FS

./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local input file:///opt/mapr/hadoop/hadoop-0.20.2/output 'dfs[a-z.]+'

Output to MapR-FS

./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local file:///opt/mapr/hadoop/hadoop-0.20.2/input output 'dfs[a-z.]+'



Tuning Your MapR Install

MapR automatically tunes the cluster for most purposes. A service called the determines machine resources on nodes configured to runwardenthe TaskTracker service, and sets MapReduce parameters accordingly.

On nodes with multiple CPUs, MapR uses to reserve CPUs for MapR services:taskset

On nodes with five to eight CPUs, CPU 0 is reserved for MapR servicesOn nodes with nine or more CPUs, CPU 0 and CPU 1 are reserved for MapR services

In certain circumstances, you might wish to manually tune MapR to provide higher performance. For example, when running a job consisting ofunusually large tasks, it is helpful to reduce the number of slots on each TaskTracker and adjust the Java heap size. The following sectionsprovide MapReduce tuning tips. If you change any settings in , restart the TaskTracker.mapred-site.xml

NFS Write Performance

The kernel tunable value represents the number of simultaneous Remote Procedure Call (RPC) requests.sunrpc.tcp_slot_table_entriesThis tunable's default value is 16. Increasing this value to 128 may improve write speeds. Use the command sysctl -w

to set the value. Add an entry to your file to make the setting persist across reboots.sunrpc.tcp_slot_table_entries=128 sysctl.conf

NFS write performance varies between different Linux distributions. This suggested change may have no or negative effect on your particularcluster.

Memory Settings

Memory for MapR Services

The memory allocated to each MapR service is specified in the file, which MapR automatically configures/opt/mapr/conf/warden.confbased on the physical memory available on the node. For example, you can adjust the minimum and maximum memory used for theTaskTracker, as well as the percentage of the heap that the TaskTracker tries to use, by setting the appropriate , , and parametpercent max miners in the file:warden.conf

...service.command.tt.heapsize.percent=2service.command.tt.heapsize.max=325service.command.tt.heapsize.min=64...

The percentages of memory used by the services need not add up to 100; in fact, you can use less than the full heap by setting the heapsize.p parameters for all services to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settingsercent

for individual services, unless you see specific memory-related problems occurring.

MapReduce Memory

The memory allocated for MapReduce tasks normally equals the total system memory minus the total memory allocated for MapR services. Ifnecessary, you can use the parameter to set the maximum physical memory reserved bymapreduce.tasktracker.reserved.physicalmemory.mbMapReduce tasks, or you can set it to to disable physical memory accounting and task management.-1

If the node runs out of memory, MapReduce tasks are killed by the to free memory. You can use (copyOOM-killer mapred.child.oom_adjfrom to adjust the parameter for MapReduce tasks. The possible values of range from -17 to +15.mapred-default.xml oom_adj oom_adjThe higher the score, more likely the associated process is to be killed by the OOM-killer.

Job Configuration

Map Tasks

Map tasks use memory mainly in two ways:

The MapReduce framework uses an intermediate buffer to hold serialized (key, value) pairs.The application consumes memory to run the map function.

MapReduce framework memory is controlled by . If is less than the data emitted from the mapper, the task ends upio.sort.mb io.sort.mbspilling data to disk. If is too large, the task can run out of memory or waste allocated memory. By default is 100mb. Itio.sort.mb io.sort.mbshould be approximately 1.25 times the number of data bytes emitted from mapper. If you cannot resolve memory problems by adjusting io.sor

, then try to re-write the application to use less memory in its map function.t.mb

http://www.unix.com/man-page/Linux/1/taskset/

http://linux-mm.org/OOM_Killer



Compression

To turn off MapR compression for map outputs, set mapreduce.maprfs.use.compression=falseTo turn on LZO or any other compression, set and mapreduce.maprfs.use.compression=false mapred.compress.map.output=true

Reduce Tasks

If tasks fail because of an Out of Heap Space error, increase the heap space (the option in ) to give-Xmx mapred.reduce.child.java.optsmore memory to the tasks. If map tasks are failing, you can also try reducing .io.sort.mb(see mapred.map.child.java.opts in mapred-site.xml)

TaskTracker Configuration

MapR sets up map and reduce slots on each TaskTracker node using formulas based on the number of CPUs present on the node. The defaultformulas are stored in the following parameters in :mapred-site.xml

mapred.tasktracker.map.tasks.maximum: (CPUS > 2) ? (CPUS * 0.75) : 1 (At least one Map slot, up to 0.75 times the number ofCPUs)mapred.tasktracker.reduce.tasks.maximum: (CPUS > 2) ? (CPUS * 0.50) : 1 (At least one Map slot, up to 0.50 times thenumber of CPUs)

You can adjust the maximum number of map and reduce slots by editing the formula used in anmapred.tasktracker.map.tasks.maximumd . The following variables are used in the formulas:mapred.tasktracker.reduce.tasks.maximum

CPUS - number of CPUs present on the nodeDISKS - number of disks present on the nodeMEM - memory reserved for MapReduce tasks

Ideally, the number of map and reduce slots should be decided based on the needs of the application. Map slots should be based on how manymap tasks can fit in memory, and reduce slots should be based on the number of CPUs. If each task in a MapReduce job takes 3 GB, and eachnode has 9GB reserved for MapReduce tasks, then the total number of map slots should be 3. The amount of data each map task must processalso affects how many map slots should be configured. If each map task processes 256 MB (the default chunksize in MapR), then each map taskshould have 800 MB of memory. If there are 4 GB reserved for map tasks, then the number of map slots should be 4000MB/800MB, or 5 slots.

MapR allows the JobTracker to over-schedule tasks on TaskTracker nodes in advance of the availability of slots, creating a pipeline. Thisoptimization allows TaskTracker to launch each map task as soon as the previous running map task finishes. The number of tasks toover-schedule should be about 25-50% of total number of map slots. You can adjust this number with the parameter mapreduce.tasktracker.prefe

.tch.maptasks



ExpressLane



mapred.fairscheduler.smalljob.schedule.enable true Enable small job fast scheduling inside fair scheduler. TaskTrackersshould reserve a slot called ephemeral slot which is used for smalljob ifcluster is busy.

mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps allowed in small job.

mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of reducers allowed in small job.

mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes allowed for a small job.Default is 10GB.

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 Small job definition. Max estimated input size for a reducer allowed insmall job. Default is 1GB per reducer.

mapred.cluster.ephemeral.tasks.memory.limit.mb 200 Small job definition. Max memory in mbytes reserved for an ephermalslot. Default is 200mb. This value must be same on JobTracker andTaskTracker nodes.




1. 2.

3.

4.

5. 6.

1. 2.

Compiling Pipes Programs

To facilitate running jobs on various platforms, MapR provides , , and sources.hadoop pipes hadoop pipes util pipes-example

When using , all nodes must run the same distribution of the operating system. If you run different distributions (Red Hatpipesand CentOS, for example) on nodes in the same cluster, the compiled application might run on some nodes but not others.

To compile the pipes example:

Install on all nodes.libsslSet the environment variable as follows:LIBSexport LIBS=-lcryptoThis is needed for the time being, to fix errors in the configuration script.Change to the directory, and execute the following commands:/opt/mapr/hadoop/hadoop-0.20.2/src/c++/utils

chmod +x configure./configure # resolve any errorsmake install

Change to the directory, and execute the following commands:/opt/mapr/hadoop/hadoop-0.20.2/src/c++/pipes

chmod +x configure./configure # resolve any errorsmake install

The APIs and libraries will be in the directory./opt/mapr/hadoop/hadoop-0.20.2/src/c++/installCompile :pipes-example

cd /opt/mapr/hadoop/hadoop-0.20.2/src/c++g++ pipes-example/impl/wordcount-simple.cc -Iinstall/include/ -Linstall/lib/ -lhadooputils-lhadooppipes -lss -lcrypto -lpthread -o wc-simple

To run the pipes example:

Copy the pipes program into MapR-FS.Run the command:hadoop pipes

hadoop pipes -Dhadoop.pipes.java.recordreader= -Dhadoop.pipes.java.recordwriter= -inputtrue true<input-dir> -output <output-dir> -program <MapR-FS path to program>



Working with MapR-FS

Working with MapR-FS

The file contains a list of the clusters that can be used by your application. The first cluster in the list is treated as themapr-clusters.confdefault cluster.

In the parameter determines the default filesystem used by your application. Normally, this should be setcore-site.xml fs.default.nameto one of the following values:

maprfs:/// - resolves to the default cluster in mapr-clusters.confmaprfs:///mapr/<cluster name>/ or - resolves to the specified cluster/mapr/<cluster name>/

In general, the first two options ( and ) provide the most flexibility, because they are not tiedmaprfs:/// maprfs:///mapr/<cluster name>/to an IP address and will continue to function even if the IP address of the master CLDB changes (during failover, for example).

Using Java to Interface with MapR-FS

In your Java application, you will use a object to interface with MapR-FS. When you run your Java application, add the HadoopConfigurationconfiguration directory to the Java classpath. When you instantiate a object/opt/mapr/hadoop/hadoop-<version>/conf Configuration, it is created with default values drawn from configuration files in that directory.

Sample Code

The following sample code shows how to interface with MapR-FS using Java. The example creates a directory, writes a file, then reads thecontents of the file.

Compiling the sample code requires only the Hadoop core JAR:

javac -cp /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar MapRTest.java

Running the sample code uses the following library path:

java -Djava.library.path=/opt/mapr/lib -cp .:\ /opt/mapr/hadoop/hadoop-0.20.2/conf:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar \ MapRTest /test

Sample Code

/* Copyright (c) 2009 & onwards. MapR Tech, Inc., All rights reserved */

// com.mapr.fs;package

java.net.*;import org.apache.hadoop.fs.*;import org.apache.hadoop.conf.*;import

/** * Assumes mapr installed in /opt/mapr * * compilation needs only hadoop jars: * javac -cp /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar MapRTest.java * * Run: * java -Djava.library.path=/opt/mapr/lib -cp/opt/mapr/hadoop/hadoop-0.20.2/conf:/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar:.:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-logging-1



.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar MapRTest /test */

class MapRTestpublic{ void main( args[]) Exception {public static String throws buf[] = [ 65*1024];byte new byte ac = 0;int (args.length != 1) {if .out.println( );System "usage: MapRTest pathname" ;return }

// maprfs:/// -> uses the first entry in /opt/mapr/conf/mapr-clusters.conf// maprfs:///mapr/my.cluster.com/// /mapr/my.cluster.com/

// uri = ;String "maprfs:///" dirname = args[ac++];String

Configuration conf = Configuration();new

//FileSystem fs = FileSystem.get(URI.create(uri), conf); // wanting to use a different clusterifFileSystem fs = FileSystem.get(conf);

Path dirpath = Path( dirname + );new "/dir" Path wfilepath = Path( dirname + );new "/file.w" //Path rfilepath = Path( dirname + );new "/file.r"Path rfilepath = wfilepath;

// mkdirtry res = fs.mkdirs( dirpath);boolean

(!res) {if .out.println( + dirpath);System "mkdir failed, path: " ;return }

.out.println( + dirpath + );System "mkdir( " ") went ok, now writing file"

// create wfileFSDataOutputStream ostr = fs.create( wfilepath, , true // overwrite512, // buffersize( ) 1, short // replication( )(64*1024*1024) long // chunksize); ostr.write(buf); ostr.close();

.out.println( + wfilepath + );System "write( " ") went ok"

// read rfile.out.println( + rfilepath);System "reading file: "

FSDataInputStream istr = fs.open( rfilepath); bb = istr.readInt();int istr.close(); .out.println( );System "Read ok" }}



Using C to Interface with MapR-FS

MapR provides a modified version of that supports access to both MapR-FS and HDFS. MapR-FS is API-compatible with HDFS; iflibhdfs.soyou already have a client program built to use , you do not have to relink your program just to access the MapR filesystem.libhdfs.soHowever, re-linking to the MapR-specific shared library will give you better performance, because it does not make anylibMapRClient.soJava calls to access the filesystem (unlike ):libhdfs.so

If you will be using with a Java MapReduce application, then you must link your program to (see libMapRClient.so libjvm run1.sh, below).If you will be using with a C/C++ client program (no java involved), then you do not need to link to . In thislibMapRClient.so libjvmcase, use the following options:gcc

-Wl -allow-shlib-undefined (see run2.sh)

The library provides backward compatibility; if you need to access a distributed filesystem other than MapR-FS, you must link to libhdfs.so li.bhdfs.so

The APIs are defined in the header file, which includes documentation for/opt/mapr/hadoop/hadoop-0.20.2/src/c++/libhdfs/hdfs.heach API. Three sample programs are included in the same directory: , , and .hdfs_test.c hdfs_write.c hdfs_read.c

Finally, before running your program, some environment variables need to be set depending on what option is chosen. For examples, look at run and .1.sh run2.sh

run1.sh

#!/bin/bash

#Ensure JAVA_HOME is defined [ ${JAVA_HOME} = "" ] ; thenif

echo "JAVA_HOME not defined" exit 1fi

#Setup environmentexport HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2/export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/mapr/lib:${JAVA_HOME}/jre/lib/amd64/server/GCC_OPTS="-I. -I${HADOOP_HOME}/src/c++/libhdfs -I${JAVA_HOME}/include-I${JAVA_HOME}/include/linux -L${HADOOP_HOME}/c++/lib-L${JAVA_HOME}/jre/lib/amd64/server/ -L/opt/mapr/lib -lMapRClient-ljvm"

#Compilegcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_test.c -o hdfs_testgcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_read.c -o hdfs_readgcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_write.c -o hdfs_write

#Run tests./hdfs_test -m

run2.sh



#!/bin/bash

#Setup environmentexport HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2/GCC_OPTS="-Wl,--allow-shlib-undefined -I.-I${HADOOP_HOME}/src/c++/libhdfs -L/opt/mapr/lib -lMapRClient"export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/opt/mapr/lib

#Compile and Linkgcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_test.c -o hdfs_testgcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_read.c -o hdfs_readgcc ${GCC_OPTS} ${HADOOP_HOME}/src/c++/libhdfs/hdfs_write.c -o hdfs_write

#Run tests./hdfs_test -m



Chunk Size

Files in MapR-FS are split into (similar to Hadoop ) that are normally 256 MB by default. Any multiple of 65,536 bytes is a validchunks blockschunk size, but tuning the size correctly is important:

Smaller chunk sizes result in larger numbers of map tasks, which can result in lower performance due to task scheduling overheadLarger chunk sizes require more memory to sort the map task output, which can crash the JVM or add significant garbage collectionoverheadMapR can deliver a single stream at upwards of 300 MB per second, making it possible to use larger chunks than in stock Hadoop.Generally, it is wise to set the chunk size between 64 MB and 256 MB.

Chunk size is set at the directory level. Files inherit the chunk size settings of the directory that contains them, as do subdirectories on whichchunk size has not been explicitly set. Any files written by a Hadoop application, whether via the file APIs or over NFS, use chunk size specifiedby the settings for the directory where the file is written. If you change a directory's chunk size settings after writing a file, the file will keep the oldchunk size settings. Further writes to the file will use the file's existing chunk size.

Setting Chunk Size

You can set the chunk size for a given directory in two ways:

Change the attribute in the file at the top level of the directoryChunkSize .dfs_attributesUse the command -setchunksize <size> <directory>hadoop mfs

For example, if the volume is NFS-mounted at you can set the chunk size to 268,435,456test /mapr/my.cluster.com/projects/testbytes by editing the file and setting . To accomplish/mapr/my.cluster.com/projects/test/.dfs_attributes ChunkSize=268435456the same thing from the shell, use the following command:hadoop

hadoop mfs -setchunksize 268435456 /mapr/my.cluster.com/projects/test



Compression

MapR provides compression for files stored in the cluster. Compression is set at the directory level. Files inherit the compression settings of thedirectory that contains them, as do subdirectories on which compression has not been explicitly set. Any files written by a Hadoop application,whether via the file APIs or over NFS, is compressed according to the settings for the directory where the file is written. If you change a directory'scompression settings after writing a file, the file will keep the old compression settings---that is, if you write a file in an uncompressed directory andthen turn compression on, the file does not automatically end up compressed, and vice versa. Further writes to the file will use the file's existingcompression setting.

Only the owner of a directory can change its compression settings or other attributes. Write permission is not sufficient.

By default, MapR does not compress files whose filename extension indicate they are already compressed. The default list of filename extensionsis as follows:

bz2gzlzotgztbz2zipzZmp3jpgjpegmpgmpegavigifpng

The list of filename extensions not to compress is stored as comma-separated values in the configuration parameter,mapr.fs.nocompressionand can be modified with the command. Example:config save

maprcli config save -values '{ :"mapr.fs.nocompression" "bz2,gz,lzo,tgz,tbz2,zip,z,Z,mp3,jpg,jpeg,mpg,mp}'eg,avi,gif,png"

The list can be viewed with the command. Example:config load

maprcli config load -keys mapr.fs.nocompression

Setting Compression on Directories

You can turn compression on or off for a given directory in two ways:

Change the attribute in the file at the top level of the directoryCompression .dfs_attributesUse the command -setcompression on|off <dir>hadoop mfs

For example, if the volume is NFS-mounted at you can turn off compression by editing thetest /mapr/my.cluster.com/projects/testfile and setting . To accomplish the same thing from/mapr/my.cluster.com/projects/test/.dfs_attributes Compression=falsethe shell, use the following command:hadoop

hadoop mfs -setcompression off /projects/test

You can view the compression settings for directories using the command. -lshadoop mfs

Setting Compression During Shuffle

You can use the switch to turn compression off during the Shuffle phase of a MapReduce job.-Dmapreduce.maprfs.use.compressionExample:



hadoop jar xxx.jar -Dmapreduce.maprfs.use.compression=false



Working with Data

This section contains information about working with data:

Copying Data from Apache Hadoop - using to copy data to MapR from an Apache clusterdistcpData Protection - how to protect data from corruption or deletionAccessing Data with NFS - how to mount the cluster via NFSManaging Data with Volumes - using volumes to manage data

Mirror Volumes - local or remote copies of volumesSchedules - scheduling for snapshots and mirrorsSnapshots - point-in-time images of volumes



1.

2.

3.

4.

1. 2. 3. 4.

Accessing Data with NFS

Unlike other Hadoop distributions which only allow cluster data import or import as a batch operation, MapR lets you mount the cluster itself viaNFS so that your applications can read and write data directly. MapR allows direct file modification and multiple concurrent reads and writes viaPOSIX semantics. With an NFS-mounted cluster, you can read and write data directly with standard tools, applications, and scripts. For example,you could run a MapReduce job that outputs to a CSV file, then import the CSV file directly into SQL via NFS.

MapR exports each cluster as the directory (for example, ). If you create a mount point with/mapr/<cluster name> /mapr/my.cluster.comthe local path , then Hadoop FS paths and NFS paths to the cluster will be the same. This makes it easy to work on the same files via NFS/maprand Hadoop. In a multi-cluster setting, the clusters share a single namespace, and you can see them all by mounting the top-level director/mapry.

MapR uses version 3 of the NFS protocol. NFS version 4 bypasses the port mapper and attempts to connect to the default portonly. If you are running NFS on a , mounts from NFS version 4 clients time out.non-standard port

Mounting the Cluster

Before you begin, make sure you know the hostname and directory of the NFS share you plan to mount.Example:

usa-node01:/mapr - for mounting from the command linenfs://usa-node01/mapr - for mounting from the Mac Finder

Make sure the client machine has the appropriate username and password to access the NFS share. For best results, the username andpassword for accessing the MapR cluster should be the same username and password used to log into the client machine.

Automatically mounting NFS to MapRFS on a Cluster

To automatically mount NFS to MapRFS on the cluster at the mount point, add the following line to my.cluster.com /mapr2 /opt/mapr/con:f/mapr_fstab

localhost:/mapr/my.cluster.com/user /mapr2 hard,nolock

Linux

Make sure the NFS client is installed. Examples: sudo yum install nfs-utils (Red Hat or CentOS)sudo apt-get install nfs-common (Ubuntu)sudo zypper install nfs-client (SUSE)


You can also add an NFS mount to so that it mounts automatically when your system starts up. Example:/etc/fstab

# device mountpoint fs-type options dump fsckorder...usa-node01:/mapr /mapr nfs rw 0 0...

Mac

To mount the cluster from the Finder:

Open the Disk Utility: go to .Applications > Utilities > Disk UtilitySelect .File > NFS MountsClick the at the bottom of the NFS Mounts window.+In the dialog that appears, enter the following information:



4.

5. 6. 7. 8.

1.

2.

3.

Remote NFS URL: The URL for the NFS mount. If you do not know the URL, use the command described below.showmountExample: nfs://usa-node01/maprMount location: The mount point where the NFS mount should appear in the local filesystem.

Click the triangle next to .Advanced Mount ParametersEnter in the text field.nolocksClick .VerifyImportant: On the dialog that appears, click to skip the verification process.Don't Verify

The MapR cluster should now appear at the location you specified as the mount point.

To mount the cluster from the command line:


Windows

Because of Windows directory caching, there may appear to be no directory in each volume's root directory. To work.snapshotaround the problem, force Windows to re-load the volume's root directory by updating its modification time (for example, bycreating an empty file or directory in the volume's root directory).

With Windows NFS clients, use the option on the NFS server to prevent the Linux NLM from registering with the-o nolockportmapper.The native Linux NLM conflicts with the MapR NFS server.

To mount the cluster on Windows 7 Ultimate or Windows 7 Enterprise:



1. 2. 3. 4. 5.

1. 2.

3.

1.

2.

Open .Start > Control Panel > ProgramsSelect .Turn Windows features on or offSelect .Services for NFSClick .OKMount the cluster and map it to a drive using the tool or from the command line. Example:Map Network Drivemount -o nolock usa-node01:/mapr z:

To mount the cluster on other Windows versions:

Download and install (SFU). You only need to install the NFS Client and the User Name Mapping.Microsoft Windows Services for UnixConfigure the user authentication in SFU to match the authentication used by the cluster (LDAP or operating system users). You canmap local Windows users to cluster Linux users, if desired.Once SFU is installed and configured, mount the cluster and map it to a drive using the tool or from the commandMap Network Driveline. Example:mount -o nolock usa-node01:/mapr z:

To map a network drive with the Map Network Drive tool:

Open .Start > My Computer

http://www.microsoft.com/downloads/en/details.aspx?FamilyID=896c9688-601b-44f1-81a4-02878ff11778&DisplayLang=en



2. 3. 4. 5.

6. 7.

Select .Tools > Map Network DriveIn the Map Network Drive window, choose an unused drive letter from the drop-down list.DriveSpecify the by browsing for the MapR cluster, or by typing the hostname and directory into the text field.FolderBrowse for the MapR cluster or type the name of the folder to map. This name must follow UNC. Alternatively, click the Browse… buttonto find the correct folder by browsing available network shares.Select to reconnect automatically to the MapR cluster whenever you log into the computer.Reconnect at loginClick Finish.

Setting Compression and Chunk Size

Each directory in MapR storage contains a hidden file called that controls compression and chunk size. To change these.dfs_attributesattributes, change the corresponding values in the file.

Example:

# lines beginning with # are treated as commentsCompression=lz4ChunkSize=268435456

Valid values:

Compression: , , , or lz4 lzf zlib falseChunk size (in bytes): a multiple of 65535 (64 K) or zero (no chunks). Example: 131072

You can also set compression and chunksize using the command.hadoop mfs

By default, MapR does not compress files whose filename extension indicate they are already compressed. The default list of filename extensionsis as follows:

bz2gzlzotgztbz2zipzZmp3jpgjpegmpgmpegavigifpng

The list of filename extensions not to compress is stored as comma-separated values in the configuration parameter,mapr.fs.nocompressionand can be modified with the command. Example:config save

maprcli config save -values '{ :"mapr.fs.nocompression" "bz2,gz,lzo,tgz,tbz2,zip,z,Z,mp3,jpg,jpeg,mpg,mp}'eg,avi,gif,png"

The list can be viewed with the command. Example:config load

maprcli config load -keys mapr.fs.nocompression



1.

2.

3.

1. 2. 3.

4.

Copying Data from Apache Hadoop

There are three ways to copy data from an HDFS cluster to MapR:

If the HDFS cluster uses the same version of the RPC protocol that MapR uses (currently version 4), use normally to copy datadistcpwith the following procedureIf you are copying very small amounts of data, use hftpIf the HDFS cluster and the MapR cluster do not use the same version of the RPC protocol, or if for some other reason the above stepsdo not work, you can data from the HDFS clusterpush

To copy data from HDFS to MapR using :distcp

<NameNode IP> - the IP address of the NameNode in the HDFS cluster<NameNode Port> - the port for connecting to the NameNode in the HDFS cluster<HDFS path> - the path to the HDFS directory from which you plan to copy data<MapR-FS path> - the path in the MapR cluster to which you plan to copy HDFS data<file> - a file in the HDFS path

From a node in the MapR cluster, try to determine whether the MapR cluster can successfully communicate with thehadoop fs -lsHDFS cluster:

hadoop fs -ls <NameNode IP>:<NameNode port>/<path>

If the command is successful, try to determine whether the MapR cluster can read file contentshadoop fs -ls hadoop fs -catfrom the specified path on the HDFS cluster:

hadoop fs -cat <NameNode IP>:<NameNode port>/<HDFS path>/<file>

If you are able to communicate with the HDFS cluster and read file contents, use to copy data from the HDFS cluster to thedistcpMapR cluster:

hadoop distcp <NameNode IP>:<NameNode port>/<HDFS path> <MapR-FS path>

Using hftp

<NameNode IP> - the IP address of the NameNode in the HDFS cluster<NameNode HTTP Port> - the HTTP port on the NameNode in the HDFS cluster<HDFS path> - the path to the HDFS directory from which you plan to copy data<MapR-FS path> - the path in the MapR cluster to which you plan to copy HDFS data

Use over HFTP to copy files:distcp

hadoop distcp hftp://<NameNode IP>:<NameNode HTTP Port>/<HDFS path> <MapR-FS path>

To push data from an HDFS cluster

Perform the following steps from a MapR client or node (any computer that has either or installed). For moremapr-core mapr-clientinformation about setting up a MapR client, see .Setting Up the Client

<input path> - the HDFS path to the source data<output path> - the MapR-FS path to the target directory<MapR CLDB IP> - the IP address of the master CLDB node on the MapR cluster.

Log in as the user (or use for the following commands).root sudoCreate the directory on the Apache Hadoop JobClient node./tmp/maprfs-client/Copy the following files from a MapR client or any MapR node to the directory:/tmp/maprfs-client/

/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar,/opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar/opt/mapr/hadoop/hadoop-0.20.2/lib/native/Linux-amd64-64/libMapRClient.so

Install the files in the correct places on the Apache Hadoop JobClient node:



4.

5.

6. 7. 8.

cp /tmp/maprfs-client/maprfs-0.1.jar $HADOOP_HOME/lib/.cp /tmp/maprfs-client/zookeeper-3.3.2.jar $HADOOP_HOME/lib/.cp /tmp/maprfs-client/libMapRClient.so $HADOOP_HOME/lib/ /Linux-amd64-64/libMapRClient.so native

If you are on a 32-bit client, use in place of above.Linux-i386-32 Linux-amd64-64If the JobTracker is a different node from the JobClient node, copy and install the files to the JobTracker node as well using the abovesteps.On the JobTracker node, set in .fs.maprfs.impl=com.mapr.fs.MapRFileSystem $HADOOP_HOME/conf/core-site.xmlRestart the JobTracker.You can now copy data to the MapR cluster by running on the JobClient node of the Apache Hadoop cluster. Example:distcp

./bin/hadoop distcp -Dfs.maprfs.impl=com.mapr.fs.MapRFileSystem -libjars/tmp/maprfs-client/maprfs-0.1.jar,/tmp/maprfs-client/zookeeper-3.3.2.jar -files/tmp/maprfs-client/libMapRClient.so <input path> maprfs://<MapR CLDB IP>:7222/<output path>



1. 2.

a. b. c.

3.

1. 2. 3. 4.

1. 2. 3. 4. 5. 6.

1. 2.

3.

Data Protection

You can use MapR to protect your data from hardware failures, accidental overwrites, and natural disasters. MapR organizes data into volumesso that you can apply different data protection strategies to different types of data. The following scenarios describe a few common problems andhow easily and effectively MapR protects your data from loss.

Scenario: Hardware Failure

Even with the most reliable hardware, growing cluster and datacenter sizes will make frequent hardware failures a real threat to businesscontinuity. In a cluster with 10,000 disks on 1,000 nodes, it is reasonable to expect a disk failure more than once a day and a node failure everyfew days.

Solution: Topology and Replication Factor

MapR automatically replicates data and places the copies on different nodes to safeguard against data loss in the event of hardware failure. Bydefault, MapR assumes that all nodes are in a single rack. You can provide MapR with information about the rack locations of all nodes by settingtopology paths. MapR interprets each topology path as a separate rack, and attempts to replicate data onto different racks to provide continuity incase of a power failure affecting an entire rack. These replicas are maintained, copied, and made available seamlessly without user intervention.

To set up topology and replication:

In the MapR Control System, open the MapR-FS group and click to display the view.Nodes NodesSet up each rack with its own path. For each rack, perform the following steps:

Click the checkboxes next to the nodes in the rack.Click the button to display the dialog.Change Topology Change Node TopologyIn the Change Node Topology dialog, type a path to represent the rack. For example, if the cluster name is and thecluster1nodes are in rack 14, type ./cluster1/rack14

When creating volumes, choose a of 3 or more to provide sufficient data redundancy.Replication Factor

Scenario: Accidental Overwrite

Even in a cluster with data replication, important data can be overwritten or deleted accidentally. If a data set is accidentally removed, the removalitself propagates across the replicas and the data is lost. Users or applications can corrupt data, and once the corruption spreads to the replicasthe damage is permanent.

Solution: Snapshots

With MapR, you can create a point-in-time snapshot of a volume, allowing recovery from a known good data set. You can create a manualsnapshot to enable recovery to a specific point in time, or schedule snapshots to occur regularly to maintain a recent recovery point. If data is lost,you can restore the data using the most recent snapshot (or any snapshot you choose). Snapshots do not add a performance penalty, becausethey do not involve additional data copying operations; a snapshot can be created almost instantly regardless of data size.

Example: Creating a Snapshot Manually

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name the volume, then click the button to display the dialog.New Snapshot Snapshot NameType a name for the new snapshot in the field.Name...Click to create the snapshot.OK

Example: Scheduling Snapshots

This example schedules snapshots for a volume hourly and retains them for 24 hours.

To create a schedule:

In the Navigation pane, expand the group and click the view.MapR-FS SchedulesClick .New ScheduleIn the field, type "Every Hour".Schedule NameFrom the first dropdown menu in the Schedule Rules section, select .HourlyIn the field, specify 24 Hours.Retain ForClick to create the schedule.Save Schedule

To apply the schedule to the volume:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clickingVolume Propertiesthe button.Properties



3. 4.

1. 2. 3.

a. b. c. d. e. f.

g.

1.

2. 3.

a.

b.

In the section, choose "Every Hour."Replication and Snapshot SchedulingClick to apply the changes and close the dialog.Modify Volume

Scenario: Disaster Recovery

A severe natural disaster can cripple an entire datacenter, leading to permanent data loss unless a disaster plan is in place.

Solution: Mirroring to Another Cluster

MapR makes it easy to protect against loss of an entire datacenter by mirroring entire volumes to a different datacenter. A mirror is a full read-onlycopy of a volume that can be synced on a schedule to provide point-in-time recovery for critical data. If the volumes on the original cluster containa large amount of data, you can store them on physical media using the command and transport them to the mirror cluster.volume dump createOtherwise, you can simply create mirror volumes that point to the volumes on the original cluster and copy the data over the network. Themirroring operation conserves bandwidth by transmitting only the deltas between the source and the mirror, and by compressing the data over thewire. In addition, MapR uses checksums and a latency-tolerant protocol to ensure success even on high-latency WANs. You can set up acascade of mirrors to replicate data over a distance. For instance, you can mirror data from New York to London, then use lower-cost links toreplicate the data from London to Paris and Rome.

To set up mirroring to another cluster:

Use the command to create a full volume dump for each volume you want to mirror.volume dump createTransport the volume dump to the mirror cluster.For each volume on the original cluster, set up a corresponding volume on the mirror cluster.

Restore the volume using the command.volume dump restoreIn the MapR Control System, click under the MapR-FS group to display the Volumes view.VolumesClick the name of the volume to display the dialog.Volume PropertiesSet the to Remote Mirror Volume.Volume TypeSet the to the source volume name.Source Volume NameSet the to the cluster where the source volume resides.Source Cluster NameIn the section, choose a schedule to determine how often the mirror will sync.Replication and Mirror Scheduling

To recover volumes from mirrors:

Use the command to create a full volume dump for each mirror volume you want to restore. Example:volume dump createmaprcli volume create -e statefile1 -dumpfile fulldump1 -name volume@clusterTransport the volume dump to the rebuilt cluster.For each volume on the mirror cluster, set up a corresponding volume on the rebuilt cluster.

Restore the volume using the command. Example:volume dump restoremaprcli volume dump restore -name volume@cluster -dumpfile fulldump1Copy the files to a standard (non-mirror) volume.



Provisioning Applications

Provisioning a new application involves meeting the business goals of performance, continuity, and security while providing necessary resourcesto a client, department, or project. You'll want to know how much disk space is needed, and what the priorities are in terms of performance andreliability. Once you have gathered all the requirements, you will create a volume to manage the application data. A volume provides convenientcontrol over data placement, performance, protection, and policy for an entire data set.

Make sure the cluster has the storage and processing capacity for the application. You'll need to take into account the starting and predicted sizeof the data, the performance and protection requirements, and the memory required to run all the processes required on each node. Here is theinformation to gather before beginning:

Access How often will the data be read and written? What is the ratio of reads to writes?

Continuity What is the desired (RPO)?recovery point objective

What is the desired (RTO)? recovery time objective

Performance Is the data static, or will it change frequently? Is the goal data storage or data processing?

Size How much data capacity is required to start? What is the predicted growth of the data?

The considerations in the above table will determine the best way to set up a volume for the application.

About Volumes

Volumes provide a number of ways to help you meet the performance, access, and continuity goals of an application, while managing applicationdata size:

Mirroring - create read-only copies of the data for highly accessed data or multi-datacenter accessPermissions - allow users and groups to perform specific actions on a volumeQuotas - monitor and manage the data size by project, department, or userReplication - maintain multiple synchronized copies of data for high availability and failure protectionSnapshots - create a real-time point-in-time data image to enable rollbackTopology - place data on a high-performance rack or limit data to a particular set of machines

See .Managing Data with Volumes

Mirroring

Mirroring means creating , full physical read-only copies of normal volumes for fault tolerance and high performance. When youmirror volumescreate a mirror volume, you specify a source volume from which to copy data, and you can also specify a schedule to automatere-synchronization of the data to keep the mirror up-to-date. After a mirror is initially copied, the synchronization process saves bandwidth andreads on the source volume by transferring only the deltas needed to bring the mirror volume to the same state as its source volume. A mirrorvolume need not be on the same cluster as its source volume; MapR can sync data on another cluster (as long as it is reachable over thenetwork). When creating multiple mirrors, you can further reduce the mirroring bandwidth overhead by daisy-chaining the mirrors. That is, set thesource volume of the first mirror to the original volume, the source volume of the second mirror to the first mirror, and so on. Each mirror is a fullcopy of the volume, so remember to take the number of mirrors into account when planning application data size. See .Mirrors

Permissions

MapR provides fine-grained control over which users and groups can perform specific tasks on volumes and clusters. When you create a volume,keep in mind which users or groups should have these types of access to the volume. You may want to create a specific group to associate with aproject or department, then add users to the group so that you can apply permissions to them all at the same time. See .Managing Permissions

Quotas

You can use quotas to limit the amount of disk space an application can use. There are two types of quotas:

User/Group quotas limit the amount of disk space available to a user or groupVolume quotas limit the amount of disk space available to a volume

When the data owned by a user, group, or volume exceeds the quota, MapR prevents further writes until either the data size falls below the quotaagain, or the quota is raised to accommodate the data.

Volumes, users, and groups can also be assigned . An advisory quota does not limit the disk space available, but raises an alarmadvisory quotasand sends a notification when the space used exceeds a certain point. When you set a quota, you can use a slightly lower advisory quota as awarning that the data is about to exceed the quota, preventing further writes.



Remember that volume quotas do not take into account disk space used by sub-volumes (because volume paths are logical, not physical).

You can set a User/Group quota to manage and track the disk space used by an (a department, project, or application):accounting entity

Create a group to represent the accounting entity.Create one or more volumes and use the group as the Accounting Entity for each.Set a User/Group quota for the group.Add the appropriate users to the group.

When a user writes to one of the volumes associated with the group, any data written counts against the group's quota. Any writes to volumes notassociated with the group are not counted toward the group's quota. See .Managing Quotas

Replication

When you create a volume, you can choose a replication factor to safeguard important data. MapR manages the replication automatically, raisingan alarm and notification if replication falls below the minimum level you have set. A replicate of a volume is a full copy of the volume; rememberto take that into account when planning application data size.

Snapshots

A snapshot is an instant image of a volume at a particular point in time. Snapshots take no time to create, because they only record changes todata over time rather than the data itself. You can manually create a snapshot to enable rollback to a particular known data state, or scheduleperiodic automatic snapshots to ensure a specific (RPO). You can use snapshots and mirrors to achieve a near-zero recovery point objective reco

(RTO). Snapshots store only the deltas between a volume's current state and its state when the snapshot is taken. Initially,very time objectivesnapshots take no space on disk, but they can grow arbitrarily as a volume's data changes. When planning application data size, take intoaccount how much the data is likely to change, and how often snapshots will be taken. See .Snapshots

Topology

You can restrict a volume to a particular rack by setting its physical topology attribute. This is useful for placing an application's data on ahigh-performance rack (for critical applications) or a low-performance rack (to keep it out of the way of critical applications). See Setting Volume

.Topology

Scenarios

Here are a few ways to configure the application volume based on different types of data. If the application requires more than one type of data,you can set up multiple volumes.

Data Type Strategy

Important Data High replication factor Frequent snapshots to minimize RPO and RTO Mirroring in a remote cluster

Highly Acccessed Data High replication factor Mirroring for high-performance reads Topology: data placement on high-performance machines

Scratch data No snapshots, mirrors, or replication Topology: data placement on low-performance machines

Static data Mirroring and replication set by performance and availability requirements

One snapshot (to protect against accidental changes) Volume set to read-only

The following documents provide examples of different ways to provision an application to meet business goals:

Provisioning for CapacityProvisioning for Performance

Setting Up the Application

Once you know the course of action to take based on the application's data and performance needs, you can use the MapR Control System to setup the application.

Creating a Group and a VolumeSetting Up MirroringSetting Up SnapshotsSetting Up User or Group Quotas



Creating a Group and a Volume

Create a group and a volume for the application. If you already have a snapshot schedule prepared, you can apply it to the volume at creationtime. Otherwise, use the procedure in below, after you have created the volume.Setting Up Snapshots

Setting Up Mirroring

If you want the mirror to sync automatically, use the procedure in to create a schedule.Creating a ScheduleUse the procedure in to create a mirror volume. Make sure to set the following fields:Creating a Volume

Volume Type - Mirror VolumeSource Volume - the volume you created for the applicationResponsible Group/User - in most cases, the same as for the source volume

Setting Up Snapshots

To set up automatic snapshots for the volume, use the procedure in .Scheduling a Snapshot



1. 2. 3. 4. 5.

a. b.

c.

6. a. b.

7. 8.

Provisioning for Capacity

You can easily provision a volume for maximum data storage capacity by setting a low replication factor, setting hard and advisory quotas, andtracking storage use by users, groups, and volumes. You can also set permissions to limit who can write data to the volume.

The replication factor determines how many complete copies of a volume are stored in the cluster. The actual storage requirement for a volume isthe volume size multiplied by its replication factor. To maximize storage capacity, set the replication factor on the volume to 1 at the time youcreate the volume.

Volume quotas and user or group quotas limit the amount of data that can be written by a user or group, or the maximum size of a specificvolume. When the data size exceeds the advisory quota, MapR raises an alarm and notification but does not prevent additional data writes. Oncethe data exceeds the hard quota, no further writes are allowed for the volume, user, or group. The advisory quota is generally somewhat lowerthan the hard quota, to provide advance warning that the data is in danger of exceeding the hard quota. For a high-capacity volume, the volumequotas should be as large as possible. You can use the advisory quota to warn you when the volume is approaching its maximum size.

To use the volume capacity wisely, you can limit write access to a particular user or group. Create a new user or group on all nodes in the cluster.

In this scenario, storage capacity takes precedence over high performance and data recovery; to maximize data storage, there will be nosnapshots or mirrors set up in the cluster. A low replication factor means that the data is less effectively protected against loss in the event thatdisks or nodes fail. Because of these tradeoffs, this strategy is most suitable for risk-tolerant large data sets, and should not be used for data withstringent protection, recovery, or performance requirements.

To create a high-capacity volume:

Set up a user or group that will be responsible for the volume. For more information, see .Users & GroupsIn the MapR Control System, open the MapR-FS group and click to display the view.Volumes VolumesClick the button to display the dialog.New Volume New VolumeIn the pane, set the volume name and mount path.Volume SetupIn the pane:Usage Tracking

In the section, select or and enter the user or group responsible for the volume.Group/User User GroupIn the section, check and enter the maximum capacity of the volume, based on the storage capacity ofQuotas Volume Quotayour cluster. Example: 1 TBCheck and enter a lower number than the volume quota, to serve as advance warning when the dataVolume Advisory Quotaapproaches the hard quota. Example: 900 GB

In the pane:Replication & Snapshot SchedulingSet to .Replication 1Do not select a snapshot schedule.

Click OK to create the volume.Set the volume permissions on the volume via NFS or using . You can limit writes to root and the responsible user or group.hadoop fs

See for more information.Managing Data with Volumes



1. 2. 3. 4.

1. 2. 3. 4.

1. 2. 3. 4. 5. 6.

1. 2. 3. 4. 5. 6.

Provisioning for Performance

You can provision a high-performance volume by creating multiple mirrors of the data and defining volume topology to control data placement:store the data on your fastest servers (for example, servers that use SSDs instead of hard disks).

When you create mirrors of a volume, make sure your application load-balances reads across the mirrors to increase performance. Each mirror isan actual volume, so you can control data placement and replication on each mirror independently. The most efficient way to create multiplemirrors is to cascade them rather than creating all the mirrors from the same source volume. Create the first mirror from the original volume, thencreate the second mirror using the first mirror as the source volume, and so on. You can mirror the volume within the same cluster or to anothercluster, possibly in a different datacenter.

You can set node topology paths to specify the physical locations of nodes in the cluster, and volume topology paths to limit volumes to specificnodes or racks.

To set node topology:

Use the following steps to create a rack path representing the high-performance nodes in your cluster.

In the MapR Control System, open the MapR-FS group and click to display the view.Nodes NodesClick the checkboxes next to the high-performance nodes.Click the button to display the dialog.Change Topology Change Node TopologyIn the Change Node Topology dialog, type a path to represent the high-performance rack. For example, if the cluster name is cluster1and the high-performance nodes make up rack 14, type ./cluster1/rack14

To set up the source volume:

In the MapR Control System, open the MapR-FS group and click to display the view.Volumes VolumesClick the button to display the dialog.New Volume New VolumeIn the pane, set the volume name and mount path normally.Volume SetupSet the to limit the volume to the high-performance rack. Example: Topology /default/rack14

To Set Up the First Mirror

In the MapR Control System, open the MapR-FS group and click to display the view.Volumes VolumesClick the button to display the dialog.New Volume New VolumeIn the pane, set the volume name and mount path normally.Volume SetupChoose .Local Mirror VolumeSet the to the original volume name. Example: Source Volume Name original-volumeSet the to a different rack from the source volume. Topology

To Set Up Subsequent Mirrors

In the MapR Control System, open the MapR-FS group and click to display the view.Volumes VolumesClick the button to display the dialog.New Volume New VolumeIn the pane, set the volume name and mount path normally.Volume SetupChoose .Local Mirror VolumeSet the to the previous mirror volume name. Example: Source Volume Name mirror1Set the to a different rack from the source volume and the other mirror.Topology

See for more information.Managing Data with Volumes



MapR Metrics and Job Performance

The MapR Metrics service collects and displays detailed about the tasks and task attempts that comprise your Hadoop job. You can useanalyticsthe to display charts based on those analytics and diagnose performance issues with a particular job.MapR Control System

View this video for an introduction to Job Metrics...

The MapR Control System presents the jobs running on your cluster and the tasks that make up a specific job as a sortable list, along withhistograms and line charts that represent the distribution of a particular metric. You can sort the list by the metric you're interested in to quicklyfind any outliers, then display specific detailed information about a job or task attempt that you want to learn more about. The filtering capabilitiesof the MapR Control System enable you to narrow down the display of data to the ranges you're interested in.

For example, if a job lists 100% map task completion and 99% reduce task completion, you can filter the views in the MapR Control System to listonly reduce tasks. Once you have a list of your job's reduce tasks, you can sort the list by duration to see if any reduce task attempts are takingan abnormally long time to execute, then display detailed information about those task attempts, including log files for those task attempts.

You can also use the Metrics displays to gauge performance. Consider two different jobs that perform the same function. One job is written inPython using , and the other job is written in C++ using . To evaluate how these jobs perform on the cluster, you can open twopydoop Pipesbrowser windows logged into the MapR Control System and filter the display down to the metrics you're most interested in while the jobs arerunning.

http://pydoop.sourceforge.net/docs/

http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/pipes/package-summary.html



Troubleshooting Development Issues

This section provides information about troubleshooting development problems. Click a subtopic below for more detail.



Migration GuideThis guide provides instructions for migrating business-critical data and applications from an Apache Hadoop cluster to a MapR cluster.

The MapR distribution is 100% API-compatible with Apache Hadoop, and migration is a relatively straight-forward process. The additionalfeatures available in MapR provide new ways to interact with your data. In particular, MapR provides a fully read/write storage layer that can bemounted as a filesystem via NFS, allowing existing processes, legacy workflows, and desktop applications full access to the entire cluster.

Migration consists of the following steps:

Planning the Migration — Identify the goals of the migration, understand the differences between your current cluster and the MapRcluster, and identify potential gotchas.Initial MapR Deployment — Install, configure, and test the MapR cluster.Component Migration — Migrate your customized components to the MapR cluster.Application Migration — Migrate your applications to the MapR cluster and test using a small set of data.Data Migration — Migrate your data to the MapR cluster and test the cluster against performance benchmarks.Node Migration — Take down old nodes from the previous cluster and install them as MapR nodes.



Planning the Migration

The first phase of migration is planning. In this phase you will identify the requirements and goals of the migration, identify potential issues in themigration, and define a strategy.

The requirements and goals of the migration depend on a number of factors:

Data migration — can you move your datasets individually, or must the data be moved all at once?Downtime — can you tolerate downtime, or is it important to complete the migration with no interruption in service?Customization — what custom patches or applications are running on the cluster?Storage — is there enough space to store the data during the migration?

The MapR Hadoop distribution is 100% plug-and-play compatible with Apache Hadoop, so you do not need to make changes to your applicationsto run them on a MapR cluster. MapR Hadoop automatically configures compression and memory settings, task heap sizes, and local volumes forshuffle data.



1. 2. 3. 4.

1. 2. 3.

1. 2. 3. 4. 5. 6. 7. 8. 9.

10. 11. 12.

Initial MapR Deployment

The initial MapR deployment phase consists of installing, configuring, and testing the MapR cluster and any ecosystem components (such asHive, HBase, or Pig) on an initial set of nodes. Once you have the MapR cluster deployed, you will be able to begin migrating data andapplications.

To deploy the MapR cluster on the selected nodes, follow the steps in the .Installation Guide

PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements.PLAN which services to deploy on which nodes in the cluster.PREPARE package files for installation, either relying on MapR's repository or locating packages on a local network.INSTALL the MapR software.

Show details...

On each node, the planned MapR services.INSTALLOn all nodes, .RUN configure.shOn all nodes, disks for use by MapR.FORMAT

BRING UP the cluster and apply a license.CONFIGURE the cluster.

Show details...

SET UP the administrative user.SET UP MapR Metrics.CHECK that the correct services are running.SET UP node topology.SET UP initial volume structure.SET UP NFS for high availability (HA). (M5 Edition only)SET UP authentication.CONFIGURE cluster email settings.CONFIGURE permissions.SET user quotas.CONFIGURE alarm notifications.ISOLATE the CLDB service on dedicated nodes for large clusters. (optional)



1.

2.

3.

4.

Component Migration

MapR Hadoop features the complete Hadoop distribution including components such as Hive and HBase. There are a few things to know aboutmigrating Hive and HBase, or about migrating custom components you have patched yourself.

Hive Migration

Hive facilitates the analysis of large datasets stored in the Hadoop filesystem by organizing that data into tables that can be queried and analyzedusing a dialect of SQL called HiveQL. The schemas that define these tables and all other Hive metadata are stored in a centralized repositorycalled the .metastore

If you would like to continue using Hive tables developed on an HDFS cluster in a MapR cluster, you can import Hive metadata from themetastore to recreate those tables in MapR. Depending on your needs, you can choose to import a subset of table schemas or the entiremetastore in a single go.

Importing table schemas into a MapR cluster

Use this procedure to import a subset of Hive metastore from an HDFS cluster to a MapR cluster. This method is preferred when you want to testa subset of applications using a smaller subset of data.

Use the following procedure to import Hive metastore data into a new metastore running on a node in the MapR cluster. You will need to redirectall of links that formerly pointed to the HDFS ( ) to point to MapR-FS ( ).hdfs://<namenode>:<port number>/<path> maprfs:///<path>

Importing an entire Hive metastore into a MapR cluster

Use this procedure to import an entire Hive metastore from an HDFS cluster to a MapR cluster. This method is preferred when you want to test allapplications using a complete dataset. MySQL is a very popular choice for the Hive metastore and so we’ll use it as an example. If you are usinganother RDBMS, consult the relevant documentation.

Ensure that both Hive and your database are installed on one of the nodes in the MapR cluster. For step-by-step instructions on settingup a standalone MySQL metastore, see .Setting Up Hive with a MySQL MetastoreOn the HDFS cluster, back up the metastore to a file.

mysqldump [options] \--databases db_name... > filename

Ensure that queries in the dumpfile point to the MapR-FS rather than HDFS. Search the dumpfile and edit all of the URIs that point to hdf so that they point to instead.s:// maprfs:///

Import the data from the dumpfile into the metastore running on the node in the MapR cluster:

mysql [options] db_name < filename

Using Hive with MapR volumes

MapR-FS does not allow moving or renaming across volume boundaries. Be sure to set the Hive Scratch Directory and Hive Warehouse Directoryin the same volume where the data for the Hive job resides before running the job. For more information see .Using Hive with MapR Volumes

HBase Migration

HBase is the Hadoop database, which provides random, real-time read/write access to very large datasets. The MapR Hadoop distributionincludes HBase and is fully integrated with MapR enhancements for speed, usability, and dependability. MapR provides a (normallyvolumemounted at ) to store HBase data./hbase

HBase bulk load jobs: If you are currently using HBase bulk load jobs to import data into the HDFS, make sure to load your data into apath under the volume./hbaseCompression: The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Beforeusing HBase, turn off MapR compression for directories in the HBase volume. For more information, see .HBase Best Practices

Custom Components

If you have applied your own patches to a component and wish to continue to use that customized component with the MapR distribution, youshould keep the following considerations in mind:

MapR libraries: All Hadoop components must point to MapR for the Hadoop libraries. Change any absolute paths. Do not hardcode hdf or into your applications. This is also true of Hadoop ecosystem components that are not included in the MapRs:// maprfs://

Hadoop distribution (such as Cascading). For more information see .Working with MapR-FS



Component compatibility: Before you commit to the migration of a customized component (for example, customized HBase), check theMapR release notes to see if MapR Technologies has issued a patch that satisfies your business requirements. MapR Technologiespublishes a list of Hadoop common patches and MapR patches with each release and makes those patches available for our customersto take, build, and deploy. For more information, see the .Release NotesZooKeeper coordination service: Certain components, such as HBase, depend on ZooKeeper. When you migrate your customizedcomponent from the HDFS cluster to the MapR cluster, make sure it points correctly to the MapR ZooKeeper service.



1.

2. 3. 4.

Application Migration

In this phase you will migrate your applications to the MapR cluster test environment. The goal of this phase is to get your applications runningsmoothly on the MapR cluster using a subset of data. Once you have confirmed that all applications and components are running as expectedyou can begin migrating your data.

Migrating your applications from HDFS to MapR is relatively easy. MapR Hadoop is 100% plug-and-play compatible with Apache Hadoop, so youdo not need to make changes to your applications to run them on a MapR cluster.

Application Migration Guidelines

Keep the following considerations in mind when you migrate your applications:

MapR Libraries — Ensure that your applications can find the libraries/configs it is expecting. Make sure the java classpath includes thepath to and the includes maprfs.jar java.library.path libMapRClient.soMapR Storage — Every application must point to MapR-FS ( ) rather than the HDFS ( ). If your application uses maprfs:/// hdfs:// fs

then it will work automatically. If you have hardcoded HDFS links into your applications, you must redirect those links.default.nameso that they point to MapR-FS. Setting a default path of tells your applications to use the cluster specified in the first line of maprfs:///

. You can also specify a specific cluster with .mapr-clusters.conf maprfs://<cluster name>/Permissions — The command does not copy permissions; permissions defined in HDFS do not transfer automatically todistcpMapR-FS. MapR uses a combination of access control lists (ACLs) to specify cluster or volume-level permissions and file permissions tomanage directory and file access. You must define these permissions in MapR when you migrate your customized components,applications, and data. For more information, see .Managing PermissionsMemory — Remove explicit memory settings defined in your applications. If memory is set explicitly in the application, the jobs may failafter migration to MapR.

Application Migration Roadmap

Generally, the best approach to migrating your applications to MapR is to import a small subset of data and test and tune your application usingthat data in a test environment before you import your production data.

The following procedure offers a simple roadmap for migrating and running your applications in a MapR cluster test environment.

Copy over a small amount of data to the MapR cluster. Use the command to copy over a small number of files:hadoop distcp hftp

$ hadoop distcp hftp://namenode1:50070/foo maprfs:///bar

You must specify the namenode IP address, port number, and source directory on the HDFS cluster. For more information, see CopyingData from Apache Hadoop

Run the application.Add more data and test again.When the application is running to your satisfaction, use the same process to test and tune another application.



1.

2.

3.

1. 2. 3.

Data Migration

Once you have installed and configured your MapR cluster in a test environment and migrated your applications to the MapR cluster you canbegin to copy over your data from the Apache Hadoop HDFS to the MapR cluster.

In the application migration phase, you should have already moved over small amounts of data using the command. Seehadoop distcp hftp. While this method is ideal for copying over the very small amounts of data required for an initial test, you mustApplication Migration Roadmap

use different methods to migrate your data.

There two ways to migrate large datasets from an HDFS cluster to MapR:

Distributed Copy — Use the command to copy data from the HDFS to the MapR-FS. This is the preferred method forhadoop distcpmoving large amounts of data.Push Data — If the HDFS cluster and the MapR cluster do not use the same version of the RPC protocol, or if for some other reason youcannot use the hadoop distcp command, you can push data from HDFS to MapR-FS.

Important: Ensure that you have laid out your volumes and defined policies before you migrate your data from the HDFS cluster to the to theMapR cluster. Note that you cannot copy over permissions defined in HDFS.

Distributed Copy

The command (distributed copy) enables you to use a MapReduce job to copy large amounts of data between clusters. “The hadoop distcp ha command expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified indoop distcp

the source list.”

You can use the command to migrate data from a Hadoop HDFS cluster to the MapR-FS only if the HDFShadoop distcpcluster uses the same version of the RPC protocol as that used by the MapR cluster. Currently, MapR uses version 4. (If theclusters do not share the same version of the RPC protocol, you must use the push data method described below.)

To copy data from HDFS to MapR using hadoop distcp:

From a node in the MapR cluster, try hadoop fs -ls to determine whether the MapR cluster can successfully communicate with the HDFScluster:

hadoop fs \-ls <NameNode IP>:<NameNode port>/<path>

If the hadoop fs -ls command is successful, try hadoop fs -cat to determine whether the MapR cluster can read file contents from thespecified path on the HDFS cluster:

hadoop fs \-cat <NameNode IP>:<NameNode port>/<HDFS path>/<file>

If you are able to communicate with the HDFS cluster and read file contents, use distcp to copy data from the HDFS cluster to the MapRcluster:

hadoop distcp <NameNode IP>:<NameNode port>/<HDFS path> <MapR-FS path>

Pushing Data from HDFS to MapR-FS

If the HDFS cluster and the MapR cluster do not use the same version of the RPC protocol, or if for some other reason you cannot use the hadoo command to copy files from HDFS to MapR-FS, you can push data from the HDFS cluster to the MapR cluster.p distcp

Perform the following steps from a MapR client or node (any computer that has either mapr-core or mapr-client installed). For more informationabout setting up a MapR client, see .Setting Up the Client

<input path>: The HDFS path to the source data.<output path>: The MapR-FS path to the target directory.<MapR CLDB IP>: The IP address of the master CLDB node on the MapR cluster.

Log into a MapR client or node as the root user (or use for the following commands).sudoCreate the directory on the Apache Hadoop JobClient node./tmp/maprfs-client/Copy the following files from a MapR client or any MapR node to the directory:/tmp/maprfs-client/

/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar



1.

1.

2. 3. 4.

/opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar/opt/mapr/hadoop/hadoop-0.20.2/lib/native/Linux-amd64-64/libMapRClient.so

Install the files in the correct places on the Apache Hadoop JobClient node:

cp /tmp/maprfs-client/maprfs-0.1.jar $HADOOP_HOME/lib/cp /tmp/maprfs-client/zookeeper-3.3.2.jar $HADOOP_HOME/lib/cp /tmp/maprfs-client/libMapRClient.so $HADOOP_HOME/lib/native/Linux-amd64-64/libMapRClient.soNote: If you are on a 32-bit client, use Linux-i386-32 in place of Linux-amd64-64 above.

If the JobTracker is a different node from the JobClient node, copy and install the files to the JobTracker node as well using the abovesteps.On the JobTracker node, set in .fs.maprfs.impl=com.mapr.fs.MapRFileSystem $HADOOP_HOME/conf/core-site.xmlRestart the JobTracker.Copy data to the MapR cluster by running the command on the JobClient node of the Apache Hadoop cluster.hadoop distcp



Node Migration

Once you have loaded your data and tested and tuned your applications, you can add decommission HDFS data-nodes and add them to theMapR cluster.

This is a three-step process:

Decommissioning nodes on an Apache Hadoop cluster: The Hadoop decommission feature enables you to gracefully remove a setof existing data-nodes from a cluster while it is running, without data loss. For more information, see .Hadoop Wiki FAQMeeting minimum hardware and software requirements: Ensure that every data-node you want to add to the MapR cluster meets thehardware, software, and configuration .requirementsAdding Nodes to a MapR cluster: You can add those data-nodes to the MapR cluster. For more information, see Adding Nodes to a

.Cluster

http://wiki.apache.org/hadoop/FAQ



Reference GuideThe MapR Reference Guide contains in-depth reference information for MapR Software. Choose a subtopic below for more detail.

Release Notes - Known issues and new features, by releaseMapR Control System - User interface referenceAPI Reference - Information about the command-line interface and the REST APIUtilities - MapR tool and utility referenceEnvironment Variables - Environment variables specific to MapRConfiguration Files - Information about MapR settingsPorts Used by MapR - List of network ports used by MapR servicesGlossary - Essential MapR terms and definitionsHadoop Commands - Listing of Hadoop commands and options



Release Notes

This section contains Release Notes for all releases of MapR Distribution for Apache Hadoop:

Version 2.1 Release Notes - November 21, 2012Version 2.0.1 Release Notes - November 6, 2012Version 2.0 Release Notes - August 10, 2012Version 1.2.9 Release Notes - July 12, 2012Version 1.2.7 Release Notes - June 4, 2012Version 1.2.3 Release Notes - March 14, 2012Version 1.2.2 Release Notes - February 2, 2012Version 1.2 Release Notes - December 12, 2011Version 1.1.3 Release Notes - September 29, 2011Version 1.1.2 Release Notes - September 7, 2011Version 1.1.1 Release Notes - August 22, 2011Version 1.1 Release Notes - July 28, 2011Version 1.0 Release Notes - June 29, 2011Beta Release Notes - April 1, 2011Alpha Release Notes - February 15, 2011

Repository Paths

Version 2.1.0http://package.mapr.com/releases/v2.1.0/redhat/ (CentOS or Red Hat)http://package.mapr.com/releases/v2.1.0/ubuntu/ (Ubuntu)http://package.mapr.com/releases/v2.1.0/suse/ (SUSE)

Version 2.0.1http://package.mapr.com/releases/v2.0.1/redhat/ (CentOS or Red Hat)http://package.mapr.com/releases/v2.0.1/ubuntu/ (Ubuntu)http://package.mapr.com/releases/v2.0.1/suse/ (SUSE)

Version 2.0http://package.mapr.com/releases/v2.0.0/mac/ (Mac)http://package.mapr.com/releases/v2.0.0/redhat/ (CentOS or Red Hat)http://package.mapr.com/releases/v2.0.0/ubuntu/ (Ubuntu)http://package.mapr.com/releases/v2.0.0/suse/ (SUSE)http://package.mapr.com/releases/v2.0.0/windows/ (Windows)

Version 1.2.9http://package.mapr.com/releases/v1.2.9/mac/ (Mac)http://package.mapr.com/releases/v1.2.9/redhat/ (CentOS or Red Hat, or SUSE)http://package.mapr.com/releases/v1.2.9/ubuntu/ (Ubuntu)http://package.mapr.com/releases/v1.2.9/windows/ (Windows)

Version 1.2.7http://package.mapr.com/releases/v1.2.7/mac/ (Mac)http://package.mapr.com/releases/v1.2.7/redhat/ (CentOS, Red Hat, or SUSE)http://package.mapr.com/releases/v1.2.7/ubuntu/ (Ubuntu)http://package.mapr.com/releases/v1.2.7/windows/ (Windows)

Version 1.2.3http://package.mapr.com/releases/v1.2.3/mac/ (Mac)http://package.mapr.com/releases/v1.2.3/redhat/ (Red Hat or CentOS)http://package.mapr.com/releases/v1.2.3/ubuntu/ (Ubuntu)http://package.mapr.com/releases/v1.2.3/windows/ (Windows)



Version 1.1.3http://package.mapr.com/releases/v1.1.3/redhat/ (Red Hat or CentOS)http://package.mapr.com/releases/v1.1.3/ubuntu/ (Ubuntu)

Version 1.1.2 - Internal maintenance releaseVersion 1.1.1

http://package.mapr.com/releases/v1.1.1/mac/ (Mac client)http://package.mapr.com/releases/v1.1.1/redhat/ (Red Hat or CentOS)http://package.mapr.com/releases/v1.1.1/ubuntu/ (Ubuntu)

Version 1.1.0http://package.mapr.com/releases/v1.1.0-sp0/mac/ (Mac client)

http://package.mapr.com/releases/v2.1.0/redhat/

http://package.mapr.com/releases/v2.1.0/ubuntu/

http://package.mapr.com/releases/v2.1.0/suse/




http://package.mapr.com/releases/v2.0.0/mac/




http://package.mapr.com/releases/v2.0.0/windows/


























http://package.mapr.com/releases/v1.1.0-sp0/mac/



http://package.mapr.com/releases/v1.1.0-sp0/redhat/ (Red Hat or CentOS)http://package.mapr.com/releases/v1.1.0-sp0/ubuntu/ (Ubuntu)

Version 1.0.0http://package.mapr.com/releases/v1.0.0-sp0/redhat/ (Red Hat or CentOS)http://package.mapr.com/releases/v1.0.0-sp0/ubuntu/ (Ubuntu)

http://package.mapr.com/releases/v1.1.0-sp0/redhat/

http://package.mapr.com/releases/v1.1.0-sp0/ubuntu/

http://package.mapr.com/releases/v1.0.0-sp0/redhat/

http://package.mapr.com/releases/v1.0.0-sp0/ubuntu/



Version 2.1 Release Notes

New In This Release

Performance improvements: MapR 2.1 has increased performanceStarting in version 2.1, .MapR is compatible with Hadoop 1.0.3Improved CLDB/HA failover due to faster network failure discovery.Continuous client access throughout failover process.Faster CLDB failover.Faster NFS VIP failover.

Resolved Issues

MapReduce

(Issue 8692) Initialization scripts now point to the correct binary on RedHat and CentOS.

Filesystem

(Issue 8237) Old FSID values are now removed correctly.(Issue 8170) Cache expiration no longer causes binding to return null and generate a core.

NFS

(Issue 7909) Multihomed nodes now failover correctly with VIPs.

MCS and CLI

(Issue 8266) The command properly handles filters passed with the switch.maprcli volume remove -filter(Issue 8642) The command now correctly handles flat files.maprcli disk remove

CLDB

(Issue 7893) A given CLDB node can now have multiple IP addresses in .mapr-clusters.conf(Issue 8578) CLDB failover time decreased significantly.(Issue 8598) CLDB now properly sets device information when relinquishing an IP address.(Issue 8874) CLDB failover during rereplication no longer raises false under-replication alarms.(Issue 8922) CLDB no longer indefinitely and incorrectly marks some containers as spurious.

Hive

(Issue 4981) Hive now supports lower replication settings for intermediate data.

Logging

(Issue 8325) Diagnostics info files now generate only for failed task attempts.(Issue 8459) Distributed cache log messages now have information about job ID and file sizes.

Known Issues

(Issue 5537)

Rare MapR application crashes due to Java JVM dumping core. This is a Java Runtime Environment issue.

(Issue 7310)

In some cases, Map/Reduce preemption may not work with Map Task prefetch and Expresslane. MapR recommends using prefetch, preemptionand expresslane mutually exclusively.

(Issue 7332)

The bug is not yet fixed in MapR's release. It may cause the dropping of a table with a largehttps://issues.apache.org/jira/browse/HIVE-2907number of partitions to fail due to an out-of-memory error.

https://issues.apache.org/jira/browse/HIVE-2907



(Issue 7834)

There are no obvious indications when a MapR license expires, except the degradation of services to the non-licensed level. The workaround is topay attention to License expiration alarms before the actual expiry.

(Issue 9067)

The job tracker must be manually restarted after upgrading from any MapR version prior to 2.0 to MapR version 2.0 or later on a live cluster.

Map/Reduce and Hadoop Patches Integrated Since Last Release

Hadoop Common Patches

MapR 2.1 includes the following Apache Hadoop patches that are not included in the MapR distribution for Hadoop version 2.0:

[HADOOP-8329] HADOOP-8329. Build fails with Java 7. Backport new FileSystem methods introduced by HADOOP-8014 to branch-1[HADOOP-8430] BloomMapFile can return false negatives[HADOOP-6546] Error handling in snappy decompressor throws invalid exceptions[HADOOP-8151] Fix javac, javadoc, findbugs warnings[HADOOP-6642] merge hadoop archive goodness from trunk to .20[HADOOP-7539] wordcount, sort etc on har files fails with NPE[HADOOP-7602] Support HTTP REST in HttpServer[HADOOP-7594] FileSystem.getCanonicalServiceName throws NPE for any file system uri that doesn't have an authority.[HADOOP-7661] TestMapredGroupMappingServiceRefresh and TestRefreshUserMappings fail after HADOOP-7625[HADOOP-7649] RPC clients must connect over a network interface corresponding to the host name in the client's kerberos principal key[HADOOP-7215] Improve message when Authentication is required[HADOOP-7509] Token should not print the password in toString[HADOOP-8445] HarFileSystem access of harMetaCache isn't threadsafe[HADOOP-8587] TestSaslRPC#testDigestAuthMethodHostBasedToken fails with hostname localhost.localdomain[HADOOP-7836] integer overflow in S3InputStream for blocks > 2GB[HADOOP-6975] Conflict: Same security.log.file for multiple users.[HADOOP-8552] Backport HADOOP-8599 to branch-1 (Non empty response when read beyond eof)[HADOOP-8612]

MapReduce Patches

MapR 2.1 includes the following Apache MapReduce patches that are not included in the MapR distribution for Hadoop version 2.0:

[MAPREDUCE-336] The logging level of the tasks should be configurable by the job Potential deadlock in Counters[MAPREDUCE-4359] Delegation token cancellation shouldn't hold global JobTracker lock[MAPREDUCE-2452] Graceful handling of codec errors during decompression[MAPREDUCE-3993] FairScheduler.maxTasksToAssign() should check for fairscheduler.assignmultiple.maps < TaskTracker.availableSlots[MAPREDUCE-4385] JobSplitWriter.java can't handle large job.split file[MAPREDUCE-2779] Streaming TestUlimit fails on CentOS 6[MAPREDUCE-4036] Add RunningJob.getJobStatus()[MAPREDUCE-4355] Backport the Job.getInstance methods from MAPREDUCE-1505 to branch-1[MAPREDUCE-4415] streaming MR job succeeds even if the streaming command fails[MAPREDUCE-4154] Hostnames with an underscore no longer cause reduce tasks to fail.[MAPREDUCE-4464]

https://issues.apache.org/jira/browse/HADOOP-8329


















https://issues.apache.org/jira/browse/MAPREDUCE-336













Hadoop Compatibility in Version 2.1

As of November 21, 2012, the MapR Hadoop Ecosystem contains the following packages:

Cascading 2.1Flume 1.2.0Hbase 0.92.2Hcatalog 0.4.0Hive 0.9.0Mahout 0.7Oozie 3.2.0Pig 0.10.0Sqoop 1.4.2Whirr 0.7.0

These packages are available from the following repositories:

RedHat ( )http://package.mapr.com/releases/ecosystem/redhatUbuntu ( )http://package.mapr.com/releases/ecosystem/ubuntuSUSE ( )http://package.mapr.com/releases/ecosystem/suse

Other supported versions of this software are available from :http://package.mapr.com/releases/ecosystem-all/

Cascading 2.0Hbase 0.92.1Hbase 0.94.1Mahout 0.6Oozie 3.1.0Sqoop 1.4.1

The collection of packages is not a repository. Download yourhttp://package.mapr.com/releases/ecosystem-all/desired packages manually.

http://package.mapr.com/releases/ecosystem/redhat

http://package.mapr.com/releases/ecosystem/ubuntu

http://package.mapr.com/releases/ecosystem/suse





Version 2.1.1 Release Notes

New In This Release

New Alarm: .NODE_ALARM_METRICS_WRITE_PROBLEMArchitectural changes to the handling of Hadoop configuration files , , and .mapred-site.xml mapred-default.xml core-site.xmlThese changes are transparent to developers and administrators; continue to specify custom values for parameters in those files asnormal.New utility: to display configuration information for a node.hadoop conf

Resolved Issues

General

(Issue 9017) Multi-thread timing issue no longer results in null pointer exceptions.(Issue 9044) Multi-thread issue timing issue with the Fair Scheduler no longer results in null pointer exceptions.(Issue 9046) The script now uses a default path when does not specify the path.configure-common.sh mapred-site.xml(Issue 9075) The CLI tool checks key values and can dump the entire configuration.hadoop conf

JobTracker

(Issue 9067) JobTrackers start correctly after a manual rolling upgrade to this version.

MCS and CLI

(Issue 9080) The script now correctly invokes .diskremove.sh mrconfig

HBase

(Issue 9037) MapReduce task attempts connect more reliably to the HBase Regionserver. Improved performance in ,fileclientparticularly in HBase random read operations.

Known Issues

(Issue 5537)


(Issue 7310)


(Issue 7332)


(Issue 7834)


Map/Reduce and Hadoop Patches Integrated Since Last Release

The 2.1.1 release of the MapR distribution of Hadoop has not integrated any Apache Hadoop patches that are not included in the 2.1.0 release ofthe MapR distribution for Hadoop.




1. 2.

3. 4. 5. 6.

7.


Important Notes

NFS and Upgrading






General Information



New in This Release

MapR Metrics

MapR 2.0 provides analytics. The MapR Metrics service collects and displays analytics information about the Hadoop jobs, tasks,job and taskand task attempts that run on the nodes in your cluster. You can use this information to examine specific aspects of your cluster's performance ata very granular level, enabling you to monitor how your cluster responds to changing workloads and optimize your Hadoop jobs or clusterconfiguration. This information is available both in the area of the MapR Control System and via the . MapR Metrics requires aJobs Metrics APISQL server to store metrics data.

Centralized Configuration

MapR services can be configured globally across the cluster, from master configuration files stored in a MapR-FS, eliminating the need to editconfiguration files on all nodes individually. See .Central Configuration

Centralized Logging

MapR's Centralized Logging feature provides a job-centric view of all log files generated by tracker nodes throughout the cluster. During or afterexecution of a job, use the maprcli job linklogs command to create a centralized log directory populated with symbolic links to all log files relatedto tasks, map attempts, and reduce attempts pertaining to the specified job(s). If MapR-FS is mounted using NFS, you can use standard tools likegrep and find to investigate issues which may be distributed across multiple nodes in the cluster.

Installation Script

The script automates the process of installing and starting MapR services on a single node or multiple nodes. The maprinstall maprinstall

http://www.mapr.com/doc/display/MapR/maprinstall



script can take command-line arguments or interactive input about how to proceed with the installation.

Running as Non-root User

MapR services can now run as a user other than , enabling greater security and compliance with SELinux. During installation, you canrootchoose or create a user under which MapR services run; after upgrade to Version 2.0, you can a cluster from running as to a non-convert root r

user.oot

Node Metrics

Metrics information about each node in the cluster is available through the API.node metrics

HCatalog Support

MapR now supports the metadata and data storage abstraction service to provide interoperability across the Pig, MapReduce, and HiveHCatalogservices.

Resolved Issues

(Issue 5830) - After a disk failure, MFS may access a non-existent disk and expose a Linux kernel panic(Issue 6592) - MFS thinks a volume is not mounted, while it in fact is mounted, which may lead to MapReduce task failures(Issue 6724) - Under heavy I/O load on multiple volumes with readahead, MFS may core dump due to null-pointer assert failure(Issue 7422) - MFS may not "heartbeat" to CLDB for more than 5 minutes due to FSID mismatch, resulting in errors in MFS and CLDBlogs(Issue 7630) - Permission denied on relative path

Known Issues

(Issue 5537)


(Issue 7310)


(Issue 7332)


(Issue 7834)


http://incubator.apache.org/hcatalog/









Package Dependencies for MapR version 2.x

This page contains details about package dependencies.The section lists the non-MapR packages that the MapRPackage Dependenciesdistribution depends on. The section lists the packages that contain each MapR component, and theirNode Roles, Packages and Dependenciesdependencies.

Package Dependencies

This section contains dependencies for the MapR software for each supported version of Linux:

Package Dependencies for Red Hat and CentOSPackage Dependencies for SUSEPackage Dependencies for UbuntuCore MapR PackagesMapR Hadoop Ecosystem Packages

Package Dependencies for Red Hat and CentOS

Make sure the following packages are installed on each node:

bashrpcbingAjaxtermdmidecodeglibchdparmiputilsirqbalancelibgcclibstdc++redhat-lsbrpm-libssdparmshadow-utilssyslinuxunzipzip

On ZooKeeper nodes, is required.netcatOn nodes running the Metrics service, is required.mysql_sock

Package Dependencies for SUSE

All package dependencies for Red Hat are also applicable for SUSE.Installing on SUSE also requires the following package to be installed:

mapr-compat-suse-2.0.0.15132GA-1.x86_64.rpm

Package Dependencies for Ubuntu


adduserajaxtermawkbashcoreutilsdmidecodedpkg-repackgrephdparmiputils-arpingirqbalancelibc6libgcc1libstdc++6lsb-basenfs-commonperl



procpssdparmsedsyslinuxunzipzip

On ZooKeeper nodes, is required.netcat

If setting up a local repository, the following packages are required on the repository machine:

dpkg-devapache2

Node Roles, Packages and Dependencies

This section lists the roles (ie, services) that can be installed on a node, which package implements each specific role, and what dependenciesexist.

Package files for Red Hat, CentOS and SUSE distributions have extension . Package files for the Ubuntu distribution have extension *.rpm *.de. The table lists package files ending in only, but each package has an equivalent file for Ubuntu.b rpm deb

Core MapR Packages

Role Package Internal Dependencies

CLDB mapr-cldb-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-fileserver-2.1.0.16877GA-1.x86_64.rpm

MapR Client mapr-client-2.1.0.16877GA-1.amd64.rpm None

FileServer mapr-fileserver-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

MapR Metrics mapr-metrics-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

JobTracker mapr-jobtracker-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-fileserver-2.1.0.16877GA-1.x86_64.rpm

NFS mapr-nfs-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

MapR Single Node mapr-single-node-2.1.0.16877GA-1.x86_64.rpm mapr-cldb-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-fileserver-2.1.0.16877GA-1.x86_64.rpm mapr-jobtracker-2.1.0.16877GA-1.x86_64.rpm mapr-nfs-2.1.0.16877GA-1.x86_64.rpm mapr-tasktracker-2.1.0.16877GA-1.x86_64.rpm

mapr-webserver-2.1.0.16877GA-1.x86_64.rpm

mapr-zookeeper-2.1.0.16877GA-1.x86_64.rpm

mapr-zk-internal-2.1.0.16877GA-1.x86_64.rpm

TaskTracker mapr-tasktracker-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-fileserver-2.1.0.16877GA-1.x86_64.rpm

MapR Upgrade mapr-upgrade-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

Webserver mapr-webserver-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

ZooKeeper mapr-zookeeper-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-zk-internal-2.1.0.16877GA-1.x86_64.rpm

MapR Hadoop Ecosystem Packages


Flume mapr-flume-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-flume-internal-2.1.0.16877GA-1.x86_64.rpm



HBase Master mapr-hbase-master-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-hbase-internal-2.1.0.16877GA-1.x86_64.rpm

HBase Region Server mapr-hbase-regionserver-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-hbase-internal-2.1.0.16877GA-1.x86_64.rpm

HCatalog mapr-hcatalog-0.4.0.15190-1.noarch.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

HCatalog Server mapr-hcatalog-server-0.4.0.15190-1.noarch.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

Hive mapr-hive-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-hbase-internal-2.1.0.16877GA-1.x86_64.rpm

mapr-hive-internal-2.1.0.16877GA-1.x86_64.rpm

Oozie mapr-oozie-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-oozie-internal-2.1.0.16877GA-1.x86_64.rpm

Pig mapr-pig-2.1.0.16877GA-1.x86_64.rpm mapr-pig-internal-2.1.0.16877GA-1.x86_64.rpm

Sqoop mapr-sqoop-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-hbase-internal-2.1.0.16877GA-1.x86_64.rpm

mapr-sqoop-internal-2.1.0.16877GA-1.x86_64.rpm

Whirr mapr-whirr-2.1.0.16877GA-1.x86_64.rpm None



Packages and Dependencies for MapR Version 2.x

This page contains details about packages and dependencies for the MapR software and Hadoop ecosystem components for the MapR version2.x release.

This page contains the following subtopics:

Package DependenciesPackage Dependencies for Red Hat and CentOSPackage Dependencies for SUSEPackage Dependencies for Ubuntu

Node Roles, Packages and DependenciesCore MapR PackagesMapR Hadoop Ecosystem Packages

The section lists the non-MapR packages that the MapR distribution depends on. The Package Dependencies Node Roles, Packages and section lists the packages that contain each MapR component, and their dependencies.Dependencies

Package Dependencies

This section contains dependencies for the MapR software for each supported version of Linux:

Package Dependencies for Red Hat and CentOS


bashrpcbingAjaxtermdmidecodeglibchdparmiputilsirqbalancelibgcclibstdc++redhat-lsbrpm-libssdparmshadow-utilssyslinuxunzipzip

On ZooKeeper nodes, is required.netcatOn nodes running the Metrics service, is required.mysql_sock

Package Dependencies for SUSE

All package dependencies for Red Hat are also applicable for SUSE.Installing on SUSE also requires the following package to be installed:

mapr-compat-suse-2.0.0.15132GA-1.x86_64.rpm

Package Dependencies for Ubuntu


adduserajaxtermawkbashcoreutilsdmidecodedpkg-repackgrephdparmiputils-arpingirqbalancelibc6



libgcc1libstdc++6lsb-basenfs-commonperlprocpssdparmsedsyslinuxunzipzip

On ZooKeeper nodes, is required.netcat

If setting up a local repository, the following packages are required on the repository machine:

dpkg-devapache2

Node Roles, Packages and Dependencies

This section lists the roles (ie, services) that can be installed on a node, which package implements each specific role, and what dependenciesexist.

Package files for Red Hat, CentOS and SUSE distributions have extension . Package files for the Ubuntu distribution have extension *.rpm *.de. The table lists package files ending in only, but each package has an equivalent file for Ubuntu.b rpm deb

Core MapR Packages


CLDB mapr-cldb-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-fileserver-2.1.0.16877GA-1.x86_64.rpm

MapR Client mapr-client-2.1.0.16877GA-1.amd64.rpm None

FileServer mapr-fileserver-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

MapR Metrics mapr-metrics-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

JobTracker mapr-jobtracker-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-fileserver-2.1.0.16877GA-1.x86_64.rpm

NFS mapr-nfs-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

MapR Single Node mapr-single-node-2.1.0.16877GA-1.x86_64.rpm mapr-cldb-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-fileserver-2.1.0.16877GA-1.x86_64.rpm mapr-jobtracker-2.1.0.16877GA-1.x86_64.rpm mapr-nfs-2.1.0.16877GA-1.x86_64.rpm mapr-tasktracker-2.1.0.16877GA-1.x86_64.rpm

mapr-webserver-2.1.0.16877GA-1.x86_64.rpm

mapr-zookeeper-2.1.0.16877GA-1.x86_64.rpm

mapr-zk-internal-2.1.0.16877GA-1.x86_64.rpm

TaskTracker mapr-tasktracker-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-fileserver-2.1.0.16877GA-1.x86_64.rpm

MapR Upgrade mapr-upgrade-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

Webserver mapr-webserver-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

ZooKeeper mapr-zookeeper-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-zk-internal-2.1.0.16877GA-1.x86_64.rpm

MapR Hadoop Ecosystem Packages




Flume mapr-flume-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-flume-internal-2.1.0.16877GA-1.x86_64.rpm

HBase Master mapr-hbase-master-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-hbase-internal-2.1.0.16877GA-1.x86_64.rpm

HBase Region Server mapr-hbase-regionserver-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-hbase-internal-2.1.0.16877GA-1.x86_64.rpm

HCatalog mapr-hcatalog-0.4.0.15190-1.noarch.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

HCatalog Server mapr-hcatalog-server-0.4.0.15190-1.noarch.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm

Hive mapr-hive-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-hbase-internal-2.1.0.16877GA-1.x86_64.rpm

mapr-hive-internal-2.1.0.16877GA-1.x86_64.rpm

Oozie mapr-oozie-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-oozie-internal-2.1.0.16877GA-1.x86_64.rpm

Pig mapr-pig-2.1.0.16877GA-1.x86_64.rpm mapr-pig-internal-2.1.0.16877GA-1.x86_64.rpm

Sqoop mapr-sqoop-2.1.0.16877GA-1.x86_64.rpm mapr-core-2.1.0.16877GA-1.x86_64.rpm mapr-hbase-internal-2.1.0.16877GA-1.x86_64.rpm

mapr-sqoop-internal-2.1.0.16877GA-1.x86_64.rpm

Whirr mapr-whirr-2.1.0.16877GA-1.x86_64.rpm None




General Information



We have released Hbase 94.1 and Oozie 3.2.0 separately, in the following location: http://package.mapr.com/releases/ecosystem-all

New in This Release

This is a maintenance release with no new features.

Resolved Issues

The following issues have been resolved in this release.

MapReduce

(Issue 6132)

JobTracker HA detects JT process being terminated and fails over to a standby JT. However if JT process is there, but is not responsive, nofailover occurs. With the fix, if JT process is unresponsive for a configurable duration (default 10 min), its jstack is logged and it is restartedautomatically.

(Issue 7628)

TaskTracker won't run if node hostname resolves to IP not covered by MAPR_SUBNETS.

(Issue 8105)

Potential deadlock while cleaning up Job Metrics data.

(Issue 8115)

Split calculation does not handle Mapr-FS chunk size 0 and client JVM spins at 100% CPU utilization while submitting Map/Reduce job.

Filesystem

(Issue 6717)

Potential memleak in flushing dirty inodes in file system.

(Issue 7894)

Mirroring no longer causes an assert failure by losing track of the most recent replica of a blank source container.

(Issue 7925)

A lot of containers may get created on a volume when a node goes down.

(Issue 7931)

IOMgr trying to access the cache pages after failing the IO can cause an assert in the filesystem.

(Issue 7984)

http://package.mapr.com/releases/ecosystem-all



Name container for large volume stuck under replicated for 2 days due to message timeout likely because of network issues.

(Issue 8071)

CLDB shows containers has master but containers are stale due to larger than 64K message not being sent.

(Issue 8092)

fsck doesn't work on storage pools > 6TB.

(Issue 8139)

No CLDB master for ~20 minutes after upgrading.

(Issue 8174)

Excessive number of containers in mapred local volume.

(Issue 8179)

The "disk online" operation may crash with EIO error.

(Issue 8189)

The parameter cldb.balancer.disk.threshold.percentage not used.

(Issue 8262)

When the topology of a local volume changes, containers for the local volume are not re-used, resulting in a large number of container creationoperations.

Logging

(Issue 7677)

Can't set debug logging for maprcli (or any option from command line).

(Issue 7793)

Jobtracker log shows garbage characters in "NetworkTopology: Adding a new node:" entry.

(Issue 8144)

Fileclient should have option to add the IP address of the node to log message along with the container-id.

MCS & CLI

(Issue 6507)

Spurious volume alarms may get raised.

(Issue 6552)

Support for anonymous authentication with SMTP.

(Issue 7892)

Nodes that register early as a CLDB node becomes the master CLDB node sometimes incorrectly appear twice in the balancer classification.

(Issue 7989)

Add start time to maprcli dump rereplicationinfo -json output.

(Issue 8047)

Maintenance mode may not work correctly - node under maintenance is reported with status Critical.



(Issue 8134)

2.0 MCS keeps giving the error "Cannot use both alarmednodes and nfsnodes together" after switch from nfsnodes report to alarmed node report.

Services

(Issue 8032)

Warden and zookeeper fail to start due to requiretty being enabled by default in 1.7.4p5 version of sudo.

(Issue 8280)

When MapR is run as non-root user, stop warden may not stop JT/TT due to incorrect permissions on the pid file.

NFS

(Issue 8260)

NFS client hung accessing files with chunksize set to 0.

(Issue 8136)

Buffer overflow when there are more entries to export than 1024 buffer.

Security

(Issue 8137)

mount fails due to mapr user not having root privileges.




General Information


Apache Hadoop 0.20.2flume-0.9.4hbase-0.90.4hive-0.7.1mahout-0.5oozie-3.0.0pig-0.9.0sqoop-1.3.0whirr-0.3.0

New in This Release

Dial Home

Dial Home is a feature that collects information about the cluster for MapR support and engineering. You can opt in or out of Dial Home featurewhen you first install or upgrade MapR. To change the Dial Home status of your cluster at any time, see the commands.Dial Home

Rolling Upgrade

The script performs a software upgrade of an entire MapR cluster. See for details.rollingupgrade.sh Cluster Upgrade

Improvements to the Support Tools

The script has been enhanced to generate and gather support output from specified cluster nodes into a singlemapr-support-collect.shoutput file via MapR-FS. To support this feature, has a new option:mapr-support-collect.sh

-O, --online Specifies a space-separated list of nodes from which to gather support output.

There is now a "mini-dump" option for both and to limit the size of the support output. When the orsupport-dump.sh support-collect.sh m option is specified along with a size, collects only a head and tail, each limited to the specified size, from any-mini-dump support-dump.sh

log file that is larger than twice the specified size. The total size of the output is therefore limited to approximately 2 * size * number of logs. Thesize can be specified in bytes, or using the following suffixes:

b - blocks (512 bytes)k - kilobytes (1024 bytes)m - megabytes (1024 kilobytes)

-m, --mini-dump<size>

For any log file greater than 2 * <size>, collects only a head and tail each of the specified size. The <size> may have asuffix specifying units:


MapR Virtual Machine

The is a fully-functional single-node Hadoop cluster capable of running MapReduce programs and working withMapR Virtual Machineapplications like Hive, Pig, and HBase. You can try the MapR Virtual Machine on nearly any 64-bit computer by downloading the free VMware

.Player

Windows 7 Client

A is now available. The MapR client lets you interact with MapR Hadoop directly. With the MapR client, youWindows 7 version of the MapR Clientcan submit Map/Reduce jobs and run hadoop fs and hadoop mfs commands.

Resolved Issues

(Issue 4307) Snapshot create fails with error EEXIST





Known Issues

(Issue 5590)

When is upgraded from 1.1.3 to 1.2, is updated to contain , even ifmapr-core /opt/mapr/bin/versions.sh hbase-version 0.90.4HBase has not been upgraded. This can create problems for any process that uses to determine the correct version of HBase.versions.shAfter upgrading to Version 1.2, check that the version of HBase specified in is correct.versions.sh

(Issue 5489) HBase nodes require during rolling upgradeconfigure.sh

When performing an upgrade to MapR 1.2, the HBase package is upgraded from version 0.90.2 to 0.90.4, and it is necessary to run configure. on any nodes that are running the HBase region server or HBase master.sh

(Issue 4269) Bulk operations

The MapR Control System provides both a checkbox and a link for selecting all alarms, nodes, snapshots, or volumes matching a filter,Select Alleven if there are too many results to display on a single screen. However, the following operations can only be performed on individually selectedresults, or results selected using the link at the bottom of the MapR Control System screen:Select Visible

Volumes - Edit VolumesVolumes - Remove VolumesVolumes - New SnapshotVolumes - UnmountMirror Volumes - Edit VolumesMirror Volumes - Remove VolumesMirror Volumes - UnmountUser Disk Usage - EditSnapshots - RemoveSnapshots - PreserveNode Alarms - Change TopologyNodes - Change TopologyVolume Alarms - EditVolume Alarms - UnmountVolume Alarms - RemoveUser/Group Alarms - Edit

In order to perform these operations on a large number of alarms, nodes, snapshots, or volumes, it is necessary to select each screenful ofresults using and perform the operation before selecting the next screenful of results.Select Visible

(Issue 3122) Mirroring with fsck-repaired volume

If a source or mirror volume is repaired with then the source and mirror volumes can go out of sync. It is necessary to perform a full mirrorfsckoperation with to bring them back in sync. Similarly, when creating a dump file from a volume that hasvolume mirror start -full truebeen repaired with , use on the command.fsck -full true volume dump create




Release Information


Apache Hadoop 0.20.2Flume 0.9.4Hbase 0.90.6Hive 0.7.1Mahout 0.5Oozie 3.0.0Pig 0.9.0Sqoop 1.3.0Whirr 0.3.0

New in This Release

This is a maintenance release. No new features.

Resolved Issues

MapReduce

(Issue 6132) High-availability JobTracker now logs jstack and automatically restarts when the JT process exists but is unresponsive for aconfigurable amount of time. The default configuration is ten minutes.(Issue 7628) TaskTracker won't run if the node hostname resolves to an IP that is not covered by the value in the enviroMAPR_SUBNETSnment variable.(Issue 8115) Split calculation now correctly handles a MapR-FS chunk size of zero. The client JVM no longer spins at 100% CPUutilization while submitting a MapReduce job.(Issue 7310) MapReduce task preemption now works correctly when tasks are scheduled with ExpressLane.

File System

(Issue 6021) Cleaning multiple inodes in a container now creates a single update thread.(Issue 6717) Potential memory leak in flushing dirty inodes in file system.(Issue 7894) Mirroring no longer causes an assert failure by losing track of the most recent replica of a blank source container.(Issue 7924) The number of working RPCs reported after a failed attempt is now correct.GetData(Issue 7925) A node going down no longer creates a large number of containers in a volume.(Issue 7931) IOMgr no longer causes assert failures in the filesystem while trying to access the cache pages after failing.(Issue 7984) Name containers for large volumes are no longer stuck as under-replicated due to message timeout.(Issue 8071) CLDB no longer shows a stale container as master due to messages over 64k not being sent.(Issue 8092) fsck now works on Storage Pools over 6TB.(Issue 8139) CLDB no longer takes over 20 minutes to come up on clusters with large numbers of containers.(Issue 5537) Allocating buffers while buffers are still draining no longer causes segmentation faults.(Issue 8179) "disk online" operations no longer crash with an EIO error.(Issue 7502) Mirroring: Rollforward operations now correctly update the container epoch, allowing for mirror restarts.(Issue 7657) Nodes with a very large number of containers no longer become stuck waiting on a command to process.BECOME_MASTER(Issue 7881) Write failures now correctly throw exceptions.(Issue 7599) MFS cache now populates with correct hostname information for containers.(Issue 7685) Disk balancer selects host nodes for new container replicas more evenly.(Issue 7722) MFS now supports more than 256 groups.(Issue 7752) MFS no longer returns null hostnames on rare occasions.(Issue 7691) New command recursively lists all paths within a single volume.hadoop mfs -lsrv <path>(Issue 7575) Renaming of now supported.mapr.cluster.root

Logging

(Issue 7677) Debug log levels can be set from the command line.(Issue 7793) Updating the hostname entry in before calling no longer generates garbageserverTab_ createMapRBlockLocation()characters in the JobTracker log.(Issue 7653) Excessive MFS logging fixed.(Issue 7810) More details are logged, including:

Where a job was killed from (MCS or API)User that sent the API call to kill a jobIP address sending the API call to kill a job

(Issue 7892) Balancing classifications for CLDB logs no longer show the same node multiple times.

MCS and CLI



(Issue 8206) Spurious volume alarms no longer being raised.(Issue 7989) The output of the command now includes the start time of the resyncingmaprcli dump rereplicationinfo -jsonoperation.(Issue 8047) Nodes in maintenance mode now display correctly in MCS.(Issue 8404) tab in the MCS UI now behaves properly.Volumes

NFS

(Issue 8260) NFS client no longer hangs while accessing files with chunksize set to 0.(Issue 8136) The command no longer generates a buffer overflow when maprcli nfsmgmt refreshexports /opt/mapr/conf/ex

is over 1024 bytes in size.ports

Alerts

(Issue 5941) Email alerts now dispatch correctly.




Release Information



New in This Release

This is a maintenance release. No new features.

Resolved Issues

General

(Issue 5862) now correctly exports APIMapRclient.dll libhdfs(Issue 5941) Fixed email sending inconsistency(Issue 7531) Fixed problems with inline setup and “permission denied" errors(Issue 7582) Fix to ignore invalid hostnames in and reprocess them on demand/opt/mapr/conf/mapr-clusters.conf

JobTracker

(Issue 5761) Fixed JobTracker registration problem with zookeeper after reboot(Issue 6132) Fixed JobTracker hang that caused inconsistent failover behavior(Issue 6861) Users submitting jobs without a queue name can no longer cause the JobTracker to fail(Issue 6901) Fixed issue where TaskTrackers fail to kill tasks and then the TaskTracker hangs

NFS

Various fixes and refinements to NFS feature to enhance performance, improve reliability, increase failover performance and improve scalability.

CLDB

Various fixes and refinements to improve CLDB failover performance and replication behavior including:

(Issue 7052) Resolved problem with ZooKeeper disconnecting CLDB and causing failover.

MapR-FS

(Issue 7218) Fixed CLDB rpc problems on multi-homed servers (caused CLDB shutdown on ZooKeeper restart)(Issue 7465) Resolved intermittent container resync failures(Issue 7558) Fixed cause of MapR file system cores seen in customer clusters(Issue 7586) Fix to avoid creating directories with chunksize zero(Issue 7605) Fixed problem with Hadoop jobs failing due to MFS error 110

Known Issues

(Issue 7630)

If a user submits a job with a relative path on a fresh installation, the job may fail with a permission denied error for the directory because/userthat path does not yet exist in a fresh installation.As a workaround, the administrator should create a volume called mounted at . Example:users /user

maprcli volume create -name users -path /user

It is a good idea to create a volume for each user, mounted within ./user






Important Notes

Linux Leap Second Bug

A bug in the component of the Linux subsystem was discovered when the leap second was applied to Coordinated Universal Timehrtimer ntp(UTC) on June 30, 2012 at 23:59:60 UTC. To work around the bug, MapR recommends running the following command as on all nodes:root

/etc/init.d/ntp stop; date; date `date + `; date;"%m%d%H%M%C%y.%S"

Wait a day before re-enabling NTP:

/etc.init.d/ntp start

Inline Setup

Inline setup is a setting that causes each job's setup task to run as a thread directly inside of the JobTracker instead of being forked out as aseparate task by a TaskTracker. This means that jobs that need a setup task will start running faster in some cases because they don't need towait for the TaskTrackers to get scheduled and then run the setup task.

MapR recommends turning off inline setup ( in ) on productionmapreduce.jobtracker.inline.setup.cleanup mapred-site.xmlclusters, because it is dangerous to have the JobTracker execute user-defined code as the privileged JT user (root). If you originally installedversion 1.2.7 or earlier, inline setup defaults to and you should set it to by adding the following to :true false mapred-site.xml

<property> <name>mapreduce.jobtracker.inline.setup.cleanup</name> <value> </value>false <description> </description> </property>

Release Information



New in This Release

Support for SUSE LinuxCLDB enhancements to improve reliability, scalabilty and performanceFixes in the MapReduce layer related to security and data integrityFSCK performance and logging improvementsUpdates to MapR storage services layer to improve performance, security and stabilitySupport for whitelist of subnets MapR-FS will accept requests fromSupport for Hbase 0.90.6NFS improvements to increase performance, reliability and failure recoveryMiscellaneous defect fixes to rolling upgrade and the MapR GUIMapR now works with Accumulo




General Information



New in This Release


Resolved Issues

(1438) Home directory support(5833) Improve CLDB failover time(5857) Streamline container replication and reporting(5934) Ensure disktab is correct after reboot(6014) Optimize Java garbage collection to mitigate CLDB disruptions(6074) Enhance CLDB exception handling(6140) Improve mfs exception handling(6144) CLDB timeout in M3 reduced to 1 hour on new node (instant on same node)(6166) Add API to blacklist a tasktracker manually(6171) Fixed container stuck offline problem(6198) Corrected overcommit documentation page(6211) Rolling upgrade enhancements(6235) BTree improvements(6273) Fixed getBlockLocations() in MapReduce layer




General Information



New in This Release


Resolved Issues

(5840) Mfs process stays at 100% after upgrade(5848) Container resyncs not executed in a timely manner(5866) Mfs generates core file.(5897) CLDB exception causes failover(5907) CLDB exception causes failover(5961) File System loses track which container is master.(5971) Memory leak in MapR client(6044) CLDB over-replicates data






MapR HBase Patches

In the directory, MapR provides the following patches for HBase:/opt/mapr/hbase/hbase-0.90.4/mapr-hbase-patches

0000-hbase-with-mapr.patch0001-HBASE-4196-0.90.4.patch0002-HBASE-4144-0.90.4.patch0003-HBASE-4148-0.90.4.patch0004-HBASE-4159-0.90.4.patch0005-HBASE-4168-0.90.4.patch0006-HBASE-4196-0.90.4.patch0007-HBASE-4095-0.90.4.patch0008-HBASE-4222-0.90.4.patch0009-HBASE-4270-0.90.4.patch0010-HBASE-4238-0.90.4.patch0011-HBASE-4387-0.90.4.patch0012-HBASE-4295-0.90.4.patch0013-HBASE-4563-0.90.4.patch0014-HBASE-4570-0.90.4.patch0015-HBASE-4562-0.90.4.patch

MapR Pig Patches

In the directory, MapR provides the following patches for Pig:/opt/mapr/pig/pig-0.9.0/mapr-pig-patches

0000-pig-mapr-compat.patch0001-remove-hardcoded-hdfs-refs.patch0002-pigmix2.patch0003-pig-hbase-compatibility.patch

MapR Mahout Patches

In the directory, MapR provides the following patches for Mahout:/opt/mapr/mahout/mahout-0.5/mapr-mahout-patches

0000-mahout-mapr-compat.patch

MapR Hive Patches

In the directory, MapR provides the following patches for Hive:/opt/mapr/hive/hive-0.7.1/mapr-hive-patches

0000-symlink-support-in-hive-binary.patch 0001-remove-unnecessary-fsscheme-check.patch 0002-remove-unnecessary-fsscheme-check-1.patch



MapR Flume Patches

In the directory, MapR provides the following patches for Flume:/opt/mapr/flume/flume-0.9.4/mapr-flume-patches

0000-flume-mapr-compat.patch

MapR Sqoop Patches

In the directory, MapR provides the following patches for Sqoop:/opt/mapr/sqoop/sqoop-1.3.0/mapr-sqoop-patches

0000-setting-hadoop-hbase-versions-to-mapr-shipped-versions.patch

MapR Oozie Patches

In the directory, MapR provides the following patches for Oozie:/opt/mapr/oozie/oozie-3.0.0/mapr-oozie-patches

0000-oozie-with-mapr.patch0001-OOZIE-022-3.0.0.patch0002-OOZIE-139-3.0.0.patch

HBase Common Patches

MapR 1.2 includes the following Apache Hadoop patches that are not included in the Apache HBase base version 0.90.4:

[HBASE-4169] FSUtils LeaseRecovery for non HDFS FileSystems. A client continues to try and connect to a powered down regionserver[HBASE-4168] TableRecordReader may skip first row of region[HBASE-4196] RS does not abort if the initialization of RS fails[HBASE-4144] HFileOutputFormat doesn't fill in TIMERANGE_KEY metadata[HBASE-4148] HBaseServer - IPC Reader threads are not daemons[HBASE-4159] Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked[HBASE-4095] IOE ignored during flush-on-close causes dataloss[HBASE-4270] CatalogJanitor can clear a daughter that split before processing its parent[HBASE-4238] Error while syncing: DFSOutputStream is closed[HBASE-4387] rowcounter does not return the correct number of rows in certain circumstances[HBASE-4295] When error occurs in this.parent.close(false) of split, the split region cannot write or read[HBASE-4563] Fix a race condition that could cause inconsistent results from scans during concurrent writes.[HBASE-4570] When split doing offlineParentInMeta encounters error, it\'ll cause data loss[HBASE-4562] Make HLog more resilient to write pipeline failures[HBASE-4222]

Oozie Common Patches

MapR 1.2 includes the following Apache Hadoop patches that are not included in the Apache Oozie base version 3.0.0: Add Hive action[GH-0022] Add Sqoop action[GH-0139]


MapR 1.2 includes the following Apache Hadoop patches that are not included in the Apache Hadoop base version 0.20.2:

[HADOOP-1722] Make streaming to handle non-utf8 byte array IPC server max queue size should be configurable[HADOOP-1849] speculative execution start up condition based on completion time[HADOOP-2141] Space in the value for dfs.data.dir can cause great problems[HADOOP-2366] Use job control for tasks (and therefore for pipes and streaming)[HADOOP-2721] Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni[HADOOP-2838] Shuffling fetchers waited too long between map output fetch re-tries[HADOOP-3327] Patch to allow hadoop native to compile on Mac OS X[HADOOP-3659] Providing splitting support for bzip2 compressed files[HADOOP-4012] IsolationRunner does not work as documented[HADOOP-4041] Map and Reduce tasks should run as the user who submitted the job[HADOOP-4490] FileSystem.CACHE should be ref-counted[HADOOP-4655] Add a user to groups mapping service[HADOOP-4656]

https://issues.apache.org/jira/browse/HBASE-4169















https://github.com/yahoo/oozie/issues/22

https://github.com/yahoo/oozie/issues/139
















Current Ganglia metrics implementation is incompatible with Ganglia 3.1[HADOOP-4675] Allow FileSystem shutdown hook to be disabled[HADOOP-4829] Streaming combiner should allow command, not just JavaClass[HADOOP-4842] Implement setuid executable for Linux to assist in launching tasks as job owners[HADOOP-4930] ConcurrentModificationException in JobHistory.java[HADOOP-4933] Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide[HADOOP-5170] Option to prohibit jars unpacking[HADOOP-5175] TT's version build is too restrictive[HADOOP-5203] Queue ACLs should be refreshed without requiring a restart of the job tracker[HADOOP-5396] Provide a way for users to find out what operations they can do on which M/R queues[HADOOP-5419] Support killing of process groups in LinuxTaskController binary[HADOOP-5420] The job history display needs to be paged[HADOOP-5442] Add support for application-specific typecodes to typed bytes[HADOOP-5450] Exposing Hadoop metrics via HTTP[HADOOP-5469] calling new SequenceFile.Reader(...) leaves an InputStream open, if the given sequence file is broken[HADOOP-5476] HADOOP-2721 doesn't clean up descendant processes of a jvm that exits cleanly after running a task successfully[HADOOP-5488] Binary partitioner[HADOOP-5528] Hadoop Vaidya throws number format exception due to changes in the job history counters string format (escaped compact[HADOOP-5582]

representation). Hadoop Streaming - GzipCodec[HADOOP-5592] change S3Exception to checked exception[HADOOP-5613] Ability to blacklist tasktracker[HADOOP-5643] Counter for S3N Read Bytes does not work[HADOOP-5656] DistCp should not launch a job if it is not necessary[HADOOP-5675] Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics[HADOOP-5733] UGI checks in testcases are broken[HADOOP-5737] Split waiting tasks field in JobTracker metrics to individual tasks[HADOOP-5738] Allow setting the default value of maxRunningJobs for all pools[HADOOP-5745] The length of the heartbeat cycle should be configurable.[HADOOP-5784] JobTracker should refresh the hosts list upon recovery[HADOOP-5801] problem using top level s3 buckets as input/output directories[HADOOP-5805] s3n files are not getting split by default[HADOOP-5861] GzipCodec should read compression level etc from configuration[HADOOP-5879] Allow administrators to be able to start and stop queues[HADOOP-5913] Use JDK 1.6 File APIs in DF.java wherever possible[HADOOP-5958] create script to provide classpath for external tools[HADOOP-5976] LD_LIBRARY_PATH not passed to tasks spawned off by LinuxTaskController[HADOOP-5980] HADOOP-2838 doesnt work as expected[HADOOP-5981] RPC client opens an extra connection for VersionedProtocol[HADOOP-6132] ReflectionUtils performance regression[HADOOP-6133] Implement a pure Java CRC32 calculator[HADOOP-6148] Add get/setEnum to Configuration[HADOOP-6161] Improve PureJavaCrc32[HADOOP-6166] Provide a configuration dump in json format.[HADOOP-6184] Configuration does not lock parameters marked final if they have no value.[HADOOP-6227] Permission configuration files should use octal and symbolic[HADOOP-6234] s3n fails with SocketTimeoutException[HADOOP-6254] Missing synchronization for defaultResources in Configuration.addResource[HADOOP-6269] Add JVM memory usage to JvmMetrics[HADOOP-6279] Any hadoop commands crashing jvm (SIGBUS) when /tmp (tmpfs) is full[HADOOP-6284] Use JAAS LoginContext for our login[HADOOP-6299] Configuration sends too much data to log4j[HADOOP-6312] Update FilterInitializer class to be more visible and take a conf for further development[HADOOP-6337] Stack trace of any runtime exceptions should be recorded in the server logs.[HADOOP-6343] Log errors getting Unix UGI[HADOOP-6400] Add a /conf servlet to dump running configuration[HADOOP-6408] Change RPC layer to support SASL based mutual authentication[HADOOP-6419] Add AsyncDiskService that is used in both hdfs and mapreduce[HADOOP-6433] Prevent remote CSS attacks in Hostname and UTF-7.[HADOOP-6441] Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH[HADOOP-6453] StringBuffer -> StringBuilder - conversion of references as necessary[HADOOP-6471] HttpServer sends wrong content-type for CSS files (and others)[HADOOP-6496] doAs for proxy user[HADOOP-6510] FsPermission:SetUMask not updated to use new-style umask setting.[HADOOP-6521] LocalDirAllocator should use whitespace trimming configuration getters[HADOOP-6534] Allow authentication-enabled RPC clients to connect to authentication-disabled RPC servers[HADOOP-6543] archive does not work with distcp -update[HADOOP-6558] Authorization for default servlets[HADOOP-6568] FsShell#cat should avoid calling unecessary getFileStatus before opening a file to read[HADOOP-6569] RPC responses may be out-of-order with respect to SASL[HADOOP-6572] IPC server response buffer reset threshold should be configurable[HADOOP-6577] Configuration should trim whitespace around a lot of value types[HADOOP-6578] Split RPC metrics into summary and detailed metrics[HADOOP-6599] Deadlock in DFSClient#getBlockLocations even with the security disabled[HADOOP-6609] RPC server should check for version mismatch first[HADOOP-6613]













































































"Bad Connection to FS" message in FSShell should print message from the exception[HADOOP-6627] FileUtil.fullyDelete() should continue to delete other files despite failure at any level.[HADOOP-6631] AccessControlList uses full-principal names to verify acls causing queue-acls to fail[HADOOP-6634] Benchmark overhead of RPC session establishment[HADOOP-6637] FileSystem.get() does RPC retries within a static synchronized block[HADOOP-6640] util.Shell getGROUPS_FOR_USER_COMMAND method name - should use common naming convention[HADOOP-6644] login object in UGI should be inside the subject[HADOOP-6649] ShellBasedUnixGroupsMapping shouldn't have a cache[HADOOP-6652] NullPointerException in setupSaslConnection when browsing directories[HADOOP-6653] BlockDecompressorStream get EOF exception when decompressing the file compressed from empty file[HADOOP-6663] RPC.waitForProxy should retry through NoRouteToHostException[HADOOP-6667] zlib.compress.level ignored for DefaultCodec initialization[HADOOP-6669] UserGroupInformation doesn't support use in hash tables[HADOOP-6670] Performance Improvement in Secure RPC[HADOOP-6674] user object in the subject in UGI should be reused in case of a relogin.[HADOOP-6687] Incorrect exit codes for "dfs -chown", "dfs -chgrp"[HADOOP-6701] Relogin behavior for RPC clients could be improved[HADOOP-6706] Symbolic umask for file creation is not consistent with posix[HADOOP-6710] FsShell 'hadoop fs -text' does not support compression codecs[HADOOP-6714] Client does not close connection when an exception happens during SASL negotiation[HADOOP-6718] NetUtils.connect should check that it hasn't connected a socket to itself[HADOOP-6722] unchecked exceptions thrown in IPC Connection orphan clients[HADOOP-6723] IPC doesn't properly handle IOEs thrown by socket factory[HADOOP-6724] adding some java doc to Server.RpcMetrics, UGI[HADOOP-6745] NullPointerException for hadoop clients launched from streaming tasks[HADOOP-6757] WebServer shouldn't increase port number in case of negative port setting caused by Jetty's race[HADOOP-6760] exception while doing RPC I/O closes channel[HADOOP-6762] UserGroupInformation.createProxyUser's javadoc is broken[HADOOP-6776] Add a new newInstance method in FileSystem that takes a "user" as argument[HADOOP-6813] refreshSuperUserGroupsConfiguration should use server side configuration for the refresh[HADOOP-6815] Provide a JNI-based implementation of GroupMappingServiceProvider[HADOOP-6818] Provide a web server plugin that uses a static user for the web UI[HADOOP-6832] IPC leaks call parameters when exceptions thrown[HADOOP-6833] Introduce additional statistics to FileSystem[HADOOP-6859] Provide a JNI-based implementation of ShellBasedUnixGroupsNetgroupMapping (implementation of[HADOOP-6864]

GroupMappingServiceProvider) The efficient comparators aren't always used except for BytesWritable and Text[HADOOP-6881] RawLocalFileSystem#setWorkingDir() does not work for relative names[HADOOP-6899] Rpc client doesn't use the per-connection conf to figure out server's Kerberos principal[HADOOP-6907] BZip2Codec incorrectly implements read()[HADOOP-6925] Fix BooleanWritable comparator in 0.20[HADOOP-6928] The GroupMappingServiceProvider interface should be public[HADOOP-6943] Suggest that HADOOP_CLASSPATH should be preserved in hadoop-env.sh.template[HADOOP-6950] Allow wildcards to be used in ProxyUsers configurations[HADOOP-6995] Configuration.writeXML should not hold lock while outputting[HADOOP-7082] UserGroupInformation.getCurrentUser() fails when called from non-Hadoop JAAS context[HADOOP-7101] Remove unnecessary DNS reverse lookups from RPC layer[HADOOP-7104] Implement chmod with JNI[HADOOP-7110] FsShell should dump all exceptions at DEBUG level[HADOOP-7114] Add a cache for getpwuid_r and getpwgid_r calls[HADOOP-7115] NPE in Configuration.writeXml[HADOOP-7118] Timed out shell commands leak Timer threads[HADOOP-7122] getpwuid_r is not thread-safe on RHEL6[HADOOP-7156] SecureIO should not check owner on non-secure clusters that have no native support[HADOOP-7172] Remove unused fstat() call from NativeIO[HADOOP-7173] WritableComparator.get should not cache comparator objects[HADOOP-7183] Remove deprecated local.cache.size from core-default.xml[HADOOP-7184]

MapReduce Patches

MapR 1.2 includes the following Apache MapReduce patches that are not included in the Apache Hadoop base version 0.20.2:

[MAPREDUCE-112] Reduce Input Records and Reduce Output Records counters are not being set when using the new Mapreduce reducer API Job.getJobID() will always return null[MAPREDUCE-118] TaskMemoryManager should log process-tree's status while killing tasks.[MAPREDUCE-144] Secure job submission[MAPREDUCE-181] Provide a node health check script and run it periodically to check the node health status[MAPREDUCE-211] Collecting cpu and memory usage for MapReduce tasks[MAPREDUCE-220] TaskTracker could send an out-of-band heartbeat when the last running map/reduce completes[MAPREDUCE-270] Job history counters should be avaible on the UI.[MAPREDUCE-277] JobTracker should give preference to failed tasks over virgin tasks so as to terminate the job ASAP if it is eventually going to[MAPREDUCE-339]

fail. Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api.[MAPREDUCE-364] Change org.apache.hadoop.mapred.lib.MultipleInputs to use new api.[MAPREDUCE-369]






































































Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.[MAPREDUCE-370] JobControl Job does always has an unassigned name[MAPREDUCE-415] Move the completed jobs' history files to a DONE subdirectory inside the configured history directory[MAPREDUCE-416] Enable ServicePlugins for the JobTracker[MAPREDUCE-461] The job setup and cleanup tasks should be optional[MAPREDUCE-463] Collect information about number of tasks succeeded / total per time unit for a tasktracker.[MAPREDUCE-467] extend DistributedCache to work locally (LocalJobRunner)[MAPREDUCE-476] separate jvm param for mapper and reducer[MAPREDUCE-478] Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs[MAPREDUCE-516] The capacity-scheduler should assign multiple tasks per heartbeat[MAPREDUCE-517] After JobTracker restart Capacity Schduler does not schedules pending tasks from already running tasks.[MAPREDUCE-521] Allow admins of the Capacity Scheduler to set a hard-limit on the capacity of a queue[MAPREDUCE-532] Add preemption to the fair scheduler[MAPREDUCE-551] If #link is missing from uri format of -cacheArchive then streaming does not throw error.[MAPREDUCE-572] Change KeyValueLineRecordReader and KeyValueTextInputFormat to use new api.[MAPREDUCE-655] Existing diagnostic rules fail for MAP ONLY jobs[MAPREDUCE-676] XML-based metrics as JSP servlet for JobTracker[MAPREDUCE-679] Reuse of Writable objects is improperly handled by MRUnit[MAPREDUCE-680] Reserved tasktrackers should be removed when a node is globally blacklisted[MAPREDUCE-682] Conf files not moved to "done" subdirectory after JT restart[MAPREDUCE-693] Per-pool task limits for the fair scheduler[MAPREDUCE-698] Support for FIFO pools in the fair scheduler[MAPREDUCE-706] Provide a jobconf property for explicitly assigning a job to a pool[MAPREDUCE-707] node health check script does not display the correct message on timeout[MAPREDUCE-709] JobConf.findContainingJar unescapes unnecessarily on Linux[MAPREDUCE-714] org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle[MAPREDUCE-716] More slots are getting reserved for HiRAM job tasks then required[MAPREDUCE-722] node health check script should not log "UNHEALTHY" status for every heartbeat in INFO mode[MAPREDUCE-732] java.util.ConcurrentModificationException observed in unreserving slots for HiRam Jobs[MAPREDUCE-734] Allow relative paths to be created inside archives.[MAPREDUCE-739] Provide summary information per job once a job is finished.[MAPREDUCE-740] Support in DistributedCache to share cache files with other users after HADOOP-4493[MAPREDUCE-744] NPE in expiry thread when a TT is lost[MAPREDUCE-754] TypedBytesInput's readRaw() does not preserve custom type codes[MAPREDUCE-764] Configuration information should generate dump in a standard format.[MAPREDUCE-768] Setup and cleanup tasks remain in UNASSIGNED state for a long time on tasktrackers with long running high RAM tasks[MAPREDUCE-771] Use PureJavaCrc32 in mapreduce spills[MAPREDUCE-782] -files, -archives should honor user given symlink path[MAPREDUCE-787] Job summary logs show status of completed jobs as RUNNING[MAPREDUCE-809] Move completed Job history files to HDFS[MAPREDUCE-814] Add a cache for retired jobs with minimal job info and provide a way to access history file url[MAPREDUCE-817] JobClient completion poll interval of 5s causes slow tests in local mode[MAPREDUCE-825] DBInputFormat leaves open transaction[MAPREDUCE-840] Per-job local data on the TaskTracker node should have right access-control[MAPREDUCE-842] Localized files from DistributedCache should have right access-control[MAPREDUCE-856] Job/Task local files have incorrect group ownership set by LinuxTaskController binary[MAPREDUCE-871] Make DBRecordReader execute queries lazily[MAPREDUCE-875] More efficient SQL queries for DBInputFormat[MAPREDUCE-885] After HADOOP-4491, the user who started mapred system is not able to run job.[MAPREDUCE-890] Users can set non-writable permissions on temporary files for TT and can abuse disk usage.[MAPREDUCE-896] When using LinuxTaskController, localized files may become accessible to unintended users if permissions are[MAPREDUCE-899]

misconfigured. Cleanup of task-logs should happen in TaskTracker instead of the Child[MAPREDUCE-927] OutputCommitter should have an abortJob method[MAPREDUCE-947] Inaccurate values in jobSummary logs[MAPREDUCE-964] TaskTracker does not need to fully unjar job jars[MAPREDUCE-967] NPE in distcp encountered when placing _logs directory on S3FileSystem[MAPREDUCE-968] distcp does not always remove distcp.tmp.dir[MAPREDUCE-971] Cleanup tasks are scheduled using high memory configuration, leaving tasks in unassigned state.[MAPREDUCE-1028] Reduce tasks are getting starved in capacity scheduler[MAPREDUCE-1030] Show total slot usage in cluster summary on jobtracker webui[MAPREDUCE-1048] distcp can generate uneven map task assignments[MAPREDUCE-1059] Use the user-to-groups mapping service in the JobTracker[MAPREDUCE-1083] For tasks, "ulimit -v -1" is being run when user doesn't specify mapred.child.ulimit[MAPREDUCE-1085] hadoop commands in streaming tasks are trying to write to tasktracker's log[MAPREDUCE-1086] JobHistory files should have narrower 0600 perms[MAPREDUCE-1088] Fair Scheduler preemption triggers NPE when tasks are scheduled but not running[MAPREDUCE-1089] Modify log statement in Tasktracker log related to memory monitoring to include attempt id.[MAPREDUCE-1090] Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.[MAPREDUCE-1098] User's task-logs filling up local disks on the TaskTrackers[MAPREDUCE-1100] Additional JobTracker metrics[MAPREDUCE-1103] CapacityScheduler: It should be possible to set queue hard-limit beyond it's actual capacity[MAPREDUCE-1105] Capacity Scheduler scheduling information is hard to read / should be tabular format[MAPREDUCE-1118] Using profilers other than hprof can cause JobClient to report job failure[MAPREDUCE-1131] Per cache-file refcount can become negative when tasks release distributed-cache files[MAPREDUCE-1140]













































































runningMapTasks counter is not properly decremented in case of failed Tasks.[MAPREDUCE-1143] Streaming tests swallow exceptions[MAPREDUCE-1155] running_maps is not decremented when the tasks of a job is killed/failed[MAPREDUCE-1158] Two log statements at INFO level fill up jobtracker logs[MAPREDUCE-1160] Lots of fetch failures[MAPREDUCE-1171] MultipleInputs fails with ClassCastException[MAPREDUCE-1178] URL to JT webconsole for running job and job history should be the same[MAPREDUCE-1185] While localizing a DistributedCache file, TT sets permissions recursively on the whole base-dir[MAPREDUCE-1186] MAPREDUCE-947 incompatibly changed FileOutputCommitter[MAPREDUCE-1196] Alternatively schedule different types of tasks in fair share scheduler[MAPREDUCE-1198] TaskTrackers restart is very slow because it deletes distributed cache directory synchronously[MAPREDUCE-1213] JobTracker Metrics causes undue load on JobTracker[MAPREDUCE-1219] Kill tasks on a node if the free physical memory on that machine falls below a configured threshold[MAPREDUCE-1221] Distcp is very slow[MAPREDUCE-1231] Refactor job token to use a common token interface[MAPREDUCE-1250] Fair scheduler event log not logging job info[MAPREDUCE-1258] DistCp cannot handle -delete if destination is local filesystem[MAPREDUCE-1285] DistributedCache localizes only once per cache URI[MAPREDUCE-1288] AutoInputFormat doesn't work with non-default FileSystems[MAPREDUCE-1293] TrackerDistributedCacheManager can delete file asynchronously[MAPREDUCE-1302] Add counters for task time spent in GC[MAPREDUCE-1304] Introduce the concept of Job Permissions[MAPREDUCE-1307] NPE in FieldFormatter if escape character is set and field is null[MAPREDUCE-1313] JobTracker holds stale references to retired jobs via unreported tasks[MAPREDUCE-1316] Potential JT deadlock in faulty TT tracking[MAPREDUCE-1342] Incremental enhancements to the JobTracker for better scalability[MAPREDUCE-1354] ConcurrentModificationException in JobInProgress[MAPREDUCE-1372] Args in job details links on jobhistory.jsp are not URL encoded[MAPREDUCE-1378] MRAsyncDiscService should tolerate missing local.dir[MAPREDUCE-1382] NullPointerException observed during task failures[MAPREDUCE-1397] TaskLauncher remains stuck on tasks waiting for free nodes even if task is killed.[MAPREDUCE-1398] The archive command shows a null error message[MAPREDUCE-1399] Save file-sizes of each of the artifacts in DistributedCache in the JobConf[MAPREDUCE-1403] LinuxTaskController tests failing on trunk after the commit of MAPREDUCE-1385[MAPREDUCE-1421] Changing permissions of files/dirs under job-work-dir may be needed sothat cleaning up of job-dir in all[MAPREDUCE-1422]

mapred-local-directories succeeds always Improve performance of CombineFileInputFormat when multiple pools are configured[MAPREDUCE-1423] archive throws OutOfMemoryError[MAPREDUCE-1425] symlinks in cwd of the task are not handled properly after MAPREDUCE-896[MAPREDUCE-1435] Deadlock in preemption code in fair scheduler[MAPREDUCE-1436] MapReduce should use the short form of the user names[MAPREDUCE-1440] Configuration of directory lists should trim whitespace[MAPREDUCE-1441] StackOverflowError when JobHistory parses a really long line[MAPREDUCE-1442] DBInputFormat can leak connections[MAPREDUCE-1443] The servlets should quote server generated strings sent in the response[MAPREDUCE-1454] Authorization for servlets[MAPREDUCE-1455] For secure job execution, couple of more UserGroupInformation.doAs needs to be added[MAPREDUCE-1457] In JobTokenIdentifier change method getUsername to getUser which returns UGI[MAPREDUCE-1464] FileInputFormat should save #input-files in JobConf[MAPREDUCE-1466] committer.needsTaskCommit should not be called for a task cleanup attempt[MAPREDUCE-1476] CombineFileRecordReader does not properly initialize child RecordReader[MAPREDUCE-1480] Authorization for job-history pages[MAPREDUCE-1493] Push HADOOP-6551 into MapReduce[MAPREDUCE-1503] Cluster class should create the rpc client only when needed[MAPREDUCE-1505] Protection against incorrectly configured reduces[MAPREDUCE-1521] FileInputFormat may change the file system of an input path[MAPREDUCE-1522] Cache the job related information while submitting the job , this would avoid many RPC calls to JobTracker.[MAPREDUCE-1526] Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and[MAPREDUCE-1533]

Counters.makeEscapedString() TrackerDistributedCacheManager can fail because the number of subdirectories reaches system limit[MAPREDUCE-1538] Log messages of JobACLsManager should use security logging of HADOOP-6586[MAPREDUCE-1543] Add 'first-task-launched' to job-summary[MAPREDUCE-1545] UGI.doAs should not be used for getting the history file of jobs[MAPREDUCE-1550] Task diagnostic info would get missed sometimes.[MAPREDUCE-1563] Shuffle stage - Key and Group Comparators[MAPREDUCE-1570] Task controller may not set permissions for a task cleanup attempt's log directory[MAPREDUCE-1607] TaskTracker.localizeJob should not set permissions on job log directory recursively[MAPREDUCE-1609] Refresh nodes and refresh queues doesnt work with service authorization enabled[MAPREDUCE-1611] job conf file is not accessible from job history web page[MAPREDUCE-1612] Streaming's TextOutputReader.getLastOutput throws NPE if it has never read any output[MAPREDUCE-1621] ResourceEstimator does not work after MAPREDUCE-842[MAPREDUCE-1635] Job submission should fail if same uri is added for mapred.cache.files and mapred.cache.archives[MAPREDUCE-1641] JobStory should provide queue info.[MAPREDUCE-1656] After task logs directory is deleted, tasklog servlet displays wrong error message about job ACLs[MAPREDUCE-1657] Job Acls affect Queue Acls[MAPREDUCE-1664]












































































Add a metrics to track the number of heartbeats processed[MAPREDUCE-1680] Tasks should not be scheduled after tip is killed/failed.[MAPREDUCE-1682] Remove JNI calls from ClusterStatus cstr[MAPREDUCE-1683] JobHistory shouldn't be disabled for any reason[MAPREDUCE-1699] TaskRunner can get NPE in getting ugi from TaskTracker[MAPREDUCE-1707] Truncate logs of finished tasks to prevent node thrash due to excessive logging[MAPREDUCE-1716] Authentication between pipes processes and java counterparts.[MAPREDUCE-1733] Un-deprecate the old MapReduce API in the 0.20 branch[MAPREDUCE-1734] DistributedCache creates its own FileSytem instance when adding a file/archive to the path[MAPREDUCE-1744] Replace mapred.persmissions.supergroup with an acl : mapreduce.cluster.administrators[MAPREDUCE-1754] Exception message for unauthorized user doing killJob, killTask, setJobPriority needs to be improved[MAPREDUCE-1759] CompletedJobStatusStore initialization should fail if {mapred.job.tracker.persist.jobstatus.dir} is unwritable[MAPREDUCE-1778] IFile should check for null compressor[MAPREDUCE-1784] Add streaming config option for not emitting the key[MAPREDUCE-1785] Support for file sizes less than 1MB in DFSIO benchmark.[MAPREDUCE-1832] FairScheduler.tasksToPeempt() can return negative number[MAPREDUCE-1845] Include job submit host information (name and ip) in jobconf and jobdetails display[MAPREDUCE-1850] MultipleOutputs does not cache TaskAttemptContext[MAPREDUCE-1853] Add read timeout on userlog pull[MAPREDUCE-1868] Re-think (user|queue) limits on (tasks|jobs) in the CapacityScheduler[MAPREDUCE-1872] MRAsyncDiskService does not properly absolutize volume root paths[MAPREDUCE-1887] MapReduce daemons should close FileSystems that are not needed anymore[MAPREDUCE-1900] TrackerDistributedCacheManager never cleans its input directories[MAPREDUCE-1914] Ability for having user's classes take precedence over the system classes for tasks' classpath[MAPREDUCE-1938] Limit the size of jobconf.[MAPREDUCE-1960] ConcurrentModificationException when shutting down Gridmix[MAPREDUCE-1961] java.lang.ArrayIndexOutOfBoundsException in analysejobhistory.jsp of jobs with 0 maps[MAPREDUCE-1985] TestDFSIO read test may not read specified bytes.[MAPREDUCE-2023] Race condition in writing the jobtoken password file when launching pipes jobs[MAPREDUCE-2082] Secure local filesystem IO from symlink vulnerabilities[MAPREDUCE-2096] task-controller shouldn't require o-r permissions[MAPREDUCE-2103] safely handle InterruptedException and interrupted status in MR code[MAPREDUCE-2157] Race condition in LinuxTaskController permissions handling[MAPREDUCE-2178] JT should not try to remove mapred.system.dir during startup[MAPREDUCE-2219] If Localizer can't create task log directory, it should fail on the spot[MAPREDUCE-2234] JobTracker "over-synchronization" makes it hang up in certain cases[MAPREDUCE-2235] LinuxTaskController doesn't properly escape environment variables[MAPREDUCE-2242] Servlets should specify content type[MAPREDUCE-2253] FairScheduler fairshare preemption from multiple pools may preempt all tasks from one pool causing that pool to go below[MAPREDUCE-2256]

fairshare. Permissions race can make getStagingDir fail on local filesystem[MAPREDUCE-2289] TT should fail to start on secure cluster when SecureIO isn't available[MAPREDUCE-2321] Add metrics to the fair scheduler[MAPREDUCE-2323] memory-related configurations missing from mapred-default.xml[MAPREDUCE-2328] Improve error messages when MR dirs on local FS have bad ownership[MAPREDUCE-2332] mapred.job.tracker.history.completed.location should support an arbitrary filesystem URI[MAPREDUCE-2351] Make the MR changes to reflect the API changes in SecureIO library[MAPREDUCE-2353] A task succeeded even though there were errors on all attempts.[MAPREDUCE-2356] Shouldn't hold lock on rjob while localizing resources.[MAPREDUCE-2364] TaskTracker can't retrieve stdout and stderr from web UI[MAPREDUCE-2366] TaskLogsTruncater does not need to check log ownership when running as Child[MAPREDUCE-2371] TaskLogAppender mechanism shouldn't be set in log4j.properties[MAPREDUCE-2372] When tasks exit with a nonzero exit status, task runner should log the stderr as well as stdout[MAPREDUCE-2373] Should not use PrintWriter to write taskjvm.sh[MAPREDUCE-2374] task-controller fails to parse configuration if it doesn't end in \n[MAPREDUCE-2377] Distributed cache sizing configurations are missing from mapred-default.xml[MAPREDUCE-2379]



























































General Information


Apache Hadoop 0.20.2flume-0.9.3hbase-0.90.2hive-0.7.0oozie-3.0.0pig-0.8sqoop-1.2.0whirr-0.3.0

New in This Release

Mac OS Client

A Mac OS client is now available. For more information, see Installing the MapR Client on Mac OS X

Resolved Issues

(Issue 4415) Select and Kill Controls in JobTracker UI(Issue 2809) Add Release Note for NFS Dependencies

Known Issues


The error indicates an attempt to create a new snapshot with the same name as an existing snapshot, but can occur in the followingEEXISTcases as well:

If the node with the snapshot's name container fails during snapshot creation, the failed snapshot remains until it is removed by the CLDBafter 30 minutes.If snapshot creation fails after reserving the name, then the name exists but the snapshot does not.If the response to a successful snapshot is delayed by a network glitch, and the snapshot operation is retried as a result, correcEEXISTStly indicates that the snapshot exists although it does not appear to.

In any of the above cases, either retry the snapshot with a different name, or delete the existing (or failed) snapshot and create it again.

(Issue 4269) Bulk Operations






1.

2.

3. 4.

(Issue 4037) Starting Newly Added Services

After you install new services on a node, you can start them in two ways:

Use the MapR Control System, the API, or the command-line interface to start the services individuallyRestart the warden to stop and start all services on the node

If you start the services individually, the node's memory will not be reconfigured to account for the newly installed services. This can causememory paging, slowing or stopping the node. However, stopping and restarting the warden can take the node out of service.

For best results, choose a time when the cluster is not very busy if you need to install additional services on a node. If that is not possible, makesure to restart the warden as soon as it is practical to do so after installing new services.

(Issue 4024) Hadoop Copy Commands Do Not Handle Broken Symbolic Links

The and commands attempt to resolve symbolic links in the source data set, tohadoop fs -copyToLocal hadoop fs -copyFromLocalcreate physical copies of the files referred to by the links. If a broken symbolic link is encountered by either command, the copy operation fails atthat point.

(Issue 4018)(HDFS-1768) fs -put crash that depends on source file name

Copying a file using the command generates a warning or exception if a corresponding checksum file exists. If this errorhadoop fs .*.crcoccurs, delete all local checksum files and try again. See http://www.mail-archive.com/[email protected]/msg15824.html

(Issue 3524) Apache Port 80 Open


(Issue 3488) Ubuntu IRQ Balancer Issue on Virtual Machines

In VM environments like EC2, VMWare, and Xen, when running Ubuntu 10.10, problems can occur due to an Ubuntu bug unless the IRQbalancer is turned off. On all nodes, edit the file and set to turn off the IRQ balancer (requires reboot/etc/default/irqbalance ENABLED=0to take effect).

(Issue 3244) Volume Mirror Issue

If a volume dump restore command is interrupted before completion (killed by the user, node fails, etc.) then the volume remains in the "Mirroringin Progress" state. Before retrying the operation, you must issue the command explicitly.volume dump restore volume mirror stop



(Issue 3028) Changing the Time on a ZooKeeper Node

To avoid cluster downtime, use the following steps to set the time on any node running ZooKeeper:

Use the MapR to check that all configured ZooKeeper services on the cluster are running. Start any non-running ZooKeeperDashboardinstances.Stop ZooKeeper on the node:/etc/init.d/mapr-zookeeper stopChange the time on the node or sync the time to NTP.Start ZooKeeper on the node:/etc/init.d/mapr-zookeeper start

http://www.mail-archive.com/[email protected]/msg15824.html




General Information



New in This Release


Resolved Issues


Known Issues










General Information



New in This Release


Resolved Issues

(Issue 4037) Starting Newly Added Services (Documentation fix)(Issue 4024) Hadoop Copy Commands Do Not Handle Broken Symbolic Links(Issue 4018)(HDFS-1768) fs -put crash that depends on source file name(Issue 3524) Apache Port 80 Open (Documentation fix)(Issue 3488) Ubuntu IRQ Balancer Issue on Virtual Machines (Documentation fix)(Issue 3244) Volume Mirror Issue(Issue 3028) Changing the Time on a ZooKeeper Node (Documentation fix)

Known Issues
















General Information



New in This Release

EMC License support

Packages built for EMC have "EMC" in the MapRBuildVersion (example: ).1.1.0.10806EMC-1

HBase LeaseRecovery

FSUtis LeaseRecovery is supported in HBase trunk and HBase 90.2. To run a different version of HBase with MapR, apply the following patchesand compile HBase:

https://issues.apache.org/jira/secure/attachment/12489782/4169-v5.txthttps://issues.apache.org/jira/secure/attachment/12489818/4169-correction.txt

Resolved Issues

(Issue 4792) Synchronization in reading JobTracker address(Issue 4910) File attributes sometimes not updated correctly in client cache(HBase Jira 4169) FSUtils LeaseRecovery for non HDFS FileSystems(Issue 4905) FSUtils LeaseRecovery for MapR

Known Issues







Volumes - Edit VolumesVolumes - Remove VolumesVolumes - New SnapshotVolumes - UnmountMirror Volumes - Edit VolumesMirror Volumes - Remove VolumesMirror Volumes - UnmountUser Disk Usage - EditSnapshots - RemoveSnapshots - PreserveNode Alarms - Change Topology

https://issues.apache.org/jira/secure/attachment/12489782/4169-v5.txt

https://issues.apache.org/jira/secure/attachment/12489818/4169-correction.txt




1.

2.

3. 4.

Nodes - Change TopologyVolume Alarms - EditVolume Alarms - UnmountVolume Alarms - RemoveUser/Group Alarms - Edit





























MapR 1.0 includes the following Apache Hadoop issues that are not included in the Apache Hadoop base version 0.20.2:

[ Make streaming to handle non-utf8 byte arrayHADOOP-1722] IPC server max queue size should be configurable[HADOOP-1849] speculative execution start up condition based on completion time[HADOOP-2141] Space in the value for dfs.data.dir can cause great problems[HADOOP-2366] Use job control for tasks (and therefore for pipes and streaming)[HADOOP-2721] Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni[HADOOP-2838] Shuffling fetchers waited too long between map output fetch re-tries[HADOOP-3327] Patch to allow hadoop native to compile on Mac OS X[HADOOP-3659] Providing splitting support for bzip2 compressed files[HADOOP-4012] IsolationRunner does not work as documented[HADOOP-4041] Map and Reduce tasks should run as the user who submitted the job[HADOOP-4490] FileSystem.CACHE should be ref-counted[HADOOP-4655] Add a user to groups mapping service[HADOOP-4656] Current Ganglia metrics implementation is incompatible with Ganglia 3.1[HADOOP-4675] Allow FileSystem shutdown hook to be disabled[HADOOP-4829] Streaming combiner should allow command, not just JavaClass[HADOOP-4842] Implement setuid executable for Linux to assist in launching tasks as job owners[HADOOP-4930] ConcurrentModificationException in JobHistory.java[HADOOP-4933] Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide[HADOOP-5170] Option to prohibit jars unpacking[HADOOP-5175] TT's version build is too restrictive[HADOOP-5203] Queue ACLs should be refreshed without requiring a restart of the job tracker[HADOOP-5396] Provide a way for users to find out what operations they can do on which M/R queues[HADOOP-5419] Support killing of process groups in LinuxTaskController binary[HADOOP-5420] The job history display needs to be paged[HADOOP-5442] Add support for application-specific typecodes to typed bytes[HADOOP-5450] Exposing Hadoop metrics via HTTP[HADOOP-5469] calling new SequenceFile.Reader(...) leaves an InputStream open, if the given sequence file is broken[HADOOP-5476] HADOOP-2721 doesn't clean up descendant processes of a jvm that exits cleanly after running a task successfully[HADOOP-5488] Binary partitioner[HADOOP-5528] Hadoop Vaidya throws number format exception due to changes in the job history counters string format (escaped compact[HADOOP-5582]

representation). Hadoop Streaming - GzipCodec[HADOOP-5592] change S3Exception to checked exception[HADOOP-5613] Ability to blacklist tasktracker[HADOOP-5643] Counter for S3N Read Bytes does not work[HADOOP-5656] DistCp should not launch a job if it is not necessary[HADOOP-5675] Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics[HADOOP-5733] UGI checks in testcases are broken[HADOOP-5737] Split waiting tasks field in JobTracker metrics to individual tasks[HADOOP-5738] Allow setting the default value of maxRunningJobs for all pools[HADOOP-5745] The length of the heartbeat cycle should be configurable.[HADOOP-5784] JobTracker should refresh the hosts list upon recovery[HADOOP-5801] problem using top level s3 buckets as input/output directories[HADOOP-5805] s3n files are not getting split by default[HADOOP-5861] GzipCodec should read compression level etc from configuration[HADOOP-5879] Allow administrators to be able to start and stop queues[HADOOP-5913] Use JDK 1.6 File APIs in DF.java wherever possible[HADOOP-5958] create script to provide classpath for external tools[HADOOP-5976] LD_LIBRARY_PATH not passed to tasks spawned off by LinuxTaskController[HADOOP-5980] HADOOP-2838 doesnt work as expected[HADOOP-5981] RPC client opens an extra connection for VersionedProtocol[HADOOP-6132] ReflectionUtils performance regression[HADOOP-6133] Implement a pure Java CRC32 calculator[HADOOP-6148] Add get/setEnum to Configuration[HADOOP-6161] Improve PureJavaCrc32[HADOOP-6166]


























































Provide a configuration dump in json format.[HADOOP-6184] Configuration does not lock parameters marked final if they have no value.[HADOOP-6227] Permission configuration files should use octal and symbolic[HADOOP-6234] s3n fails with SocketTimeoutException[HADOOP-6254] Missing synchronization for defaultResources in Configuration.addResource[HADOOP-6269] Add JVM memory usage to JvmMetrics[HADOOP-6279] Any hadoop commands crashing jvm (SIGBUS) when /tmp (tmpfs) is full[HADOOP-6284] Use JAAS LoginContext for our login[HADOOP-6299] Configuration sends too much data to log4j[HADOOP-6312] Update FilterInitializer class to be more visible and take a conf for further development[HADOOP-6337] Stack trace of any runtime exceptions should be recorded in the server logs.[HADOOP-6343] Log errors getting Unix UGI[HADOOP-6400] Add a /conf servlet to dump running configuration[HADOOP-6408] Change RPC layer to support SASL based mutual authentication[HADOOP-6419] Add AsyncDiskService that is used in both hdfs and mapreduce[HADOOP-6433] Prevent remote CSS attacks in Hostname and UTF-7.[HADOOP-6441] Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH[HADOOP-6453] StringBuffer -> StringBuilder - conversion of references as necessary[HADOOP-6471] HttpServer sends wrong content-type for CSS files (and others)[HADOOP-6496] doAs for proxy user[HADOOP-6510] FsPermission:SetUMask not updated to use new-style umask setting.[HADOOP-6521] LocalDirAllocator should use whitespace trimming configuration getters[HADOOP-6534] Allow authentication-enabled RPC clients to connect to authentication-disabled RPC servers[HADOOP-6543] archive does not work with distcp -update[HADOOP-6558] Authorization for default servlets[HADOOP-6568] FsShell#cat should avoid calling unecessary getFileStatus before opening a file to read[HADOOP-6569] RPC responses may be out-of-order with respect to SASL[HADOOP-6572] IPC server response buffer reset threshold should be configurable[HADOOP-6577] Configuration should trim whitespace around a lot of value types[HADOOP-6578] Split RPC metrics into summary and detailed metrics[HADOOP-6599] Deadlock in DFSClient#getBlockLocations even with the security disabled[HADOOP-6609] RPC server should check for version mismatch first[HADOOP-6613] "Bad Connection to FS" message in FSShell should print message from the exception[HADOOP-6627] FileUtil.fullyDelete() should continue to delete other files despite failure at any level.[HADOOP-6631] AccessControlList uses full-principal names to verify acls causing queue-acls to fail[HADOOP-6634] Benchmark overhead of RPC session establishment[HADOOP-6637] FileSystem.get() does RPC retries within a static synchronized block[HADOOP-6640] util.Shell getGROUPS_FOR_USER_COMMAND method name - should use common naming convention[HADOOP-6644] login object in UGI should be inside the subject[HADOOP-6649] ShellBasedUnixGroupsMapping shouldn't have a cache[HADOOP-6652] NullPointerException in setupSaslConnection when browsing directories[HADOOP-6653] BlockDecompressorStream get EOF exception when decompressing the file compressed from empty file[HADOOP-6663] RPC.waitForProxy should retry through NoRouteToHostException[HADOOP-6667] zlib.compress.level ignored for DefaultCodec initialization[HADOOP-6669] UserGroupInformation doesn't support use in hash tables[HADOOP-6670] Performance Improvement in Secure RPC[HADOOP-6674] user object in the subject in UGI should be reused in case of a relogin.[HADOOP-6687] Incorrect exit codes for "dfs -chown", "dfs -chgrp"[HADOOP-6701] Relogin behavior for RPC clients could be improved[HADOOP-6706] Symbolic umask for file creation is not consistent with posix[HADOOP-6710] FsShell 'hadoop fs -text' does not support compression codecs[HADOOP-6714] Client does not close connection when an exception happens during SASL negotiation[HADOOP-6718] NetUtils.connect should check that it hasn't connected a socket to itself[HADOOP-6722] unchecked exceptions thrown in IPC Connection orphan clients[HADOOP-6723] IPC doesn't properly handle IOEs thrown by socket factory[HADOOP-6724] adding some java doc to Server.RpcMetrics, UGI[HADOOP-6745] NullPointerException for hadoop clients launched from streaming tasks[HADOOP-6757] WebServer shouldn't increase port number in case of negative port setting caused by Jetty's race[HADOOP-6760] exception while doing RPC I/O closes channel[HADOOP-6762] UserGroupInformation.createProxyUser's javadoc is broken[HADOOP-6776] Add a new newInstance method in FileSystem that takes a "user" as argument[HADOOP-6813] refreshSuperUserGroupsConfiguration should use server side configuration for the refresh[HADOOP-6815] Provide a JNI-based implementation of GroupMappingServiceProvider[HADOOP-6818] Provide a web server plugin that uses a static user for the web UI[HADOOP-6832] IPC leaks call parameters when exceptions thrown[HADOOP-6833] Introduce additional statistics to FileSystem[HADOOP-6859] Provide a JNI-based implementation of ShellBasedUnixGroupsNetgroupMapping (implementation of[HADOOP-6864]

GroupMappingServiceProvider) The efficient comparators aren't always used except for BytesWritable and Text[HADOOP-6881] RawLocalFileSystem#setWorkingDir() does not work for relative names[HADOOP-6899] Rpc client doesn't use the per-connection conf to figure out server's Kerberos principal[HADOOP-6907] BZip2Codec incorrectly implements read()[HADOOP-6925] Fix BooleanWritable comparator in 0.20[HADOOP-6928] The GroupMappingServiceProvider interface should be public[HADOOP-6943] Suggest that HADOOP_CLASSPATH should be preserved in hadoop-env.sh.template[HADOOP-6950]













































































Allow wildcards to be used in ProxyUsers configurations[HADOOP-6995] Configuration.writeXML should not hold lock while outputting[HADOOP-7082] UserGroupInformation.getCurrentUser() fails when called from non-Hadoop JAAS context[HADOOP-7101] Remove unnecessary DNS reverse lookups from RPC layer[HADOOP-7104] Implement chmod with JNI[HADOOP-7110] FsShell should dump all exceptions at DEBUG level[HADOOP-7114] Add a cache for getpwuid_r and getpwgid_r calls[HADOOP-7115] NPE in Configuration.writeXml[HADOOP-7118] Timed out shell commands leak Timer threads[HADOOP-7122] getpwuid_r is not thread-safe on RHEL6[HADOOP-7156] SecureIO should not check owner on non-secure clusters that have no native support[HADOOP-7172] Remove unused fstat() call from NativeIO[HADOOP-7173] WritableComparator.get should not cache comparator objects[HADOOP-7183] Remove deprecated local.cache.size from core-default.xml[HADOOP-7184]

MapReduce Patches

MapR 1.0 includes the following Apache MapReduce issues that are not included in the Apache Hadoop base version 0.20.2:


fail. Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api.[MAPREDUCE-364] Change org.apache.hadoop.mapred.lib.MultipleInputs to use new api.[MAPREDUCE-369] Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.[MAPREDUCE-370] JobControl Job does always has an unassigned name[MAPREDUCE-415] Move the completed jobs' history files to a DONE subdirectory inside the configured history directory[MAPREDUCE-416] Enable ServicePlugins for the JobTracker[MAPREDUCE-461] The job setup and cleanup tasks should be optional[MAPREDUCE-463] Collect information about number of tasks succeeded / total per time unit for a tasktracker.[MAPREDUCE-467] extend DistributedCache to work locally (LocalJobRunner)[MAPREDUCE-476] separate jvm param for mapper and reducer[MAPREDUCE-478] Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs[MAPREDUCE-516] The capacity-scheduler should assign multiple tasks per heartbeat[MAPREDUCE-517] After JobTracker restart Capacity Schduler does not schedules pending tasks from already running tasks.[MAPREDUCE-521] Allow admins of the Capacity Scheduler to set a hard-limit on the capacity of a queue[MAPREDUCE-532] Add preemption to the fair scheduler[MAPREDUCE-551] If #link is missing from uri format of -cacheArchive then streaming does not throw error.[MAPREDUCE-572] Change KeyValueLineRecordReader and KeyValueTextInputFormat to use new api.[MAPREDUCE-655] Existing diagnostic rules fail for MAP ONLY jobs[MAPREDUCE-676] XML-based metrics as JSP servlet for JobTracker[MAPREDUCE-679] Reuse of Writable objects is improperly handled by MRUnit[MAPREDUCE-680] Reserved tasktrackers should be removed when a node is globally blacklisted[MAPREDUCE-682] Conf files not moved to "done" subdirectory after JT restart[MAPREDUCE-693] Per-pool task limits for the fair scheduler[MAPREDUCE-698] Support for FIFO pools in the fair scheduler[MAPREDUCE-706] Provide a jobconf property for explicitly assigning a job to a pool[MAPREDUCE-707] node health check script does not display the correct message on timeout[MAPREDUCE-709] JobConf.findContainingJar unescapes unnecessarily on Linux[MAPREDUCE-714] org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle[MAPREDUCE-716] More slots are getting reserved for HiRAM job tasks then required[MAPREDUCE-722] node health check script should not log "UNHEALTHY" status for every heartbeat in INFO mode[MAPREDUCE-732] java.util.ConcurrentModificationException observed in unreserving slots for HiRam Jobs[MAPREDUCE-734] Allow relative paths to be created inside archives.[MAPREDUCE-739] Provide summary information per job once a job is finished.[MAPREDUCE-740] Support in DistributedCache to share cache files with other users after HADOOP-4493[MAPREDUCE-744] NPE in expiry thread when a TT is lost[MAPREDUCE-754] TypedBytesInput's readRaw() does not preserve custom type codes[MAPREDUCE-764] Configuration information should generate dump in a standard format.[MAPREDUCE-768] Setup and cleanup tasks remain in UNASSIGNED state for a long time on tasktrackers with long running high RAM tasks[MAPREDUCE-771] Use PureJavaCrc32 in mapreduce spills[MAPREDUCE-782] -files, -archives should honor user given symlink path[MAPREDUCE-787] Job summary logs show status of completed jobs as RUNNING[MAPREDUCE-809] Move completed Job history files to HDFS[MAPREDUCE-814] Add a cache for retired jobs with minimal job info and provide a way to access history file url[MAPREDUCE-817] JobClient completion poll interval of 5s causes slow tests in local mode[MAPREDUCE-825] DBInputFormat leaves open transaction[MAPREDUCE-840]







































































Per-job local data on the TaskTracker node should have right access-control[MAPREDUCE-842] Localized files from DistributedCache should have right access-control[MAPREDUCE-856] Job/Task local files have incorrect group ownership set by LinuxTaskController binary[MAPREDUCE-871] Make DBRecordReader execute queries lazily[MAPREDUCE-875] More efficient SQL queries for DBInputFormat[MAPREDUCE-885] After HADOOP-4491, the user who started mapred system is not able to run job.[MAPREDUCE-890] Users can set non-writable permissions on temporary files for TT and can abuse disk usage.[MAPREDUCE-896] When using LinuxTaskController, localized files may become accessible to unintended users if permissions are[MAPREDUCE-899]

misconfigured. Cleanup of task-logs should happen in TaskTracker instead of the Child[MAPREDUCE-927] OutputCommitter should have an abortJob method[MAPREDUCE-947] Inaccurate values in jobSummary logs[MAPREDUCE-964] TaskTracker does not need to fully unjar job jars[MAPREDUCE-967] NPE in distcp encountered when placing _logs directory on S3FileSystem[MAPREDUCE-968] distcp does not always remove distcp.tmp.dir[MAPREDUCE-971] Cleanup tasks are scheduled using high memory configuration, leaving tasks in unassigned state.[MAPREDUCE-1028] Reduce tasks are getting starved in capacity scheduler[MAPREDUCE-1030] Show total slot usage in cluster summary on jobtracker webui[MAPREDUCE-1048] distcp can generate uneven map task assignments[MAPREDUCE-1059] Use the user-to-groups mapping service in the JobTracker[MAPREDUCE-1083] For tasks, "ulimit -v -1" is being run when user doesn't specify mapred.child.ulimit[MAPREDUCE-1085] hadoop commands in streaming tasks are trying to write to tasktracker's log[MAPREDUCE-1086] JobHistory files should have narrower 0600 perms[MAPREDUCE-1088] Fair Scheduler preemption triggers NPE when tasks are scheduled but not running[MAPREDUCE-1089] Modify log statement in Tasktracker log related to memory monitoring to include attempt id.[MAPREDUCE-1090] Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.[MAPREDUCE-1098] User's task-logs filling up local disks on the TaskTrackers[MAPREDUCE-1100] Additional JobTracker metrics[MAPREDUCE-1103] CapacityScheduler: It should be possible to set queue hard-limit beyond it's actual capacity[MAPREDUCE-1105] Capacity Scheduler scheduling information is hard to read / should be tabular format[MAPREDUCE-1118] Using profilers other than hprof can cause JobClient to report job failure[MAPREDUCE-1131] Per cache-file refcount can become negative when tasks release distributed-cache files[MAPREDUCE-1140] runningMapTasks counter is not properly decremented in case of failed Tasks.[MAPREDUCE-1143] Streaming tests swallow exceptions[MAPREDUCE-1155] running_maps is not decremented when the tasks of a job is killed/failed[MAPREDUCE-1158] Two log statements at INFO level fill up jobtracker logs[MAPREDUCE-1160] Lots of fetch failures[MAPREDUCE-1171] MultipleInputs fails with ClassCastException[MAPREDUCE-1178] URL to JT webconsole for running job and job history should be the same[MAPREDUCE-1185] While localizing a DistributedCache file, TT sets permissions recursively on the whole base-dir[MAPREDUCE-1186] MAPREDUCE-947 incompatibly changed FileOutputCommitter[MAPREDUCE-1196] Alternatively schedule different types of tasks in fair share scheduler[MAPREDUCE-1198] TaskTrackers restart is very slow because it deletes distributed cache directory synchronously[MAPREDUCE-1213] JobTracker Metrics causes undue load on JobTracker[MAPREDUCE-1219] Kill tasks on a node if the free physical memory on that machine falls below a configured threshold[MAPREDUCE-1221] Distcp is very slow[MAPREDUCE-1231] Refactor job token to use a common token interface[MAPREDUCE-1250] Fair scheduler event log not logging job info[MAPREDUCE-1258] DistCp cannot handle -delete if destination is local filesystem[MAPREDUCE-1285] DistributedCache localizes only once per cache URI[MAPREDUCE-1288] AutoInputFormat doesn't work with non-default FileSystems[MAPREDUCE-1293] TrackerDistributedCacheManager can delete file asynchronously[MAPREDUCE-1302] Add counters for task time spent in GC[MAPREDUCE-1304] Introduce the concept of Job Permissions[MAPREDUCE-1307] NPE in FieldFormatter if escape character is set and field is null[MAPREDUCE-1313] JobTracker holds stale references to retired jobs via unreported tasks[MAPREDUCE-1316] Potential JT deadlock in faulty TT tracking[MAPREDUCE-1342] Incremental enhancements to the JobTracker for better scalability[MAPREDUCE-1354] ConcurrentModificationException in JobInProgress[MAPREDUCE-1372] Args in job details links on jobhistory.jsp are not URL encoded[MAPREDUCE-1378] MRAsyncDiscService should tolerate missing local.dir[MAPREDUCE-1382] NullPointerException observed during task failures[MAPREDUCE-1397] TaskLauncher remains stuck on tasks waiting for free nodes even if task is killed.[MAPREDUCE-1398] The archive command shows a null error message[MAPREDUCE-1399] Save file-sizes of each of the artifacts in DistributedCache in the JobConf[MAPREDUCE-1403] LinuxTaskController tests failing on trunk after the commit of MAPREDUCE-1385[MAPREDUCE-1421] Changing permissions of files/dirs under job-work-dir may be needed sothat cleaning up of job-dir in all[MAPREDUCE-1422]

mapred-local-directories succeeds always Improve performance of CombineFileInputFormat when multiple pools are configured[MAPREDUCE-1423] archive throws OutOfMemoryError[MAPREDUCE-1425] symlinks in cwd of the task are not handled properly after MAPREDUCE-896[MAPREDUCE-1435] Deadlock in preemption code in fair scheduler[MAPREDUCE-1436] MapReduce should use the short form of the user names[MAPREDUCE-1440] Configuration of directory lists should trim whitespace[MAPREDUCE-1441] StackOverflowError when JobHistory parses a really long line[MAPREDUCE-1442]












































































DBInputFormat can leak connections[MAPREDUCE-1443] The servlets should quote server generated strings sent in the response[MAPREDUCE-1454] Authorization for servlets[MAPREDUCE-1455] For secure job execution, couple of more UserGroupInformation.doAs needs to be added[MAPREDUCE-1457] In JobTokenIdentifier change method getUsername to getUser which returns UGI[MAPREDUCE-1464] FileInputFormat should save #input-files in JobConf[MAPREDUCE-1466] committer.needsTaskCommit should not be called for a task cleanup attempt[MAPREDUCE-1476] CombineFileRecordReader does not properly initialize child RecordReader[MAPREDUCE-1480] Authorization for job-history pages[MAPREDUCE-1493] Push HADOOP-6551 into MapReduce[MAPREDUCE-1503] Cluster class should create the rpc client only when needed[MAPREDUCE-1505] Protection against incorrectly configured reduces[MAPREDUCE-1521] FileInputFormat may change the file system of an input path[MAPREDUCE-1522] Cache the job related information while submitting the job , this would avoid many RPC calls to JobTracker.[MAPREDUCE-1526] Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and[MAPREDUCE-1533]

Counters.makeEscapedString() TrackerDistributedCacheManager can fail because the number of subdirectories reaches system limit[MAPREDUCE-1538] Log messages of JobACLsManager should use security logging of HADOOP-6586[MAPREDUCE-1543] Add 'first-task-launched' to job-summary[MAPREDUCE-1545] UGI.doAs should not be used for getting the history file of jobs[MAPREDUCE-1550] Task diagnostic info would get missed sometimes.[MAPREDUCE-1563] Shuffle stage - Key and Group Comparators[MAPREDUCE-1570] Task controller may not set permissions for a task cleanup attempt's log directory[MAPREDUCE-1607] TaskTracker.localizeJob should not set permissions on job log directory recursively[MAPREDUCE-1609] Refresh nodes and refresh queues doesnt work with service authorization enabled[MAPREDUCE-1611] job conf file is not accessible from job history web page[MAPREDUCE-1612] Streaming's TextOutputReader.getLastOutput throws NPE if it has never read any output[MAPREDUCE-1621] ResourceEstimator does not work after MAPREDUCE-842[MAPREDUCE-1635] Job submission should fail if same uri is added for mapred.cache.files and mapred.cache.archives[MAPREDUCE-1641] JobStory should provide queue info.[MAPREDUCE-1656] After task logs directory is deleted, tasklog servlet displays wrong error message about job ACLs[MAPREDUCE-1657] Job Acls affect Queue Acls[MAPREDUCE-1664] Add a metrics to track the number of heartbeats processed[MAPREDUCE-1680] Tasks should not be scheduled after tip is killed/failed.[MAPREDUCE-1682] Remove JNI calls from ClusterStatus cstr[MAPREDUCE-1683] JobHistory shouldn't be disabled for any reason[MAPREDUCE-1699] TaskRunner can get NPE in getting ugi from TaskTracker[MAPREDUCE-1707] Truncate logs of finished tasks to prevent node thrash due to excessive logging[MAPREDUCE-1716] Authentication between pipes processes and java counterparts.[MAPREDUCE-1733] Un-deprecate the old MapReduce API in the 0.20 branch[MAPREDUCE-1734] DistributedCache creates its own FileSytem instance when adding a file/archive to the path[MAPREDUCE-1744] Replace mapred.persmissions.supergroup with an acl : mapreduce.cluster.administrators[MAPREDUCE-1754] Exception message for unauthorized user doing killJob, killTask, setJobPriority needs to be improved[MAPREDUCE-1759] CompletedJobStatusStore initialization should fail if {mapred.job.tracker.persist.jobstatus.dir} is unwritable[MAPREDUCE-1778] IFile should check for null compressor[MAPREDUCE-1784] Add streaming config option for not emitting the key[MAPREDUCE-1785] Support for file sizes less than 1MB in DFSIO benchmark.[MAPREDUCE-1832] FairScheduler.tasksToPeempt() can return negative number[MAPREDUCE-1845] Include job submit host information (name and ip) in jobconf and jobdetails display[MAPREDUCE-1850] MultipleOutputs does not cache TaskAttemptContext[MAPREDUCE-1853] Add read timeout on userlog pull[MAPREDUCE-1868] Re-think (user|queue) limits on (tasks|jobs) in the CapacityScheduler[MAPREDUCE-1872] MRAsyncDiskService does not properly absolutize volume root paths[MAPREDUCE-1887] MapReduce daemons should close FileSystems that are not needed anymore[MAPREDUCE-1900] TrackerDistributedCacheManager never cleans its input directories[MAPREDUCE-1914] Ability for having user's classes take precedence over the system classes for tasks' classpath[MAPREDUCE-1938] Limit the size of jobconf.[MAPREDUCE-1960] ConcurrentModificationException when shutting down Gridmix[MAPREDUCE-1961] java.lang.ArrayIndexOutOfBoundsException in analysejobhistory.jsp of jobs with 0 maps[MAPREDUCE-1985] TestDFSIO read test may not read specified bytes.[MAPREDUCE-2023] Race condition in writing the jobtoken password file when launching pipes jobs[MAPREDUCE-2082] Secure local filesystem IO from symlink vulnerabilities[MAPREDUCE-2096] task-controller shouldn't require o-r permissions[MAPREDUCE-2103] safely handle InterruptedException and interrupted status in MR code[MAPREDUCE-2157] Race condition in LinuxTaskController permissions handling[MAPREDUCE-2178] JT should not try to remove mapred.system.dir during startup[MAPREDUCE-2219] If Localizer can't create task log directory, it should fail on the spot[MAPREDUCE-2234] JobTracker "over-synchronization" makes it hang up in certain cases[MAPREDUCE-2235] LinuxTaskController doesn't properly escape environment variables[MAPREDUCE-2242] Servlets should specify content type[MAPREDUCE-2253] FairScheduler fairshare preemption from multiple pools may preempt all tasks from one pool causing that pool to go below[MAPREDUCE-2256]

fairshare. Permissions race can make getStagingDir fail on local filesystem[MAPREDUCE-2289] TT should fail to start on secure cluster when SecureIO isn't available[MAPREDUCE-2321] Add metrics to the fair scheduler[MAPREDUCE-2323]












































































memory-related configurations missing from mapred-default.xml[MAPREDUCE-2328] Improve error messages when MR dirs on local FS have bad ownership[MAPREDUCE-2332] mapred.job.tracker.history.completed.location should support an arbitrary filesystem URI[MAPREDUCE-2351] Make the MR changes to reflect the API changes in SecureIO library[MAPREDUCE-2353] A task succeeded even though there were errors on all attempts.[MAPREDUCE-2356] Shouldn't hold lock on rjob while localizing resources.[MAPREDUCE-2364] TaskTracker can't retrieve stdout and stderr from web UI[MAPREDUCE-2366] TaskLogsTruncater does not need to check log ownership when running as Child[MAPREDUCE-2371] TaskLogAppender mechanism shouldn't be set in log4j.properties[MAPREDUCE-2372] When tasks exit with a nonzero exit status, task runner should log the stderr as well as stdout[MAPREDUCE-2373] Should not use PrintWriter to write taskjvm.sh[MAPREDUCE-2374] task-controller fails to parse configuration if it doesn't end in \n[MAPREDUCE-2377] Distributed cache sizing configurations are missing from mapred-default.xml[MAPREDUCE-2379]

















General Information



MapR GA Version 1.0 Documentation

New in This Release

Rolling Upgrade

The script upgrades a MapR cluster to a specified version of the MapR software, or to a specific set of MapR packages,rollingupgrade.sheither via SSH or node by node. This makes it easy to upgrade a MapR cluster with a minimum of downtime.

32-Bit Client

The MapR Client can now be installed on both 64-bit and 32-bit computers. See .MapR Client

Core File Removal

In the event of a core dump on a node, MapR writes the core file to the directory. If disk space on the node is nearly full, MapR/opt/coresautomatically reclaims space by deleting core files. To prevent a specific core file from being deleted, rename the file to start with a period ( )..Example:

mv mfs.core.2127.node12 .mfs.core.2127.node12

Resolved Issues

Removing Nodes(Issue 4068) Upgrading Red Hat(Issue 3984) HBase Upgrade(Issue 3965) Volume Dump Restore Failure(Issue 3890) Sqoop Requires HBase(Issue 3560) Intermittent Scheduled Mirror Failure(Issue 2949) NFS Mounting Issue on Ubuntu(Issue 2815) File Cleanup is Slow

Known Issues

(Issue 4415) Select and Kill Controls in JobTracker UI

The Select and Kill controls in the JobTracker UI appear when the parameter in is setwebinterface.private.actions mapred-site.xmlto . In MapR clusters upgraded from the beta version of the software, the parameter must be added manually for the controls to appear.true

To enable the Select and Kill controls in the JobTracker UI, copy the following lines from /opt/mapr/hadoop/hadoop-0.20.2/conf.new/ma to :pred-site.xml /opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml

http://10.250.1.5:8080/download/attachments/3998550/MapR-GA-1.0-Docs-Final.pdf



<property> <name>webinterface. .actions</name>private <value> </value>true <description> If set to , jobs can be killed from JT's web .true interface Enable option the interfaces are only reachable bythis if those who have the right authorization. </description></property>


















1.

2.

3. 4.














(Issue 2809) NFS Dependencies

If you are installing the MapR NFS service on a node that cannot connect to the standard apt-get or yum repositories, you should install thefollowing packages by hand:

CentOS:iputilsportmapglibc-common-2.5-49.el5_5.7

Red Hat:rpcbindiputils

Ubuntu:nfs-commoniputils-arping








MapR 1.0 includes the following Apache Hadoop issues that are not included in the Apache Hadoop base version 0.20.2:

[ Make streaming to handle non-utf8 byte arrayHADOOP-1722] IPC server max queue size should be configurable[HADOOP-1849] speculative execution start up condition based on completion time[HADOOP-2141] Space in the value for dfs.data.dir can cause great problems[HADOOP-2366] Use job control for tasks (and therefore for pipes and streaming)[HADOOP-2721] Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni[HADOOP-2838] Shuffling fetchers waited too long between map output fetch re-tries[HADOOP-3327] Patch to allow hadoop native to compile on Mac OS X[HADOOP-3659] Providing splitting support for bzip2 compressed files[HADOOP-4012] IsolationRunner does not work as documented[HADOOP-4041] Map and Reduce tasks should run as the user who submitted the job[HADOOP-4490] FileSystem.CACHE should be ref-counted[HADOOP-4655] Add a user to groups mapping service[HADOOP-4656] Current Ganglia metrics implementation is incompatible with Ganglia 3.1[HADOOP-4675] Allow FileSystem shutdown hook to be disabled[HADOOP-4829] Streaming combiner should allow command, not just JavaClass[HADOOP-4842] Implement setuid executable for Linux to assist in launching tasks as job owners[HADOOP-4930] ConcurrentModificationException in JobHistory.java[HADOOP-4933] Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide[HADOOP-5170] Option to prohibit jars unpacking[HADOOP-5175] TT's version build is too restrictive[HADOOP-5203] Queue ACLs should be refreshed without requiring a restart of the job tracker[HADOOP-5396] Provide a way for users to find out what operations they can do on which M/R queues[HADOOP-5419] Support killing of process groups in LinuxTaskController binary[HADOOP-5420] The job history display needs to be paged[HADOOP-5442] Add support for application-specific typecodes to typed bytes[HADOOP-5450] Exposing Hadoop metrics via HTTP[HADOOP-5469] calling new SequenceFile.Reader(...) leaves an InputStream open, if the given sequence file is broken[HADOOP-5476] HADOOP-2721 doesn't clean up descendant processes of a jvm that exits cleanly after running a task successfully[HADOOP-5488] Binary partitioner[HADOOP-5528] Hadoop Vaidya throws number format exception due to changes in the job history counters string format (escaped compact[HADOOP-5582]

representation). Hadoop Streaming - GzipCodec[HADOOP-5592] change S3Exception to checked exception[HADOOP-5613] Ability to blacklist tasktracker[HADOOP-5643] Counter for S3N Read Bytes does not work[HADOOP-5656] DistCp should not launch a job if it is not necessary[HADOOP-5675] Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics[HADOOP-5733] UGI checks in testcases are broken[HADOOP-5737] Split waiting tasks field in JobTracker metrics to individual tasks[HADOOP-5738] Allow setting the default value of maxRunningJobs for all pools[HADOOP-5745] The length of the heartbeat cycle should be configurable.[HADOOP-5784] JobTracker should refresh the hosts list upon recovery[HADOOP-5801] problem using top level s3 buckets as input/output directories[HADOOP-5805] s3n files are not getting split by default[HADOOP-5861] GzipCodec should read compression level etc from configuration[HADOOP-5879] Allow administrators to be able to start and stop queues[HADOOP-5913] Use JDK 1.6 File APIs in DF.java wherever possible[HADOOP-5958] create script to provide classpath for external tools[HADOOP-5976] LD_LIBRARY_PATH not passed to tasks spawned off by LinuxTaskController[HADOOP-5980] HADOOP-2838 doesnt work as expected[HADOOP-5981] RPC client opens an extra connection for VersionedProtocol[HADOOP-6132] ReflectionUtils performance regression[HADOOP-6133] Implement a pure Java CRC32 calculator[HADOOP-6148] Add get/setEnum to Configuration[HADOOP-6161] Improve PureJavaCrc32[HADOOP-6166]


























































Provide a configuration dump in json format.[HADOOP-6184] Configuration does not lock parameters marked final if they have no value.[HADOOP-6227] Permission configuration files should use octal and symbolic[HADOOP-6234] s3n fails with SocketTimeoutException[HADOOP-6254] Missing synchronization for defaultResources in Configuration.addResource[HADOOP-6269] Add JVM memory usage to JvmMetrics[HADOOP-6279] Any hadoop commands crashing jvm (SIGBUS) when /tmp (tmpfs) is full[HADOOP-6284] Use JAAS LoginContext for our login[HADOOP-6299] Configuration sends too much data to log4j[HADOOP-6312] Update FilterInitializer class to be more visible and take a conf for further development[HADOOP-6337] Stack trace of any runtime exceptions should be recorded in the server logs.[HADOOP-6343] Log errors getting Unix UGI[HADOOP-6400] Add a /conf servlet to dump running configuration[HADOOP-6408] Change RPC layer to support SASL based mutual authentication[HADOOP-6419] Add AsyncDiskService that is used in both hdfs and mapreduce[HADOOP-6433] Prevent remote CSS attacks in Hostname and UTF-7.[HADOOP-6441] Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH[HADOOP-6453] StringBuffer -> StringBuilder - conversion of references as necessary[HADOOP-6471] HttpServer sends wrong content-type for CSS files (and others)[HADOOP-6496] doAs for proxy user[HADOOP-6510] FsPermission:SetUMask not updated to use new-style umask setting.[HADOOP-6521] LocalDirAllocator should use whitespace trimming configuration getters[HADOOP-6534] Allow authentication-enabled RPC clients to connect to authentication-disabled RPC servers[HADOOP-6543] archive does not work with distcp -update[HADOOP-6558] Authorization for default servlets[HADOOP-6568] FsShell#cat should avoid calling unecessary getFileStatus before opening a file to read[HADOOP-6569] RPC responses may be out-of-order with respect to SASL[HADOOP-6572] IPC server response buffer reset threshold should be configurable[HADOOP-6577] Configuration should trim whitespace around a lot of value types[HADOOP-6578] Split RPC metrics into summary and detailed metrics[HADOOP-6599] Deadlock in DFSClient#getBlockLocations even with the security disabled[HADOOP-6609] RPC server should check for version mismatch first[HADOOP-6613] "Bad Connection to FS" message in FSShell should print message from the exception[HADOOP-6627] FileUtil.fullyDelete() should continue to delete other files despite failure at any level.[HADOOP-6631] AccessControlList uses full-principal names to verify acls causing queue-acls to fail[HADOOP-6634] Benchmark overhead of RPC session establishment[HADOOP-6637] FileSystem.get() does RPC retries within a static synchronized block[HADOOP-6640] util.Shell getGROUPS_FOR_USER_COMMAND method name - should use common naming convention[HADOOP-6644] login object in UGI should be inside the subject[HADOOP-6649] ShellBasedUnixGroupsMapping shouldn't have a cache[HADOOP-6652] NullPointerException in setupSaslConnection when browsing directories[HADOOP-6653] BlockDecompressorStream get EOF exception when decompressing the file compressed from empty file[HADOOP-6663] RPC.waitForProxy should retry through NoRouteToHostException[HADOOP-6667] zlib.compress.level ignored for DefaultCodec initialization[HADOOP-6669] UserGroupInformation doesn't support use in hash tables[HADOOP-6670] Performance Improvement in Secure RPC[HADOOP-6674] user object in the subject in UGI should be reused in case of a relogin.[HADOOP-6687] Incorrect exit codes for "dfs -chown", "dfs -chgrp"[HADOOP-6701] Relogin behavior for RPC clients could be improved[HADOOP-6706] Symbolic umask for file creation is not consistent with posix[HADOOP-6710] FsShell 'hadoop fs -text' does not support compression codecs[HADOOP-6714] Client does not close connection when an exception happens during SASL negotiation[HADOOP-6718] NetUtils.connect should check that it hasn't connected a socket to itself[HADOOP-6722] unchecked exceptions thrown in IPC Connection orphan clients[HADOOP-6723] IPC doesn't properly handle IOEs thrown by socket factory[HADOOP-6724] adding some java doc to Server.RpcMetrics, UGI[HADOOP-6745] NullPointerException for hadoop clients launched from streaming tasks[HADOOP-6757] WebServer shouldn't increase port number in case of negative port setting caused by Jetty's race[HADOOP-6760] exception while doing RPC I/O closes channel[HADOOP-6762] UserGroupInformation.createProxyUser's javadoc is broken[HADOOP-6776] Add a new newInstance method in FileSystem that takes a "user" as argument[HADOOP-6813] refreshSuperUserGroupsConfiguration should use server side configuration for the refresh[HADOOP-6815] Provide a JNI-based implementation of GroupMappingServiceProvider[HADOOP-6818] Provide a web server plugin that uses a static user for the web UI[HADOOP-6832] IPC leaks call parameters when exceptions thrown[HADOOP-6833] Introduce additional statistics to FileSystem[HADOOP-6859] Provide a JNI-based implementation of ShellBasedUnixGroupsNetgroupMapping (implementation of[HADOOP-6864]

GroupMappingServiceProvider) The efficient comparators aren't always used except for BytesWritable and Text[HADOOP-6881] RawLocalFileSystem#setWorkingDir() does not work for relative names[HADOOP-6899] Rpc client doesn't use the per-connection conf to figure out server's Kerberos principal[HADOOP-6907] BZip2Codec incorrectly implements read()[HADOOP-6925] Fix BooleanWritable comparator in 0.20[HADOOP-6928] The GroupMappingServiceProvider interface should be public[HADOOP-6943] Suggest that HADOOP_CLASSPATH should be preserved in hadoop-env.sh.template[HADOOP-6950]













































































Allow wildcards to be used in ProxyUsers configurations[HADOOP-6995] Configuration.writeXML should not hold lock while outputting[HADOOP-7082] UserGroupInformation.getCurrentUser() fails when called from non-Hadoop JAAS context[HADOOP-7101] Remove unnecessary DNS reverse lookups from RPC layer[HADOOP-7104] Implement chmod with JNI[HADOOP-7110] FsShell should dump all exceptions at DEBUG level[HADOOP-7114] Add a cache for getpwuid_r and getpwgid_r calls[HADOOP-7115] NPE in Configuration.writeXml[HADOOP-7118] Timed out shell commands leak Timer threads[HADOOP-7122] getpwuid_r is not thread-safe on RHEL6[HADOOP-7156] SecureIO should not check owner on non-secure clusters that have no native support[HADOOP-7172] Remove unused fstat() call from NativeIO[HADOOP-7173] WritableComparator.get should not cache comparator objects[HADOOP-7183] Remove deprecated local.cache.size from core-default.xml[HADOOP-7184]

MapReduce Patches

MapR 1.0 includes the following Apache MapReduce issues that are not included in the Apache Hadoop base version 0.20.2:


fail. Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api.[MAPREDUCE-364] Change org.apache.hadoop.mapred.lib.MultipleInputs to use new api.[MAPREDUCE-369] Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.[MAPREDUCE-370] JobControl Job does always has an unassigned name[MAPREDUCE-415] Move the completed jobs' history files to a DONE subdirectory inside the configured history directory[MAPREDUCE-416] Enable ServicePlugins for the JobTracker[MAPREDUCE-461] The job setup and cleanup tasks should be optional[MAPREDUCE-463] Collect information about number of tasks succeeded / total per time unit for a tasktracker.[MAPREDUCE-467] extend DistributedCache to work locally (LocalJobRunner)[MAPREDUCE-476] separate jvm param for mapper and reducer[MAPREDUCE-478] Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs[MAPREDUCE-516] The capacity-scheduler should assign multiple tasks per heartbeat[MAPREDUCE-517] After JobTracker restart Capacity Schduler does not schedules pending tasks from already running tasks.[MAPREDUCE-521] Allow admins of the Capacity Scheduler to set a hard-limit on the capacity of a queue[MAPREDUCE-532] Add preemption to the fair scheduler[MAPREDUCE-551] If #link is missing from uri format of -cacheArchive then streaming does not throw error.[MAPREDUCE-572] Change KeyValueLineRecordReader and KeyValueTextInputFormat to use new api.[MAPREDUCE-655] Existing diagnostic rules fail for MAP ONLY jobs[MAPREDUCE-676] XML-based metrics as JSP servlet for JobTracker[MAPREDUCE-679] Reuse of Writable objects is improperly handled by MRUnit[MAPREDUCE-680] Reserved tasktrackers should be removed when a node is globally blacklisted[MAPREDUCE-682] Conf files not moved to "done" subdirectory after JT restart[MAPREDUCE-693] Per-pool task limits for the fair scheduler[MAPREDUCE-698] Support for FIFO pools in the fair scheduler[MAPREDUCE-706] Provide a jobconf property for explicitly assigning a job to a pool[MAPREDUCE-707] node health check script does not display the correct message on timeout[MAPREDUCE-709] JobConf.findContainingJar unescapes unnecessarily on Linux[MAPREDUCE-714] org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle[MAPREDUCE-716] More slots are getting reserved for HiRAM job tasks then required[MAPREDUCE-722] node health check script should not log "UNHEALTHY" status for every heartbeat in INFO mode[MAPREDUCE-732] java.util.ConcurrentModificationException observed in unreserving slots for HiRam Jobs[MAPREDUCE-734] Allow relative paths to be created inside archives.[MAPREDUCE-739] Provide summary information per job once a job is finished.[MAPREDUCE-740] Support in DistributedCache to share cache files with other users after HADOOP-4493[MAPREDUCE-744] NPE in expiry thread when a TT is lost[MAPREDUCE-754] TypedBytesInput's readRaw() does not preserve custom type codes[MAPREDUCE-764] Configuration information should generate dump in a standard format.[MAPREDUCE-768] Setup and cleanup tasks remain in UNASSIGNED state for a long time on tasktrackers with long running high RAM tasks[MAPREDUCE-771] Use PureJavaCrc32 in mapreduce spills[MAPREDUCE-782] -files, -archives should honor user given symlink path[MAPREDUCE-787] Job summary logs show status of completed jobs as RUNNING[MAPREDUCE-809] Move completed Job history files to HDFS[MAPREDUCE-814] Add a cache for retired jobs with minimal job info and provide a way to access history file url[MAPREDUCE-817] JobClient completion poll interval of 5s causes slow tests in local mode[MAPREDUCE-825] DBInputFormat leaves open transaction[MAPREDUCE-840]







































































Per-job local data on the TaskTracker node should have right access-control[MAPREDUCE-842] Localized files from DistributedCache should have right access-control[MAPREDUCE-856] Job/Task local files have incorrect group ownership set by LinuxTaskController binary[MAPREDUCE-871] Make DBRecordReader execute queries lazily[MAPREDUCE-875] More efficient SQL queries for DBInputFormat[MAPREDUCE-885] After HADOOP-4491, the user who started mapred system is not able to run job.[MAPREDUCE-890] Users can set non-writable permissions on temporary files for TT and can abuse disk usage.[MAPREDUCE-896] When using LinuxTaskController, localized files may become accessible to unintended users if permissions are[MAPREDUCE-899]

misconfigured. Cleanup of task-logs should happen in TaskTracker instead of the Child[MAPREDUCE-927] OutputCommitter should have an abortJob method[MAPREDUCE-947] Inaccurate values in jobSummary logs[MAPREDUCE-964] TaskTracker does not need to fully unjar job jars[MAPREDUCE-967] NPE in distcp encountered when placing _logs directory on S3FileSystem[MAPREDUCE-968] distcp does not always remove distcp.tmp.dir[MAPREDUCE-971] Cleanup tasks are scheduled using high memory configuration, leaving tasks in unassigned state.[MAPREDUCE-1028] Reduce tasks are getting starved in capacity scheduler[MAPREDUCE-1030] Show total slot usage in cluster summary on jobtracker webui[MAPREDUCE-1048] distcp can generate uneven map task assignments[MAPREDUCE-1059] Use the user-to-groups mapping service in the JobTracker[MAPREDUCE-1083] For tasks, "ulimit -v -1" is being run when user doesn't specify mapred.child.ulimit[MAPREDUCE-1085] hadoop commands in streaming tasks are trying to write to tasktracker's log[MAPREDUCE-1086] JobHistory files should have narrower 0600 perms[MAPREDUCE-1088] Fair Scheduler preemption triggers NPE when tasks are scheduled but not running[MAPREDUCE-1089] Modify log statement in Tasktracker log related to memory monitoring to include attempt id.[MAPREDUCE-1090] Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.[MAPREDUCE-1098] User's task-logs filling up local disks on the TaskTrackers[MAPREDUCE-1100] Additional JobTracker metrics[MAPREDUCE-1103] CapacityScheduler: It should be possible to set queue hard-limit beyond it's actual capacity[MAPREDUCE-1105] Capacity Scheduler scheduling information is hard to read / should be tabular format[MAPREDUCE-1118] Using profilers other than hprof can cause JobClient to report job failure[MAPREDUCE-1131] Per cache-file refcount can become negative when tasks release distributed-cache files[MAPREDUCE-1140] runningMapTasks counter is not properly decremented in case of failed Tasks.[MAPREDUCE-1143] Streaming tests swallow exceptions[MAPREDUCE-1155] running_maps is not decremented when the tasks of a job is killed/failed[MAPREDUCE-1158] Two log statements at INFO level fill up jobtracker logs[MAPREDUCE-1160] Lots of fetch failures[MAPREDUCE-1171] MultipleInputs fails with ClassCastException[MAPREDUCE-1178] URL to JT webconsole for running job and job history should be the same[MAPREDUCE-1185] While localizing a DistributedCache file, TT sets permissions recursively on the whole base-dir[MAPREDUCE-1186] MAPREDUCE-947 incompatibly changed FileOutputCommitter[MAPREDUCE-1196] Alternatively schedule different types of tasks in fair share scheduler[MAPREDUCE-1198] TaskTrackers restart is very slow because it deletes distributed cache directory synchronously[MAPREDUCE-1213] JobTracker Metrics causes undue load on JobTracker[MAPREDUCE-1219] Kill tasks on a node if the free physical memory on that machine falls below a configured threshold[MAPREDUCE-1221] Distcp is very slow[MAPREDUCE-1231] Refactor job token to use a common token interface[MAPREDUCE-1250] Fair scheduler event log not logging job info[MAPREDUCE-1258] DistCp cannot handle -delete if destination is local filesystem[MAPREDUCE-1285] DistributedCache localizes only once per cache URI[MAPREDUCE-1288] AutoInputFormat doesn't work with non-default FileSystems[MAPREDUCE-1293] TrackerDistributedCacheManager can delete file asynchronously[MAPREDUCE-1302] Add counters for task time spent in GC[MAPREDUCE-1304] Introduce the concept of Job Permissions[MAPREDUCE-1307] NPE in FieldFormatter if escape character is set and field is null[MAPREDUCE-1313] JobTracker holds stale references to retired jobs via unreported tasks[MAPREDUCE-1316] Potential JT deadlock in faulty TT tracking[MAPREDUCE-1342] Incremental enhancements to the JobTracker for better scalability[MAPREDUCE-1354] ConcurrentModificationException in JobInProgress[MAPREDUCE-1372] Args in job details links on jobhistory.jsp are not URL encoded[MAPREDUCE-1378] MRAsyncDiscService should tolerate missing local.dir[MAPREDUCE-1382] NullPointerException observed during task failures[MAPREDUCE-1397] TaskLauncher remains stuck on tasks waiting for free nodes even if task is killed.[MAPREDUCE-1398] The archive command shows a null error message[MAPREDUCE-1399] Save file-sizes of each of the artifacts in DistributedCache in the JobConf[MAPREDUCE-1403] LinuxTaskController tests failing on trunk after the commit of MAPREDUCE-1385[MAPREDUCE-1421] Changing permissions of files/dirs under job-work-dir may be needed sothat cleaning up of job-dir in all[MAPREDUCE-1422]

mapred-local-directories succeeds always Improve performance of CombineFileInputFormat when multiple pools are configured[MAPREDUCE-1423] archive throws OutOfMemoryError[MAPREDUCE-1425] symlinks in cwd of the task are not handled properly after MAPREDUCE-896[MAPREDUCE-1435] Deadlock in preemption code in fair scheduler[MAPREDUCE-1436] MapReduce should use the short form of the user names[MAPREDUCE-1440] Configuration of directory lists should trim whitespace[MAPREDUCE-1441] StackOverflowError when JobHistory parses a really long line[MAPREDUCE-1442]












































































DBInputFormat can leak connections[MAPREDUCE-1443] The servlets should quote server generated strings sent in the response[MAPREDUCE-1454] Authorization for servlets[MAPREDUCE-1455] For secure job execution, couple of more UserGroupInformation.doAs needs to be added[MAPREDUCE-1457] In JobTokenIdentifier change method getUsername to getUser which returns UGI[MAPREDUCE-1464] FileInputFormat should save #input-files in JobConf[MAPREDUCE-1466] committer.needsTaskCommit should not be called for a task cleanup attempt[MAPREDUCE-1476] CombineFileRecordReader does not properly initialize child RecordReader[MAPREDUCE-1480] Authorization for job-history pages[MAPREDUCE-1493] Push HADOOP-6551 into MapReduce[MAPREDUCE-1503] Cluster class should create the rpc client only when needed[MAPREDUCE-1505] Protection against incorrectly configured reduces[MAPREDUCE-1521] FileInputFormat may change the file system of an input path[MAPREDUCE-1522] Cache the job related information while submitting the job , this would avoid many RPC calls to JobTracker.[MAPREDUCE-1526] Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and[MAPREDUCE-1533]

Counters.makeEscapedString() TrackerDistributedCacheManager can fail because the number of subdirectories reaches system limit[MAPREDUCE-1538] Log messages of JobACLsManager should use security logging of HADOOP-6586[MAPREDUCE-1543] Add 'first-task-launched' to job-summary[MAPREDUCE-1545] UGI.doAs should not be used for getting the history file of jobs[MAPREDUCE-1550] Task diagnostic info would get missed sometimes.[MAPREDUCE-1563] Shuffle stage - Key and Group Comparators[MAPREDUCE-1570] Task controller may not set permissions for a task cleanup attempt's log directory[MAPREDUCE-1607] TaskTracker.localizeJob should not set permissions on job log directory recursively[MAPREDUCE-1609] Refresh nodes and refresh queues doesnt work with service authorization enabled[MAPREDUCE-1611] job conf file is not accessible from job history web page[MAPREDUCE-1612] Streaming's TextOutputReader.getLastOutput throws NPE if it has never read any output[MAPREDUCE-1621] ResourceEstimator does not work after MAPREDUCE-842[MAPREDUCE-1635] Job submission should fail if same uri is added for mapred.cache.files and mapred.cache.archives[MAPREDUCE-1641] JobStory should provide queue info.[MAPREDUCE-1656] After task logs directory is deleted, tasklog servlet displays wrong error message about job ACLs[MAPREDUCE-1657] Job Acls affect Queue Acls[MAPREDUCE-1664] Add a metrics to track the number of heartbeats processed[MAPREDUCE-1680] Tasks should not be scheduled after tip is killed/failed.[MAPREDUCE-1682] Remove JNI calls from ClusterStatus cstr[MAPREDUCE-1683] JobHistory shouldn't be disabled for any reason[MAPREDUCE-1699] TaskRunner can get NPE in getting ugi from TaskTracker[MAPREDUCE-1707] Truncate logs of finished tasks to prevent node thrash due to excessive logging[MAPREDUCE-1716] Authentication between pipes processes and java counterparts.[MAPREDUCE-1733] Un-deprecate the old MapReduce API in the 0.20 branch[MAPREDUCE-1734] DistributedCache creates its own FileSytem instance when adding a file/archive to the path[MAPREDUCE-1744] Replace mapred.persmissions.supergroup with an acl : mapreduce.cluster.administrators[MAPREDUCE-1754] Exception message for unauthorized user doing killJob, killTask, setJobPriority needs to be improved[MAPREDUCE-1759] CompletedJobStatusStore initialization should fail if {mapred.job.tracker.persist.jobstatus.dir} is unwritable[MAPREDUCE-1778] IFile should check for null compressor[MAPREDUCE-1784] Add streaming config option for not emitting the key[MAPREDUCE-1785] Support for file sizes less than 1MB in DFSIO benchmark.[MAPREDUCE-1832] FairScheduler.tasksToPeempt() can return negative number[MAPREDUCE-1845] Include job submit host information (name and ip) in jobconf and jobdetails display[MAPREDUCE-1850] MultipleOutputs does not cache TaskAttemptContext[MAPREDUCE-1853] Add read timeout on userlog pull[MAPREDUCE-1868] Re-think (user|queue) limits on (tasks|jobs) in the CapacityScheduler[MAPREDUCE-1872] MRAsyncDiskService does not properly absolutize volume root paths[MAPREDUCE-1887] MapReduce daemons should close FileSystems that are not needed anymore[MAPREDUCE-1900] TrackerDistributedCacheManager never cleans its input directories[MAPREDUCE-1914] Ability for having user's classes take precedence over the system classes for tasks' classpath[MAPREDUCE-1938] Limit the size of jobconf.[MAPREDUCE-1960] ConcurrentModificationException when shutting down Gridmix[MAPREDUCE-1961] java.lang.ArrayIndexOutOfBoundsException in analysejobhistory.jsp of jobs with 0 maps[MAPREDUCE-1985] TestDFSIO read test may not read specified bytes.[MAPREDUCE-2023] Race condition in writing the jobtoken password file when launching pipes jobs[MAPREDUCE-2082] Secure local filesystem IO from symlink vulnerabilities[MAPREDUCE-2096] task-controller shouldn't require o-r permissions[MAPREDUCE-2103] safely handle InterruptedException and interrupted status in MR code[MAPREDUCE-2157] Race condition in LinuxTaskController permissions handling[MAPREDUCE-2178] JT should not try to remove mapred.system.dir during startup[MAPREDUCE-2219] If Localizer can't create task log directory, it should fail on the spot[MAPREDUCE-2234] JobTracker "over-synchronization" makes it hang up in certain cases[MAPREDUCE-2235] LinuxTaskController doesn't properly escape environment variables[MAPREDUCE-2242] Servlets should specify content type[MAPREDUCE-2253] FairScheduler fairshare preemption from multiple pools may preempt all tasks from one pool causing that pool to go below[MAPREDUCE-2256]

fairshare. Permissions race can make getStagingDir fail on local filesystem[MAPREDUCE-2289] TT should fail to start on secure cluster when SecureIO isn't available[MAPREDUCE-2321] Add metrics to the fair scheduler[MAPREDUCE-2323]












































































memory-related configurations missing from mapred-default.xml[MAPREDUCE-2328] Improve error messages when MR dirs on local FS have bad ownership[MAPREDUCE-2332] mapred.job.tracker.history.completed.location should support an arbitrary filesystem URI[MAPREDUCE-2351] Make the MR changes to reflect the API changes in SecureIO library[MAPREDUCE-2353] A task succeeded even though there were errors on all attempts.[MAPREDUCE-2356] Shouldn't hold lock on rjob while localizing resources.[MAPREDUCE-2364] TaskTracker can't retrieve stdout and stderr from web UI[MAPREDUCE-2366] TaskLogsTruncater does not need to check log ownership when running as Child[MAPREDUCE-2371] TaskLogAppender mechanism shouldn't be set in log4j.properties[MAPREDUCE-2372] When tasks exit with a nonzero exit status, task runner should log the stderr as well as stdout[MAPREDUCE-2373] Should not use PrintWriter to write taskjvm.sh[MAPREDUCE-2374] task-controller fails to parse configuration if it doesn't end in \n[MAPREDUCE-2377] Distributed cache sizing configurations are missing from mapred-default.xml[MAPREDUCE-2379]
















Beta Release Notes

General Information

New in This Release

Services Down Alarm Removed

The Services Down Alarm (NODE_ALARM_MISC_DOWN) has been removed.

Hoststats Service Down Alarm Added

The Hoststats Service Down Alarm (NODE_ALARM_SERVICE_HOSTSTATS_DOWN) has been added. This alarm indicates that the Hoststatsservice on the indicated node is not running.

Installation Directory Full Alarm Added

The Installation Directory Full Alarm (NODE_ALARM_OPT_MAPR_FULL) has been added. This alarm indicates that the directory on/opt/maprthe indicated node is approaching capacity.

Root Partition Full Alarm Added

The Root Partition Full Alarm (NODE_ALARM_ROOT_PARTITION_FULL) has been added. This alarm indicates that the directory on the/indicated node is approaching capacity.

Cores Present Alarm Added

The Cores Present Alarm (NODE_ALARM_CORE_PRESENT) has been added. This alarm indicates that the a service on the indicated node hascrashed, leaving a core dump file.

Global fsck scan

Global fsck automatically scans the entire MapR cluster for errors. If an error is found, contact MapR Support for assistance.

Volume Mirrors

A volume mirror is a full read-only copy of a volume that can be synced on a schedule to provide point-in-time recovery for critical data, or forhigher-performance read concurrency. Creating a mirror requires the permission. See .mir Managing Volumes

Resolved Issues

(Issue 3724) Default Settings Must Be Changed(Issue 3620) Can't Run MapReduce Jobs as Non-Root User(Issue 2434) Mirroring Disabled in Alpha(Issue 2282) fsck Not Present in Alpha

Known Issues

Removing Nodes

The MapR Beta release may experience problems when nodes are removed from the cluster. The problems are likely to be seen asinconsistencies in the GUI and can be corrected by stopping and restarting the CLDB process. This behavior will be corrected in the GA release.

(Issue 4068) Upgrading Red Hat

When upgrading MapR packages on nodes that run Red Hat, you should only upgrade packages if they appear on the following list:

mapr-coremapr-flume-internalmapr-hbase-internalmapr-hive-internalmapr-oozie-internalmapr-pig-internalmapr-sqoop-internal



1. 2. 3. 4.

5. 6.

mapr-zk-internal

Other installed packages should not be upgraded. If you accidentally upgrade other packages, you can restore the node to proper operation byforcing a reinstall of the latest versions of the packages using the following steps:

Log in as (or use for the following steps).root sudoStop the warden: /etc/init.d/mapr-warden stopIf Zookeeper is installed and running, stop it: /etc/init.d/mapr-zookeeper stopForce reinstall of the packages by running with a list of packages to be installed. Example: yum reinstall yum reinstallmapr-core mapr-zk-internalIf ZooKeeper is installed on the node, start it: /etc/init.d/mapr-zookeeper startStart the warden: /etc/init.d/mapr-warden start



Use the MapR Control System, the API, or the command-line interface to start the services individuallyRestart the warden to stop and start all services on the nodeIf you start the services individually, the node's memory will not be reconfigured to account for the newly installed services. This cancause memory paging, slowing or stopping the node. However, stopping and restarting the warden can take the node out of service.






(Issue 3965) Volume Dump Restore Failure

The command can fail with error 22 ( ) if nodes containing the volume dump are restarted during the restorevolume dump restore EINVALoperation. To fix the problem, run the command again after the nodes have restarted.

(Issue 3984) HBase Upgrade

If you are using HBase and upgrading during the MapR beta, please contact MapR Support for assistance.

(Issue 3890) Sqoop Requires HBase

The Sqoop package requires HBase, but the package dependency is not set. If you install Sqoop, you must also explicitly install HBase.

(Issue 3817) Increasing File Handle Limits Requires Restarting PAM Session Management

If you're upgrading from the Apache distribution of Hadoop on Ubuntu 10.x, it is not sufficient to modify to/etc/security/limits.confincrease the file handle limits for all the new users. You must also modify your PAM configuration, by adding the following line to /etc/pam.d/c

and then restarting the services:ommon-session


(Issue 3560) Intermittent Scheduled Mirror Failure

Under certain conditions, a scheduled mirror ends prematurely. To work around the issue, re-start mirroring manually. This issue will be correctedin a post-beta code release.







1.

2.

3. 4.





If a source or mirror volume is repaired with then the source and mirror volumes can go out of sync. It is necessary to perform a full mirrorfsckoperation with to bring them back in sync. If a mirror operation is not feasible (due to bandwithvolume mirror start -full trueconstraints, for example), then you should restore the mirror volume from a full dump file. When creating a dump file from a volume that has beenrepaired with , use the command without specifying to create a full volume dump.fsck volume dump create -s




(Issue 2949) NFS Mounting Issue on Ubuntu

When mounting a cluster via NFS, you must include the option, which specifies NFS protocol version 3.vers=3If no version is specified, NFS uses the highest version supported by the kernel and command, which is most cases is version 4. Versionmount4 is not yet supported by MapR-FS NFS.

(Issue 2815) File Cleanup is Slow

After a MapReduce job is completed, cleanup of files and directories associated with the tasks can take a long time and tie up the TaskTrackernode. If this happens on multiple nodes, it can cause a temporary cluster outage. If this happens, check the and make sure allJobTracker ViewTaskTrackers are back online before submitting additional jobs.

(Issue 2809) NFS Dependencies

If you are installing the MapR NFS service on a node that cannot connect to the standard apt-get or yum repositories, you should install thefollowing packages by hand:

CentOS:iputilsportmapglibc-common-2.5-49.el5_5.7

Red Hat:rpcbindiputils

Ubuntu:nfs-commoniputils-arping

(Issue 2495) NTP Requirement



http://www.ntp.org/

http://www.ntp.org/



Alpha Release Notes

New in This Release

As this is the first release, there are no added or changed features.

Resolved Issues

As this is the first release, there are no issues resolved or carried over from a previous release.

Known Issues

(Issue 2495) NTP Requirement



(Issue 2434) Mirroring Disabled in Alpha

Volume Mirroring is intentionally disabled in the MapR Alpha Release. User interface elements and API commands related to mirroring arenon-functional.

(Issue 2282) fsck Not Present in Alpha

MapR cluster fsck is not present in the Alpha release.

http://www.ntp.org/

http://www.ntp.org/



Packages and Dependencies for MapR Software

This page links to the Packages and Dependencies page for each release of the MapR software. These pages list:

Package dependencies for each Linux platformMapR packages for all services, and their dependenciesHadoop ecosystem packages and their dependencies

Select a link below for version-specific details:

Packages and Dependencies for MapR Version 2.x



1.

2.

MapR Control System

The MapR Control System main screen consists of a navigation pane to the left and a view to the right. Dialogs appear over the main screen toperform certain actions.

View this video for an introduction to the MapR Control System dashboard...

Logging on to the MapR Control System

In a browser, navigate to the node that is running the service:mapr-webserver

https://<hostname>:8443

When prompted, enter the username and password of the administrative user.

The Dashboard

The Navigation pane to the left lets you choose which to display on the right.view

The main view groups are:

Cluster Views - information about the nodes in the clusterMapR-FS - information about volumes, snapshots and schedulesNFS HA Views - NFS nodes and virtual IP addressesAlarms Views - node and volume alarmsSystem Settings Views - configuration of alarm notifications, quotas, users, groups, SMTP, and HTTP

Some other views are separate from the main navigation tree:

CLDB View - information about the container location databaseHBase View - information about HBase on the clusterJobTracker View - information about the JobTrackerNagios View - information about the Nagios configuration scriptTerminal View - an ssh terminal for logging in to the cluster

Views

Views display information about the system. As you open views, tabs along the top let you switch between them quickly.

Clicking any column name in a view sorts the data in ascending or descending order by that column.

Most views contain the following controls:

a that lets you sort data in the view, so you can quickly find the information you wantFilter toolbar

an info symbol ( ) that you can click for help



Some views contain collapsible panes that provide different types of detailed information. Each collapsible a control at the top left that expandsand collapses the pane. The control changes to show the state of the pane:

- pane is collapsed; click to expand

- pane is expanded; click to collapse

Views that contain many results provide the following controls:

( ) - navigates to the first screenful of resultsFirst

( ) - navigates to the previous screenful of resultsPrevious

( ) - navigates to the next screenful of resultsNext

( ) - navigates to the last screenful of resultsLast

( ) - refreshes the list of resultsRefresh

The Filter Toolbar

The Filter toolbar lets you build search expressions to provide sophisticated filtering capabilities for locating specific data on views that display alarge number of nodes. Expressions are implicitly connected by the AND operator; any search results satisfy the criteria specified in allexpressions.

There are three controls in the Filter toolbar:

The close control ( ) removes the expression.The button adds a new expression.AddThe button displays brief help about the Filter toolbar.Filter Help

Expressions

Each expression specifies a semantic statement that consists of a field, an operator, and a value.

The first dropdown menu specifies the field to match.The second dropdown menu specifies the type of match to perform:The text field specifies a value to match or exclude in the field. You can use a wildcard to substitute for any part of the string.



Cluster Views

The Cluster view group provides the following views:

Dashboard - a summary of information about cluster health, activity, and usageNodes - information about nodes in the clusterNode Heatmap - a summary of the health of nodes in the clusterJobs - information about jobs, tasks, and task attempts

Dashboard

The Dashboard displays a summary of information about the cluster in five panes:

Cluster Heat Map - the alarms and health for each node, by rackAlarms - a summary of alarms for the clusterCluster Utilization - CPU, Memory, and Disk Space usageServices - the number of instances of each serviceVolumes - the number of available, under-replicated, and unavailable volumesMapReduce Jobs - the number of running and queued jobs, running tasks, and blacklisted nodes

Links in each pane provide shortcuts to more detailed information. The following sections provide information about each pane.

Cluster Heat Map

The Cluster Heat Map pane displays the health of the nodes in the cluster, by rack. Each node appears as a colored square to show its health ata glance.

The Show Legend/Hide Legend link above the heatmap shows or hides a key to the color-coded display.

The drop-down menu at the top right of the pane lets you filter the results to show the following criteria:

Health

(green): healthy; all services up, MapR-FS and all disks OK, and normal heartbeat

(orange): degraded; one or more services down, or no heartbeat for over 1 minute

(red): critical; Mapr-FS Inactive/Dead/Replicate, or no heartbeat for over 5 minutes

(gray): maintenance

(purple): upgrade in processCPU Utilization

(green): below 50%; (orange): 50% - 80%; (red): over 80%Memory Utilization

(green): below 50%; (orange): 50% - 80%; (red): over 80%Disk Space Utilization

(green): below 50%; (orange): 50% - 80%; (red): over 80% or all disks deadDisk Failure(s) - status of the NODE_ALARM_DISK_FAILURE alarm

(red): raised; (green): clearedExcessive Logging - status of the NODE_ALARM_DEBUG_LOGGING alarm

(red): raised; (green): clearedSoftware Installation & Upgrades - status of the NODE_ALARM_VERSION_MISMATCH alarm

(red): raised; (green): clearedTime Skew - status of theNODE_ALARM_ TIME_SKEW alarm

(red): raised; (green): cleared



CLDB Service Down - status of the NODE_ALARM_SERVICE_CLDB_DOWN alarm

(red): raised; (green): clearedFileServer Service Down - status of the NODE_ALARM_SERVICE_FILESERVER_DOWN alarm

(red): raised; (green): clearedJobTracker Service Down - status of the NODE_ALARM_SERVICE_JT_DOWN alarm

(red): raised; (green): clearedTaskTracker Service Down - status of the NODE_ALARM_SERVICE_TT_DOWN alarm

(red): raised; (green): clearedHBase Master Service Down - status of the NODE_ALARM_SERVICE_HBMASTER_DOWN alarm

(red): raised; (green): clearedHBase Regionserver Service Down - status of the NODE_ALARM_SERVICE_HBREGION_DOWN alarm

(red): raised; (green): clearedNFS Service Down - status of the NODE_ALARM_SERVICE_NFS_DOWN alarm

(red): raised; (green): clearedWebServer Service Down - status of the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarm

(red): raised; (green): clearedHoststats Service Down - status of the NODE_ALARM_SERVICE_HOSTSTATS_DOWN alarm

(red): raised; (green): clearedRoot Partition Full - status of the NODE_ALARM_ROOT_PARTITION_FULL alarm

(red): raised; (green): clearedInstallation Directory Full - status of the NODE_ALARM_OPT_MAPR_FULL alarm

(red): raised; (green): clearedCores Present - status of the NODE_ALARM_CORE_PRESENT alarm

(red): raised; (green): cleared

Clicking a rack name navigates to the view, which provides more detailed information about the nodes in the rack.Nodes

Clicking a colored square navigates to the , which provides detailed information about the node.Node Properties View

Alarms

The Alarms pane displays the following information about alarms on the system:

Alarm - a list of alarms raised on the clusterLast Raised - the most recent time each alarm state changedSummary - how many nodes or volumes have raised each alarm

Clicking any column name sorts data in ascending or descending order by that column.

Cluster Utilization

The Cluster Utilization pane displays a summary of the total usage of the following resources:

CPUMemoryDisk Space

For each resource type, the pane displays the percentage of cluster resources used, the amount used, and the total amount present in thesystem.



Services

The Services pane shows information about the services running on the cluster. For each service, the pane displays the following information:

Actv - the number of running instances of the serviceStby - the number of instances of the service that are configured and standing by to provide failover.Stop - the number of instances of the service that have been intentionally stopped.Fail - the number of instances of the service that have failed, indicated by a corresponsing Service Down alarmTotal - the total number of instances of the service configured on the cluster

Clicking a service navigates to the view.Services

Volumes

The Volumes pane displays the total number of volumes, and the number of volumes that are mounted and unmounted. For each category, theVolumes pane displays the number, percent of the total, and total size.

Clicking or navigates to the view.mounted unmounted Volumes

MapReduce Jobs

The MapReduce Jobs pane shows information about MapReduce jobs:

Running Jobs - the number of MapReduce jobs currently runningQueued Jobs - the number of MapReduce jobs queued to runRunning Tasks - the number of MapReduce tasks currently runningBlacklisted Nodes - the number of nodes that have been eliminated from the MapReduce pool



Nodes

The Nodes view displays the nodes in the cluster, by rack. The Nodes view contains two panes: the Topology pane and the Nodes pane. TheTopology pane shows the racks in the cluster. Selecting a rack displays that rack's nodes in the Nodes pane to the right. Selecting displaClusterys all the nodes in the cluster.


Selecting the checkboxes beside one or more nodes makes the following buttons available:

Manage Services - displays the dialog, which lets you start and stop services on the node.Manage Node ServicesRemove - displays the dialog, which lets you remove the node.Remove NodeChange Topology - displays the dialog, which lets you change the topology path for a node.Change Node Topology

Selecting the checkbox beside a single node makes the following button available:

Properties - navigates to the , which displays detailed information about a single node.Node Properties View

The dropdown menu at the top left specifies the type of information to display:

Overview - general information about each nodeServices - services running on each nodeMachine Performance - information about memory, CPU, I/O and RPC performance on each nodeDisks - information about disk usage, failed disks, and the MapR-FS heartbeat from each nodeMapReduce - information about the JobTracker heartbeat and TaskTracker slots on each nodeNFS Nodes - the IP addresses and Virtual IPs assigned to each NFS nodeAlarm Status - the status of alarms on each node

Clicking a node's Hostname navigates to the , which provides detailed information about the node.Node Properties View

Selecting the checkbox displays the , which provides additional data filtering options.Filter Filter toolbar

Overview

The Overview displays the following general information about nodes in the cluster:



Hlth - each node's health: healthy, degraded, or criticalHostname - the hostname of each nodePhys IP(s) - the IP address or addresses associated with each nodeFS HB - time since each node's last heartbeat to the CLDBJT HB - time since each node's last heartbeat to the JobTrackerPhysical Topology - the rack path to each node

Services

The Services view displays the following information about nodes in the cluster:

Hlth - eact node's health: healthy, degraded, or criticalHostname - the hostname of eact nodeServices - a list of the services running on each nodePhysical Topology - each node's physical topology

Machine Performance

The Machine Performance view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or criticalHostname - the hostname of each nodeMemory - the percentage of memory used and the total memory# CPUs - the number of CPUs present on each node% CPU Idle - the percentage of CPU usage on each nodeBytes Received - the network inputBytes Sent - the network output# RPCs - the number of RPC callsRPC In Bytes - the RPC input, in bytesRPC Out Bytes - the RPC output, in bytes# Disk Reads - the number of RPC disk reads# Disk Writes - the number of RPC disk writesDisk Read Bytes - the number of bytes read from diskDisk Write Bytes - the number of bytes written to disk# Disks - the number of disks present

Disks

The Disks view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or criticalHostname - the hostname of each node# bad Disks - the number of failed disks on each nodeUsage - the amount of disk used and total disk capacity, in gigabytes

MapReduce

The MapReduce view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or criticalHostname - the hostname of each nodeJT HB - the time since each node's most recent JobTracker heartbeatTT Map Slots - the number of map slots on each nodeTT Map Slots Used - the number of map slots in use on each nodeTT Reduce Slots - the number of reduce slots on each nodeTT Reduce Slots Used - the number of reduce slots in use on each node

NFS Nodes

The NFS Nodes view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or criticalHostname - the hostname of each nodePhys IP(s) - the IP address or addresses associated with each nodeVIP(s) - the virtual IP address or addresses assigned to each node

Alarm Status

The Alarm Status view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or critical



Hostname - the hostname of each nodeVersion Alarm - whether the NODE_ALARM_VERSION_MISMATCH alarm is raisedExcess Logs Alarm - whether the NODE_ALARM_DEBUG_LOGGING alarm is raisedDisk Failure Alarm - whether the NODE_ALARM_DISK_FAILURE alarm is raisedTime Skew Alarm - whether the NODE_ALARM_TIME_SKEW alarm is raisedRoot Partition Alarm - whether the NODE_ALARM_ROOT_PARTITION_FULL alarm is raisedInstallation Directory Alarm - whether the NODE_ALARM_OPT_MAPR_FULL alarm is raisedCore Present Alarm - whether the NODE_ALARM_CORE_PRESENT alarm is raisedCLDB Alarm - whether the NODE_ALARM_SERVICE_CLDB_DOWN alarm is raisedFileServer Alarm - whether the NODE_ALARM_SERVICE_FILESERVER_DOWN alarm is raisedJobTracker Alarm - whether the NODE_ALARM_SERVICE_JT_DOWN alarm is raisedTaskTracker Alarm - whether the NODE_ALARM_SERVICE_TT_DOWN alarm is raisedHBase Master Alarm - whether the NODE_ALARM_SERVICE_HBMASTER_DOWN alarm is raisedHBase Region Alarm - whether the NODE_ALARM_SERVICE_HBREGION_DOWN alarm is raisedNFS Gateway Alarm - whether the NODE_ALARM_SERVICE_NFS_DOWN alarm is raisedWebServer Alarm - whether the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarm is raised

Node Properties View

The Node Properties view displays detailed information about a single node in seven collapsible panes:

AlarmsMachine PerformanceGeneral InformationMapReduceManage Node ServicesMapR-FS and Available DisksSystem Disks

Buttons:

Remove Node - displays the dialogRemove Node

Alarms

The Alarms pane displays a list of alarms that have been raised on the system, and the following information about each alarm:

Alarm - the alarm nameLast Raised - the most recent time when the alarm was raisedSummary - a description of the alarm

Machine Performance

The Activity Since Last Heartbeat pane displays the following information about the node's performance and resource usage since it last reportedto the CLDB:

Memory Used - the amount of memory in use on the nodeDisk Used - the amount of disk space used on the nodeCPU - The number of CPUs and the percentage of CPU used on the node



Network I/O - the input and output to the node per secondRPC I/O - the number of RPC calls on the node and the amount of RPC input and outputDisk I/O - the amount of data read to and written from the disk# Operations - the number of disk reads and writes

General Information

The General Information pane displays the following general information about the node:

FS HB - the amount of time since the node performed a heartbeat to the CLDBJT HB - the amount of time since the node performed a heartbeat to the JobTrackerPhysical Topology - the rack path to the node

MapReduce

The MapReduce pane displays the number of map and reduce slots used, and the total number of map and reduce slots on the node.

MapR-FS and Available Disks

The MapR-FS and Available Disks pane displays the disks on the node, and the following information about each disk:

Mnt - whether the disk is mounted or unmountedDisk - the disk nameFile System - the file system on the diskUsed -the percentage used and total size of the disk

Clicking the checkbox next to a disk lets you select the disk for addition or removal.



1.

2. 3.

Buttons:

Add Disks to MapR-FS - with one or more disks selected, adds the disks to the MapR-FS storageRemove Disks from MapR-FS with one or more disks selected, removes the disks from the MapR-FS storage



System Disks

The System Disks pane displays information about disks present and mounted on the node:

Mnt - whether the disk is mountedDevice - the device name of the diskFile System - the file systemUsed - the percentage used and total capacity

Manage Node Services

The Manage Node Services pane displays the status of each service on the node:

Service - the name of each serviceState:

0 - NOT_CONFIGURED: the package for the service is not installed and/or the service is not configured ( has notconfigure.shrun)2 - RUNNING: the service is installed, has been started by the warden, and is currently executing3 - STOPPED: the service is installed and has run, but the service is currently not executingconfigure.sh

Log Path - the path where each service stores its logs



Buttons:

Start Service - starts the selected servicesStop Service - stops the selected servicesLog Settings - displays the Trace Activity dialog

You can also start and stop services in the the dialog, by clicking in the view.Manage Node Services Manage Services Nodes

Trace Activity

The Trace Activity dialog lets you set the log level of a specific service on a particular node.

The dropdown specifies the logging threshold for messages.Log Level

Buttons:

OK - save changes and exitClose - exit without saving changes

Remove Node

The Remove Node dialog lets you remove the specified node.



The Remove Node dialog contains a radio button that lets you choose how to remove the node:

Shut down all services and then remove - shut down services before removing the nodeRemove immediately (-force) - remove the node without shutting down services

Buttons:

Remove Node - removes the nodeCancel - returns to the Node Properties View without removing the node

Manage Node Services

The Manage Node Services dialog lets you start and stop services on the node.

The Service Changes section contains a dropdown menu for each service:

No change - leave the service running if it is running, or stopped if it is stoppedStart - start the serviceStop - stop the service

Buttons:

Change Node - start and stop the selected services as specified by the dropdown menusCancel - returns to the Node Properties View without starting or stopping any services



You can also start and stop services in the the pane of the view.Manage Node Services Node Properties

Change Node Topology

The Change Node Topology dialog lets you change the rack or switch path for one or more nodes.

The Change Node Topology dialog consists of two panes:

Node(s) to move shows the node or nodes specified in the Nodes view.New Path contains the following fields:

Path to Change - rack path or switch pathNew Path - the new node topology path

The Change Node Topology dialog contains the following buttons:

Move Node - changes the node topologyClose - returns to the Nodes view without changing the node topology

Node Heatmap

The Node Heatmap view displays information about each node, by rack.

The dropdown menu above the heatmap lets you choose the type of information to display. See .Cluster Heat Map


Jobs



The Jobs view displays the data collected by the MapR Metrics service. The Jobs view contains two panes: the chart pane and the data grid. Thechart pane displays the data corresponding to the selected metric in histogram form. The data grid lists the jobs running on the cluster.

The dropdown menu above the chart pane lets you choose the type of information to display:

Cumulative CPU usageCumulative physical memory usageJob durationNumber of map tasks per jobNumber of reduce tasks per jobNumber of failed map tasks per jobNumber of failed reduce tasks per jobNumber of map task attempts per jobNumber of reduce task attempts per jobNumber of failed map task attempts per jobNumber of failed reduce task attempts per jobRate of map input records per jobRate of map output records per jobRate of reduce input records per jobRate of reduce output records per jobRate of reduce shuffle bytes per jobAverage duration of map attempt per jobAverage duration of reduce attempt per jobMaximum duration of map attempt per jobMaximum duration of reduce attempt per job

Select the checkbox to display the , which provides additional data filtering options.Filter Filter toolbarThe drop-down selector lets you change the display scale of the histogram's X axis between a uniform or logarithmic scale. Hover thex-axis:

cursor over a bar in the histogram to display the ( ) and ( ) buttons. Click the button or click the bar to filter theFilter Zoom Filtertable below the histogram by the data range corresponding to that bar. The selected bar turns yellow. Hover the cursor over the selected bar to

display the ( ) and buttons. Click the button to remove the filter from the data range in the tableClear Filter Zoom Clear Filterbelow the histogram.

Double-click a bar or click the button to zoom in and display a new histogram that displays metrics constrained to the data rangeZoomrepresented by the bar. The data range applied to the metrics data set displays above the histogram.



Click the button to clear a filter condition or uncheck the checkbox in the green bar above the histogram to clear the entire filter.Filter

Check the box next to a job in the table below the histogram to enable the button. If the job is still running, checking this box alsoView Jobenables the button. Click to display a confirmation dialog:Kill Job Kill Job

Click to kill the job. Click to cancel.Yes No

Click the button or click the job name in the table below the histogram to open the Job tab for that job.View Job

The Job Pane

From the main page, select a job from the list below the histogram and click . You can also click directly on the name of the job inJobs View Jobthe list. The pane displays with the tab selected by default. This pane has three tabs, , , and . If the job isJob Properties Tasks Tasks Charts Inforunning, the button is enabled.Kill Job

The Tasks Tab

The tab has two panes. The upper pane displays histograms of metrics for the tasks and task attempts in the selected job. The lower paneTasksdisplays a table that lists the tasks and primary task attempts in the selected job. Tasks can be in any of the following states:

COMPLETEFAILEDKILLEDPENDINGRUNNING

The table of tasks also lists the following information for each task:

Task ID. Click the link to display a with information about the task attempts for this task.tableTask type:

M: MapR: ReduceTC: Task CleanupJS: Job SetupJC: Job Cleanup

Primary task attempt ID. Click the link to display the pane for this task attempt.task attemptTask starting timestampTask ending timestampTask durationHost localityNode running the task. Click the link to display the pane for this node.Node Properties



You can select the following task histogram metrics for this job from the drop-down selector:

Task DurationTask Attempt DurationTask Attempt Local Bytes ReadTask Attempt Local Bytes WrittenTask Attempt MapR-FS Bytes ReadTask Attempt MapR-FS Bytes WrittenTask Attempt Garbage Collection TimeTask Attempt CPU TimeTask Attempt Physical Memory BytesTask Attempt Virtual Memory BytesMap Task Attempt Input RecordsMap Task Attempt Output RecordsMap Task Attempt Skipped RecordsMap Task Attempt Input BytesMap Task Attempt Output BytesReduce Task Attempt Input GroupsReduce Task Attempt Shuffle BytesReduce Task Attempt Input RecordsReduce Task Attempt Output RecordsReduce Task Attempt Skipped RecordsTask Attempt Spilled RecordsCombined Task Attempt Input RecordsCombined Task Attempt Output Records

Uncheck the box to hide map tasks. Uncheck the box to hide reduce tasks. Check the Show Map Tasks Show Reduce Tasks Show box to display job and task setup and cleanup tasks. Histogram filtering and zoom work in the same way as the paneSetup/Cleanup Tasks Jobs

.

The Charts Tab

Click the tab to display your job's line chart metrics.Charts



Click the button to dismiss a chart. Click the button to add a new line chart.

Line charts can display the following metrics for your job:

Cumulative CPU usedCumulative physical memory usedNumber of failed map tasksNumber of failed reduce tasksNumber of running map tasksNumber of running reduce tasksNumber of map task attemptsNumber of failed map task attemptsNumber of failed reduce task attemptsRate of map record inputRate of map record outputRate of map input bytesRate of map output bytesRate of reduce record outputRate of reduce shuffle bytesAverage duration of map attemptsAverage duration of reduce attemptsMaximum duration of map attemptsMaximum duration of reduce attempts

The Information Tab

The tab of the pane displays summary information about the job in three collapsible panes:Information Job Properties

The pane displays information about this job's MapReduce activity.MapReduce Framework Counters

The pane displays information about the number of this job's map tasks.Job Counters



The pane displays information about this job's interactions with the cluster's file system.File System Counters

The Task Table

The Task table displays a list of the task attempts for the selected task, along with the following information for each task attempt:

Status:RUNNINGSUCCEEDEDFAILEDUNASSIGNEDKILLEDCOMMIT PENDINGFAILED UNCLEANKILLED UNCLEAN

Task attempt ID. Click the link to display the pane for this task attempt.task attemptTask attempt type:

M: MapR: ReduceTC: Task CleanupJS: Job SetupJC: Job Cleanup

Task attempt starting timestampTask attempt ending timestampTask attempt shuffle ending timestampTask attempt sort ending timestampTask attempt durationNode running the task attempt. Click the link to display the pane for this node.Node PropertiesA link to the log file for this task attemptDiagnostic information about this task attempt

The Task Attempt Pane

The pane has two tabs, and .Task Attempt Info Charts

The Task Attempt Info Tab



The tab displays summary information about this task attempt in two collapsible panes:Info

The pane displays information about this task attempt's interactions with the cluster's file system.File System Counters

The pane displays information about this task attempt's MapReduce activity.MapReduce Framework Counters

The Task Attempt Charts Tab

The tab displays line charts for metrics specific to this task attempt. By default, this tab displays charts for these metrics:Task Attempt Charts

Cumulative CPU by TimePhysical Memory by TimeVirtual Memory by Time

Click the button to dismiss a chart. Click the button to add a new line chart.

Line charts can display the following metrics for your task:

Combine Task Attempt Input RecordsCombine Task Attempt Output RecordsMap Task Attempt Input BytesMap Task Attempt Input RecordsMap Task Attempt Output BytesMap Task Attempt Output RecordsMap Task Attempt Skipped RecordsReduce Task Attempt Input GroupsReduce Task Attempt Input RecordsReduce Task Attempt Output RecordsReduce Task Attempt Shuffle Bytes



Reduce Task Attempt Skipped RecordsTask Attempt CPU TimeTask Attempt Local Bytes ReadTask Attempt Local Bytes WrittenTask Attempt MapR-FS Bytes ReadTask Attempt MapR-FS Bytes WrittenTask Attempt Physical Memory BytesTask Attempt Spilled RecordsTask Attempt Virtual Memory Bytes



MapR-FS Views

The MapR-FS group provides the following views:

Volumes - information about volumes in the clusterMirror Volumes - information about mirrorsUser Disk Usage - cluster disk usageSnapshots - information about volume snapshotsSchedules - information about schedules

Volumes

The Volumes view displays the following information about volumes in the cluster:

Mnt - whether the volume is mounted ( )Vol Name - the name of the volumeMount Path - the path where the volume is mountedCreator - the user or group that owns the volumeQuota - the volume quotaVol Size - the size of the volumeSnap Size - the size of the volume snapshotTotal Size - the size of the volume and all its snapshotsReplication Factor - the number of copies of the volumePhysical Topology - the rack path to the volume


The checkbox specifies whether to show unmounted volumes:Show Unmounted

selected - show both mounted and unmounted volumesunselected - show mounted volumes only

The checkbox specifies whether to show system volumes: Show System

selected - show both system and user volumesunselected - show user volumes only


Clicking displays the dialog.New Volume New Volume

Selecting one or more checkboxes next to volumes enables the following buttons:

Remove - displays the dialogRemove VolumeProperties - displays the dialog (becomes if more than one checkbox is selected)Volume Properties Edit X VolumesSnapshots - displays the dialogSnapshots for VolumeNew Snapshot - displays the dialogSnapshot Name



New Volume

The New Volume dialog lets you create a new volume.

For mirror volumes, the Replication & Snapshot Scheduling section is replaced with a section called Replication & Mirror Scheduling:

The Volume Setup section specifies basic information about the volume using the following fields:

Volume Type - a standard volume, or a local or remote mirror volumeVolume Name (required) - a name for the new volumeMount Path - a path on which to mount the volumeMounted - whether the volume is mounted at creation



Topology - the new volume's rack topologyRead-only - if checked, prevents writes to the volume

The Ownership & Permissions section lets you grant specific permissions on the volume to certain users or groups:

User/Group field - the user or group to which permissions are to be granted (one user or group per row)Permissions field - the permissions to grant to the user or group (see the Permissions table below)Delete button ( ) - deletes the current row[ + Add Permission ] - adds a new row

Volume Permissions

Code Allowed Action




d Delete a volume


The Usage Tracking section sets the accountable entity and quotas for the volume using the following fields:

Group/User - the group/user that is accountable for the volumeQuotas - the volume quotas:

Volume Advisory Quota - if selected, the advisory quota for the volume as an integer plus a single letter to represent the unitVolume Quota - if selected, the quota for the volume as an integer plus a single letter to represent the unit

The Replication & Snapshot Scheduling section (normal volumes) contains the following fields:

Replication - the desired replication factor for the volumeMinimum Replication - the minimum replication factor for the volume. When the number of replicas drops down to or below this number,the volume is aggressively re-replicated to bring it above the minimum replication factor.Snapshot Schedule - determines when snapshots will be automatically created; select an existing schedule from the pop-up menu

The Replication & Mirror Scheduling section (mirror volumes) contains the following fields:

Replication Factor - the desired replication factor for the volumeActual Replication - what percent of the volume data is replicated once (1x), twice (2x), and so on, respectivelyMirror Update Schedule - determines when mirrors will be automatically updated; select an existing schedule from the pop-up menuLast Mirror Operation - the status of the most recent mirror operation.

Buttons:

Save - creates the new volumeClose - exits without creating the volume

Remove Volume

The Remove Volume dialog prompts you for confirmation before removing the specified volume or volumes.



Buttons:

Remove Volume - removes the volume or volumesCancel - exits without removing the volume or volumes

Volume Properties

The Volume Properties dialog lets you view and edit volume properties.



For mirror volumes, the Replication & Snapshot Scheduling section is replaced with a section called Replication & Mirror Scheduling:



For information about the fields in the Volume Properties dialog, see .New Volume

Snapshots for Volume

The Snapshots for Volume dialog displays the following information about snapshots for the specified volume:

Snapshot Name - the name of the snapshotDisk Used - the disk space occupied by the snapshotCreated - the date and time the snapshot was createdExpires - the snapshot expiration date and time

Buttons:

New Snapshot - displays the dialog.Snapshot NameRemove - when the checkboxes beside one or more snapshots are selected, displays the dialogRemove SnapshotsPreserve - when the checkboxes beside one or more snapshots are selected, prevents the snapshots from expiringClose - closes the dialog

Snapshot Name

The Snapshot Name dialog lets you specify the name for a new snapshot you are creating.



The Snapshot Name dialog creates a new snapshot with the name specified in the following field:

Name For New Snapshot(s) - the new snapshot name

Buttons:

OK - creates a snapshot with the specified nameCancel - exits without creating a snapshot

Remove Snapshots

The Remove Snapshots dialog prompts you for confirmation before removing the specified snapshot or snapshots.

Buttons

Yes - removes the snapshot or snapshotsNo - exits without removing the snapshot or snapshots

Mirror Volumes

The Mirror Volumes pane displays information about mirror volumes in the cluster:

Mnt - whether the volume is mountedVol Name - the name of the volumeSrc Vol - the source volumeSrc Clu - the source clusterOrig Vol -the originating volume for the data being mirroredOrig Clu - the originating cluster for the data being mirroredLast Mirrored - the time at which mirroring was most recently completed

- status of the last mirroring operation% Done - progress of the mirroring operationError(s) - any errors that occurred during the last mirroring operation

User Disk Usage



The User Disk Usage view displays information about disk usage by cluster users:

Name - the usernameDisk Usage - the total disk space used by the user# Vols - the number of volumesHard Quota - the user's quotaAdvisory Quota - the user's advisory quotaEmail - the user's email address

Snapshots

The Snapshots view displays the following information about volume snapshots in the cluster:

Snapshot Name - the name of the snapshotVolume Name - the name of the source volume volume for the snapshotDisk Space used - the disk space occupied by the snapshotCreated - the creation date and time of the snapshotExpires - the expiration date and time of the snapshot



Buttons:

Remove Snapshot - when the checkboxes beside one or more snapshots are selected, displays the dialogRemove SnapshotsPreserve Snapshot - when the checkboxes beside one or more snapshots are selected, prevents the snapshots from expiring

Schedules

The Schedules view lets you view and edit schedules, which can then can be attached to events to create occurrences. A schedule is a namedgroup of rules that describe one or more points of time in the future at which an action can be specified to take place.



The left pane of the Schedules view lists the following information about the existing schedules:

Schedule Name - the name of the schedule; clicking a name displays the schedule details in the right pane for editing

In Use - indicates whether the schedule is ( ), or attached to an actionin use

The right pane provides the following tools for creating or editing schedules:

Schedule Name - the name of the scheduleSchedule Rules - specifies schedule rules with the following components:

A dropdown that specifies frequency (Once, Yearly, Monthly, Weekly, Daily, Hourly, Every X minutes)Dropdowns that specify the time within the selected frequencyRetain For - the time for which the scheduled snapshot or mirror data is to be retained after creation

[ +Add Rule ] - adds another rule to the schedule

Navigating away from a schedule with unsaved changes displays the dialog.Save Schedule

Buttons:

New Schedule - starts editing a new scheduleRemove Schedule - displays the dialogRemove ScheduleSave Schedule - saves changes to the current scheduleCancel - cancels changes to the current schedule

Remove Schedule

The Remove Schedule dialog prompts you for confirmation before removing the specified schedule.

Buttons

Yes - removes the scheduleNo - exits without removing the schedule



NFS HA Views

The NFS view group provides the following views:

NFS Setup - information about NFS nodes in the clusterVIP Assignments - information about virtual IP addresses (VIPs) in the clusterNFS Nodes - information about NFS nodes in the cluster

NFS Setup

The NFS Setup view displays information about NFS nodes in the cluster and any VIPs assigned to them:

Starting VIP - the starting IP of the VIP rangeEnding VIP - the ending IP of the VIP rangeNode Name(s) - the names of the NFS nodesIP Address(es) - the IP addresses of the NFS nodesMAC Address(es) - the MAC addresses associated with the IP addresses

Buttons:

Start NFS - displays the Manage Node Services dialogAdd VIP - displays the Add Virtual IPs dialogEdit - when one or more checkboxes are selected, edits the specified VIP rangesRemove- when one or more checkboxes are selected, removes the specified VIP rangesUnconfigured Nodes - displays nodes not running the NFS service (in the Nodes view)VIP Assignments - displays the VIP Assignments view

VIP Assignments

The VIP Assignments view displays VIP assignments beside the nodes to which they are assigned:

Virtual IP Address - each VIP in the rangeNode Name - the node to which the VIP is assignedIP Address - the IP address of the nodeMAC Address - the MAC address associated with the IP address

Buttons:

Start NFS - displays the Manage Node Services dialogAdd VIP - displays the Add Virtual IPs dialogUnconfigured Nodes - displays nodes not running the NFS service (in the Nodes view)

NFS Nodes

The NFS Nodes view displays information about nodes running the NFS service:

Hlth - the health of the nodeHostname - the hostname of the node



Phys IP(s) - physical IP addresses associated with the nodeVIP(s) - virtual IP addresses associated with the node

Buttons:

Properties - when one or more nodes are selected, navigates to the Node Properties ViewManage Services - navigates to the dialog, which lets you start and stop services on the nodeManage Node ServicesRemove - navigates to the dialog, which lets you remove the nodeRemove NodeChange Topology - navigates to the dialog, which lets you change the rack or switch path for a node Change Node Topology



Alarms Views

The Alarms view group provides the following views:

Node Alarms - information about node alarms in the clusterVolume Alarms - information about volume alarms in the clusterUser/Group Alarms - information about users or groups that have exceeded quotasAlarm Notifications - configure where notifications are sent when alarms are raised

Node Alarms

The Node Alarms view displays information about node alarms in the cluster.

Hlth - a color indicating the status of each node (see )Cluster Heat MapHostname - the hostname of the nodeVersion Alarm - last occurrence of the NODE_ALARM_VERSION_MISMATCH alarmExcess Logs Alarm - last occurrence of the NODE_ALARM_DEBUG_LOGGING alarmDisk Failure Alarm - of the NODE_ALARM_DISK_FAILURE alarmTime Skew Alarm - last occurrence of the NODE_ALARM_ TIME_SKEW alarmRoot Partition Alarm - last occurrence of the NODE_ALARM_ROOT_PARTITION_FULL alarmInstallation Directory Alarm - last occurrence of the NODE_ALARM_OPT_MAPR_FULL alarmCore Present Alarm - last occurrence of the NODE_ALARM_CORE_PRESENT alarmCLDB Alarm - last occurrence of the NODE_ALARM_SERVICE_CLDB_DOWN alarmFileServer Alarm - last occurrence of the NODE_ALARM_SERVICE_FILESERVER_DOWN alarmJobTracker Alarm - last occurrence of the NODE_ALARM_SERVICE_JT_DOWN alarmTaskTracker Alarm - last occurrence of the NODE_ALARM_SERVICE_TT_DOWN alarmHBase Master Alarm - last occurrence of the NODE_ALARM_SERVICE_HBMASTER_DOWN alarmHBase Regionserver Alarm - last occurrence of the NODE_ALARM_SERVICE_HBREGION_DOWN alarmNFS Gateway Alarm - last occurrence of the NODE_ALARM_SERVICE_NFS_DOWN alarmWebServer Alarm - last occurrence of the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarmHoststats Alarm - last occurrence of the NODE_ALARM_SERVICE_HOSTSTATS_DOWN alarm

See .Alarms Reference


The left pane of the Node Alarms view displays the following information about the cluster:

Topology - the rack topology of the cluster


Clicking a node's Hostname navigates to the , which provides detailed information about the node.Node Properties View

Buttons:

Properties - navigates to the Node Properties ViewRemove - navigates to the dialog, which lets you remove the nodeRemove NodeManage Services - navigates to the dialog, which lets you start and stop services on the nodeManage Node Services



Change Topology - navigates to the dialog, which lets you change the rack or switch path for a nodeChange Node Topology

Volume Alarms

The Volume Alarms view displays information about volume alarms in the cluster:

Mnt - whether the volume is mountedVol Name - the name of the volumeSnapshot Alarm - last Snapshot Failed alarmMirror Alarm - last Mirror Failed alarmReplication Alarm - last Data Under-Replicated alarmData Alarm - last Data Unavailable alarmVol Advisory Quota Alarm - last Volume Advisory Quota Exceeded alarmVol Quota Alarm- last Volume Quota Exceeded alarm

Clicking any column name sorts data in ascending or descending order by that column. Clicking a volume name displays the Volume Propertiesdialog

Selecting the checkbox shows unmounted volumes as well as mounted volumes.Show Unmounted


Buttons:

New Volume displays the New Volume Dialog.Properties - if the checkboxes beside one or more volumes is selected,displays the dialogVolume PropertiesMount - if an unmounted volume is selected, mounts it; if a mounted volume is selected, unmounts it(Unmount)Remove - if the checkboxes beside one or more volumes is selected, displays the dialogRemove VolumeStart Mirroring - if a mirror volume is selected, starts the mirror sync processSnapshots - if the checkboxes beside one or more volumes is selected,displays the dialogSnapshots for VolumeNew Snapshot - if the checkboxes beside one or more volumes is selected,displays the dialogSnapshot Name

User/Group Alarms

The User/Group Alarms view displays information about user and group quota alarms in the cluster:

Name - the name of the user or groupUser Advisory Quota Alarm - the last Advisory Quota Exceeded alarmUser Quota Alarm - the last Quota Exceeded alarm



Buttons:

Edit Properties

Alarm Notifications

The Configure Global Alarm Notifications dialog lets you specify where email notifications are sent when alarms are raised.

Fields:

Alarm Name - select the alarm to configureStandard Notification - send notification to the default for the alarm type (the cluster administrator or volume creator, for example)Additional Email Address - specify an additional custom email address to receive notifications for the alarm type

Buttons:

Save - save changes and exitClose - exit without saving changes



System Settings Views

The System Settings view group provides the following views:

Email Addresses - specify MapR user email addressesPermissions - give permissions to usersQuota Defaults - settings for default quotas in the clusterSMTP - settings for sending email from MapRHTTP - settings for accessing the MapR Control System via a browserMapR Licenses - MapR license settingsMetrics Database - Settings for the MapR Metrics MySQL database

Email Addresses

The Configure Email Addresses dialog lets you specify whether MapR gets user email addresses from an LDAP directory, or uses a companydomain:

Use Company Domain - specify a domain to append after each username to determine each user's email addressUse LDAP - obtain each user's email address from an LDAP server

Buttons:

Save - save changes and exitClose - exit without saving changes

Permissions

The Edit Permissions dialog lets you grant specific clluster permissions to particular users and groups.

User/Group field - the user or group to which permissions are to be granted (one user or group per row)Permissions field - the permissions to grant to the user or group (see the Permissions table below)Delete button ( ) - deletes the current row[ + Add Permission ] - adds a new row

Cluster Permissions



cv


cv Create volumes





Buttons:

OK - save changes and exitClose - exit without saving changes

Quota Defaults

The Configure Quota Defaults dialog lets you set the default quotas that apply to users and groups.

The User Quota Defaults section contains the following fields:

Default User Advisory Quota - if selected, sets the advisory quota that applies to all users without an explicit advisory quota.Default User Total Quota - if selected, sets the advisory quota that applies to all users without an explicit total quota.



The Group Quota Defaults section contains the following fields:

Default Group Advisory Quota - if selected, sets the advisory quota that applies to all groups without an explicit advisory quota.Default Group Total Quota - if selected, sets the advisory quota that applies to all groups without an explicit total quota.

Buttons:

Save - saves the settingsClose - exits without saving the settings

SMTP

The Configure Sending Email dialog lets you configure the email account from which the MapR cluster sends alerts and other notifications.

The Configure Sending Email (SMTP) dialog contains the following fields:

Provider - selects Gmail or another email provider; if you select Gmail, the other fields are partially populated to help you with theconfigurationSMTP Server specifies the SMTP server to use when sending email.The server requires an encrypted connection (SSL) - use SSL when connecting to the SMTP serverSMTP Port - the port to use on the SMTP serverFull Name - the name used in the From field when the cluster sends an alert emailEmail Address - the email address used in the From field when the cluster sends an alert email.Username - the username used to log onto the email account the cluster will use to send email.SMTP Password - the password to use when sending email.

Buttons:


HTTP

The Configure HTTP dialog lets you configure access to the MapR Control System via HTTP and HTTPS.



The sections in the Configure HTTP dialog let you enable HTTP and HTTPS access, and set the session timeout, respectively:

Enable HTTP Access - if selected, configure HTTP access with the following field:HTTP Port - the port on which to connect to the MapR Control System via HTTP

Enable HTTPS Access - if selected, configure HTTPS access with the following fields:HTTPS Port - the port on which to connect to the MapR Control System via HTTPSHTTPS Keystore Path - a path to the HTTPS keystoreHTTPS Keystore Password - a password to access the HTTPS keystoreHTTPS Key Password - a password to access the HTTPS key

Session Timeout - the number of seconds before an idle session times out.

Buttons:


MapR Licenses

The MapR License Management dialog lets you add and activate licenses for the cluster, and displays the Cluster ID and the following informationabout existing licenses:

Name - the name of each licenseIssued - the date each license was issuedExpires - the expiration date of each licenseNodes - the nodes to which each license applies

If installing a new cluster, make sure to install the latest version of MapR software. If applying a new license to an existing MapRcluster, make sure to upgrade to the latest version of MapR first. If you are not sure, check the contents of the file MapRBuildV

in the directory. If the version is and includes then you must upgrade before applying a license.ersion /opt/mapr 1.0.0 GAExample:

# cat /opt/mapr/MapRBuildVersion 1.0.0.10178GA-0v

For information about upgrading the cluster, see .Cluster Upgrade



Fields:

Cluster ID - the unique identifier needed for licensing the cluster

Buttons:

Add Licenses via Web - navigates to the MapR licensing form onlineAdd License via Upload - alternate licensing mechanism: upload via browserAdd License via Copy/Paste - alternate licensing mechanism: paste license keyApply Licenses - validates the licenses and applies them to the clusterClose - closes the dialog.

Metrics Database Configuration

The Metrics configuration dialog enables you to specify the location and login credentials of the MySQL server that stores information for Job.Metrics



Fields:

URL - the hostname and port of the machine running the MySQL serverUsername - the username for the MySQL databasemetricsPassword - the password for the MySQL databasemetrics

Buttons:

Save - saves the MySQL information in the fieldsCancel - closes the dialog



Other Views

In addition to the MapR Control System views, there are views that display detailed information about the system:

CLDB View - information about the container location databaseHBase View - information about HBase on the clusterJobTracker View - information about the JobTrackerNagios View - information about the Nagios configuration scriptTerminal View - an ssh terminal for logging in to the cluster

With the exception of the MapR Launchpad, the above views include the following buttons:

- Refresh Button (refreshes the view)

- Popout Button (opens the view in a new browser window)



CLDB View

The CLDB view provides information about the Container Location Database (CLDB). To display the CLDB view, open the MapR Control Systemand click in the navigation pane.CLDB

The following table describes the fields on the CLDB view:

Field Description

CLDB Mode

CLDB BuildVersion

CLDB Status

Cluster Capacity

Cluster Used

Cluster Available

Active FileServers A list of FileServers, and the following information about each:

ServerID (Hex) -ServerID -HostPort -HostName -Network Location -Capacity (MB) -Used (MB) -Available (MB) -Last Heartbeat (s) -State -Type -

Volumes A list of volumes, and the following information about each:

Volume Name -Mount Point -Mounted -ReadOnly -Volume ID -Volume Topology -Quota -Advisory Quota -Used -LogicalUsed -Root Container ID -Replication -Guaranteed Replication -Num Containers -

Accounting Entities A list of users and groups, and the following information about each:

AE Name -AE Type -AE Quota -AE Advisory Quota -AE Used -



Mirrors A list of mirrors, and the following information about each:

Mirror Volume Name -Mirror ID -Mirror NextID -Mirror Status -Last Successful Mirror Time -Mirror SrcVolume -Mirror SrcRootContainerID -Mirror SrcClusterName -Mirror SrcSnapshot -Mirror DataGenerator Volume -

Snapshots A list of snapshots, and the following information about each:

Snapshot ID -RW Volume ID -Snapshot Name -Root Container ID -Snapshot Size -Snapshot InProgress -

Containers A list of containers, and the following information about each:

Container ID -Volume ID -Latest Epoch -SizeMB -Container Master Location -Container Locations -Inactive Locations -Unused Locations -Replication Type -

Snapshot Containers A list of snapshot containers, and the following information about each:

Snapshot Container ID - unique ID of the containerSnapshot ID - ID of the snapshot corresponding to the containerRW Container ID - corresponding source container IDLatest Epoch -SizeMB - container size, in MBContainer Master Location - location of the container's master replicaContainer Locations -Inactive Locations -



HBase View

The HBase View provides information about HBase on the cluster.

Field Description

Local Logs A link to the HBase Local Logs View

Thread Dump A link to the HBase Thread Dump View

Log Level A link to the , a form for getting/setting the log levelHBase Log Level View

Master Attributes A list of attributes, and the following information about each:

Attribute Name -Value -Description -

Catalog Tables A list of tables, and the following information about each:

Table -Description -

User Tables

Region Servers A list of region servers in the cluster, and the following information about each:

Address -Start Code -Load -Total -



HBase Local Logs View

The HBase Local Logs view displays a list of the local HBase logs. Clicking a log name displays the contents of the log. Each log name can becopied and pasted into the to get or set the current log level.HBase Log Level View



HBase Log Level View

The HBase Log Level View is a form for getting and setting log levels that determine which information gets logged. The field accepts a logLogname (which can be copied from the and pasted). The Level field takes any of the following valid log levels:HBase Local Logs View

ALLTRACEDEBUGINFOWARNERROROFF



HBase Thread Dump View

The HBase Thread Dump View displays a dump of the HBase thread.

Example:

Dump: 40 active threads 318 (1962516546@qtp-879081272-3): State: RUNNABLEProcess Thread ThreadBlocked count: 8 Waited count: 32 Stack: sun.management.ThreadImpl.getThreadInfo0(Native Method)sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:149)org.apache.hadoop.http.HttpServer$StackServlet.doGet(HttpServer.java:695)javax.servlet.http.HttpServlet.service(HttpServlet.java:707)javax.servlet.http.HttpServlet.service(HttpServlet.java:820)org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:826)org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)org.mortbay.jetty.Server.handle(Server.java:326)org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) 50Thread(perfnode51.perf.lab:60000-CatalogJanitor): State: TIMED_WAITING Blocked count: 1081 Waited count:1350 Stack: java.lang. .wait(Native Method)Objectorg.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:91)org.apache.hadoop.hbase.Chore.run(Chore.java:74) 49 (perfnode51.perf.lab:60000-BalancerChore):ThreadState: TIMED_WAITING Blocked count: 0 Waited count: 270 Stack:java.lang. .wait(Native Method) org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:91)Objectorg.apache.hadoop.hbase.Chore.run(Chore.java:74) 48Thread(MASTER_OPEN_REGION-perfnode51.perf.lab:60000-1): State: WAITING Blockedcount: 2 Waited count: 3 Waiting onjava.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6d1cf4e5 Stack:sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)



JobTracker View

Field Description

State

Started

Version

Compiled

Identifier

Cluster Summary The heapsize, and the following information about the cluster:

Running Map Tasks -Running Reduce Tasks -Total Submissions -Nodes -Occupied Map Slots -Occupied Reduce Slots -Reserved Map Slots -Reserved Reduce Slots -Map Task Capacity -Reduce Task Capacity -Avg. Tasks/Node -Blacklisted Nodes -Excluded Nodes -MapTask Prefetch Capacity -

Scheduling Information A list of queues, and the following information about each:

Queue name -State -Scheduling Information -

Filter A field for filtering results by Job ID, Priority, User, or Name

Running Jobs A list of running MapReduce jobs, and the following information about each:

JobId -Priority -User -Name -Start Time -Map % Complete -Current Map Slots -Failed MapAttempts -MapAttempt Time Avg/Max -Cumulative Map CPU -Current Map PMem -Reduce % Complete -Current Reduce Slots -Failed ReduceAttempts -ReduceAttempt Time Avg/MaxCumulative Reduce CPU -Current Reduce PMem -



Completed Jobs A list of current MapReduce jobs, and the following information about each:

JobId -Priority -User -Name -Start Time -Total Time -Maps Launched -Map Total -Failed MapAttempts -MapAttempt Time Avg/Max -Cumulative Map CPU -Reducers Launched -Reduce Total -Failed ReduceAttempts -ReduceAttempt Time Avg/Max -Cumulative Reduce CPU -Cumulative Reduce PMem -Vaidya Reports -

Retired Jobs A list of retired MapReduce job, and the following information about each:

JobId -Priority -User -Name -State -Start Time -Finish Time -Map % Complete -Reduce % Complete -Job Scheduling Information -Diagnostic Info -

Local Logs A link to the local logs

JobTracker Configuration A link to a page containing Hadoop JobTracker configuration values



JobTracker Configuration View

Field Description

fs.automatic.close TRUE

fs.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary

fs.checkpoint.edits.dir ${fs.checkpoint.dir}

fs.checkpoint.period 3600

fs.checkpoint.size 67108864

fs.default.name maprfs:///

fs.file.impl org.apache.hadoop.fs.LocalFileSystem

fs.ftp.impl org.apache.hadoop.fs.ftp.FTPFileSystem

fs.har.impl org.apache.hadoop.fs.HarFileSystem

fs.har.impl.disable.cache TRUE

fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem

fs.hftp.impl org.apache.hadoop.hdfs.HftpFileSystem

fs.hsftp.impl org.apache.hadoop.hdfs.HsftpFileSystem

fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem

fs.maprfs.impl com.mapr.fs.MapRFileSystem

fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem

fs.s3.block.size 67108864

fs.s3.buffer.dir ${hadoop.tmp.dir}/s3

fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem

fs.s3.maxRetries 4

fs.s3.sleepTimeSeconds 10

fs.s3n.block.size 67108864

fs.s3n.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem

fs.trash.interval 0

hadoop.job.history.location file:////opt/mapr/hadoop/hadoop-0.20.2/bin/../logs/history

hadoop.logfile.count 10

hadoop.logfile.size 10000000

hadoop.native.lib TRUE

hadoop.proxyuser.root.groups root

hadoop.proxyuser.root.hosts

hadoop.rpc.socket.factory.class.default org.apache.hadoop.net.StandardSocketFactory

hadoop.security.authentication simple

hadoop.security.authorization FALSE

hadoop.security.group.mapping org.apache.hadoop.security.ShellBasedUnixGroupsMapping

hadoop.tmp.dir /tmp/hadoop-${user.name}

hadoop.util.hash.type murmur



io.bytes.per.checksum 512

io.compression.codecs org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec

io.file.buffer.size 8192

io.map.index.skip 0

io.mapfile.bloom.error.rate 0.005

io.mapfile.bloom.size 1048576

io.seqfile.compress.blocksize 1000000

io.seqfile.lazydecompress TRUE

io.seqfile.sorter.recordlimit 1000000

io.serializations org.apache.hadoop.io.serializer.WritableSerialization

io.skip.checksum.errors FALSE

io.sort.factor 256

io.sort.record.percent 0.17

io.sort.spill.percent 0.99

ipc.client.connect.max.retries 10

ipc.client.connection.maxidletime 10000

ipc.client.idlethreshold 4000

ipc.client.kill.max 10

ipc.client.tcpnodelay FALSE

ipc.server.listen.queue.size 128

ipc.server.tcpnodelay FALSE

job.end.retry.attempts 0

job.end.retry.interval 30000

jobclient.completion.poll.interval 5000

jobclient.output.filter FAILED

jobclient.progress.monitor.poll.interval 1000

keep.failed.task.files FALSE

local.cache.size 10737418240

map.sort.class org.apache.hadoop.util.QuickSort

mapr.localoutput.dir output

mapr.localspill.dir spill

mapr.localvolumes.path /var/mapr/local

mapred.acls.enabled FALSE

mapred.child.oom_adj 10

mapred.child.renice 10

mapred.child.taskset TRUE

mapred.child.tmp ./tmp

mapred.cluster.ephemeral.tasks.memory.limit.mb 200

mapred.compress.map.output FALSE



mapred.fairscheduler.allocation.file conf/pools.xml

mapred.fairscheduler.assignmultiple TRUE

mapred.fairscheduler.eventlog.enabled FALSE

mapred.fairscheduler.smalljob.max.inputsize 10737418240

mapred.fairscheduler.smalljob.max.maps 10

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824

mapred.fairscheduler.smalljob.max.reducers 10

mapred.fairscheduler.smalljob.schedule.enable TRUE

mapred.healthChecker.interval 60000

mapred.healthChecker.script.timeout 600000

mapred.inmem.merge.threshold 1000

mapred.job.queue.name default

mapred.job.reduce.input.buffer.percent 0

mapred.job.reuse.jvm.num.tasks -1

mapred.job.shuffle.input.buffer.percent 0.7

mapred.job.shuffle.merge.percent 0.66

mapred.job.tracker ted-desk.perf.lab:9001

mapred.job.tracker.handler.count 10

mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done

mapred.job.tracker.http.address 0.0.0.0:50030

mapred.job.tracker.persist.jobstatus.active FALSE

mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo

mapred.job.tracker.persist.jobstatus.hours 0

mapred.jobtracker.completeuserjobs.maximum 100

mapred.jobtracker.instrumentation org.apache.hadoop.mapred.JobTrackerMetricsInst

mapred.jobtracker.job.history.block.size 3145728

mapred.jobtracker.jobhistory.lru.cache.size 5

mapred.jobtracker.maxtasks.per.job -1

mapred.jobtracker.port 9001

mapred.jobtracker.restart.recover TRUE

mapred.jobtracker.retiredjobs.cache.size 1000

mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.FairScheduler

mapred.line.input.format.linespermap 1

mapred.local.dir ${hadoop.tmp.dir}/mapred/local

mapred.local.dir.minspacekill 0

mapred.local.dir.minspacestart 0

mapred.map.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log

mapred.map.max.attempts 4

mapred.map.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec



mapred.map.tasks 2

mapred.map.tasks.speculative.execution FALSE

mapred.max.maps.per.node -1

mapred.max.reduces.per.node -1

mapred.max.tracker.blacklists 4

mapred.max.tracker.failures 4

mapred.merge.recordsBeforeProgress 10000

mapred.min.split.size 0

mapred.output.compress FALSE

mapred.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec

mapred.output.compression.type RECORD

mapred.queue.names default

mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log

mapred.reduce.copy.backoff 300

mapred.reduce.max.attempts 4

mapred.reduce.parallel.copies 12

mapred.reduce.slowstart.completed.maps 0.95

mapred.reduce.tasks 1

mapred.reduce.tasks.speculative.execution FALSE

mapred.running.map.limit -1

mapred.running.reduce.limit -1

mapred.skip.attempts.to.start.skipping 2

mapred.skip.map.auto.incr.proc.count TRUE

mapred.skip.map.max.skip.records 0

mapred.skip.reduce.auto.incr.proc.count TRUE

mapred.skip.reduce.max.skip.groups 0

mapred.submit.replication 10

mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system

mapred.task.cache.levels 2

mapred.task.profile FALSE

mapred.task.profile.maps 0-2

mapred.task.profile.reduces 0-2

mapred.task.timeout 600000

mapred.task.tracker.http.address 0.0.0.0:50060

mapred.task.tracker.report.address 127.0.0.1:0

mapred.task.tracker.task-controller org.apache.hadoop.mapred.DefaultTaskController

mapred.tasktracker.dns.interface default

mapred.tasktracker.dns.nameserver default

mapred.tasktracker.ephemeral.tasks.maximum 1



mapred.tasktracker.ephemeral.tasks.timeout 10000

mapred.tasktracker.ephemeral.tasks.ulimit 4294967296>

mapred.tasktracker.expiry.interval 600000

mapred.tasktracker.indexcache.mb 10

mapred.tasktracker.instrumentation org.apache.hadoop.mapred.TaskTrackerMetricsInst

mapred.tasktracker.map.tasks.maximum (CPUS > 2) ? (CPUS * 0.75) : 1

mapred.tasktracker.reduce.tasks.maximum (CPUS > 2) ? (CPUS * 0.50): 1

mapred.tasktracker.taskmemorymanager.monitoring-interval 5000

mapred.tasktracker.tasks.sleeptime-before-sigkill 5000

mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp

mapred.userlog.limit.kb 0

mapred.userlog.retain.hours 24

mapreduce.heartbeat.10 300




mapreduce.job.acl-view-job

mapreduce.job.complete.cancel.delegation.tokens TRUE

mapreduce.job.split.metainfo.maxsize 10000000

mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recovery

mapreduce.jobtracker.recovery.maxtime 120

mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging

mapreduce.maprfs.use.compression TRUE

mapreduce.reduce.input.limit -1

mapreduce.tasktracker.outofband.heartbeat FALSE

mapreduce.tasktracker.prefetch.maptasks 1

mapreduce.use.fastreduce FALSE

mapreduce.use.maprfs TRUE

tasktracker.http.threads 2

topology.node.switch.mapping.impl org.apache.hadoop.net.ScriptBasedMapping

topology.script.number.args 100

webinterface.private.actions FALSE



Nagios View

The Nagios view displays a dialog containing a Nagios configuration script.

Example:

############# Commands #############

define command {command_name check_fileserver_proccommand_line $USER1$/check_tcp -p 5660}

define command {command_name check_cldb_proccommand_line $USER1$/check_tcp -p 7222}

define command {command_name check_jobtracker_proccommand_line $USER1$/check_tcp -p 50030}

define command {command_name check_tasktracker_proccommand_line $USER1$/check_tcp -p 50060}

define command {command_name check_nfs_proccommand_line $USER1$/check_tcp -p 2049}

define command {command_name check_hbmaster_proccommand_line $USER1$/check_tcp -p 60000}

define command {command_name check_hbregionserver_proccommand_line $USER1$/check_tcp -p 60020



}

define command {command_name check_webserver_proccommand_line $USER1$/check_tcp -p 8443}

################# HOST: perfnode51.perf.lab ###############

define host {use linux-serverhost_name perfnode51.perf.labaddress 10.10.30.51check_command check-host-alive}










define host {use linux-serverhost_name perfnode56.perf.labaddress 10.10.30.56



check_command check-host-alive}



Terminal View

The ssh terminal provides access to the command line.



Hadoop Commands

All Hadoop commands are invoked by the script.bin/hadoop

Usage: hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]

Hadoop has an option parsing framework that employs parsing generic options as well as running classes.

COMMAND_OPTION Description

--config confdir Overwrites the default Configuration directory. Default is .${HADOOP_HOME}/conf

COMMAND Various commands with their options are described in the following sections.

GENERIC_OPTIONS The common set of options supported by multiple commands.

COMMAND_OPTIONS Various command options are described in the following sections.

Useful InformationRunning the script without any arguments prints the description for all commands.hadoop

Commands

The following commands may be run on MapR:hadoop

Command Description

archive-archiveNameNAME <src>*<dest>

The command creates a Hadoop archive, a file that contains other files. A Hadoop archive alwayshadoop archivehas a extension.*.har

classpath The command prints the class path needed to access the Hadoop JAR and the required libraries.hadoop classpath

daemonlog The command may be used to get or set the log level of Hadoop daemons.hadoop daemonlog

distcp<source><destination>

The command is a tool for large inter- and intra-cluster copying. It uses MapReduce to effect itshadoop distcpdistribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks,each of which will copy a partition of the files specified in the source list.

fs The command runs a generic filesystem user client that interacts with the MapR filesystem (MapR-FS).hadoop fs

jar <jar> The command runs a JAR file. Users can bundle their MapReduce code in a JAR file and execute it usinghadoop jarthis command.

job Manipulates MapReduce jobs.

jobtracker Runs the MapReduce Jobtracker node.

mfs The command performs operations on directories in the cluster. The main purposes of are tohadoop mfs hadoop mfsdisplay directory information and contents, to create symbolic links, and to set compression and chunk size on adirectory.

mradmin}} Runs a MapReduce admin client.

pipes Runs a pipes job.

queue Gets information about job queues.

tasktracker The command runs a MapReduce tasktracker node.hadoop tasktracker

version The command prints the Hadoop software version.hadoop version



Useful InformationMost Hadoop commands print help when invoked without parameters.

Generic Options

Implement the interface and the following generic Hadoop command-line options are available for many of the Hadoop commands.Tool

Generic options are supported by the , , , , , and Hadoop commands.distcp fs job mradmin pipes queue

Generic Option Description

-conf <filename1 filename2...>

Add the specified configuration files to the list of resources available in the configuration.

-D <property=value> Set a value for the specified Hadoop configuration property.

-fs <local|filesystem URI> Set the URI of the default filesystem.

-jt <local|jobtracker:port> Specify a jobtracker for a given host and port. This command option is a shortcut for -Dmapred.job.tracker=host:port

-files <file1,file2,...> Specify files to be copied to the map reduce cluster.

-libjars <jar1,jar2,...> Specify JAR files to be included in the classpath of the mapper and reducer tasks.

-archives<archive1,archive2,...>

Specify archive files (JAR, tar, tar.gz, ZIP) to be copied and unarchived on the task node.

CLASSNAME

hadoop script can be used to invoke any class.

Usage: hadoop CLASSNAME

Runs the class named CLASSNAME.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Tool.html



hadoop archive

The command creates a Hadoop archive, a file that contains other files. A Hadoop archive always has a extension.hadoop archive *.har

Syntax

hadoop [ Generic Options ] archive -archiveName <name> [-p <parent>] <source> <destination>

Parameters

Parameter Description

-archiveName <name> Name of the archive to be created.

-p <parent_path> The parent argument is to specify the relative path to which the files should be archived to.

<source> Filesystem pathnames which work as usual with regular expressions.

<destination> Destination directory which would contain the archive.

Examples

Archive within a single directory

hadoop archive -archiveName myArchive.har -p /foo/bar /outputdir

The above command creates an archive of the directory in the directory ./foo/bar /outputdir

Archive to another directory

hadoop archive -archiveName myArchive.har -p /foo/bar a/b/c e/f/g

The above command creates an archive of the directory in the directory ./foo/bar/a/b/c /foo/bar/e/f/g



hadoop classpath

The command prints the class path needed to access the Hadoop jar and the required libraries.hadoop classpath

Syntax

hadoop classpath

Output

$ hadoop classpath/opt/mapr/hadoop/hadoop-0.20.2/bin/../conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/..:/opt/mapr/hadoop/hadoop-0.20.2/bin/../hadoop*core*.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjrt-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjtools-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-cli-1.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-codec-1.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-daemon-1.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-el-1.0.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-httpclient-3.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-api-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-net-1.4.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/core-3.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/eval-0.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-capacity-scheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-fairscheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hsqldb-1.8.0.10.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-core-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-mapper-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-compiler-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-runtime-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jets3t-0.6.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-servlet-tester-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-util-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/junit-4.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/kfs-0.2.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/log4j-1.2.15.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/logging-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-test-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mockito-all-1.8.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mysql-connector-java-5.0.8-bin.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/oro-2.0.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/servlet-api-2.5-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-api-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-log4j12-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/xmlenc-0.52.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/zookeeper-3.3.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-api-2.1.jar



hadoop daemonlog

The command gets and sets the log level for each daemon.hadoop daemonlog

Hadoop daemons all produce logfiles that you can use to learn about what is happening on the system. You can use the cohadoop daemonlogmmand to temporarily change the log level of a component when debugging the system.

Syntax

hadoop daemonlog -getlevel | -setlevel <host>:<port> <name> [ <level> ]

Parameters

The following command options are supported for command:hadoop daemonlog


-getlevel <host:port><name> Prints the log level of the daemon running at the specified host and port, by querying

http://<host>:<port>/logLevel?log=<name>

<host>: The host on which to get the log level.<port>: The port by which to get the log level.<name>: The daemon on which to get the log level. Usually the fully qualified classname of thedaemon doing the logging.For example, for the JobTracker daemon.org.apache.hadoop.mapred.JobTracker

-setlevel <host:port> <name><level>

Sets the log level of the daemon running at the specified host and port, by querying

http://<host>:<port>/logLevel?log=<name>

* : The host on which to set the log level.<host>

<port>: The port by which to set the log level.<name>: The daemon on which to set the log level.<level: The log level to set the daemon.

Examples

Getting the log levels of a daemon

To get the log level for each daemon enter a command such as the following:

hadoop daemonlog -getlevel 10.250.1.15:50030 org.apache.hadoop.mapred.JobTracker Connecting to http://10.250.1.15:50030/logLevel?log=org.apache.hadoop.mapred.JobTrackerSubmitted Log Name: org.apache.hadoop.mapred.JobTrackerLog : org.apache.commons.logging.impl.Log4JLoggerClassEffective level: ALL

Setting the log level of a daemon

To temporarily set the log level for a daemon enter a command such as the following:



hadoop daemonlog -setlevel 10.250.1.15:50030 org.apache.hadoop.mapred.JobTracker DEBUGConnecting to http://10.250.1.15:50030/logLevel?log=org.apache.hadoop.mapred.JobTracker&level=DEBUGSubmitted Log Name: org.apache.hadoop.mapred.JobTrackerLog : org.apache.commons.logging.impl.Log4JLoggerClassSubmitted Level: DEBUGSetting Level to DEBUG ...Effective level: DEBUG

Using this method, the log level is automatically reset when the daemon is restarted.

To make the change to log level of a daemon persistent, enter a command such as the following:

hadoop daemonlog -setlevel 10.250.1.15:50030 log4j.logger.org.apache.hadoop.mapred.JobTracker DEBUG



hadoop distcp

The command is a tool used for large inter- and intra-cluster copying. It uses MapReduce to effect its distribution, error handlinghadoop distcpand recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specifiedin the source list.

Syntax

hadoop [ Generic Options ] distcp <source> <destination> [-p [rbugp] ] [-i ] [-log ] [-m ] [-overwrite ] [-update ] [-f <URI list> ] [-filelimit <n> ] [-sizelimit <n> ] [-delete ]

Parameters

Command Options

The following command options are supported for the command:hadoop distcp


<source> Specify the source URL.

<destination> Specify the destination URL.

-p [rbugp] Preserve : replication number r : block size b : user u : group g : permission p

alone is equivalent to . -p -prbugpModification times are not preserved. When you specify , status updates are not synchronized unless the file sizes-updatealso differ.

-i Ignore failures. As explained in the below, this option will keep more accurate statistics about the copy than the defaultcase. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause thejob to fail before all splits are attempted.

-log<logdir>

Write logs to . The command keeps logs of each file it attempts to copy as map output. If a<logdir> hadoop distcpmap fails, the log output will not be retained if it is re-executed.

-m <num_maps> Maximum number of simultaneous copies. Specify the number of maps to copy data. Note that more maps may notnecessarily improve throughput. See .Map Sizing

-overwrite Overwrite destination. If a map fails and is not specified, all the files in the split, not only those that failed, will be-irecopied. As discussed in the , it also changes the semantics for generating destinationOverwriting Files Between Clusterspaths, so users should use this carefully.

-update Overwrite if size is different from size. As noted in the preceding, this is not a "sync"<source> <destination>operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces thedestination file. See Updating Files Between Clusters

-f <URI list> Use list at <URI list> as source list. This is equivalent to listing each source on the command line. The value of <URI list>must be a fully qualified URI.

-filelimit<n>

Limit the total number of files to be <= n. See .Symbolic Representations

-sizelimit<n>

Limit the total size to be <= n bytes. See .Symbolic Representations



-delete Delete the files existing in the but not in The deletion is done by FS Shell. So the trash will be<destination> <source>used, if it is enable.

Generic Options

The command supports the following generic options: , , hadoop distcp -conf <configuration file> -D <property=value> -fs, , , <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars

, and .<libjar1,libjar2,libjar3,...> -archives <archive1,archive2,archive3,...>For more information on generic options, see .Generic Options

Symbolic Representations

The parameter in and can be specified with symbolic representation. For example,<n> -filelimit -sizelimit

1230k = 1230 * 1024 = 1259520891g = 891 * 1024^3 = 956703965184

Map Sizing

The command attempts to size each map comparably so that each copies roughly the same number of bytes. Note that files arehadoop distcpthe finest level of granularity, so increasing the number of simultaneous copiers (i.e. maps) may not always increase the number of simultaneouscopies nor the overall throughput.

If is not specified, will attempt to schedule work for wher-m distcp min (total_bytes / bytes.per.map, 20 * num_task_trackers)e defaults to 256MB.bytes.per.map

Tuning the number of maps to the size of the source and destination clusters, the size of the copy, and the available bandwidth is recommendedfor long-running and regularly run jobs.

Examples

Basic inter-cluster copying

The commmand is most often used to copy files between clusters:hadoop distcp

hadoop distcp maprfs:///mapr/cluster1/foo \maprfs:///mapr/cluster2/bar

The command in the example expands the namespace under on cluster1 into a temporary file, partitions its contents among a set of/foo/barmap tasks, and starts a copy on each TaskTracker from cluster1 to cluster2. Note that the command expects absolute paths.hadoop distcp

Only those files that do not already exist in the destination are copied over from the source directory.

Updating files between clusters

Use the command to synchronize changes between clusters.hadoop distcp -update

$ hadoop distcp -update maprfs:///mapr/cluster1/foo maprfs:///mapr/cluster2/bar/foo

Files in the subtree are copied from cluster1 to cluster2 only if the size of the source file is different from that of the size of the destination/foofile. Otherwise, the files are skipped over.

Note that using the option changes distributed copy interprets the source and destination paths making it necessary to add the trailing -update / subdirectory in the second cluster.foo

Overwriting files between clusters

By default, distributed copy skips files that already exist in the destination directory, but you can overwrite those files using the optio-overwriten. In this example, multiple source directories are specified:

$ hadoop distcp -overwrite maprfs:///mapr/cluster1/foo/a \maprfs:///mapr/cluster1/foo/b \maprfs:///mapr/cluster2/bar



As with using the option, using the changes the way that the source and destination paths are interpreted by distributed-update -overwritecopy: the contents of the source directories are compared to the contents of the destination directory. The distributed copy aborts in case of aconflict.

Migrating Data from HDFS to MapR-FS

The command can be used to migrate data from an HDFS cluster to a MapR-FS where the HDFS cluster uses the samehadoop distcpversion of the RPC protocol as that used by MapR. For a discussion, see .Copying Data from Apache Hadoop

$ hadoop distcp namenode1:50070/foo maprfs:///bar

You must specify the IP address and HTTP port (usually 50070) for the namenode on the HDFS cluster.



hadoop fs

The command runs a generic filesystem user client that interacts with the MapR filesystem (MapR-FS).hadoop fs

Syntax

hadoop [ Generic Options ] fs [-cat <src>] [-chgrp [-R] GROUP PATH...] [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-copyFromLocal <localsrc> ... <dst>] [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-count[-q] <path>] [-cp <src> <dst>] [-df <path>] [-du <path>] [-dus <path>] [-expunge] [-get [-ignoreCrc] [-crc] <src> <localdst> [-getmerge <src> <localdst> [addnl]] [-help [cmd]] [-ls <path>] [-lsr <path>] [-mkdir <path>] [-moveFromLocal <localsrc> ... <dst>] [-moveToLocal <src> <localdst>] [-mv <src> <dst>] [-put <localsrc> ... <dst>] [-rm [-skipTrash] <src>] [-rmr [-skipTrash] <src>] [-stat [format] <path>] [-tail [-f] <path>] [-test -[ezd] <path>] [-text <path>] [-touchz <path>]

Parameters

Command Options

The following command parameters are supported for :hadoop fs


-cat <src> Fetch all files that match the file pattern defined by the <src>parameter and display their contents on .stdout

-fs [local | <file system URI>] Specify the file system to use.

If not specified, the current configuration is used, taken from the following, in increasing precedence:

inside the hadoop jar file core-default.xml in core-site.xml $HADOOP_CONF_DIR

The option means use the local file system as your DFS. local

specifies a particular file system to <file system URI>contact. This argument is optional but if used must appearappear first on the command line. Exactly one additionalargument must be specified.

-ls <path> List the contents that match the specified file pattern. Ifpath is not specified, the contents of /user/<currentUser>will be listed.

Directory entries are of the form dirName (full path) <dir>and file entries are of the form . size fileName(full path) <r n>where n is the number of replicas specified for the file and size is the size of the file, in bytes.



-lsr <path> Recursively list the contents that match the specifiedfile pattern. Behaves very similarly to ,hadoop fs -lsexcept that the data is shown for all the entries in thesubtree.

-df [<path>] Shows the capacity, free and used space of the filesystem.

If the filesystem has multiple partitions, and no path to a particular partitionis specified, then the status of the root partitions will be shown.

-du <path> Show the amount of space, in bytes, used by the files that match the specified file pattern. Equivalent to the Unixcommand in case of a directory, du -sb <path>/*and to in case of a file. du -b <path>

The output is in the form name(full path) size (in bytes).

-dus <path> Show the amount of space, in bytes, used by the files that match the specified file pattern. Equivalent to the Unixcommand . The output is in the form du -sb

size (in bytes).name(full path)

-mv <src> <dst> Move files that match the specified file pattern <src>to a destination . When moving multiple files, the <dst>destination must be a directory.

-cp <src> <dst> Copy files that match the file pattern to a <src>destination. When copying multiple files, the destinationmust be a directory.

-rm [-skipTrash] <src> Delete all files that match the specified file pattern.Equivalent to the Unix command . rm <src>The option bypasses trash, if enabled, -skipTrashand immediately deletes <src>

-rmr [-skipTrash] <src> Remove all directories which match the specified file pattern. Equivalent to the Unix command rm -rf <src>

The option bypasses trash, if enabled,-skipTrashand immediately deletes <src>

-put <localsrc> ... <dst> Copy files from the local file system into fs.

-copyFromLocal <localsrc> ... <dst> Identical to the command.-put

-moveFromLocal <localsrc> ... <dst> Same as , except that the source is-putdeleted after it's copied.

-get [-ignoreCrc] [-crc] <src> <localdst> Copy files that match the file pattern <src> to the local name. <src> is kept. When copying multiple files, the destination must be a directory.

-getmerge <src> <localdst> Get all the files in the directories that match the source file pattern and merge and sort them to onlyone file on local fs. is kept.<src>

-copyToLocal [-ignoreCrc] [-crc] <src><localdst>

Identical to the command.-get

-moveToLocal <src> <localdst> Not implemented yet

-mkdir <path> Create a directory in specified location.

-tail [-f] <file> Show the last 1KB of the file. The option shows appended data as the file grows.-f

-touchz <path> Write a timestamp in formatyyyy-MM-dd HH:mm:ssin a file at . An error is returned if the file exists with non-zero length.<path>

-test -[ezd] <path> If file { exists, has zero length, is a directorythen return 0, else return 1.

-text <src> Takes a source file and outputs the file in text format.The allowed formats are zip and TextRecordInputStream.



-stat [format] <path> Print statistics about the file/directory at <path>in the specified format. Format accepts filesize in blocks (%b), filename (%n),block size (%o), replication (%r), modification date (%y, %Y)

-chmod [-R] <MODE[,MODE]... | OCTALMODE>PATH...

Changes permissions of a file. This works similar to shell's with a few exceptions. chmod

modifies the files recursively. This is the only option currently supported. -R

Mode is same as mode used for shell command.MODE chmodOnly letters recognized are . That is, rwxXt +t,a+r,g-w,+rwx,o=r

Mode specifed in 3 or 4 digits. If 4 digits, the first mayOCTALMODEbe 1 or 0 to turn the sticky bit on or off, respectively. Unlike shell command, it is not possible to specify only part of the modeE.g. 754 is same as u=rwx,g=rx,o=r

If none of 'augo' is specified, 'a' is assumed and unlikeshell command, no umask is applied.

-chown [-R] [OWNER][:[GROUP]] PATH... Changes owner and group of a file. This is similar to shell's with a few exceptions. chown

modifies the files recursively. This is the only option-Rcurrently supported.

If only owner or group is specified then only owner orgroup is modified.The owner and group names may only consists of digits, alphabet, and any of . The names are case-.@/' i.e. [-.@/a-zA-Z0-9]sensitive.

WarningWARNING: Avoid using '.' to separate user name and groupthoughLinux allows it. If user names have dots in them and you areusing local file system, you might see surprising results sinceshell command is used for local files.chown

-chgrp [-R] GROUP PATH... This is equivalent to -chown ... :GROUP ...

-count[-q] <path> Count the number of directories, files and bytes under the pathsthat match the specified file pattern. The output columns are:

orDIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME QUOTA REMAINING_QUATA SPACE_QUOTA REMAINING_SPACE_QUOTA

DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME

-help [cmd] Displays help for given command or all commands if noneis specified.

Generic Options

The following generic options are supported for the command: , , hadoop fs -conf <configuration file> -D <property=value> -fs, , , <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars

, and . For more information on generic options,<libjar1,libjar2,libjar3,...> -archives <archive1,archive2,archive3,...>see .Generic Options



hadoop jar

The command runs a program contained in a JAR file. Users can bundle their MapReduce code in a JAR file and execute it usinghadoop jarthis command.

Syntax

hadoop jar <jar> [<arguments>}

Parameters

The following commands parameters are supported for <<hadoop jar>>:


<jar> The JAR file.

<arguments> Arguments to the program specified in the JAR file.

Examples

Streaming Jobs

Hadoop streaming jobs are run using the command. The Hadoop streaming utility enables you to create and run MapReduce jobshadoop jarwith any executable or script as the mapper and/or the reducer.

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer /bin/wc

The , , , and streaming command options are all required for streaming jobs. Either an executable or a Java-input -output -mapper -reducerclass may be used for the mapper and the reducer. For more information about and examples of streaming jobs, see .Streaming examples

Word Count

The simple Word Count program is another example of a program that is run using the command. The Word Count program readshadoop jarfiles from an input directory, counts the words, and writes the results of the job to files in an output directory.

$ hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in/myvolume/out

http://hadoop.apache.org/common/docs/current/streaming.html#More+usage+examples



hadoop job

The command enables you to manage MapReduce jobs.hadoop job

Syntax

hadoop job [Generic Options] [-submit <job-file>] [-status <job-id>] [-counter <job-id> <group-name> <counter-name>] [-kill <job-id>] [-unblacklist <job-id> <hostname>] [-set-priority <job-id> <priority>] [-events <job-id> <from-event-#> <#-of-events>] [-history <jobOutputDir>] [-list [all]] [-list-active-trackers] [-list-blacklisted-trackers] [-list-attempt-ids <job-id> <task-type> <task-state>] [-kill-task <task-id>] [-fail-task <task-id>] [-blacklist-tasktracker <hostname>] [-showlabels]

Parameters

Command Options

The following command options are supported for :hadoop job


-submit <job-file> Submits the job.

-status <job-id> Prints the map and reduce completion percentage and all job counters.

-counter <job-id> <group-name><counter-name>

Prints the counter value.

-kill <job-id> Kills the job.

-unblacklist <job-id> <hostname> Removes a tasktracker job from the jobtracker's blacklist.

-set-priority <job-id><priority>

Changes the priority of the job. Valid priority values are , , , and VERY_HIGH HIGH, NORMAL LOW. VERY_LOW

The job scheduler uses this property to determine the order in which jobs are run.

-events <job-id> <from-event-#><#-of-events>

Prints the events' details received by jobtracker for the given range.

-history <jobOutputDir> Prints job details, failed and killed tip details.

-list [all] The option displays all jobs. The command without the option displays-list all -list allonly jobs which are yet to complete.

-list-active-trackers Prints all active tasktrackers.

-list-blackisted-trackers Prints blacklisted tasktrackers.

-list-attempt-ids<job-id><task-type>

Lists the IDs of task attempts.

-kill-task <task-id> Kills the task. Killed tasks are counted against failed attempts.not

-fail-task <task-id> Fails the task. Failed tasks are counted against failed attempts.

-blacklist-tasktracker<hostname>

Pauses all current tasktracker jobs and prevent additional jobs from being scheduled on thetasktracker.

-showlabels Dumps label information of all active nodes.



Generic Options

The following generic options are supported for the command: , , hadoop job -conf <configuration file> -D <property=value> -fs, , , <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars


Examples

Submitting Jobs

The command enables you to submit a job to the specified jobtracker.hadoop job -submit

$ hadoop job -jt darwin:50020 -submit job.xml

Stopping Jobs Gracefully

Use the command to stop a running or queued job.hadoop kill

$ hadoop job -kill <job-id>

Viewing Job History Logs

Run the command to view the history logs summary in specified directory.hadoop job -history

$ hadoop job -history output-dir

This command will print job details, failed and killed tip details.

Additional details about the job such as successful tasks and task attempts made for each task can be viewed by adding the option:-all

$ hadoop job -history all output-dir

Blacklisting Tasktrackers

The command when run as root or using can be used to manually blacklist tasktrackers:hadoop job sudo

hadoop job -blacklist-tasktracker <hostname>

Manually blacklisting a tasktracker pauses any running jobs and prevents additional jobs from being scheduled.For a detailed discussion see .TaskTracker Blacklisting



hadoop jobtracker

The command runs the MapReduce jobtracker node.hadoop jobtracker

Syntax

hadoop jobtracker [-dumpConfiguration]

Parameters

The command supports the following command options:hadoop jobtracker


-dumpConfiguration Dumps the configuration used by the jobtracker along with queue configuration in JSON format into standard outputused by the jobtracker and exits.



hadoop mfs

The command performs operations on directories in the cluster. The main purposes of are to display directoryhadoop mfs hadoop mfsinformation and contents, to create symbolic links, and to set compression and chunk size on a directory.

Syntax

hadoop mfs [ -ln <target> <symlink> ] [ -ls <path> ] [ -lsd <path> ] [ -lsr <path> ] [ -Lsr <path> ] [ -lsrv <path> ] [ -lss <path> ] [ -setcompression on|off|lzf|lz4|zlib <dir> ] [ -setchunksize <size> <dir> ] [ -help <command> ]

Parameters

The normal command syntax is to specify a single option from the following table, along with its corresponding arguments. If compression andchunk size are not set explicitly for a given directory, the values are inherited from the parent directory.


-ln <target><symlink>

Creates a symbolic link that points to the target path , similar to the standard Linux c<symlink> <target> ln -sommand.

-ls <path> Lists files in the directory specified by . The command corresponds to the standard <path> hadoop mfs -ls had command, but provides the following additional information: oop fs -ls

Chunks used for each fileServer where each chunk resides

-lsd <path> Lists files in the directory specified by , and also provides information about the specified directory itself:<path>

Whether compression is enabled for the directory (indicated by )zThe configured chunk size (in bytes) for the directory.

-lsr <path> Lists files in the directory and subdirectories specified by , recursively, including dereferencing symbolic<path>links. The command corresponds to the standard command, but provideshadoop mfs -lsr hadoop fs -lsrthe following additional information:

Chunks used for each fileServer where each chunk resides

-Lsr <path> Equivalent to lsr, but additionally dereferences symboliclinks

-lsrv <path> Lists all paths recursively without crossing volume links.

-lss <path> Lists files in the directory specified by , with an additional column that displays the number of disk blocks<path>per file. Disk blocks are 8192 bytes.

-setcompressionon|off|lzf|lz4|zlib<dir>

Turns compression on or off on the directory specified in , and sets the compression type:<dir>

on — turns on compression using the default algorithm (LZ4)off — turns off compressionlzf — turns on compression and sets the algorithm to LZFlz4 — turns on compression and sets the algorithm to LZ4zlib — turns on compression and sets the algorithm to ZLIB

-setchunksize <size> <dir>

Sets the chunk size in bytes for the directory specified in . The parameter must be a multiple of<dir> <size>65536.

-help <command> Displays help for the command. hadoop mfs



Examples

The command is used to view file contents. You can use this command to check if compression is turned off in a directory orhadoop mfsmounted volume. For example,

# hadoop mfs -ls /Found 23 itemsvrwxr-xr-x Z - root root 13 2012-04-29 10:24 268435456 /.rw p mapr.cluster.root writeable 2049.35.16584 -> 2049.16.2 scale-50.scale.lab:5660scale-51.scale.lab:5660 scale-52.scale.lab:5660vrwxr-xr-x U - root root 7 2012-04-28 22:16 67108864 /hbase p mapr.hbase 2049.32.16578 -> 2050.16.2 scale-50.scale.lab:5660defaultscale-51.scale.lab:5660 scale-52.scale.lab:5660drwxr-xr-x Z - root root 0 2012-04-29 09:14 268435456 /tmp p 2049.41.16596 scale-50.scale.lab:5660 scale-51.scale.lab:5660scale-52.scale.lab:5660vrwxr-xr-x Z - root root 1 2012-04-27 22:59 268435456 /user p users 2049.36.16586 -> 2055.16.2 scale-50.scale.lab:5660defaultscale-52.scale.lab:5660 scale-51.scale.lab:5660drwxr-xr-x Z - root root 1 2012-04-27 22:37 268435456 /var p 2049.33.16580 scale-50.scale.lab:5660 scale-51.scale.lab:5660scale-52.scale.lab:5660

In the above example, the letter indicates LZ4 compression on the directory; the letter indicates that the directory is uncompressed.Z U

Output

When used with , , , or , displays information about files and directories. For each file or directory -ls -lsd -lsr -lss hadoop mfs hadoop mfsdisplays a line of basic information followed by lines listing the chunks that make up the file, in the following format:

{mode} {compression} {replication} {owner} {group} {size} {date} {chunk size} {name} {chunk} {fid} {host} [{host}...] {chunk} {fid} {host} [{host}...] ...

Volume links are displayed as follows:

{mode} {compression} {replication} {owner} {group} {size} {date} {chunk size} {name} {chunk} {target volume name} {writability} {fid} -> {fid} [{host}...]

For volume links, the first is the chunk that stores the volume link itself; the after the arrow ( ) is the first chunk in the target volume.fid fid ->

The following table describes the values:

mode A text string indicating the read, write, and execute permissions for the owner, group, and other permissions. See also Mana.ging Permissions

compressionU — uncompressedL — LZfZ (Uppercase) — LZ4z (Lowercase) — ZLIB

replication The replication factor of the file (directories display a dash instead)

owner The owner of the file or directory

group The group of the file of directory

size The size of the file or directory

date The date the file or directory was last modified

chunk size The chunk size of the file or directory

name The name of the file or directory

chunk The chunk number. The first chunk is a primary chunk labeled " ", a 64K chunk containing the root of the file. Subsequentpchunks are numbered in order.



fid The chunk's file ID, which consists of three parts:

The ID of the container where the file is storedThe inode of the file within the containerAn internal version number

host The host on which the chunk resides. When several hosts are listed, the first host is the first copy of the chunk andsubsequent hosts are replicas.

target volumename

The name of the volume pointed to by a volume link.

writability Displays whether the volume is writable.



hadoop mradmin

The command runs Map-Reduce administrative commands.hadoop mradmin

Syntax

hadoop [ Generic Options ] mradmin [-refreshServiceAcl] [-refreshQueues] [-refreshNodes] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-help [cmd]]

Parameters

The following command parameters are supported for :hadoop mradmin


-refreshServiceAcl Reload the service-level authorization policy fileJob tracker will reload the authorization policy file.

-refreshQueues Reload the queue acls and stateJobTracker will reload the mapred-queues.xml file.

-refreshUserToGroupsMappings Refresh user-to-groups mappings.

-refreshSuperUserGroupsConfiguration Refresh superuser proxy groups mappings.

-refreshNodes Refresh the hosts information at the job tracker.

-help [cmd] Displays help for the given command or all commands if noneis specified.

The following generic options are supported for :hadoop mradmin

Generic Option Description

-conf <configuration file> Specify an application configuration file.

-D <property=value> Use value for given property.

-fs <local|file system URI> Specify a file system.

-jt <local|jobtracker:port> Specify a job tracker.

-files <comma separated list of files> Specify comma separated files to be copied to the map reduce cluster.

-libjars <comma seperated list of jars> Specify comma separated jar files to include in the classpath.

-archives <comma separated list of archives> Specify comma separated archives to be unarchived on the computer machines.



hadoop pipes

The command runs a pipes job.hadoop pipes

Hadoop Pipes is the C++ interface to Hadoop Reduce. Hadoop Pipes uses sockets to enable tasktrackers to communicate processes running theC++ map or reduce functions. See also .Compiling Pipes Programs

Syntax

hadoop [GENERIC OPTIONS ] pipes [-output <path>] [-jar <jar file>] [-inputformat <class>] [-map <class>] [-partitioner <class>] [-reduce <class>] [-writer <class>] [-program <executable>] [-reduces <num>]

Parameters

Command Options

The following command parameters are supported for :hadoop pipes


-output <path> Specify the output directory.

-jar <jar file> Specify the jar filename.

-inputformat <class> InputFormat class.

-map <class> Specify the Java Map class.

-partitioner <class> Specify the Java Partitioner.

-reduce <class> Specify the Java Reduce class.

-writer <class> Specify the Java RecordWriter.

-program <executable> Specify the URI of the executable.

-reduces <num> Specify the number of reduces.

Generic Options

The following generic options are supported for the command: , , hadoop pipes -conf <configuration file> -D <property=value> -f, , , s <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars




hadoop queue

The command displays job queue information.hadoop queue

Syntax

hadoop [ Generic Options ] queue [-list] | [-info <job-queue-name> [-showJobs]] | [-showacls]

Parameters

Command Options

The command supports the following command options:hadoop queue


-list Gets list of job queues configured in the system. Along with scheduling information associated with the job queues.

-info<job-queue-name>[-showJobs]

Displays the job queue information and associated scheduling information of particular job queue. If o-showJobsptions is present a list of jobs submitted to the particular job queue is displayed.

-showacls Displays the queue name and associated queue operations allowed for the current user. The list consists of onlythose queues to which the user has access.

Generic Options

The following generic options are supported for the command: , , hadoop queue -conf <configuration file> -D <property=value> -f, , , s <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars




hadoop tasktracker

The command runs a MapReduce tasktracker node.hadoop tasktracker

Syntax

hadoop tasktracker

Output

mapr@mapr-desktop:~$ hadoop tasktracker12/03/21 21:19:56 INFO mapred.TaskTracker: STARTUP_MSG:/************************************************************STARTUP_MSG: Starting TaskTrackerSTARTUP_MSG: host = mapr-desktop/127.0.1.1STARTUP_MSG: args = []STARTUP_MSG: version = 0.20.2-devSTARTUP_MSG: build = -r ; compiled by 'root' on Thu Dec 8 22:43:13 PST 2011************************************************************/12/03/21 21:19:56 INFO mapred.TaskTracker:/*-------------- TaskTracker Properties ----------------Systemjava.runtime.name: Java(TM) SE EnvironmentRuntimesun.boot.library.path: /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64java.vm.version: 20.1-b02hadoop.root.logger: INFO,consolejava.vm.vendor: Sun Microsystems Inc.java.vendor.url: http://java.sun.com/path.separator: :java.vm.name: Java HotSpot(TM) 64-Bit Server VMfile.encoding.pkg: sun.iosun.java.launcher: SUN_STANDARDuser.country: USsun.os.patch.level: unknownjava.vm.specification.name: Java Virtual Machine Specificationuser.dir: /home/maprjava.runtime.version: 1.6.0_26-b03java.awt.graphicsenv: sun.awt.X11GraphicsEnvironmentjava.endorsed.dirs: /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/endorsedos.arch: amd64java.io.tmpdir: /tmpline.separator:

hadoop.log.file: hadoop.logjava.vm.specification.vendor: Sun Microsystems Inc.os.name: Linuxhadoop.id.str:sun.jnu.encoding: UTF-8java.library.path: /opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/ /Linux-amd64-64:nativehadoop.home.dir: /opt/mapr/hadoop/hadoop-0.20.2/bin/..java.specification.name: Java Platform API Specificationjava.class.version: 50.0sun.management.compiler: HotSpot 64-Bit Tiered Compilershadoop.pid.dir: /opt/mapr/hadoop/hadoop-0.20.2/bin/../pidsos.version: 2.6.32-33-genericuser.home: /home/mapruser.timezone: America/Los_Angelesjava.awt.printerjob: sun.print.PSPrinterJobfile.encoding: UTF-8java.specification.version: 1.6java.class.path:/opt/mapr/hadoop/hadoop-0.20.2/bin/../conf:/usr/lib/jvm/java-6-sun-1.6.0.26/lib/tools.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/..:/opt/mapr/hadoop/hadoop-0.20.2/bin/../hadoop*core*.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjrt-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjtools-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-cli-1.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/b



in/../lib/commons-codec-1.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-daemon-1.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-el-1.0.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-httpclient-3.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-api-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-net-1.4.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/core-3.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/eval-0.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-capacity-scheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-fairscheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hsqldb-1.8.0.10.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-core-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-mapper-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-compiler-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-runtime-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jets3t-0.6.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-servlet-tester-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-util-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/junit-4.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/kfs-0.2.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/log4j-1.2.15.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/logging-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-test-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mockito-all-1.8.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mysql-connector-java-5.0.8-bin.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/oro-2.0.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/servlet-api-2.5-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-api-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-log4j12-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/xmlenc-0.52.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/zookeeper-3.3.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-api-2.1.jaruser.name: maprjava.vm.specification.version: 1.0sun.java.command: org.apache.hadoop.mapred.TaskTrackerjava.home: /usr/lib/jvm/java-6-sun-1.6.0.26/jresun.arch.data.model: 64user.language: enjava.specification.vendor: Sun Microsystems Inc.hadoop.log.dir: /opt/mapr/hadoop/hadoop-0.20.2/bin/../logsjava.vm.info: mixed modejava.version: 1.6.0_26java.ext.dirs: /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/ext:/usr/java/packages/lib/extsun.boot.class.path:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/modules/jdk.boot.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/classesjava.vendor: Sun Microsystems Inc.file.separator: /java.vendor.url.bug: http://java.sun.com/cgi-bin/bugreport.cgisun.io.unicode.encoding: UnicodeLittlesun.cpu.endian: littlehadoop.policy.file: hadoop-policy.xmlsun.desktop: gnomesun.cpu.isalist:------------------------------------------------------------*/12/03/21 21:19:57 INFO mapred.TaskTracker: /tmp is not tmpfs or ramfs. Java Hotspot Instrumentationwill be disabled by default12/03/21 21:19:57 INFO mapred.TaskTracker: Cleaning up config files from the job history folder12/03/21 21:19:57 INFO mapred.TaskTracker: TT local config is/opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml12/03/21 21:19:57 INFO mapred.TaskTracker: Loading resource properties file : /opt/mapr//logs/cpu_mem_disk12/03/21 21:19:57 INFO mapred.TaskTracker: Physical memory reserved mapreduce tasks = 2105540608forbytes12/03/21 21:19:57 INFO mapred.TaskTracker: CPUS: 112/03/21 21:19:57 INFO mapred.TaskTracker: Total MEM: 1.9610939GB12/03/21 21:19:57 INFO mapred.TaskTracker: Reserved MEM: 2008MB12/03/21 21:19:57 INFO mapred.TaskTracker: Reserved MEM Ephemeral slots 0for12/03/21 21:19:57 INFO mapred.TaskTracker: DISKS: 212/03/21 21:19:57 INFO mapred.TaskTracker: Map slots 1, Default heapsize map task 873 mbfor12/03/21 21:19:57 INFO mapred.TaskTracker: Reduce slots 1, Default heapsize reduce task 1135 mbfor12/03/21 21:19:57 INFO mapred.TaskTracker: Ephemeral slots 0, memory given each ephemeral slot 200formb12/03/21 21:19:57 INFO mapred.TaskTracker: Prefetch map slots 112/03/21 21:20:07 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via



org.mortbay.log.Slf4jLog12/03/21 21:20:08 INFO http.HttpServer: Added global filtersafety(class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)12/03/21 21:20:08 WARN mapred.TaskTracker: Error writing to TaskController configwhilefilejava.io.FileNotFoundException: /opt/mapr/hadoop/hadoop-0.20.2/bin/../conf/taskcontroller.cfg(Permission denied)12/03/21 21:20:08 ERROR mapred.TaskTracker: Can not start TaskTracker because java.io.IOException:Cannot run program :"/opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/Linux-amd64-64/bin/task-controller"java.io.IOException: error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) at org.apache.hadoop.util.Shell.runCommand(Shell.java:267) at org.apache.hadoop.util.Shell.run(Shell.java:249) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:442) at org.apache.hadoop.mapred.LinuxTaskController.setup(LinuxTaskController.java:142) at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:2149) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:5216)Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied at java.lang.UNIXProcess.<init>(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) ... 6 more

12/03/21 21:20:08 INFO mapred.TaskTracker: SHUTDOWN_MSG:/************************************************************SHUTDOWN_MSG: Shutting down TaskTracker at mapr-desktop/127.0.1.1************************************************************/





hadoop version

The command prints the hadoop software version.hadoop version

Syntax

hadoop version

Output

mapr@mapr-desktop:~$ hadoop version Hadoop 0.20.2-devSubversion -r Compiled by root on Thu Dec 8 22:43:13 PST 2011From source with checksum 19fa44df0cb831c45ef984f21feb7110



hadoop conf

The command outputs the configuration information for this node to standard output.hadoop conf

Syntax

hadoop [ generic options ] conf [ -dump ] [ -key <parameter name>]

Parameters


-dump Dumps the entire configuration set to standard output.

-key <parameter name> Displays the configured value for the specified parameter.

Examples

Dumping a node's entire configuration to a text file

hadoop conf -dump > nodeconfiguration.txt

The above command creates a text file named that contains the node's configuration information. Using the utilnodeconfiguration.txt taility to examine the last few lines of the file displays the following information:

[user@hostame:01] tail nodeconfiguration.txtmapred.merge.recordsBeforeProgress=10000io.mapfile.bloom.error.rate=0.005io.bytes.per.checksum=512mapred.cluster.ephemeral.tasks.memory.limit.mb=200mapred.fairscheduler.smalljob.max.inputsize=10737418240ipc.client.tcpnodelay=truemapreduce.tasktracker.reserved.physicalmemory.mb.low=0.80fs.s3.sleepTimeSeconds=10mapred.task.tracker.report.address=127.0.0.1:0*** MapR Configuration Dump: END ***[user@hostname:02]

Displaying the configured value of a specific parameter

[user@hostame:01] hadoop conf -key io.bytes.per.checksum512[user@hostname:02]

The above command returns 512 as the configured value of the parameter.io.bytes.per.checksum



API Reference

Overview

This guide provides information about the MapR command API. Most commands can be run on the command-line interface (CLI), or by makingREST requests programmatically or in a browser. To run CLI commands, use a machine or an ssh connection to any node in the cluster.ClientTo use the REST interface, make HTTP requests to a node that is running the WebServer service.

Each command reference page includes the command syntax, a table that describes the parameters, and examples of command usage. In eachparameter table, required parameters are in text. For output commands, the reference pages include tables that describe the output fields.boldValues that do not apply to particular combinations are marked .NA

REST API Syntax

MapR REST calls use the following format:

https://<host>:<port>/rest/<command>[/<subcommand>...]?<parameters>

Construct the list from the required and optional parameters, in the format separated by the<parameters> <parameter>=<value>ampersand ( ) character. Example:&

https://r1n1.qa.sj.ca.us:8443/api/volume/mount?name=test-volume&path=/test

Values in REST API calls must be URL-encoded. For readability, the values in this document are presented using the actual characters, ratherthan the URL-encoded versions.

Authentication

To make REST calls using or , provide the username and password.curl wget

Curl Syntax

curl -k -u <username>:<password> https://<host>:<port>/ /<command>...rest

Wget Syntax

wget --no-check-certificate --user <username> --password <password> https://<host>:<port>/ /<commanrestd>...

Command-Line Interface (CLI) Syntax

The MapR CLI commands are documented using the following conventions:

[Square brackets] indicate an optional parameter<Angle brackets> indicate a value to enter

The following syntax example shows that the command requires the parameter, for which you must enter a list ofvolume mount -namevolumes, and all other parameters are optional:

maprcli volume mount [ -cluster <cluster> ] -name <volume list> [ -path <path list> ]

For clarity, the syntax examples show each parameter on a separate line; in practical usage, the command and all parameters and options aretyped on a single line. Example:

maprcli volume mount -name test-volume -path /test

Common Parameters



The following parameters are available for many commands in both the REST and command-line contexts.


cluster The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued.In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command.

zkconnect A ZooKeeper connect string, which specifies a list of the hosts running ZooKeeper, and the port to use on each, in the format: '<h Default: In most cases the ZooKeeper connect string canost>[:<port>][,<host>[:<port>]...]' 'localhost:5181'

be omitted, but it is useful in certain cases when the CLDB is not running.

Common Options

The following options are available for most commands in the command-line context.

Option Description

-noheader When displaying tabular output from a command, omits the header row.

-long Shows the entire value. This is useful when the command response contains complex information. When -long is omitted, complexinformation is displayed as an ellipsis (...).

-json Displays command output in JSON format. When -json is omitted, the command output is displayed in tabular format.

Filters

Some MapR CLI commands use , which let you specify large numbers of nodes or volumes by matching specified values in specified fieldsfiltersrather than by typing each name explicitly.

Filters use the following format:

[<field><operator>"<value>"]<and|or>[<field><operator>"<value>"] ...

field Field on which to filter. The field depends on the command with which the filter is used.

operator An operator for that field:

== - Exact match!= - Does not match> - Greater than< - Less than

>= - Greater than or equal to<= - Less than or equal to |

value Value on which to filter. Wildcards (using ) are allowed for operators and . There is a special value that matches all* == != allvalues.

You can use the wildcard ( ) for partial matches. For example, you can display all volumes whose owner is and whose name begins with * root t as follows:est

maprcli volume list -filter [n=="test*"]and[on=="root"]

Response

The commands return responses in JSON or in a tabular format. When you run commands from the command line, the response is returned intabular format unless you specify JSON using the -json option; when you run commands through the REST interface, the response is returned inJSON.

Success

On a successful call, each command returns the error code zero (OK) and any data requested. When JSON output is specified, the data isreturned as an array of records along with the status code and the total number of records. In the tabular format, the data is returned as asequence of rows, each of which contains the fields in the record separated by tabs.



JSON{ "status":"OK", "total":<number of records>, "data":[ { <record> } ... ]}

Tabularstatus0

Or

<heading> <heading> <heading> ...<field> <field> <field> ......

Error

When an error occurs, the command returns the error code and descriptive message.

JSON{ "status":"ERROR", "errors":[ { "id":<error code>, "desc":"<command>: <error message>" } ]}

TabularERROR (<error code>) - <command>: <error message>



acl

The acl commands let you work with (ACLs):access control lists

acl edit - modifies a specific user's access to a cluster or volumeacl set - modifies the ACL for a cluster or volumeacl show - displays the ACL associated with a cluster or volume

In order to use the command, you must have full control ( ) permission on the cluster or volume for which you are running theacl edit fccommand.

Specifying Permissions

Specify permissions for a user or group with a string that lists the permissions for that user or group. To specify permissions for multiple users orgroups, use a string for each, separated by spaces. The format is as follows:

Users - <user>:<action>[,<action>...][ <user>:<action>[,<action...]]Groups - <group>:<action>[,<action>...][ <group>:<action>[,<action...]]

The following tables list the permission codes used by the commands.acl

Cluster Permission Codes



cv


cv Create volumes



Volume Permission Codes

Code Allowed Action




d Delete a volume




acl edit

The command grants one or more specific volume or cluster permissions to a user. To use the command, you must haveacl edit acl editfull control ( ) permissions on the volume or cluster for which you are running the command.fc

The permissions are specified as a comma-separated list of permission codes. See . You must specify either a or a . When the acl user group ty is , a volume name must be specified using the parameter.pe volume name

Syntax

CLImaprcli acl edit [ -cluster <cluster name> ] [ -group <group> ] [ -name <name> ] -type cluster|volume [ -user <user> ]

RESThttp[s]://<host:port>/rest/acl/edit?<parameters>

Parameters


cluster The cluster on which to run the command.

group Groups and allowed actions for each group. See . Format: acl <group>:<action>[,<action>...][<group>:<action>[,<action...]]

name The object name.

type The object type ( or ). cluster volume

user Users and allowed actions for each user. See . Format: acl <user>:<action>[,<action>...][<user>:<action>[,<action...]]

Examples

Give the user jsmith dump, restore, and delete permissions for "test-volume":

CLImaprcli acl edit -type volume -name test-volume -user jsmith:dump,restore,d



acl set

The command specifies the entire ACL for a cluster or volume. Any previous permissions are overwritten by the new values, and anyacl setpermissions omitted are removed. To use the command, you must have full control ( ) permissions on the volume or cluster for whichacl set fcyou are running the command.

The permissions are specified as a comma-separated list of permission codes. See . You must specify either a or a . When the acl user group ty is , a volume name must be specified using the parameter.pe volume name

The command removes any previous ACL values. If you wish to preserve some of the permissions, you should eitheracl setuse the command instead of , or use to list the values before overwriting them.acl edit acl set acl show

Syntax

CLImaprcli acl set [ -cluster <cluster name> ] [ -group <group> ] [ -name <name> ] -type cluster|volume [ -user <user> ]

RESThttp[s]://<host:port>/rest/acl/edit?<parameters>

Parameters



group Groups and allowed actions for each group. See . Format: acl <group>:<action>[,<action>...][<group>:<action>[,<action...]]

name The object name.

type The object type ( or ). cluster volume

user Users and allowed actions for each user. See . Format: acl <user>:<action>[,<action>...][ <user>:<action>[,<action...]]

Examples

Give the user full control of the cluster and remove all permissions for all other users:root my.cluster.com

CLImaprcli acl set -type cluster -cluster my.cluster.com -userroot:fc

Usage Example



# maprcli acl show -type clusterPrincipal Allowed actionsUser root [login, ss, cv, a, fc]User lfedotov [login, ss, cv, a, fc]User mapr [login, ss, cv, a, fc]

# maprcli acl set -type cluster -cluster my.cluster.com -user root:fc# maprcli acl show -type clusterPrincipal Allowed actionsUser root [login, ss, cv, a, fc]

Notice that the specified permissions have overwritten the existing ACL.

Give multiple users specific permissions for the volume and remove all permissions for all other users:test-volume

CLImaprcli acl set -type volume -name test-volume -user jsmith:dump,restore,m rjones:fc



acl show

Displays the ACL associated with an object (cluster or a volume). An ACL contains the list of users who can perform specific actions.

Syntax

CLImaprcli acl show [ -cluster <cluster> ] [ -group <group> ] [ -name <name> ] [ -output long|short|terse ] [ -perm ] -type cluster|volume [ -user <user> ]

RESThttp[s]://<host:port>/rest/acl/show?<parameters>

Parameters


cluster The name of the cluster on which to run the command

group The group for which to display permissions

name The cluster or volume name

output The output format:

longshortterse

perm When this option is specified, displays the permissions available for the object type specified in the parameter.acl show type

type Cluster or volume.

user The user for which to display permissions

Output

The actions that each user or group is allowed to perform on the cluster or the specified volume. For information about each allowed action, see a.cl

Principal Allowed actions User root [r, ss, cv, a, fc] Group root [r, ss, cv, a, fc] All users [r]

Examples

Show the ACL for "test-volume":



CLImaprcli acl show -type volume -name test-volume

Show the permissions that can be set on a cluster:

CLImaprcli acl show -type cluster -perm



alarm

The alarm commands perform functions related to system alarms:

alarm clear - clears one or more alarmsalarm clearall - clears all alarmsalarm config load - displays the email addresses to which alarm notifications are to be sentalarm config save - saves changes to the email addresses to which alarm notifications are to be sentalarm list - displays alarms on the clusteralarm names - displays all alarm namesalarm raise - raises a specified alarm

Alarm Notification Fields

The following fields specify the configuration of alarm notifications.

Field Description

alarm The named alarm.

individual Specifies whether individual alarm notifications are sent to the default email address for the alarm type.

0 - do not send notifications to the default email address for the alarm type1 - send notifications to the default email address for the alarm type

email A custom email address for notifications about this alarm type. If specified, alarm notifications are sent to this email address,regardless of whether they are sent to the default email address

Alarm Types

See .Alarms Reference

Alarm History

To see a history of alarms that have been raised, look at the file on the master CLDB node. Example:/opt/mapr/logs/cldb.log

grep ALARM /opt/mapr/logs/cldb.log



alarm clear

Clears one or more alarms. Permissions required: or fc a

Syntax

CLImaprcli alarm clear -alarm <alarm> [ -cluster <cluster> ] [ -entity <host, volume, user, or group name> ]

RESThttp[s]://<host>:<port>/rest/alarm/clear?<parameters>

Parameters


alarm The named alarm to clear. See .Alarm Types


entity The entity on which to clear the alarm.

Examples

Clear a specific alarm:

CLImaprcli alarm clear -alarm NODE_ALARM_DEBUG_LOGGING

RESThttps://r1n1.sj.us:8443/rest/alarm/clear?alarm=NODE_ALARM_DEBUG_LOGGING



alarm clearall

Clears all alarms. Permissions required: or fc a

Syntax

CLImaprcli alarm clearall [ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/alarm/clearall?<parameters>

Parameters



Examples

Clear all alarms:

CLImaprcli alarm clearall

RESThttps://r1n1.sj.us:8443/rest/alarm/clearall



alarm config load

Displays the configuration of alarm notifications. Permissions required: or fc a

Syntax

CLImaprcli alarm config load [ -cluster <cluster> ] [ -output terse|verbose ]

RESThttp[s]://<host>:<port>/rest/alarm/config/load

Parameters



output Whether the output should be terse or verbose.

Output

A list of configuration values for alarm notifications.

Output Fields

See .Alarm Notification Fields

Sample output



alarm individual email CLUSTER_ALARM_BLACKLIST_TTS 1 CLUSTER_ALARM_UPGRADE_IN_PROGRESS 1 CLUSTER_ALARM_UNASSIGNED_VIRTUAL_IPS 1 VOLUME_ALARM_SNAPSHOT_FAILURE 1 VOLUME_ALARM_MIRROR_FAILURE 1 VOLUME_ALARM_DATA_UNDER_REPLICATED 1 VOLUME_ALARM_DATA_UNAVAILABLE 1 VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED 1 VOLUME_ALARM_QUOTA_EXCEEDED 1NODE_ALARM_CORE_PRESENT 1 NODE_ALARM_DEBUG_LOGGING 1 NODE_ALARM_DISK_FAILURE 1 NODE_ALARM_OPT_MAPR_FULL 1 NODE_ALARM_VERSION_MISMATCH 1 NODE_ALARM_TIME_SKEW 1 NODE_ALARM_SERVICE_CLDB_DOWN 1 NODE_ALARM_SERVICE_FILESERVER_DOWN 1 NODE_ALARM_SERVICE_JT_DOWN 1 NODE_ALARM_SERVICE_TT_DOWN 1 NODE_ALARM_SERVICE_HBMASTER_DOWN 1 NODE_ALARM_SERVICE_HBREGION_DOWN 1 NODE_ALARM_SERVICE_NFS_DOWN 1 NODE_ALARM_SERVICE_WEBSERVER_DOWN 1 NODE_ALARM_SERVICE_HOSTSTATS_DOWN 1NODE_ALARM_ROOT_PARTITION_FULL 1AE_ALARM_AEADVISORY_QUOTA_EXCEEDED 1 AE_ALARM_AEQUOTA_EXCEEDED 1

Examples

Display the alarm notification configuration:

CLImaprcli alarm config load

RESThttps://r1n1.sj.us:8443/rest/alarm/config/load



alarm config save

Sets notification preferences for alarms. Permissions required: or fc a

Alarm notifications can be sent to the default email address and a specific email address for each named alarm. If is set to for aindividual 1specific alarm, then notifications for that alarm are sent to the default email address for the alarm type. If a custom email address is provided,notifications are sent there regardless of whether they are also sent to the default email address.

Syntax

CLImaprcli alarm config save [ -cluster <cluster> ] -values <values>

RESThttp[s]://<host>:<port>/rest/alarm/config/save?<parameters>

Parameters



values A comma-separated list of configuration values for one or more alarms, in the following format:

<alarm>,<individual>,<email> See .Alarm Notification Fields

Examples

Send alert emails for the AE_ALARM_AEQUOTA_EXCEEDED alarm to the default email address and a custom email address:

CLImaprcli alarm config save -values "AE_ALARM_AEQUOTA_EXCEEDED,1,[email protected]"

RESThttps://r1n1.sj.us:8443/rest/alarm/config/save?values=AE_ALARM_AEQUOTA_EXCEEDED,1,[email protected]



alarm list

Lists alarms in the system. Permissions required: or fc a

You can list all alarms, alarms by type (Cluster, Node or Volume), or alarms on a particular node or volume. To retrieve a count of all alarm types,pass in the parameter. You can specify the alarms to return by filtering on type and entity. Use and to retrieve only a1 summary start limitspecified window of data.

Syntax

CLImaprcli alarm list [ -alarm <alarm ID> ] [ -cluster <cluster> ] [ -entity <host or volume> ] [ -limit <limit> ] [ -output (terse|verbose) ] [ -start <offset> ] [ -summary (0|1) ] [ -type <alarm type> ]

RESThttp[s]://<host>:<port>/rest/alarm/list?<parameters>

Parameters


alarm The alarm type to return. See . Alarm Types

cluster The cluster on which to list alarms.

entity The name of the cluster, node, volume, user, or group to check for alarms.

limit The number of records to retrieve. Default: 2147483647


start The list offset at which to start.

summary Specifies the type of data to return:

1 = count by alarm type0 = List of alarms

type The entity type:

clusternodevolumeae

Output

Information about one or more named alarms on the cluster, or for a specified node, volume, user, or group.

Output Fields



Field Description

alarm state State of the alarm:

0 = Clear1 = Raised

description A description of the condition that raised the alarm

entity The name of the volume, node, user, or group.

alarm name The name of the alarm.

alarm statechange time The date and time the alarm was most recently raised.

Sample Output

alarm state description entity alarm name alarm statechange time1 Volume desired replication is 1, current replication is 0 mapr.qa-node173.qa.prv.local.logs VOLUME_ALARM_DATA_UNDER_REPLICATED 12967077078721 Volume data unavailable mapr.qa-node173.qa.prv.local.logs VOLUME_ALARM_DATA_UNAVAILABLE 12967077078711 Volume desired replication is 1, current replication is 0 mapr.qa-node235.qa.prv.local.mapred VOLUME_ALARM_DATA_UNDER_REPLICATED 12967082833551 Volume data unavailable mapr.qa-node235.qa.prv.local.mapred VOLUME_ALARM_DATA_UNAVAILABLE 12967082830991 Volume desired replication is 1, current replication is 0 mapr.qa-node175.qa.prv.local.logs VOLUME_ALARM_DATA_UNDER_REPLICATED 1296706343256

Examples

List a summary of all alarms

CLImaprcli alarm list -summary 1

RESThttps://r1n1.sj.us:8443/rest/alarm/list?summary=1

List cluster alarms

CLImaprcli alarm list -type 0

RESThttps://r1n1.sj.us:8443/rest/alarm/list?type=0



alarm names

Displays a list of alarm names. Permissions required or .fc a

Syntax

CLImaprcli alarm names

RESThttp[s]://<host>:<port>/rest/alarm/names

Examples

Display all alarm names:

CLImaprcli alarm names

RESThttps://r1n1.sj.us:8443/rest/alarm/names



alarm raise

Raises a specified alarm or alarms. Permissions required or .fc a

Syntax

CLImaprcli alarm raise -alarm <alarm> [ -cluster <cluster> ] [ -description <description> ] [ -entity <cluster, entity, host, node, or volume> ]

RESThttp[s]://<host>:<port>/rest/alarm/raise?<parameters>

Parameters


alarm The alarm type to raise. See .Alarm Types


description A brief description.

entity The entity on which to raise alarms.

Examples

Raise a specific alarm:

CLImaprcli alarm raise -alarm NODE_ALARM_DEBUG_LOGGING

RESThttps://r1n1.sj.us:8443/rest/alarm/raise?alarm=NODE_ALARM_DEBUG_LOGGING



config

The config commands let you work with configuration values for the MapR cluster:

config load displays the valuesconfig save makes changes to the stored values

Configuration Fields

Field Default Value Description

cldb.balancer.disk.max.switches.in.nodes.percentage 10

cldb.balancer.disk.paused 1

cldb.balancer.disk.sleep.interval.sec 2 * 60

cldb.balancer.disk.threshold.percentage 70

cldb.balancer.logging 0

cldb.balancer.role.max.switches.in.nodes.percentage 10

cldb.balancer.role.paused 1

cldb.balancer.role.sleep.interval.sec 15 * 60

cldb.balancer.startup.interval.sec 30 * 60

cldb.cluster.almost.full.percentage 90 The percentage at which theCLUSTER_ALARM_CLUSTER_ALMOST_FULL alarm istriggered.

cldb.container.alloc.selector.algo 0

cldb.container.assign.buffer.sizemb 1 * 1024

cldb.container.create.diskfull.threshold 80

cldb.container.sizemb 16 * 1024

cldb.default.chunk.sizemb 256

cldb.default.volume.topology The default topology for new volumes.

cldb.dialhome.metrics.rotation.period 365

cldb.fileserver.activityreport.interval.hb.multiplier 3

cldb.fileserver.containerreport.interval.hb.multiplier 1800

cldb.fileserver.heartbeat.interval.sec 1

cldb.force.master.for.container.minutes 1

cldb.fs.mark.inactive.sec 5 * 60

cldb.fs.mark.rereplicate.sec 60 * 60 The number of seconds a node can fail to heartbeat before it isconsidered dead. Once a node is considered dead, the CLDBre-replicates any data contained on the node.

cldb.fs.workallocator.num.volume.workunits 20

cldb.fs.workallocator.num.workunits 80

cldb.ganglia.cldb.metrics 0

cldb.ganglia.fileserver.metrics 0

cldb.heartbeat.monitor.sleep.interval.sec 60

cldb.log.fileserver.timeskew.interval.mins 60

cldb.max.parallel.resyncs.star 2



cldb.min.containerid 1

cldb.min.fileservers 1 The minimum CLDB fileservers.

cldb.min.snap.containerid 1

cldb.min.snapid 1

cldb.replication.manager.start.mins 15 The delay between CLDB startup and replication managerstartup, to allow all nodes to register and heartbeat

cldb.replication.process.num.containers 60

cldb.replication.sleep.interval.sec 15

cldb.replication.tablescan.interval.sec 2 * 60

cldb.restart.wait.time.sec 180

cldb.snapshots.inprogress.cleanup.minutes 30

cldb.topology.almost.full.percentage 90

cldb.volume.default.replication The default replication for the CLDB volumes.

cldb.volume.epoch

cldb.volumes.default.min.replication 2

cldb.volumes.default.replication 3

mapr.domainname The domain name MapR uses to get operating system users andgroups (in domain mode).

mapr.entityquerysource Sets MapR to get user information from LDAP (LDAP mode) orfrom the operating system of a domain (domain mode):

ldapdomain

mapr.eula.user

mapr.eula.time

mapr.fs.nocompression "bz2,gz,tgz,tbz2,zip,z,Z,mp3,jpg,jpeg,mpg,mpeg,avi,gif,png"

The file types that should not be compressed. See Extensions.Not Compressed

mapr.fs.permissions.supergroup The of the MapR-FS layer.super group

mapr.fs.permissions.superuser The of the MapR-FS layer.super user

mapr.ldap.attribute.group The LDAP server group attribute.

mapr.ldap.attribute.groupmembers The LDAP server groupmembers attribute.

mapr.ldap.attribute.mail The LDAP server mail attribute.

mapr.ldap.attribute.uid The LDAP server uid attribute.

mapr.ldap.basedn The LDAP server Base DN.

mapr.ldap.binddn The LDAP server Bind DN.

mapr.ldap.port The port MapR is to use on the LDAP server.

mapr.ldap.server The LDAP server MapR uses to get users and groups (in LDAPmode).

mapr.ldap.sslrequired Specifies whether the LDAP server requires SSL:

0 == no1 == yes

http://www.mapr.com/doc/display/MapR/Extensions+Not+Compressed

http://www.mapr.com/doc/display/MapR/Extensions+Not+Compressed



mapr.license.exipry.notificationdays 30

mapr.quota.group.advisorydefault The default group advisory quota; see . Managing Quotas

mapr.quota.group.default The default group quota; see . Managing Quotas

mapr.quota.user.advisorydefault The default user advisory quota; see Managing Quotas.

mapr.quota.user.default The default user quota; see Managing Quotas.

mapr.smtp.port The port MapR uses on the SMTP server ( -1 65535).

mapr.smtp.sender.email The reply-to email address MapR uses when sendingnotifications.

mapr.smtp.sender.fullname The full name MapR uses in the Sender field when sendingnotifications.

mapr.smtp.sender.password The password MapR uses to log in to the SMTP server whensending notifications.

mapr.smtp.sender.username The username MapR uses to log in to the SMTP server whensending notifications.

mapr.smtp.server The SMTP server that MapR uses to send notifications.

mapr.smtp.sslrequired Specifies whether SSL is required when sending email:

0 == no1 == yes

mapr.targetversion

mapr.webui.http.port The port MapR uses for the MapR Control System over HTTP(0-65535); if 0 is specified, disables HTTP access.

mapr.webui.https.certpath The HTTPS certificate path.

mapr.webui.https.keypath The HTTPS key path.

mapr.webui.https.port The port MapR uses for the MapR Control System over HTTPS(0-65535); if 0 is specified, disables HTTPS access.

mapr.webui.timeout The number of seconds the MapR Control System allows toelapse before timing out.

mapreduce.cluster.permissions.supergroup The of the MapReduce layer.super group

mapreduce.cluster.permissions.superuser The of the MapReduce layer.super user



config load

Displays information about the cluster configuration. You can use the parameter to specify which information to display.keys

Syntax

CLImaprcli config load [ -cluster <cluster> ] -keys <keys>

RESThttp[s]://<host>:<port>/rest/config/load?<parameters>

Parameters


cluster The cluster for which to display values.

keys The fields for which to display values; see the tableConfiguration Fields

Output

Information about the cluster configuration. See the table.Configuration Fields

Sample Output

{ : ,"status" "OK" :1,"total" :["data" { : ,"mapr.webui.http.port" "8080" : ,"mapr.fs.permissions.superuser" "root" : ,"mapr.smtp.port" "25" :"mapr.fs.permissions.supergroup" "supergroup" } ]}

Examples

Display several keys:

CLImaprcli config load -keys mapr.webui.http.port,mapr.webui.https.port,mapr.webui.https.keystorepath,mapr.webui.https.keystorepassword,mapr.webui.https.keypassword,mapr.webui.timeout



RESThttps://r1n1.sj.us:8443/rest/config/load?keys=mapr.webui.http.port,mapr.webui.https.port,mapr.webui.https.keystorepath,mapr.webui.https.keystorepassword,mapr.webui.https.keypassword,mapr.webui.timeout



config save

Saves configuration information, specified as key/value pairs. Permissions required: or .fc a

See the table.Configuration Fields

Syntax

CLImaprcli config save [ -cluster <cluster> ] -values <values>

RESThttp[s]://<host>:<port>/rest/config/save?<parameters>

Parameters



values A JSON object containing configuration fields; see the table.Configuration Fields

Examples

Configure MapR SMTP settings:

CLImaprcli config save -values '{"mapr.smtp.provider":"gmail","mapr.smtp.server":"smtp.gmail.com","mapr.smtp.sslrequired":"true","mapr.smtp.port":"465","mapr.smtp.sender.fullname":"Ab Cd","mapr.smtp.sender.email":"[email protected]","mapr.smtp.sender.username":"[email protected]","mapr.smtp.sender.password":"abc"}'

RESThttps://r1n1.sj.us:8443/rest/config/save?values={"mapr.smtp.provider":"gmail","mapr.smtp.server":"smtp.gmail.com","mapr.smtp.sslrequired":"true","mapr.smtp.port":"465","mapr.smtp.sender.fullname":"AbCd","mapr.smtp.sender.email":"[email protected]","mapr.smtp.sender.username":"[email protected]","mapr.smtp.sender.password":"abc"}



dashboard

The command displays a summary of information about the cluster.dashboard info



dashboard info

Displays a summary of information about the cluster. For best results, use the option when running from the command-json dashboard infoline.

Syntax

CLImaprcli dashboard info [ -cluster <cluster> ] [ -multi_cluster_info true|false. default: false ] [ -version true|false. default: false ] [ -zkconnect <ZooKeeper connect string> ]

RESThttp[s]://<host>:<port>/rest/dashboard/info?<parameters>

Parameters



multi_cluster_info Specifies whether to display cluster information from multiple clusters.

version Specifies whether to display the version.

zkconnect ZooKeeper Connect String

Output

A summary of information about the services, volumes, mapreduce jobs, health, and utilization of the cluster.

Output Fields

Field Description

Timestamp The time at which the data was retrieved, expressed as a Unix epoch time.dashboard info

Status The success status of the command.dashboard info

Total The number of clusters for which data was queried in the command.dashboard info

Version The MapR software version running on the cluster.

Cluster The following information about the cluster:

name — the cluster nameip — the IP address of the active CLDBid — the cluster ID



services The number of active, stopped, failed, and total installed services on the cluster:

CLDBFile serverJob trackerTask trackerHB masterHB region server

volumes The number and size (in GB) of volumes that are:

MountedUnmounted

mapreduce The following mapreduce information:

Queue timeRunning jobsQueued jobsRunning tasksBlacklisted jobs

maintenance The following information about system health:

Failed disk nodesCluster alarmsNode alarmsVersions

utilization The following utilization information:

CPU:MemoryDisk spacecompression

Sample Output

# maprcli dashboard info -json{ :1336760972531,"timestamp" : ,"status" "OK" :1,"total" :["data" { : ,"version" "2.0.0" :{"cluster" : ,"name" "mega-cluster" : ,"ip" "192.168.50.50" :"id" "7140172612740778586" }, :{"volumes" :{"mounted" :76,"total" :88885376"size" }, :{"unmounted" :1,"total" :6"size" } }, :{"utilization" :{"cpu"



:14,"util" :528,"total" :75"active" }, :{"memory" :2128177,"total" :896194"active" }, :{"disk_space" :707537,"total" :226848"active" }, :{"compression" :86802,"compressed" :116655"uncompressed" } }, :{"services" :{"fileserver" :22,"active" :0,"stopped" :0,"failed" :22"total" }, :{"nfs" :1,"active" :0,"stopped" :0,"failed" :1"total" }, :{"webserver" :1,"active" :0,"stopped" :0,"failed" :1"total" }, :{"cldb" :1,"active" :0,"stopped" :0,"failed" :1"total" }, :{"tasktracker" :21,"active" :0,"stopped" :0,"failed" :21"total" }, :{"jobtracker" :1,"active" :0,"standby" :0,"stopped" :0,"failed" :1"total" }, :{"hoststats" :22,"active" :0,"stopped" :0,"failed" :22"total" } }, :{"mapreduce" :1,"running_jobs" :0,"queued_jobs" :537,"running_tasks" :0"blacklisted" } }



]}

Examples

Display dashboard information:

CLImaprcli dashboard info -json

RESThttps://r1n1.sj.us:8443/rest/dashboard/info



dialhome

The commands let you change the Dial Home status of your cluster:dialhome

dialhome ackdial - acknowledges a successful Dial Home transmission.dialhome enable - enables or disables Dial Home.dialhome lastdialed - displays the last Dial Home transmission.dialhome metrics - displays the metrics collected by Dial Home.dialhome status - displays the current Dial Home status.



dialhome ackdial

Acknowledges the most recent Dial Home on the cluster. Permissions required: or fc a

Syntax

CLImaprcli dialhome ackdial [ -forDay <date> ]

RESThttp[s]://<host>:<port>/rest/dialhome/ackdial[?parameters]

Parameters


forDay Date for which the recorded metrics were successfully dialed home. Accepted values: UTC timestamp or a UTC date inMM/DD/YY format. Default: yesterday

Examples

Acknowledge Dial Home:

CLImaprcli dialhome ackdial

RESThttps://r1n1.sj.us:8443/rest/dialhome/ackdial



dialhome enable

Enables Dial Home on the cluster. Permissions required: or fc a

Syntax

CLImaprcli dialhome enable -enable 0|1

RESThttp[s]://<host>:<port>/rest/dialhome/enable

Parameters


enable Specifies whether to enable or disable Dial Home:

0 - Disable1 - Enable

Output

A success or failure message.

Sample output

pconrad@s1-r1-sanjose-ca-us:~$ maprcli dialhome enable -enable 1Successfully enabled dialhome pconrad@s1-r1-sanjose-ca-us:~$ maprcli dialhome statusDial home status is: enabled

Examples

Enable Dial Home:

CLImaprcli dialhome enable -enable 1

RESThttps://r1n1.sj.us:8443/rest/dialhome/enable?enable=1



dialhome lastdialed

Displays the date of the last successful Dial Home call. Permissions required: or fc a

Syntax

CLImaprcli dialhome lastdialed

RESThttp[s]://<host>:<port>/rest/dialhome/lastdialed

Output

The date of the last successful Dial Home call.

Sample output

$ maprcli dialhome lastdialeddate 1322438400000

Examples

Show the date of the most recent Dial Home:

CLImaprcli dialhome lastdialed

RESThttps://r1n1.sj.us:8443/rest/dialhome/lastdialed



dialhome metrics

Returns a compressed metrics object. Permissions required: or fc a

Syntax

CLImaprcli dialhome metrics [ -forDay <date> ]

RESThttp[s]://<host>:<port>/rest/dialhome/metrics

Parameters


forDay Date for which the recorded metrics were successfully dialed home. Accepted values: UTC timestamp or a UTC date inMM/DD/YY format. Default: yesterday

Output

Sample output

$ maprcli dialhome metricsmetrics [B@48067064

Examples

Show the Dial Home metrics:

CLImaprcli dialhome metrics

RESThttps://r1n1.sj.us:8443/rest/dialhome/metrics



dialhome status

Displays the Dial Home status. Permissions required: or fc a

Syntax

CLImaprcli dialhome status

RESThttp[s]://<host>:<port>/rest/dialhome/status

Output

The current Dial Home status.

Sample output

$ maprcli dialhome statusenabled 1

Examples

Display the Dial Home status:

CLImaprcli dialhome status

RESThttps://r1n1.sj.us:8443/rest/dialhome/status



disk

The disk commands lets you work with disks:

disk add adds a disk to a nodedisk list lists disksdisk listall lists all disksdisk remove removes a disk from a node

Disk Fields

The following table shows the fields displayed in the output of the disk list and disk listall commands. You can choose which fields (columns) todisplay and sort in ascending or descending order by any single field.

Field Description

hn Hostname of node which owns this disk/partition.

n Name of the disk or partition.

st Disk status:

0 = Good1 = Bad disk

pst Disk power status:

0 = Active/idle (normal operation)1 = Standby (low power mode)2 = Sleeping (lowest power mode, drive is completely shut down)

mt Disk mount status

0 = unmounted1 = mounted

fs File system type

mn Model number

sn Serial number

fw Firmware version

ven Vendor name

dst Total disk space, in MB

dsu Disk space used, in MB

dsa Disk space available, in MB

err Disk error message, in english. Note that this will be translated. Only sent if st == 1.not

ft Disk failure time, MapR disks only. Only sent if st == 1.



1.

2. 3.

disk add

Adds one or more disks to the specified node. Permissions required: or fc a



Syntax

CLImaprcli disk add [ -cluster ] -disks <disk names> -host <host>

RESThttp[s]://<host>:<port>/rest/disk/add?<parameters>

Parameters


cluster The cluster on which to add disks.

disks A comma-separated list of disk names.Examples:

["disk"] ["disk","disk","disk"...]

host The hostname or IP address of the machine on which to add the disk.

Output

Output Fields

Field Description

ip The IP address of the machine that owns the disk(s).

disk The name of a disk or partition. Example "sca" or "sca/sca1"

all The string , meaning all unmounted disks for this node.all



Examples

Add a disk:

CLImaprcli disk add -disks ["/dev/sda1"] -host 10.250.1.79

RESThttps://r1n1.sj.us:8443/rest/disk/add?disks=["/dev/sda1"]



disk list

The command lists the disks on a node.maprcli disk list

Syntax

CLImaprcli disk list -host <host> [ -output terse|verbose ] [ -system 1|0 ]

RESThttp[s]://<host>:<port>/rest/disk/list?<parameters>

Parameters


host The node on which to list the disks.

output Whether the output should be or .terse verbose

system Show only operating system disks:

0 - shows only MapR-FS disks1 - shows only operating system disksNot specified - shows both MapR-FS and operating system disks

Output

Information about the specified disks. See the table.Disk Fields

Examples

List disks on a host:

CLImaprcli disk list -host 10.10.100.22

RESThttps://r1n1.sj.us:8443/rest/disk/list?host=10.10.100.22



disk listall

Lists all disks

Syntax

CLImaprcli disk listall [ -cluster <cluster> ] [ -columns <columns>] [ -filter <filter>] [ -limit <limit>] [ -output terse|verbose ] [ -start <offset>]

RESThttp[s]://<host>:<port>/rest/disk/listall?<parameters>

Parameters



columns A comma-separated list of fields to return in the query. See the table.Disk Fields

filter A filter specifying snapshots to preserve. See for more information.Filters

limit The number of rows to return, beginning at start. Default: 0

output Always the string .terse

start The offset from the starting row according to sort. Default: 0

Output

Information about all disks. See the table.Disk Fields

Examples

List all disks:

CLImaprcli disk listall

RESThttps://r1n1.sj.us:8443/rest/disk/listall



disk remove

Removes a disk from MapR-FS. Permissions required: or fc a

The command does not remove a disk containing unreplicated data unless forced. To force disk removal, specify with thedisk remove -forcevalue .1

Only use the option if you are sure that you do not need the data on the disk. This option removes the disk without-force 1regard to replication factor or other data protection mechanisms, and may result in permanent data loss.

Syntax

CLImaprcli disk remove [ -cluster <cluster> ] -disks <disk names> [ -force 0|1 ] -host <host>

RESThttp[s]://<host>:<port>/rest/disk/remove?<parameters>

Parameters



disks A list of disks in the form:

["disk"]or["disk","disk","disk"...]or[]

force Whether to force

0 (default) - do not remove the disk or disks if there is unreplicated data on the disk1 - remove the disk or disks regardless of data loss or other consequences

host The hostname or ip address of the node from which to remove the disk.

Output

Output Fields

Field Description

disk The name of a disk or partition. Example: or sca sca/sca1

all The string , meaning all unmounted disks attached to the node.all

disks A comma-separated list of disks which have non-replicated volumes.<eg> "sca" or "sca/sca1,scb"</eg>



Examples

Remove a disk:

CLImaprcli disk remove -disks ["sda1"]

RESThttps://r1n1.sj.us:8443/rest/disk/remove?disks=["sda1"]



entity

The entity commands let you work with (users and groups):entities

entity info shows information about a specified user or groupentity list lists users and groups in the clusterentity modify edits information about a specified user or group



entity info

Displays information about an entity.

Syntax

CLImaprcli entity info [ -cluster <cluster> ] -name <entity name> [ -output terse|verbose ] -type <type>

RESThttp[s]://<host>:<port>/rest/entity/info?<parameters>

Parameters



name The entity name.

output Whether to display terse or verbose output.

type The entity type

Output

DiskUsage EntityQuota EntityType EntityName VolumeCount EntityAdvisoryquota EntityId 864415 0 0 root 208 0 0

Output Fields

Field Description

DiskUsage Disk space used by the user or group

EntityQuota The user or group quota

EntityType The entity type

EntityName The entity name

VolumeCount The number of volumes associated with the user or group

EntityAdvisoryquota The user or group advisory quota

EntityId The ID of the user or group



Examples

Display information for the user 'root':

CLImaprcli entity info -type 0 -name root

RESThttps://r1n1.sj.us:8443/rest/entity/info?type=0&name=root



entity list

Syntax

CLImaprcli entity list [ -alarmedentities true|false ] [ -cluster <cluster> ] [ -columns <columns> ] [ -filter <filter> ] [ -limit <rows> ] [ -output terse|verbose ] [ -start <start> ]

RESThttp[s]://<host>:<port>/rest/entity/list?<parameters>

Parameters


alarmedentities Specifies whether to list only entities that have exceeded a quota or advisory quota.


columns A comma-separated list of fields to return in the query. See the table below.Fields

filter A filter specifying entities to display. See for more information.Filters


output Specifies whether output should be or .terse verbose


Output

Information about the users and groups.

Fields

Field Description

EntityType Entity type

0 = User1 = Group

EntityName User or Group name

EntityId User or Group id

EntityQuota Quota, in MB. = no quota.0

EntityAdvisoryquota Advisory quota, in MB. = no advisory quota.0

VolumeCount The number of volumes this entity owns.



DiskUsage Disk space used for all entity's volumes, in MB.

Sample Output

DiskUsage EntityQuota EntityType EntityName VolumeCount EntityAdvisoryquota EntityId 5859220 0 0 root 209 0 0

Examples

List all entities:

CLImaprcli entity list

RESThttps://r1n1.sj.us:8443/rest/entity/list



entity modify

Modifies a user or group quota or email address. Permissions required: or fc a

Syntax

CLImaprcli entity modify [ -advisoryquota <advisory quota> [ -cluster <cluster> ] [ -email <email>] [ -entities <entities> ] -name <entityname> [ -quota <quota> ] -type <type>

RESThttp[s]://<host>:<port>/rest/entity/modify?<parameters>

Parameters


advisoryquota The advisory quota.


email Email address.

entities A comma-separated list of entities, in the format . Example: <type>:<name> 0:<user1>,0:<user2>,1:<group1>,1:<group2>...

name The entity name.

quota The quota for the entity.

type The entity type:

0=user1-group

Examples

Modify the email address for the user 'root':

CLImaprcli entity modify -name root -type 0 -email [email protected]

RESThttps://r1n1.sj.us:8443/rest/entity/modify?name=root&type=0&[email protected]



license

The license commands let you work with MapR licenses:

license add - adds a licenselicense addcrl - adds a certificate revocation list (CRL)license apps - displays the features included in the current licenselicense list - lists licenses on the clusterlicense listcrl - lists CRLslicense remove - removes a licenselicense showid - displays the cluster ID



license add

Adds a license. Permissions required: or fc a

The license can be specified either by passing the license string itself to , or by specifying a file containing the license string.license add

Syntax

CLImaprcli license add [ -cluster <cluster> ] [ -is_file true|false ] -license <license>

RESThttp[s]://<host>:<port>/rest/license/add?<parameters>

Parameters



is_file Specifies whether the specifies a file. If , the parameter contains a long license string.license false license

license The license to add to the cluster. If is true, specifies the filename of a license file. Otherwise, conta-is_file license licenseins the license string itself.

Examples

Adding a License from a File

Assuming a file containing a license string, the following command adds the license to the cluster./tmp/license.txt

CLImaprcli license add -is_file true -license /tmp/license.txt



license addcrl

Adds a certificate revocation list (CRL). Permissions required: or fc a

Syntax

CLImaprcli license addcrl [ -cluster <cluster> ] -crl <crl> [ -is_file true|false ]

RESThttp[s]://<host>:<port>/rest/license/addcrl?<parameters>

Parameters



crl The CRL to add to the cluster. If file is set, crl specifies the filename of a CRL file. Otherwise, crl contains the CRL string itself.

is_file Specifies whether the license is contained in a file.



license apps

Displays the features authorized for the current license. Permissions required: login

Syntax

CLImaprcli license apps[ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/license/apps?<parameters>

Parameters





license list

Lists licenses on the cluster. Permissions required: login

Syntax

CLImaprcli license list[ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/license/list?<parameters>

Parameters





license listcrl

Lists certificate revocation lists (CRLs) on the cluster. Permissions required: login

Syntax

CLImaprcli license listcrl[ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/license/listcrl?<parameters>

Parameters





license remove

Removes a license. Permissions required: or fc a

Syntax

CLImaprcli license remove[ -cluster <cluster> ]-license_id <license>

RESThttp[s]://<host>:<port>/rest/license/remove?<parameters>

Parameters



license_id The license to remove.



license showid

Displays the cluster ID for use when creating a new license. Permissions required: login

Syntax

CLImaprcli license showid[ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/license/showid?<parameters>

Parameters





nagios

The command generates a topology script for Nagiosnagios generate



nagios generate

Generates a Nagios Object Definition file that describes the cluster nodes and the services running on each.

Syntax

CLImaprcli nagios generate [ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/nagios/generate?<parameters>

Parameters



Output

Sample Output



############# Commands #############

define command { command_name check_fileserver_proc command_line $USER1$/check_tcp -p 5660}

define command { command_name check_cldb_proc command_line $USER1$/check_tcp -p 7222}

define command { command_name check_jobtracker_proc command_line $USER1$/check_tcp -p 50030}

define command { command_name check_tasktracker_proc command_line $USER1$/check_tcp -p 50060}

define command { command_name check_nfs_proc command_line $USER1$/check_tcp -p 2049}

define command { command_name check_hbmaster_proc command_line $USER1$/check_tcp -p 60000}

define command { command_name check_hbregionserver_proc command_line $USER1$/check_tcp -p 60020}

define command { command_name check_webserver_proc command_line $USER1$/check_tcp -p 8443}

################# HOST: host1 ###############

define host { use linux-server host_name host1 address 192.168.1.1 check_command check-host-alive}

################# HOST: host2 ###############

define host { use linux-server host_name host2 address 192.168.1.2 check_command check-host-alive}

Examples

Generate a nagios configuration, specifying cluster name and ZooKeeper nodes:



CLImaprcli nagios generate -cluster cluster-1

RESThttps://host1:8443/rest/nagios/generate?cluster=cluster-1

Generate a nagios configuration and save to the file :nagios.conf

CLImaprcli nagios generate >nagios.conf



nfsmgmt

The command refreshes the NFS exports on the specified host and/or port.nfsmgmt refreshexports



nfsmgmt refreshexports

Refreshes the list of clusters and mount points available to mount with NFS. Permissions required: or fc a

Syntax

CLImaprcli nfsmgmt refreshexports [ -nfshost <host> ] [ -nfsport <port> ]

RESThttp[s]://<host><:port>/rest/nfsmgmt/refreshexports?<parameters>

Parameters


nfshost The hostname of the node that is running the MapR NFS server.

nfsport The port to use.



node

The node commands let you work with nodes in the cluster:

node heatmapnode listnode pathnode removenode servicesnode topo



add-to-cluster

Allows host IDs to join the cluster after duplicates have been resolved.

When the CLDB detects duplicate nodes with the same host ID, all nodes with that host ID are removed from the cluster and prevented fromjoining it again. After making sure that all nodes have unique host IDs, you can use the command to un-ban thenode allow-into-clusterhost ID that was previously duplicated among several nodes.

Syntax

CLImaprcli node allow-into-cluster [ -hostids <host IDs> ]

RESThttp[s]://<host>:<port>/rest/node/allow-into-cluster?<parameters>

Parameters


hostids A comma-separated list of host IDs.

Examples

Allow former duplicate host IDs node1 and node2 to join the cluster:

CLImaprcli node allow-into-cluster -hostids node1,node2

RESThttps://r1n1.sj.us:8443/rest/node/allow-into-cluster?hostids=node1,node2



node allow-into-cluster

Allows host IDs to join the cluster after duplicates have been resolved.

When the CLDB detects duplicate nodes with the same host ID, all nodes with that host ID are removed from the cluster and prevented fromjoining it again. After making sure that all nodes have unique host IDs, you can use the command to un-ban thenode allow-into-clusterhost ID that was previously duplicated among several nodes.

Syntax

CLImaprcli node allow-into-cluster [ -hostids <host IDs> ]

RESThttp[s]://<host>:<port>/rest/node/allow-into-cluster?<parameters>

Parameters


hostids A comma-separated list of host IDs.

Examples

Allow former duplicate host IDs node1 and node2 to join the cluster:

CLImaprcli node allow-into-cluster -hostids node1,node2

RESThttps://r1n1.sj.us:8443/rest/node/allow-into-cluster?hostids=node1,node2



node heatmap

Displays a heatmap for the specified nodes.

Syntax

CLImaprcli node heatmap [ -cluster <cluster> ] [ -filter <filter> ] [ -view <view> ]

RESThttp[s]://<host>:<port>/rest/node/heatmap?<parameters>

Parameters




view Name of the heatmap view to show:

status = Node status:0 = Healthy1 = Needs attention2 = Degraded3 = Maintenance4 = Critical

cpu = CPU utilization, as a percent from 0-100.memory = Memory utilization, as a percent from 0-100. diskspace = MapR-FS disk space utilization, as a percent from 0-100. DISK_FAILURE = Status of the DISK_FAILURE alarm. if clear, if raised.0 1SERVICE_NOT_RUNNING = Status of the SERVICE_NOT_RUNNING alarm. if clear, if raised.0 1CONFIG_NOT_SYNCED = Status of the CONFIG_NOT_SYNCED alarm. if clear, if raised.0 1

Output

Description of the output.



{ status: ,"OK" data: [{ : {"{{rackTopology}}" : {{heatmapValue}},"{{nodeName}}" : {{heatmapValue}},"{{nodeName}}" : {{heatmapValue}},"{{nodeName}}" ... }, : {"{{rackTopology}}" : {{heatmapValue}},"{{nodeName}}" : {{heatmapValue}},"{{nodeName}}" : {{heatmapValue}},"{{nodeName}}" ... }, ... }]}

Output Fields

Field Description

rackTopology The topology for a particular rack.

nodeName The name of the node in question.

heatmapValue The value of the metric specified in the view parameterfor this node, as an integer.

Examples

Display a heatmap for the default rack:

CLImaprcli node heatmap

RESThttps://r1n1.sj.us:8443/rest/node/heatmap

Display memory usage for the default rack:

CLImaprcli node heatmap -view memory

RESThttps://r1n1.sj.us:8443/rest/node/heatmap?view=memory



node list

Lists nodes in the cluster.

Syntax

CLImaprcli node list [ -alarmednodes 1 ] [ -cluster <cluster ] [ -columns <columns>] [ -filter <filter> ] [ -limit <limit> ] [ -nfsnodes 1 ] [ -output terse|verbose ] [ -start <offset> ] [ -zkconnect <ZooKeeper Connect String> ]

RESThttp[s]://<host>:<port>/rest/node/list?<parameters>

Parameters


alarmednodes If set to 1, displays only nodes with raised alarms. Cannot be used if nfsnodes is set.



filter A filter specifying nodes on which to start or stop services. See for more information.Filters


nfsnodes If set to 1, displays only nodes running NFS. Cannot be used if alarmednodes is set.

output Specifies whether the output should be terse or verbose.



Output

Information about the nodes.See the table above.Fields

Sample Output



bytesSent dreads davail TimeSkewAlarm servicesHoststatsDownAlarm ServiceHBMasterDownNotRunningAlarm ServiceNFSDownNotRunningAlarm ttmapUsed DiskFailureAlarm mused id mtotal cpus utilization rpcout ttReduceSlots ServiceFileserverDownNotRunningAlarm ServiceCLDBDownNotRunningAlarm dtotal jt-heartbeat ttReduceUsed dwriteK ServiceTTDownNotRunningAlarm ServiceJTDownNotRunningAlarm ttmapSlots dused uptime hostname health disks faileddisks fs-heartbeat rpcin ip dreadK dwrites ServiceWebserverDownNotRunningAlarm rpcs LogLevelAlarm ServiceHBRegionDownNotRunningAlarm bytesReceived service topo(rack) MapRfs disks ServiceMiscDownNotRunningAlarm VersionMismatchAlarm8300 0 269 0 0 0 0 75 0 4058 6394230189818826805 7749 4 3 141 50 0 0 286 2 10 32 0 0 100 16 Thu Jan 15 16:58:57 PST 1970 whatsup 0 1 0 0 51 10.250.1.48 0 2 0 0 0 0 8236 /third/rack/whatsup 1 0 0

Fields

Field Description

bytesReceived Bytes received by the node since the last CLDB heartbeat.

bytesSent Bytes sent by the node since the last CLDB heartbeat.

CorePresentAlarm Cores Present Alarm (NODE_ALARM_CORE_PRESENT):

0 = Clear1 = Raised

cpus The total number of CPUs on the node.

davail Disk space available on the node.

DiskFailureAlarm Failed Disks alarm (DISK_FAILURE):

0 = Clear1 = Raised

disks Total number of disks on the node.

dreadK Disk Kbytes read since the last heartbeat.

dreads Disk read operations since the last heartbeat.

dtotal Total disk space on the node.

dused Disk space used on the node.

dwriteK Disk Kbytes written since the last heartbeat.

dwrites Disk write ops since the last heartbeat.

faileddisks Number of failed MapR-FS disks on the node.

0 = Clear1 = Raised |

fs-heartbeat Time since the last heartbeat to the CLDB, in seconds.

health Overall node health, calculated from various alarm states:

0 = Healthy1 = Needs attention



2 = Degraded3 = Maintenance4 = Critical |

hostname The host name.

id The node ID.

ip A list of IP addresses associated with the node.

jt-heartbeat Time since the last heartbeat to the JobTracker, in seconds.

LogLevelAlarm Excessive Logging Alarm (NODE_ALARM_DEBUG_LOGGING):


MapRfs disks

mtotal Total memory, in GB.

mused Memory used, in GB.

HomeMapRFullAlarm Installation Directory Full Alarm (ODE_ALARM_OPT_MAPR_FULL):


RootPartitionFullAlarm Root Partition Full Alarm (NODE_ALARM_ROOT_PARTITION_FULL):


rpcin RPC bytes received since the last heartbeat.

rpcout RPC bytes sent since the last heartbeat.

rpcs Number of RPCs since the last heartbeat.

service A comma-separated list of services running on the node:

cldb - CLDBfileserver - MapR-FSjobtracker - JobTrackertasktracker - TaskTrackerhbmaster - HBase Masterhbregionserver - HBase RegionServernfs - NFS GatewayExample: "cldb,fileserver,nfs" |

ServiceCLDBDownNotRunningAlarm CLDB Service Down Alarm (NODE_ALARM_SERVICE_CLDB_DOWN)


ServiceFileserverDownNotRunningAlarm Fileserver Service Down Alarm (NODE_ALARM_SERVICE_FILESERVER_DOWN)


ServiceHBMasterDownNotRunningAlarm HBase Master Service Down Alarm (NODE_ALARM_SERVICE_HBMASTER_DOWN)


ServiceHBRegionDownNotRunningAlarm HBase Regionserver Service Down Alarm"(NODE_ALARM_SERVICE_HBREGION_DOWN)




ServicesHoststatsDownNotRunningAlarm Hoststats Service Down Alarm (NODE_ALARM_SERVICE_HOSTSTATS_DOWN)


ServiceJTDownNotRunningAlarm Jobtracker Service Down Alarm (NODE_ALARM_SERVICE_JT_DOWN)


ServiceMiscDownNotRunningAlarm 0 = Clear

1 = Raised |

ServiceNFSDownNotRunningAlarm NFS Service Down Alarm (NODE_ALARM_SERVICE_NFS_DOWN):


ServiceTTDownNotRunningAlarm Tasktracker Service Down Alarm (NODE_ALARM_SERVICE_TT_DOWN):


ServicesWebserverDownNotRunningAlarm Webserver Service Down Alarm (NODE_ALARM_SERVICE_WEBSERVER_DOWN)


TimeSkewAlarm Time Skew alarm (NODE_ALARM_TIME_SKEW):


racktopo The rack path.

ttmapSlots TaskTracker map slots.

ttmapUsed TaskTracker map slots used.

ttReduceSlots TaskTracker reduce slots.

ttReduceUsed TaskTracker reduce slots used.

uptime Date when the node came up.

utilization CPU use percentage since the last heartbeat.

VersionMismatchAlarm Software Version Mismatch Alarm (NODE_ALARM_VERSION_MISMATCH):


Examples

List all nodes:

CLImaprcli node list

RESThttps://r1n1.sj.us:8443/rest/node/list

List the health of all nodes:



CLImaprcli node list -columns service,health

RESThttps://r1n1.sj.us:8443/rest/node/list?columns=service,health

List the number of slots on all nodes:

CLImaprcli node list -columns ip,ttmapSlots,ttmapUsed,ttReduceSlots,ttReduceUsed

RESThttps://r1n1.sj.us:8443/rest/node/list?columns=ip,ttmapSlots,ttmapUsed,ttReduceSlots,ttReduceUsed



node listcldbs

The API returns the hostnames of the nodes in the cluster that are running the CLDB service. node listcldbs

Syntax

CLImaprcli node listcldbs [ -cluster <cluster name> ] [ -cldb <cldb hostname|ip:port> ]

RESThttp[s]://<host>:<port>/rest/node/listcldbs?<parameters>

Parameters


cluster name The name of the cluster for which to return the list of CLDB node hostnames.

cldb hostname|ip:port The hostname or IP address and port number of a CLDB node.

Examples

Return the list of CLDB nodes for the cluster my.cluster.com:

CLImaprcli node listcldbs -cluster my.cluster.com

RESThttps://r1n1.sj.us:8443/rest/node/listcldbs?cluster=my.cluster.com



node listcldbzks

The API returns the hostnames of the nodes in the cluster that are running the CLDB service and the IP addresses and portnode listcldbzksnumbers for the nodes in the cluster that are runnin gthe ZooKeeper service.

Syntax

CLImaprcli node listcldbzks [ -cluster <cluster name> ] [ -cldb <cldb hostname|ip:port> ]

RESThttp[s]://<host>:<port>/rest/node/listcldbzks?<parameters>

Parameters


cluster name The name of the cluster for which to return the CLDB and ZooKeeper information.


Examples

Return CLDB and ZooKeeper node information for the cluster my.cluster.com:

CLImaprcli node listcldbzks -cluster my.cluster.com

RESThttps://r1n1.sj.us:8443/rest/node/listcldbzks?cluster=my.cluster.com



node listzookeepers

The API returns the hostnames of the nodes in the cluster that are running the CLDB service. node listcldbs

Syntax

CLImaprcli node listcldbs [ -cluster <cluster name> ] [ -cldb <cldb hostname|ip:port> ]

RESThttp[s]://<host>:<port>/rest/node/listcldbs?<parameters>

Parameters


cluster name The name of the cluster for which to return the list of CLDB node hostnames.


Examples

Return the list of CLDB nodes for the cluster my.cluster.com:

CLImaprcli node listcldbs -cluster my.cluster.com

RESThttps://r1n1.sj.us:8443/rest/node/listcldbs?cluster=my.cluster.com



1. 2. 3.

node maintenance

Places a node into a maintenance mode.

Syntax

CLImaprcli node maintenance [ -cluster <cluster> ] [ -serverids <serverids> ] [ -nodes <nodes> ] -timeoutminutes minutes

RESThttp[s]://<host>:<port>/rest/node/maintenance?<parameters>

Parameters



serverids List of server IDs

nodes List of nodes

timeoutminutes Duration of timeout in minutes

Output Fields

Field Description

path The physical topology path to the node.

errorChildCount The number of descendants of the node which have overall status 0.

OKChildCount The number of descendants of the node which have overall status 1.

configChildCount The number of descendants of the node which have overall status 2.

Bringing a Node out of Maintenance

To bring a node back from maintenance before the timeout expires:

Stop the service on the nodemfsRun the command.maprcli node maintenance -nodes <node in maintenance> -timeoutminutes 0Start the service on the nodemfs



node metrics

Retrieves metrics information for nodes in a cluster.

Use the API to retrieve node data for your job.node metrics

Syntax

CLImaprcli node metrics -nodes -start start_time -end end_time [ -json ] [ -interval interval ] [ -events ] [ -columns columns ] [ -cluster cluster name ]

Parameters


nodes A space-separated list of node names.

start The start of the time range. Can be a UTC timestamp or a UTC date in MM/DD/YY format.

end The end of the time range. Can be a UTC timestamp or a UTC date in MM/DD/YY format.

json Specify this flag to return data as a JSON object.

interval Data measurement interval in seconds. The minimum value is 10 seconds.

events Specify to return node events only. The default value of this parameter is .TRUE FALSE

columns Comma-separated list of column to return.names

cluster Cluster name.

Column Name Parameters

The API always returns the (node name), (timestamp string), and (integer timestamp)node metrics NODE TIMESTAMPSTR TIMESTAMPcolumns. Use the flag to specify a comma-separated list of column names to return.-columns

The , , and parameters return information in . This unit measures one tick of the systemCPUNICE CPUUSER CPUSYSTEM jiffiestimer interrupt and is usually equivalent to 10 milliseconds, but may vary depending on your particular node configuration. Call s

to determine the exact value for your node.ysconf(_SC_CLK_TCK)

Parameter Description Notes

CPUNICE Amount of CPU time used by processes with a positive nice value.

CPUUSER Amount of CPU time used by user processes.

CPUSYSTEM Amount of CPU time used by system processes.

LOAD5PERCENT Percentage of time this node spent at load 5 or below

LOAD1PERCENT Percentage of time this node spent at load 1 or below

MEMORYCACHED Memory cache size in bytes

MEMORYSHARED Shared memory size in bytes



MEMORYBUFFERS Memory buffer size in bytes

MEMORYUSED Memory used in bytes

PROCRUN Number of processes running

RPCCOUNT Number of RPC calls

RPCINBYTES Number of bytes passed in by RPC calls

RPCOUTBYTES Number of bytes passed out by RPC calls

SERVAVAILSIZEMB Server storage available in megabytes

SERVUSEDSIZEMB Server storage used in megabytes

SWAPFREE Free swap space in bytes

TTMAPUSED Number of TaskTracker slots used for map tasks

TTREDUCEUSED Number of TaskTracker slots used for reduce tasks

Three column name parameters return data that is too granular to display in a standard table. Use the option to return this information as a-jsonJSON object.

Parameter Description Metrics Returned

CPUS Activity on this node's CPUs. Each CPU on the node is numbered from zero, to . Metrics returned are for each CPU.cpu0 cpuN

CPUIDLE: Amount of CPU time spent idle.Reported as .jiffies

: Amount of CPU time spent waiting forCPUIOWAITI/O operations. Reported as .jiffies

: Total amount of CPU time. ReportedCPUTOTALas .jiffies

DISKS Activity on this node's disks. Metrics returned are for each partition. READOPS: Number of read operations.: Number of kilobytes read.READKB

: Number of write operations.WRITEOPS: Number of kilobytes written.WRITEKB

NETWORK Activity on this node's network interfaces. Metrics returned are for eachinterface.

BYTESIN: Number of bytes received.: Number of bytes sent.BYTESOUT

: Number of packets received.PKTSIN: Number of packets sent.PKTSOUT

Examples

To retrieve the percentage of time that a node spent at the 1 and 5 load levels:

[user@host ~]# maprcli node metrics -nodes my.node.lab -start 07/25/12 -end 07/26/12 -interval 7200-columns LOAD1PERCENT,LOAD5PERCENT

NODE LOAD5PERCENT LOAD1PERCENT TIMESTAMPSTR TIMESTAMP my.node.lab Wed Jul 25 12:52:40 PDT 2012 1343245960047 my.node.lab 11 18 Wed Jul 25 14:52:50 PDT 2012 1343253170000 my.node.lab 10 23 Wed Jul 25 16:52:50 PDT 2012 1343260370000 my.node.lab 15 46 Wed Jul 25 18:52:57 PDT 2012 1343267577000 my.node.lab 18 34 Wed Jul 25 20:52:58 PDT 2012 1343274778000 my.node.lab 28 70 Wed Jul 25 22:53:01 PDT 2012 1343281981000 my.node.lab 35 84 Thu Jul 26 00:53:01 PDT 2012 1343289181000 my.node.lab 30 35 Thu Jul 26 02:53:03 PDT 2012 1343296383000 my.node.lab 36 62 Thu Jul 26 04:53:10 PDT 2012 1343303590000 my.node.lab 37 44 Thu Jul 26 06:53:14 PDT 2012 1343310794000 my.node.lab 12 28 Thu Jul 26 08:53:21 PDT 2012 1343318001000 my.node.lab 22 38 Thu Jul 26 10:53:30 PDT 2012 1343325210000

Sample JSON object



This JSON object is returned by the following command:

[user@host ~]# maprcli node metrics -nodes my.node.lab -json -start 1343290000000 -end 1343300000000-interval 28800 -columns LOAD1PERCENT,LOAD5PERCENT,CPUS

{ :1343333063869,"timestamp" : ,"status" "OK" :3,"total" :["data" { : ,"NODE" "my.node.lab" : ,"TIMESTAMPSTR" "Wed Jul 25 18:00:05 PDT 2012" :1343264405000,"TIMESTAMP" :13,"LOAD1PERCENT" :12"LOAD5PERCENT" :{"CPUS" :{"cpu0" :169173957,"CPUIDLE" :2982912,"CPUIOWAIT" :173897423"CPUTOTAL" }, :{"cpu1" :172217855,"CPUIDLE" :26760,"CPUIOWAIT" :174016589"CPUTOTAL" }, :{"cpu2" :171071574,"CPUIDLE" :4051,"CPUIOWAIT" :173957716"CPUTOTAL" }, }, }, { : ,"NODE" "my.node.lab" : ,"TIMESTAMPSTR" "Thu Jul 26 02:00:08 PDT 2012" :1343293208000,"TIMESTAMP" :17,"LOAD1PERCENT" :13"LOAD5PERCENT" :{"CPUS" :{"cpu0" :169173957,"CPUIDLE" :2982912,"CPUIOWAIT" :173897423"CPUTOTAL" }, :{"cpu1" :172217855,"CPUIDLE" :26760,"CPUIOWAIT" :174016589"CPUTOTAL" }, :{"cpu2" :171071574,"CPUIDLE" :4051,"CPUIOWAIT" :173957716"CPUTOTAL" }, }, }, { : ,"NODE" "my.node.lab" : ,"TIMESTAMPSTR" "Thu Jul 26 10:00:08 PDT 2012" :1343322008000,"TIMESTAMP" :18,"LOAD1PERCENT" :13"LOAD5PERCENT" :{"CPUS"



:{"cpu0" :169173957,"CPUIDLE" :2982912,"CPUIOWAIT" :173897423"CPUTOTAL" }, :{"cpu1" :172217855,"CPUIDLE" :26760,"CPUIOWAIT" :174016589"CPUTOTAL" }, :{"cpu2" :171071574,"CPUIDLE" :4051,"CPUIOWAIT" :173957716"CPUTOTAL" }, }, }



]}



node move

Moves one or more nodes to a different topology. Permissions required: or fc a

Syntax

CLImaprcli node move [ -cluster <cluster> ] -serverids <server IDs> -topology <topology>

RESThttp[s]://<host>:<port>/rest/node/move?<parameters>

Parameters



serverids The server IDs of the nodes to move.

topology The new topology.



node path

Changes the path of the specified node or nodes. Permissions required: or fc a

Syntax

CLImaprcli node path [ -cluster <cluster> ] [ -filter <filter> ] [ -nodes <node names> ] -path <path> [ -which switch|rack|both ] [ -zkconnect <ZooKeeper Connect String> ]

RESThttp[s]://<host>:<port>/rest/node/path?<parameters>

Parameters




nodes A list of node names, separated by spaces.

path The path to change.

which Which path to change: switch, rack or both. Default: rack

zkconnect . ZooKeeper Connect String



node remove

The command removes one or more server nodes from the system. Permissions required: or node remove fc a

After issuing the command, wait several minutes to ensure that the node has been properly and completely removed.node remove

Syntax

CLImaprcli node remove [ -filter <filter> ] [ -force true|false ] [ -nodes <node names> ] [ -zkconnect <ZooKeeper Connect String> ]

RESThttp[s]://<host>:<port>/rest/node/remove?<parameters>

Parameters



force Forces the service stop operations. Default: false


zkconnect . Example: 'host:port,host:port,host:port,...'. default: localhost:5181ZooKeeper Connect String



node services

Starts, stops, restarts, suspends, or resumes services on one or more server nodes. Permissions required: , or ss fc a

The same set of services applies to all specified nodes; to manipulate different groups of services differently, send multiple requests.

Note: the suspend and resume actions have not yet been implemented.

Syntax

CLImaprcli node services [ -action restart|resume|start|stop|suspend ] [ -cldb restart|resume|start|stop|suspend ] [ -cluster <cluster> ] [ -fileserver restart|resume|start|stop|suspend ] [ -filter <filter> ] [ -hbmaster restart|resume|start|stop|suspend ] [ -hbregionserver restart|resume|start|stop|suspend ] [ -jobtracker restart|resume|start|stop|suspend ] [ -name <service> ] [ -nfs restart|resume|start|stop|suspend ] [ -nodes <node names> ] [ -tasktracker restart|resume|start|stop|suspend ] [ -zkconnect <ZooKeeper Connect String> ]

RESThttp[s]://<host>:<port>/rest/node/services?<parameters>

Parameters

When used together, the and parameters specify an action to perform on a service. To start the JobTracker, for example, you canaction nameeither specify for the and for the , or simply specify on the .start action jobtracker name start jobtracker


action An action to perform on a service specified in the parameter: restart, resume, start, stop, or suspendname

cldb Starts or stops the cldb service. Values: restart, resume, start, stop, or suspend


fileserver Starts or stops the fileserver service. Values: restart, resume, start, stop, or suspend


hbmaster Starts or stops the hbmaster service. Values: restart, resume, start, stop, or suspend

hbregionserver Starts or stops the hbregionserver service. Values: restart, resume, start, stop, or suspend

jobtracker Starts or stops the jobtracker service. Values: restart, resume, start, stop, or suspend

name A service on which to perform an action specified by the parameter.action

nfs Starts or stops the nfs service. Values: restart, resume, start, stop, or suspend


tasktracker Starts or stops the tasktracker service. Values: restart, resume, start, stop, or suspend




node topo

Lists cluster topology information.

Lists internal nodes only (switches/racks/etc) and not leaf nodes (server nodes).

Syntax

CLImaprcli node topo [ -cluster <cluster> ] [ -path <path> ]

RESThttp[s]://<host>:<port>/rest/node/topo?<parameters>

Parameters



path The path on which to list node topology.

Output

Node topology information.

Sample output

{ status: ,"OK" total:recordCount, data: [ { path:'path', status:[errorChildCount,OKChildCount,configChildCount], }, ...additional structures above each topology node...for ]}

Output Fields

Field Description

path The physical topology path to the node.

errorChildCount The number of descendants of the node which have overall status 0.

OKChildCount The number of descendants of the node which have overall status 1.

configChildCount The number of descendants of the node which have overall status 2.



schedule

The schedule commands let you work with schedules:

schedule create creates a scheduleschedule list lists schedulesschedule modify modifies the name or rules of a schedule by IDschedule remove removes a schedule by ID

A schedule is a JSON object that specifies a single or recurring time for volume snapshot creation or mirror syncing. For a schedule to be useful,it must be associated with at least one volume. See and .volume create volume modify

Schedule Fields

The schedule object contains the following fields:

Field Value

id The ID of the schedule.

name The name of the schedule.

inuse Indicates whether the schedule is associated with an action.

rules An array of JSON objects specifying how often the scheduled action occurs. See below.Rule Fields

Rule Fields

The following table shows the fields to use when creating a rules object.

Field Values

frequency How often to perform the action:

once - Onceyearly - Yearlymonthly - Monthlyweekly - Weeklydaily - Dailyhourly - Hourlysemihourly - Every 30 minutesquarterhourly - Every 15 minutesfiveminutes - Every 5 minutesminute - Every minute

retain How long to retain the data resulting from the action. For example, if the schedule creates a snapshot, the retain field sets thesnapshot's expiration. The retain field consists of an integer and one of the following units of time:

mi - minutesh - hoursd - daysw - weeksm - monthsy - years

time The time of day to perform the action, in 24-hour format: HH

date The date on which to perform the action:

For single occurrences, specify month, day and year: MM/DD/YYYYFor yearly occurrences, specify the month and day: MM/DDFor monthly occurrences occurrences, specify the day: DD Daily and hourly occurrences do not require the date field.



Example

The following example JSON shows a schedule called "snapshot," with three rules.

{ :8,"id" : ,"name" "snapshot" :0,"inuse" :["rules" { : ,"frequency" "monthly" : ,"date" "8" :14,"time" :"retain" "1m" }, { : ,"frequency" "weekly" : ,"date" "sat" :14,"time" :"retain" "2w" }, { : ,"frequency" "hourly" :"retain" "1d" } ] }



schedule create

Creates a schedule. Permissions required: or fc a

A schedule can be associated with a volume to automate mirror syncing and snapshot creation. See and .volume create volume modify

Syntax

CLImaprcli schedule create [ -cluster <cluster> ] -schedule <JSON>

RESThttp[s]://<host>:<port>/rest/schedule/create?<parameters>

Parameters



schedule A JSON object describing the schedule. See for more information.Schedule Objects

Examples

Scheduling a Single Occurrence

CLImaprcli schedule create -schedule '{"name":"Schedule-1","rules":[{"frequency":"once","retain":"1w","time":13,"date":"12/5/2010"}]}'

RESThttps://r1n1.sj.us:8443/rest/schedule/create?schedule={"name":"Schedule-1","rules":[{"frequency":"once","retain":"1w","time":13,"date":"12/5/2010"}]}

A Schedule with Several Rules

CLImaprcli schedule create -schedule '{"name":"Schedule-1","rules":[{"frequency":"weekly","date":"sun","time":7,"retain":"2w"},{"frequency":"daily","time":14,"retain":"1w"},{"frequency":"hourly","retain":"1w"},{"frequency":"yearly","date":"11/5","time":14,"retain":"1w"}]}'

RESThttps://r1n1.sj.us:8443/rest/schedule/create?schedule={"name":"Schedule-1","rules":[{"frequency":"weekly","date":"sun","time":7,"retain":"2w"},{"frequency":"daily","time":14,"retain":"1w"},{"frequency":"hourly","retain":"1w"},{"frequency":"yearly","date":"11/5","time":14,"retain":"1w"}]}



schedule list

Lists the schedules on the cluster.

Syntax

CLImaprcli schedule list [ -cluster <cluster> ] [ -output terse|verbose ]

RESThttp[s]://<host>:<port>/rest/schedule/list?<parameters>

Parameters



output Specifies whether the output should be terse or verbose.

Output

A list of all schedules on the cluster. See for more information.Schedule Objects

Examples

List schedules:

CLImaprcli schedule list

RESThttps://r1n1.sj.us:8443/rest/schedule/list



1. 2. 3.

schedule modify

Modifies an existing schedule, specified by ID. Permissions required: or fc a

To find a schedule's ID:

Use the command to list the schedules.schedule listSelect the schedule to modifyPass the selected schedule's ID in the -id parameter to the command.schedule modify

Syntax

CLImaprcli schedule modify [ -cluster <cluster> ] -id <schedule ID> [ -name <schedule name ] [ -rules <JSON>]

RESThttp[s]://<host>:<port>/rest/schedule/modify?<parameters>

Parameters



id The ID of the schedule to modify.

name The new name of the schedule.

rules A JSON object describing the rules for the schedule. If specified, replaces the entire existing rules object in the schedule. Forinformation about the fields to use in the JSON object, see .Rule Fields

Examples

Modify a schedule

CLImaprcli schedule modify -id 0 -name Newname -rules '[{"frequency":"weekly","date":"sun","time":7,"retain":"2w"},{"frequency":"daily","time":14,"retain":"1w"}]'

RESThttps://r1n1.sj.us:8443/rest/schedule/modify?id=0&name=Newname&rules=[{"frequency":"weekly","date":"sun","time":7,"retain":"2w"},{"frequency":"daily","time":14,"retain":"1w"}]



schedule remove

Removes a schedule.

A schedule can only be removed if it is not associated with any volumes. See .volume modify

Syntax

CLImaprcli schedule remove [ -cluster <cluster> ] -id <schedule ID>

RESThttp[s]://<host>:<port>/rest/schedule/remove?<parameters>

Parameters



id The ID of the schedule to remove.

Examples

Remove schedule with ID 0:

CLImaprcli schedule remove -id 0

RESThttps://r1n1.sj.us:8443/rest/schedule/remove?id=0



service list

Lists all services on the specified node, along with the state and log path for each service.

Syntax

CLImaprcli service list -node <node name>

RESThttp[s]://<host>:<port>/rest/service/list?<parameters>

Parameters


node The node on which to list the services

Output

Information about services on the specified node. For each service, the status is reported numerically:

0 - NOT_CONFIGURED: the package for the service is not installed and/or the service is not configured ( has not run)configure.sh2 - RUNNING: the service is installed, has been started by the warden, and is currently executing3 - STOPPED: the service is installed and has run, but the service is currently not executingconfigure.sh5 - STAND_BY: the service is installed and is in standby mode, waiting to take over in case of failure of another instance (mainly used forJobTracker warm standby)



setloglevel

The setloglevel commands set the log level on individual services:

setloglevel cldb - Sets the log level for the CLDB.setloglevel hbmaster - Sets the log level for the HB Master.setloglevel hbregionserver - Sets the log level for the HBase RegionServer.setloglevel jobtracker - Sets the log level for the JobTracker.setloglevel fileserver - Sets the log level for the FileServer.setloglevel nfs - Sets the log level for the NFS.setloglevel tasktracker - Sets the log level for the TaskTracker.



setloglevel cldb

Sets the log level on the CLDB service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel cldb -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/cldb?<parameters>

Parameters


classname The name of the class for which to set the log level.

loglevel The log level to set:

DEBUGERRORFATALINFOTRACEWARN

node The node on which to set the log level.

port The CLDB port



setloglevel fileserver

Sets the log level on the FileServer service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel fileserver -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/fileserver?<parameters>

Parameters






port The MapR-FS port



setloglevel hbmaster

Sets the log level on the HBase Master service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel hbmaster -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/hbmaster?<parameters>

Parameters






port The HBase Master webserver port



setloglevel hbregionserver

Sets the log level on the HBase RegionServer service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel hbregionserver -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/hbregionserver?<parameters>

Parameters






port The Hbase Region Server webserver port



setloglevel jobtracker

Sets the log level on the JobTracker service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel jobtracker -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/jobtracker?<parameters>

Parameters






port The JobTracker webserver port



setloglevel nfs

Sets the log level on the NFS service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel nfs -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/nfs?<parameters>

Parameters






port The NFS port



setloglevel tasktracker

Sets the log level on the TaskTracker service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel tasktracker -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/tasktracker?<parameters>

Parameters






port The TaskTracker port



trace

The trace commands let you view and modify the trace buffer, and the trace levels for the system modules. The valid trace levels are:

DEBUGINFOERRORWARNFATAL

The following pages provide information about the trace commands:

trace dumptrace infotrace printtrace resettrace resizetrace setleveltrace setmode



trace dump

Dumps the contents of the trace buffer into the MapR-FS log.

Syntax

CLImaprcli trace dump [ -host <host> ] [ -port <port> ]

REST None.

Parameters


host The IP address of the node from which to dump the trace buffer. Default: localhost

port The port to use when dumping the trace buffer. Default: 5660

Examples

Dump the trace buffer to the MapR-FS log:

CLImaprcli trace dump



trace info

Displays the trace level of each module.

Syntax

CLImaprcli trace info [ -host <host> ] [ -port <port> ]

REST None.

Parameters


host The IP address of the node on which to display the trace level of each module. Default: localhost

port The port to use. Default: 5660

Output

A list of all modules and their trace levels.

Sample Output



RPC Client Initialize**Trace is in DEFAULT mode.**Allowed Trace Levels are:FATALERRORWARNINFODEBUG**Trace buffer size: 2097152**Modules and levels:Global : INFORPC : ERRORMessageQueue : ERRORCacheMgr : INFOIOMgr : INFOTransaction : ERRORLog : INFOCleaner : ERRORAllocator : ERRORBTreeMgr : ERRORBTree : ERRORBTreeDelete : ERRORBTreeOwnership : INFOMapServerFile : ERRORMapServerDir : INFOContainer : INFOSnapshot : INFOUtil : ERRORReplication : INFOPunchHole : ERRORKvStore : ERRORTruncate : ERROROrphanage : INFOFileServer : INFODefer : ERRORServerCommand : INFONFSD : INFOCidcache : ERRORClient : ERRORFidcache : ERRORFidmap : ERRORInode : ERRORJniCommon : ERRORShmem : ERRORTable : ERRORFctest : ERRORDONE

Examples

Display trace info:

CLImaprcli trace info



trace print

Manually dumps the trace buffer to stdout.

Syntax

CLImaprcli trace print [ -host <host> ] [ -port <port> ] -size <size>

REST None.

Parameters


host The IP address of the node from which to dump the trace buffer to stdout. Default: localhost


size The number of kilobytes of the trace buffer to print. Maximum: 64

Output

The most recent bytes of the trace buffer.<size>

-----------------------------------------------------2010-10-04 13:59:31,0000 Program: mfs on Host: fakehost IP: 0.0.0.0, Port: 0, PID: 0

-----------------------------------------------------DONE

Examples

Display the trace buffer:

CLImaprcli trace print



trace reset

Resets the in-memory trace buffer.

Syntax

CLImaprcli trace reset [ -host <host> ] [ -port <port> ]

REST None.

Parameters


host The IP address of the node on which to reset the trace parameters. Default: localhost


Examples

Reset trace parameters:

CLImaprcli trace reset



trace resize

Resizes the trace buffer.

Syntax

CLImaprcli trace resize [ -host <host> ] [ -port <port> ] -size <size>

REST None.

Parameters


host The IP address of the node on which to resize the trace buffer. Default: localhost


size The size of the trace buffer, in kilobytes. Default: Minimum: 2097152 1

Examples

Resize the trace buffer to 1000

CLImaprcli trace resize -size1000



trace setlevel

Sets the trace level on one or more modules.

Syntax

CLImaprcli trace setlevel [ -host <host> ] -level <trace level> -module <module name> [ -port <port> ]

REST None.

Parameters


host The node on which to set the trace level. Default: localhost

module The module on which to set the trace level. If set to , sets the trace level on all modules.all

level The new trace level. If set to , sets the trace level to its default. default


Examples

Set the trace level of the log module to INFO:

CLImaprcli trace setlevel -module log -levelinfo

Set the trace levels of all modules to their defaults:

CLImaprcli trace setlevel -module all -leveldefault



trace setmode

Sets the trace mode. There are two modes:

DefaultContinuous

In default mode, all trace messages are saved in a memory buffer. If there is an error, the buffer is dumped to stdout. In continuous mode, everyallowed trace message is dumped to stdout in real time.

Syntax

CLImaprcli trace setmode [ -host <host> ] -mode default|continuous [ -port <port> ]

REST None.

Parameters


host The IP address of the host on which to set the trace mode

mode The trace mode.

port The port to use.

Examples

Set the trace mode to continuous:

CLImaprcli trace setmode -modecontinuous



urls

The urls command displays the status page URL for the specified service.

Syntax

CLImaprcli urls [ -cluster <cluster> ] -name <service name> [ -zkconnect <zookeeper connect string>]

RESThttp[s]://<host>:<port>/rest/urls/<name>

Parameters


cluster The name of the cluster on which to save the configuration.

name The name of the service for which to get the status page:

cldbjobtrackertasktracker


Examples

Display the URL of the status page for the CLDB service:

CLImaprcli urls -name cldb

RESThttps://r1n1.sj.us:8443/rest/maprcli/urls/cldb



virtualip

The virtualip commands let you work with virtual IP addresses for NFS nodes:

virtualip add adds a range of virtual IP addressesvirtualip edit edits a range of virtual IP addressesvirtualip list lists virtual IP addressesvirtualip move reassigns a range of virtual IP addresses to a MACvirtualip remove removes a range of virtual IP addresses

Virtual IP Fields

Field Description

macaddress The MAC address of the virtual IP.

netmask The netmask of the virtual IP.

virtualipend The virtual IP range end.



virtualip add

Adds a virtual IP address. Permissions required: or fc a

Syntax

CLImaprcli virtualip add [ -cluster <cluster> ] [ -gateway <gateway> ] [ -macs <MAC address> ] -netmask <netmask> -virtualip <virtualip> [ -virtualipend <virtual IP range end> ]

RESThttp[s]://<host>:<port>/rest/virtualip/add?<parameters>

Parameters



gateway The NFS gateway IP or address

macs A list of the MAC addresses that represent the NICs on the nodes that the VIPs in the VIP range can be associated with. Use thislist to limit VIP assignment to NICs on a particular subnet when your NFS server is part of multiple subnets.

netmask The netmask of the virtual IP.

virtualip The virtual IP, or the start of the virtual IP range.

virtualipend The virtual IP range end.



virtualip edit

Edits a virtual IP (VIP) range. Permissions required: or fc a

Syntax

CLImaprcli virtualip edit [ -cluster <cluster> ] [ -macs <mac address(es)> ] -netmask <netmask> -virtualip <virtualip> [ -virtualipend <range end> ]

RESThttp[s]://<host>:<port>/rest/virtualip/edit?<parameters>

Parameters



macs A list of MAC addresses. These MAC addresses belong to the NICs on the nodes that the VIP or VIP range can be associatedwith.

netmask The netmask for the VIP or VIP range.

virtualip The start of the VIP range, or the VIP if only one VIP is used.

virtualipend The end of the VIP range if more than one VIP is used.



virtualip list

Lists the virtual IP addresses in the cluster.

Syntax

CLImaprcli virtualip list [ -cluster <cluster> ] [ -columns <columns> ] [ -filter <filter> ] [ -limit <limit> ] [ -nfsmacs <NFS macs> ] [ -output <output> ] [ -range <range> ] [ -start <start> ]

RESThttp[s]://<host>:<port>/rest/virtualip/list?<parameters>

Parameters



columns The columns to display.

filter A filter specifying VIPs to list. See for more information.Filters

limit The number of records to return.

nfsmacs The MAC addresses of servers running NFS.

output Whether the output should be or .terse verbose

range The VIP range.

start The index of the first record to return.



virtualip move

The API reassigns a virtual IP or a range of virtual IP addresses to a specified Media Access Control (MAC) address. virtualip move

Syntax

CLImaprcli virtualip move [ -cluster <cluster name> ] -virtualip <virtualip> [ -virtualipend <virtualip end range> -tomac <mac>

RESThttp[s]://<host>:<port>/rest/virtualip/move?<parameters>

Parameters


cluster name The name of the cluster where the virtual IP addresses are being moved.

virtualip A virtual IP address. If you provide a value for , this virtual IP address defines the beginning of the range.-virtualipend

virtualip end range A virtual IP address that defines the end of a virtual IP address range.

mac The MAC address that the virtual IP addresses are being assigned.

Examples

Move a range of three virtual IP addresses to a MAC address for the cluster my.cluster.com:

CLImaprcli virtualip move -cluster my.cluster.com -virtualip 192.168.0.8 -virtualipend 192.168.0.10 -tomac 00:FE:ED:CA:FE:99

RESThttps://r1n1.sj.us:8443/rest/virtualip/move?cluster=my.cluster.com&virtualip=192.168.0.8&virtualipend=192.168.0.10&tomac=00%3AFE%3AED%3ACA%3AFE%3A99



virtualip remove

Removes a virtual IP (VIP) or a VIP range. Permissions required: or fc a

Syntax

CLImaprcli virtualip remove [ -cluster <cluster> ] -virtualip <virtual IP> [ -virtualipend <Virtual IP Range End> ]

RESThttp[s]://<host>:<port>/rest/virtualip/remove?<parameters>

Parameters



virtualip The virtual IP or the start of the VIP range to remove.

virtualipend The end of the VIP range to remove.



job

The commands enable you to manipulate information about the Hadoop jobs that are running on your cluster:job

job changepriority - Changes the priority of a specific job.job kill - Kills a specific job.job linklogs - Uses Centralized Logging to create symbolic links to all the logs relating to the activity of a specific job.job table - Retrieves detailed information about the jobs running on the cluster.



job changepriority

Changes the priority of the specified job.

Syntax

CLImaprcli job changepriority [ -cluster cluster name ] -jobid job ID -priority NORMAL|LOW|VERY_LOW|HIGH|VERY_HIGH

RESThttp[s]://<host>:<port>/rest/job/changepriority?<parameters>

Parameters


cluster Cluster name

jobid Job ID

priority New job priority

Examples

Changing a Job's Priority:

CLImaprcli job changepriority -jobid job_201120603544_8282 -priority LOW

RESThttps://r1n1.sj.us:8443/rest/job/changepriority?jobid=job_201120603544_8282&priority=LOW



job kill

The API kills the specified job.job kill

Syntax

CLImaprcli job changepriority [ -cluster cluster name ] -jobid job ID

RESThttp[s]://<host>:<port>/rest/job/kill?[cluster=cluster_name&]jobid=job_ID

Parameters



jobid Job ID

Examples

Killing a Job

CLImaprcli job kill -jobid job_201120603544_8282

RESThttps://r1n1.sj.us:8443/rest/job/kill?jobid=job_201120603544_8282



job linklogs

The command performs , which provides a job-centric view of all log files generated by trackermaprcli job linklogs Centralized Loggingnodes during job execution.

The output of is a directory populated with symbolic links to all log files related to tasks, map attempts, and reduce attemptsjob linklogspertaining to the specified job(s). The command can be performed during or after a job.

Syntax

CLImaprcli job linklogs -jobid <jobPattern> -todir <desinationDirectory>

RESThttp[s]://<host>:<port>/rest/job/linklogs?jobid=<jobPattern>&todir=<destinationDirectory>

Parameters


jobid A regular expression specifying the target jobs.

todir The target location to dump the Centralized Logging output directories.

Output

The following directory structure will be created in the location specified by for all jobids matching the parameter.todir jobid

<jobid>/hosts/<host>/ contains symbolic links to log directories of tasks executed for <jobid> on <host><jobid>/mappers/ contains symbolic links to log directories of all map task attempts for <jobid> across the whole cluster<jobid>/reducers/ contains symbolic links to log directories of all reduce task attempts for <jobid> across the whole cluster

Examples

Link logs for all jobs named "wordcount1" and dump output to /myvolume/joblogviewdir:

CLImaprcli job linklogs -jobid job_*_wordcount1 -todir /myvolume/joblogviewdir

RESThttps://r1n1.sj.us:8443/rest/job/linklogs?jobid=job_*_wordcount1&todir=/myvolume/joblogviewdir



job table

Retrieves histograms and line charts for job metrics.

Use the API to retrieve for your job. The metrics data can be formatted for histogram display or line chart display.job table job metrics

Syntax

RESThttp[s]://<host>:<port>/api/job/table?output=terse&filter=string&chart=chart_type&columns=list_of_columns&scale=scale_type<parameters>

Parameters


filter Filters results to match the value of a specified string.

chart Chart type to use: for a line chart, for a histogram.line bar


bincount Number of histogram bins.

scale Scale to use for the histogram. Specify for a linear scale and for a logarithmic scale.linear log

Column Names

Parameter Description Notes

jmadvg Job Average Map Attempt Duration

jradavg Job Average Reduce Attempt Duration

jtadavg Job Average Task Duration Filter Only

jcmtct Job Complete Map Task Count Filter Only

jcrtct Job Complete Reduce Task Count Filter Only

jctct Job Complete Task Count Filter Only

jccpu Job Cumulative CPU

jcmem Job Cumulative Physical Memory

jcpu Job Current CPU Filter Only

jmem Job Current Memory Filter Only

jfmtact Job Failed Map Task Attempt Count

jfmtct Job Failed Map Task Count

jfrtact Job Failed Reduce Task Attempt Count

jfrtct Job Failed Reduce Task Count

jftact Job Failed Task Attempt Count Filter Only

jftct Job Failed Task Count Filter Only

jmibps Job Map Input Bytes Rate Per-second throughput rate

jmirps Job Map Input Records Rate Per-second throughput rate

jmobps Job Map Output Bytes Rate Per-second throughput rate



jmorps Job Map Output Records Rate Per-second throughput rate

jmtact Job Mask Task Attempt Count Per-second throughput rate

jmtct Job Map Task Count

jmadmax Job Maximum Map Attempt Duration

jradmax Job Maximum Reduce Attempt Duration

jtadmax Job Maximum Task Duration Filter Only

jribps Job Reduce Input Bytes Rate

jrirps Job Reduce Input Records Rate

jrobps Job Reduce Output Bytes Rate

jrorps Job Reduce Output Records Rate

jrsbps Job Reduce Shuffle Bytes Rate

jrtact Job Reduce Task Attempt Count

jrtct Job Reduce Task Count

jrumtct Job Running Map Task Count Filter Only

jrurtct Job Running Reduce Task Count Filter Only

jrutct Job Running Task Count Filter Only

jtact Job Task Attempt Count Filter Only

jtct Job Total Task Count Filter Only

jd Job Duration Histogram Only

Examples

Retrieve a Histogram:

RESThttps://r1n1.sj.us:8443/api/job/table?chart=bar&filter=[tt!=JOB_SETUP]and[tt!=JOB_CLEANUP]and[jid==job_201129649560_3390]&columns=td&bincount=28&scale=log

CURLcurl -d @json https://r1n1.sj.us:8443/api/job/table

In the example above, the file contains a URL-encoded version of the information in the section below.curl json Request

Request



GENERAL_PARAMS:{ [chart: | ],"bar" "line" columns: <comma-sep list of column terse names>, [filter: ,]"[<terse_field>{ }<value>]and[...]"operator [output: terse,] [start: ,]int [limit: ]int}

REQUEST_PARAMS_HISTOGRAM:{ chart:bar columns:jd filter: <anything>}

REQUEST_PARAMS_LINE:{ chart:line, columns:jmem, filter: NOT PARSED, UNUSED IN BACKEND}

REQUEST_PARAMS_GRID:{ columns:jid,jn,js,jd filter:<any real filter expression> output:terse, start:0, limit:50}

Response

RESPONSE_SUCCESS_HISTOGRAM:{ : ,"status" "OK" : 15,"total" : [ ],"columns" "jd" : [ , , , , , , , , ,"binlabels" "0-5s" "5-10s" "10-30s" "30-60s" "60-90s" "90s-2m" "2m-5m" "5m-10m" "10m-30m" "30m-1h", , , , , ],"1h-2h" "2h-6h" "6h-12h" "12h-24h" ">24h" : ["binranges" [0,5000], [5000,10000], [10000,30000], [30000,60000], [60000,90000], [90000,120000], [120000,300000], [300000,600000], [600000,1800000], [1800000,3600000], [3600000,7200000], [7200000,21600000], [21600000,43200000], [43200000,86400000], [86400000] ], : [33,919,1,133,9820,972,39,2,44,80,11,93,31,0,0]"data"}

RESPONSE_SUCCESS_GRID:{



: ,"status" "OK" : 50,"total" : [ , , , , , , , , , , , , ,"columns" "jid" "jn" "ju" "jg" "js" "jcpu" "jccpu" "jmem" "jcmem" "jpri" "jmpro" "jrpro" "jsbt" "j

, , , , , , , , , , , ,st" "jft" "jd" "jmtct" "jfmtact" "jrtct" "jmtact" "jrtact" "jtact" "jfrtact" "jftact" "jfmtct" "jfrtct, , , , , , , , , , , ," "jftct" "jtct" "jrumtct" "jrurtct" "jrutct" "jctct" "jcmtct" "jcrtct" "jmirps" "jmorps" "jmibps" "jm

, , , , , , , , , , ]obps" "jrirps" "jrorps" "jribps" "jrobps" "jtadavg" "jmadavg" "jmadmax" "jtadmax" "jradavg" "jradmax", : ["data" [ , , , , ,69,9106628,857124,1810874"job_201210216041_7311" "Billboard Top 10" "heman" "jobberwockies" "PREP"10, ,30,48,1309992275580,1316685654403,1324183149687,7497495284,72489,25227,6171223,95464,6171184,"LOW"6266648,-38,25189,13115,-4,13111,6243712,5329,4,6243712,6225268,54045,6171223,403,128,570,137,172,957,490,179,246335,367645,1758151,1758151,125024,514028], [ , , , , ,59,3125830,2895159,230270693,"job_201129309372_8897" "Super Big" "srivas" "jobberwockies" "KILLED"

,91,1,1313111819653,1323504739893,1326859602015,3354862122,8705980,3774739,7691269,12515000,16273"LOW"631,28788631,8196156,11970895,2706470,4365698,7072168,16397249,215570,35,16397249,9109476,5783940,3325536,707,509,345,463,429,93,88,752,406336,553455,3392429,3392429,259216,511285], [ , , , , ,100,1304791,50"job_201165490737_7144" "Trending Human Interaction" "mickey" "jobberwockies" "PREP"4092,728635524, ,57,90,1301684627596,1331548331890,1331592957521,44625631,7389503,3770494,543"VERY_LOW"3308,11011495,15048822,26060317,9769362,13539856,2544349,4315172,6859521,12822811,21932,327,12822811,5941031,4823222,1117809,739,654,561,426,925,23,420,597,292024,470314,1854688,1854688,113733,566672], [ , , , , ,82,7151113,2839682,490527441, ,"job_201152533959_6159" "Star Search" "darth" "fbi" "FAILED" "NORMAL"51,61,1305367042224,1325920952001,1327496965896,1576013895,8939964,2041524,4965024,10795895,8786681,19582576,3924842,5966366,1130482,3422544,4553026,13904988,833761,2,13904988,8518199,6975721,1542478,665,916,10,34,393,901,608,916,186814,331708,2500504,2500504,41920,251453] ]}

RESPONSE_SUCCESS_LINE:{ : ,"status" "OK" : 22,"total" : [ ],"columns" "jcmem" : ["data" [1329891055016,0], [1329891060016,8], [1329891065016,16], [1329891070016,1024], [1329891075016,2310], [1329891080016,3243], [1329891085016,4345], [1329891090016,7345], [1329891095016,7657], [1329891100016,8758], [1329891105016,9466], [1329891110016,10345], [1329891115016,235030], [1329891120016,235897], [1329891125016,287290], [1329891130016,298390], [1329891135016,301355], [1329891140016,302984], [1329891145016,303985], [1329891150016,304403], [1329891155016,503030], [1329891160016,983038]



]}



volume

The volume commands let you work with volumes, snapshots and mirrors:

volume create creates a volumevolume dump create creates a volume dumpvolume dump restore restores a volume from a volume dumpvolume info displays information about a volumevolume link create creates a symbolic linkvolume link remove removes a symbolic linkvolume list lists volumes in the clustervolume mirror push pushes a volume's changes to its local mirrorsvolume mirror start starts mirroring a volumevolume mirror stop stops mirroring a volumevolume modify modifies a volumevolume mount mounts a volumevolume move moves a volumevolume remove removes a volumevolume rename renames a volumevolume snapshot create creates a volume snapshotvolume snapshot list lists volume snapshotsvolume snapshot preserve prevents a volume snapshot from expiringvolume snapshot remove removes a volume snapshotvolume unmount unmounts a volume



volume create

Creates a volume. Permissions required: , , or cv fc a

Syntax

CLImaprcli volume create-name <volume name>-type 0|1[ -advisoryquota <advisory quota> ][ -ae <accounting entity> ][ -aetype <accounting entity type> ][ -cluster <cluster> ][ -createparent 0|1 ][ -group <list of group:allowMask> ][ -localvolumehost <localvolumehost> ][ -localvolumeport <localvolumeport> ][ -maxinodesalarmthreshold <maxinodesalarmthreshold> ][ -minreplication <minimum replication factor> ][ -mount 0|1 ][ -path <mount path> ][ -quota <quota> ][ -readonly <read-only status> ][ -replication <replication factor> ][ -replicationtype <type> ][ -rereplicationtimeoutsec <seconds> ][ -rootdirperms <root directory permissions> ][ -schedule <ID> ][ -source <source> ][ -topology <topology> ][ -user <list of user:allowMask> ]

RESThttp[s]://<host>:<port>/rest/volume/create?<parameters>

Parameters


advisoryquota The advisory quota for the volume as plus , , , , , integer unit. Example: quota=500G; Units: B K M G T P

ae The accounting entity that owns the volume.

aetype The type of accounting entity:

0=user1=group

cluster The cluster on which to create the volume.

createparent Specifies whether or not to create a parent volume:

0 = Do not create a parent volume.1 = Create a parent volume.

group Space-separated list of pairs.group:permission

localvolumehost The local volume host.

localvolumeport The local volume port. Default: 5660



maxinodesalarmthreshold Threshold for the alarm.INODES_EXCEEDED

minreplication The minimum replication level. Default: 2 When the replication factor falls below this minimum, re-replication occursas aggressively as possible to restore the replication level. If any containers in the CLDB volume fall below theminimum replication factor, writes are disabled until aggressive re-replication restores the minimum level ofreplication.

mount Specifies whether the volume is mounted at creation time.

name The name of the volume to create.

path The path at which to mount the volume.

quota The quota for the volume as plus , , , , , integer unit. Example: quota=500G; Units: B K M G T P

readonly Specifies whether or not the volume is read-only:

0 = Volume is read/write.1 = Volume is read-only.

replication The desired replication level. Default: 3 When the number of copies falls below the desired replication factor, butremains equal to or above the minimum replication factor, re-replication occurs after the timeout specified in thecldb.fs.mark.rereplicate.sec parameter.

replicationtype The desired replication type. You can specify (star replication) or (chainlow_latency high_throughputreplication). The default setting is .high_throughput

rereplicationtimeoutsec The re-replication timeout, in seconds.

rootdirperms Permissions on the volume root directory.

schedule The ID of a schedule. If a schedule ID is provided, then the volume will automatically create snapshots (normalvolume) or sync with its source volume (mirror volume) on the specified schedule. Use the commandschedule listto find the ID of the named schedule you wish to apply to the volume.

source For mirror volumes, the source volume to mirror, in the format (Required when<source volume>@<cluster>creating a mirror volume).

topology The rack path to the volume.

user Space-separated list of pairs.user:permission

type The type of volume to create:

0 - standard volume1 - mirror volume

Examples

Create the volume "test-volume" mounted at "/test/test-volume":

CLImaprcli volume create -name test-volume -path /test/test-volume

RESThttps://r1n1.sj.us:8443/rest/volume/create?name=test-volume&path=/test/test-volume

Create Volume with a Quota and an Advisory Quota

This example creates a volume with the following parameters:

advisoryquota: 100Mname: volumenamepath: /volumepathquota: 500Mreplication: 3schedule: 2



topology: /East Coasttype: 0

CLImaprcli volume create -name volumename -path /volumepath -advisoryquota 100M -quota 500M -replication 3 -schedule 2 -topology "/East Coast" -type 0

RESThttps://r1n1.sj.us:8443/rest/volume/create?advisoryquota=100M&name=volumename&path=/volumepath&quota=500M&replication=3&schedule=2&topology=/East%20Coast&type=0

Create the mirror volume "test-volume.mirror" from source volume "test-volume" and mount at "/test/test-volume-mirror":

CLImaprcli volume create -name test-volume.mirror -source test-volume@src-cluster-name -path /test/test-volume-mirror

RESThttps://r1n1.sj.us:8443/rest/volume/create?name=test-volume.mirror&sourcetest-volume@src-cluster-name&path=/test/test-volume-mirror



1.

2.

volume dump create

The volume dump create command creates a volume containing data from a volume for distribution or restoration. Permissionsdump filerequired: , , or dump fc a

You can use volume dump create to create two types of files:

full dump files containing all data in a volumeincremental dump files that contain changes to a volume between two points in time

A full dump file is useful for restoring a volume from scratch. An incremental dump file contains the changes necessary to take an existing (orrestored) volume from one point in time to another. Along with the dump file, a full or incremental dump operation can produce a filestate(specified by the ?-e parameter) that contains a table of the version number of every container in the volume at the time the dump file wascreated. This represents the of the dump file, which is used as the of the next incremental dump. The main differenceend point start pointbetween creating a full dump and creating an incremental dump is whether the -s parameter is specified; if -s is not specified, the volume createcommand includes all volume data and creates a full dump file. If you create a full dump followed by a series of incremental dumps, the result is asequence of dump files and their accompanying state files:

dumpfile1 statefile1



...

To maintain an up-to-date dump of a volume:

Create a full dump file. Example:

maprcli volume dump create -name cli-created -dumpfile fulldump1 -e statefile1

Periodically, add an incremental dump file. Examples:

maprcli volume dump create -s statefile1 -e statefile2 -name cli-created -dumpfile incrdump1maprcli volume dump create -s statefile2 -e statefile3 -name cli-created -dumpfile incrdump2maprcli volume dump create -s statefile3 -e statefile4 -name cli-created -dumpfile incrdump3

...and so on.

You can restore the volume from scratch, using the command with the full dump file, followed by each dump file involume dump restoresequence.

Syntax

CLImaprcli volume dump create [ -cluster <cluster> ] [ -s <start state file> ] [ -e <end state file> ] [ -o ] [ -dumpfile <dump file> ] -name volumename{anchor:cli-syntax-end}

REST None.

Parameters





dumpfile The name of the dump file (ignored if -o is used).

e The name of the state file to create for the end point of the dump.

name A volume name.

o This option dumps the volume to stdout instead of to a file.

s The start point for an incremental dump.

Examples

Create a full dump:

CLImaprcli volume create -e statefile1 -dumpfile fulldump1 -name volume-n

Create an incremental dump:

CLImaprcli volume dump -s statefile1 -e statefile2 -name volume -dumpfile incrdump1



1.

2.

volume dump restore

The command restores or updates a volume from a dump file. Permissions required: , , or volume dump restore dump fc a

There are two ways to use :volume dump restore

With a full dump file, recreates a volume from scratch from volume data stored in the dump file.volume dump restoreWith an incremental dump file, updates a volume using incremental changes stored in the dump file. volume dump restore

The volume that results from a operation is a mirror volume whose source is the volume from which the dump wasvolume dump restorecreated. After the operation, this volume can perform mirroring from the source volume.

When you are updating a volume from an incremental dump file, you must specify an existing volume and an incremental dump file. To restorefrom a sequence of previous dump files would involve first restoring from the volume's full dump file, then applying each subsequent incrementaldump file.

A restored volume may contain mount points that represent volumes that were mounted under the original source volume from which the dumpwas created. In the restored volume, these mount points have no meaning and do not provide access to any volumes that were mounted underthe source volume. If the source volume still exists, then the mount points in the restored volume will work if the restored volume is associatedwith the source volume as a mirror.

To restore from a full dump plus a sequence of incremental dumps:

Restore from the full dump file, using the option to create a new mirror volume and the option to specify the name. Example:-n -name

maprcli volume dump restore -dumpfile fulldump1 -name restore1 -n

Restore from each incremental dump file in order, specifying the same volume name. Examples:

maprcli volume dump restore -dumpfile incrdump1 -name restore1maprcli volume dump restore -dumpfile incrdump2 -name restore1maprcli volume dump restore -dumpfile incrdump3 -name restore1

...and so on.

Syntax

CLImaprcli volume dump restore [ -cluster <cluster> ] [ -dumpfile dumpfilename ] [ -i ] [ -n ] -name <volume name>

REST None.

Parameters



dumpfile The name of the dumpfile (ignored if is used).-i

i This option reads the dump file from . stdin

n This option creates a new volume if it doesn't exist.

name A volume name, in the form volumename



Examples

Restore a volume from a full dump file:

CLImaprcli volume dump restore -name volume -dumpfilefulldump1

Apply an incremental dump file to a volume:

CLImaprcli volume dump restore -name volume -dumpfileincrdump1



volume fixmountpath

Corrects the mount path of a volume. Permissions required: or fc a

The CLDB maintains information about the mount path of every volume. If a directory in a volume's path is renamed (by a command,hadoop fsfor example) the information in the CLDB will be out of date. The command does a reverse path walk from the volumevolume fixmountpathand corrects the mount path information in the CLDB.

Syntax

CLImaprcli volume fixmountpath-name <name>[ -cluster <clustername> ]

RESThttp[s]://<host>:<port>/rest/volume/fixmountpath?<parameters>

Parameters


name The volume name.

clustername The cluster name

Examples

Fix the mount path of volume v1:

CLImaprcli volume fixmountpath -name v1

RESThttps://r1n1.sj.us:8443/rest/volume/fixmountpath?name=v1



volume info

Displays information about the specified volume.

Syntax

CLImaprcli volume info [ -cluster <cluster> ] [ -name <volume name> ] [ -output terse|verbose ] [ -path <path> ]

RESThttp[s]://<host>:<port>/rest/volume/info?<parameters>

Parameters

You must specify either name or path.



name The volume for which to retrieve information.


path The mount path of the volume for which to retrieve information.



volume link create

Creates a link to a volume. Permissions required: or fc a

Syntax

CLImaprcli volume link create [ -cluster <clustername> ] -path <path> -type <type> -volume <volume>

RESThttp[s]://<host>:<port>/rest/volume/link/remove?<parameters>

Parameters


path The path parameter specifies the link path and other information, using the following syntax:

/link/[maprfs::][volume::]<volume type>::<volume name>

link - the link pathmaprfs - a keyword to indicate a special MapR-FS linkvolume - a keyword to indicate a link to a volumevolume type - writeable or mirrorvolume name - the name of the volume Example:

/abc/maprfs::mirror::abc

type The volume type: or .writeable mirror

volume The volume name.

clustername The cluster name.

Examples

Create a link to v1 at the path v1. mirror:

CLImaprcli volume link create -volume v1 -type mirror -path /v1.mirror

RESThttps://r1n1.sj.us:8443/rest/volume/link/create?path=/v1.mirror&type=mirror&volume=v1



volume link remove

Removes the specified symbolic link. Permissions required: or fc a

Syntax

CLImaprcli volume link remove -path <path> [ -cluster <clustername> ]

RESThttp[s]://<host>:<port>/rest/volume/link/remove?<parameters>

Parameters


path The symbolic link to remove. The path parameter specifies the link path and other information about the symbolic link, using thefollowing syntax:

/link/[maprfs::][volume::]<volume type>::<volume name>

link - the symbolic link path* - a keyword to indicate a special MapR-FS linkmaprfsvolume - a keyword to indicate a link to a volumevolume type - or writeable mirrorvolume name - the name of the volumeExample:

/abc/maprfs::mirror::abc

clustername The cluster name.

Examples

Remove the link /abc:

CLImaprcli volume link remove -path /abc/maprfs::mirror::abc

RESThttps://r1n1.sj.us:8443/rest/volume/link/remove?path=/abc/maprfs::mirror::abc



volume list

Lists information about volumes specified by name, path, or filter.

Syntax

CLImaprcli volume list [ -alarmedvolumes 1 ] [ -cluster <cluster> ] [ -columns <columns> ] [ -filter <filter> ] [ -limit <limit> ] [ -nodes <nodes> ] [ -output terse | verbose ] [ -start <offset> ]

RESThttp[s]://<host>:<port>/rest/volume/list?<parameters>

Parameters


alarmedvolumes Specifies whether to list alarmed volumes only.



filter A filter specifying volumes to list. See for more information.Filters


nodes A list of nodes. If specified, only lists volumes on the specified nodes.volume list

output Specifies whether the output should be or .terse verbose


Field Description

volumeid Unique volume ID.

volumetype Volume type:

0 = normal volume1 = mirror volume

volumename Unique volume name.

mountdir Unique volume path (may be null if the volume is unmounted).



mounted Volume mount status:

0 = unmounted1 = mounted

rackpath Rack path.

creator Username of the volume creator.

aename Accountable entity name.

aetype Accountable entity type:

0=user1=group

uacl Users ACL (comma-separated list of user names.

gacl Group ACL (comma-separated list of group names).

quota Quota, in MB; = no quota.0

advisoryquota Advisory quota, in MB; = no advisory quota.0

used Disk space used, in MB, not including snapshots.

snapshotused Disk space used for all snapshots, in MB.

totalused Total space used for volume and snapshots, in MB.

readonly Read-only status:

0 = read/write1 = read-only

numreplicas Desired replication factor (number of replications).

minreplicas Minimum replication factor (number of replications)

actualreplication The actual current replication factor by percentage of the volume, as a zero-based array of integers from 0 to100. For each position in the array, the value is the percentage of the volume that is replicated index numberof times. Example: means that 5% is not replicated, 10% is replicated once, 85% isarf=[5,10,85]replicated twice.

schedulename The name of the schedule associated with the volume.

scheduleid The ID of the schedule associated with the volume.

mirrorSrcVolumeId Source volume ID (mirror volumes only).

mirrorSrcVolume Source volume name (mirror volumes only).

mirrorSrcCluster The cluster where the source volume resides (mirror volumes only).

lastSuccessfulMirrorTime Last successful Mirror Time, milliseconds since 1970 (mirror volumes only).

mirrorstatus Mirror Status (mirror volumes only:

0 = success1 = running2 = error

mirror-percent-complete Percent completion of last/current mirror (mirror volumes only).

snapshotcount Snapshot count .

SnapshotFailureAlarm Status of SNAPSHOT_FAILURE alarm:

0 = Clear1 = Raised



AdvisoryQuotaExceededAlarm Status of VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED alarm:

0 = Clear1 = Raised

QuotaExceededAlarm Status of VOLUME_QUOTA_EXCEEDED alarm:

0 = Clear1 = Raised

MirrorFailureAlarm Status of MIRROR_FAILURE alarm:

0 = Clear1 = Raised

DataUnderReplicatedAlarm Status of DATA_UNDER_REPLICATED alarm:

0 = Clear1 = Raised

DataUnavailableAlarm Status of DATA_UNAVAILABLE alarm:

0 = Clear1 = Raised

Output

Information about the specified volumes.



mirrorstatus QuotaExceededAlarm numreplicas schedulename DataUnavailableAlarm volumeid rackpath volumename used volumetype SnapshotFailureAlarm mirrorDataSrcVolumeId advisoryquota aetype creator snapshotcount quota mountdir scheduleid snapshotused MirrorFailureAlarm AdvisoryQuotaExceededAlarm minreplicas mirrorDataSrcCluster actualreplication aename mirrorSrcVolumeId mirrorId mirrorSrcCluster lastSuccessfulMirrorTime nextMirrorId mirrorDataSrcVolume mirrorSrcVolume mounted logicalUsed readonly totalused DataUnderReplicatedAlarm mirror-percent-complete0 0 3 every15min 0 362 / ATS-Run-2011-01-31-160018 864299 0 0 0 0 0 root 3 0 /ATS-Run-2011-01-31-160018 4 1816201 0 0 1 ... root 0 0 0 0 1 2110620 0 2680500 0 00 0 3 0 12 / mapr.cluster.internal 0 0 0 0 0 0 root 0 0 / /mapr/cluster 0 var0 0 0 1 ... root 0 0 0 0 1 0 0 0 0 00 0 3 0 11 / mapr.cluster.root 1 0 0 0 0 0 root 0 0 / 0 0 0 0 1 ... root 0 0 0 0 1 1 0 1 0 00 0 10 0 21 / mapr.jobtracker.volume 1 0 0 0 0 0 root 0 0 / /mapr/cluster/mapred/jobTracker 0 var0 0 0 1 ... root 0 0 0 0 1 1 0 1 0 00 0 3 0 1 / mapr.kvstore.table 0 0 0 0 0 0 root 0 0 0 0 0 0 1 ... root 0 0 0 0 0 0 0 0 0 0

Output Fields

See the table above.Fields



volume mirror push

Pushes the changes in a volume to all of its mirror volumes in the same cluster, and waits for each mirroring operation to complete. Use thiscommand when you need to push recent changes.

Syntax

CLImaprcli volume mirror push [ -cluster <cluster> ] -name <volume name> [ -verbose true|false ]

REST None.

Parameters



name The volume to push.

verbose Specifies whether the command output should be verbose. Default: true

Output

Sample Output

Starting mirroring of volume mirror1Mirroring complete volume mirror1forSuccessfully completed mirror push to all local mirrors of volume volume1

Examples

Push changes from the volume "volume1" to its local mirror volumes:

CLImaprcli volume mirror push -name volume1 -clustermycluster



volume mirror start

Starts mirroring on the specified volume from its source volume. License required: M5 Permissions required: or fc a

When a mirror is started, the mirror volume is synchronized from a hidden internal snapshot so that the mirroring process is not affected by anyconcurrent changes to the source volume. The command does not wait for mirror completion, but returns immediately.volume mirror startThe changes to the mirror volume occur atomically at the end of the mirroring process; deltas transmitted from the source volume do not appearuntil mirroring is complete.

To provide rollback capability for the mirror volume, the mirroring process creates a snapshot of the mirror volume before starting the mirror, withthe following naming format: .<volume>.mirrorsnap.<date>.<time>

Normally, the mirroring operation transfers only deltas from the last successful mirror. Under certain conditions (mirroring a volume repaired by fs, for example), the source and mirror volumes can become out of sync. In such cases, it is impossible to transfer deltas, because the state isck

not the same for both volumes. Use the option to force the mirroring operation to transfer all data to bring the volumes back in sync.-full

Syntax

CLImaprcli volume mirror start [ -cluster <cluster> ] [ -full true|false ] -name <volume name>

RESThttp[s]://<host>:<port>/rest/volume/mirror/start?<parameters>

Parameters



full Specifies whether to perform a full copy of all data. If false, only the deltas are copied.

name The volume for which to start the mirror.

Output

Sample Output

messages Started mirror operation volumes 'test-mirror' for

Examples

Start mirroring the mirror volume "test-mirror":

CLImaprcli volume mirror start -nametest-mirror



volume mirror stop

Stops mirroring on the specified volume. License required: M5 Permissions required: or fc a

The command lets you stop mirroring (for example, during a network outage). You can use the volume mirror stop volume mirror startcommand to resume mirroring.

Syntax

CLImaprcli volume mirror stop [ -cluster <cluster> ] -name <volume name>

RESThttp[s]://<host>:<port>/rest/volume/mirror/stop?<parameters>

Parameters



name The volume for which to stop the mirror.

Output

Sample Output

messages Stopped mirror operation volumes 'test-mirror' for

Examples

Stop mirroring the mirror volume "test-mirror":

CLImaprcli volume mirror stop -nametest-mirror



volume modify

Modifies an existing volume. Permissions required: , , or m fc a

An error occurs if the name or path refers to a non-existent volume, or cannot be resolved.

Syntax

CLImaprcli volume modify [ -cluster <cluster> ] -name <volume name> [ -source <source> ] [ -replication <replication> ] [ -minreplication <minimum replication> ] [ -user <list of user:allowMask> ] [ -group <list of group:allowMask> ] [ -aetype <aetype> ] [ -ae <accounting entity> ] [ -quota <quota> ] [ -advisoryquota <advisory quota> ] [ -readonly <readonly> ] [ -schedule <schedule ID> ] [ -maxinodesalarmthreshold <threshold> ]

REST http[s]://<host>:<port>/rest/volume/modify?<parameters>

Parameters


advisoryquota The advisory quota for the volume as plus , , , , , integer unit. Example: quota=500G; Units: B K M G T P

ae The accounting entity that owns the volume.

aetype The type of accounting entity:

0=user1=group


group Space-separated list of pairs.group:permission

minreplication The minimum replication level. Default: 0

name The name of the volume to modify.

quota The quota for the volume as plus , , , , , integer unit. Example: quota=500G; Units: B K M G T P

readonly Specifies whether the volume is read-only.

0 = read/write1 = read-only

replication The desired replication level. Default: 0

schedule A schedule ID. If a schedule ID is provided, then the volume will automatically create snapshots (normal volume) or sync withits source volume (mirror volume) on the specified schedule.



source (Mirror volumes only) The source volume from which a mirror volume receives updates, specified in the format <volume>@<cl.uster>

user Space-separated list of pairs.user:permission

threshold Threshold for the alarm.INODES_EXCEEDED

Examples

Change the source volume of the mirror "test-mirror":

CLImaprcli volume modify -name test-mirror -source volume-2@my-cluster

RESThttps://r1n1.sj.us:8443/rest/volume/modify?name=test-mirror&source=volume-2@my-cluster



volume mount

Mounts one or more specified volumes. Permissions required: , , or mnt fc a

Syntax

CLImaprcli volume mount [ -cluster <cluster> ] -name <volume list> [ -path <path list> ] [ -createparent 0|1 ]

RESThttp[s]://<host>:<port>/rest/volume/mount?<parameters>

Parameters



name The name of the volume to mount.

path The path at which to mount the volume.

createparent Specifies whether or not to create a parent volume:

0 = Do not create a parent volume.1 = Create a parent volume.

Examples

Mount the volume "test-volume" at the path "/test":

CLImaprcli volume mount -name test-volume -path /test

RESThttps://r1n1.sj.us:8443/rest/volume/mount?name=test-volume&path=/test



volume move

Moves the specified volume or mirror to a different topology. Permissions required: , , or m fc a

Syntax

CLImaprcli volume move [ -cluster <cluster> ] -name <volume name> -topology <path>

RESThttp[s]://<host>:<port>/rest/volume/move?<parameters>

Parameters




topology The new rack path to the volume.



volume remove

Removes the specified volume or mirror. Permissions required: , , or d fc a

Syntax

CLImaprcli volume remove [ -cluster <cluster> ] [ -force ] -name <volume name> [ -filter <filter> ]

RESThttp[s]://<host>:<port>/rest/volume/remove?<parameters>

Parameters



force Forces the removal of the volume, even if it would otherwise be prevented.


filter All volumes with names that match the filter are removed.



volume rename

Renames the specified volume or mirror. Permissions required: , , or m fc a

Syntax

CLImaprcli volume rename [ -cluster <cluster> ] -name <volume name> -newname <new volume name>

RESThttp[s]://<host>:<port>/rest/volume/rename?<parameters>

Parameters




newname The new volume name.



volume showmounts

The API returns a list of mount points for the specified volume. volume showmounts

Syntax

CLImaprcli volume showmounts [ -cluster <cluster name> ] -name <volume name>

RESThttp[s]://<host>:<port>/rest/volume/showmounts?<parameters>

Parameters


cluster name The name of the cluster hosting the volume.

volume name The name of the volume to return a list of mount points for.

Examples

Return the mount points for volume mapr.user.volume for the cluster my.cluster.com:

CLImaprcli volume showmounts -cluster my.cluster.com -name mapr.user.volume

RESThttps://r1n1.sj.us:8443/rest/volume/showmounts?cluster=my.cluster.com&name=mapr.user.volume



volume snapshot create

Creates a snapshot of the specified volume, using the specified snapshot name. License required: M5 Permissions required: , , or snap fc a

Syntax

CLImaprcli volume snapshot create [ -cluster <cluster> ] -snapshotname <snapshot> -volume <volume>

RESThttp[s]://<host>:<port>/rest/volume/snapshot/create?<parameters>

Parameters



snapshotname The name of the snapshot to create.

volume The volume for which to create a snapshot.

Examples

Create a snapshot called "test-snapshot" for volume "test-volume":

CLImaprcli volume snapshot create -snapshotname test-snapshot -volume test-volume

RESThttps://r1n1.sj.us:8443/rest/volume/snapshot/create?volume=test-volume&snapshotname=test-snapshot



volume snapshot list

Displays info about a set of snapshots. You can specify the snapshots by volumes or paths, or by specifying a filter to select volumes with certaincharacteristics.

Syntax

CLImaprcli volume snapshot list [ -cluster <cluster> ] [ -columns <fields> ] ( -filter <filter> | -path <volume path list> | -volume <volume list> ) [ -limit <rows> ] [ -output (terse\|verbose) ] [ -start <offset> ]

RESThttp[s]://<host>:<port>/rest/volume/snapshot/list?<parameters>

Parameters

Specify exactly one of the following parameters: , , or .volume path filter



columns A comma-separated list of fields to return in the query. See the table below. Default: noneFields



output Specifies whether the output should be or . Default: terse verbose verbose

path A comma-separated list of paths for which to preserve snapshots.


volume A comma-separated list of volumes for which to preserve snapshots.

Fields

The following table lists the fields used in the sort and columns parameters, and returned as output.

Field Description

snapshotid Unique snapshot ID.

snapshotname Snapshot name.

volumeid ID of the volume associated with the snapshot.

volumename Name of the volume associated with the snapshot.

volumepath Path to the volume associated with the snapshot.

ownername Owner (user or group) associated with the volume.



ownertype Owner type for the owner of the volume:

0=user1=group

dsu Disk space used for the snapshot, in MB.

creationtime Snapshot creation time, milliseconds since 1970

expirytime Snapshot expiration time, milliseconds since 1970; = never expires.0

Output

The specified columns about the specified snapshots.

Sample Output

creationtime ownername snapshotid snapshotname expirytime diskspaceused volumeid volumename ownertype volumepath1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001 1063191 362 ATS-Run-2011-01-31-160018 1 /dummy1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057 753010 362 ATS-Run-2011-01-31-160018 1 /dummy1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0 362 ATS-Run-2011-01-31-160018 1 /dummydummy 1 14 test-volume-2 /dummy 102 test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001

Output Fields

See the table above.Fields

Examples

List all snapshots:

CLImaprcli volume snapshot list

RESThttps://r1n1.sj.us:8443/rest/volume/snapshot/list



volume snapshot preserve

Preserves one or more snapshots from expiration. Specify the snapshots by volumes, paths, filter, or IDs. License required: M5 Permissionsrequired: , , or snap fc a

Syntax

CLImaprcli volume snapshot preserve [ -cluster <cluster> ] ( -filter <filter> | -path <volume path list> | -snapshots <snapshot list> | -volume <volumelist> )

RESThttp[s]://<host>:<port>/rest/volume/snapshot/preserve?<parameters>

Parameters

Specify exactly one of the following parameters: volume, path, filter, or snapshots.




path A comma-separated list of paths for which to preserve snapshots.

snapshots A comma-separated list of snapshot IDs to preserve.

volume A comma-separated list of volumes for which to preserve snapshots.

Examples

Preserve two snapshots by ID:

First, use to get the IDs of the snapshots you wish to preserve. Example:volume snapshot list

# maprcli volume snapshot listcreationtime ownername snapshotid snapshotname expirytime diskspaceused volumeid volumename ownertype volumepath1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001 1063191 362 ATS-Run-2011-01-31-160018 1 /dummy1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057 753010 362 ATS-Run-2011-01-31-160018 1 /dummy1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0 362 ATS-Run-2011-01-31-160018 1 /dummydummy 1 14 test-volume-2 /dummy 102 test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001

Use the IDs in the command. For example, to preserve the first two snapshots in the above list, run thevolume snapshot preservecommands as follows:



CLImaprcli volume snapshot preserve -snapshots 363,364

RESThttps://r1n1.sj.us:8443/rest/volume/snapshot/preserve?snapshots=363,364



volume snapshot remove

Removes one or more snapshots. License required: M5 Permissions required: , , or snap fc a

Syntax

CLImaprcli volume snapshot remove [ -cluster <cluster> ] ( -snapshotname <snapshot name> | -snapshots <snapshots> | -volume <volume name>)

RESThttp[s]://<host>:<port>/rest/volume/snapshot/remove?<parameters>

Parameters

Specify exactly one of the following parameters: snapshotname, snapshots, or volume.



snapshotname The name of the snapshot to remove.

snapshots A comma-separated list of IDs of snapshots to remove.

volume The name of the volume from which to remove the snapshot.

Examples

Remove the snapshot "test-snapshot":

CLImaprcli volume snapshot remove -snapshotname test-snapshot

RESThttps://10.250.1.79:8443/api/volume/snapshot/remove?snapshotname=test-snapshot

Remove two snapshots by ID:

First, use to get the IDs of the snapshots you wish to remove. Example:volume snapshot list

# maprcli volume snapshot listcreationtime ownername snapshotid snapshotname expirytime diskspaceused volumeid volumename ownertype volumepath1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001 1063191 362 ATS-Run-2011-01-31-160018 1 /dummy1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057 753010 362 ATS-Run-2011-01-31-160018 1 /dummy1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0 362 ATS-Run-2011-01-31-160018 1 /dummydummy 1 14 test-volume-2 /dummy 102 test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001



Use the IDs in the command. For example, to remove the first two snapshots in the above list, run the commandsvolume snapshot removeas follows:

CLImaprcli volume snapshot remove -snapshots 363,364

RESThttps://r1n1.sj.us:8443/rest/volume/snapshot/remove?snapshots=363,364



volume unmount

Unmounts one or more mounted volumes. Permissions required: , , or mnt fc a

Syntax

CLImaprcli volume unmount [ -cluster <cluster> ] [ -force 1 ] -name <volume name>

RESThttp[s]://<host>:<port>/rest/volume/unmount?<parameters>

Parameters



force Specifies whether to force the volume to unmount.

name The name of the volume to unmount.

Examples

Unmount the volume "test-volume":

CLImaprcli volume unmount -name test-volume

RESThttps://r1n1.sj.us:8443/rest/volume/unmount?name=test-volume



Metrics API

A Hadoop job sets the rules that the JobTracker service uses to break an input data set into discrete tasks and assign those tasks to individualnodes. The MapR Metrics service provides two API calls that enable you to retrieve grids of job data or task attempt data depending on theparameters you send:

/api/job/table retrieves information about the jobs running on your cluster. You can use this API to retrieve information about thenumber of task attempts for jobs on the cluster, job duration, job computing resource use (CPU and memory), and job data throughput(both records and bytes per second)./api/task/table retrieves information about the tasks that make up a specific job, as well as the specific task attempts. You can usethis API to retrieve information about a task attempt's data throughput, measured in number of records per second as well as in bytes persecond.

Both of these APIs provide robust filtering capabilities to display data with a high degree of specificity.



task

The commands enable you to manipulate information about the Hadoop jobs that are running on your cluster:task

task killattempt - Kills a specific task attempt.task failattempt - Ends a specific task attempt as failed.task table - Retrieves detailed information about the task attempts associated with a job running on the cluster.



task failattempt

The API ends the specified task attempt as failed.task failattempt

Syntax

CLImaprcli task failattempt [ -cluster cluster name ] -taskattemptid task attempt ID

RESThttp[s]://<host>:<port>/rest/task/failattempt?[cluster=cluster_name&]taskattemptid=task_attempt_ID

Parameters



taskattemptid Task attempt ID

Examples

Ending a Task Attempt as Failed

CLImaprcli task failattempt -taskattemptid attempt_201187941846_1077_300_7707

RESThttps://r1n1.sj.us:8443/rest/task/failattempt?taskattemptid=attempt_201187941846_1077_300_7707



task killattempt

The API kills the specified task attempt.task killattempt

Syntax

CLImaprcli task killattempt [ -cluster cluster name ] -taskattemptid task attempt ID

RESThttp[s]://<host>:<port>/rest/task/killattempt?[cluster=cluster_name&]taskattemptid=task_attempt_ID

Parameters



taskattemptid Task attempt ID

Examples

Killing a Task Attempt

CLImaprcli task killattempt -taskattemptid attempt_201187941846_1077_300_7707

RESThttps://r1n1.sj.us:8443/rest/task/killattempt?taskattemptid=attempt_201187941846_1077_300_7707



task table

Retrieves histograms and line charts for task metrics.

Use the API to retrieve data for your job. The metrics data can be formatted for histogram display or line charttask table task analyticsdisplay.

Syntax

RESThttp[s]://<host>:<port>/api/task/table?output=terse&filter=string&chart=chart_type&columns=list_of_columns&scale=scale_type<parameters>

Parameters


filter Filters results to match the value of a specified string.

chart Chart type to use: for a line chart, for a histogram.line bar


bincount Number of histogram bins.

scale Scale to use for the histogram. Specify for a linear scale and for a logarithmic scale.linear log

Column Names

The following table lists the terse short names for particular metrics regarding task attempts.


tacir Combine Task Attempt Input Records

tacor Combine Task Attempt Output Records

tamib Map Task Attempt Input Bytes

tamir Map Task Attempt Input Records

tamob Map Task Attempt Output Bytes

tamor Map Task Attempt Output Records

tamsr Map Task Attempt Skipped Records

tarig Reduce Task Attempt Input Groups

tarir Reduce Task Attempt Input Records

taror Reduce Task Attempt Output Records

tarsb Reduce Task Attempt Shuffle Bytes

tarsr Reduce Task Attempt Skipped Records

tacput Task Attempt CPU Time

talbr Task Attempt Local Bytes Read

talbw Task Attempt Local Bytes Written

tambr Task Attempt MapR-FS Bytes Read

tambw Task Attempt MapR-FS Bytes Written



tapmem Task Attempt Physical Memory Bytes

taspr Task Attempt Spilled Records

tavmem Task Attempt Virtual Memory Bytes

tad Task Attempt Duration (histogram only)

tagct Task Attempt Garbage Collection Time (histogram only)

td Task Duration (histogram only)

Example

Retrieve a Task Histogram:

RESThttps://r1n1.sj.us:8443/api/task/table?chart=bar&filter=%5Btt!=JOB_SETUP%5Dand%5Btt!=JOB_CLEANUP%5Dand%5Bjid==job_201129649560_3390%5D&columns=td&bincount=28&scale=log

CURLcurl -d @json https://r1n1.sj.us:8443/api/task/table

In the example above, the file contains a URL-encoded version of the information in the section below.curl json Request

Request

GENERAL_PARAMS:{ [chart: | ],"bar" "line" columns: <comma-separated list of column terse names>, [filter: ,]"[<terse_field>{ }<value>]and[...]"operator [output: terse,] [start: ,]int [limit: ]int}

REQUEST_PARAMS_HISTOGRAM:{ chart:bar columns:td filter: <anything>}

REQUEST_PARAMS_LINE:{ chart:line, columns:tapmem, filter: NOT PARSED, UNUSED IN BACKEND}

REQUEST_PARAMS_GRID:{ columns:tid,tt,tsta,tst,tft filter:<any real filter expression> output:terse, start:0, limit:50}

Response

RESPONSE_SUCCESS_HISTOGRAM:



{ : ,"status" "OK" : 15,"total" : [ ],"columns" "td" : [ , , , , , , , , ,"binlabels" "0-5s" "5-10s" "10-30s" "30-60s" "60-90s" "90s-2m" "2m-5m" "5m-10m" "10m-30m" "30m-1h", , , , , ],"1h-2h" "2h-6h" "6h-12h" "12h-24h" ">24h" : ["binranges" [0,5000], [5000,10000], [10000,30000], [30000,60000], [60000,90000], [90000,120000], [120000,300000], [300000,600000], [600000,1800000], [1800000,3600000], [3600000,7200000], [7200000,21600000], [21600000,43200000], [43200000,86400000], [86400000] ], : [33,919,1,133,9820,972,39,2,44,80,11,93,31,0,0]"data"}

RESPONSE_SUCCESS_GRID:{ : ,"status" "OK" : 67,"total" : [ , , , , , , , , ],"columns" "ts" "tid" "tt" "tsta" "tst" "tft" "td" "th" "thl" : ["data" [ , , ,"FAILED" "task_201204837529_1284_9497_4858" "REDUCE" "attempt_201204837529_1284_94

,97_4858_3680" 1301066803229,1322663797292,21596994063, , ],"newyork-rack00-8" "remote" [ , , ,"PENDING" "task_201204837529_1284_9497_4858" "MAP" "attempt_201204837529_1284_9497_4858_8

,178" 1334918721349,1341383566992,6464845643, , ],"newyork-rack00-7" "unknown" [ , , ,"RUNNING" "task_201204837529_1284_9497_4858" "JOB_CLEANUP" "attempt_201204837529_1

,284_9497_4858_1954" 1335088225728,1335489232319,401006591, , ],"newyork-rack00-8" "local" ]}

RESPONSE_SUCCESS_LINE:{ : ,"status" "OK" : 22,"total" : [ ],"columns" "tapmem" : ["data" [1329891055016,0], [1329891060016,8], [1329891065016,16], [1329891070016,1024], [1329891075016,2310], [1329891080016,3243], [1329891085016,4345], [1329891090016,7345], [1329891095016,7657], [1329891100016,8758], [1329891105016,9466], [1329891110016,10345], [1329891115016,235030], [1329891120016,235897], [1329891125016,287290], [1329891130016,298390], [1329891135016,301355], [1329891140016,302984], [1329891145016,303985], [1329891150016,304403], [1329891155016,503030],



[1329891160016,983038]



]}



rlimit

The rlimit commands enable you to get and set resource usage limits for your cluster.

rlimit getrlimit set



rlimit get

The API returns the resource usage limit for the cluster's disk resourcerlimit get

Syntax

CLImaprcli rlimit get -resource disk [ -cluster <cluster name> ]

RESThttp[s]://<host>:<port>/rest/rlimit/get?<parameters>

Parameters


resource The type of resource to get the usage limit for. Currently only the value is supported.disk

cluster name The name of the cluster whose usage limit is being queried.

Examples

Return the disk usage limit for the cluster my.cluster.com:

CLImaprcli rlimit get -resource disk -cluster my.cluster.com

RESThttps://r1n1.sj.us:8443/rest/rlimit/get?cluster=my.cluster.com



rlimit set

The API sets the resource usage limit for the cluster's disk resourcerlimit set

Syntax

CLImaprcli rlimit set -resource disk [ -cluster <cluster name> ] -value <limit>

RESThttp[s]://<host>:<port>/rest/rlimit/set?<parameters>

Parameters


resource The type of resource to set the usage limit for. Currently only the value is supported.disk

cluster name The name of the cluster whose usage limit is being set.

limit The value of the limit being set. You can express the value as KB, MB, GB, or TB.

Examples

Set the disk usage limit for the cluster my.cluster.com to 80TB:

CLImaprcli rlimit set -resource disk -cluster my.cluster.com -value 80TB

RESThttps://r1n1.sj.us:8443/rest/rlimit/get?resource=disk&cluster=my.cluster.com&value=80TB



userconfig

The command displays information about the current user.userconfig load



userconfig load

Loads the configuration for the specified user.

Syntax

CLImaprcli userconfig load -username <username>

RESThttp[s]://<host>:<port>/rest/userconfig/load?<parameters>

Parameters


username The username for which to load the configuration.

Output

The configuration for the specified user.

Sample Output

username fsadmin mradminroot 1 1

Output Fields

Field Description

username The username for the specified user.

email The email address for the user.

fsadmin Indicates whether the user is a MapR-FS Administrator:

0 = no1 = yes

mradmin Indicates whether the user is a MapReduce Administrator:

0 = no1 = yes

helpUrl URL pattern for locating help files on the server. Example:

http://www.mapr.com/doc/display/MapR-<version>/<page>#<topic>



helpVersion Version of the help content corresponding to this build of MapR. Note that this is different from the build version.

Examples

View the root user's configuration:

CLImaprcli userconfig load -username root

RESThttps://r1n1.sj.us:8443/rest/userconfig/load?username=root



dump

The commands can be used to view key information about volumes, containers, storage pools, and MapR cluster services formaprcli dumpdebugging and troubleshooting.

dump balancerinfo returns detailed information about the storage pools on a cluster. If there are any active container moves, thecommand returns information about the source and destination storage pools.dump balancermetrics returns a cumulative count of container moves and MB of data moved between storage pools.dump cldbnodes returns the IP address and port number of the CLDB nodes on the cluster.dump containerinfo returns detailed information about one or more specified containers.dump replicationmanagerinfo returns information about volumes and the containers on those volumes including the nodes on which thecontainers have been replicated and the space allocated to each container.dump replicationmanagerqueueinfo returns information that enables you to identify containers that are under-replicated orover-replicated.dump rereplicationinfo returns information about the ongoing re-replication of replica containers including the destination IP address andport number, the ID number of the destination file server, and the ID number of the destination storage pool.dump rolebalancerinfo returns information about active replication role switches.dump rolebalancermetrics returns the cumulative number of times that the replication role balancer has switched the replication role ofname containers and data containers on the cluster.dump volumeinfo returns information about volumes and the associated containers.dump volumenodes returns the IP address and port number of volume nodes.dump zkinfo returns the ZooKeeper znodes. This command is used by the mapr-support-collect.sh script to gather cluster diagnostics fortroubleshooting.



dump balancerinfo

The command enables you see how space is used in storage pools and to track active container moves. maprcli dump balancerinfo

The is a tool that balances disk space usage on a cluster by moving containers between storage pools. Whenever a storagedisk space balancerpool is over 70% full (or a threshold defined by the parameter), the disk space balancercldb.balancer.disk.threshold.percentagedistributes containers to other storage pools that have lower utilization than the average for that cluster. The disk space balancer aims to ensurethat the percentage of space used on all of the disks in the node is similar. For more information, see .Disk Space Balancer

Syntax

maprcli dump balancerinfo [-cluster <cluster name>]

Parameters


-cluster<cluster

name>

The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it isissued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command.

Output

The command returns detailed information about the storage pools on a cluster. If there are any activemaprcli dump balancerinfocontainer moves, the command returns information about the source and destination storage pools.



# maprcli dump balancerinfo -cluster my.cluster.com -json{ :1337036566035,"timestamp" : ,"status" "OK" :187,"total" :["data"{ : ,"spid" "4bc329ce06752062004fa1a537abcdef" :5410063549464613987,"fsid" : ,"ip:port" "10.50.60.72:5660-" :1585096,"capacityMB" :1118099,"usedMB" :70,"percentage" : ,"fullnessLevel" "AboveAverage" :0,"inTransitMB" :31874"outTransitMB" },{ : ,"spid" "761fec1fabf32104004fad9630ghijkl" :3770844641152008527,"fsid" : ,"ip:port" "10.50.60.73:5660-" :1830364,"capacityMB" :793679,"usedMB" :47,"percentage" : ,"fullnessLevel" "BelowAverage" :79096,"inTransitMB" :0"outTransitMB" },

....

{ :4034,"containerid" :16046,"sizeMB" :5410063549464613987,"From fsid" : ,"From IP:Port" "10.50.60.72:5660-" : ,"From SP" "4bc329ce06752062004fa1a537abcefg" :3770844641152008527,"To fsid" : ,"To IP:Port" "10.50.60.73:5660-" :"To SP" "761fec1fabf32104004fad9630ghijkl" },

Output fields

Field Description

spid The unique ID number of the storage pool.

fsid The unique ID number of the file server. The FSID identifies an MapR-FS instance or a node that has MapR-FS running in thecluster. Typically, each node has a group of storage pools, so the same FSID will correspond to multiple SPIDs.

ip:port The host IP address and MapR-FS port.

capacityMB The total capacity of the storage pool (in MB).

usedMB The amount of space used on the storage pool (in MB).

percentage The percentage of the storage pool currently utilized. A ratio of the space used ( ) to the total capacity ( ) ofusedMB capacityMBthe storage pool.

fullnessLevel The fullness of the storage pool relative to the fullness of the rest of the cluster. Possible values are , , OverUsed AboveAverage, , and . For more information, see below.Average BelowAverage UnderUsed Monitoring storage pool space usage

inTransitMB The amount of data (in MB) that the disk space balancer is currently moving into a storage pool.



outTransitMB The amount of data (in MB) that the disk space balancer is currently moving out of a storage pool.

The following fields are returned only if the disk space balancer is actively moving one or more containers at the time the command is run.

Field Description

containerid The unique ID number of the container.

sizeMB The amount of data (in MB) being moved.

From fsid The FSID (file server ID number) of the source file server.

From IP:Port The IP address and port number of the source node.

From SP The SPID (storage pool ID) of the source storage pool.

To fsid The FSID (file server ID number) of the destination file server.

To IP:Port The IP address and port number of the destination node.

To SP The SPID (storage pool ID number) of the destination storage pool.

Examples

Monitoring storage pool space usage

You can use the command to monitor space usage on storage pools.maprcli dump balancerinfo


....{ : ,"spid" "4bc329ce06752062004fa1a537abcefg" :5410063549464613987,"fsid" : ,"ip:port" "10.50.60.72:5660-" :1585096,"capacityMB" :1118099,"usedMB" :70,"percentage" : ,"fullnessLevel" "AboveAverage" :0,"inTransitMB" :31874"outTransitMB" },

Tracking active container moves

Using the command you can monitor the activity of the disk space balancer. Whenever there are activemaprcli dump balancerinfocontainer moves, the command returns information about the source and destination storage pools.

# maprcli dump balancerinfo -json .... { :7840,"containerid" :15634,"sizeMB" :8081858704500413174,"From fsid" : ,"From IP:Port" "10.50.60.64:5660-" : ,"From SP" "9e649bf0ac6fb9f7004fa19d20rstuvw" :3770844641152008527,"To fsid" : ,"To IP:Port" "10.50.60.73:5660-" :"To SP" "fefcc342475f0286004fad963flmnopq" }

The example shows that a container (7840) is being moved from a storage pool on node 10.50.60.64 to a storage pool on node 10.50.60.73.



TipYou can use the storage pool IDs (SPIDs) to search the CLDB and MFS logs for activity (balancer moves, container moves,creates, deletes, etc.) related to specific storage pools.



dump balancermetrics

The command enables you see how space is used in storage pools and to track active container moves. maprcli dump balancerinfo

The is a tool that balances disk space usage on a cluster by moving containers between storage pools. Whenever a storagedisk space balancerpool is over 70% full (or a threshold defined by the parameter), the disk space balancercldb.balancer.disk.threshold.percentagedistributes containers to other storage pools that have lower utilization than the average for that cluster. The disk space balancer aims to ensurethat the percentage of space used on all of the disks in the node is similar. For more information, see .Disk Space Balancer

Syntax

maprcli dump balancerinfo [-cluster <cluster name>]

Parameters


-cluster<cluster

name>


Output

The command returns detailed information about the storage pools on a cluster. If there are any activemaprcli dump balancerinfocontainer moves, the command returns information about the source and destination storage pools.



# maprcli dump balancerinfo -cluster my.cluster.com -json{ :1337036566035,"timestamp" : ,"status" "OK" :187,"total" :["data"{ : ,"spid" "4bc329ce06752062004fa1a537abcdef" :5410063549464613987,"fsid" : ,"ip:port" "10.50.60.72:5660-" :1585096,"capacityMB" :1118099,"usedMB" :70,"percentage" : ,"fullnessLevel" "AboveAverage" :0,"inTransitMB" :31874"outTransitMB" },{ : ,"spid" "761fec1fabf32104004fad9630ghijkl" :3770844641152008527,"fsid" : ,"ip:port" "10.50.60.73:5660-" :1830364,"capacityMB" :793679,"usedMB" :47,"percentage" : ,"fullnessLevel" "BelowAverage" :79096,"inTransitMB" :0"outTransitMB" },

....

{ :4034,"containerid" :16046,"sizeMB" :5410063549464613987,"From fsid" : ,"From IP:Port" "10.50.60.72:5660-" : ,"From SP" "4bc329ce06752062004fa1a537abcefg" :3770844641152008527,"To fsid" : ,"To IP:Port" "10.50.60.73:5660-" :"To SP" "761fec1fabf32104004fad9630ghijkl" },

Output fields

Field Description

spid The unique ID number of the storage pool.

fsid The unique ID number of the file server. The FSID identifies an MapR-FS instance or a node that has MapR-FS running in thecluster. Typically, each node has a group of storage pools, so the same FSID will correspond to multiple SPIDs.

ip:port The host IP address and MapR-FS port.

capacityMB The total capacity of the storage pool (in MB).

usedMB The amount of space used on the storage pool (in MB).

percentage The percentage of the storage pool currently utilized. A ratio of the space used ( ) to the total capacity ( ) ofusedMB capacityMBthe storage pool.

fullnessLevel The fullness of the storage pool relative to the fullness of the rest of the cluster. Possible values are , , OverUsed AboveAverage, , and . For more information, see below.Average BelowAverage UnderUsed Monitoring storage pool space usage

inTransitMB The amount of data (in MB) that the disk space balancer is currently moving into a storage pool.



outTransitMB The amount of data (in MB) that the disk space balancer is currently moving out of a storage pool.

The following fields are returned only if the disk space balancer is actively moving one or more containers at the time the command is run.

Field Description



From fsid The FSID (file server ID number) of the source file server.

From IP:Port The IP address and port number of the source node.

From SP The SPID (storage pool ID) of the source storage pool.

To fsid The FSID (file server ID number) of the destination file server.


To SP The SPID (storage pool ID number) of the destination storage pool.

Examples

Monitoring storage pool space usage

You can use the command to monitor space usage on storage pools.maprcli dump balancerinfo


....{ : ,"spid" "4bc329ce06752062004fa1a537abcefg" :5410063549464613987,"fsid" : ,"ip:port" "10.50.60.72:5660-" :1585096,"capacityMB" :1118099,"usedMB" :70,"percentage" : ,"fullnessLevel" "AboveAverage" :0,"inTransitMB" :31874"outTransitMB" },

Tracking active container moves

Using the command you can monitor the activity of the disk space balancer. Whenever there are activemaprcli dump balancerinfocontainer moves, the command returns information about the source and destination storage pools.

# maprcli dump balancerinfo -json ....

{ :7840,"containerid" :15634,"sizeMB" :8081858704500413174,"From fsid" : ,"From IP:Port" "10.83.66.64:5660-" : ,"From SP" "9e649bf0ac6fb9f7004fa19d200daabe" :3770844641152008527,"To fsid" : ,"To IP:Port" "10.83.66.73:5660-" :"To SP" "fefcc342475f0286004fad963f06eeba" }

The example shows that a container (7840) is being moved from a storage pool on node 10.50.60.64 to a storage pool on node 10.50.60.73.



TipYou can use the storage pool IDs (SPIDs) to search the CLDB and MFS logs for activity (balancer moves, container moves,creates, deletes, etc.) related to specific storage pools.



dump changeloglevel

Dumps the change log level.

Syntax

CLImaprcli dump changeloglevel [ -classname <class name> ] [ -loglevel <log level> ] [ -cldbip <host> ] [ -cldbiprt <port> ]

RESTNone

Parameters


classname The class name.

loglevel The log level to dump.

cldbip The IP address of the CLDB to use. Default: 127.0.0.1

cldbiprt The port to use on the CLDB. Default: 7222

Examples

Do a thing to a thing:

CLImaprcli dump changeloglevel

RESTNone



dump cldbnodes

The command lists the nodes that contain (CLDB) data.maprcli dump cldbnodes container location database

The CLDB is a service running on one or more MapR nodes that maintains the location of cluster containers, services, and other information. TheCLDB automatically replicates its data to other nodes in the cluster, preserving at least two (and generally three) copies of the CLDB data. If theCLDB process dies, it is automatically restarted on the node.

Syntax

maprcli dump cldbnodes [-cluster <cluster name>] -zkconnect <ZooKeeper Connect String>

Parameters


-cluster <clustername>

The cluster on which to run the command. If this parameter is omitted, the command is run on the same clusterwhere it is issued. In multi-cluster contexts, you can use this parameter to specify a different cluster on which to runthe command.

-zkconnect<ZooKeeperconnection string

A ZooKeeper connect string, which specifies a list of the hosts running ZooKeeper, and the port to use on each, inthe format: '<host>[:<port>][,<host>[:<port>]...]'

Output

The command returns the IP address and port number of the CLDB nodes on the cluster.maprcli dump cldbnodes

$ maprcli dump cldbnodes -zkconnect localhost:5181 -json{{ :1309882069107,"timestamp" : ,"status" "OK" :1,"total" :["data" { :["valid" ,"10.10.30.39:5660-10.50.60.39:5660-" ,"10.10.30.38:5660-10.50.60.38:5660-" "10.10.30.35:5660-10.50.60.35:5660-" ] } ]}

Examples

Disaster Recovery

In the event that all CLDB nodes fail, you can restore the CLDB from a backup. It is a good idea to set up an automatic backup of the CLDBvolume at regular intervals. You can use the command to set up cron jobs to back up CLDB volumes locally or tomaprcli dump cldbnodesexternal media such as a USB drive. For more information, see .[Draft:Disaster Recovery]



1.

2. 3.

1.

2.

To back up a CLDB volume from a remote cluster:

Set up a cron job to save the container information on the remote cluster using the following command:

# maprcli dump cldbnodes -zkconnect <ZooKeeper connect string> > <path to file>

Set up a cron job to copy the container information file to a volume on the local cluster.Create a mirror volume on the local cluster, choosing the volume from the remote cluster as the source volume.mapr.cldb.internalSet the mirror sync schedule so that it will run at the same time as the cron job.

To back up a CLDB volume locally:

Set up a cron job to save the container information to a file on external media by running the following command:

# maprcli dump cldbnodes -zkconnect <ZooKeeper connect string> > <path to file>

Set up a cron job to create a dump file of the local volume on external media. Example:mapr.cldb.internal

# maprcli volume dump create -name mapr.cldb.internal -dumpfile <path to file>

For information about restoring from a backup of the CLDB, contact MapR Support.



dump containerinfo

The command enables you to view detailed information about one or more specified containers.maprcli dump containerinfo

A is a unit of sharded storage in a MapR cluster. Every container in a MapR volume is either a or a .container name container data container

The name container is the first container in a volume and holds that volume's namespace and file chunk locations. Depending on its replicationrole, a name container may be either a (part of the original copy of the volume) or a (one of the replicas in themaster container replica containerreplication chain).

Every data container is either a , an , or a .master container intermediate container tail container

Syntax

maprcli dump containerinfo [-clustername <cluster name>] -ids <id1,id2,id3 ...>

Parameters


[-clustername<clustername>]


-ids<id1,id2,id3...>

Specifies one or more container IDs. Container IDs are comma separated.

Output

The command returns information about one or more containers.maprcli dump containerinfo



# maprcli dump containerinfo -ids 2049 -json{ :1335831624586,"timestamp" : ,"status" "OK" :1,"total" :["data" { :2049,"ContainerId" :11,"Epoch" : ,"Master" "10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--11-VALID" :{"ActiveServers" :"IP:Port" "10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--11-VALID" }, :{"InactiveServers"

}, :{"UnusedServers"

}, : ,"OwnedSizeMB" "0 MB" : ,"SharedSizeMB" "0 MB" : ,"LogicalSizeMB" "0 MB" : ,"Mtime" "Thu Mar 22 15:44:22 PDT 2012" : ,"NameContainer" " "true : ,"VolumeName" "mapr.cluster.root" :93501816,"VolumeId" :3,"VolumeReplication" :"VolumeMounted" true } ]}

Output fields

Field Description

ContainerID The unique ID number for the container.

Epoch A sequence number that indicates the most recent copy of the container. The CLDB uses the epoch to ensure that anout-of-date copy cannot become the master for the container.

Master The physical IP address and port number of the . The master copy is part of the original copy of the volume.master copy

ActiveServers The physical IP address and port number of each active node on which the container resides.

InactiveServers The physical IP address and port number of each inactive node on which the container resides.

UnusedServers The physical IP address and port number of servers from which no "heartbeat" has been received for quite some time.

OwnedSizeMB The size on disk (in MB) dedicated to the container.

SharedSizeMB The size on disk (in MB) shared by the container.

LogicalSizeMB The logical size on disk (in MB) of the container.

TotalSizeMB The total size on disk (in MB) allocated to the container. Combines the Owned Size and Shared Size.

Mtime The time of the last modification to the contents of the container.

NameContainer Indicates if the container is the for the volume. If , the container holds the volume's namespacename container trueinformation and file chunk locations.

VolumeName The name of the volume.

VolumeId The unique ID number of the volume.

VolumeReplication The , the number of copies of a volume excluding the original.replication factor



VolumeMounted Indicates whether the volume is mounted. If , the volume is currently mounted. If , the volume is not mounted.true false



dump replicationmanagerinfo

The enables you to see which containers are under or over replicated in a specified volume. Formaprcli dump replicationmanagerinfoeach container, the command displays the current state of that container.

Syntax

maprcli dump replicationmanagerinfo [-cluster <cluster name>] -volumename <volume name>

Parameters


-cluster<cluster

name>


-volumename<volume

name>

Specifies the name of the volume.

Output

The returns information about volumes and the containers on those volumes including the nodesmaprcli dump replicationmanagerinfoon which the containers have been replicated and the space allocated to each container.



# maprcli dump replicationmanagerinfo -cluster my.cluster.com -volumename mapr.metrics -json{ :1335830006872,"timestamp" : ,"status" "OK" :2,"total" :["data" { : ,"VolumeName" "mapr.metrics" :182964332,"VolumeId" : ,"VolumeTopology" "/" :1,"VolumeUsedSizeMB" :3,"VolumeReplication" :2"VolumeMinReplication" }, { :2053,"ContainerId" :9,"Epoch" : ,"Master" "10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--9-VALID" :{"ActiveServers" :"IP:Port" "10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--9-VALID" }, :{"InactiveServers" }, :{"UnusedServers" }, : ,"OwnedSizeMB" "1 MB" : ,"SharedSizeMB" "0 MB" : ,"LogicalSizeMB" "1 MB" : ,"Mtime" "Mon Apr 30 16:40:41 PDT 2012" :"NameContainer" " "true } ]}

Output fields

Field Description

VolumeName Indicates the name of the volume.

VolumeId Indicates the ID number of the volume.

VolumeTopology The volume topology corresponds to the node topology of the rack or nodes where the volume resides. By default,new volumes are created with a topology of / (root directory). For more information, see Volume Topology

VolumeUsedSizeMB The size on disk (in MB) of the volume.

VolumeReplication The desired replication factor, the number of copies of a volume excluding the original. The default value is .3

VolumeMinReplication The minimum replication factor, the number of copies of a volume (excluding the original) that should be maintained bythe MapR cluster for normal operation. When the replication factor falls below this minimum, writes to the volume aredisabled. The default value is .2

ContainerId The unique ID number for the container.


Master The physical IP address and port number of the . The master copy is part of the original copy of themaster copyvolume.





UnusedServers The physical IP address and port number of each on which the container does not reside.




Mtime Indicates the time of the last modification to the container's contents.

NameContainer Indicates if the container is the for the volume. If , the container is the volume's first container andname container truereplication occurs simultaneously from the master to the intermediate and tail containers.



dump replicationmanagerqueueinfo

The command enables you to determine the status of under-replicated containers andmaprcli dump replicationmanagerqueueinfoover-replicated containers.

Syntax

maprcli dump replicationmanagerqueueinfo [-cluster <cluster name>] -queue <queue>

Parameters


cluster<cluster

name>

The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued.In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command.

queue<queue>

The name of the queue. Valid values are , , or . Queue 0 includes containers that have copies below the minimum replication0 1 2factor for the volume. Queue 1 includes containers that have copies below the the replication for the volume, but above theminimum replication factor. Queue 2 includes containers that are over-replicated.

Output

The command returns information about one of three queues: 0, 1, or 2. Depending on themaprcli dump replicationmanagerqueueinfoqueue value entered, the command displays information about containers that are under-replicated or over-replicated. You can use thisinformation to decide if you need to change the replication factor for that volume.



# maprcli dump replicationmanagerqueueinfo -queue 0Mtime LogicalSizeMB UnusedServers ActiveServers TotalSizeMB NameContainer InactiveServers ContainerId Master Epoch SharedSizeMB OwnedSizeMBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2065 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2064 10.250.1.103:5660--3-VALID 3 0 MB 0 MB 0 MB ... 0 MB true1 10.250.1.103:5660--8-VALID 8 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2066 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 1 MB ... 0 MB false2069 10.250.1.103:5660--5-VALID 5 0 MB 0 MBThu May 17 10:32:59 PDT 2012 1 MB ... 0 MB false2068 10.250.1.103:5660--5-VALID 5 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2071 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2070 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2073 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2072 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2075 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2074 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2077 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2076 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:36:30 PDT 2012 0 MB ... 0 MB true2049 10.250.1.103:5660--7-VALID 7 0 MB 0 MBThu May 17 10:36:36 PDT 2012 0 MB ... 0 MB true2050 10.250.1.103:5660--7-VALID 7 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB true2051 10.250.1.103:5660--6-VALID 6 0 MB 0 MBThu May 17 10:37:06 PDT 2012 0 MB ... 0 MB true2053 10.250.1.103:5660--6-VALID 6 0 MB 0 MBFri May 18 14:33:44 PDT 2012 0 MB ... 0 MB true2054 10.250.1.103:5660--5-VALID 5 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB true2055 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB true2056 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2057 10.250.1.103:5660--5-VALID 5 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2058 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2059 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2060 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2061 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2062 10.250.1.103:5660--3-VALID 3 0 MB 0 MBThu May 17 10:32:59 PDT 2012 0 MB ... 0 MB false2063 10.250.1.103:5660--3-VALID 3 0 MB 0 MB

Output fields

Field Description



ContainerID The unique ID number of the container.


Master The physical IP address and port number of the . The master copy is part of the original copy of the volume.master copy








Mtime The time of the last modification to the contents of the container.

NameContainer Indicates if the container is the for the volume. If , the container holds the volume's namespacename container trueinformation and file chunk locations.



dump rereplicationinfo

The command enables you to view information about the re-replication of containers.maprcli dump rereplicationinfo

Re-replication occurs whenever the number of available replica containers drops below the number prescribed by that volume's replication factor.Re-replication may occur for a variety of reasons including replica container corruption, node unavailability, hard disk failure, or an increase inreplication factor.

Syntax

maprcli dump rereplicationinfo [-cluster <cluster name>]

Parameters


-cluster<cluster

name>


Output

The command returns information about the ongoing re-replication of replica containers including themaprcli dump rereplicationinfodestination IP address and port number, the ID number of the destination file server, and the ID number of the destination storage pool.



# maprcli dump rereplicationinfo -json{ :1338222709331,"timestamp" : ,"status" "OK" :7,"total" :["data" { :2158,"containerid" :{"replica" :15467,"sizeMB" :9057314602141502940,"To fsid" : ,"To IP:Port" "192.0.2.28:5660-" :"To SP" "03b5970f41abbe48004f828abaabcdef" } }, { :3367,"containerid" :{"replica" :658,"sizeMB" :3684488804112157043,"To fsid" : ,"To IP:Port" "192.0.2.33:5660-" :"To SP" "3b86b4ce5bfd6bbf004f87e9b6ghijkl" } }, { :3376,"containerid" :{"replica" :630,"sizeMB" :3684488804112157043,"To fsid" : ,"To IP:Port" "192.0.2.33:5660-" :"To SP" "3b86b4ce5bfd6bbf004f87e9b6ghijkl" } }, { :3437,"containerid" :{"replica" :239,"sizeMB" :6776586767180745590,"To fsid" : ,"To IP:Port" "192.0.2.32:5660-" :"To SP" "6cd440fad0426db7004f828b2amnopqr" } }, { :8833,"containerid" :{"replica" :7327,"sizeMB" :9057314602141502940,"To fsid" : ,"To IP:Port" "192.0.2.28:5660-" :"To SP" "33885e3c5be9a04d004f828abcstuvwx" } } ]

}

Output fields

Field Description


To fsid The ID number (FSID) of the destination file server.




To SP The ID number (SPID) of the destination storage pool.



dump rolebalancerinfo

The command enables you to monitor the replication role balancer and view information about activemaprcli dump rolebalancerinforeplication role switches.



(the volume's first container), replication occurs simultaneously from the master to all replica containers. For ,containers data containersreplication proceeds from the master to the intermediate container(s) until it reaches the tail containers. For more information, see Replication

. Role Balancer

Syntax

maprcli dump rolebalancerinfo [-cluster <cluster name>]

Parameters


-cluster<cluster

name>


Output

The command returns information about active replication role switches. maprcli dump rolebalancerinfo

# maprcli dump rolebalancerinfo -json{ :1335835436698,"timestamp" : ,"status" "OK" :1,"total" :["data" { : 36659,"containerid" : ,"Tail IP:Port" "10.50.60.123:5660-" :"Updates blocked Since" "Wed May 23 05:48:15 PDT 2012" }

]}

Output fields

Field Description


Tail IP:Port The IP address and port number of the tail container node.

Updates blocked Since During a replication role switch, updates to that container are blocked.



dump rolebalancermetrics

The command enables you to view the number of times that the replication role balancer hasmaprcli dump rolebalancermetricsswitched the replication role of the name containers and data containers to ensure that containers are balanced across the nodes in the cluster.



(the volume's first container), replication occurs simultaneously from the master to all replica containers. For ,containers data containersreplication proceeds from the master to the intermediate container(s) until it reaches the tail containers. For more information, see Replication

.Role Balancer

Syntax

maprcli dump rolebalancermetrics [-cluster <cluster name>]

Parameters


-cluster<cluster

name>


Output

The command returns the cumulative number of times that the replication role balancer has switched themaprcli dump rolebalancerinforeplication role of name containers and data containers on the cluster.

# maprcli dump rolebalancermetrics -json{ :1337777286527,"timestamp" : ,"status" "OK" :1,"total" :["data" { :60,"numNameContainerSwitches" :28,"numDataContainerSwitches" :"timeOfLastMove" "Wed May 23 05:48:00 PDT 2012" } ]}

Output fields

Field Description

numNameContainerSwitches The number of times that the replication role balancer has switched the replication role of name containers.

numDataContainerSwitches The number of times that the replication role balancer has switched the replication role of data containers.

timeOfLastMove The date and time of the last replication role change.



dump volumeinfo

The command enables you to view information about a volume and the containers within that volume.maprcli dump volumeinfoA is a logical unit that allows you to apply policies to a set of files, directories, and sub-volumes. Using volumes, you can enforce diskvolumeusage limits, set replication levels, establish ownership and accountability, and measure the cost generated by different projects or departments.For more information, see . [Draft:Managing Data with Volumes]

Syntax

maprcli dump volumeinfo [-cluster <cluster name>] -volumename <volume name>

Parameters


cluster<cluster

name>


volumename<volumename>

The name of the volume.

Output

The returns information about the volume and the containers associated with that volume. Volume information includesmaprcli volume infothe ID, volume name, and replication factor. For each container on the specified volume, the command returns information about nodes andstorage.



# maprcli dump volumeinfo -volumename mapr.cluster.root -json{ :1335830155441,"timestamp" : ,"status" "OK" :2,"total" :["data" { : ,"VolumeName" "mapr.cluster.root" :93501816,"VolumeId" : ,"VolumeTopology" "/" :0,"VolumeUsedSizeMB" :3,"VolumeReplication" :2"VolumeMinReplication" }, { :2049,"ContainerId" :11,"Epoch" : ,"Master" "10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--11-VALID" :{"ActiveServers" :"IP:Port" "10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--11-VALID" }, :{"InactiveServers"

}, :{"UnusedServers"

}, : ,"OwnedSizeMB" "0 MB" : ,"SharedSizeMB" "0 MB" : ,"LogicalSizeMB" "0 MB" : ,"Mtime" "Thu Mar 22 15:44:22 PDT 2012" :"NameContainer" " "true } ]}

Output fields

Field Description

VolumeName The name of the volume.

VolumeId The unique ID number of the volume.

VolumeTopology The volume topology corresponds to the node topology of the rack or nodes where the volume resides. By default,new volumes are created with a topology of / (root directory). For more information, see .Volume Topology

VolumeUsedSizeMB The size on disk (in MB) of the volume.

VolumeReplication The desired replication factor, the number of copies of a volume. The default value is . The maximum value is .3 6

VolumeMinReplication The minimum replication factor, the number of copies of a volume (excluding the original) that should be maintained bythe MapR cluster for normal operation. When the replication factor falls below this minimum, writes to the volume aredisabled. The default value is .2

ContainerId The unique ID number of the container.


Master The physical IP address and port number of the . The master copy is part of the original copy of themaster copyvolume.










Mtime Indicates the time of the last modification to the contents of the container.

NameContainer Indicates if the container is the for the volume. If , the container is the volume's first container andname container truereplication occurs simultaneously from the master to the intermediate and tail containers.



dump volumenodes

The command enables you to view information about the nodes on a volume.maprcli dump volumenodes

Syntax

maprcli dump volumenodes [-cluster <cluster name>] -volumename <volume name>

Parameters


cluster<cluster

name>


volumename<volumename>

The name of the volume.

Output

The command returns the IP address and port number of volume nodes.maprcli dump volumenodes

# maprcli dump volumenodes -volumename mapr.hbase -json{ :1337280188850,"timestamp" : ,"status" "OK" :1,"total" :["data" { :{"Servers" :"IP:Port" "10.250.1.103:5660--7-VALID" } } ]}

Output fields

Field Description

IP:Port The IP address and MapR-FS port.



dump zkinfo

The command enables you to view a snapshot of the data stored in Zookeeper as a result of cluster operations.maprcli dump zkinfo

ZooKeeper prevents service coordination conflicts by enforcing a rigid set of rules and conditions, provides cluster-wide information about runningservices and their configuration, and provides a mechanism for almost instantaneous service failover. The will not start any serviceswardenunless ZooKeeper is reachable and more than half of the configured ZooKeeper nodes are live.

The mapr-support-collect.sh script calls the command to gather cluster diagnostics for troubleshooting. For moremaprcli dump supportdumpinformation, see .[Draft:mapr\-support\-collect.sh]

Syntax

maprcli dump zkinfo [-cluster <cluster name>] [-zkconnect <connect string>]

Parameters


-cluster<cluster name>


-zkconnect<connectionstring>

A ZooKeeper connect string, which specifies a list of the hosts running ZooKeeper, and the port to use on each, in theformat: '<host>[:<port>][,<host>[:<port>]...]'

Output

The command is run as part of support dump tools to view the current state of the Zookeeper service. The commandmaprcli dump zkinfoshould always be run using the flag. Output in the tabular format is not useful. Command output displays the data stored in the ZooKeepr-jsonhierarchical tree of znodes.

# maprcli dump zkinfo -json{ :1335825202157,"timestamp" : ,"status" "OK" :1,"total" :["data" { :"/_Stats" "\ncZxid = 0,ctime = Wed Dec 31 16:00:00 PST 1969,mZxid = 0,mtime = Wed Dec 31 16:00:00PST 1969,pZxid = 516,cversion = 12,dataVersion = 0,aclVersion = 0,ephemeralOwner = 0,dataLength =

,0,numChildren = 13" :["/" { .... } ]}

Output fields

You can use the command as you would use a database snapshot. The , , maprcli dump zkinfo /services /services_config /server, and znodes are used by the wardens to store and exchange information.s /*_locks

Field Description

services The directory is used by the wardens to store and exchange information about services./services



1. 2.

3.

4.

datacenter The directory contains CLDB "vital signs" that you can to identify the CLDB master, the most recent epoch,/datacenterand other key data. For more information, see below.Moving CLDB Data

services_config The directory is used by the wardens to store and exchange information./services_config

zookeeper The directory stores information about the ZooKeeper service./zookeeper

servers The directory is used by the wardens to store and exchange information./servers

nodes The directory (znode) stores key information about the nodes./nodes

Examples

Moving CLDB Data

In a M3-licensed cluster, CLDB data must be recovered from a failed CLDB node and installed on another node. The cluster can continuenormally as soon as the CLDB is started on another node. For more information, see .Recovering from a Failed CLDB Node on an M3 Cluster

Use the command to identify the latest epoch of the CLDB, identify the nodes where replicates of the CLDB are stored,maprcli dump zkinfoand select one of those nodes to serve the new CLDB node. Perform the following steps on any cluster node:

Log in as or use for the following commands.root sudoIssue the command using the flag.maprcli dump zkinfo -json

# maprcli dump zkinfo -json

The output displays the ZooKeeper znodes.

In the directory, locate the CLDB with the latest epoch./datacenter/controlnodes/cldb/epoch/1

{ :" Container ID:1"/datacenter/controlnodes/cldb/epoch/1/KvStoreContainerInfo" VolumeId:1 Master:10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--13-VALID Servers: 10.250.1.15:5660-172.16.122.1:5660-192.168.115.1:5660--13-VALID Inactive Servers: UnusedServers: Latest epoch:13"}

The Latest Epoch field identifies the current epoch of the CLDB data. In this example, the latest epoch is .13

Select a CLDB from among the copies at the latest epoch. For example, indicates that the node has a10.250.2.41:5660--13-VALIDcopy at epoch 13 (the latest epoch).



Alarms Reference

This page provides details for all alarm types.

User/Group AlarmsEntity Advisory Quota AlarmEntity Quota Alarm

Cluster AlarmsBlacklist AlarmLicense Near ExpirationLicense ExpiredCluster Almost FullCluster FullMaximum Licensed Nodes Exceeded alarmUpgrade in ProgressVIP Assignment Failure

Node AlarmsCLDB Service AlarmCore Present AlarmDebug Logging ActiveDisk FailureDuplicate Host IDFileServer Service AlarmHBMaster Service AlarmHBRegion Service AlarmHoststats AlarmInstallation Directory Full AlarmJobTracker Service AlarmMapR-FS High Memory AlarmMapR User MismatchMetrics Write Problem AlarmNFS Service AlarmPAM Misconfigured AlarmRoot Partition Full AlarmTaskTracker Service AlarmTaskTracker Local Directory Full AlarmTime Skew AlarmVersion AlarmWebServer Service Alarm

Volume AlarmsData UnavailableData Under-ReplicatedInodes Limit ExceededMirror FailureNo Nodes in TopologySnapshot FailureTopology Almost FullTopology Full AlarmVolume Advisory Quota AlarmVolume Quota Alarm

User/Group Alarms

User/group alarms indicate problems with user or group quotas. The following tables describe the MapR user/group alarms.

Entity Advisory Quota Alarm

UI Column User Advisory Quota Alarm

LoggedAs

AE_ALARM_AEADVISORY_QUOTA_EXCEEDED

Meaning A user or group has exceeded its advisory quota. See for more information about user/group quotas. Managing Quotas

Resolution No immediate action is required. To avoid exceeding the hard quota, clear space on volumes created by the user or group, orstop further data writes to those volumes.

Entity Quota Alarm



UI Column User Quota Alarm

LoggedAs

AE_ALARM_AEQUOTA_EXCEEDED

Meaning A user or group has exceeded its quota. Further writes by the user or group will fail. See for more informationManaging Quotasabout user/group quotas.

Resolution Free some space on the volumes created by the user or group, or increase the user or group quota.

Cluster Alarms

Cluster alarms indicate problems that affect the cluster as a whole. The following tables describe the MapR cluster alarms.

Blacklist Alarm

UI Column Blacklist Alarm

LoggedAs

CLUSTER_ALARM_BLACKLIST_TTS

Meaning The JobTracker has blacklisted a TaskTracker node because tasks on the node have failed too many times.

Resolution To determine which node or nodes have been blacklisted, see the JobTracker status page (click in the JobTracker Navigation). The JobTracker status page provides links to the TaskTracker log for each node; look at the log for the blacklisted node orPane

nodes to determine why tasks are failing on the node.

License Near Expiration

UI Column License Near Expiration Alarm

Logged As CLUSTER_ALARM_LICENSE_NEAR_EXPIRATION

Meaning The M5 license associated with the cluster is within 30 days of expiration.

Resolution Renew the M5 license.

License Expired

UI Column License Expiration Alarm

Logged As CLUSTER_ALARM_LICENSE_EXPIRED

Meaning The M5 license associated with the cluster has expired. M5 features have been disabled.

Resolution Renew the M5 license.

Cluster Almost Full

UI Column Cluster Almost Full

LoggedAs

CLUSTER_ALARM_CLUSTER_ALMOST_FULL

Meaning The cluster storage is almost full. The percentage of storage used before this alarm is triggered is 90% by default, and iscontrolled by the configuration parameter .cldb.cluster.almost.full.percentage

Resolution Reduce the amount of data stored in the cluster. If the cluster storage is less than 90% full, check the cldb.cluster.almost. parameter via the command, and adjust it if necessary via the command.full.percentage config load config save

Cluster Full

UI Column Cluster Full

Logged As CLUSTER_ALARM_CLUSTER_FULL



Meaning The cluster storage is full. MapReduce operations have been halted.

Resolution Free up some space on the cluster.

Maximum Licensed Nodes Exceeded alarm

UI Column Licensed Nodes Exceeded Alarm

Logged As CLUSTER_ALARM_LICENSE_MAXNODES_EXCEEDED

Meaning The cluster has exceeded the number of nodes specified in the license.

Resolution Remove some nodes, or upgrade the license to accommodate the added nodes.

Upgrade in Progress

UI Column Software Installation & Upgrades

LoggedAs

CLUSTER_ALARM_UPGRADE_IN_PROGRESS

Meaning A rolling upgrade of the cluster is in progress.

Resolution No action is required. Performance may be affected during the upgrade, but the cluster should still function normally. After theupgrade is complete, the alarm is cleared.

VIP Assignment Failure

UI Column VIP Assignment Alarm

LoggedAs

CLUSTER_ALARM_UNASSIGNED_VIRTUAL_IPS

Meaning MapR was unable to assign a VIP to any NFS servers.

Resolution Check the VIP configuration, and make sure at least one of the NFS servers in the VIP pool are up and running. See Configuring. This alarm can also indicateNFS for HA

that a VIP's hostname exceeds the maximum allowed length of 16. Check the log file for/opt/mapr/logs/nfsmon.logadditional information.

Node Alarms

Node alarms indicate problems in individual nodes. The following tables describe the MapR node alarms.

CLDB Service Alarm

UI Column CLDB Alarm

LoggedAs

NODE_ALARM_SERVICE_CLDB_DOWN

Meaning The CLDB service on the node has stopped running.

Resolution Go to the pane of the to check whether the CLDB service is running. The warden will tryManage Services Node Properties Viewthree times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the CLDB service, the alarm is cleared. If the warden is unable to restart the CLDB service, itmay be necessary to contact technical support.

Core Present Alarm

UI Column Core files present

Logged As NODE_ALARM_CORE_PRESENT

http://www.mapr.com/doc/display/MapR/High+Availability+NFS#HighAvailabilityNFS-nfsha

http://www.mapr.com/doc/display/MapR/High+Availability+NFS#HighAvailabilityNFS-nfsha



Meaning A service on the node has crashed and created a core dump file. When all core files are removed, the alarm is cleared.

Resolution Contact technical support.

Debug Logging Active

UI Column Excess Logs Alarm

LoggedAs

NODE_ALARM_DEBUG_LOGGING

Meaning Debug logging is enabled on the node.

Resolution Debug logging generates enormous amounts of data, and can fill up disk space. If debug logging is not absolutely necessary, turnit off: either use the pane in the Node Properties view or the command. If it is absolutely necessary,Manage Services setloglevelmake sure that the logs in /opt/mapr/logs are not in danger of filling the entire disk.

Disk Failure

UI Column Disk Failure Alarm

LoggedAs

NODE_ALARM_DISK_FAILURE

Meaning A disk has failed on the node.

Resolution Check the disk health log (/opt/mapr/logs/faileddisk.log) to determine which disk failed and view any SMART data provided by thedisk. See Handling Disk Failure

Duplicate Host ID

UI Column Duplicate Host ID

LoggedAs

NODE_ALARM_DUPLICATE_HOSTID

Meaning Two or more nodes in the cluster have the same host ID.

Resolution Multiple nodes with the same host ID are prevented from joining the cluster, in order to prevent addressing problems that canlead to data loss. To correct the problem and clear the alarm, make sure all host IDs are unique and use the maprcli node

command to un-ban the affected host IDs.allow-into-cluster

FileServer Service Alarm

UI Column FileServer Alarm

LoggedAs

NODE_ALARM_SERVICE_FILESERVER_DOWN

Meaning The FileServer service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the FileServer service is running. The warden willManage Servicestry three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the FileServer service, the alarm is cleared. If the warden is unable to restart the FileServerservice, it may be necessary to contact technical support.

HBMaster Service Alarm

UI Column HBase Master Alarm

LoggedAs

NODE_ALARM_SERVICE_HBMASTER_DOWN

Meaning The HBMaster service on the node has stopped running.



Resolution Go to the pane of the Node Properties View to check whether the HBMaster service is running. The warden willManage Servicestry three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the HBMaster service, the alarm is cleared. If the warden is unable to restart the HBMasterservice, it may be necessary to contact technical support.

HBRegion Service Alarm

UI Column HBase RegionServer Alarm

LoggedAs

NODE_ALARM_SERVICE_HBREGION_DOWN

Meaning The HBRegion service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the HBRegion service is running. The warden willManage Servicestry three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the HBRegion service, the alarm is cleared. If the warden is unable to restart the HBRegionservice, it may be necessary to contact technical support.

Hoststats Alarm

UI Column Hoststats process down

LoggedAs

NODE_ALARM_HOSTSTATS_DOWN

Meaning The Hoststats service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the Hoststats service is running. The warden willManage Servicestry three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the service, the alarm is cleared. If the warden is unable to restart the service, it may benecessary to contact technical support.

Installation Directory Full Alarm

UI Column Installation Directory full

Logged As NODE_ALARM_OPT_MAPR_FULL

Meaning The partition on the node is running out of space (95% full)./opt/mapr

Resolution Free up some space in on the node./opt/mapr

JobTracker Service Alarm

UI Column JobTracker Alarm

LoggedAs

NODE_ALARM_SERVICE_JT_DOWN

Meaning The JobTracker service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the JobTracker service is running. The wardenManage Serviceswill try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try threetimes to restart the service. The interval can be configured using the parameter in services.retryinterval.time.sec ward

. If the warden successfully restarts the JobTracker service, the alarm is cleared. If the warden is unable to restart theen.confJobTracker service, it may be necessary to contact technical support.

MapR-FS High Memory Alarm

UI Column High FileServer Memory Alarm

Logged As NODE_ALARM_HIGH_MFS_MEMORY



Meaning Memory consumed by <strong> fileserver </strong> service on the node is high

Resolution Log on as root to the node for which the alarm is raised, and restart the Warden:/etc/init.d/mapr-warden restart

MapR User Mismatch

UI Column MapR User Mismatch Alarm

LoggedAs

NODE_ALARM_MAPRUSER_MISMATCH

Meaning The cluster nodes are not all set up to run MapR services as the same user (for example, some nodes are running MapR as roo while others are running as .t mapr_user

Resolution For the nodes on which the User Mismatch alarm is raised, follow the steps in .Changing the User for MapR Services

Metrics Write Problem Alarm

UI Column Metrics write problem Alarm

LoggedAs

NODE_ALARM_METRICS_WRITE_PROBLEM

Meaning Unable to write Metrics data to the database or the MapR-FS local Metrics volume.

Resolution This issue can have multiple causes. To clear the alarm, check the log file at for the cause/opt/mapr/logs/hoststats.logof the write failure. In the case of database access failure, restore write access to the MySQL database. For more information,consult the process outlined in .Setting up the MapR Metrics Database

NFS Service Alarm

UI Column NFS Alarm

LoggedAs

NODE_ALARM_SERVICE_NFS_DOWN

Meaning The NFS service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the NFS service is running. The warden will tryManage Servicesthree times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the NFS service, the alarm is cleared. If the warden is unable to restart the NFS service, it maybe necessary to contact technical support.

PAM Misconfigured Alarm

UI Column PAM Alarm

Logged As NODE_ALARM_PAM_MISCONFIGURED

Meaning The PAM authentication on the node is configured incorrectly.

Resolution See .PAM Configuration

Root Partition Full Alarm

UI Column Root partition full

Logged As NODE_ALARM_ROOT_PARTITION_FULL

Meaning The root partition ('/') on the node is running out of space (99% full).

Resolution Free up some space in the root partition of the node.

TaskTracker Service Alarm



UI Column TaskTracker Alarm

LoggedAs

NODE_ALARM_SERVICE_TT_DOWN

Meaning The TaskTracker service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the TaskTracker service is running. The wardenManage Serviceswill try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try threetimes to restart the service. The interval can be configured using the parameter in services.retryinterval.time.sec ward

. If the warden successfully restarts the TaskTracker service, the alarm is cleared. If the warden is unable to restart theen.confTaskTracker service, it may be necessary to contact technical support.

TaskTracker Local Directory Full Alarm

UI Column TaskTracker Local Directory Full Alarm

Logged As NODE_ALARM_TT_LOCALDIR_FULL

Meaning The local directory used by the TaskTracker on the specified node(s) is full, and the TaskTracker cannot operate as a result.

Resolution Delete or move data from the local disks, or add storage to the specified node(s), and try the jobs again.

Time Skew Alarm

UI Column Time Skew Alarm

Logged As NODE_ALARM_TIME_SKEW

Meaning The clock on the node is out of sync with the master CLDB by more than 20 seconds.

Resolution Use NTP to synchronize the time on all the nodes in the cluster.

Version Alarm

UI Column Version Alarm

Logged As NODE_ALARM_VERSION_MISMATCH

Meaning One or more services on the node are running an unexpected version.

Resolution Stop the node, Restore the correct version of any services you have modified, and re-start the node. See .Managing Nodes

WebServer Service Alarm

UI Column WebServer Alarm

LoggedAs

NODE_ALARM_SERVICE_WEBSERVER_DOWN

Meaning The WebServer service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the WebServer service is running. The wardenManage Serviceswill try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try threetimes to restart the service. The interval can be configured using the parameter in services.retryinterval.time.sec ward

. If the warden successfully restarts the WebServer service, the alarm is cleared. If the warden is unable to restart theen.confWebServer service, it may be necessary to contact technical support.

Volume Alarms

Volume alarms indicate problems in individual volumes. The following tables describe the MapR volume alarms.

Data Unavailable



UI Column Data Alarm

LoggedAs

VOLUME_ALARM_DATA_UNAVAILABLE

Meaning This is a potentially very serious alarm that may indicate data loss. Some of the data on the volume cannot be located. This alarmindicates that enough nodes have failed to bring the replication factor of part or all of the volume to zero. For example, if thevolume is stored on a single node and has a replication factor of one, the Data Unavailable alarm will be raised if that volume failsor is taken out of service unexpectedly. If a volume is replicated properly (and therefore is stored on multiple nodes) then the DataUnavailable alarm can indicate that a significant number of nodes is down.

Resolution Investigate any nodes that have failed or are out of service.

You can see which nodes have failed by looking at the Cluster Node Heatmap pane of the .DashboardCheck the cluster(s) for any snapshots or mirrors that can be used to re-create the volume. You can see snapshots andmirrors in the view.MapR-FS

Data Under-Replicated

UI Column Replication Alarm

LoggedAs

VOLUME_ALARM_DATA_UNDER_REPLICATED

Meaning The volume replication factor is lower than the set in . This can be caused by failingminimum replication factor Volume Propertiesdisks or nodes, or the cluster may be running out of storage space.

Resolution Investigate any nodes that are failing. You can see which nodes have failed by looking at the Cluster Node Heatmap pane of the . Determine whether it is necessary to add disks or nodes to the cluster. This alarm is generally raised when the nodesDashboard

that store the volumes or replicas have not sent a for five minutes. To prevent re-replication during normal maintenanceheartbeatprocedures, MapR waits a specified interval (by default, one hour) before considering the node dead and re-replicating its data.You can control this interval by setting the parameter using the command.cldb.fs.mark.rereplicate.sec config save

Inodes Limit Exceeded

UI Column Inodes Exceeded Alarm

LoggedAs

VOLUME_ALARM_INODES_EXCEEDED

Meaning The volume contains too many files.

Resolution This alarm indicates that not enough volumes are set up to handle the number of files stored in the cluster. Typically, each user orproject should have a separate volume.

Mirror Failure

UI Column Mirror Alarm

LoggedAs

VOLUME_ALARM_MIRROR_FAILURE

Meaning A mirror operation failed.

Resolution Make sure the CLDB is running on both the source cluster and the destination cluster. Look at the CLDB log(/opt/mapr/logs/cldb.log) and the MapR-FS log (/opt/mapr/logs/mfs.log) on both clusters for more information. If the attemptedmirror operation was between two clusters, make sure that both clusters are reachable over the network. Make sure the sourcevolume is available and reachable from the cluster that is performing the mirror operation.

No Nodes in Topology

UI Column No Nodes in Vol Topo

LoggedAs

VOLUME_ALARM_NO_NODES_IN_TOPOLOGY



Meaning The path specified in the volume's topology no longer corresponds to a physical topology that contains any nodes, either due tonode failures or changes to node topology settings. While this alarm is raised, MapR places data for the volume on nodes outsidethe volume's topology to prevent write failures.

Resolution Add nodes to the specified volume topology, either by moving existing nodes or adding nodes to the cluster. See .Node Topology

Snapshot Failure

UI Column Snapshot Alarm

LoggedAs

VOLUME_ALARM_SNAPSHOT_FAILURE

Meaning A snapshot operation failed.

Resolution Make sure the CLDB is running. Look at the CLDB log (/opt/mapr/logs/cldb.log) and the MapR-FS log (/opt/mapr/logs/mfs.log) onboth clusters for more information. If the attempted snapshot was a scheduled snapshot that was running in the background, try amanual snapshot.

Topology Almost Full

UI Column Vol Topo Almost Full

LoggedAs

VOLUME_ALARM_TOPOLOGY_ALMOST_FULL

Meaning The nodes in the specified topology are running out of storage space.

Resolution Move volumes to another topology, enlarge the specified topology by adding more nodes, or add disks to the nodes in thespecified topology.

Topology Full Alarm

UI Column Vol Topo Full

LoggedAs

VOLUME_ALARM_TOPOLOGY_FULL

Meaning The nodes in the specified topology have out of storage space.

Resolution Move volumes to another topology, enlarge the specified topology by adding more nodes, or add disks to the nodes in thespecified topology.

Volume Advisory Quota Alarm

UI Column Vol Advisory Quota Alarm

Logged As VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED

Meaning A volume has exceeded its advisory quota.

Resolution No immediate action is required. To avoid exceeding the hard quota, clear space on the volume or stop further data writes.

Volume Quota Alarm

UI Column Vol Quota Alarm

Logged As VOLUME_ALARM_QUOTA_EXCEEDED

Meaning A volume has exceeded its quota. Further writes to the volume will fail.

Resolution Free some space on the volume or increase the volume hard quota.



Utilities

This section contains information about the following scripts and commands:

configure.sh - configures a node or client to work with the clusterdisksetup - sets up disks for use by MapR storagemapr-support-collect.sh - collects cluster information for use by MapR Supportpullcentralconfig - pulls master configuration files from the cluster to the local diskrollingupgrade.sh - upgrades software on a MapR cluster



configure.sh

Sets up a MapR cluster or client, creates or modifies , and updates the corresponding and /opt/mapr/conf/mapr-clusters.conf *.conf * files. .xml

Each time is run, it creates or modifies a line in containing a cluster name followedconfigure.sh /opt/mapr/conf/mapr-clusters.confby a list of CLDB nodes. If you do not specify a name (using the parameter), applies a default name (my.cluster.com) to the-N configure.shcluster. Subsequent runs of without the parameter will operate on this default cluster. If you specify a name when you first runconfigure.sh -N

, you can modify the CLDB and ZooKeeper settings corresponding to the named cluster by specifying the same name andconfigure.shrunning again. Whenever you run you must be aware of the existing cluster name or names in configure.sh mapr-clusters.conf mapr-cl

and specify the parameter accordingly. If you specify a name that does not exist, a new line is created in usters.conf -N mapr-clusters.co and treated as a configuration for a separate cluster.nf

The normal use of is to set up a MapR cluster, or to set up a MapR client for communication with one or more clusters.configure.sh

To set up a cluster, run on all nodes specifying the cluster's CLDB and ZooKeeper nodes, and a cluster name if desired.configure.shIf setting up a cluster on virtual machines, use the parameter.-isvmTo set up a client, run on the client machine, specifying the CLDB and ZooKeeper nodes of the cluster or clusters. Whenconfigure.shsetting up a client to work with multiple clusters, run for each cluster, specifying the CLDB and ZooKeeper nodesconfigure.shnormally and specifying the name with the parameter. On a client, use both the and parameters.-N -c -CTo change services (other than the CLDB and ZooKeeper) running on a node, run with the option. If you change theconfigure.sh -Rlocation or number of CLDB or ZooKeeper services in a cluster, run and specify the new lineup of CLDB and ZooKeeperconfigure.shnodes.To specify a MySQL database to use for storing MapR Metrics data, use the parameters , , and . If these are not specified, you-d -du -dpcan configure the database later using the MapR Control System or the MapR API.To specify a user for running MapR services, either set to the username before running , or specify the$MAPR_USER configure.shusername in the parameter when running .-u configure.sh

On a Windows client, the script is named but otherwise works in a similar way.configure.sh configure.bat

Syntax

/opt/mapr/server/configure.sh -C cldb_list (hostname[:port_no] [,hostname[:port_no]...]) -M cldb_mh_list (hostname[:port_no][,[hostname[:port_no]...]) -Z zookeeper_list (hostname[:port_no][,hostname[:port_no]...]) [ -c ] [ --isvm ] [ -J <CLDB JMX port> ] [ -L <log file> ] [ -N <cluster name> ] [ -R ] [ -d <host>:<port> ] [ -du <database username> ] [ -dp <database password> ] [ --create-user|-a ] [ -U <user ID> ] [ -u <username> ] [ -G <group ID> ] [ -g <group name> ] [ -f ]

Parameters


-C Use the option only for CLDB servers that have a single IP address each. This option takes a list of the CLDB nodes that this-Cmachine uses to connect to the MapR cluster. The list is in the following format:

hostname[:port_no] [,hostname[:port_no]...]



-M Use the option only for multihomed CLDB servers that have more than one IP address. This option takes a list of the-Mmultihomed CLDB nodes that this machine uses to connect to the MapR cluster. The list is in the follwing format:

hostname[:port_no][,[hostname[:port_no]...]]

-Z The option is required unless (lowercase) is specified. This option takes a list of the ZooKeeper nodes in the cluster. The-Z -clist is in the following format:

hostname[:port_no][,hostname[:port_no]...]

--isvm Specifies virtual machine setup. Required when is run on a virtual machine.configure.sh

-c Specifies client setup. See .Setting Up the Client

-J Specifies the port for the CLDB. Default: JMX 7220

-L Specifies a log file. If not specified, logs errors to .configure.sh /opt/mapr/logs/configure.log

-N Specifies the cluster name.

-R After initial node configuration, specifies that should use the previously configured ZooKeeper and CLDBconfigure.shnodes. When is specified, the CLDB credentials are read from-R

and the ZooKeeper credentials are read from . The option is useful for makingmapr-clusters.conf warden.conf -Rchanges to the services configured on a node without changing the CLDB and ZooKeeper nodes. The and parameters-C -Zare not required when is specified.-R

-d The host and port of the MySQL database to use for storing MapR Metrics data.

-du The username for logging into the MySQL database used for storing MapR Metrics data.

-dp The password for logging into the MySQL database used for storing MapR Metrics data.

--create-useror -a

Create a local user to run MapR services, using the specified user from or the environment variable .-u $MAPR_USER

-U The user ID to use when creating with the or option; corresponds to the or option$MAPR_USER --create-user -a -u --uidof the command in Linux.useradd

-u The user name under which MapR services will run.

-G The group ID to use when creating with the -gid option of$MAPR_USER - or option; corresponds to the or create-user -a -gthe command in Linux.useradd

-g The group name under which MapR services will run.

-f Specifies that the node should be configured without the system prerequisite check.

Examples

Add a node (not CLDB or ZooKeeper) to a cluster that is running the CLDB and ZooKeeper on three nodes:

On the new node, run the following command:

/opt/mapr/server/configure.sh -C nodeA,nodeB,nodeC -Z nodeA,nodeB,nodeC

Configure a client to work with cluster my.cluster.com, which has one CLDB at nodeA:

On a Linux client, run the following command:

/opt/mapr/server/configure.sh -N my.cluster.com -c -C nodeA

On a Windows 7 client, run the following command:

/opt/mapr/server/configure.bat -N my.cluster.com -c -C nodeA

Add a second cluster to the configuration:

On a node in the second cluster , run the following command:your.cluster.com

http://en.wikipedia.org/wiki/Java_Management_Extensions



configure.sh -C nodeZ -N your.cluster.com

Adding CLDB servers with multiple IP addresses to a cluster:

In this example, the cluster has CLDB servers at , , , and . The CLDB servers and hmy.cluster.com nodeA nodeB nodeC nodeD nodeB nodeDave two NICs each at and .eth0 eth1

On a node in the cluster , run the following command:my.cluster.com

configure.sh -N my.cluster.com -C nodeAeth0,nodeCeth0 -M nodeBeth0,nodeBeth1 -M nodeDeth0,nodeDeth1 -ZzknodeA



disksetup

The command formats specified disks for use by MapR storage and adds those disks to the file. See disksetup disktab Setting Up Disks for for more information about when and how to use .MapR disksetup

Syntax

/opt/mapr/server/disksetup <disk list file> [-F] [-G] [-M] [-W <stripe_width>]

Options

Option Description

-F Forces formatting of all specified disks. Disks that are already formatted for MapR are not reformatted by unless youdisksetupspecify this option. The option fails when a filesystem has an entry in the file, is mounted, or is in use. Call -F disktab maprcli di

to remove a disk entry from the file.sk remove disktab

-G Generates contents from input disk list, but does not format disks. This option is useful if disk names change after a reboot,disktabor if the file is damaged.disktab

-M Uses the maximum available number of disks per storage pool.

-W Specifies the number of disks per storage pool.

Examples

Setting up disks specified in the file /tmp/disks.txt:


Reformatting all disks

To reformat all disks, remove the file and issue the command to format the disk:disktab disksetup -F

/opt/mapr/server/disksetup -F

To reformat a particular disk from the use the and commands. For more information,disktab maprcli disk remove maprcli disk addsee .Setting Up Disks for MapR

Specifying disks









Test Purposes Only: Using a Flat File for Storage





/root/storagefile



mapr-support-collect.sh

Collects information about a cluster's recent activity, to help MapR Support diagnose problems.

The "mini-dump" option limits the size of the support output. When the or option is specified along with a size, -m --mini-dump support-dump collects only a head and tail, each limited to the specified size, from any log file that is larger than twice the specified size. The total size of.sh

the output is therefore limited to approximately 2 * size * number of logs. The size can be specified in bytes, or using the following suffixes:

b - bytesk - kilobytes (1024 bytes)m - megabytes (1024 kilobytes)

Syntax

/opt/mapr/support/tools/mapr-support-collect.sh [ -h|--hosts <host file> ] [ -H|--host <host entry> ] [ -Q|--no-cldb ] [ -n|--name <name> ] [ -d|--output-dir <path> ] [ -l|--no-logs ] [ -s|--no-statistics ] [ -c|--no-conf ] [ -i|--no-sysinfo ] [ -x|--exclude-cluster ] [ -u|--user <user> ] [ -m|--mini-dump <size> ] [ -O|--online ] [ -p|--par <par> ] [ -t|--dump-timeout <dump timeout> ] [ -T|--scp-timeout <SCP timeout> ] [ -C|--cluster-timeout <cluster timeout> ] [ -y|--yes ] [ -S|--scp-port <SCP port> ] [ --collect-cores ] [ --move-cores ] [ --port <port> ] [ -?|--help ]

Parameters


-h or --hosts A file containing a list of hosts. Each line contains one host entry, in the format [user@]host[:port]

-H or --host One or more hosts in the format [user@]host[:port]

-Q or --no-cldb If specified, the command does not query the CLDB for list of nodes

-n or --name Specifies the name of the output file. If not specified, the default is a date-named file in the formatYYYY-MM-DD-hh-mm-ss.tar

-d or --output-dir The absolute path to the output directory. If not specified, the default is /opt/mapr/support/collect/

-l or --no-logs If specified, the command output does not include log files

-s or --no-statistics If specified, the command output does not include statistics

-c or --no-conf If specified, the command output does not include configurations

-i or --no-sysinfo If specified, the command output does not include system information

-x or--exclude-cluster

If specified, the command output does not collect cluster diagnostics



-u or --user The username for ssh connections

-m, --mini-dump<size>

For any log file greater than 2 * <size>, collects only a head and tail each of the specified size. The <size> may have asuffix specifying units:


-O or --online Specifies a space-separated list of nodes from which to gather support output, and uses the warden instead of ssh fortransmitting the support data.

-p or --par The maximum number of nodes from which support dumps will be gathered concurrently (default: 10)

-t or--dump-timeout

The timeout for execution of the command on a node (default: 120 seconds or 0 = no limit)mapr-support-dump

-T or --scp-timeout The timeout for copy of support dump output from a remote node to the local file system (default: 120 seconds or 0 = nolimit)

-C or--cluster-timeout

The timeout for collection of cluster diagnostics (default: 300 seconds or 0 = no limit)

-y or --yes If specified, the command does not require acknowledgement of the number of nodes that will be affected

-S or --scp-port The local port to which remote nodes will establish an SCP session

--collect-cores If specified, the command collects cores of running mfs processes from all nodes (off by default)

--move-cores If specified, the command moves mfs and nfs cores from /opt/cores from all nodes (off by default)

--port The port number used by FileServer (default: 5660)

-? or --help Displays usage help text

Examples

Collect support information and dump it to the file /opt/mapr/support/collect/mysupport-output.tar:

/opt/mapr/support/tools/mapr-support-collect.sh -n mysupport-output



pullcentralconfig

The script on each node pulls master configuration files from on the/opt/mapr/server/pullcentralconfig /var/mapr/configurationcluster to the local disk:

If the master configuration file is newer, the local copy is overwritten by the master copyIf the local configuration file is newer, no changes are made to the local copy

The volume (normally mounted at ) contains directories with master configuration files:mapr.configuration /var/mapr/configuration

Configuration files in the directory are applied to all nodesdefaultTo specify custom configuration files for individual nodes, create directories corresponding to individual hostnames. For example, theconfiguration files in a directory named would only be applied to the machine with the/var/mapr/configuration/host1.r1.nychostname .host1.r1.nyc

The following parameters in control whether central configuration is enabled, the path to the master configuration files, and howwarden.confoften runs:pullcentralconfig

centralconfig.enabled — Specifies whether to enable central configuration.pollcentralconfig.interval.seconds--- The frequency to check for configuration updates, in seconds.



rollingupgrade.sh

Upgrades a MapR cluster to a specified version of the MapR software, or to a specific set of MapR packages.

By default, any node on which upgrade fails is rolled back to the previous version. To disable rollback, use the option. To force installation-nregardless of the existing version on each node, use the option.-rFor more information about using , see .rollingupgrade.sh Cluster Upgrade

Syntax

/opt/upgrade-mapr/rollingupgrade.sh [-c <cluster name>] [-d] [-h] [-i <identity file>] [-n] [-p <directory>] [-r] [-s] [-u <username>] [-v <version>] [-x]

Parameters


-c Cluster name.

-d If specified, performs a dry run without upgrading the cluster.

-h Displays help text.

-i Specifies an identity file for SSH. See the .SSH man page

-n Specifies that the node should not be rolled back to the previous version if upgrade fails.

-p Specifies a directory containing the upgrade packages.

-r Specifies reinstallation of packages even on nodes that are already at the target version.

-s Specifies SSH to upgrade nodes.

-u A username for SSH.

-v The target upgrade version, using the format to specify the major, minor, and revision numbers. Example: x.y.z 1.2.0

-x Specifies that packages should be copied to nodes via SCP.

http://www.openbsd.org/cgi-bin/man.cgi?query=ssh



Environment Variables

The following table describes environment variables specific to MapR.

Variable Example Values Description

JAVA_HOME /usr/lib/jvm/java-6-sun The directory where the correct version of Java is installed.

MAPR_HOME /opt/mapr The directory in which MapR is installed.

MAPR_SUBNETS 1.2.3.4/12, 5.6/24 If you do not want MapR to use all NICs on each node, use the environment variableMAPR_SUBNETS to restrict MapR traffic to specific NICs. Set MAPR_SUBNETS to acomma-separated list of up to four subnets in CIDR notation with no spaces. If MAPR_SUBNETSis not set, MapR uses all NICs present on the node. When MAPR_SUBNETS is set, make surethe node can reach all nodes in the cluster (servers and clients) using the specified subnets.

MAPR_USER mapr_user Used with to specify the user under which MapR runs its services. If it is notconfigure.shexplicitly set, defaults to the user . After is run, the value is stored in mapr_user configure.sh

.daemon.conf



Configuration Files

This guide contains reference information about the following configuration files:

.dfs_attributes - Controls compression and chunk size for each directorycldb.conf - Specifies configuration parameters for the CLDB and cluster topologycore-site.xml - Specifies the default filesystemdaemon.conf - Specifies the user and group that MapR services run asdisktab - Lists the disks in use by MapR-FShadoop-metrics.properties - Specifies where to output service metric reportsmapr-clusters.conf - Specifies the CLDB nodes for one or more clusters that can be reached from the node or clientmapred-default.xml - Contains MapReduce default settings that can be overridden using mapred-site.xml. Not to be edited directly byusers.mapred-site.xml - Core MapReduce settingsmfs.conf - Specifies parameters about MapR-FS server on each nodeThe Roles File - Defines the configuration of services and nodes at install timetaskcontroller.cfg - Specifies TaskTracker configuration parameterswarden.conf - Specifies parameters related to MapR services and the warden. Not to be edited directly by users.zoo.cfg - Specifies ZooKeeper configuration parameters




.dfs_attributes

Each directory in MapR storage contains a hidden file called that controls compression and chunk size. To change these.dfs_attributesattributes, change the corresponding values in the file.

Example:

# lines beginning with # are treated as commentsCompression=lz4ChunkSize=268435456

Valid values:

Compression: , , , or lz4 lzf zlib falseChunk size (in bytes): a multiple of 65535 (64 K) or zero (no chunks). Example: 131072

You can also set compression and chunksize using the command.hadoop mfs



cldb.conf

The file specifies configuration parameters for the CLDB and for cluster topology./opt/mapr/conf/cldb.conf

Field Value Description

cldb.min.fileservers 1 Number of fileservers that must register with the CLDB before the root volume is created

cldb.port 7222 The port on which the CLDB listens.

cldb.numthreads 10 The number of threads reserved for use by the CLDB.

cldb.web.port 7221 The port the CLDB uses for the webserver.

cldb.containers.cache.entries 1000000 The maximum number of read/write containers available in the CLDB cache.

net.topology.script.file.name The path to a script that associates IP addresses with physical topology paths. The script takes theIP address of a single node as input and returns the physical topology that should be associated withthe specified node.

net.topology.table.file.name The path to a text file that associates IP addresses with physical topology paths. Each line of the textfile contains the IP address or hostname of one node, followed by the topology path that should beassociated with the node.

cldb.zookeeper.servers The nodes that are running ZooKeeper, in the format .\<host:port\>

hadoop.version The version of Hadoop supported by the cluster.

cldb.jmxremote.port 7220 The CLDB JMX remote port

Example cldb.conf file



## CLDB Config file. # Properties defined in file are loaded during startupthis# and are valid only CLDB which loaded the config.for# These parameters are not persisted anywhere .else## Wait until minimum number of fileserver register with # CLDB before creating Root Volumecldb.min.fileservers=1# CLDB listening portcldb.port=7222# of worker threadsNumbercldb.numthreads=10# CLDB webportcldb.web.port=7221# of RW containers in cacheNumber#cldb.containers.cache.entries=1000000## Topology script to be used to determine# Rack topology of node# Script should take an IP address as input and print rack path # on STDOUT. eg# $>/home/mapr/topo.pl 10.10.10.10# $>/mapr-rack1# $>/home/mapr/topo.pl 10.10.10.20# $>/mapr-rack2#net.topology.script.file.name=/home/mapr/topo.pl## Topology mapping file used to determine# Rack topology of node# File is of a 2 column format (space separated)# 1st column is an IP address or hostname# 2nd column is the rack path# Line starting with '#' is a comment# Example file contents# 10.10.10.10 /mapr-rack1# 10.10.10.20 /mapr-rack2# host.foo.com /mapr-rack3#net.topology.table.file.name=/home/mapr/topo.txt## ZooKeeper addresscldb.zookeeper.servers=zoink:5181# Hadoop metrics jar versionhadoop.version=0.20.2# CLDB JMX remote portcldb.jmxremote.port=7220



core-site.xml

The file contains configuration information that overrides the default values/opt/mapr/hadoop/hadoop-<version>/conf/core-site.xmlfor core Hadoop properties. Overrides of the default values for MapReduce configuration properties are stored in the file.mapred-site.xml

To override a default value, specify the new value within the tags, using the following format:<configuration>

<property> <name> </name> <value> </value> <description> </description></property>

The table of describes the possible entries to place in the and tags. The tag is optional butcore parameters <name> <value> <description>recommended for maintainability.

You can examine the current configuration information for this node by using the command from a -dumphadoop confcommand line.

Default core-site.xml file

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

</configuration>

Core Parameters

Parameter Default Value Description

fs.automatic.close True The default behavior for filesystem instances is to close when theprogram exits. This filesystem closure uses a JVM shutdown hook.Set this property to False to disable this behavior. This is anadvanced option. Set the value of tofs.automatic.closeFalse only if your server application requires a specific shutdownsequence.

fs.file.impl org.apache.hadoop.fs.LocalFileSystem The filesystem for OS mounts that use URIs.file:

fs.ftp.impl org.apache.hadoop.fs.ftp.FTPFileSystem The filesystem for URIs.ftp:

fs.har.impl.disable.cache True The default value does not cache filesystem instances. Setharthis value to False to enable caching of filesystem instances.har

fs.har.impl org.apache.hadoop.fs.HarFileSystem The filesystem for Hadoop archives.

fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem The filesystem for URIs.hdfs:

fs.hftp.impl org.apache.hadoop.hftp.HftpFileSystem The filesystem for URIs.hftp:

fs.hsftp.impl org.apache.hadoop.hdfs.HsftpFileSystem The filesystem for URIs.hsftp:

fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem The filesystem for URIs.kfs:

fs.maprfs.impl com.mapr.fs.MapRFileSystem The filesystem for URIs.maprfs:

fs.mapr.working.dir /user/$USERNAME/ Working directory for MapR-FS



fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem The filesystem for URIs.ramfs:

fs.s3.blockSize 33554432 Block size to use when writing files to S3.

fs.s3.buffer.dir ${hadoop.tmp.dir}/s3 Specifies the location on the local filesystem where Amazon S3stores files before the files are sent to the S3 filesystem. Thislocation also stores files retrieved from S3.

fs.s3.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem The filesystem for URIs.s3:

fs.s3.maxRetries 4 Specifies the maximum number of retries for file read or writeoperations to S3. After the maximum number of retries has beenattempted, Hadoop signals failure to the application.

fs.s3n.blockSize 33554432 Block size to use when reading files from the native S3 filesystemusing URIs.s3n:

fs.s3n.impl org.apache.hadoop.fs.s3native.NativeS3FileSystem The filesystem for URIs.s3n:

fs.s3.sleepTimeSeconds 10 The number of seconds to sleep between S3 retries.

fs.trash.interval 0 Specified the number of minutes between trash checkpoints. Setthis value to zero to disable the trash feature.

hadoop.logfile.count 10 This property is deprecated.

hadoop.logfile.size 10000000 This property is deprecated.

hadoop.native.lib True Specifies whether to use native Hadoop libraries if they arepresent. Set this value to False to disable the use of nativeHadoop libraries.

hadoop.rpc.socket.factory.class.default org.apache.hadoop.net.StandardSocketFactory Specifies the default socket factory. The value for this parametermust be in the format .package.FactoryClassName

hadoop.security.authentication simple Specifies authentication protocols to use. The default value of sim uses no authentication. Specify to enableple kerberos

Kerberos authentication.

hadoop.security.authorization False Specifies whether or not service-level authorization is enabled.Specify True to enable service-level authorization.

hadoop.security.group.mapping org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback Specifies the user-to-group mapping class that returns the groupsa given user is in.

hadoop.security.uid.cache.secs 14400 Specifies the timeout for entries in the NativeIO cache ofUID-to-UserName pairs.

hadoop.tmp.dir /tmp/hadoop-${user.name} Specifies the base directory for other temporary directories.

hadoop.workaround.non.threadsafe.getpwuid False Some operating systems or authentication modules are known tohave broken implementations of and tgetpwuid_r getpwgid_rhat are not thread-safe. Symptoms of this problem include JVMcrashes with a stack trace inside these functions. Enable thisconfiguration parameter to include a lock around the calls as aworkaround.An incomplete list of some systems known to have this issue isavailable at http://wiki.apache.org/hadoop/KnownBrokenPwuidImplementations

io.bytes.per.checksum 512 The number of checksum bytes. Maximum value for thisparameter is equal to the value of the paio.file.buffer.sizerameter.

io.compression.codecs org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec

A list of compression codec classes available for compression anddecompression.

io.file.buffer.size 8192 Specifies the buffer size for sequence files. For optimalperformance, set this parameter's value to a multiple of thehardware page size (the Intel x86 architecture has a hardwarepage size of 4096). This value determines how much data isbuffered during read and write operations.

http://wiki.apache.org/hadoop/KnownBrokenPwuidImplementations

http://wiki.apache.org/hadoop/KnownBrokenPwuidImplementations



io.mapfile.bloom.error.rate 0.005 This value specifies the acceptable rate of false positives for Bloo, which is used in . The size of the mFilter-s BloomMapFile Bl

file increases exponentially as the value of thisoomFilter-sproperty decreases

io.mapfile.bloom.size 1048576 This value sets the maximum number of keys in a BloomFilter- file used in BloomMapFile. Once a number of keys equal to thiss

value is appended, the next BloomFilter is created inside a Dynam. Larger values decrease the number oficBloomFilter

individual filters. A lower number of filters increases performanceand consumes more space.

io.map.index.skip 0 Number of index entries to skip between each entry. Values forthis property larger than zero can facilitate opening large map filesusing less memory.

io.seqfile.compress.blocksize 1000000 The minimum block size for compression in block compressed Seq.uenceFiles

io.seqfile.lazydecompress True . Set this value to False to always decompressDeprecatedblock-compressed .SequenceFiles

io.seqfile.sorter.recordlimit 1000000 . The limit on number of records kept in memory in aDeprecatedspill in .SequenceFiles.Sorter

io.serializations org.apache.hadoop.io.serializer.WritableSerialization A list of serialization classes available for obtaining serializers anddeserializers.

io.skip.checksum.errors False Set this property to True to skip an entry instead of throwing anexception when encountering a checksum error while reading asequence file.

ipc.client.connection.maxidletime 10000 This property's value specifies the maximum idle time inmilliseconds. Once this idle time elapses, the client drops theconnection to the server.

ipc.client.connect.max.retries 10 This property's value specifies the maximum number of retryattempts a client makes to establish a server connection.

ipc.client.idlethreshold 4000 This property's value specifies number of connections after whichconnections are inspected for idleness.

ipc.client.kill.max 10 This property's value specifies the maximum number of clients todisconnect simultaneously.

ipc.client.max.connection.setup.timeout 20 This property's value specifies the time in minutes that a failoverRPC from the job client waits while setting up initial connectionwith the server.

ipc.client.tcpnodelay True Change this value to False to enable Nagle's algorithm for the TCPsocket connection on the client. Disabling Nagle's algorithm uses agreater number of smaller packets and may decrease latency.

ipc.server.listen.queue.size 128 Indicates the length of the listen queue for servers accepting clientconnections.

ipc.server.tcpnodelay True Change this value to False to enable Nagle's algorithm for the TCPsocket connection on the server. Disabling Nagle's algorithm usesa greater number of smaller packets and may decrease latency.



daemon.conf

The file specifies the user and group under which MapR services run, and whether all MapR services run as/opt/mapr/conf/daemon.confthe specified user/group, or only ZooKeeper and FileServer. The configuration parameters operate as follows:

If and are set, the ZooKeeper and FileServer run as the specified user/group. Otherwise,mapr.daemon.user mapr.daemon.groupthey run as .rootIf , all services started by the warden run as the specified user. Otherwise, they run as .mapr.daemon.runuser.warden=1 root

Sample daemon.conf file

mapr.daemon.user=maprmapr.daemon.group=maprmapr.daemon.runuser.warden=1



disktab

On each node, the file lists all of the physical drives and partitions that have been added to MapR-FS. The /opt/mapr/conf/disktab diskta file is created by and automatically updated when disks are added or removed (either using the MapR Control System, or with the b disksetup

and commands).disk add disk remove

Sample disktab file

# MapR Disks Mon Nov 28 11:46:16 2011

/dev/sdb 47E4CCDA-3536-E767-CD18-0CB7E4D34E00/dev/sdc 7B6A3E66-6AF0-AF60-AE39-01B8E4D34E00/dev/sdd 27A59ED3-DFD4-C692-68F8-04B8E4D34E00/dev/sde F0BB5FB1-F2AC-CC01-275B-08B8E4D34E00/dev/sdf 678FCF40-926F-0D04-49AC-0BB8E4D34E00/dev/sdg 46823852-E45B-A7ED-8417-02B9E4D34E00/dev/sdh 60A99B96-4CEE-7C46-A749-05B9E4D34E00/dev/sdi 66533D4D-49F9-3CC4-0DF9-08B9E4D34E00/dev/sdj 44CA818A-9320-6BBB-3751-0CB9E4D34E00/dev/sdk 587E658F-EC8B-A3DF-4D74-00BAE4D34E00/dev/sdl 11384F8D-1DA2-E0F3-E6E5-03BAE4D34E00



hadoop-metrics.properties

The files direct MapR where to output service metric reports: to an output file ( ) or to 3.1hadoop-metrics.properties FileContext Ganglia( ). A third context, , disables metrics. To direct metrics to an output file, comment out the linesMapRGangliaContext31 NullContextpertaining to Ganglia and the ; for the chosen service; to direct metrics to Ganglia, comment out the lines pertaining to the metricsNullContextfile and the . See .NullContext Service Metrics

There are two files:hadoop-metrics.properties

/opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties specifies output for standard Hadoop services/opt/mapr/conf/hadoop-metrics.properties specifies output for MapR-specific services

The following table describes the parameters for each service in the files.hadoop-metrics.properties

Parameter Example Values Description

<service>.classorg.apache.hadoop.metrics.spi.NullContextWithUpdateThreadapache.hadoop.metrics.file.FileContextcom.mapr.fs.cldb.counters.MapRGangliaContext31

The class that implements the interfaceresponsible for sending the service metrics tothe appropriate handler. When implementing aclass that sends metrics to Ganglia, set thisproperty to the class name.

<service>.period1060

The interval between 2 service metrics dataexports to the appropriate interface. This isindependent of how often are the metricsupdated in the framework.

<service>.fileName /tmp/cldbmetrics.log The path to the file where service metrics areexported when the cldb.class property is set toFileContext.

<service.servers localhost:8649 The location of the gmon or gmeta that isaggregating metrics for this instance of theservice, when the cldb.class property is set toGangliaContext.

<service>.spoof 1 Specifies whether the metrics being sent outfrom the server should be spoofed as comingfrom another server. All our fileserver metricsare also on cldb, but to make it appear to endusers as if these properties were emitted byfileserver host, we spoof the metrics to Gangliausing this property. Currently only used for theFileServer service.

Examples

The files are organized into sections for each service that provides metrics. Each section is divided intohadoop-metrics.propertiessubsections for the three contexts.

/opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties



# Configuration of the context "dfs" for nulldfs.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the context file"dfs" for#dfs.class=org.apache.hadoop.metrics.file.FileContext#dfs.period=10#dfs.fileName=/tmp/dfsmetrics.log

# Configuration of the context ganglia"dfs" for# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)# dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext# dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31# dfs.period=10# dfs.servers=localhost:8649

# Configuration of the context "mapred" for nullmapred.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the context file"mapred" for#mapred.class=org.apache.hadoop.metrics.file.FileContext#mapred.period=10#mapred.fileName=/tmp/mrmetrics.log

# Configuration of the context ganglia"mapred" for# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)# mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext# mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31# mapred.period=10# mapred.servers=localhost:8649

# Configuration of the context "jvm" for null#jvm.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the context file"jvm" for#jvm.class=org.apache.hadoop.metrics.file.FileContext#jvm.period=10#jvm.fileName=/tmp/jvmmetrics.log

# Configuration of the context ganglia"jvm" for# jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext# jvm.period=10# jvm.servers=localhost:8649

# Configuration of the context "ugi" for nullugi.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the context "fairscheduler" for null#fairscheduler.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the context file"fairscheduler" for#fairscheduler.class=org.apache.hadoop.metrics.file.FileContext#fairscheduler.period=10#fairscheduler.fileName=/tmp/fairschedulermetrics.log

# Configuration of the context ganglia"fairscheduler" for# fairscheduler.class=org.apache.hadoop.metrics.ganglia.GangliaContext# fairscheduler.period=10# fairscheduler.servers=localhost:8649#

/opt/mapr/conf/hadoop-metrics.properties



########################################################################################################################### hadoop-metrics.properties###########################################################################################################################

#CLDB metrics config - Pick one out of ,file or ganglia.null#Uncomment all properties in , file or ganglia context, to send cldb metrics to that contextnull

# Configuration of the context "cldb" for null#cldb.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread#cldb.period=10

# Configuration of the context file"cldb" for#cldb.class=org.apache.hadoop.metrics.file.FileContext#cldb.period=60#cldb.fileName=/tmp/cldbmetrics.log

# Configuration of the context ganglia"cldb" forcldb.class=com.mapr.fs.cldb.counters.MapRGangliaContext31cldb.period=10cldb.servers=localhost:8649cldb.spoof=1

#FileServer metrics config - Pick one out of ,file or ganglia.null#Uncomment all properties in , file or ganglia context, to send fileserver metrics to that contextnull

# Configuration of the context "fileserver" for null#fileserver.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread#fileserver.period=10

# Configuration of the context file"fileserver" for#fileserver.class=org.apache.hadoop.metrics.file.FileContext#fileserver.period=60#fileserver.fileName=/tmp/fsmetrics.log

# Configuration of the context ganglia"fileserver" forfileserver.class=com.mapr.fs.cldb.counters.MapRGangliaContext31fileserver.period=37fileserver.servers=localhost:8649fileserver.spoof=1

################################################################################################################



mapr-clusters.conf

The configuration file specifies the CLDB nodes for one or more clusters that can be reached from/opt/mapr/conf/mapr-clusters.confthe node or client on which it is installed.

Format:

clustername1 <CLDB> <CLDB> <CLDB>[ clustername2 <CLDB> <CLDB> <CLDB> ][ ... ]

The <CLDB> string format is one of the following:

host,ip:port - Host, IP, and port (uses DNS to resolve hostnames, or provided IP if DNS is down)host:port - Hostname and IP (uses DNS to resolve host, specifies port)ip:port - IP and port (avoids using DNS to resolve hosts, specifies port)host - Hostname only (default, uses DNS to resolve host, uses default port)ip - IP only (avoids using DNS to resolve hosts, uses default port)

You can use the script to add entries to the file by using the following syntax:configure.sh mapr-clusters.conf

configure.sh -C <cluster CLDB nodes> -N <cluster name>



mapred-default.xml

The configuration file provides defaults that can be overridden using , and is located in the Hadoopmapred-default.xml mapred-site.xmlcore JAR file ( )./opt/mapr/hadoop/hadoop-<version>/lib/hadoop-<version>-dev-core.jar

Do not modify directly. Instead, copy parameters into and modify them there. Ifmapred-default.xml mapred-site.xmlmapred-site.xml does not already exist, create it.

You can examine the current configuration information for this node by using the command from a -dumphadoop confcommand line.

The format for a parameter in both and is:mapred-default.xml mapred-site.xml

<property> <name>io.sort.spill.percent</name> <value>0.99</value> <description>The soft limit in either the buffer or record collection buffers. Once reached, a thread will begin to spill the contents to disk in the background. Note that does not imply any chunking of data tothis the spill. A value less than 0.5 is not recommended.</description></property>

The element contains the parameter name, the element contains the parameter value, and the optional elem<name> <value> <description>ent contains the parameter description. You can create XML for any parameter from the table below, using the example above as a guide.


hadoop.job.history.location If job tracker is static the history files are stored in this single wellknown place on local filesystem. If No value is set here, bydefault, it is in the local file system at $<hadoop.log.dir>/history.History files are moved tomapred.jobtracker.history.completed.location which is onMapRFs JobTracker volume.

hadoop.job.history.user.location User can specify a location to store the history files of a particularjob. If nothing is specified, the logs are stored in output directory.The files are stored in "_logs/history/" in the directory. User canstop logging by giving the value "none".

hadoop.rpc.socket.factory.class.JobSubmissionProtocol SocketFactory to use to connect to a Map/Reduce master(JobTracker). If null or empty, then usehadoop.rpc.socket.class.default.

io.map.index.skip 0 Number of index entries to skip between each entry. Zero bydefault. Setting this to values larger than zero can facilitateopening large map files using less memory.

io.sort.factor 256 The number of streams to merge at once while sorting files. Thisdetermines the number of open file handles.

io.sort.mb 100 Buffer used to hold map outputs in memory before writing finalmap outputs. Setting this value very low may cause spills. If aaverage input to map is "MapIn" bytes then typically value ofio.sort.mb should be '1.25 times MapIn' bytes.

io.sort.record.percent 0.17 The percentage of io.sort.mb dedicated to tracking recordboundaries. Let this value be r, io.sort.mb be x. The maximumnumber of records collected before the collection thread mustblock is equal to (r * x) / 4

io.sort.spill.percent 0.99 The soft limit in either the buffer or record collection buffers. Oncereached, a thread will begin to spill the contents to disk in thebackground. Note that this does not imply any chunking of data tothe spill. A value less than 0.5 is not recommended.



job.end.notification.url http://localhost:8080/jobstatus.php?jobId=$jobId&jobStatus=$jobStatus Indicates url which will be called on completion of job to informend status of job. User can give at most 2 variables with URI :$jobId and $jobStatus. If they are present in URI, then they will bereplaced by their respective values.

job.end.retry.attempts 0 Indicates how many times hadoop should attempt to contact thenotification URL

job.end.retry.interval 30000 Indicates time in milliseconds between notification URL retry calls

jobclient.completion.poll.interval 5000 The interval (in milliseconds) between which the JobClient pollsthe JobTracker for updates about job status. You may want to setthis to a lower value to make tests run faster on a single nodesystem. Adjusting this value in production may lead to unwantedclient-server traffic.

jobclient.output.filter FAILED The filter for controlling the output of the task's userlogs sent tothe console of the JobClient. The permissible options are: NONE,KILLED, FAILED, SUCCEEDED and ALL.

jobclient.progress.monitor.poll.interval 1000 The interval (in milliseconds) between which the JobClient reportsstatus to the console and checks for job completion. You maywant to set this to a lower value to make tests run faster on asingle node system. Adjusting this value in production may leadto unwanted client-server traffic.

map.sort.class org.apache.hadoop.util.QuickSort The default sort class for sorting keys.

mapr.localoutput.dir output The path for local output

mapr.localspill.dir spill The path for local spill

mapr.localvolumes.path /var/mapr/local The path for local volumes

mapred.acls.enabled false Specifies whether ACLs should be checked for authorization ofusers for doing various queue and job level operations. ACLs aredisabled by default. If enabled, access control checks are madeby JobTracker and TaskTracker when requests are made byusers for queue operations like submit job to a queue and kill ajob in the queue and job operations like viewing the job-details(See mapreduce.job.acl-view-job) or for modifying the job (Seemapreduce.job.acl-modify-job) using Map/Reduce APIs, RPCs orvia the console and web user interfaces.

mapred.child.env User added environment variables for the task tracker childprocesses. Example : 1) A=foo This will set the env variable A tofoo 2) B=$B:c This is inherit tasktracker's B env variable.

mapred.child.java.opts Java opts for the task tracker child processes. The followingsymbol, if present, will be interpolated: @taskid@ is replaced bycurrent TaskID. Any other occurrences of '@' will go unchanged.For example, to enable verbose gc logging to a file named for thetaskid in /tmp and to set the heap maximum to be a gigabyte,pass a 'value' of: -Xmx1024m -verbose:gc-Xloggc:/tmp/@[email protected] The configuration variablemapred.child.ulimit can be used to control the maximum virtualmemory of the child processes.

mapred.child.oom_adj 10 Increase the OOM adjust for oom killer (linux specific). We onlyallow increasing the adj value. (valid values: 0-15)

mapred.child.renice 10 Nice value to run the job in. on linux the range is from -20 (mostfavorable) to 19 (least favorable). We only allow reducing thepriority. (valid values: 0-19)

mapred.child.taskset true Run the job in a taskset. man taskset (linux specific) 1-4 CPUs:No taskset 5-8 CPUs: taskset 1- (processor 0 reserved forinfrastructrue processes) 9-n CPUs: taskset 2- (processors 0,1reserved for infrastructrue processes)

mapred.child.tmp ./tmp To set the value of tmp directory for map and reduce tasks. If thevalue is an absolute path, it is directly assigned. Otherwise, it isprepended with task's working directory. The java tasks areexecuted with option -Djava.io.tmpdir='the absolute path of thetmp dir'. Pipes and streaming are set with environment variable,TMPDIR='the absolute path of the tmp dir'

http://localhost:8080/jobstatus.php?jobId=$jobId&jobStatus=$jobStatus



mapred.child.ulimit The maximum virtual memory, in KB, of a process launched bythe Map-Reduce framework. This can be used to control both theMapper/Reducer tasks and applications using Hadoop Pipes,Hadoop Streaming etc. By default it is left unspecified to letcluster admins control it via limits.conf and other such relevantmechanisms. Note: mapred.child.ulimit must be greater than orequal to the -Xmx passed to JavaVM, else the VM might not start.

mapred.cluster.map.memory.mb -1 The size, in terms of virtual memory, of a single map slot in theMap-Reduce framework, used by the scheduler. A job can ask formultiple slots for a single map task viamapred.job.map.memory.mb, upto the limit specified bymapred.cluster.max.map.memory.mb, if the scheduler supportsthe feature. The value of -1 indicates that this feature is turnedoff.

mapred.cluster.max.map.memory.mb -1 The maximum size, in terms of virtual memory, of a single maptask launched by the Map-Reduce framework, used by thescheduler. A job can ask for multiple slots for a single map taskvia mapred.job.map.memory.mb, upto the limit specified bymapred.cluster.max.map.memory.mb, if the scheduler supportsthe feature. The value of -1 indicates that this feature is turnedoff.

mapred.cluster.max.reduce.memory.mb -1 The maximum size, in terms of virtual memory, of a single reducetask launched by the Map-Reduce framework, used by thescheduler. A job can ask for multiple slots for a single reduce taskvia mapred.job.reduce.memory.mb, upto the limit specified bymapred.cluster.max.reduce.memory.mb, if the scheduler supportsthe feature. The value of -1 indicates that this feature is turnedoff.

mapred.cluster.reduce.memory.mb -1 The size, in terms of virtual memory, of a single reduce slot in theMap-Reduce framework, used by the scheduler. A job can ask formultiple slots for a single reduce task viamapred.job.reduce.memory.mb, upto the limit specified bymapred.cluster.max.reduce.memory.mb, if the scheduler supportsthe feature. The value of -1 indicates that this feature is turnedoff.

mapred.compress.map.output false Should the outputs of the maps be compressed before being sentacross the network. Uses SequenceFile compression.

mapred.healthChecker.interval 60000 Frequency of the node health script to be run, in milliseconds

mapred.healthChecker.script.args List of arguments which are to be passed to node health scriptwhen it is being launched comma seperated.

mapred.healthChecker.script.path Absolute path to the script which is periodicallyrun by the nodehealth monitoring service to determine if the node is healthy ornot. If the value of this key is empty or the file does not exist inthe location configured here, the node health monitoring serviceis not started.

mapred.healthChecker.script.timeout 600000 Time after node health script should be killed if unresponsive andconsidered that the script has failed.

mapred.hosts.exclude Names a file that contains the list of hosts that should beexcluded by the jobtracker. If the value is empty, no hosts areexcluded.

mapred.hosts Names a file that contains the list of nodes that may connect tothe jobtracker. If the value is empty, all hosts are permitted.

mapred.inmem.merge.threshold 1000 The threshold, in terms of the number of files for the in-memorymerge process. When we accumulate threshold number of fileswe initiate the in-memory merge and spill to disk. A value of 0 orless than 0 indicates we want to DON'T have any threshold andinstead depend only on the ramfs's memory consumption totrigger the merge.



mapred.job.map.memory.mb -1 The size, in terms of virtual memory, of a single map task for thejob. A job can ask for multiple slots for a single map task, roundedup to the next multiple of mapred.cluster.map.memory.mb andupto the limit specified by mapred.cluster.max.map.memory.mb, ifthe scheduler supports the feature. The value of -1 indicates thatthis feature is turned off iff mapred.cluster.map.memory.mb isalso turned off (-1).

mapred.job.map.memory.physical.mb Maximum physical memory limit for map task of this job. If limit isexceeded task attempt will be FAILED.

mapred.job.queue.name default Queue to which a job is submitted. This must match one of thequeues defined in mapred.queue.names for the system. Also, theACL setup for the queue must allow the current user to submit ajob to the queue. Before specifying a queue, ensure that thesystem is configured with the queue, and access is allowed forsubmitting jobs to the queue.

mapred.job.reduce.input.buffer.percent 0.0 The percentage of memory- relative to the maximum heap size-to retain map outputs during the reduce. When the shuffle isconcluded, any remaining map outputs in memory must consumeless than this threshold before the reduce can begin.

mapred.job.reduce.memory.mb -1 The size, in terms of virtual memory, of a single reduce task forthe job. A job can ask for multiple slots for a single map task,rounded up to the next multiple ofmapred.cluster.reduce.memory.mb and upto the limit specified bymapred.cluster.max.reduce.memory.mb, if the scheduler supportsthe feature. The value of -1 indicates that this feature is turned offiff mapred.cluster.reduce.memory.mb is also turned off (-1).

mapred.job.reduce.memory.physical.mb Maximum physical memory limit for reduce task of this job. If limitis exceeded task attempt will be FAILED..

mapred.job.reuse.jvm.num.tasks -1 How many tasks to run per jvm. If set to -1, there is no limit.

mapred.job.shuffle.input.buffer.percent 0.70 The percentage of memory to be allocated from the maximumheap size to storing map outputs during the shuffle.

mapred.job.shuffle.merge.percent 0.66 The usage threshold at which an in-memory merge will beinitiated, expressed as a percentage of the total memoryallocated to storing in-memory map outputs, as defined bymapred.job.shuffle.input.buffer.percent.

mapred.job.tracker.handler.count 10 The number of server threads for the JobTracker. This should beroughly 4% of the number of tasktracker nodes.

mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done The completed job history files are stored at this singlewell-known location. If nothing is specified, the files are stored at$<hadoop.job.history.location>/done in local filesystem.

mapred.job.tracker.http.address 0.0.0.0:50030 The job tracker http server address and port the server will listenon. If the port is 0 then the server will start on a free port.

mapred.job.tracker.persist.jobstatus.active false Indicates if persistency of job status information is active or not.

mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo The directory where the job status information is persisted in a filesystem to be available after it drops of the memory queue andbetween jobtracker restarts.

mapred.job.tracker.persist.jobstatus.hours 0 The number of hours job status information is persisted in DFS.The job status information will be available after it drops of thememory queue and between jobtracker restarts. With a zerovalue the job status information is not persisted at all in DFS.

mapred.job.tracker localhost:9001 jobTracker address ip:port or use uri maprfs:/// for default clusteror maprfs:///mapr/san_jose_cluster1 to connect'san_jose_cluster1' cluster.

mapred.jobtracker.completeuserjobs.maximum 100 The maximum number of complete jobs per user to keep aroundbefore delegating them to the job history.

mapred.jobtracker.instrumentation org.apache.hadoop.mapred.JobTrackerMetricsInst Expert: The instrumentation class to associate with eachJobTracker.



mapred.jobtracker.job.history.block.size 3145728 The block size of the job history file. Since the job recovery usesjob history, its important to dump job history to disk as soon aspossible. Note that this is an expert level parameter. The defaultvalue is set to 3 MB.

mapred.jobtracker.jobhistory.lru.cache.size 5 The number of job history files loaded in memory. The jobs areloaded when they are first accessed. The cache is cleared basedon LRU.

mapred.jobtracker.maxtasks.per.job -1 The maximum number of tasks for a single job. A value of -1indicates that there is no maximum.

mapred.jobtracker.plugins Comma-separated list of jobtracker plug-ins to be activated.

mapred.jobtracker.port 9001 Port on which JobTracker listens.

mapred.jobtracker.restart.recover true "true" to enable (job) recovery upon restart, "false" to start afresh

mapred.jobtracker.retiredjobs.cache.size 1000 The number of retired job status to keep in the cache.

mapred.jobtracker.taskScheduler.maxRunningTasksPerJob The maximum number of running tasks for a job before it getspreempted. No limits if undefined.

mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.JobQueueTaskScheduler The class responsible for scheduling the tasks.

mapred.line.input.format.linespermap 1 Number of lines per split in NLineInputFormat.

mapred.local.dir.minspacekill 0 If the space in mapred.local.dir drops under this, do not ask moretasks until all the current ones have finished and cleaned up.Also, to save the rest of the tasks we have running, kill one ofthem, to clean up some space. Start with the reduce tasks, thengo with the ones that have finished the least. Value in bytes.

mapred.local.dir.minspacestart 0 If the space in mapred.local.dir drops under this, do not ask formore tasks. Value in bytes.

mapred.local.dir $<hadoop.tmp.dir>/mapred/local The local directory where MapReduce stores intermediate datafiles. May be a comma-separated list of directories on differentdevices in order to spread disk i/o. Directories that do not existare ignored.

mapred.map.child.env User added environment variables for the task tracker childprocesses. Example : 1) A=foo This will set the env variable A tofoo 2) B=$B:c This is inherit tasktracker's B env variable.

mapred.map.child.java.opts -XX:ErrorFile=/opt/cores/hadoop/java_error%p.log Java opts for the map tasks. The following symbol, if present, willbe interpolated: @taskid@ is replaced by current TaskID. Anyother occurrences of '@' will go unchanged. For example, toenable verbose gc logging to a file named for the taskid in /tmpand to set the heap maximum to be a gigabyte, pass a 'value' of:-Xmx1024m -verbose:gc -Xloggc:/tmp/@[email protected] Theconfiguration variable mapred.<map/reduce>.child.ulimit can beused to control the maximum virtual memory of the childprocesses. MapR: Default heapsize(-Xmx) is determined bymemory reserved for mapreduce at tasktracker. Reduce task isgiven more memory than a map task. Default memory for a maptask = (Total Memory reserved for mapreduce) * (#mapslots/(#mapslots + 1.3*#reduceslots))

mapred.map.child.ulimit The maximum virtual memory, in KB, of a process launched bythe Map-Reduce framework. This can be used to control both theMapper/Reducer tasks and applications using Hadoop Pipes,Hadoop Streaming etc. By default it is left unspecified to letcluster admins control it via limits.conf and other such relevantmechanisms. Note: mapred.<map/reduce>.child.ulimit must begreater than or equal to the -Xmx passed to JavaVM, else the VMmight not start.

mapred.map.max.attempts 4 Expert: The maximum number of attempts per map task. In otherwords, framework will try to execute a map task these manynumber of times before giving up on it.

mapred.map.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec If the map outputs are compressed, how should they becompressed?



mapred.map.tasks.speculative.execution true If true, then multiple instances of some map tasks may beexecuted in parallel.

mapred.map.tasks 2 The default number of map tasks per job. Ignored whenmapred.job.tracker is "local".

mapred.max.tracker.blacklists 4 The number of blacklists for a taskTracker by various jobs afterwhich the task tracker could be blacklisted across all jobs. Thetracker will be given a tasks later (after a day). The tracker willbecome a healthy tracker after a restart.

mapred.max.tracker.failures 4 The number of task-failures on a tasktracker of a given job afterwhich new tasks of that job aren't assigned to it.

mapred.merge.recordsBeforeProgress 10000 The number of records to process during merge before sending aprogress notification to the TaskTracker.

mapred.min.split.size 0 The minimum size chunk that map input should be split into. Notethat some file formats may have minimum split sizes that takepriority over this setting.

mapred.output.compress false Should the job outputs be compressed?

mapred.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec If the job outputs are compressed, how should they becompressed?

mapred.output.compression.type RECORD If the job outputs are to compressed as SequenceFiles, howshould they be compressed? Should be one of NONE, RECORDor BLOCK.

mapred.queue.default.state RUNNING This values defines the state , default queue is in. the values canbe either "STOPPED" or "RUNNING" This value can be changedat runtime.

mapred.queue.names default Comma separated list of queues configured for this jobtracker.Jobs are added to queues and schedulers can configure differentscheduling properties for the various queues. To configure aproperty for a queue, the name of the queue must match thename specified in this value. Queue properties that are commonto all schedulers are configured here with the naming convention,mapred.queue.$QUEUE-NAME.$PROPERTY-NAME, for e.g.mapred.queue.default.submit-job-acl. The number of queuesconfigured in this parameter could depend on the type ofscheduler being used, as specified inmapred.jobtracker.taskScheduler. For example, theJobQueueTaskScheduler supports only a single queue, which isthe default configured here. Before adding more queues, ensurethat the scheduler you've configured supports multiple queues.

mapred.reduce.child.env

mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/hadoop/java_error%p.log Java opts for the reduce tasks. MapR: Default heapsize(-Xmx) isdetermined by memory reserved for mapreduce at tasktracker.Reduce task is given more memory than map task. Defaultmemory for a reduce task = (Total Memory reserved formapreduce) * (1.3*#reduceslots / (#mapslots + 1.3*#reduceslots))

mapred.reduce.child.ulimit

mapred.reduce.copy.backoff 300 The maximum amount of time (in seconds) a reducer spends onfetching one map output before declaring it as failed.

mapred.reduce.max.attempts 4 Expert: The maximum number of attempts per reduce task. Inother words, framework will try to execute a reduce task thesemany number of times before giving up on it.

mapred.reduce.parallel.copies 12 The default number of parallel transfers run by reduce during thecopy(shuffle) phase.

mapred.reduce.slowstart.completed.maps 0.95 Fraction of the number of maps in the job which should becomplete before reduces are scheduled for the job.

mapred.reduce.tasks.speculative.execution true If true, then multiple instances of some reduce tasks may beexecuted in parallel.



mapred.reduce.tasks 1 The default number of reduce tasks per job. Typically set to 99%of the cluster's reduce capacity, so that if a node fails the reducescan still be executed in a single wave. Ignored whenmapred.job.tracker is "local".

mapred.skip.attempts.to.start.skipping 2 The number of Task attempts AFTER which skip mode will bekicked off. When skip mode is kicked off, the tasks reports therange of records which it will process next, to the TaskTracker.So that on failures, tasktracker knows which ones are possiblythe bad records. On further executions, those are skipped.

mapred.skip.map.auto.incr.proc.count true The flag which if set to true,SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS isincremented by MapRunner after invoking the map function. Thisvalue must be set to false for applications which process therecords asynchronously or buffer the input records. For examplestreaming. In such cases applications should increment thiscounter on their own.

mapred.skip.map.max.skip.records 0 The number of acceptable skip records surrounding the badrecord PER bad record in mapper. The number includes the badrecord as well. To turn the feature of detection/skipping of badrecords off, set the value to 0. The framework tries to narrowdown the skipped range by retrying until this threshold is met ORall attempts get exhausted for this task. Set the value toLong.MAX_VALUE to indicate that framework need not try tonarrow down. Whatever records(depends on application) getskipped are acceptable.

mapred.skip.out.dir If no value is specified here, the skipped records are written to theoutput directory at _logs/skip. User can stop writing skippedrecords by giving the value "none".

mapred.skip.reduce.auto.incr.proc.count true The flag which if set to true,SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPSis incremented by framework after invoking the reduce function.This value must be set to false for applications which process therecords asynchronously or buffer the input records. For examplestreaming. In such cases applications should increment thiscounter on their own.

mapred.skip.reduce.max.skip.groups 0 The number of acceptable skip groups surrounding the bad groupPER bad group in reducer. The number includes the bad groupas well. To turn the feature of detection/skipping of bad groupsoff, set the value to 0. The framework tries to narrow down theskipped range by retrying until this threshold is met OR allattempts get exhausted for this task. Set the value toLong.MAX_VALUE to indicate that framework need not try tonarrow down. Whatever groups(depends on application) getskipped are acceptable.

mapred.submit.replication 10 The replication level for submitted job files. This should be aroundthe square root of the number of nodes.

mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system The shared directory where MapReduce stores control files.

mapred.task.cache.levels 2 This is the max level of the task cache. For example, if the level is2, the tasks cached are at the host level and at the rack level.

mapred.task.profile.maps 0-2 To set the ranges of map tasks to profile. mapred.task.profile hasto be set to true for the value to be accounted.

mapred.task.profile.reduces 0-2 To set the ranges of reduce tasks to profile. mapred.task.profilehas to be set to true for the value to be accounted.

mapred.task.profile false To set whether the system should collect profiler information forsome of the tasks in this job? The information is stored in the userlog directory. The value is "true" if task profiling is enabled.

mapred.task.timeout 600000 The number of milliseconds before a task will be terminated if itneither reads an input, writes an output, nor updates its statusstring.

mapred.task.tracker.http.address 0.0.0.0:50060 The task tracker http server address and port. If the port is 0 thenthe server will start on a free port.



mapred.task.tracker.report.address 127.0.0.1:0 The interface and port that task tracker server listens on. Since itis only connected to by the tasks, it uses the local interface.EXPERT ONLY. Should only be changed if your host does nothave the loopback interface.

mapred.task.tracker.task-controller org.apache.hadoop.mapred.DefaultTaskController TaskController which is used to launch and manage taskexecution

mapred.tasktracker.dns.interface default The name of the Network Interface from which a task trackershould report its IP address.

mapred.tasktracker.dns.nameserver default The host name or IP address of the name server (DNS) which aTaskTracker should use to determine the host name used by theJobTracker for communication and display purposes.

mapred.tasktracker.expiry.interval 600000 Expert: The time-interval, in miliseconds, after which a tasktrackeris declared 'lost' if it doesn't send heartbeats.

mapred.tasktracker.indexcache.mb 10 The maximum memory that a task tracker allows for the indexcache that is used when serving map outputs to reducers.

mapred.tasktracker.instrumentation org.apache.hadoop.mapred.TaskTrackerMetricsInst Expert: The instrumentation class to associate with eachTaskTracker.

mapred.tasktracker.map.tasks.maximum (CPUS > 2) ? (CPUS * 0.75) : 1 The maximum number of map tasks that will be runsimultaneously by a task tracker.

mapred.tasktracker.memory_calculator_plugin Name of the class whose instance will be used to query memoryinformation on the tasktracker. The class must be an instance oforg.apache.hadoop.util.MemoryCalculatorPlugin. If the value isnull, the tasktracker attempts to use a class appropriate to theplatform. Currently, the only platform supported is Linux.

mapred.tasktracker.reduce.tasks.maximum (CPUS > 2) ? (CPUS * 0.50): 1 The maximum number of reduce tasks that will be runsimultaneously by a task tracker.

mapred.tasktracker.taskmemorymanager.monitoring-interval 5000 The interval, in milliseconds, for which the tasktracker waitsbetween two cycles of monitoring its tasks' memory usage. Usedonly if tasks' memory management is enabled viamapred.tasktracker.tasks.maxmemory.

mapred.tasktracker.tasks.sleeptime-before-sigkill 5000 The time, in milliseconds, the tasktracker waits for sending aSIGKILL to a process, after it has been sent a SIGTERM.

mapred.temp.dir $<hadoop.tmp.dir>/mapred/temp A shared directory for temporary files.

mapred.user.jobconf.limit 5242880 The maximum allowed size of the user jobconf. The default is setto 5 MB

mapred.userlog.limit.kb 0 The maximum size of user-logs of each task in KB. 0 disables thecap.

mapred.userlog.retain.hours 24 The maximum time, in hours, for which the user-logs are to beretained after the job completion.

mapreduce.heartbeat.10 300 heartbeat in milliseconds for small cluster (less than or equal 10nodes)

mapreduce.heartbeat.100 1000 heartbeat in milliseconds for medium cluster (11 - 100 nodes).Scales linearly between 300ms - 1s

mapreduce.heartbeat.1000 10000 heartbeat in milliseconds for medium cluster (101 - 1000 nodes).Scales linearly between 1s - 10s

mapreduce.heartbeat.10000 100000 heartbeat in milliseconds for medium cluster (1001 - 10000nodes). Scales linearly between 10s - 100s



mapreduce.job.acl-modify-job job specific access-control list for 'modifying' the job. It is onlyused if authorization is enabled in Map/Reduce by setting theconfiguration property mapred.acls.enabled to true. This specifiesthe list of users and/or groups who can do modification operationson the job. For specifying a list of users and groups the format touse is "user1,user2 group1,group". If set to '*', it allows allusers/groups to modify this job. If set to ' '(i.e. space), it allowsnone. This configuration is used to guard all the modificationswith respect to this job and takes care of all the followingoperations: o killing this job o killing a task of this job, failing atask of this job o setting the priority of this job Each of theseoperations are also protected by the per-queue level ACL"acl-administer-jobs" configured via mapred-queues.xml. So acaller should have the authorization to satisfy either thequeue-level ACL or the job-level ACL. Irrespective of this ACLconfiguration, job-owner, the user who started the cluster, clusteradministrators configured via mapreduce.cluster.administratorsand queue administrators of the queue to which this job issubmitted to configured viamapred.queue.queue-name.acl-administer-jobs inmapred-queue-acls.xml can do all the modification operations ona job. By default, nobody else besides job-owner, the user whostarted the cluster, cluster administrators and queueadministrators can perform modification operations on a job.

mapreduce.job.acl-view-job job specific access-control list for 'viewing' the job. It is only usedif authorization is enabled in Map/Reduce by setting theconfiguration property mapred.acls.enabled to true. This specifiesthe list of users and/or groups who can view private details aboutthe job. For specifying a list of users and groups the format to useis "user1,user2 group1,group". If set to '*', it allows allusers/groups to modify this job. If set to ' '(i.e. space), it allowsnone. This configuration is used to guard some of the job-viewsand at present only protects APIs that can return possiblysensitive information of the job-owner like o job-level counters otask-level counters o tasks' diagnostic information o task-logsdisplayed on the TaskTracker web-UI and o job.xml showed bythe JobTracker's web-UI Every other piece of information of jobsis still accessible by any other user, for e.g., JobStatus,JobProfile, list of jobs in the queue, etc. Irrespective of this ACLconfiguration, job-owner, the user who started the cluster, clusteradministrators configured via mapreduce.cluster.administratorsand queue administrators of the queue to which this job issubmitted to configured viamapred.queue.queue-name.acl-administer-jobs inmapred-queue-acls.xml can do all the view operations on a job.By default, nobody else besides job-owner, the user who startedthe cluster, cluster administrators and queue administrators canperform view operations on a job.

mapreduce.job.complete.cancel.delegation.tokens true if false - do not unregister/cancel delegation tokens from renewal,because same tokens may be used by spawned jobs

mapreduce.job.split.metainfo.maxsize 10000000 The maximum permissible size of the split metainfo file. TheJobTracker won't attempt to read split metainfo files bigger thanthe configured value. No limits if set to -1.

mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recovery Recovery Directory

mapreduce.jobtracker.recovery.job.initialization.maxtime Maximum time in seconds JobTracker will wait for initializing jobsbefore starting recovery. By default it is same asmapreduce.jobtracker.recovery.maxtime.

mapreduce.jobtracker.recovery.maxtime 480 Maximum time in seconds JobTracker should stay in recoverymode. JobTracker recovers job after talking to all runningtasktrackers. On large cluster if many jobs are to be recovered,mapreduce.jobtracker.recovery.maxtime should be increased.

mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging The root of the staging area for users' job files In practice, thisshould be the directory where users' home directories are located(usually /user)

mapreduce.maprfs.use.checksum true Deprecated; checksums are always used.



mapreduce.maprfs.use.compression true When true, MapReduce uses compression during the Shufflephase.

mapreduce.reduce.input.limit -1 The limit on the input size of the reduce. If the estimated inputsize of the reduce is greater than this value, job is failed. A valueof -1 means that there is no limit set.

mapreduce.task.classpath.user.precedence false Set to true if user wants to set different classpath.

mapreduce.tasktracker.group Expert: Group to which TaskTracker belongs. IfLinuxTaskController is configured viamapreduce.tasktracker.taskcontroller, the group owner of thetask-controller binary should be same as this group.

mapreduce.tasktracker.heapbased.memory.management false Expert only: If admin wants to prevent swapping by not launchingtoo many tasks use this option. Task's memory usage is based onmax java heap size (-Xmx). By default -Xmx will be computed bytasktracker based on slots and memory reserved for mapreducetasks. Seemapred.map.child.java.opts/mapred.reduce.child.java.opts.

mapreduce.tasktracker.jvm.idle.time 10000 If jvm is idle for more than mapreduce.tasktracker.jvm.idle.time(milliseconds) tasktracker will kill it.

mapreduce.tasktracker.outofband.heartbeat false Expert: Set this to true to let the tasktracker send an out-of-bandheartbeat on task-completion for better latency.

mapreduce.tasktracker.prefetch.maptasks 1.0 How many map tasks should be scheduled in-advance on atasktracker. To be given in % of map slots. Default is 1.0 whichmeans number of tasks overscheduled = total map slots ontasktracker.

mapreduce.tasktracker.reserved.physicalmemory.mb Maximum phyiscal memory tasktracker should reserve formapreduce tasks. If tasks use more than the limit, task usingmaximum memory will be killed. Expert only: Set this value ifftasktracker should use a certain amount of memory formapreduce tasks. In MapR Distro warden figures this numberbased on services configured on a node. Settingmapreduce.tasktracker.reserved.physicalmemory.mb to -1 willdisable physical memory accounting and task management.

mapreduce.tasktracker.volume.healthcheck.interval 60000 How often tasktracker should check for mapreduce volume at$<mapr.localvolumes.path>/mapred/. Value is in milliseconds.

mapreduce.use.fastreduce false Expert only . Reducer won't be able to tolerate failures.

mapreduce.use.maprfs true If true, then mapreduce uses maprfs to store task related datamay be executed in parallel.

keep.failed.task.files false Should the files for failed tasks be kept. This should only be usedon jobs that are failing, because the storage is never reclaimed. Italso prevents the map outputs from being erased from the reducedirectory as they are consumed.

keep.task.files.pattern .*_m_123456_0 Keep all files from tasks whose task names match the givenregular expression. Defaults to none.

tasktracker.http.threads 2 The number of worker threads that for the http server. This isused for map output fetching



mapred-site.xml

The file specifies MapReduce formulas and parameters./opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xml

Each parameter in the local configuration file overrides the corresponding parameter in the cluster-wide configuration unless the cluster-wide copyof the parameter includes . In general, only job-specific parameters should be set in the local copy of <final>true</final> mapred-site.xm

.l

There are three parts to :mapred-site.xml

JobTracker configurationTaskTracker configurationJob configuration

Jobtracker Configuration

Should be changed by the administrator. When changing any parameters in this section, a JobTracker restart is required.


mapred.job.tracker maprfs:/// JobTracker address ip:port or use uri maprfs:/// for default cluster ormaprfs:///mapr/san_jose_cluster1 to connect 'san_jose_cluster1' cluster. Replace localhost by one or more ip addresses for jobtracker.

mapred.jobtracker.port 9001 Port on which JobTracker listens. Read by JobTracker to start RPC Server.

mapreduce.tasktracker.outofband.heartbeat True The task tracker sends an out-of-band heartbeat on task completion to improvelatency. Set this value to false to disable this behavior.

webinterface.private.actions False By default, jobs cannot be killed from the job tracker's web interface. Set thisvalue to True to enable this behavior.

MapR recommends properly securing your interfaces beforeenabling this behavior.

maprfs.openfid2.prefetch.bytes 0 Expert: number of shuffle bytes to prefetched by reduce task

mapr.localoutput.dir output The path for map output files on shuffle volume.

mapr.localspill.dir spill The path for local spill files on shuffle volume.

mapreduce.jobtracker.node.labels.file The file that specifies the labels to apply to the nodes in the cluster. &&&PEdeprecated?

mapreduce.jobtracker.node.labels.monitor.interval 120000 Specifies how often to poll the node labels file for changes. &&&PE what unitsis this?

mapred.queue.<queue-name>.label Specifies a label for the queue named in the placeholder.<queue-name>

mapred.queue.<queue-name>.label.policy Specifies a policy for the label applied to the queue named in the <queue-nam placeholder. The policy controls the interaction between the queue labele>

and the job label:

PREFER_QUEUE — always use label set on queuePREFER_JOB — always use label set on jobAND (default) — job label AND node labelOR — job label OR node label

Jobtracker Directories

When changing any parameters in this section, a JobTracker restart is required.

Volume path = mapred.system.dir/../




mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system The shared directory where MapReducestores control files.

mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo The directory where the job statusinformation is persisted in a file system to beavailable after it drops of the memory queueand between jobtracker restarts.

mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging The root of the staging area for users' jobfiles In practice, this should be the directorywhere users' home directories are located(usually /user)

mapreduce.job.split.metainfo.maxsize 10000000 The maximum permissible size of the splitmetainfo file. The JobTracker won't attemptto read split metainfo files bigger than theconfigured value. No limits if set to -1.

mapreduce.maprfs.use.compression True Set this property's value to False to disablethe use of MapR-FS compression for shuffledata by MapReduce.

mapred.jobtracker.retiredjobs.cache.size 1000 The number of retired job status to keep inthe cache.

mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done The completed job history files are stored atthis single well known location. If nothing isspecified, the files are stored at${hadoop.job.history.location}/done in localfilesystem.

hadoop.job.history.location If job tracker is static the history files arestored in this single well known place onlocal filesystem. If No value is set here, bydefault, it is in the local file system at${hadoop.log.dir}/history. History files aremoved tomapred.jobtracker.history.completed.locationwhich is on MapRFs JobTracker volume.

mapred.jobtracker.jobhistory.lru.cache.size 5 The number of job history files loaded inmemory. The jobs are loaded when they arefirst accessed. The cache is cleared basedon LRU.

JobTracker Recovery



mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recovery Recovery Directory. Stores list of knownTaskTrackers.

mapreduce.jobtracker.recovery.maxtime 120 Maximum time in seconds JobTracker should stayin recovery mode.

mapreduce.jobtracker.split.metainfo.maxsize 10000000 This property's value sets the maximumpermissible size of the split metainfo file. TheJobTracker does not attempt to read splitmetainfo files larger than this value.

mapred.jobtracker.restart.recover true "true" to enable (job) recovery upon restart,"false" to start afresh

mapreduce.jobtracker.recovery.job.initialization.maxtime 480 this property's value specifies the maximum timein seconds that the JobTracker waits to initializejobs before starting recovery. This property'sdefault value is equal to the value of the mapredu

propertyce.jobtracker.recovery.maxtime.

Enable Fair Scheduler





mapred.fairscheduler.allocation.file conf/pools.xml

mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.FairScheduler The class responsible for task scheduling.

mapred.fairscheduler.assignmultiple true

mapred.fairscheduler.eventlog.enabled false Enable scheduler logging in${HADOOP_LOG_DIR}/fairscheduler/

mapred.fairscheduler.smalljob.schedule.enable True Set this property's value to False to disablefast scheduling for small jobs inFairScheduler. TaskTrackers can reservean ephemeral slot for small jobs when thecluster is under load.

mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of mapsallowed in small job.

mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number ofreducers allowed in small job.

mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytesallowed for a small job. Default is 10GB.

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 Small job definition. Max estimated inputsize for a reducer allowed in small job.Default is 1GB per reducer.

mapred.cluster.ephemeral.tasks.memory.limit.mb 200 Small job definition. Max memory in mbytesreserved for an ephermal slot. Default is200mb. This value must be same onJobTracker and TaskTracker nodes.

TaskTracker Configuration

When changing any parameters in this section, a TaskTracker restart is required.

When is greater than 0, you must disable mapreduce.tasktracker.prefetch.maptasks Fair Scheduler with preemptionand .label-based job placement


mapred.tasktracker.map.tasks.maximum (CPUS > 2) ? (CPUS * 0.75) : 1 The maximum number of map tasks that will be runsimultaneously by a task tracker.

mapreduce.tasktracker.prefetch.maptasks 1.0 How many map tasks should be scheduled in-advance on atasktracker. To be given in % of map slots. Default is 1.0 whichmeans number of tasks overscheduled = total map slots on TT.

mapreduce.tasktracker.reserved.physicalmemory.mb.low 0.8 This property's value sets the target memory usage level whenthe TaskTracker kills tasks to reduce total memory usage. Thisproperty's value represents a percentage of the amount in the ma

vpreduce.tasktracker.reserved.physicalmemory.mbalue.

mapreduce.tasktracker.task.slowlaunch False Set this property's value to True to wait after each task launch fornodes running critical services like CLDB, JobTracker, andZooKeeper.

mapreduce.tasktracker.volume.healthcheck.interval 60000 This property's value defines the frequency in milliseconds thatthe TaskTracker checks the Mapreduce volume defined in the ${

property.mapr.localvolumes.path}/mapred/

mapreduce.use.maprfs True Use MapR-FS for shuffle and sort/merge.

mapred.userlog.retain.hours 24 This property's value specifies the maximum time, in hours, toretain the user-logs after job completion.



mapred.user.jobconf.limit 5242880 The maximum allowed size of the user jobconf. The default is setto 5 MB.

mapred.userlog.limit.kb 0 Deprecated: The maximum size of user-logs of each task in KB.0 disables the cap.

mapreduce.use.fastreduce False Expert: Merge map outputs without copying.

mapred.tasktracker.reduce.tasks.maximum (CPUS > 2) ? (CPUS * 0.50): 1 The maximum number of reduce tasks that will be runsimultaneously by a task tracker.

mapred.tasktracker.ephemeral.tasks.maximum 1 Reserved slot for small job scheduling

mapred.tasktracker.ephemeral.tasks.timeout 10000 Maximum time in ms a task is allowed to occupy ephemeral slot

mapred.tasktracker.ephemeral.tasks.ulimit 4294967296 Ulimit (bytes) on all tasks scheduled on an ephemeral slot

mapreduce.tasktracker.reserved.physicalmemory.mb Maximum phyiscal memory tasktracker should reserve formapreduce tasks.If tasks use more than the limit, task using maximum memorywill be killed.Expert only: Set this value iff tasktracker should use a certainamount of memoryfor mapreduce tasks. In MapR Distro warden figures this numberbasedon services configured on a node.Setting mapreduce.tasktracker.reserved.physicalmemory.mb to-1 will disablephysical memory accounting and task management.

mapred.tasktracker.expiry.interval 600000 Expert: This property's value specifies a time interval inmilliseconds. After this interval expires without any heartbeatssent, a TaskTracker is marked .lost

mapreduce.tasktracker.heapbased.memory.management false Expert only: If admin wants to prevent swapping by notlaunching too many tasksuse this option. Task's memory usage is based on max javaheap size (-Xmx).By default -Xmx will be computed by tasktracker based on slotsand memory reserved for mapreduce tasks.See mapred.map.child.java.opts/mapred.reduce.child.java.opts.

mapreduce.tasktracker.jvm.idle.time 10000 If jvm is idle for more than mapreduce.tasktracker.jvm.idle.time(milliseconds) tasktracker will kill it.

mapred.max.tracker.failures 4 The number of task-failures on a tasktracker of a given job afterwhich new tasks of that job aren't assigned to it.

mapred.max.tracker.blacklists 4 The number of blacklists for a taskTracker by various jobs afterwhich the task tracker could be blacklisted across all jobs. Thetracker will be given a tasks later (after a day). The tracker willbecome a healthy tracker after a restart.

mapred.task.tracker.http.address 0.0.0.0:50060 This property's value specifies the HTTP server address and portfor the TaskTracker. Specify 0 as the port to make the serverstart on a free port.

mapred.task.tracker.report.address 127.0.0.1:0 The IP address and port that TaskTrackeer server listens on.Since it is only connected to by the tasks, it uses the localinterface. EXPERT ONLY. Only change this value if your hostdoes not have a loopback interface.

mapreduce.tasktracker.group mapr Expert: Group to which TaskTracker belongs. IfLinuxTaskController is configured via the mapreduce.tasktra

value, the group owner of thecker.taskcontrollertask-controller binary $HADOOP_HOME/bin/platform/bin/task-co

must be same as this group.ntroller



mapred.tasktracker.task-controller.config.overwrite True The needs a configuration file set at LinuxTaskController $. The/conf/taskcontroller.cfgHADOOP_HOME

configuration file takes the following parameters:

mapred.local.dir = Local dir used by tasktracker, takenfrom mapred-site.xml.hadoop.log.dir = hadoop log dir, taken from systemproperties of the tasktracker processmapreduce.tasktracker.group = groups allowed to runtasktracker see 'mapreduce.tasktracker.group'min.user.id = Don't allow any user below this uid tolaunch a task.banned.users = users who are not allowed to launchany tasks.If set to true, TaskTracker will always overwrite configfile with default values asmin.user.id = -1(check disabled), banned.users = bin,mapreduce.tasktracker.group = rootTo disable this configuration and use a customconfiguration, set this property's value to False andrestart the TaskTracker.

mapred.tasktracker.indexcache.mb 10 This property's value specifies the maximum amount of memoryallocated by the TaskTracker for the index cache. The indexcache is used when the TaskTracker serves map outputs toreducers.

mapred.tasktracker.instrumentation org.apache.hadoop.mapred.TaskTrackerMetricsInst Expert: The instrumentation class to associate with eachTaskTracker.

mapred.task.tracker.task-controller org.apache.hadoop.mapred.LinuxTaskController This property's value specifies the TaskController that launchesand manages task execution.

mapred.tasktracker.taskmemorymanager.killtask.maxRSS False Set this property's value to True to kill tasks that are usingmaximum memory when the total number of MapReduce tasksexceeds the limit specified in the TaskTracker's mapreduce.ta

property. Taskssktracker.reserved.physicalmemory.mbare killed in most-recently-launched order.

mapred.tasktracker.taskmemorymanager.monitoring-interval 3000 This property's value specifies an interval in milliseconds thatTaskTracker waits between monitoring the memory usage oftasks. This property is only used when tasks memorymanagement is enabled by setting the property mapred.taskt

to True.racker.tasks.maxmemory

mapred.tasktracker.tasks.sleeptime-before-sigkill 5000 This property's value sets the time in milliseconds that theTaskTracker waits before sending a SIGKILL to a process after ithas been sent a SIGTERM.

mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp A shared directory for temporary files.

mapreduce.cluster.map.userlog.retain-size -1 This property's value specifies the number of bytes to retain frommap task logs. The default value of -1 disables this feature.

mapreduce.cluster.reduce.userlog.retain-size -1 This property's value specifies the number of bytes to retain fromreduce task logs. The default value of -1 disables this feature.

mapreduce.heartbeat.10000 100000 This property's value specifies a heartbeat time in millisecondsfor a medium cluster of 1001 to 10000 nodes. Scales linearlybetween 10s - 100s.

mapreduce.heartbeat.1000 10000 This property's value specifies a heartbeat time in millisecondsfor a medium cluster of 101 to 1000 nodes. Scales linearlybetween 1s - 10s.

mapreduce.heartbeat.100 1000 This property's value specifies a heartbeat time in millisecondsfor a medium cluster of 11 to 100 nodes. Scales linearly between300ms - 1s.

mapreduce.heartbeat.10}mapreduce.heartbeat.100}mapreduce.heartbeat.10 300 This property's value specifies a heartbeat time in millisecondsfor a medium cluster of 1 to 10 nodes.

mapreduce.job.complete.cancel.delegation.tokens True Set this property's value to False to prevent unregister or canceldelegation tokens from renewing.



1. 2. 3.

mapreduce.jobtracker.inline.setup.cleanup False Set this property's value to True to make the JobTracker attemptto set up and clean up the job by itself or do it in setup/cleanuptask.

Job Configuration

Users should set these values on the node from which you plan to submit jobs, before submitting the jobs. If you are using Hadoop examples, youcan set these parameters from the command line. Example:

hadoop jar hadoop-examples.jar terasort -Dmapred.map.child.java.opts="-Xmx1000m"

When you submit a job, the JobClient creates by reading parameters from the following files in the following order:job.xml

mapred-default.xmlThe local - overrides identical parameters in mapred-site.xml mapred-default.xmlAny settings in the job code itself - overrides identical parameters in mapred-site.xml


keep.failed.task.files false Should the files for failed tasks be kept. This should only be usedon jobs that are failing, because the storage is never reclaimed. Italso prevents the map outputs from being erased from the reducedirectory as they are consumed.

mapred.job.reuse.jvm.num.tasks -1 How many tasks to run per jvm. If set to -1, there is no limit.

mapred.map.tasks.speculative.execution true If true, then multiple instances of some map tasks may beexecuted in parallel.

mapred.reduce.tasks.speculative.execution true If true, then multiple instances of some reduce tasks may beexecuted in parallel.

mapred.reduce.tasks 1 The default number of reduce tasks per job. Typically set to 99%of the cluster's reduce capacity, so that if a node fails the reducescan still be executed in a single wave. Ignored when the value ofthe property is .mapred.job.tracker local

mapred.job.map.memory.physical.mb Maximum physical memory limit for map task of this job. If limit isexceeded task attempt will be FAILED.

mapred.job.reduce.memory.physical.mb Maximum physical memory limit for reduce task of this job. If limitis exceeded task attempt will be FAILED.

mapreduce.task.classpath.user.precedence false Set to true if user wants to set different classpath.

mapred.max.maps.per.node -1 Per-node limit on running map tasks for the job. A value of -1signifies no limit.

mapred.max.reduces.per.node -1 Per-node limit on running reduce tasks for the job. A value of -1signifies no limit.

mapred.running.map.limit -1 Cluster-wide limit on running map tasks for the job. A value of -1signifies no limit.

mapred.running.reduce.limit -1 Cluster-wide limit on running reduce tasks for the job. A value of-1 signifies no limit.

mapreduce.tasktracker.cache.local.numberdirectories 10000 This property's value sets the maximum number of subdirectoriesto create in a given distributed cache store. Cache items in excessof this limit are expunged whether or not the total size threshold isexceeded.

mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log Java opts for the reduce tasks. MapR Default heapsize(-Xmx) isdetermined by memory reserved for mapreduce at tasktracker.Reduce task is given more memory than map task. Defaultmemory for a reduce task = (Total Memory reserved formapreduce) * (2*#reduceslots / (#mapslots + 2*#reduceslots))

mapred.reduce.child.ulimit



io.sort.factor 256 The number of streams to merge simultaneously during filesorting. The value of this property determines the number of openfile handles.

io.sort.mb 380 This value sets the size, in megabytes, of the memory buffer thatholds map outputs before writing the final map outputs. Lowervalues for this property increases the chance of spills.Recommended practice is to set this value to 125% of theaverage size in megabytes of a map input.

io.sort.record.percent 0.17

io.sort.record.percent 0.17 The percentage of the memory buffer specified by the io.sort. property that is dedicated to tracking record boundaries. Themb

maximum number of records that the collection thread can collectbefore blocking is one-fourth of the multiplied values of io.sort.

and .mb io.sort.percent

io.sort.spill.percent 0.99 This property's value sets the soft limit for either the buffer orrecord collection buffers. Threads that reach the soft limit begin tospill the contents to disk in the background. Note that this doesnot imply any chunking of data to the spill. Do not reduce thisvalue below 0.5.

mapred.reduce.slowstart.completed.maps 0.95 Fraction of the number of maps in the job which should becomplete before reduces are scheduled for the job.

mapreduce.reduce.input.limit -1 The limit on the input size of the reduce. If the estimated input size of the reduce is greater than this value, job is failed. A value of -1 means that there is no limit set.

mapred.reduce.parallel.copies 12 The default number of parallel transfers run by reduce during thecopy(shuffle) phase.

jobclient.completion.poll.interval 5000 This property's value specifies the JobClient's polling frequency inmilliseconds to the JobTracker for updates about job status.Reduce this value for faster tests on single node systems.Adjusting this value on production clusters may result in undesiredclient-server traffic.

jobclient.output.filter FAILED This property's value specifies the filter that controls the output ofthe task's userlogs that are sent to the JobClient's console. Legalvalues are:

NONEKILLEDFAILEDSUCCEEDEDALL

jobclient.progress.monitor.poll.interval 1000 This property's value specifies the JobClient's status reportingfrequency in milliseconds to the console and checking for jobcompletion.

job.end.notification.url http://localhost:8080/jobstatus.php?jobId=$jobId&jobStatus=$jobStatus This property's value specifies the URL to call at job completion toreport the job's end status. Only two variables are legal in theURL, and . When present, these variables$jobId $jobStatusare replaced by their respective values.

job.end.retry.attempts 0 This property's value specifies the maximum number of times thatHadoop attempts to contact the notification URL.

job.end.retry.interval 30000 This property's value specifies the interval in milliseconds betweenattempts to contact the notification URL.

keep.failed.task.files False Set this property's value to True to keep files for failed tasks.Because this storage is not automatically reclaimed by thesystem, keep files only for jobs that are failing. Setting thisproperty's value to True also keeps map outputs in the reducedirectory as the map outputs are consumed instead of deleting themap outputs on consumption.

local.cache.size 10737418240 This property's value specifies the number of bytes allocated toeach local TaskTracker directory to store Distributed Cache data.

http://localhost:8080/jobstatus.php?jobId=$jobId&jobStatus=$jobStatus



mapr.centrallog.dir logs This property's value specifies the relative path under a localvolume path that points to the central log location, ${mapr.loca

}.lvolumes.path}/ /${mapr.centrallog.dir<hostname>

mapr.localvolumes.path /var/mapr/local The path for local volumes.

map.sort.class org.apache.hadoop.util.QuickSort The default sort class for sorting keys.

tasktracker.http.threads 2 The number of worker threads that for the HTTP server.

topology.node.switch.mapping.impl org.apache.hadoop.net.ScriptBasedMapping The default implementation of the DNSToSwitchMapping. Itinvokes a script specified in the topology.script.file.nameproperty to resolve node names. If no value is set for the topolo

property, the default value ofgy.script.file.nameDEFAULT_RACK is returned for all node names.

topology.script.number.args 100 The max number of arguments that the script configured with the runs with. Each argument is antopology.script.file.name

IP address.

mapr.task.diagnostics.enabled False Set this property's value to True to run the MapR diagnosticsscript before killing an unresponsive task attempt.

mapred.acls.enabled False This property's value specifies whether or not to check ACLs foruser authorization during various queue and job level operations.Set this property's value to True to enable access control checksmade by the JobTracker and TaskTracker when users requestqueue and job operations using Map/Reduce APIs, RPCs, theconsole, or the web user interfaces.

mapred.child.oom_adj 10 This property's value specifies the adjustment to theout-of-memory value for the Linux-specific out-of-memory killer.Legal values are 0-15.

mapred.child.renice 10 This property's value specifies an integer from 0 to 19 for use bythe Linux nice}} utility.

mapred.child.taskset True Set this property's value to False to prevent running the job in ataskset. See the manual page for for moretaskset(1)information.

mapred.child.tmp ./tmp This property's value sets the location of the temporary directoryfor map and reduce tasks. Set this value to an absolute path todirectly assign the directory. Relative paths are located under thetask's working directory. Java tasks execute with the option -Dja

. Pipesva.io.tmpdir=absolute path of the tmp dirand streaming are set with environment variable TMPDIR=absolu

.te path of the tmp dir

mapred.cluster.ephemeral.tasks.memory.limit.mb 200 This property's value specifies the maximum size in megabytes forsmall jobs. This value is reserved in memory for an ephemeralslot. JobTracker and TaskTracker nodes must set this property tothe same value.

mapred.cluster.map.memory.mb -1 This property's value sets the virtual memory size of a single mapslot in the Map-Reduce framework used by the scheduler. If thescheduler supports this feature, a job can ask for multiple slots fora single map task via , to the limitmapred.job.map.memory.mbspecified by the value of mapred.cluster.max.map.memory.

. The default value of -1 disables the feature. Set this value to ambuseful memory size to enable the feature.

mapred.cluster.max.map.memory.mb -1 This property's value sets the virtual memory size of a single maptask launched by the Map-Reduce framework used by thescheduler. If the scheduler supports this feature, a job can ask formultiple slots for a single map task via mapred.job.map.memor

, to the limit specified by the value of y.mb mapred.cluster.ma. The default value of -1 disables the feature.x.map.memory.mb

Set this value to a useful memory size to enable the feature.



mapred.cluster.max.reduce.memory.mb -1 This property's value sets the virtual memory size of a singlereduce task launched by the Map-Reduce framework used by thescheduler. If the scheduler supports this feature, a job can ask formultiple slots for a single map task via mapred.job.reduce.me

, to the limit specified by the value of mory.mb mapred.cluster. The default value of -1 disables the.max.reduce.memory.mb

feature. Set this value to a useful memory size to enable thefeature.

mapred.cluster.reduce.memory.mb -1 This property's value sets the virtual memory size of a singlereduce slot in the Map-Reduce framework used by the scheduler.If the scheduler supports this feature, a job can ask for multipleslots for a single map task via mapred.job.reduce.memory.m

, to the limit specified by the value of b mapred.cluster.max.r. The default value of -1 disables the feature.educe.memory.mb

Set this value to a useful memory size to enable the feature.

mapred.compress.map.output False Set this property's value to True to compress map outputs withSequenceFile compresison before sending the outputs over thenetwork.

mapred.fairscheduler.assignmultiple True Set this property's value to False to prevent the FairSchedulerfrom assigning multiple tasks.

mapred.fairscheduler.eventlog.enabled False Set this property's value to True to enable scheduler logging in {{$/fairscheduler/{HADOOP_LOG_DIR}

mapred.fairscheduler.smalljob.max.inputsize 10737418240 This property's value specifies the maximum size, in bytes, thatdefines a small job.

mapred.fairscheduler.smalljob.max.maps 10 This property's value specifies the maximum number of mapsallowed in a small job.

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 This property's value specifies the maximum estimated input size,in bytes, for a reducer in a small job.

mapred.fairscheduler.smalljob.max.reducers 10 This property's value specifies the maximum number of reducersallowed in a small job.

mapred.healthChecker.interval 60000 This property's value sets the frequency, in milliseconds, that thenode health script runs.

mapred.healthChecker.script.timeout 600000 This property's value sets the frequency, in milliseconds, afterwhich the node script is killed for being unresponsive and reportedas failed.

mapred.inmem.merge.threshold 1000 When a number of files equal to this property's value accumulate,the in-memory merge triggers and spills to disk. Set this property'svalue to zero or less to force merges and spills to trigger solely onRAMFS memory consumption.

mapred.job.map.memory.mb -1 This property's value sets the virtual memory size of a single maptask for the job. If the scheduler supports this feature, a job canask for multiple slots for a single map task via mapred.cluster

, to the limit specified by the value of .map.memory.mb mapred.. The default value of -1cluster.max.map.memory.mb

disables the feature if the value of the mapred.cluster.map.m property is also -1. Set this value to a useful memoryemory.mgb

size to enable the feature.

mapred.job.queue.name default This property's value specifies the queue a job is submitted to.This property's value must match the name of a queue defined in

for the system. The ACL setup for themapred.queue.namesqueue must allow the current user to submit a job to the queue.

mapred.job.reduce.input.buffer.percent 0 This property's value specifies the percentage of memory relativeto the maximum heap size. After the shuffle, remaining mapoutputs in memory must occupy less memory than this thresholdvalue before reduce begins.



mapred.job.reduce.memory.mb -1 This property's value sets the virtual memory size of a singlereduce task for the job. If the scheduler supports this feature, a jobcan ask for multiple slots for a single map task via mapred.clus

, to the limit specified by the value of ter.reduce.memory.mb m. The default valueapred.cluster.max.reduce.memory.mb

of -1 disables the feature if the value of the mapred.cluster.m property is also -1. Set this value to a usefulap.memory.mgb

memory size to enable the feature.

mapred.job.reuse.jvm.num.tasks -1 This property's value sets the number of tasks to run on eachJVM. The default of -1 sets no limit.

mapred.job.shuffle.input.buffer.percent 0.7 This property's value sets the percentage of memory allocatedfrom the maximum heap size to storing map outputs during theshuffle.

mapred.job.shuffle.merge.percent 0.66 This property's value sets a percentage of the total memoryallocated to storing map outputs in mapred.job.shuffle.inpu

. When memory storage for map outputst.buffer.percentreaches this percentage, an in-memory merge triggers.

mapred.job.tracker.handler.count 10 This property's value sets the number of server threads for theJobTracker. As a best practice, set this value to approximately 4%of the number of TaskTracker nodes.

mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done This property's value sets a location to store completed job historyfiles. When this property has no value specified, completed jobfiles are stored at /done in the local${hadoop.job.history.location}filesystem.

mapred.job.tracker.http.address 0.0.0.0:50030 This property's value specifies the HTTP server address and portfor the JobTracker. Specify 0 as the port to make the server starton a free port.

mapred.jobtracker.instrumentation org.apache.hadoop.mapred.JobTrackerMetricsInst Expert: The instrumentation class to associate with eachJobTracker.

mapred.jobtracker.job.history.block.size 3145728 This property's value sets the block size of the job history file.Dumping job history to disk is important because job recoveryuses the job history.

mapred.jobtracker.jobhistory.lru.cache.size 5 This property's value specifies the number of job history files toload in memory. The jobs are loaded when they are firstaccessed. The cache is cleared based on LRU.

mapred.job.tracker maprfs:/// JobTracker address ip:port or use uri maprfs:/// for default clusteror maprfs:///mapr/san_jose_cluster1 to connect'san_jose_cluster1' cluster. ""local"" for standalone mode.

mapred.jobtracker.maxtasks.per.job -1 Set this property's value to any positive integer to set themaximum number of tasks for a single job. The default value of -1indicates that there is no maximum.

mapred.job.tracker.persist.jobstatus.active False Set this property's value to True to enable persistence of jobstatus information.

mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo This property's value specifies the directory where job statusinformation persists after dropping out of the memory queuebetween JobTracker restarts.

mapred.job.tracker.persist.jobstatus.hours 0 This property's value specifies job status information persistencetime in hours. Persistent job status information is available afterthe information drops out of the memory queue and betweenJobTracker restarts. The default value of zero disables job statusinformation persistence.

mapred.jobtracker.port 9001 The IPC port on which the JobTracker listens.

mapred.jobtracker.restart.recover True Set this property's value to False to disable job recovery onrestart.

mapred.jobtracker.retiredjobs.cache.size 1000 This property's value specifies the number of retired job statuseskept in the cache.

mapred.jobtracker.retirejob.check 30000 This property's value specifies the frequency interval used by theretire job thread to check for completed jobs.



mapred.line.input.format.linespermap 1 Number of lines per split in NLineInputFormat.

mapred.local.dir.minspacekill 0 This property's value specifies a threshold of free space in thedirectory specified by the property. Whenmapred.local.dirfree space drops below this threshold, no more tasks arerequested until all current tasks finish and clean up. When freespace is below this threshold, running tasks are killed in thefollowing order until free space is above the threshold:

Reduce tasksAll other tasks in reverse percent-completed order.

mapred.local.dir.minspacestart 0 This property's value specifies a free space threshold for thedirectory specified by . No tasks aremapred.local.dirrequested while free space is below this threshold.

mapred.local.dir /tmp/mapr-hadoop/mapred/local This property's value specifies the local directory whereMapReduce stores job, jar, and xml files. MapReduce alsocreates work directories for tasks under this local directory. TheMapR distribution for Hadoop uses a local volume for mapoutputs.

mapred.map.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log This property stores Java options for map tasks. When present,the symbol is replaced with the current TaskID. As an (taskid)example, to enable verbose garbage collection logging to a filenamed for the taskid in and to set the heap maximum to/tmp1GB, set this property to the value -Xmx1024m -verbose:gc

.-Xloggc:/tmp/ .gc(taskid)The configuration variable mapred. .child.uli{map/reduce}

controls the maximum virtual memory of the child processes.mitIn the MapR distribution for Hadoop, the default is-Xmxdetermined by memory reserved for mapreduce by theTaskTracker. Reduce tasks use memory than map tasks. Thedefault memory for a map task follows the formula (Total Memoryreserved for mapreduce) * (#mapslots/ (#mapslots +1.3*#reduceslots)).

mapred.map.child.log.level INFO This property's value sets the logging level for the map task. Theallowed levels are:

OFFFATALERRORWARNINFODEBUGTRACEALL

mapred.map.max.attempts 4 Expert: This property's value sets the maximum number ofattempts per map task.

mapred.map.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec Specifies the compression codec to use to compress map outputsif compression of map outputs is enabled.

mapred.maptask.memory.default 800 Xmx value for a map task attempt jvm gets when map slots arenot set

mapred.map.tasks 2 The default number of map tasks per job. Ignored when the valueof the property is .mapred.job.tracker local

mapred.maxthreads.generate.mapoutput 1 Expert: Number of intra-map-task threads to sort and write themap output partitions.

mapred.maxthreads.partition.closer 1 Expert: Number of threads that asynchronously close or flush mapoutput partitions.

mapred.merge.recordsBeforeProgress 10000 The number of records to process during a merge before sendinga progress notification to the TaskTracker.

mapred.min.split.size 0 The minimum size chunk that map input should be split into. Fileformats with minimum split sizes take priority over this setting.



mapred.output.compress False Set this property's value to True to compress job outputs.

mapred.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec When job output compression is enabled, this property's valuespecifies the compression codec.

mapred.output.compression.type RECORD When job outputs are compressed as SequenceFiles, this value'sproperty specifies how to compress the job outputs. Legal valuesare:

NONERECORDBLOCK

mapred.queue.default.state RUNNING This property's value defines the state of the default queue, whichcan be either STOPPED or RUNNING. This value can bechanged at runtime.

mapred.queue.names default This property's value specifies a comma-separated list of thequeues configured for this JobTracker. Jobs are added to queuesand schedulers can configure different scheduling properties forthe various queues. To configure a property for a queue, thename of the queue must match the name specified in this value.Queue properties that are common to all schedulers areconfigured here with the naming convention mapred.queue.$QU

.EUE-NAME.$PROPERTY-NAMEThe number of queues configured in this parameter can dependon the type of scheduler being used, as specified inmapred.jobtracker.taskScheduler. For example, theJobQueueTaskScheduler supports only a single queue, which isthe default configured here. Verify that the schedule supportsmultiple queues before adding queues.

mapred.reduce.child.log.level INFO The logging level for the reduce task. The allowed levels are:

OFFFATALERRORWARNINFODEBUGTRACEALL

mapred.reduce.copy.backoff 300 This property's value specifies the maximum amount of time inseconds a reducer spends on fetching one map output beforedeclaring the fetch failed.

mapred.reduce.max.attempts 4 Expert: The maximum number of attempts per reduce task.

mapred.reducetask.memory.default 1500 Xmx for reduce task attempt JVM when reduce slots are not set.

mapred.skip.attempts.to.start.skipping 2 This property's value specifies a number of task attempts. Afterthat many task attempts, skip mode is active. While skip mode isactive, the task reports the range of records which it will processnext to the TaskTracker. With this record range, the TaskTrackeris aware of which records are dubious and skips dubious recordson further executions.

mapred.skip.map.auto.incr.proc.count True SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDSincrements after MapRunner invokes the map function. Set thisproperty's value to False for applications that process recordsasynchronously or buffer input records. Such applications mustincrement this counter directly.

mapred.skip.map.max.skip.records 0 The number of acceptable skip records around the bad record,per bad record in the mapper. The number includes the badrecord. The default value of 0 disables detection and skipping ofbad records. The framework tries to narrow down the skippedrange by retrying until this threshold is met OR all attempts getexhausted for this task. Set the value to toLong.MAX_VALUEprevent the framework from narrowing down the skipped range.



mapred.skip.reduce.auto.incr.proc.count True SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDSincrements after MapRunner invokes the reduce function. Set thisproperty's value to False for applications that process recordsasynchronously or buffer input records. Such applications mustincrement this counter directly.

mapred.skip.reduce.max.skip.groups 0 The number of acceptable skip records around the bad record,per bad record in the reducer. The number includes the badrecord. The default value of 0 disables detection and skipping ofbad records. The framework tries to narrow down the skippedrange by retrying until this threshold is met OR all attempts getexhausted for this task. Set the value to toLong.MAX_VALUEprevent the framework from narrowing down the skipped range.

mapred.submit.replication 10 This property's value specifies the replication level for submittedjob files. As a best practice, set this value to approximately thesquare root of the number of nodes.

mapred.task.cache.levels 2 This property's value specifies the maximum level of the taskcache. For example, if the level is 2, the tasks cached are at thehost level and at the rack level.

mapred.task.calculate.resource.usage True Set this property's value to False to prevent the use of the ${map parareduce.tasktracker.resourcecalculatorplugin}

meter.

mapred.task.profile False Set this property's value to True to enable task profiling and thecollection of profiler information by the system.

mapred.task.profile.maps 0-2 This property's value sets the ranges of map tasks to profile. Thisproperty is ignored when the value of the mapred.task.profil

property is set to False.e

mapred.task.profile.reduces 0-2 This property's value sets the ranges of reduce tasks to profile.This property is ignored when the value of the mapred.task.pr

property is set to False.ofile

mapred.task.timeout 600000 This property's value specifies a time in milliseconds after which atask terminates if the task does not perform any of the following:

reads an inputwrites an outputupdates its status string

mapred.tasktracker.dns.interface default This property's value specifies the name of the network interfacethat the TaskTracker reports its IP address from.

mapred.tasktracker.dns.nameserver default This property's value specifies the host name or IP address of thename server (DNS) that the TaskTracker uses to determine theJobTracker's hostname.

Oozie


hadoop.proxyuser.root.hosts * Specifies the hosts that the superuser must connect from in order to act as another user. Specifythe hosts as a comma-separatedlist of IP addresses or hostnames that are running Oozieservers.

hadoop.proxyuser.mapr.groups mapr,staff

hadoop.proxyuser.root.groups root The superuser can act as any member of the listed groups.



mfs.conf

The configuration file specifies the following parameters about the MapR-FS server on each node:/opt/mapr/conf/mfs.conf


mfs.server.ip 192.168.10.10 IP address of the FileServer

mfs.server.port 5660 Port used for communication with the server

mfs.cache.lru.sizes inode:6:log:6:meta:10:dir:40:small:15 LRU cache configuration

mfs.on.virtual.machine 0 Specifies whether MapR-FS is running on a virtual machine

mfs.io.disk.timeout 60 Timeout, in seconds, after which a disk is considered failed and taken offline.This parameter can be increased to tolerate slow disks.

mfs.max.disks 48 Maximum number of disks supported on a single node.

mfs.subnets.whitelist A list of subnets that are allowed to make requests to the FileServer serviceand access data on the cluster.

Example

mfs.server.ip=192.168.10.10mfs.server.port=5660mfs.cache.lru.sizes=inode:6:log:6:meta:10:dir:40:small:15mfs.on.virtual.machine=0mfs.io.disk.timeout=60mfs.max.disks=48



taskcontroller.cfg

The file specifies TaskTracker configuration parameters. The/opt/mapr/hadoop/hadoop-<version>/conf/taskcontroller.cfgparameters should be set the same on all TaskTracker nodes. See also .Secured TaskTracker


mapred.local.dir /tmp/mapr-hadoop/mapred/local The local MapReduce directory.

hadoop.log.dir /opt/mapr/hadoop/hadoop-0.20.2/bin/../logs The Hadoop log directory.

mapreduce.tasktracker.group root The group that is allowed to submit jobs.

min.user.id -1 The minimum user ID for submitting jobs:

Set to to disallow from submitting jobs0 rootSet to to disallow all superusers from submitting1000jobs

banned.users (not present by default) Add this parameter with a comma-separated list of usernames toban certain users from submitting jobs



warden.conf

The file controls parameters related to MapR services and the warden. Most of the parameters are not/opt/mapr/conf/warden.confintended to be edited directly by users. The following table shows the parameters of interest:


service.command.jt.heapsize.percent 10 The percentage of heap space reserved for the JobTracker.

service.command.jt.heapsize.max 5000 The maximum heap space that can be used by the JobTracker.

service.command.jt.heapsize.min 256 The minimum heap space for use by the JobTracker.

service.command.tt.heapsize.percent 2 The percentage of heap space reserved for the TaskTracker.

service.command.tt.heapsize.max 325 The maximum heap space that can be used by the TaskTracker.

service.command.tt.heapsize.min 64 The minimum heap space for use by the TaskTracker.

service.command.hbmaster.heapsize.percent 4 The percentage of heap space reserved for the HBase Master.

service.command.hbmaster.heapsize.max 512 The maximum heap space that can be used by the HBase Master.

service.command.hbmaster.heapsize.min 64 The minimum heap space for use by the HBase Master.

service.command.hbregion.heapsize.percent 25 The percentage of heap space reserved for the HBase Region Server.

service.command.hbregion.heapsize.max 4000 The maximum heap space that can be used by the HBase Region Server.

service.command.hbregion.heapsize.min 1000 The minimum heap space for use by the HBase Region Server.

service.command.cldb.heapsize.percent 8 The percentage of heap space reserved for the CLDB.

service.command.cldb.heapsize.max 4000 The maximum heap space that can be used by the CLDB.

service.command.cldb.heapsize.min 256 The minimum heap space for use by the CLDB.

service.command.mfs.heapsize.percent 20 The percentage of heap space reserved for the MapR-FS FileServer.

service.command.mfs.heapsize.min 512 The maximum heap space that can be used by the MapR-FS FileServer.

service.command.webserver.heapsize.percent 3 The percentage of heap space reserved for the MapR Control System.

service.command.webserver.heapsize.max 750 The maximum heap space that can be used by the MapR Control System.

service.command.webserver.heapsize.min 512 The minimum heap space for use by the MapR Control System.

service.command.os.heapsize.percent 3 The percentage of heap space reserved for the operating system.

service.command.os.heapsize.max 750 The maximum heap space that can be used by the operating system.

service.command.os.heapsize.min 256 The minimum heap space for use by the operating system.

service.nice.value -10 The priority under which all services will run.nice

zookeeper.servers 10.250.1.61:5181 The list of ZooKeeper servers.

services.retries 3 The number of times the Warden tries to restart a service that fails.

services.retryinterval.time.sec 1800 The number of seconds after which the warden will again attempt severaltimes to start a failed service. The number of attempts after each interval isspecified by the parameter .services.retries

cldb.port 7222 The port for communicating with the CLDB.

mfs.port 5660 The port for communicating with the FileServer.

hbmaster.port 60000 The port for communicating with the HBase Master.

hoststats.port 5660 The port for communicating with the HostStats service.

jt.port 9001 The port for communicating with the JobTracker.

kvstore.port 5660 The port for communicating with the Key/Value Store.



mapr.home.dir /opt/mapr The directory where MapR is installed.

centralconfig.enabled true Specifies whether to enable central configuration.

pullcentralconfig.freq.seconds 300000 The frequency to check for configuration updates, in seconds.

warden.conf

services=webserver:all:cldb;jobtracker:1:cldb;tasktracker:all:jobtracker;nfs:all:cldb;kvstore:all;cldb:all:kvstore;hoststats:all:kvstoreservice.command.jt.start=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh start jobtrackerservice.command.tt.start=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh start tasktrackerservice.command.hbmaster.start=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh start masterservice.command.hbregion.start=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh start regionserverservice.command.cldb.start=/etc/init.d/mapr-cldb startservice.command.kvstore.start=/etc/init.d/mapr-mfs startservice.command.mfs.start=/etc/init.d/mapr-mfs startservice.command.nfs.start=/etc/init.d/mapr-nfsserver startservice.command.hoststats.start=/etc/init.d/mapr-hoststats startservice.command.webserver.start=/opt/mapr/adminuiapp/webserver startservice.command.jt.stop=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh stop jobtrackerservice.command.tt.stop=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh stop tasktrackerservice.command.hbmaster.stop=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh stop masterservice.command.hbregion.stop=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh stop regionserverservice.command.cldb.stop=/etc/init.d/mapr-cldb stopservice.command.kvstore.stop=/etc/init.d/mapr-mfs stopservice.command.mfs.stop=/etc/init.d/mapr-mfs stopservice.command.nfs.stop=/etc/init.d/mapr-nfsserver stopservice.command.hoststats.stop=/etc/init.d/mapr-hoststats stopservice.command.webserver.stop=/opt/mapr/adminuiapp/webserver stopservice.command.jt.type=BACKGROUNDservice.command.tt.type=BACKGROUNDservice.command.hbmaster.type=BACKGROUNDservice.command.hbregion.type=BACKGROUNDservice.command.cldb.type=BACKGROUNDservice.command.kvstore.type=BACKGROUNDservice.command.mfs.type=BACKGROUNDservice.command.nfs.type=BACKGROUNDservice.command.hoststats.type=BACKGROUNDservice.command.webserver.type=BACKGROUNDservice.command.jt.monitor=org.apache.hadoop.mapred.JobTrackerservice.command.tt.monitor=org.apache.hadoop.mapred.TaskTrackerservice.command.hbmaster.monitor=org.apache.hadoop.hbase.master.HMaster startservice.command.hbregion.monitor=org.apache.hadoop.hbase.regionserver.HRegionServer startservice.command.cldb.monitor=com.mapr.fs.cldb.CLDBservice.command.kvstore.monitor=server/mfsservice.command.mfs.monitor=server/mfsservice.command.nfs.monitor=server/nfsserverservice.command.jt.monitorcommand=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh statusjobtrackerservice.command.tt.monitorcommand=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh statustasktrackerservice.command.hbmaster.monitorcommand=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh status masterservice.command.hbregion.monitorcommand=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh statusregionserverservice.command.cldb.monitorcommand=/etc/init.d/mapr-cldb statusservice.command.kvstore.monitorcommand=/etc/init.d/mapr-mfs statusservice.command.mfs.monitorcommand=/etc/init.d/mapr-mfs statusservice.command.nfs.monitorcommand=/etc/init.d/mapr-nfsserver statusservice.command.hoststats.monitorcommand=/etc/init.d/mapr-hoststats statusservice.command.webserver.monitorcommand=/opt/mapr/adminuiapp/webserver statusservice.command.jt.heapsize.percent=10service.command.jt.heapsize.max=5000service.command.jt.heapsize.min=256service.command.tt.heapsize.percent=2service.command.tt.heapsize.max=325service.command.tt.heapsize.min=64



service.command.hbmaster.heapsize.percent=4service.command.hbmaster.heapsize.max=512service.command.hbmaster.heapsize.min=64service.command.hbregion.heapsize.percent=25service.command.hbregion.heapsize.max=4000service.command.hbregion.heapsize.min=1000service.command.cldb.heapsize.percent=8service.command.cldb.heapsize.max=4000service.command.cldb.heapsize.min=256service.command.mfs.heapsize.percent=20service.command.mfs.heapsize.min=512service.command.webserver.heapsize.percent=3service.command.webserver.heapsize.max=750service.command.webserver.heapsize.min=512service.command.os.heapsize.percent=3service.command.os.heapsize.max=750service.command.os.heapsize.min=256service.nice.value=-10zookeeper.servers=10.250.1.61:5181nodes.mincount=1services.retries=3cldb.port=7222mfs.port=5660hbmaster.port=60000hoststats.port=5660jt.port=9001



kvstore.port=5660mapr.home.dir=/opt/mapr



zoo.cfg

The file specifies ZooKeeper configuration parameters./opt/mapr/zookeeper/zookeeper-3.3.2/conf/zoo.cfg

Example

# The number of milliseconds of each ticktickTime=2000# The number of ticks that the initial # synchronization phase can takeinitLimit=20# The number of ticks that can pass between # sending a request and getting an acknowledgementsyncLimit=10# the directory where the snapshot is stored.dataDir=/ /mapr-zookeeper-datavar# the port at which the clients will connectclientPort=5181# max number of client connectionsmaxClientCnxns=100



Ports Used by MapR

The table below defines the ports used by a MapR cluster, along with the default port numbers.

Service Port

CLDB 7222

CLDB JMX monitor port 7220

CLDB web port 7221

HBase Master 60000

Hive Metastore 9083

JobTracker 9001

JobTracker web 50030

LDAP 389

LDAPS 636

MFS server 5660

NFS 2049

NFS monitor (for HA) 9997

NFS management 9998

NFS VIP service 9997 and 9998

Oozie 11000

Port mapper 111

SMTP 25

SSH 22

TaskTracker web 50060

Web UI HTTPS 8443

Web UI HTTP 8080

ZooKeeper 5181

ZooKeeper follower-to-leader communication 2888

ZooKeeper leader election 3888



1.

2.

3.

1.

2.

3.

4.

1.

Best Practices

File Balancing

MapR distributes volumes to balance files across the cluster. Each volume has a name container that is restricted to one . Thestorage poolgreater the number of volumes, the more evenly MapR can distribute files. For best results, the number of volumes should be greater than thetotal number of storage pools in the cluster. To accommodate a very large number of files, you can use with the option whendisksetup -Winstalling or re-formatting nodes, to create storage pools larger than the default of three disks each.

Disk Setup

It is not necessary to set up RAID on disks used by MapR-FS. MapR uses a script called to set up storage pools. In most cases, youdisksetupshould let MapR calculate storage pools using the default of two or three disks. If you anticipate a high volume of random-access I/O,stripe widthyou can use the option with to specify larger storage pools of up to 8 disks each.-W disksetup

Setting Up MapR NFS

MapR uses version 3 of the NFS protocol. NFS version 4 bypasses the port mapper and attempts to connect to the default portonly. If you are running NFS on a non-standard port, mounts from NFS version 4 clients time out. Use the optio-o nfsvers=3n to specify NFS version 3.

NIC Configuration

For high performance clusters, use more than one network interface card (NIC) per node. MapR can detect multiple IP addresses on each nodeand load-balance throughput automatically.

Isolating CLDB Nodes













1.

2.

1.

2.



Isolating ZooKeeper Nodes

For large clusters (100 nodes or more), isolate the ZooKeeper on nodes that do not perform any other function. Isolating the ZooKeeper nodeenables the node to perform its functions without competing for resources with other processes. Installing a ZooKeeper-only node is similar to anytypical node installation, but with a specific subset of packages.

Do not install the FileServer package on an isolated ZooKeeper node in order to prevent MapR from using this node for datastorage.

To set up a ZooKeeper-only node:


INSTALL the following packages to the node.mapr-zookeepermapr-zk-internalmapr-core

Setting Up RAID on the Operating System Partition

You can set up RAID on the operating system partition(s) or drive(s) at installation time, to provide higher operating system performance (RAID0), disk mirroring for failover (RAID 1), or both (RAID 10), for example. See the following instructions from the operating system websites:

CentOSRed HatUbuntu

Tuning MapReduce

The memory allocated to each MapR service is specified in the file, which MapR automatically configures/opt/mapr/conf/warden.confbased on the physical memory available on the node. For example, you can adjust the minimum and maximum memory used for theTaskTracker, as well as the percentage of the heap that the TaskTracker tries to use, by setting the appropriate , , and parametpercent max miners in the file:warden.conf

...service.command.tt.heapsize.percent=2service.command.tt.heapsize.max=325service.command.tt.heapsize.min=64...

The percentages of memory used by the services need not add up to 100; in fact, you can use less than the full heap by setting the heapsize.p parameters for all services to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settingsercent

for individual services, unless you see specific memory-related problems occurring.

MapReduce Memory

The memory allocated for MapReduce tasks normally equals the total system memory minus the total memory allocated for MapR services. Ifnecessary, you can use the parameter to set the maximum physical memory reserved bymapreduce.tasktracker.reserved.physicalmemory.mbMapReduce tasks, or you can set it to to disable physical memory accounting and task management.-1

If the node runs out of memory, MapReduce tasks are killed by the to free memory. You can use (copyOOM-killer mapred.child.oom_adjfrom to adjust the parameter for MapReduce tasks. The possible values of range from -17 to +15.mapred-default.xml oom_adj oom_adj

http://wiki.centos.org/HowTos/SoftwareRAIDonCentOS5

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/3/html/System_Administration_Guide/ch-software-raid.html

https://help.ubuntu.com/community/Installation/SoftwareRAID

http://linux-mm.org/OOM_Killer



1.

2.

3.

The higher the score, more likely the associated process is to be killed by the OOM-killer.

Troubleshooting Out-of-Memory Errors

When the aggregated memory used by MapReduce tasks exceeds the memory reserve on a TaskTracker node, tasks can fail or be killed. MapRattempts to prevent out-of-memory exceptions by killing MapReduce tasks when memory becomes scarce. If you allocate too little Java heap forthe expected memory requirements of your tasks, an exception can occur. The following steps can help configure MapR to avoid these problems:

If a particular job encounters out-of-memory conditions, the simplest way to solve the problem might be to reduce the memory footprint ofthe map and reduce functions, and to ensure that the partitioner distributes map output to reducers evenly.

If it is not possible to reduce the memory footprint of the application, try increasing the Java heap size (-Xmx) in the client-sideMapReduce configuration.

If many jobs encounter out-of-memory conditions, or if jobs tend to fail on specific nodes, it may be that those nodes are advertising toomany TaskTracker slots. In this case, the cluster administrator should reduce the number of slots on the affected nodes.

To reduce the number of slots on a node:

Stop the TaskTracker service on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker stop

Edit the file :/opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xmlReduce the number of map slots by lowering mapred.tasktracker.map.tasks.maximumReduce the number of reduce slots by lowering mapred.tasktracker.reduce.tasks.maximum

Start the TaskTracker on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker start

ExpressLane



mapred.fairscheduler.smalljob.schedule.enable true Enable small job fast scheduling inside fair scheduler. TaskTrackersshould reserve a slot called ephemeral slot which is used for smalljob ifcluster is busy.

mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps allowed in small job.

mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of reducers allowed in small job.

mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes allowed for a small job.Default is 10GB.

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 Small job definition. Max estimated input size for a reducer allowed insmall job. Default is 1GB per reducer.

mapred.cluster.ephemeral.tasks.memory.limit.mb 200 Small job definition. Max memory in mbytes reserved for an ephermalslot. Default is 200mb. This value must be same on JobTracker andTaskTracker nodes.


HBase

The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Before using HBase,turn off MapR compression for directories in the HBase volume (normally mounted at . Example:/hbase

hadoop mfs -setcompression off /hbase

You can check whether compression is turned off in a directory or mounted volume by using to list the file contents. Example:hadoop mfs



hadoop mfs -ls /hbase

The letter in the output indicates compression is turned on; the letter indicates compression is turned off. See for moreZ U hadoop mfsinformation.

On any node where you plan to run both HBase and MapReduce, give more memory to the FileServer than to the RegionServer so that the nodecan handle high throughput. For example, on a node with 24 GB of physical memory, it might be desirable to limit the RegionServer to 4 GB, give10 GB to MapR-FS, and give the remainder to TaskTracker. To change the memory allocated to each service, edit the /opt/mapr/conf/warde

file. See for more information.n.conf Tuning Your MapR Install



Glossary

Term Definition

.dfs_attributes A special file in every directory, for controlling the compression and chunk size used for the directory and its subdirectories.

.rw A special mount point in the root-level volume (or read-only mirror) that points to the writable original copy of the volume.

.snapshot A special directory in the top level of each volume, containing all the snapshots for that volume.

accesscontrol list

A list of permissions attached to an object. An access control list (ACL) specifies users or system processes that can performspecific actions on an object.

accountingentity

A clearly defined economics unit that is accounted for separately.

ACL See .access control list

advisoryquota

An advisory disk capacity limit that can be set for a volume, user, or group. When disk usage exceeds the advisory quota, analert is sent.

AE See .accounting entity

bitmask A binary number in which each bit controls a single toggle.

chunk Files in MapR-FS are split into (similar to Hadoop ) that are normally 256 MB by default. Any multiple of 65,536chunks blocksbytes is a valid chunk size, but tuning the size correctly is important. Files inherit the chunk size settings of the directory thatcontains them, as do subdirectories on which chunk size has not been explicitly set. Any files written by a Hadoop application,whether via the file APIs or over NFS, use chunk size specified by the settings for the directory where the file is written.

CLDB See .container location database

container The unit of sharded storage in a MapR cluster. Every container is either a or a .name container data container

containerlocationdatabase

A service, running on one or more MapR nodes, that maintains the locations of services, containers, and other clusterinformation.

datacontainer

One of the two types of containers in a cluster. Data containers typically have a cascaded configuration (master replicates toreplica1, replica1 replicates to replica2, and so on). Every data containers is either a master container, an intermediatecontainer, or a tail container depending on its replication role.

desiredreplicationfactor

The number of copies of a volume that should be maintained by the MapR cluster for normal operation. When the number ofcopies falls below the desired replication factor, but remains equal to or above the , re-replicationminimum replication factoroccurs after the timeout specified in the parameter.cldb.fs.mark.rereplicate.sec

disk spacebalancer

The disk space balancer is a tool that balances disk space usage on a cluster by moving containers between storage pools.Whenever a storage pool is over 70% full (or a threshold defined by the pcldb.balancer.disk.threshold.percentagearameter), the disk space balancer distributes containers to other storage pools that have lower utilization than the average forthat cluster. The disk space balancer aims to ensure that the percentage of space used on all of the disks in the node is similar.

disktab A file on each node, containing a list of the node's disks that have been configured for use by MapR-FS.

dump file A file containing data from a volume for distribution or restoration. There are two types of dump files: dump files containingfullall data in a volume, and dump files that contain changes to a volume between two points in time.incremental

entity A user or group. Users and groups can represent .accounting entities

full dump file See .dump file

epoch A sequence number that identifies all copies that have the latest updates for a container. The larger the number, the mostup-to-date the copy of the container. The CLDB uses the epoch to ensure that an out-of-date copy cannot become the masterfor the container.

Hbase A distributed storage system, designed to scale to a very large size, for managing massive amounts of structured data.

heartbeat A signal sent by each FileServer and NFS node every second to provide information to the CLDB about the node's health andresource usage.

incrementaldump file

See .dump file

JobTracker The process responsible for submitting and tracking MapReduce jobs. The JobTracker sends individual tasks to TaskTrackerson nodes in the cluster.



Mapr-FS The NFS-mountable, distributed, high-performance MapR data storage system.

minimumreplicationfactor

The minimum number of copies of a volume that should be maintained by the MapR cluster for normal operation. When thereplication factor falls below this minimum, re-replication occurs as aggressively as possible to restore the replication level. Ifany containers in the CLDB volume fall below the minimum replication factor, writes are disabled until aggressive re-replicationrestores the minimum level of replication.

mirror A read-only physical copy of a volume.

namecontainer

A container that holds a volume's namespace information and file chunk locations, and the first 64 KB of each file in thevolume.

Network FileSystem

A protocol that allows a user on a client computer to access files over a network as though they were stored locally.

NFS See .Network File System

node An individual server (physical or virtual machine) in a cluster.

quota A disk capacity limit that can be set for a volume, user, or group. When disk usage exceeds the quota, no more data can bewritten.

recovery pointobjective

The maximum allowable data loss as a point in time. If the recovery point objective is 2 hours, then the maximum allowableamount of data loss that is acceptable is 2 hours of work.

recovery timeobjective

The maximum alllowable time to recovery after data loss. If the recovery time objective is 5 hours, then it must be possible torestore data up to the recovery point objective within 5 hours. See also recovery point objective

replicationfactor

The number of copies of a volume.

replicationrole

The replication role of a container determines how that container is replicated to other storage pools in the cluster. A name may have one of two replication roles: master or replica. A may have one of three replication roles:container data container

master, intermediate, or tail.

replicationrole balancer

The replication role balancer is a tool that switches the replication roles of containers to ensure that every node has an equalshare of of master and replica containers (for name containers) and an equal share of master, intermediate, and tail containers(for data containers).

re-replication Re-replication occurs whenever the number of available replica containers drops below the number prescribed by that volume'sreplication factor. Re-replication may occur for a variety of reasons including replica container corruption, node unavailability,hard disk failure, or an increase in replication factor.

RPO See .recovery point objective

RTO See .recovery time objective

schedule A group of rules that specify recurring points in time at which certain actions are determined to occur.

snapshot A read-only logical image of a volume at a specific point in time.

storage pool A unit of storage made up of one or more disks. By default, MapR storage pools contain two or three disks. For high-volumereads and writes, you can create larger storage pools when initially formatting storage during cluster creation.

stripe width The number of disks in a .storage pool

super group The group that has administrative access to the MapR cluster.

super user The group that has administrative access to the MapR cluster.

TaskTracker The process that starts and tracks MapReduce tasks on a node. The TaskTracker receives task assignments from theJobTracker and reports the results of each task back to the JobTracker on completion.

volume A tree of files, directories, and other volumes, grouped for the purpose of applying a policy or set of policies to all of them atonce.

warden A MapR process that coordinates the starting and stopping of configured services on a node.

ZooKeeper A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providinggroup services.

Date post:	21-May-2020
Category:	Documents
Upload:	others
View:	22 times
Download:	1 times

Quick Start Installation Administration - MapR · Quick Start Installation Administration...

Documents