admin-guide

GridServer Administration GuideVersion 4.2

The GridServer Administration SeriesProprietary and Confidential

Confidentiality and Disclaimer

Neither this document nor any of its contents may be used or disclosed without the express written consent of DataSynapse. This document does not carry any right of publication or disclosure to any other party.

While the information provided herein is believed to be accurate and reliable, DataSynapse makes no representations or warranties, express or implied, as to the accuracy or completeness of such information. Only those representations and warranties contained in a definitive license agreement shall have any legal effect. In furnishing this document, DataSynapse reserves the right to amend or replace it at any time and undertakes no obligation to provide the recipient with access to any additional information. Nothing contained within this document is or should be relied upon as a promise or representation as to the future.

This product includes software developed by the Apache Software Foundation (www.apache.org/).

This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. (www.openssl.org/).

This product includes code licensed from RSA Data Security (java.sun.com/products/jsse/LICENSE.html).

DataSynapse GridServer Administration Guide Version 4.2 Copyright © 2006 DataSynapse, Inc. All Rights Reserved.

GridServer® is a registered trademark, DataSynapse, FabricServer™, the DataSynapse logo, LiveCluster™, and GridClient™ are trademarks, and GRIDesign is a servicemark of DataSynapse, Inc.

Protected by U.S. Patent No. 6,757,730. Other patents pending.

WebSphere® is a registered trademark and CloudScape™ is a trademark of International Business Machines Corporation in the United States, other countries, or both. All other product names are trademarks or registered trademarks of their respective companies.

DataSynapse, Inc. 632 Broadway, 5th Floor; New York, NY 10012 Tel: 212.842.8842 Fax: 212.842.8843 Email: [email protected] Web: www.datasynapse.com

For technical support issues and product updates, please visit customer.datasynapse.com.

We appreciate any comments or suggestions you may have about this manual or other DataSynapse documentation. Please send your feedback to [email protected].

2212006

http://www.apache.org/

http://www.openssl.org/

http://java.sun.com/products/jsse/LICENSE.html

mailto:[email protected]

http://www.datasynapse.com

http://customer.datasynapse.com

mailto:[email protected]

GridServer Administration Guide 3 • • • •••

Contents

Confidentiality and Disclaimer ............................................................................................................2Contents .....................................................................................................................................................3Chapter 1 - Introduction ........................................................................................................................9

Before you begin .............................................................................................................................9GridServer 4.2 Documentation Roadmap .......................................................................................9

GridServer Guides ..............................................................................................................9Other Documentation and Help ........................................................................................10

Document Conventions .................................................................................................................11Chapter 2 - Work ...................................................................................................................................13

Introduction ...................................................................................................................................13Services .........................................................................................................................................13

Clients ...............................................................................................................................13Service Implementations ...................................................................................................13Service Session .................................................................................................................14Service benefits .................................................................................................................14

Jobs ...............................................................................................................................................14Job Benefits .......................................................................................................................15

Binary-level Integration ................................................................................................................15Chapter 3 - Engine Balancing and Client Routing .........................................................................17

Introduction ...................................................................................................................................17Client Routing ...............................................................................................................................17

Allowed Brokers Set .........................................................................................................17Client Properties Rules .....................................................................................................17Driver API .........................................................................................................................17

Engine Routing and Balancing .....................................................................................................17Engine Weight-Based Balancer ........................................................................................18Home/Shared Balancer .....................................................................................................18Engine Balancer Configuration ........................................................................................19

Failover Brokers ...........................................................................................................................20Engine Upper and Lower Bounds .................................................................................................20Example Use Cases .......................................................................................................................20

N+1 Failover with Weighting ...........................................................................................20Engine Localization with Sharing .....................................................................................21

Chapter 4 - Grid Fault-Tolerance and Failover ...............................................................................23Introduction ...................................................................................................................................23The Fault-tolerant GridServer Deployment ..................................................................................23Heartbeats and Failure Detection ..................................................................................................23Manager Stability Features ...........................................................................................................24Engine Failure ...............................................................................................................................24Driver Failure ................................................................................................................................24Director Failure .............................................................................................................................25Broker Failure ...............................................................................................................................25Failover Brokers ...........................................................................................................................25Fault-Tolerant Tasks .....................................................................................................................26Batch Fault-Tolerance ...................................................................................................................27

4 – Contents• • • •••

This Document is Proprietary and Confidental

GridCache Fault-Tolerance ...........................................................................................................27Client .................................................................................................................................27Broker Restart ...................................................................................................................27Failover .............................................................................................................................27

Chapter 5 - Scheduling .........................................................................................................................29Introduction ...................................................................................................................................29Reschedules and Retries ...............................................................................................................29

Retry ..................................................................................................................................29Reschedule ........................................................................................................................29Timeout Behavior .............................................................................................................30

The Scheduler ...............................................................................................................................30Scheduler Overview ..........................................................................................................30Service Priority .................................................................................................................31Usage Algorithm ...............................................................................................................31Time Algorithm ................................................................................................................31Serial Priority Algorithm ..................................................................................................32

Urgent Priority Services and Preemption .....................................................................................32Engine Blacklisting .......................................................................................................................33Conditions .....................................................................................................................................33Redundant Task Rescheduling ......................................................................................................33

Chapter 6 - The GridServer Administration Tool ..........................................................................35Introduction ...................................................................................................................................35Getting Started ..............................................................................................................................35User Accounts and Access Levels ................................................................................................36

Creating User Accounts ....................................................................................................36Features Available by Access Level .................................................................................37User Account Security ......................................................................................................37

Navigating the Administration Tool .............................................................................................38The Home Page .................................................................................................................38Tabs ...................................................................................................................................38Shortcut buttons ................................................................................................................39Action Controls .................................................................................................................39Links on other pages .........................................................................................................39

Using Tables .................................................................................................................................39Pager control .....................................................................................................................39Search control ...................................................................................................................39Personalize Table ..............................................................................................................40Refresh ..............................................................................................................................40Broker and Director Monitors ...........................................................................................40Manager Component Indicator .........................................................................................40Status Display ...................................................................................................................41

Chapter 7 - Application Resource Deployment ..............................................................................43Introduction ...................................................................................................................................43Grid Libraries ................................................................................................................................43

Grid Library Format ..........................................................................................................44Using Grid Libraries from a Service .................................................................................49Deployment .......................................................................................................................50


Grid Library Manager .......................................................................................................50C++ Bridges ......................................................................................................................51JREs ..................................................................................................................................51Grid Library Example .......................................................................................................51

Legacy Resource Deployment ......................................................................................................52Using Default Resources ..................................................................................................52Default Resource Paths .....................................................................................................53C++ Bridges ......................................................................................................................53Grid Library features not supported by Default Resources ..............................................53Code Versioning Deprecation ...........................................................................................53

Resource Deployment: Distributing Grid Libraries and Default Resources ................................54The Resource Deployment Interface ................................................................................54Resource Deployment File Locations ...............................................................................54Configuring Directory Replication ...................................................................................55Using Engines with Shared Network Directories .............................................................55JAR Ordering File .............................................................................................................56

Remote Application Installation ...................................................................................................56Service Run-As .............................................................................................................................57

Types of Credentials .........................................................................................................58Using Run-As ...................................................................................................................58

Chapter 8 - The Batch Scheduling Facility .......................................................................................61Introduction ...................................................................................................................................61Terminology ..................................................................................................................................61Editing Batch Definitions .............................................................................................................62Batch Components ........................................................................................................................63Service Runners ............................................................................................................................65Scheduling Batch Definitions .......................................................................................................66The Batch Schedule Page .............................................................................................................66Running Batches ...........................................................................................................................66Deploying Batch Resources ..........................................................................................................67Batch Fault-Tolerance ...................................................................................................................67Using PDriver in a Batch ..............................................................................................................67

Chapter 9 - Configuring Security .......................................................................................................69Introduction ...................................................................................................................................69Authentication ...............................................................................................................................69

Operating System Users ....................................................................................................69Grid Users .........................................................................................................................69GridServer Built-In Authentication ..................................................................................70Extensible Authentication Hooks .....................................................................................70Enabling Client Authentication ........................................................................................70

SSL ................................................................................................................................................71Communication Overview ................................................................................................71Certificate Overview .........................................................................................................71Keypair and Cert Location ................................................................................................72Types of Connections Using SSL .....................................................................................72Enabling HTTPS on the Application Server .....................................................................72Enabling HTTPS on all Components ................................................................................73

6 – Contents• • • •••


Driver SSL ........................................................................................................................73Engines and Engine Daemon SSL ....................................................................................74Brokers and Director SSL .................................................................................................75Resources over HTTPS .....................................................................................................75Disabling HTTP ................................................................................................................76

Resource Protection ......................................................................................................................76Chapter 10 - GridServer Performance and Tuning ........................................................................77

Diagnosing Performance Problems ..............................................................................................77Tuning Data Movement ................................................................................................................77

Stateful Processing ............................................................................................................77Compression .....................................................................................................................78Packing ..............................................................................................................................78Direct Data Transfer .........................................................................................................78Shared Directories and DDT .............................................................................................79Caching .............................................................................................................................79Data References ................................................................................................................79Tasks Per Message ............................................................................................................79Invocations Per Message ..................................................................................................80

Tuning for Large Grids .................................................................................................................80Chapter 11 - Diagnosing GridServer Issues ....................................................................................81

Troubleshooting ............................................................................................................................81Obtaining Log Files ......................................................................................................................81

Manager Logs ...................................................................................................................81Engine and Daemon Logs .................................................................................................82Driver Logs .......................................................................................................................83Application Server Logs ...................................................................................................83

Chapter 12 - Administration Howto ..................................................................................................85Backup / Restore ...........................................................................................................................85

Backup Procedure .............................................................................................................85Restore Procedure .............................................................................................................85

Manager Configuration .................................................................................................................85Applying a patch or service pack to GridServer ...............................................................85Importing and Exporting Manager Configuration ............................................................86Installing Manager Licenses .............................................................................................86Setting the SMTP host ......................................................................................................87Setting Up a Failover Broker ............................................................................................87Configuring SNMP ...........................................................................................................88Enabling Enhanced Task Instrumentation ........................................................................89

Engine Management .....................................................................................................................89Deploying Files to Engines ...............................................................................................89Updating the Windows Engine JRE .................................................................................90Updating the Unix Engine JRE .........................................................................................90Setting the Director Used by Engines ...............................................................................91

Running Services ..........................................................................................................................91Running MPI Jobs using PDriver .....................................................................................91Registering a Service Type ..............................................................................................92Creating and Running a Batch .........................................................................................92


Creating a native stack trace in Linux ..............................................................................93Attaching GDB to Engine native code on Linux ..............................................................93Logging messages from a Native service to the Engine log .............................................94Running a .NET Driver from an Engine Service ..............................................................94

Configuration Issues .....................................................................................................................95Installation on Dual-Interface Machines ...........................................................................95Configuring the timeout period for the Administration Tool ...........................................95Reconfiguring Managers when Installing a secondary Director .......................................95Using UNC paths in a driver.properties file .....................................................................95

Chapter 13 - Database Administration .............................................................................................97Introduction ...................................................................................................................................97Database Types .............................................................................................................................97

The Reporting Database ....................................................................................................97The Internal Database .......................................................................................................97

Internal Database Backup .............................................................................................................97Appendix A - The grid-library.dtd ....................................................................................................99

Introduction ...................................................................................................................................99Appendix B - Reporting Database Tables ......................................................................................101

Introduction .................................................................................................................................101Batches ........................................................................................................................................101Brokers........................................................................................................................................ 101Broker_stats ................................................................................................................................102Driver_events.............................................................................................................................. 102Driver_profiles ............................................................................................................................103Driver_users ................................................................................................................................103Engine_events .............................................................................................................................104Engine_info .................................................................................................................................104Engine_stats ................................................................................................................................104Event_codes ................................................................................................................................105Job_status_codes......................................................................................................................... 105Jobs .............................................................................................................................................105Job_discriminators ......................................................................................................................106Properties ....................................................................................................................................107Tasks ...........................................................................................................................................107Task_status_codes ...................................................................................................................... 107Users ...........................................................................................................................................108User_events .................................................................................................................................108

Index .......................................................................................................................................................109

8 – Contents• • • •••



Chapter 1

• • • • • • Introduction

This guide is a reference for the administrator who maintains GridServer installations. It includes advanced information on how GridServer works, including scheduling, routing, failover, and file deployment, plus a tour of the GridServer Administration Tool. Howto information is given on frequent tasks, plus advanced information is included on security, tuning, database administration, and log files.

Before you beginThis guide assumes that you already have a GridServer Manager running and know the hostname, username, and password. If this isn’t true, see the GridServer Installation Guide or contact the administrator responsible for the installation.

GridServer 4.2 Documentation RoadmapThe following documentation is available for GridServer 4.2:

GridServer GuidesFour guides and four tutorials are included with GridServer in Adobe Acrobat (PDF) format. They are also available in print format. To view the guides, log in to the Administration tool, select the Admin tab, go to the Documentation page, and select a guide. A search engine is also available on this page for you to search all of the documentation for a phrase or keywords. The PDF files can also be found on the Manager at livecluster/admin/docs. The following guides are available:

Introducing the GridServer Platform Series:

Introducing the GridServer Platform Contains an introduction to GridServer, including definitions of key concepts and terms, such as work, Engines, Directors, and Brokers. This should be read first if you are new to GridServer.

The GridServer Administration Series:

GridServer Administration Guide Covers the operation of a GridServer installation as relevant to a system administrator. It includes basic theory on scheduling, fault-tolerance, failover, and other concepts, plus howto information, and performance and tuning information.

GridServer Installation Guide Covers installation of GridServer for Windows and Unix, including Managers, Engines, and pre-installation planning.

10 Chapter 1 – Introduction• • • •••

GridServer 4.2 Documentation Roadmap


Other Documentation and HelpIn addition to the GridServer guides, you can also find help and information from the following sources:GridServer Administration Tool Help Context-sensitive help is available throughout the GridServer Administration Tool by clicking the help icon located on any page. This provides reference help, plus how-to topics.API Reference Reference information for the GridServer API is provided in the GridServer SDK in the docs directory. The Java API information is in JavaDoc format, while C++ documentation is presented in HTML, and .NET API help is in HTMLHelp. You can also view and search them from the GridServer Administration Tool; log in to the Administration Tool, click the Admin tab, and select the Documentation link.Knowledge Base A searchable archive of known issues and support articles is available online. To access the DataSynapse Knowledge Base, go to the DataSynapse customer extranet site at customer.datasynapse.com and log in. You can also use this site to file an issue report, download product updates and licenses, and view documentation.

The GridServer Developer Series:

GridServer Developer’s Guide Contains information on how to develop applications for GridServer, including information on Service Domains, using Services, PDriver (the Batch-oriented GridServer Client), the theory behind development with the GridServer Tasklet API and concepts needed to write and adapt applications.

GridServer Object-Oriented Integration Tutorial

Tutorial on developing applications for GridServer using the object-oriented Tasklet API in Java or C++.

GridServer Service-Oriented Integration Tutorial

Tutorial on developing applications for GridServer using Services, such as Java, .NET, native, or binary executable Services.

GridServer PDriver Tutorial Tutorial on using PDriver, the Parametric Service Driver, to create and run Services with GridServer.

GridServer COM Tutorial Tutorial explaining how client applications in Windows can use COMDriver, GridServer’s COM API, to work with services on GridServer.

http://customer.datasynapse.com


Document Conventions

Convention Explanation Example

italics Book titles The GridServer Developer’s Guide describes this API in detail.

“Text in quotation marks”

References to chapter or section titles

See “Preliminaries.”

bold text Emphasizes key terminology

Interface labels or options

Client applications (Drivers) submit work to a central Manager.

Enter your URL in the Address box and click Next.

Courier New User input, directories, file names, file contents, and program scripts

Run the script in the /opt/datasynapse directory.

Blue text Hypertext link. Click to jump to the specified page or document.

See the GridServer Developer’s Guide for details.

[GS Manager Root] The directory where GridServer is installed, such as c:\datasynapse or /opt/datasynapse.

The Driver packages are located in [GS Manager

Root]/webapps/livecluster/WEB-

INF/driverInstall

./developers-guide.pdf

12 Chapter 1 – Introduction• • • •••

Document Conventions



Chapter 2

• • • • • • Work

IntroductionGridServer supports a Services model for dividing and processing work. This method takes a large data intensive or compute-intensive problem and logically breaks it down into units of work that can run independently and combine for a final result. GridServer receives the work unit requests and services them in parallel. Additionally, high throughput applications or services can be distributed to a Grid. Then, many similar requests for that service can be fulfilled as they arrive. Each request for service is independent, may be stateful, and generally arrives unpredictably at different points in time.

Services also provide a language-independent interface to the GridServer platform. As an alternative, the language-specific Job API can be used to leverage existing Java or C++ development resources. Both models are described below.

ServicesThe Service-Oriented method of defining work in GridServer is a standards-based model. It uses a thin client model, which promotes easy integration of an existing implementation. It also promotes language interoperability, as clients written in different languages can invoke methods in Service Implementations written in the same or other languages.

There are two components used with the Service-Oriented method: Clients and Service Implementations. Both are described below.

ClientsA client or client application is the implementation that is used to create a Service Session. The client invokes methods that have been distributed on Engines.

You can create a Service client in different ways:

• A client-side API in Java, COM, C++, or .NET. • A service proxy of Java or .NET client stubs generated

by GridServer.• A Web Service client using SOAP, a lightweight protocol

used for exchanging messages with decentralized components.

Service ImplementationsService Implementations are deployed to Engines, and process requests from clients. They process data and return results back to the client. Service Implementations are

FIGURE 2-1: The relationship between Service Clients and Service Implementations.

14 Chapter 2 – Work• • • •••

Jobs


registered on a GridServer Manager, as a Service Type, which is virtualized on its Engines. When a client makes a client request, it sends the request to a Manager instead of directly requesting an Engine to do the work. This one-to-many relationship provides fault tolerance and scalability for Services.

Service Implementations can be constructed with any of the following:

• Arbitrary Java classes• Arbitrary .NET classes• A Dynamic Library (.so, .DLL) with methods that conform to a simple input-output string interface. • A command, such as a script or binary executableIntegration as a Service in most cases requires minimal changes to the client application.

Service SessionA running Service is referred to as a Service Session. This includes the Service Client, Service Implementation, and Service state on all components. When a client has created a Service and the Service Implementation is running on Engines, this is collectively called the Service Session.

Service benefitsThere are many advantages to Services:Cross-language Client and Service can be in different languagesDynamic Method names can be determined dynamically, or use generated proxies for type safetyFlexible Use synchronous or asynchronous invocation patterns; can use client proxies generated by

GridServerVirtual Client-Engine correspondence is not one-to-one; Service requests are adaptively load

balancedStateful Despite being virtual, stateful Services can be handledStandards Standards-compliant

For more information on Services, see Chapter 3, “Creating Services” on page 23 and Chapter 4, “Accessing Services” on page 33 of the GridServer Developer’s Guide.

JobsThe Object-Oriented method of defining work in GridServer utilizes easy-to-use C++ and Java APIs to create a rich, empowered client. Using this API, a programmer defines a “Job” as a collection of Tasks, with each Task defined as an atomic sub-partition of the overall workload that is run in its entirety on an Engine.

The client code submits work and administrative commands and retrieves computational results and status information through a simple API.

FIGURE 2-2: Tasks within a Job.


Using the API, you design a Tasklet, which contains the Engine-side code for each Task, and marker interfaces called TaskInput and TaskOutput.

Job BenefitsThe Job-Task model has differences to the Service model which may be an advantage, depending on your development scenario. Its API makes it easy to adapt if you are designing new applications in Java or C++, and its API makes it easy to leverage existing trained programming resources.

For more information on the Job API, see Chapter 5, “The Tasklet API” on page 45 of the GridServer Developer’s Guide.

Binary-level IntegrationAnother native Driver, PDriver, enables you to execute command-line programs as a parallel processing Job without using the API.

PDriver, or the Parametric Job Driver, is a Driver that can execute existing command-line programs as a parallel processing service using the GridServer environment, taking full advantage of the parallelism and fault tolerance of GridServer.

PDriver achieves parallelism by running the same program on Engines several times with different parameters. A script is used to define how these parameters change. For example, a distributed search mechanism using the grep command could conduct a brute-force search of a network-attached file system, with each task in the Service being given a different directory or piece of the file system to search.

PDriver uses its scripting language, called PDS, to define jobs. These scripts can also be used to set options for a PDriver Service, such as remote logging and exit code checking.

For more information on the PDriver, see Chapter 6, “PDriver” on page 49 of the GridServer Developer’s Guide.

FIGURE 2-3: Workflow between a Job and an Engine.

16 Chapter 2 – Work• • • •••

Binary-level Integration



Chapter 3

• • • • • • Engine Balancing and Client Routing

IntroductionThis chapter covers the various mechanisms used by GridServer Directors to route Engines and Clients to Brokers, and reallocation of Engines based on the changing state of the grid.

Client RoutingThe following sections describe methods of routing Clients to Brokers, of which one or more can be used together. However, in most scenarios Clients are associated with a specific Broker, and usually a Failover Broker for fault tolerance.

Allowed Brokers SetThe easiest and most common method of routing clients is to use the Driver Profile’s allowedBrokers property to perform direct routing to a set of Brokers. This is configured using the Driver Profile page on the Driver tab in the GridServer Administration Tool. The profile must be associated with the username of the client using the User Admin page on the Admin tab.

Client Properties RulesClient can also be routed to Brokers using rules based on client properties. For centralized management at the Director, user-defined properties are created using the Driver Property List, and are set on a Driver Profile using the Driver Profile page. Additionally, client properties can also be set using the DriverManager API or driver.properties file on the client. The profile must be associated with the username of the client using the User Admin page on the Admin tab. The Broker Routing page is then used to set up routing rules based on these properties.

Driver API The DriverManager API on all Driver platforms provide a method, connect(String broker), that will force the client to log in to the specified Broker. If a Driver Profile is associated with the client, this profile must permit the specified Broker.

Engine Routing and Balancing Engines are dynamically allocated resources that can migrate among Brokers based on such criteria as load and policy. The Engine Balancer is the component on the Director that manages login and regularly re-routes Engines to maintain an optimal balance across the Grid. The Primary Director’s balancer always runs, while the Secondary Director’s will only run if the Primary is down.

18 Chapter 3 – Engine Balancing and Client Routing• • • •••

Engine Routing and Balancing


On a regular basis, the Director polls all Brokers for the state of all Engines on those Brokers. The routing mechanisms are tested against all Engines to determine where all Engines should optimally reside. Typically, changes in state due to load balancing requirements will result in changes in the optimal distribution. If it is determined that Engines should be re-routed, the Director sends a request to each Broker that has Engines that should be moved, to log those Engines off. When an Engine logs off, it will then log back in to the optimal Broker.

There are three balancers available, depending on how the grid is to be used. The weight-based balancer algorithm attempts to distribute Engines equally by relative weights, and it also allows rule-based routing using Engine properties. The Home/Shared Balancer routes Engines based on an Engine’s assigned Home Brokers, and the sharing policy of Home Brokers to other Brokers. Additionally, because version 4.1 used a different routing mechanism, and version 4.2 allows for 4.1 Brokers for staged migration of large grids, a 4.1-based balancer is available. All of the balancers take into account the number of running and pending tasks on each Broker, and the desired maximum and minimum number of Engines for each Broker.

If the Engine Balancer is changed on the Director, it must be restarted. Also, all balancer settings must be equal on Primary and Secondary Directors.

Engine Weight-Based BalancerThe Engine weight-based balancer allocates Engines based each Broker’s Engine weights value, which is on the Broker Admin page. This value is the amount of Engines that the Broker will be allocated relative to the other Brokers’ weights, when all Brokers are idle. The algorithm also takes into account session load, and idle Engines will be reallocated to busy Brokers as they are needed.

This balancer also allows for rule-based routing via Engine Properties, when it is necessary to restrict some Engines to a set of Brokers. Engine can be routed via their intrinsic properties, such as cpuTotal, and by user-defined properties, which can be created using the Engine Property List page and assigned using the Engine Properties List page. The Broker Routing page is used to set up routing rules based on these properties.

Home/Shared BalancerThe Home/Shared Engine balancer uses an algorithm based on the idea that every Engine has a set of Home Brokers that it will always work on when there are outstanding tasks, yet they can be shared to other Brokers when there are no outstanding tasks on any home. Engines are assigned a home via its configuration, using the Engine Configuration page. Brokers are configured to share their homed Engines to other Brokers using the Broker Admin page.

This algorithm uses Broker needs and Engine preferences for Brokers to perform allocation. Each Engine divides the existing Brokers into tiers by preference. A tier is an unordered set of Brokers. There are two tiers by default-the Engine’s home Brokers, and the shared Brokers of those home Brokers. A third tier can be introduced by splitting shared Brokers into two groups. The higher the tier, the more the Engine prefers the Brokers in that tier.

The balancer uses the following rules:

1. An Engine is routed to the highest-tiered Broker that has pending tasks. If multiple Brokers in the same tier have pending tasks, the choice is made at random, as if all weights were 1.


2. An Engine will leave its current Broker only if there is a needy Broker in a higher tier. An Engine will not move to a lower-tiered Broker unless it is idle.

3. Failover Brokers are never allocated Engines unless they are needy.When using the Home/Shared Engine balancer, tiers are shown in the GridServer Administration Tool, in the Broker Sharing field of the Broker Admin page. Brokers are separated into tiers with the semicolon, such as “A,B;C,D,E”.

For example, an Engine configuration’s home Brokers are A and B. A’s shared list is “C,D;E”. B’s shared list is “F;G”. An Engine with this configuration will have the following preferences: first: A, B; second: C, D, F; third: E, G. Within each group, Brokers are equal, and ordering doesn’t matter.

Engine Balancer ConfigurationEngine Balancing is configured in the GridServer Administration Tool on the Manager Configuration page, in the Engines and Clients section. These setting must be identical on all Directors.:

Setting Description

Engine Balancer The Engine balancer that will be used: Weight-Based, Home/Shared, or 4.1-Compatible.

Rebalance Interval The amount of time, in seconds, between balancing episodes. (Previously called the Poll Period.)

Soft Logoff If true, Engine logoffs do not restart the JVM. This enables them to retain state and log in faster.

Logoff Timeout The amount of time in seconds that an Engine will wait to finish a task before logging off.

Engine Balance Fraction

The fraction of extra Engines that will actually be moved to another Broker on a balance. This can be set to less than 1 to dampen Engine movement. For instance, if the fraction is 0.5 and the balancer determines that a Broker has 8 extra Engines, it will only move 4 on the first balance. Assuming those Engines move, on the next balance it will determine that there are 4 extra and move 2, and so on.

Engine Balance Maximum

The maximum number of Engines that will be moved to another Broker on a rebalance. The maximum applies over the entire grid. For instance, if this parameter is set to 100 and the balancer determines that 200 Engines should be rebalanced (after taking Engine Balance Fraction into account), then only 100 Engines will actually be rebalanced. Does not apply to 4.1-Compatible balancer.

Engine Threshold The difference between the actual and optimal number of Engines on a Broker must be greater than this value before any Engines are logged off. This threshold minimizes unnecessary Engine reallocation. For example, if the threshold is 2, and a Broker’s optimal number of Engines is calculated to be 8, it must have more than 10 Engines before it will log off any of them. Applies to 4.1-Compatible balancer only.


Failover Brokers


Note that if the 4.1-Compatible balancer is selected, it forces Engine instance grouping to avoid constant Engine upgrading or downgrading.

Failover BrokersThe purpose of a Failover Broker is to temporarily take over the execution of service sessions when the Client has no other Brokers to which it is permitted to connect. As far as Clients are concerned, Failover Brokers become part of the pool of active Brokers when there are no other non-Failover Brokers on which the client is permitted As far as Engines are concerned, Failover Brokers are considered to be part of the active pool when there are active sessions in progress on that Failover. In either case, this Broker is now treated like a non-Failover by the algorithm. It is important to then take this into account when setting up the routing configuration. For example, if you are setting up a Driver Profile to allow a client on only one Broker under normal conditions, you must also include a Failover Broker in its list of allowed Broker if you wish this client to have a failover if its main Broker goes down.

See Chapter 4, “Grid Fault-Tolerance and Failover” on page 23 for more information.

Engine Upper and Lower BoundsBrokers can also be configured to have upper and lower bounds on the amount of Engines that can be logged in at a given time. These are set the Broker Admin page. By default the columns are hidden, so you may need to add them using the Add Column control. The minimum value specifies that the balancer algorithm will always leave at least this amount of Engines (assuming there are this many) on the Broker regardless of the state of other Brokers. The maximum value is the cap on the total amount of Engines that can be allowed on the Broker. Both values are always considered by the balancing algorithms.

Example Use CasesExample use cases are presented in this section.

N+1 Failover with WeightingAn organization has four groups using all available Engines in a Grid. One group is guaranteed to be allocated at least half of the Grid any time it needs it, and the other three groups share the remaining Engines.

Brokers Set up five Brokers. Each group gets a Broker, plus one is used for failover.Drivers Create four Driver Profiles, one for each group. In each profile, set the allowedBrokers value to the group’s Broker and the failover Broker. Assign the Profiles to the appropriate users.Engines Use the weight-based Engine Balancer. Adjust Engine Weight on the Broker Admin page so the first group’s Broker is weighted at 3.0, and the other three groups’ Brokers are weighted as 1.0. You would most likely set the failover Broker weight at 1.0, so that a group would not be assigned any more resources than normal if their Broker went down.


Engine Localization with Sharing A company has two groups, one in New York and one in London. Each has a single middleware application that has a Driver that connects to its own Broker. Each group also has a set of CPUs that it expects to always be working on their own calculations. However, there will be times when one group’s Broker is idle, so they are allowed to share with each other.

Brokers Set up four Brokers, a regular and a failover for each group. Each regular Broker shares with the other regular Broker, plus its own failover Broker.Drivers Create two Driver Profiles, one for each group. In each profile, set the allowedBrokers value to the group’s Broker and its failover Broker. Assign the Profile to the middleware application user.

Engines Use the Home/Shared Engine Balancer. Set up two Engine Configurations, “London” and “New York,” which would home the Engines to their respective Broker.

In this scenario, the application always connects to its local Broker, unless it is down, in which case it moves to its failover. Whenever that Broker has pending requests, all of its Engines will always be local. If the other group’s Broker is idle, or if it does not need all of its Engines, any of its idle Engines will be routed to the Broker that needs it.

You may also want to increase the Engine Threshold, and decrease the Engine Fraction, to minimize wandering of Engines during normal work periods when there may be occasional brief times when the Broker may have idle Engines.


Example Use Cases



Chapter 4

• • • • • • Grid Fault-Tolerance and Failover

IntroductionGridServer is a fault-tolerant and resilient distributed computing platform. The GridServer platform will recover from a component failure, guaranteeing the execution of Services over a distributed computing Grid with diverse, intermittent compute resources. This section describes how GridServer behaves in the event of Engine, Driver, and Manager failure. Failures of components within the Grid can happen for a number of reasons, such as power outage, network failure, or interruptions by end users. For the purposes of this discussion, failure means any event that causes Grid components to be unable to communicate with each other.

The Fault-tolerant GridServer DeploymentA GridServer deployment consists of a primary Director, an optional secondary Director, and one or more Brokers. Drivers and Engines log into the Director, which routes them to one of the Brokers. Directors balance the load among their Brokers by routing Drivers and Engines to currently running Brokers.

A minimal fault-tolerant GridServer deployment contains two Directors, a primary and a secondary, and at least two Brokers. The Brokers, Engines, and Drivers in the Grid have the network locations of both the primary and the secondary Directors. During normal operation, the Engines and Drivers log in to their primary Director; the secondary Director is completely idle.

Other GridServer topographies, such as having multiple managers to handle volume or to segregate different types of Services to different Managers, are discussed in Chapter 2, “Installation Overview” on page 7 of the GridServer Installation Guide.

Heartbeats and Failure DetectionLightweight network communications sent at regular intervals, called heartbeats, are sent between GridServer components, such as from Drivers to Brokers, from Engine Instances to Brokers, and from Engine Daemons to Directors. A Manager detects Driver and Engine failure when it does not receive a heartbeat within the configurable heartbeat interval time. Drivers detect Broker failure by failing to connect when they submit Jobs or poll for results. Engines detect Broker failure when they attempt to report for work or return results. To minimize unnecessary messaging, a heartbeat is only sent if no other message has been sent within the heartbeat interval.

FIGURE 4-1: A typical redundant GridServer configuration.

24 Chapter 4 – Grid Fault-Tolerance and Failover• • • •••

Manager Stability Features


Manager Stability FeaturesSeveral precautions are taken to prevent Manager failure due to excessive traffic. For example, the number of threads used for file update is limited. This prevents a large number of file updates from Brokers to Engines from preventing other HTTP activity due to use of all of the HTTP threads on the application server; instead, Engines will retry the download later when this maximum is reached. By default, this is set at 50 threads, but can be changed in the GridServer Administration Tool on the Manager Configuration page, in the Communication section, with the Maximum Resource Download Connections property.

The number of Broker/Director messaging threads is also limited. If this limit is reached, clients will retry rather than immediately fail.

Engine FailureNetwork connection loss, hardware failure, or errant application code can cause Engine failure. When an Engine goes offline, the work assigned to it is requeued, and will be assigned to another Engine. Although work done on the failed Engine is lost, the Task will be assigned to a new Engine. Engines that have built up a considerable state or cache or that are running particularly long Tasks could cause a larger loss if Engine failure occurs. This can be avoided by shortening Task duration in your application or by using the Engine Checkpointing mechanism. For more information on Task duration, see Chapter 10, “GridServer Performance and Tuning” on page 77.

Each Engine has a checkpoint directory where a Task can save intermediate results. If an Engine fails and the Manager retains access to the Engine machine’s file system, a new Engine will copy the checkpoint directory from the failed Engine. It is the responsibility of the client application to handle correct resumption of work given the contents of the checkpoint directory.

Note that if an Engine Daemon logs off the Director or otherwise fails, it does not log off its Engines. Provided the failure has not caused the Engines to also fail, they will continue working and return results when completed.

Driver FailureWhen a client application fails, the Broker detects the failure when the Client does not return a heartbeat and does not not log back in within the interval specified by the Client Timeout setting. When this happens, any currently running services are cancelled. If this happens, application failure recovery or restart is the responsibility of your application. The exception to cancellation are fully submitted Services of type Collection.LATER, or any of type Collection.NEVER. Also, if a Client is collecting results from a Collection.LATER type Service, none of the outputs will be removed until all have been collected and the Client destroys the Service, so that if a Client fails during collection it can restart and recollect the outputs.

All Driver fileservers return a “Server Unavailable” code with instructions to retry if they are processing too many concurrent requests. This significantly reduces the chance of a Service invocation failing due to a temporarily overloaded Driver.


Director FailureIf the primary Director fails, the secondary Director takes over balancing and routing Drivers and Engines to Brokers. Since the Directors do not maintain any state, no work is lost if a Director fails and is restarted. Also, because both Directors follow the same rules for routing to Brokers, it makes no difference which Director is used for login.

The Primary Director is also responsible for the Administrative Database, which contains data needed by the Grid for operation, such as the User list, routing properties, and so on. These values, then, can only be modified on the Primary Director. This database is synchronized to the Secondary Director while both are running, and backed up by the Secondary Director on every database backup, so that the Grid can remain in operation when the Primary Director is down.

Broker FailureLike the Director, the Broker is designed as a robust application that will run indefinitely, and will typically only fail in the event of a hardware failure, power outage, or network failure. However, the fault-tolerance built into the Drivers guarantees that all Services will complete even in the event of failure.

Because the most likely reason that a Driver will be disconnected from its Broker is a temporary network outage, the Driver does not immediately attempt to log in to another Broker. Instead, it waits a configurable amount of time to reconnect to the Broker to which it was connected. After this amount of time, it will then attempt to log in to any available Broker. This amount of time is specified in the driver.properties file or via the API.

Once the Driver has timed out and reconnected to another Broker, all Service instances will then resubmit any outstanding tasks and continue. Tasks that are already complete will not be resubmitted. The Service instances will also resubmit all state updates in the order in which they were originally made. From the Service instance point of view, there will be no indication of error, such as exceptions or failure, just the absence of any activity during the time in which the Driver is disconnected. That is, all Services will run successfully to completion as long as eventually a suitable Broker is brought online.

If an Engine is disconnected from its Broker, the process simply shuts down, restarts, and logs in to any suitable Broker. Any work is discarded.

Failover BrokersIn the fault-tolerant configuration, somea Brokers can be set up as a Failover Brokers. When a DriverClient logs in to a Director, the Director will first attempt to route it to a non-Failover Broker. If no non-Failover Brokers are available, the Director will consider all Brokers, which would typically then route the Driver to a Failover Broker.

FIGURE 4-1: A GridServer configuration with Failover capability.


Fault-Tolerant Tasks


A Failover Broker is not considered for Engine routing if there are no active Services on that Broker. Otherwise, it is considered like any other Broker, and follows Engine routing like any other Broker. By virtue of these rules, if a Failover Broker becomes idle, Engines will be routed back to other Brokers.

The primary Director monitors the state of all Brokers on the Grid. If a Driver logged into a Failover Broker is able to log in to a non-Failover Broker, it will be logged off so it can return to the non-Failover Broker. All running Services will be continued on the new Broker by auto-resubmission.

By default, all Brokers are non-Failover Brokers. Designate one or more Brokers within the Grid as Failover Brokers when you want those Brokers to remain idle during normal operation.

Fault-Tolerant Tasks Fault-Tolerant Tasks enable an Engine to continue executing a task even if it logs off of a Broker, so that it does not lose work due to a Broker failure. It is intended for use on long-running tasks.

This means that if an Engine is working on a task, and it logs off of the Broker, it will not immediately exit. Rather, it will continue to work on that task, while continuing to attempt to log in to a Broker that has the Service on which it is working. If it does not log back in within a defined time period, it will exit. If it does log back in, it will first notify the Broker that it is working on the task. If it has already completed, it will immediately send the result; otherwise, it will do so upon completion.

It’s not recommend that you use this feature unless you have individual tasks that take many hours to finish (or the longest task takes nearly as long as the whole job.) For example, if a report runs during the night and some tasks takes 8 hours to process, then you may want this feature in place to ensure that the 8 hours task didn’t have to start from the beginning if the Broker failed at 7 AM. On the other hand, enabling fault-tolerant tasks can diminish the efficiency of the Grid, since it will redundantly schedule all outstanding tasks. With short tasks, it’s usually more efficient to simply recalculate tasks in the event of a Broker failure.

As an example of Fault-Tolerant Tasks, consider the following:

1. An Engine and Driver are connected to Broker A.2. Broker A goes down.3. The Driver continues for 5 minutes to find the Broker with its Service. The Engine continues

working, while it attempts to find the Broker with its Service.4. After 5 minutes, the Driver connects to Broker B, and resubmits outstanding work.5. Now that the Service is on Broker B, the Engine logs in to Broker B, and indicates that it has taken

that task. When it has finished, it writes its task. If it has already finished, it immediately writes the task.

If another Engine has already taken that task by the time this Engine logs in, no attempt will be made to cancel the task on the Broker. It will essentially be the same as a redundantly rescheduled task.

When an Engine logs into a failover Broker and works on a task, the task is cancelled once the Driver switches to the regular Broker.

To enable Fault-Tolerant Tasks, in the GridServer Administration Tool, click the Manager tab, then click Manager Configuration, then Engines and Clients and change the value of Engine Timeout Minutes and click Save. The timeout should be longer than the Driver’s timeout, which is the value of DSBrokerTimeout set in the driver.properties file.


To use Fault-Tolerant Tasks, another Broker must be available for failover, and the Client running the session will need to fail over to the Broker and resubmit its session.

No attempt will be made upon login of the Engine running a fault-tolerant task to cancel that same task if it has already been taken by another Engine.

Batch Fault-ToleranceBatch Schedules that exist on a Manager are persistent, provided the Next Run field is not never. This provides failover capability in the event of a Manager failure, as the Batch Schedules will still exist when the Manager is restarted.

The following Batch Schedules are persistent:

• Absolute schedules• Relative schedules with repeat• Cron schedulesAll persistent Batches are restarted when the Manager is restarted, just like they were scheduled for the first time. Batch runs that were to occur during the time when the Manager was down are ignored.

GridCache Fault-ToleranceGridCache supports fault-tolerance, as described below. Note that primary and failover Brokers must have their clocks synchronized for GridCache failover.

ClientIf any client puts data in the cache and subsequently dies or logs out, that data is still available to all other clients. This is due to the fact that the Broker maintains the master index and complete view of the cached data. This does not apply to the local caching mode where a region has a local loader that does not synchronize with the other local caches.

Broker RestartGridCache can be configured to survive Manager restart and failure. GridCache’s cache index is rebuilt on system startup; objects persisted on the Broker’s file system will be recovered. If some or all of the cache is stored in memory, that information will be lost.

FailoverA failover Broker can manage a GridServer cache when a regular Broker goes down, provided that the persistent cache directory is on a shared filesystem. The location of this filesystem is configurable from the Manager Configuration page in the GridServer Administration Tool. When the regular Broker goes down and the failover Broker takes over, the failover Broker will build its cache index and begin managing the cache from the shared filesystem. All clients that then fail over to the failover Broker will be able to get references to the existing cache regions on the shared filesystem.


GridCache Fault-Tolerance


Note that a failover Broker can only be configured to fail over to one shared cache directory. Therefore, a failover Broker can’t serve as a failover for multiple Brokers with different cache directories; a different failover Broker would have to be used for each Broker.


Chapter 5

• • • • • • Scheduling

One of the responsibilities of Brokers is scheduling, which is the management of Services and Tasks on Engines and interactions between Engines and Drivers. This chapter gives more details on how scheduling works, and the method used to determine what Tasks in a Service are sent to what Engines.

IntroductionMost of the time, the scheduling of Services and Tasks on Engines is completely transparent and requires no administration. However, in order to tune performance, or to diagnose and resolve problems, it is helpful to have a basic understanding of how the Broker manages scheduling.

Recall that clients create Service Sessions on the Broker. Each Service Session consists of one or more Tasks, which may be performed in any order. The scheduler determines the optimal match of Engines to Services. Whenever an Engine reports to the Broker to request work, the Broker assigns a Task from that Service to the Engine. When an Engine completes a Task, it is queued on the Broker for collection by the client. If an Engine is interrupted during processing, the Task is requeued by the Broker.

Reschedules and RetriesBefore the discussion of scheduling behavior, we must first define the terms Retry and Reschedule within the context of scheduling Tasks.

RetryA Retry is when a Task is re-queued due to a known failure of the Task. Such failures could be due to an error condition in the implementation, an error due to inability to download data, or a failure of an Engine (the monitor has detected that the Engine is no longer connected but it has not logged off.) It is always the result of the Engine returning the Task as failed to the Broker. When a Task is retried, it is always placed at the front of that session’s queue. The scheduler manages a retry count for each Task, so that a limit can be placed on the number of allowed retries.

RescheduleA Reschedule is when a Task is re-queued when it may or may not have failed. When a Task is rescheduled, it is by default placed at the back of that session’s queue, unless the Reschedule First configuration option on the Broker (set in the Manager tab, on the Manager Configuration page, in the Services section) is set to true. The scheduler also manages a reschedule count for each Task. The following conditions result in a reschedule:

• Engine Logoff: When an Engine logs off gracefully while running a Task (such as when UI or CPU idle conditions are met, or there is a forced rebalance), the Task is rescheduled, but the reschedule count is not incremented, since there was no Task error.

30 Chapter 5 – Scheduling• • • •••

The Scheduler


• Redundant Rescheduler: If any of the Redundant Rescheduler strategies are in effect, Tasks may be rescheduled to other Engines. By default, those Tasks are allowed to continue to run on the current Engines, in case they finish before the rescheduled Tasks. In this case, the reschedule count is increased.

Timeout BehaviorWhen the INVOCATION_MAX_TIME option is set, it specifies that any invocation of a request may not exceed this value. If a Task times out on an Engine, it may be either retried or rescheduled, depending on what makes more sense for your application. If retried, the current Engine’s invoke process is terminated, and the Task is assigned to another Engine. If rescheduled, the current Engine Task is allowed to continue execution. In either case, the appropriate count is incremented.

The default behavior is set on the Broker, and is set to retry by default. It can also be set for the Service Type via the Service Type Registry page, or programatically when the Service Session is created.

The SchedulerThe Scheduler is the component that is used on a GridServer Broker to assign tasks to Engines. It attempts to make optimal matches based on criteria such as the session priority level, affinity, and Serial Service and Priority execution modes.

Scheduler OverviewThe scheduler aims to schedule tasks to Engines by attempting to have the proper amount of Engines allocated to all active Service Sessions at any given time. On any given scheduling event, the algorithm decides the number of Engines each Session should have at the time based on static and dynamic criteria, and then assigns the appropriate number of Engines to sessions based on how many the Session needs to reach the ideal level.

Additionally, the scheduler takes into account the amount of usage that the Session has received over a given historical window of time. The “usage” refers to the amount of Engine clock time that the Session has occupied during that window. When a Session is created, it is initialized in such a way that it simulates as if it was running ideally over this window.

This usage provides the ordering in which Engines are allocated to Sessions. This addresses starvation issues, round off error (the number of ideal Engines will rarely be an integer), and under/over-utilization due to discrimination, changes in the number of available Engines, and so on.

Essentially, on a scheduling event, sessions are assigned the ideal number of Engines less the amount that are currently allocated, in the order of least to most usage. The following sections will discuss first the general algorithm, and then address specific subclasses of that algorithm for serial service and priority execution modes.

This approach can be seen as analogous to a CPU thread scheduling algorithm. Each session is a “thread”, the engines are the “CPU”, the window is the sample period, and each task is an uninterruptible unit of CPU time allotted to a thread.


Service PriorityEvery GridServer Service has an associated priority. Priorities can take any integer value between zero and ten, so that there are eleven priority levels in all. 0 is the lowest priority (a suspended Service), 10 is the highest (an urgent priority Service, see below), and 5 is the default. The GridServer API provides methods that allow the application code to attach priorities to Services at runtime (see the GridServer API documentation for more details) and you can use the GridServer Administration Tool to change priorities while a Service is running.

Priority Weight refers to the weight associated with a Priority Level. The weight defines the amount of Engines allocated to a session relative to all other active sessions. For example, if Session A and B have weights of 2.0, and Session B has weight 4.0, and there are eight Engines, Session A and B get allocated two Engines each, and Session B gets four. The weights are set with the Priority Weights property in the GridServer Administration Tool, on the Manager Configuration page in the Services section.

Usage AlgorithmThe usage algorithm is the default mode, and is used when Serial Service Execution mode is not enabled.

Whenever an Engine or set of Engines is available for scheduling, the scheduler decides how many Engines each session should be allocate. In general, that value is:

Ideal Engines per Session = All Engines * Session Priority Weight / Total Weight,

where “Total Weight” is the sum of all Priority Weights of active sessions. This value is rounded up to the next integer to prevents starvation for an ideal calculation of < 0.5, and assures that the sum of Ideal Engine’s is always at least as large as Total Engines. This algorithm also takes into account if the actual number of Engine that can be allocated is less than the ideal, such as when a Session is towards the end, or when Max Engines is used.

Recall that a Session’s usage is considered to be the total Engine clock time spent on the session over the last configurable amount of time. This includes running and completed tasks. When a Session is created, it must initialize its usage. The simplest, most fair method of doing this is to assume it has been operating in a steady state over the window with the ideal non-rounded number of Engines. The variables that monitor usage are then initialized as such. If no sessions are active, it initializes them such that the session's ideal is the total number of Engines currently on the Broker.

Whenever there is any event that requires a scheduling episode, the scheduler assigns the proper number of engines to each session for it to be at its ideal amount. This assignment is performed in order of least to most priority-normalized usage. If there are any unassigned Engines remaining after this initial round based on usage (typically due to disallowed conditions preventing assignment), a second tier round robin assignment is performed.

Time AlgorithmThe time algorithm is used when Serial Service Execution mode is enabled. This algorithm works as follows:


Urgent Priority Services and Preemption


Session AdditionWhen a session is added to the Waiting List, it is placed such that it is ordered by Session creation time. Typically this is at the back of the list, although if the session had been removed and then re-added, it may not be.

Scheduling EpisodeOn each episode, only the first session with waiting tasks is considered for assignment. The scheduler simply attempts to assign all Idle Engines to the session. Affinity is not considered. Note that as soon as the Session has no more waiting tasks, subsequent Sessions may be assigned Engines on the next episode even while the previous session is still running.

Serial Priority AlgorithmThe Serial Priority Algorithm is used when Serial Priority Execution mode is enabled. Either the Time Algorithm or the Usage Algorithm, depending on whether Serial Service Execution mode is enabled, is used on the subset of sessions at the current highest Priority Level that have waiting tasks in any sessions.

For example, with Serial Service Execution mode off, all sessions at level 9 (assuming highest) will be allocated equal amounts of Engines until no more sessions at level 9 have waiting tasks, after which level 8 sessions are allocated.

On the other hand, with Serial Service Execution mode on, all sessions at level 9 will execute in their order of creation. Note that in this state, if they finish, and level 8 sessions start, and then a new level 9 session is created, that new level 9 session will take over at that point. This is because priority takes precedence over creation time.

Urgent Priority Services and PreemptionServices with priority of 10 are considered urgent by the scheduler. (The API defines PRIORITY_URGENT to be equal to 10.) An urgent Service’s weight is hard-coded to be essentially infinite, so that they are assigned all available Engines. They may also preempt Engines that are currently working. When an Engine is preempted, the Task it is currently running is cancelled and rescheduled, and the Engine becomes available for new Tasks.

Engines are preempted on a Service under the following conditions: if after being assigned all free Engines a Service can still make use of more Engines, then it may preempt some busy Engines, subject to two constraints that can be adjusted with configuration properties. First, the urgent Service must have been in the queue for Preempt Delay Seconds. Second, the percentage of Engines in the Grid running urgent Services cannot exceed Preemptable Engine Percent. For example, if this property is set to 50, and 47 percent of the Engines are currently running urgent Services, then at most three percent will be preempted. This value is not a hard limit on the number of Engines that may be running urgent Services, because free Engines are allocated to urgent Services regardless of how many Engines are already running urgent Services.

The scheduler chooses Engines for preemption based on the following rules: Engines running an urgent Service will never be preempted. An Engine running a Task from a Service with lower priority will generally be selected in preference to one running a higher-priority Task. However, if the lower-priority Task has been running for a long time, a short-running, higher-priority Task may be preempted instead. The Preempt


Threshold Minutes property determines the value at which this crossover happens. For example, if this property is set to 30, then an Engine that has just started running a priority 2 Task will be chosen for preemption over an Engine that has been running a priority 1 Task for more than 30 minutes.

Other important points concerning priority Services and preemption:

• Tasks canceled by preemption are not subject to a rescheduling limit, since they are not considered failures.

• To prevent preemption from ever occurring, set Preemptable Engine Percent to 0.• It is possible that the first Service on the queue will not get all free Engines if it doesn’t have enough

Tasks, it is already using its maximum number of Engines, or it discriminates against some Engines. Free Engines that are not taken by the first urgent Service are first offered to the other urgent Services on the queue, and then to all other Services.

Engine BlacklistingIf a Service sets the option “engineBlacklisting” (ENGINE_BLACKLISTING) to true, then Engines that fail on a Task from that Service will not be given any other Tasks from that Service. The default is false. “fail” means any action that results in a failed Task being sent back to the Manager, regardless of whether that failure was due to Engine hardware, Engine environment, or Tasklet code. It does not include events such as the Engine going offline to user activity, since that does not result in a Task failure.

Blacklisted Engines are excluded for a particular Service Session only; they can freely accept tasks from any other Service, regardless of Service Type, assuming the other Services haven’t also blacklisted the Engine or have some discriminators in place that prevent it.

To remove an Engine from all blacklists, go to the Engine Daemon Admin page in the GridServer Administration Tool and select Clear from Blacklists from the Actions list.

ConditionsTask Discrimination allows limiting certain Tasks to a subset of Engines. If an Engine is ineligible to take the next waiting Task, it will be assigned the first Task it is eligible to take.

The Broker tracks a number of predefined properties, such as available memory or disk space, performance rating (megaflops), operating system, and so forth, that the Discriminator can use to define eligibility. The site administrator can also establish additional attributes to be defined as part of the Engine installation, or attach arbitrary properties to Engines “on the fly” from the Broker.

More information on using the Discriminator API, can be found in Chapter 9, “Using Discriminators” on page 85 of the GridServer Developer’s Guide.

Redundant Task ReschedulingRedundant rescheduling addresses the situation in which a handful of Tasks, running on less-capable processors, might significantly delay or prevent Job completion. The basic idea is to launch redundant instances of long-running Tasks. The Broker accepts the first result to return; remaining instances will not be cancelled immediately; it will wait to either finish, or wait until the Job finishes. Redundant rescheduling does not apply to Services. It is also unrelated to any other retry/reschedule behavior described above.


Redundant Task Rescheduling


By default, redundant Task rescheduling is not enabled. With pools of more capable or nearly identical Engines, fastest Task execution occurs when there is no redundancy from rescheduling. In general, rescheduling is only appropriate when there are widely different capabilities in Engines.

Three separate strategies, running in parallel, govern rescheduling. Tasks are rescheduled whenever one or more of the three corresponding criteria are satisfied. However, none of the rescheduling strategies comes into play for any Service until a certain percentage of Tasks within that Service have completed; the Strategy Effective Percent parameter determines this percentage.

The rescheduler scans the pending Task list for each Service at regular intervals, as determined by the Poll Period parameter. Each Service has an associated taskMaxTime, after which Tasks within that Service will be rescheduled. When the strategies are active (based on the Strategy Effective Percent), the Broker tracks the mean and standard deviation of the (clock) times consumed by each completed Task within the Service. Each of the three strategies uses one or both of these statistics to define a strategy-specific time limit for rescheduling Tasks.

Each time the rescheduler scans the pending list, it checks the elapsed computation time for each pending Task. Initially, rescheduling is driven solely by the taskMaxTime for the Service; after enough Tasks complete, and the strategies are active, the rescheduler also compares the elapsed time for each pending Task against the three strategy-specific limits. If any of the limits is exceeded, it adds a redundant instance of the Task to the waiting list. (The Broker will reset the elapsed time for that Task when it gives the redundant instance to an Engine.)

The Reschedule First flag determines whether the redundant Task instance is placed at the front of the back of the waiting list; that is, if Reschedule First is true, rescheduled Tasks are placed at the front of the queue to be distributed before other Tasks that are waiting. The default setting is false, which results in less aggressive rescheduling.

Each of the three strategies computes its corresponding limit as follows:

• The Percent Completed Strategy waits until the Service nears completion (as determined by the Remaining Task Percent setting), after which it begins rescheduling every pending Task at regular intervals, based on the average completion time for Tasks within the Service.

• The Average Strategy returns the product of the mean completion time and the Average Limit parameter. That is, this strategy reschedules Tasks when their elapsed time exceeds some multiple (as determined by the Average Limit) of the mean completion time:

• The Standard Dev Strategy returns the mean plus the product of the Standard Dev Limit parameter and the standard deviation of the completion times. That is, this strategy reschedules Tasks when their elapsed time exceeds the mean by some multiple (as determined by the Standard Dev Limit) of the standard deviation:


Chapter 6

• • • • • • The GridServer Administration Tool

IntroductionThe GridServer Manager provides the GridServer Administration Tool, a set of web-based tools that allow the administrator to monitor and manage the Manager, its Grid of Engines, and the associated job space.

The GridServer Administration Tool is accessed from a web-based interface, usable by authorized users from any compatible browser, anywhere on the network. Administrative user accounts provide password-protected, role-based authorization.

With the pages in the Administration Tool, you can:

• Monitor Service and Task execution and cancel Services

• Monitor Engine activity and kill Engines

• View and modify Manager and Engine configuration

• Install Engines• Create administrative user accounts

and edit user profiles• Subscribe to get e-mail notification of

events• Edit Engine Tracking properties and

change values• Configure Broker discrimination• View the GridServer API• Download the SDK files necessary to

integrate application code and run Drivers• View and extract log information• View diagnostic reports• Run Service Tests

Getting StartedThe Administration Tool is accessible via HTTP network access from any supported browser that supports JavaScript and Java applets. Make sure that both of these features are enabled in the browser.

FIGURE 6-1: The GridServer Administration Tool.

FIGURE 6-2: The GridServer Administration Tool.

36 Chapter 6 – The GridServer Administration Tool• • • •••

User Accounts and Access Levels


In the browser, open http://hostname:port/livecluster (where hostname is the address of the GridServer Manager, and port is the port on which it is listening.); the Manager will prompt you for a username and password. If you are running a browser on the same machine that runs the Manager, you can typically open http://localhost:8000/livecluster to begin.

User Accounts and Access LevelsAll of the administrative screens require you to first log in with a user account. The GridServer Administration Tool uses a system of tiered access to provide security and enable different users to access different areas of the interface. This is done by assigning different access levels for user accounts.

There are four account access levels: Configure, Manage, Service, and View. The Configure level is for administrators and allows access to any part of the Administration Tool. By default, the admin account you created at installation is set to the Configure level; you can also create accounts with full access for other administrative users.

Other users can be given accounts with more limited access. When a user account with an access level of View, Service, or Manage is used with the Administration Tool, some pages will either function differently, or will not be available.

Creating User AccountsTo create a User Account:

1. Log in to the GridServer Administration Tool using an account that has configure-level access, such as the one created when you first installed GridServer.

2. Click the Admin tab, then click User Admin.

3. On the User Admin page, select Create New User from the Global Actions list.The New User Information page will open.

4. Enter the User Name, a password, and confirm the password.The following information for a username is optional. You can also:

• Enter a first and last name, and an email address for notifications.• Select an access level. By default, this will be View.• If you are using Driver Authentication, you can associate a Driver Profile with a user account, so Drivers

using the same username as a user account will also use a specified Driver Profile. Select a Driver Profile from the Driver Profile list to do this.

• Select the users that can be viewed with this account. This user will be able to view any Services submitted by the selected users. Services that don't specify a user will default to the hostname of the Driver and can only be viewed by setting Service Username Access to all

FIGURE 6-3: Creating a User account.

http://localhost:8000/livecluster


Features Available by Access LevelThe following table lists what pages are available in each level:

Service Session Admin methods or actions require the user to have Service Username Access to the Service in question. For example, the Service Session page will only show a user’s Services, and that user can only cancel their own Services.

User account access levels also affect the ability to use GridServer Web Services to programmatically interact with GridServer. For a list of GridServer Web Service objects and methods enabled by access level, see Chapter 10, “GridServer Admin API” on page 89 of the GridServer Developer’s Guide.

Note that access levels don’t filter Services that were submitted before the access level was changed. For example, if a user’s account is changed from Configure to View while a long-running Service was active, the user would still have Configure-level access to that Service.

User Account SecurityUser accounts can be secured by assigning minimum username and password length, password aging, and other attributes. To configure User security, click the Manager tab, click Manager Configuration, then click Security. The following are configurable: Minimum Username Length, Minimum Password Length, Password Complexity, Password Aging, Password Aging Expiration, and Driver Fails Login With Expired Password.

Note that when a user’s password expires, they are required to provide a new password when they log into the Manager.

Level Pages

View Service Session Admin, Service Group Admin, GridCache Admin (view only), Dataset Admin (view only), Propagator Admin (view only), Engine Home, Engine Admin, Engine Install, Driver Admin, Broker Admin, Broker Monitor, Director Monitor, License Information, Discriminator Admin (view only), Engine Configuration (view only), Manager Configuration (view only), and Documentation.

Service All pages from the View level, plus SDK Download, Cache Configuration (view only), Resource Deployment (view only), Service Test, Engine Admin - Log URL List, Engine Admin - Remote Engine Log, Engine Admin - Search Logs, Engine Daemon Admin, Engine Daemon Admin - Log Url List, Engine Daemon Admin - Search Logs, Event Subscription, Cache Configuration, Hook Admin, Service Session Admin - Cancel Service, Service Session Admin - Cancel All Services, Service Session Admin - Remove Finished Service, Service Session Admin - Remove Finished Services, Service Session Admin - Set Priority, TaskAdmin - Cancel Task, ServiceSessionAdmin - Update Deployment Files, and Service Test.

Manage All pages from the Service level (with full rights on all Admin pages), plus Discriminator Admin (full rights), Engine Properties, Broker Routing, Event Subscription, Batch Admin, Batch Schedule, Reports (except Direct Query), Engine Configuration (full rights), Manager Configuration (full rights), Cache Configuration, Hook Admin, Current Log, and Diagnostics.

Configure All pages.


Navigating the Administration Tool


Session timeouts are also configured for logins to the GridServer Administration Tool and Admin Web Services. By default, these are set at 60 minutes for Administration Tool logins and 300 seconds for Admin Web Services. To change these values, click the Manager tab, click Manager Configuration, then click Security. Values are located in the Admin User Management section.

Navigating the Administration ToolThe Administration Tool consists of a number of pages, organized in the following ways:

The Home PageWhen you first open the Administration Tool, a home page is displayed with links to every page. Click a link to go to that page. You can return to this home page by clicking the Home button in the shortcut buttons.

TabsAll of the pages in the Administration Tool are arranged under seven tabs, grouped by component or function. Click a tab to display a home page, which contains a description and link for each of the pages available on the tab. You can click a page link to view that page. Each page in a section is also listed in the page bar, which is located below the tab controls.

Below each tab is a bar containing a link to each page that’s on the home page, including the home page itself. This is useful for returning to the home page, or quickly going to another page without first returning to the home page.

Note that if you have gone to a page other than the home page, clicked on another tab, then clicked on the first tab, you will return to the page you previously viewed, not the home page.

The following tabs are available:Services The Services tab contains pages used to manage, view, and submit Services. Engine The Engine tab contains pages used to manage, view, install, and configure Engines.

Driver The Driver tab contains pages used to manage and install Drivers. Manager The Manager tab contains pages used to manage Brokers and configure your Manager.Reports The Reports tab contains pages used to view statistics and events generated by the Manager.

Admin The Admin tab contains various administrative pages used to manage users, view logs, edit Manager hooks, and view Documentation. Batch The Batch tab contains links to create, edit, and manage Batches.

FIGURE 6-4: The Administration Tool Tabs.


Shortcut buttonsThe shortcut buttons, shown to the right, are displayed in the upper right of each page. The following buttons are available:

• Home - returns to the home page of the Administration Tool.• License Information - displays information on your GridServer license.

This button flashes when your license has expired, or when proxy limits are exceeded. You can turn this off on the Manager tab, in the Manager Configuration page, in the Admin section, by setting the property under the License Manager heading to false. You will also get a license warning starting 14 days before your license is due to expire, on the login page.

• Help Index - opens an index of online help topics in a new window.• Documentation - opens a list of all documentation, including links and a search engine.

Action ControlsEach table item has an action control, which is a list of actions you can choose. Some of these perform actions on table items, while others open a new page.

Links on other pagesSome pages contain shortcut links to other related pages.

Note that only pages that are accessible from the current account are displayed. If you are not using an administrative account with all privileges enabled, some options will not be visible.

Using TablesMost pages have controls or information grouped in tables. The following controls can be used to sort or reorganize tables for more convenient viewing:

Pager controlThe Pager control enables you to step through multiple pages, or specify how many rows appear on a page. Select a page number from the Page list, or select a range from the second list to display those items. You can select a greater number of items listed per page in a table or display all of the items; type a number in the Results Per Page box and click Go.

Search controlThe Search control is displayed on any page containing a table. You can use it to search any column of a table. Select a column from the list, enter a search term, and click Go.

FIGURE 6-5: Shortcut buttons.

FIGURE 6-6: The Pager control.

FIGURE 6-7: The Search control.


Using Tables


Personalize TableThe Personalize Table commands enable you to make changes to a table by removing or adding columns. There are two lists that control this:

Add Column: Select the name of a listed column to add it to the table. Columns previously deleted from the table will be listed, along with any optional columns that are not displayed in a table’s default configuration. Columns will be added to the right of existing columns.

Delete Column: Select the name of a column to remove it from the table. Deleted columns will remain hidden to this account, and these settings will be saved for future login sessions.

Tables are always sorted by a column that has an arrow in it, either facing up or down. You can click this arrow to reverse the sort order of a table, or click another column to change the sort column.

RefreshTo update the list and display the most current information in a table, click the Refresh button. You can also select a time value from the Refresh list to automatically refresh the table at a regular interval. To stop automatic refreshes, select none.

Broker and Director MonitorsWhile the pages like the Service Session Admin page and Engine Admin page can be used to oversee the running of Services on your Grid, two graphical tool can be used to provide a more simple overview of status information on your system. Both Directors and Brokers have available a graphical monitor, which can be displayed in its own window.

To display the Director Monitor, click the button to the left in the Administration Tool. Note that this button is not present in Managers that only host a Broker.

To display the Broker Monitor, click the button to the left in the Administration Tool. Note that this button is not present in Managers running only a Director.

Both

monitors display up-to-date information on your Grid. The Director Monitor contains graphs with statistics on Engines, Tasks, Servicesand machine status, including thread and memory information. The Broker Monitor contains similar information about one specific Broker. To the right is a sample of a Director Monitor for a Grid with three Engines running several Services at once.

Manager Component IndicatorThe Manager Component Indicator graphically displays what part of the Manager is controlled by each page within the Administration Tool. Each page’s functionality will control either the entire Manager, a Broker, or a Director.

FIGURE 6-8: The Add and Delete column controls.

FIGURE 6-9: The Director Monitor.


On Manager pages, a red and a blue sphere will be displayed.

If a page’s functionality is tied to a Director, just the red sphere is shown.

If a page’s functionality is for a Broker, just the blue sphere is shown.

Also, the Manager Component Indicator will show the hostname of the related component.

Status DisplayThe GridServer Administration Tool contains a Status Bar at the top of each page, which contains four Status displays. Each of these displays are updated at each page reload with information about the status of your Grid. The following Status displays are included:

• Busy Engines and Available Engines• Drivers and Engine Daemons• Running Services and Finished Services• Running Tasks and Pending Tasks


Using Tables



Chapter 7

• • • • • • Application Resource Deployment

IntroductionGridServer provides several options for distributing classes, libraries, and other resources to Engines.

A Grid Library (or GL) provides an enterprise solution to managing versioned sets of resources that may be used by multiple services. Grid Libraries provide the following features:

• Version control, including optional automatic selection of the most current version of a Grid Library.• Resource upgrading without interrupting current Sessions.• Specification of dependencies on other Grid Libraries.• Specification of C++ Bridges and non-default JREs via dependencies.• All-in-one packaging for JARs, native libraries for multiple OSes, .NET assemblies, Command Service

executables, and Engine Hooks.• Specification of Environment Variables and Java System properties.• Engines that require different compiler support libraries (GCC2/GCC3) can participate in the same

Service Session.• Optimization of Engine restarts.• Task reservation when an Engine requires a restart.• Parameterization of package configuration through the use of property substitution files.The Resource Deployment feature replicates sets of directories from a Manager to Engines to provide a method of copying and managing files. It can be used for Grid Libraries and for the default set of resources. In the simplest sense, this enables you to copy a JAR, DLL, or another resource to each Engine to run a Service.

Remote Application Installation can install and uninstall applications on remote Windows Engines in non-Grid Library deployment.

This chapter details how to use each of these methods of deployment for your GridServer installation.

Grid LibrariesA Grid Library is essentially a set of resources and properties necessary to run a Grid Service, along with configuration information that describes to the GridServer environment how those resources are to be used. For example, a Grid Library can contain JARs, native libraries, configuration files, environment variables, hooks, and other resources.

A Grid Library is deployed as an archive file in ZIP or gzipped TAR format, with a grid-library.xml file in the root that describes the Grid Library. It may also contain any number of directories that contain resources.

44 Chapter 7 – Application Resource Deployment• • • •••

Grid Libraries


Grid Libraries are identified by name and version. All Grid Libraries must have a name, and typically have a version. The version is used to detect conflicts between a desired library and library that has already been loaded; it also provides for automatic selection of the latest version of a library. A GridServer Service can specify that it is implemented by a particular Grid Library by specifying the gridLibrary and gridLibraryVersion Service Options or Service Type Registry Options.

Grid Libraries can specify that they depend on other Grid Libraries; like the Service Option, such dependencies can be specified by the name, and optionally the version. Also, nearly all aspects of a Grid Library can be specified to be valid only for a specific operating system. This means that the same Grid Library can specify distinct paths and properties for Windows, Linux, and Solaris, but only the appropriate set of package options will be applied at run-time.

Grid Library FormatThe Grid Library can be any archive file in ZIP (.zip) or gzipped TAR format (.tgz or .tar.gz), with a grid-library.xml file in the root. Although the filename has no inherent meaning, we recommend the format:

[library name]-[library version].[zip|tar.gz|tgz]

The directory structure is completely up to the user, since the configuration file is used to specify where resources are found within the Grid Library.

The configuration file must be a well-formed XML file named grid-library.xml, and be in the root of the Grid Library.

The GridServer SDKs include a grid-library.dtd file that can be used to validate the XML file. They also include an example Apache Ant build.xml file that can be used to validate and build Grid Libraries. This DTD can also be found at Appendix A, “The grid-library.dtd” on page 99.

Following is a table that specifies all elements and attributes of the grid-library.dtd file. It uses the XML schema notation for elements and attributes, such as:[no tag] (Required)? (Optional)* (Optional and Repeatable)

Element Description Elements and Attributesgrid-library The root element. ELEMENTS grid-library-name

grid-library-version?dependency*jar-path*lib-path*assembly-path*command-path*hooks-path*environment-variables*java-system-properties*

ATTRIBUTES os?compiler?

grid-library-name The library name. All libraries must be named.


grid-library-version

The version. If not specified, 0 is implied. If in comparable format as defined below, it can be used to determine the latest version.

dependency A library dependency. If the version is not specified, the latest version is chosen at runtime.

ELEMENTS grid-library-name*grid-library-version?

conflict Indicates that this library conflicts with the given library. If this Grid Library is NOT a dependency, and grid-library-name="*", then it indicates that this Grid Library conflicts with all other Grid Libraries (aside from its dependencies).

ELEMENTS grid-library-name*

pathelement An element containing a relative path, typically set to a directory. This element must be in the proper format for the OS. The path is resolved relative to the Grid Library.

jar-path The JAR path. If specified, all JARs and classes in the path are loaded.

ELEMENTS pathelement*


lib-path The native library search path. ELEMENTS pathelement*


assembly-path The .NET assembly search path. Absolute assembly paths, mapped drives, and UNC paths will not work.


command-path The path in which the Engine will search for Command Service executables.



hooks-path Engine hooks library path. Engine Hooks will be initialized at the time the containing Grid Library is loaded.



name The name of a property

value The value of a property

Element Description Elements and Attributes


Grid Libraries


The following is a list of attributes used above. Valid values can be found in the Product Info page in the GridServer Administration Tool.:

Variable SubstitutionA file can be created that contains variable substitutions, which are substituted into the grid-library.xml file. This allows for quick changes in properties in the grid-library.xml file without redeploying the Grid Library.

You can have a default properties file in your Grid Library called grid-library.properties that can provide baseline values for your variables. You can also create an external properties file, named with the same name as the Grid Library archive, with the extension .properties, and place it in the Grid Library deployment directory. External properties will substitute over those in the Grid Library.

If the grid-library.xml file contains a property with a value contained with the $ character, such as $mydir$, and the properties file contains an assignment, such as mydir=c:\\dir, the variable is substituted.

NOTE: Substitutions are allowed within the content of property value elements and pathelements only. If the substitution is not found in the file, the empty string, "", is substituted.

Substitutions are allowed anywhere in a string. Multiple substitutions per string are allowed. $ characters can be treated as literals by escaping them with another $ character. Windows paths that are specified in the [library].properties file must escape the \ character with another \.

VersioningVersioning provides the following functionality:

• It allows for deployment of new versions of libraries and deletion of old versions without interrupting currently executing Service Sessions.

• It provides for specifying conflicts, or libraries that cannot coexist with each other.

property A name/value pair, used by environment variables and Java System properties.

ELEMENTS name, value

environment-variables

Environment variables to set. ELEMENTS property


java-system-properties

Java system properties, which are set immediately prior to executing a task using this library.

ELEMENTS property

ATTRIBUTES oscompiler

Attribute Description

os The os attribute specifies that it is only applied to this OS. If the attribute is not this operating system (OS), the containing element and its children and content are ignored.

compiler If the attribute is not this compiler, the containing element and its children and content are ignored.

Element Description Elements and Attributes


• It allows for a Service Session or dependency to specify the use of the latest version of a Grid Library.To use versioning, you must specify the Grid Library version in the configuration file. An Engine can load only one version of the library with the same name at any time. If the version is not specified, it is implied to be 0.

While the version can be any String, if it follows the proper comparable version format it can also be used to determine the latest version of the library, for automatic loading. This format is

[n1].[n2].[n3]...

where nx is an integer, and there may be one or more version points.

For instance, 4.0.1.1, 4.1, 3

are in the proper comparable version format.

The integer at each version point is evaluated starting at the first point, and continue until a version point is greater than the other. If a version point does not exist for one, it is implied as zero.

For instance4.0.0.1 > 4.04.0.0.5 < 4.0.1.1

To specify that a dependency or Service use a particular version of a Grid Library, the version field is set to that value. To specify that it use the latest version, the field is left blank.

If a version is specified but not in this format, and there are multiple versions of a library, the “latest version” is undefined. Thus, automatic selection of the latest version is only possible when all Grid Libraries with the specified name provide a version in the proper format.

Note that automatic versioning is dynamic. That is, if a Service or dependency specifies the latest version, and a new version of a Grid Library is deployed, the next time that Grid Library is used by any Session it will be the new version.

DependenciesGrid Libraries may specify dependencies on other Grid Libraries. A dependency specification resolves to a particular Grid Library using two values:

grid-library-name: The name of the Grid Library, as specified in the dependency’s XMLgrid-library-version: The version of the Grid Library, as specified in the dependency’s XML. OS compatibility is determined by checking the os and compiler tags for the top-level element in the dependent Grid Library. If not specified, it will use the latest version supported by the OS

Note that if a dependency resolves to more than one Grid Library, the dependency used is undefined.

Two dependent libraries conflict if they have the same library name, but different versions.


Grid Libraries


ConflictsA conflict between two Grid Libraries means that these libraries cannot be loaded concurrently. When there is a conflict between a loaded Grid Library and a Grid Library required by a Service, the Engine must restart to unload the current libraries and load the requested library.

The following circumstances result in a conflict:

Version Conflict

The most common conflict arises via versioning, and typically when upgrading versions or using more than one version of the same library concurrently. This conflict arises when a Grid Library with the same grid-library-name as the requested Grid Library, but different version, is loaded.

Explicit Conflict

There can be situations in which different Grid Libraries can conflict with each other due to conflicting native libraries, different versions of Java classes, and so on. Because the Engine cannot determine these implicitly, the conflict element can be used to specify Grid Libraries that are known to conflict with this Grid Library.

Additionally, the value of the grid-library-name can be set to "*". This means that this Grid Library can conflict with all other Grid Libraries (aside from its dependencies), and it is guaranteed that no other Grid Libraries will be loaded concurrently with this Grid Library. Note that this is only allowed if the Grid Library is not a dependency; if the "*" is used as a conflict in a Grid Library that is a dependency, a verification error will occur.

Dynamic Version Conflict

A Grid Library conflict occurs if dynamic versioning is used, and the latest version of a Grid Library or Grid Library dependency has changed due to an addition or removal of a dependency since the Grid Library has been loaded.

Variable Substitution Conflict

A Grid Library conflict occurs if its variable substitution file has changed since it has been loaded.

Grid Library Loading When a Service Session is set to use a Grid Library, that library is loaded. Loading is the process of setting up all resources in the Grid Library for use by the Service. A library is loaded only once per Engine session.

First, the library loads itself, and then it loads all dependencies. Libraries are loaded depth-first rather than breadth-first. Certain aspects of a load may require a restart, and possibly re-initialization of the state. The following steps are performed by a load of the root library and all dependencies:

1. Checks for conflicts with currently loaded Grid Libraries. If so, it will restart with the requested Grid Library and clear out the current state of any loaded libraries.

2. If new lib-paths have been added for its OS, they will be appended to the current list of lib-paths, and the Engine will restart. The state of loaded libraries will include all libraries already loaded, plus the requested library. Note that specifying a JRE dependency has this effect.

3. If new jar-paths have been added for its OS, the jars and classes will be added to the classloader.4. If new assembly-paths have been added, it will add them to the .NET search path.


5. If new command-paths have been added for its OS, it is added to the search path for Command Tasklets.

6. If new hooks-paths have been added, any hooks in the path will be initialized.7. If the default is current and a Grid Library is requested, the Engine will restart.

State PreservationUnder most cases, when an Engine shuts down, it preserves the current state of which Grid Libraries it has loaded. When it starts back up, it loads all Grid Libraries that were loaded when it shut down. As Grid Libraries are loaded, the pathelements they contain are added to a ‘master’ list of paths for that type of pathelement. For example, if a Grid Library contains a lib-path specification, that lib-path is appended to the list of lib-path values obtained from already-loaded Grid Libraries.

Note that this means that is up to the creator of the Grid Libraries deployed on the Grid to ensure that the ordering of library paths does not lead to loading the wrong library

For example, if two different Grid Libraries each provide DLLs in their lib-paths that share the same name, because of OS-specific library load conventions, the one that will be used will be the first one found in the aggregate lib-path from across all loaded Grid Libraries. Likewise for Java classes, when more than one copy of the same class is in the classloader, it is undefined which class will be loaded. Therefore it is important to either subdivide Grid Libraries appropriately when such conflicts could arise, or to use the conflict element to explicitly state conflicts.

If an Engine shuts down due to a conflict, it clears the current state and sets up for only the requested Grid Library upon restart. This is referred to as preloading. If an Engine shuts down due to internal library inconsistencies or a crash, the state is not saved. State is also cleared on all instances for file updates, Daemon restarts, and Daemon disable.

Task ReservationIf an Engine requires a restart to load a Grid Library, the task will be reserved on the Broker for that Engine. The Engine is instructed to log back into the same Broker, and will take that task upon login. The timeout for this is configurable on the Broker on the Manager Configuration page, in the Services section.

Environment Variables and System PropertiesAll Environment variables and Java System properties for a Grid Library and all dependencies will be set each time a task is taken from a particular service that specified that Grid Library. (They are not cleared after the task is finished.) Environment variables are set via JNI so that they can be used by native libraries or .NET assemblies, and they are also passed into Command Services. Note that environment variables such as PATH and LD_LIBRARY_PATH should not be changed through this mechanism. Rather, library-path and command-path are reserved for manipulating these variables.

Using Grid Libraries from a ServiceServices can specify a Grid Library to use by setting the GRID_LIBRARY and optionally the GRID_LIBRARY_VERSION Service Options. This would typically be set by Service Type in the Service Registry page, although it can be set programatically on the Session. Jobs can specify a Grid Library to use by setting the corresponding JobOption values. If the version is not set, a Service will use the latest version of a Grid Library.


Grid Libraries


If a Service needs to find resources in a Grid Library, it can use the Grid Library Path. This value is a path value that includes the root directories of all Grid Libraries currently loaded. This path can be retrieved in the following way:

ds.GridLibraryPath: Java System property, .NET System.AppDomain.CurrentDomain data entry ds_GridLibraryPath: Command Service, native library Service environment variable

DeploymentGrid Libraries are typically deployed by placing them in the Grid Library deployment directory on the Primary Director. The Resource Manager will then replicate these libraries to all Engines. Variable Substitution property files also should be placed in this directory.

Grid Libraries are special resources, in that adding or removing Grid Libraries or property files will not result in an Engine and Daemon restart, like other resources. This is because it is not necessary to restart until the Engine actually needs to use the Grid Library, and even then only if necessary according to the loading procedure. Note that if a Grid Library is changed, the Daemon and Engines will restart like they would in the case of a change to any other resource. Also, it is the responsibility of the user not to delete Grid Libraries via the Resource Deployment page that have been loaded by active Services, as that may lead to library load failures for subsequently executed Tasks.

If you are not using the Resource Manager for replication, you can use an alternate shared Grid Library directory. You must then set the Grid Library Path in all Engine Configurations to point to this directory, instead of the default replicated location. When changes are made to this library, you must then use the Update button on the Resource Deployment page on the Primary Director. This will send a message to all Engines to check and update their Grid Libraries via the Grid Library Manager.

Grid Library ManagerThe Grid Library Manager exists on all Engines, and is responsible for maintaining the state of all Grid Libraries deployed. Whenever any change is made to the Grid Library directory (typically due to replication), the Grid Library Manager will update the local status as follows:

1. Any new Grid Library files are unzipped to a directory with the name corresponding to the file name. This new library will be added to the Grid Library Manager’s catalog, but not loaded until needed.

2. If a Grid Library is removed, it will delete the local copy of the zipped Grid Library and the unzipped directory.

3. Variable substitution files are copied into the appropriate directory. If a variable substitution file has been changed, and the corresponding Grid Library has already been loaded, it is marked as dirty so that the next time an Engine attempts use it, it will restart due to conflict.

4. If any Grid Library uses a latest version in the Grid Library’s catalog, and the latest version has changed, it is marked dirty so that the next time an Engine attempts to use it, it will restart due to conflict.

The Grid Library Manager locks the directory while making any changes, so that if multiple Engine instances are running or multiple Engine Daemons are running from a shared Engine directory, only one Engine will perform any file manipulation. Other Engines will wait until those operations are completed, and then their Grid Library Managers will update their links appropriately.


C++ BridgesC++ Bridges are the native bridges that allow Engines to execute native Services. They are packaged as Grid Libraries, named cppbridge-[os]-[compiler]-[M]-[m], where M and m are the GridServer major and minor version numbers. All C++ Bridges are pre-packaged and deployed in the Grid Library replication directory upon GridServer Manager installation or upgrade.

Only one version of a bridge can be loaded at any given time, so all bridges for a particular platform are built to explicitly conflict with each other. For example, a Service that was VC7.1 conflicts with one that uses VC7.0.

JREsJREs will be packaged as jre-os-.glz. The Grid Library name will be jre-os, and the os will be the JRE version, for example, 1.4.2.06. DataSynapse will package JREs for customers as needed, or as they become available; contact DataSynapse support for details.

Grid Library ExampleThe following example grid-library.xml is for a mixed Java/C++ application that runs on Windows, and both gcc2 and gcc3 for Linux:

Example 7.1: grid-library.xml example <?xml version="1.0" encoding="UTF-8"?><grid-library>

<grid-library-name>MyLib</grid-library-name><grid-library-version>1.0.0.1</grid-library-version>

<lib-path os="linux">

<pathelement>lib/gcc2</pathelement></lib-path><lib-path os="linux" compiler"gcc3" />

<pathelement>lib/gcc3</pathelement></lib-path>

<dependency>

<grid-library-name>cppbridge-vc6</grid-library-name></dependency><dependency>

<grid-library-name>cppbridge-gcc3</grid-library-name></dependency><dependency>

<grid-library-name>cppbridge-gcc2</grid-library-name></dependency>

<dependency>

<grid-library-name>jre-win32</grid-library-name><grid-library-version>1.4.2.06</grid-library-version>

</dependency>


Legacy Resource Deployment


Legacy Resource DeploymentWhen it is not necessary or optimal to use Grid Libraries, a default set of resources is also available for use by Engines. For instance, a Grid with only a small number of applications that do not require uninterrupted upgrading may not require Grid Libraries. Also, developing and testing GridServer applications is typically easier using the default resources.

Using Default ResourcesDefault resources are used when a Service does not specify a Grid Library. They cannot be used concurrently with Grid Libraries, so the default resources can be thought of as a non-versioned Grid Library that conflicts with all other Grid Libraries. Also, rather than using a grid-library.xml file, it uses the Engine Configuration to specify paths.

<dependency>

<grid-library-name>MyCalculator</grid-library-name></dependency>

<hooks-path><pathelement>hooks</pathelement>

</hooks-path>

<jar-path>

<pathelement>jars</pathelement><pathelement>morejars</pathelement>

</jar-path>

<lib-path os="win32">

<pathelement>lib\win</pathelement><pathelement>s:\lib\win</pathelement>

</lib-path>

<environment-variables os="win32">

<property ><name>MY_WIN_VAR</name><value>$WinVar$</value>

</property></environment-variables><environment-variables os="linux" compiler="gcc3"<property >

<name>MY_GCC3_VAR</name><value>$LinuxDriverDir$</value>

</property></environment-variables><java-system-properties>

<property><name>foo</name><value>bar</value>

</property></java-system-properties>

</grid-library>

Example 7.1: grid-library.xml example (Continued)


When using Default Resources, the following Engine Configuration properties take effect; when using Grid Libraries, they do nothing:

Default Resource PathsThe paths used by Default Resources are set in the Engine Configuration, in the Classes, Libraries, and Paths section. By default, these paths are set to replicated resource locations. Following is a list of the paths, and analogs to Grid Libraries:JAR and Class Path: The jar-pathLibrary Path: The lib-path and assembly-path (for Windows)Hooks Path: The hooks-path

C++ BridgesC++ Bridges are used by simply including the bridge libraries in the Library Path. These libraries are installed by default when the Manager is installed or upgraded, into the default library path. Note that this means that only one version of a bridge may be used. For example, when using the default resources, you cannot use both VC6 and VC7 services for the same Engine configuration.

Grid Library features not supported by Default ResourcesThe following features are unique to Grid Libraries and cannot be utilized when using Default Resources:

JRE: Only the default JRE can be used.

System Properties: Not supported, although they can be set via an Engine Hook or in the Service implementationEnvironment Variables: Not supported, although they can be set via an Engine Hook or in the Service implementation via JNIDaemon and Engine restart optimization: When default resources are changed, all Engines and Daemons will restart to update those resources. Variable Substitution: Not supported.

Code Versioning DeprecationCode Versioning has been replaced by Grid Libraries as of GridServer version 4.1.

Property

Environment Variables

Default JAR and Class Path

Default Library Path

Common Library Path

Default Hook Path


Resource Deployment: Distributing Grid Libraries and Default Resources


To support migration from Grid Libraries without changing the client implementation, the following is done: If the CODE_VERSION option is set for a Service, the GRID_LIBRARY value is set to that value.

To migrate, then, you must at minimum perform the following so that legacy clients work correctly:

1. Package all Code Version directories as Grid Libraries with grid-library-name=codeVersion.2. If any directories include C++ Bridge DLLs, remove them and replace with the proper bridge

dependency.3. If Code Versions conflict with each other, use the conflict element. If all Code Versions conflict

with each other, you can simply use the "*" conflict value.Note that these instructions are the minimum necessary to migrate from Code Versions to Grid Libraries without changing existing client code. As client code is changed, you may find a more optimal division of resources into dependencies.

Resource Deployment: Distributing Grid Libraries and Default ResourcesThe GridServer system provides a Resource Deployment mechanism for securely distributing Grid Libraries and resources, such as libraries (.dll or .so), Java class archives (JAR), binaries, or large data files that change relatively infrequently. The resources to be deployed are placed within a reserved directory on the Primary Director. The system maintains a synchronized replica of the reserved directory structure for all Engines. The replica of files on the Director is synchronized to Brokers, and then Brokers synchronize the files with Engines. The files are secure in that they cannot be accessed by anyone on the network, only the Engines.

The Resource Deployment InterfaceThe GridServer Administration Tool provides a graphical interface to manage resources synchronized to Engines. To manage resources, on the Primary Director click the Services tab in the Administration Tool, and click Resource Deployment. The Resource Deployment page, shown to the right, features a file browser that can be used to navigate the replicated directories, create new directories, and add or delete files.

To navigate the directories, simply click the displayed file names or the directory names in the current directory, displayed above. You can add new files to a directory by entering a filename and clicking the Upload button, or clicking the Browse button to find files on your computer. Once you have added new files, you can click Update to update the files to your Engines.

Resource Deployment File LocationsThe resources directory contains a directory for each Engine OS that is deployed only to Engines with the respective operating system. The gridlib and shared directories are deployed to all Engines.

FIGURE 7-1: The Resource Deployment page.


The default locations for these directories, relative to the livecluster base directory, are in the deploy/resources directory. Files in the resources directory itself are not deployed.

The corresponding Engine-side directory is located under the root directory for the Engine installation, for example, C:\Program Files\DataSynapse\Engine\resources for Windows; or /usr/local/DSEngine/resources for Unix.

There two reserved file patterns: those that contain a #, and those that end in .tmp. You cannot deploy resources that match this pattern, as they will cause problems with the replication mechanism.

Configuring Directory ReplicationThe system can be configured to trigger updates of the replicas in one of two modes:

• Automatic update mode. The resources will automatically be deployed to any Engine upon login to the Broker. Also, the Manager continuously polls the file signatures within the designated subdirectories at the time interval specified in Monitor Interval. and triggers Engine updates whenever it detects changes; to update the Engines, the system administrator need only add or overwrite files within the directories. This is the default update method.

• Manual update mode. The administrator ensures that the correct files are located in the designated subdirectories and triggers the updates manually by issuing the appropriate command in the GridServer Administration Tool. Updates also take place at startup.

To configure manual updating,

1. Click the Manager tab, then click Manager Configuration. 2. Under Broker Resources and Director Resources, set Monitor Interval for both to 0.There are two different ways to update files to Engines manually:

1. Click the Services tab, then click Resource Deployment.2. Click Update.or:

1. Click the Engine tab, then click Engine Admin.2. Click Update Deployment Files on the Global Actions menu.Either of these actions will cause all Engines to update. If you have installed new files and want all Engines to use them immediately, do either of these commands.

During rapid Java development, an alternative to file updating is the use of the JAR_FILE Service Option to dynamically attach a local JAR file to the Service. By default, this option is not available for security reasons, and has certain restrictions.

Using Engines with Shared Network DirectoriesInstead of using directory replication, you can also provide Engines with common files with a shared network directory, such as an NFS mounted directory. To do this, you must provide a directory on a shared server that can be accessed from all of the Engines. Then the Engines must be configured to use that location. Click the Engine tab in the Administration Tool, click Engine Configuration, and change the directories appropriately.


Remote Application Installation


JAR Ordering FileIf you are using multiple JAR files and need the classloader to load them in a specific order to prevent conflicts, you can specify the order in which they are loaded. To do this, create a file called index.libs in the JAR path root and put the names of JAR files, one per line, in the order in which they should be loaded. Those not in the list will be loaded afterwards, in no specified order.

Remote Application InstallationThe Windows Deployment Scripting Language provides a mechanism by which programs can be executed in conjunction with file updating on Windows Engines. This can be used for such purposes as registering COM DLLs and .NET assemblies, running Microsoft Installer packages, and so on. It runs an installation command when the script is added, and when any dependent files are modified. It can also run an uninstallation command when the script is removed. Note that the Remote Application Installation feature does not work with Grid Libraries.

A deployment script is a file named dsinstall.conf in a resource subdirectory. This is a reserved filename, and the Engine Daemon interprets any file with this name as a deployment script. The script is a properties file, with name and value pairs that govern the command execution.

Typically, the script is placed, with associated files, in its own subdirectory of the win32 deployment directory. This will be referred to as the installation directory.

The following properties are provided:

Property Descriptioninstall_cmd The installation command. The command should be either in the current directory or

the resources/win32/lib directory; you can also specify the full path to a command. This command is run when the dsinstall.conf file is added, modified, and when any dependency is modified.

workdir Working directory from which the commands are launched. The directory is relative to the installation directory.

uninstall_cmd Optional. The uninstall command. This is executed when the script is deleted, or prior to subsequent runs of the install command if uninstall_first is true. Supporting files for the uninstall script may be deleted along with the script; the command is executed prior to local deletion of the files. Typically an uninstall is performed by simply removing the entire installation directory.

dependfiles Comma-delimited list of file names that the script depends on. The files are relative to the installation directory. If any of these files change on a file update, the install command is re-run. A file may contain wildcards only as replacements for the entire name or extension, such as *.dll, *.*, or file.*.

waittime Number of seconds to wait for install/uninstall command to finish. The default is 30 seconds. If this time is exceeded, the process running the command is killed.

uninstall_first Optional. If true, the uninstall command will always be run prior to the install command, except for the first time the install command is run. This is for situations in which you need to uninstall software prior to reinstallation.


The : and \ characters must be escaped with a backslash (\) character in the dsinstall.conf file. Also, you should not rename the dsinstall.conf file.

The following is an example of a script that installs a Microsoft Installer package:

These three files, plus the mypackage.msi file, are all placed in a subdirectory under win32. Note that the uninstall_first property is used to uninstall the previous version of the software whenever the package is changed. To uninstall the software, simply remove the entire installation directory; the uninstallation is performed prior to deleting the files.

Service Run-AsThere are often cases where Services require specific user permissions in order to access needed resources. By creating the Engine process as a given user, all Service invocations executed by the Engine can operate with these permissions.Service Run-as (or RA) allows for specification of authentication domain accounts under which Service invocations will execute.

By default, all RA credentials are authenticated on the Engine Daemon in order to verify that the credentials are valid for the Engine’s authentication domain. Service RA authentication may be disabled on the Broker, but in most installations this is discouraged unless there is a specific reason for doing so. If Service RA authentication is disabled, then Driver user authentication should be enabled to prevent unauthorized users

success_exit_codes Optional. Comma-delimited list of exit code values that indicate successful command execution. If the exit code does not match any value, an error will be logged with the failure code, and the next time the Daemon restarts it will retry the installation. If this property is not set, exit codes are ignored.

disable_on_fail If an Engine Daemon should disable itself upon the failure of an install. The default is false if not specified in the conf file. When the value is true, the Engine Daemon will disable itself if the installation returned exit code is not in the success exit codes.

Example 7.2: A Microsoft Installer Package Installation Script

dsinstall.conf:dependfiles=install.bat,uninstall.bat,mypackage.msiworkdir=.waittime=30uninstall_first=trueinstall_cmd=install.batuninstall_cmd=uninstall.batsuccess_exit_codes=0

install.bat:%SystemRoot%\system32\msiexec /q /i mypackage.msi ALLUSERS=1

uninstall.bat:%SystemRoot%\system32\msiexec /q /x mypackage.msi ALLUSERS=1

Property Description


Service Run-As


from submitting Services that may run under arbitrary accounts. Also note that while disabling this authentication step removes the need for passwords, such Services may only run on Unix Engines due to restrictions in the Windows API.

Note that Service Run-As only supports the Service model; there is no support for RA using the legacy Job API.

Types of CredentialsThere are two ways in which Service Run-as credentials may be specified for a given Service:

Stored CredentialsService Run-as credentials are entered on the Director with the GridServer Administration Tool and are synchronized with all Brokers. These credentials are linked to Services in the Service Type Registry by specifying the username in the RunAsUser field. Credentials in the repository consist of a username and a password. The username may be in Windows DOMAIN/username format if domain-specific authentication is required. This domain is ignored by Unix Engines.

“Pass through” CredentialsThe Driver provides the username of the current Principal that is logged in and is running the Driver. The password is provided as a DriverManager property, CURRENT_USER_PASSWORD. These are referred to as “pass through” credentials. A password set on the Driver is required in order to prevent user account spoofing between authentication domains (for example, logging in as a local user on the Driver machine to pose as an LDAP user in the credentials DB).

“Pass through” credentials are indicated for a Service in the Service Type Registry with the $ token. This token is substituted with the username of the current principal that is executing the Driver process. The token may also be prepended with a Windows domain if domain specific authentication is required. This domain is ignored by Unix Engines.

Using Run-AsTo use Run-As, you must do three things: set up Engines, add credentials, and associate credentials with Service Types.

Engine SetupTo set up Engines for Service RA:Unix Engines

For Unix Engines, from the DSEngine directory, after running configure.sh, but before you start the Engine for the first time, do the following:

1. Change mode of all files to be group read/writable:find . | xargs chmod g+u

2. Change ownership of the invokeRA program to root, and change it to be set UID:sudo chown root bin/invokeRA

sudo chmod +s bin/invokeRA


3. Set the Engine user’s umask to make these permissions the default:umask 002

4. Start the Engine:./engine.sh

Windows Engines

For Windows Engines:

1. Right-click the Engine’s install directory, select Properties, and under the Security tab use Add... to add all users that you intend to run Services as.

2. Select the Allow check box for Full Control.3. From the Start menu, click Settings, then Control Panel, then Administrative Tools, then Services.

Right-click the Service running the Engine and select Properties. You will need to ensure that the Engine Daemon user is allowed to interact with the desktop. If the Local System user is selected, select the Allow Service to Interact with the Desktop check box.

4. The domain user who launches the Engine service in Windows needs to have the following security privileges set. Click the Start menu, then click Settings, click Control Panel, click Administrative Tools, then click Local Security Policy. Click Local Policies, then click User Rights Assignment, and add the user who launches the Engine service to the following policies:SE_TCB_NAME (“Act as part of the operating system”)

SE_CHANGE_NOTIFY_NAME (“Bypass traverse checking”)

SE_ASSIGNPRIMARYTOKEN_NAME (“Replace a process level token”)

SE_INCREASE_QUOTA_NAME (“Increase quotas” or “Adjust memory quotas for a process”)

If you are using .NET Services that use XML serialization, complete the following steps:

1. Right-click the Engine’s temp directory in its Windows system directory (C:\WINNT\temp for Windows 2000, C:\Windows\temp for Windows XP and Windows Server 2003), select Properties, and under the Security tab, use Add... to add all users that you intend to run Services as.

2. Select the Allow check box, for Read, Write, and Delete permissions. Note that the Delete permission is set using the Advanced button on the Security page of the Windows Explorer folder properties dialog box.

Managing CredentialsThe Credentials DB is a store of RA credentials on the Director and Brokers to be used for RA services. It is maintained on the Director and synchronized with Brokers.

The Credential Repository page in the GridServer Administration Tool enables you to create, edit, and delete RA credentials.

To add new Credentials to your Manager:

1. Log in to the GridServer Administration Tool.2. Click the Admin tab, then click Credentials Repository.3. Enter the name of a credential, a password, and then enter the same password again.


Service Run-As


4. Click Add.

Manage Service TypesThe Service Type Registry entries allow specification of an RA username for use with that Service.

To specify a Run-As user for a Service Type:

1. Log in to the GridServer Administration Tool.2. Click the Services tab, then click Service Type Registry.3. For an existing Service Type, go to the Actions control for that Service Type and select Edit Service

Type. This opens the Service Type Editor window.4. In the Service Type Editor window, under the ContainerBinding header, enter the user name in

RunAsUser.Note that in this field, you can use $ to indicate the Driver’s current user. Leaving this value blank (the default) indicates that the process will run as the same user running the Engine Daemon.

It is also possible to specify a Windows domain in the RunAsUser field. For example, if you are using a Unix Driver (which would not be in a Windows domain) and you want run Services on Windows Engines using a specific user and domain, you can specify this in the form domain/username. The forward slash will be translated to a backslash. For example, specifying DATASYNAPSE/BILL will run Services as the user BILL in the DATASYNAPSE Windows domain (DATASYNAPSE\BILL).


Chapter 8

• • • • • • The Batch Scheduling Facility

IntroductionCommands and Services can be scheduled to run on a regular basis using the Batch Scheduling Facility. A Batch Definition contains instructions in the form of components that define scheduling and what the Batch will execute. When the Batch Definition is scheduled on the Manager, it creates a Batch Entry, which typically waits until its scheduled time, then executes, creating a Batch Execution. Services are executed using an embedded Driver on the Manager.

Using the Batch Editor page in the GridServer Administration Tool, you can write a Batch Definition with specific scheduling instructions. You can specify a Batch Definition to immediately execute when scheduled, or it can wait until a given time and date. A Batch Definition can be submitted to run at a specific absolute time, or a relative time, such as every hour. They can also be written to wait for an event, such as a new, modified, or deleted file.

Batch Definitions contain one or more components contained within a batch component. A Command component contains a program that will be run by the Batch Definition. A schedule or event component will specify when subsequent Command components will run.

TerminologyThe following terms are used to describe components related to the Batch Scheduling Facility:

Name Page Description

Batch Definition Batch Registry How a Batch is written. The Batch Definition is edited with the Batch Editor page and contains a Batch Component, that then contains other components that define the Batch. Once created, it can be managed from the Batch Registry page.

Batch Component Batch Editor When a Batch Definition is created, it consists of a Batch component, which can contain other components, such as ServiceCommand components, Conditional components, and other Batch Components. The Batch Editor page enables you to add, remove, and edit Batch components and other components it contains.

FIGURE 8-1: A Batch Definition consists of Batch Components. When a Batch Definition is scheduled, it creates a Batch Entry, and will run as defined by the Batch Components. When it runs, it creates a Batch Execution, which then executes the components according to the definition.

62 Chapter 8 – The Batch Scheduling Facility• • • •••

Editing Batch Definitions


Editing Batch DefinitionsTo create a new Batch Definition, click the Batch tab in the Administration Tool, then click Batch Registry. The Batch Registry page contains a list of Batch Definitions on the Manager, plus a blank box for entering the name of a new Batch Definition. In the Action column, there is an Action list for each Batch Definition. From each Action list, you can select Edit Batch Definition to edit a Batch Definition, Rename Batch Definition to rename a Batch Definition, Copy Batch Definition to copy a Batch Definition, Delete Batch Definition to remove a Batch Definition, Export Batch Definition to save an XML file of the Batch Definition, or Schedule Batch Definition to place a Batch Definition in the Manager’s Batch queue. You can also select Batch View to display a graphical representation of the Batch Definition in a new window.

To edit a Batch Definition, either select Edit Batch Definition from an existing Batch Definition’s Action list, or type the name of a new Batch Definition in the empty box at the end of the list and click Add. This opens a window, shown above, containing parameters for your new Batch Definition. You can then change the values of parameters, and click Save to save the values as a Batch Definition on the Manager, or click Cancel to exit the Batch Editor and discard any changes you have made.

Batch Entry Batch Schedule When a Batch Definition has been instantiated by being scheduled on the Batch Schedule page, a Batch Entry is created. The Batch Entry will either run immediately, or wait to run, depending on what scheduling components were added to the Batch Definition.

Batch Execution Batch Admin When a Batch Entry runs, it creates a Batch Execution, which does whatever was defined in the Batch Definition. For example, if a Batch Definition uses the ServiceCommand to start ten Service Sessions, the Batch Execution will do that. The Batch Execution is managed on the Batch Admin page. Any actual Service Sessions created can be managed on the Service Session page on the Services tab.

Service Runner Service Runner Registry

Service Runners enable you to define a registered Service Type with options and init data that can be used in a Batch Definition.

Name Page Description

FIGURE 8-2: The Batch Definition Editor.


The Batch Definition parameters are as follows:

Batch ComponentsThe parameters in the Batch Editor window correspond to components contained in the Batch Definition. Each Batch Definition can contain one or more Batch components. These components can be commands, events, or other Batch Definitions. For example, a LogCommand Component is shown below. To add a component to a Batch Definition, select a component from the add component list.

Batch components are processed in a Batch Definition in order when Batch Type, described above, is set to serial. You can change the order of Batch components by clicking the Move Up and Move Down buttons in the upper-right corner of each Batch component, to move that component’s order up or down in the Batch Definition. You can also remove a Batch component by clicking the Remove button in the upper-right corner.

Parameter Description

Batch Component

Name The name of the Batch Definition. If this is a new Batch Definition, this is the name you initially typed in the blank box prior to selecting Add, and is not editable. (You can rename a Batch Definition by selecting the Rename action from the Batch Registry page.) If an additional Batch component is added to a Batch Definition, you can set its name.

Type Determines how a Batch Definition is run, either in serial or parallel. If set to parallel, all Batch components are executed when the Batch Definition is scheduled. If set to serial, Batch components are executed in the order in which they were added. If any of the components fail, it prevents the Batch from continuing, and the Batch will fail. The default is serial.

Schedule Component

Type Sets the type of the Schedule. If Immediate, the Batch Definition will run when scheduled.When Absolute, the Batch Definition will run once according to the date set in startTime. If Relative, the Batch Definition will run after the specified number of minutes in minuteDelay as well as repeating or executing immediately with respect to repeat and runNow. If Cron, the Batch Definition will run according to the values set in the cron. When set to Manager Startup, the Batch Definition when run when the Manager is first initialized.

Add component Adds an component to the Batch Definition. A Batch Definition can contain one or more components, which are described below.

FIGURE 8-3: A Batch component.


Batch Components


Each of the types of Batch components that can be added to a Batch Definition are described below. In the Batch Editor window, a help description is provided for each Batch component shown. By default, Extended Help is displayed. Using the help control in the upper right corner, you can select Help to display only the first sentence of help, or No Help to suppress the help display.

Name Description

Batch Contains another Batch Definition. This can be used to create a complex or multi-leveled Batch Definition. For example, a parent Batch Definition could start each day, starting a two child Batch Definitions, each with different schedules or conditions.

For each new Batch component, you must set the same parameters for a Batch Definition as described above. You can then add additional components to the Batch.

Conditional Provides conditional processing when running Batches. The component specified by test is run. If it runs successfully, the component specified by success is executed. If it fails, the component specified by failed is executed.

The component specified in test returns success in the following conditions:

• Command returns Command.SUCCESS• ServiceCommand creates the Service and submits the invocation

without exception• ServiceRunnerCommand creates the Service and submits all

invocations without exception

BatchReference Contains a reference to a registered Batch Definition that gets loaded when scheduled from the Batch Registry.

Command Runs an implemented method in a deployed class.

ServiceCommand Starts a Service. You can specify a Service type registered on the Manager and method name to run. You can also specify a Service reference ID (this enables you to reference the Service from another Service Command), Service action, and input and init data for the Service. Data is comma-delimited.

You can add ServiceDescription, ServiceOptions, and Discriminator components to a Service by using a Service Runner.

ServiceRunnerReference Loads the specified registered Service Runner. See below for information on registering a Service Runner.

AdminCommand Executes a command via the GridServer Admin API. For more information on using the Admin API, see Chapter 10, “GridServer Admin API” on page 89 of the GridServer Developer’s Guide.


Service RunnersService Runners enable you to define a registered Service Type with options and init data that can be used in a Batch Definition. It can also be used to chain together Service Types and discriminators into a single unit that can be used in a Batch Definition.

EmailCommand Sends an email message from a Batch Definition, for notification or alerts. You can enter a comma-delimited list of email addresses for recipients, and a message string, which will be used as a subject and a body.

Note that in order for email to be sent, you must define an SMTP server in your Manager Configuration. To do this, click the Manager tab, click Manager Configuration, click Admin, and enter a value in SMTP Host under the Mail heading.

EmailFileCommand Sends an email message from a Batch Definition that includes files as attachments, typically used to send the output of a previous command by saving that output to a file. You can enter a subject, a message body string, a comma-delimited list of email addresses, and a semicolon-delimited list of files, which will then be sent as attachments in the message.

The setup rules given above in the description of the EmailCommand component also apply to the EmailFileCommand component.

ExecCommand Executes a command from a Batch. This will execute a command from the application server’s root directory. You can set an input, output, and error file, plus a log file for the command to be run.

LogCommand Writes a string to the Manager log. This is useful for testing Batches or indicating when a Batch is starting or stopping.

WaitCommand Halts for a moment before proceeding. The amount of wait time is specified in seconds. Note that this component is only useful for generating a wait time when the Batch type is serial.

EngineWeightCommand Sets the Engine distribution weighting relative to other Brokers. The Brokers must be logged into the Director during execution and to show up in the Batch Editor. The current Broker list is fetched only when adding a new EngineWeightCommand component in the Batch Editor.

Event Makes a Batch File wait for an implemented event to take place. You can use this to pause until a specific condition in a class you deployed has occurred.

FileEvent Makes a Batch wait for a file event to occur before completing the remaining items in the Batch Definition. Specifically, it enables you to watch a file and wait until it is created, deleted, or modified before proceeding.

Name Description


Scheduling Batch Definitions


To create a Service Runner, click the Service Runner Registry page. Type the name of a Service Runner in the box and click Add. This will open a Service Runner Editor page, where you can choose a Service Type and enter init data, a description, and method names and input data for invocations. You can also use the list at the bottom of the page to add discriminators, Service input description data, and Service options.

The Service Runner Registry also lists all Service Runners existing on a Manager. Using the Actions controls, you can edit, rename, copy, delete, export, or launch each Service Runner.

Scheduling Batch DefinitionsAfter you have created a Batch Definition with the Batch Editor page, it will be listed with the other Batch Definitions on the Batch Registry page. However, these Batch Definitions are not actually running on the Manager yet. To create a Batch from a Batch Definition, you must first schedule it. This actually instantiates a Batch and inserts it into the Manager’s batch queue.

To schedule a Batch Definition, click the Batch Registry page, and find the Batch Definition in the list. Select Schedule Batch Definition from the Actions control. This will schedule the Batch Definition, and open the Batch Schedule page, displaying it as a Batch Entry.

The Batch Schedule PageBatch Entries on a Manager can be listed and administered on the Batch Schedule page. To do this, click the Batch tab, then click the Batch Schedule page. All Batch Entries resident on the Manager are listed. To remove or edit an existing Batch Entry or view logs or Batch executions, select a command from the Actions control next to the relevant Batch.

Running BatchesBatch Entries will automatically run when they reach the scheduled time or conditions defined in their Batch Definition. When this happens, Batch Executions are created and displayed on the Batch Admin page. PDriver Batches (which are also Batch Executions) are also displayed on this page. On the Batch Admin page, you can monitor Batch Executions, search for logs, and display the Batch Monitor applet to view what parts of a Batch have completed.

Any Services that are run by the Batch Execution are displayed on the Service Session Admin page. From there, you can cancel Service Sessions, view Tasks, or do any other actions you normally would with a Service. Note that it is possible to have a Batch Execution run a Service that continues to run, even after the Batch Execution reports that it is finished.

FIGURE 8-4: The Batch Schedule page.


Deploying Batch ResourcesJava Services, Commands, and other resources must be placed in [GS Manager Root]/webapps/livecluster/WEB-INF/batch/jar to be properly loaded by the embedded Driver.

For more information on resource deployment, see Chapter 7, “Application Resource Deployment” on page 43.

Batch Fault-ToleranceBatch Schedules that exist on a Manager are persistent, provided the Next Run field is not never. This provides failover capability in the event of a Manager failure, as the Batch Schedules will still exist when the Manager is restarted.

The following Batch Schedules are persistent:

• Absolute schedules• Relative schedules with repeat• Cron schedulesAll persistent Batches are restarted when the Manager is restarted, just like they were scheduled for the first time. Batch runs that were to occur during the time when the Manager was down are ignored.

Using PDriver in a BatchYou can use PDriver within a Batch, with the following configuration changes:

1. Download the GridServer SDK on your Broker machine.2. Write a batch or shell script to run your PDriver job on the Broker.3. Create a Batch Definition that uses the ExecCommand component to run that script.


Using PDriver in a Batch



Chapter 9

• • • • • • Configuring Security

IntroductionGridServer provides a rich set of security options for integrating into your organization’s computing environment. GridServer does not impose its own security policy; instead you select from the features available to implement your preferred policy. The key security areas of authentication, access control and authorization, event logging, data validation, and cryptography are discussed.

AuthenticationAuthentication is the process of determining if an entity is what it claims to be. In keeping with the GridServer philosophy of providing a flexible set of tools that can be used to implement an organization’s security policy, GridServer provides both a built-in authentication service and an extensible set of hooks for integrating to external authentication systems.

Operating System UsersBy default, GridServer does not authenticate using operating system accounts. Operating system accounts are used to start GridServer software components, like the Manager, Engine, and Driver. It is not required to use a superuser operating system account to start any GridServer component. Certain features do require superuser level access. For instance, to use GridServer’s UIIdle scheduling mode on Windows, at least the DSHook UI event timing service must run as superuser.

It is possible to use operating system user authentication for GridServer authentication. See “Extensible Authentication Hooks” on page 70 for more information.

Authentication of operating system users is handled by the operating system in question.

Grid UsersUsers of Grid Services may be either compute Service users or administrative users. In either case they are authenticated through the same mechanism.

GridServer is responsible for authenticating Grid users according to the policy defined by the administrator. Extensible authentication hooks can be used to interface to an external authentication system such as Active Directory, LDAP, or NIS.

Once a Grid user has been authenticated, they are given an authentication token to use in further correspondence. In the case of Administration Tool or Web Services users, the authentication token is a standard HTTP session cookie. In the case where compute users connect via the DataSynapse APIs, the authentication token is a DataSynapse object.

70 Chapter 9 – Configuring Security• • • •••

Authentication


User accounts are added or modified with the User Admin page, located on the Admin tab in the Administration Tool. Each user account is given an access level, which dictates what features of the Administration Tool they can use. For further details on access levels and their corresponding permissions, see Chapter 6, “The GridServer Administration Tool” on page 36.

GridServer Built-In AuthenticationGridServer’s built-in authentication mechanism uses the embedded Director database (the internal database) to authenticate Grid users. Administration Tool users must be authenticated with a username and password before they can access the Administration Tool. Likewise, Web Services users must be authenticated with a username and password. The DataSynapse Clients APIs (JDriver, CPPDriver, PDriver) do not require authentication by default, but authentication can be enabled.

GridServer built-in authentication includes options for minimum username length, minimum password length, password complexity, password aging, and application behavior on password failure.

Password authentication can be configured on the Manager Configuration page, in the Security section.

Extensible Authentication HooksMany environments already have a suitable authentication service that can be used by GridServer. For instance, the organization may be running an LDAP-based service like Active Directory. In this case the organization’s policy may be to centralize all authentication information in Active Directory. GridServer’s extensible authentication hooks can be used to integrate with existing authentication services.

Since there is no universally-accepted standard for Grid authentication nor for application authentication, DataSynapse has chosen to create its own interfaces, DriverAuthenticationHook and UserDatabaseHook, that can be used to integrate existing authentication models. We provide example implementations for these hooks to integrate with LDAP. Since LDAP bindings for Grid authentication can be expected to vary from organization to organization, it may be necessary to modify the example implementations to work with your bindings. An additional authentication hook example is provided for NTLM.

Enabling Client AuthenticationBy default, any client is allowed to log in to a Manager. However, it can be configured to only allow Drivers with a valid Grid User identity that is associated with a Driver Profile to log in. Driver Authentication is a Director setting, and should be set on all Directors.

To enable Driver authentication:

1. Click the Manager tab on the Director. 2. Click Manager Configuration. 3. Click Engines and Clients. 4. In Client Authentication Enabled, enter True. 5. Click Save.After authentication is enabled, you will then need to allow clients to log in. To do this, a Driver Profile must be assigned to a Grid User. For example:

1. Click the Driver tab.


2. Click Driver Profiles. 3. Create a new Driver Profile and save it.4. Click the Admin tab.5. Click the User Admin page.6. Create a new user, and assign the profile to that user.For Drivers, the username and password are assigned using the driver.properties file or the API.

For SOAP clients, they are set using HTTP basic authentication. Most SOAP packages provide a method for setting the username/password on the proxy.

SSLSSL (Secure Socket Layer) communication can be enabled for communication at each level in the GridServer architecture depending on the security requirements of the organization and the deployment scenarios involved. SSL provides both encryption of messaging between components, and a trust relationship of the server by the client. In addition, SSL can be used for resource downloading by Engines, and for use of the Administration Tool. In general, HTTP communication can be completely disabled, and all GridServer components can be used using only HTTPS.

Communication Overview To understand how SSL is used for messaging, it is important to understand how components establish communication channels with each other. For the remainder of this discussion, the terms “client” and “server” will be used in the traditional way, that is, a client/server relationship. An example is the Engine Daemon is a “client” to the Director’s “server”.

There are two aspects to establishing communication. The first step is the login process. The client requests a login via a known communication channel. At that point, the server may perform authentication or validation, and if successful, it returns a connection for use from then on. Note that this channel may be on a different server. For example, an Engine logs in via a Director, but the connection exists on a Broker.

SSL is configurable for both aspects. If SSL is to be used for login, it must be configured on the client. If SSL is to be used for the connection, it must be enabled on the server. For example, to enable a Driver to login via SSL, the Driver must be set to the HTTPS URL address on the Director, either via the driver.properties file or the API. To enable HTTPS communication between the Driver and Broker after login, it must be set on the Broker, typically by configuring all Messaging and Download URLs to the HTTPS URL.

Certificate Overview All SSL clients establish a trust relationship with their server. This is performed via a certificate on the client side, which essentially is a public key that is associated with a private key on the server. When establishing the trust relationship, the server’s certificate must either have been signed by a key trusted by the client, or be trusted implicitly by the client (a self-signed certificate). Most SSL clients contain a set of trusted Certificate Authorities (CAs), so that if a server has a certificate signed by one of those CAs, it will automatically trust the server. If the server is self-signed, that server’s certificate must be added to the client’s list of trusted servers.


SSL


In addition, the client may check the Common Name (CN) of the server’s certificate against the hostname of the server, to verify that the certificate is being used on the intended host.

GridServer is packaged with a default self-signed key-pair and certificate. All clients have a local copy of the certificate added to their list of trusted servers. In addition, hostname verification is disabled by default, as the CN will not match the servers hostname. This configuration allows immediate use of SSL without any additional setup. This may or may not be sufficient, depending on your needs.

Keypair and Cert Location All Managers must contain a keypair, either self or CA-signed. The default keypair is stored in a keystore, located at [GS Manager Root]/webapps/livecluster/WEB-INF/certs/server.keystore. The keystore password is configurable via the Manager Configuration page, in the Security section, under the SSL Certificates heading.

If you’ve replaced the cert on the manager with one signed by your own CA, you need to replace the cert in each downloaded SDK. If you have your own CA and ROOT_CA.pem contains its cert:

• The ROOT_CA.pem file should be imported into config/ssl.keystore as a trusted cert. This is for JDriver and .NET.

• ROOT_CA.pem should be renamed ssl.pem (replacing the existing one) in the config directory. This is for C++-based code (including PDriver).

The default SSL trust files are ssl.keystore, ssl.crt, and ssl.pem for JDriver, .NETDriver, and CPP/PDriver, respectively. New certificates can be used by either importing them into the appropriate one of these files, or by changing the DSSSLTrustFile property in driver.properties or the DriverManager.SSL_TRUST_FILE option through the API to the file containing the certs.

Types of Connections Using SSLIt is possible to enable SSL on several different types of connections within GridServer. SSL can be used for Driver connections, Engine and Engine Daemon connections, Broker and Director communication, and Engine resources.

There are two methods for enabling SSL within GridServer. The first is to enable Manager HTTPS and then enable SSL on some components. The other method is to enable HTTPS to all components. Both methods are detailed below.

Enabling HTTPS on the Application ServerTo enable HTTPS, you must first enable HTTPS on the Manager’s application server. You can then configure HTTPS on any of the connections to components.

To enable HTTPS on the application server:

1. Log in to the GridServer Administration Tool.2. Click the Admin tab, then click Manager Reconfigure.3. Click the Resin Configuration option.4. Proceed to step 4 of the Resin Configuration, the Resin SSL page. Click Enable SSL and enter an

SSL port, or use the default of 8443.


5. Complete the Manager Reconfigure steps and restart your application server.6. After restart, open the URL to your GridServer Administration Tool. You will be presented with the

Manager Installation page. Complete the installation (enabling HTTPS on components if needed, described in the next section) and restart your application server.

Enabling HTTPS on all ComponentsBecause it is possible to enable SSL on several different types of connections within GridServer, the option is available to enable everything with SSL on installation, for those who want to run a pure SSL environment.

To do this:

1. Complete the above procedure for Enabling Manager HTTPS up to step 6, and start the Manager Installation.

2. On step 3 of the Manager Installation, you are given the option to select Protocol and Port for both Web Administration and Messaging and Resource Download.The Web Administration settings are used for connections for the GridServer Administration Tool. When this is set to HTTPS (typically with port set to 8443), any attempted HTTP connection will be rerouted to a HTTPS connection on this port.

The Messaging and Resource Download settings are used for all Engine and Client messaging and Resource Downloads. Setting this protocol to HTTPS will cause all connections to use HTTPS. To configure HTTPS for only a subset of these, such as HTTPS only for Resources, you should set this protocol to HTTP, and then set HTTPS for individual components in the Manager Configuration after installation. Each component’s specific settings are described below.

3. Complete the remaining steps in the configuration, then click Start Installation to complete the installation/reconfiguration. You will need to restart your application server.

Note that if you have already installed Drivers from this GridServer installation, their driver.properties files will have to be edited to point to the new HTTPS URL before they will use SSL; Engines will reconfigure themselves to use the new secure reinstallation; the Director URLs in all Engine Configurations are changed to https://host:sslport.

Driver SSLAll Driver certificates can be found in the SDK stored in the config directory. Drivers will look for this certificate in this directory by default. The Driver can use a different location if desired; see the API for more information. If your server is using a CA-signed certificate, there is no need to for the default certificate. The JDriver keystore includes all certificates packaged with the Java 1.4.2 cacerts file, plus the GridServer default certificate.

HTTPS must be enabled on the Director for login, and on the Brokers for the connection.

To enable SSL for Driver login, you must set the Director URLs to the HTTPS location, either via the driver.properties file (with the DSPrimaryDirector property) or by setting the URL programmatically through the DriverManager API.


SSL


To enable SSL for Driver communication, you must enable it on all Brokers you wish to use it. This setting will affect any Driver that is logged in to that Broker. If your Broker is configured to use HTTPS for all Messaging, Drivers will already use HTTPS.

If you did not enable HTTPS for all messaging and want to enable SSL for Driver communication:

1. Click the Manager tab. 2. Click Manager Configuration. 3. Click Security. 4. Under HTTPS Communication, set Use HTTPS for Client Communication to True. 5. Click Save.If you wish to use hostname verification, it can be enabled via the driver.properties file or API. Keep in mind that you have to create and install your own keypair corresponding to the CN of the host.

Engines and Engine Daemon SSLThe Engine Daemon and Engine use the ssl.pem and ssl.keystores files, respectively, found in the Engine’s root directory.

HTTPS must be enabled on the Director for login and connection for Daemons, and on the Brokers for the connection for Engines.

To enable SSL for Engine and Engine Daemon login, you must set the Directors to the HTTPS location in the Engine Configuration.

To enable SSL for Engine communication, you must enable it on all Brokers you wish to use it. SSL is enabled for Engine Daemons on Directors. If your Broker is configured to use HTTPS for all Messaging, Engines will already use HTTPS.

If you did not enable HTTPS for all messaging and want to enable SSL for Engines on Broker:

1. Click the Manager tab. 2. Click Manager Configuration. 3. Click Security. 4. Under HTTPS Communication, set Use HTTPS for Engine Communication to True. 5. Click SaveTo enable SSL for Engines Daemons on a Director:

1. Click the Manager tab. 2. Click Manager Configuration. 3. Click Security. 4. Under HTTPS Communication, set Use HTTPS for Engine Daemon Communication to True. 5. Click SaveIf you wish to use hostname verification, it can be enabled via the Engine Configuration. Keep in mind that you have to create and install your own keypair corresponding to the CN of the host.


Brokers and Director SSLThe communication between Brokers and Directors, and the Secondary Director and Primary Director can also be configured to use SSL. Note that because they use pure sockets for communication, HTTPS does not need to be enabled on the Manager.

The default cert is stored in livecluster/WEB-INF/certs/ssl.keystore. Its location is configurable via the Manager Configuration page, in the SSL section.

To enable SSL for Broker and Secondary Director login:

1. Click the Manager tab on the Director. 2. Click Manager Configuration. 3. Click Security. 4. Under Server-side Socket SSL, set Require SSL for Login to True. 5. Click Save.6. Click the Manager tab on the Brokers and/or Secondary Director. 7. Click Manager Configuration. 8. Click Security. 9. Set Use SSL for Login to for all applicable categories (such as Broker- Primary Director)10.Click Save.WARNING: If a Director requires SSL, all Brokers and the Secondary Director must be also use SSL for login.

To enable SSL for the connections:

1. Click the Manager tab on the Director. 2. Click Manager Configuration. 3. Click Security. 4. Set Use SSL for Communication to True for the Broker-Primary Director and/or Broker-

Secondary Director Connections.5. Click Save.If you wish to use hostname verification, it can be enabled via the Verify Hostname setting on the Security page. Keep in mind that you have to create and install your own keypair corresponding to the CN of the host.

Resources over HTTPSThe resources used by Engines may be downloaded via HTTPS. In addition to Engines downloading resources from Brokers, Brokers also download synchronized resources from the Director. Thus there are two settings. If your Broker is configured to use HTTPS for all Messaging, Resources will already use HTTPS. Otherwise, the following procedure will enable it.

To enable SSL for the connections:

1. Click the Manager tab. 2. Click Manager Configuration. 3. Click Security.


Resource Protection


4. Under Broker Resources, set HTTPS Enabled to True for appropriate settings. On a Manager that contains only a Broker or Director, there will only be a single setting.

Disabling HTTPFor security reasons, you may want to disable HTTP on the Director and only use HTTPS.

NOTE: 1-Click install will not work if you are accessing the Manager using SSL (through an HTTPS URL.)

To disable non-HTTP connections:

1. Reconfigure the Manager, setting the URL to use the HTTPS URL.2. Update all Drivers (in the driver.properties files) to use the HTTPS URL.3. Shut down the Manager and edit the datasynapse/conf/resin.conf file (or whatever RESIN_CONF

refers to) and comment out the <http></http> entry for port 8000. (If you have already successfully gone through the Resin Configuration pages in the Administration Tool, there will be another, uncommented <http></http> entry that contains an SSL-enabled tag.)

4. When you restart the Manager, everything should use SSL, with no HTTP port open.

Resource ProtectionResources that are downloaded by Engines are protected from download via HTTPS. This is done in the following manner:

• The deployment directory is protected such that files cannot be directly downloaded from it. • When an Engine receives a message to download resources, it is provided a random nonce (a single use

token) that will expire. (This expiration time is configurable via the Manager Configuration page, in the Security section, in the Resource Deployment heading, in the Broker Resources section, with the Token Timeout setting.) When the Engine attempts to download data from the URL, it is redirected to the protected deployment directory. The nonce is then validated by the Manager, and the Engine is allowed to download the data.

Note the if you are using an alternate base directory, resources are NOT protected.


Chapter 10

• • • • • • GridServer Performance and Tuning

Diagnosing Performance ProblemsTo find bottlenecks in application performance, use GridServer’s Instrumentation feature. With instrumentation enabled, you can get detailed timings of each request submitted to the Broker. These timings highlight scheduling overhead, data marshalling time and network delays.

Note that Instrumentation measures only GridServer-related times. It does not show other application delays due to, for example, excessive database load.

For information on turning on Instrumentation, see Chapter 12, “Administration Howto” on page 89. For more information on instrumentation, see Appendix A, “Task Instrumentation” on page 105 of the GridServer Developer’s Guide.

Tuning Data MovementEfficient handling of data can often make or break achieving performance gains in a Grid-enabled application. Instrumentation will reveal problems with having too much data per request: serialization, deserialization and network transport times will be high compared to the actual Engine-side compute time. There are a number of remedies for inefficient data movement. We survey them here in order from simplest to most complex.

Stateful ProcessingGridServer supports two related mechanisms that link client-side service instances to Engine-side state, thereby reducing the need to transmit the same data many times. The two mechanisms are initialization/update data, and Service affinity.

Data that is constant across an entire set of task requests should be made Service initialization data. Initialization data is transmitted once per Engine, rather than once per request. Long-lived volume-based applications will typically process thousands of requests, and compute-intensive applications should be designed to create many small requests, rather than few large ones, for a variety of reasons (see Chapter 8, “GridServer Design Guidelines” on page 79 in the GridServer Developer’s Guide for more information).

If a piece of data is not constant throughout the life of the application, but changes rarely (relative to the frequency of requests), it can be passed as initialization data and then changed by using an update method. See Chapter 3, “Creating Services” on page 23 the GridServer Developer’s Guide for details.

The GridServer scheduler uses the fact that an Engine has initialization data and updates from a particular Service to route subsequent requests to that Service. This feature, called affinity, further reduces data movement, because unneeded Engines are not recruited into the Service. (However, if the Service has pending requests, available but uninitialized Engines will be allocated to it.) Affinity can be further exploited by dividing the state of an application across multiple client-side Service instances, called Service Sessions. The application then routes requests to the instance with the appropriate data. For example, in an application dealing with bonds, each Service instance can be initialized with the data from one or several bonds. When

78 Chapter 10 – GridServer Performance and Tuning• • • •••

Tuning Data Movement


a request comes in for the value of a particular bond, it is routed to the service instance responsible for that bond. In this way, a request is likely to arrive on an Engine that already has the bond data loaded, yet no Engine will be burdened with the entire universe of bonds.

There are Engine and Service parameters related to stateful processing. The Service Session Size parameter, located on Engine Configuration pages under the Caches heading, controls how much initialization data can be stored on an Engine in aggregate. In other words, if the total size of init data across all loaded service instances exceeds the set value of the parameter, then the least-recently used Service instance will be purged from the cache. If Instrumentation shows a non-zero time for Engine Download Instance the second or subsequent time an Engine receives a request from a service, that indicates that the service instance was purged from the cache. Increasing Tasklet Size may then result in improved performance.

The STATE_AFFINITY Service option is a number that controls how strongly the scheduler uses affinity for this service. The default is 1, so set it to a higher value to give your service preference when Engines are being allocated by affinity.

The AFFINITY_WAIT Service option controls how long a queued request will avoid being allocated to an available Engine that has no affinity, in the hope of later being matched to an Engine with affinity. Use this option when the initialization time for a service instance is large. For instance, say it takes five minutes to load a bond. If AFFINITY_WAIT is set to two minutes, then a queued request will not be assigned to an available Engine that lacks affinity for two minutes from the time the first Engine becomes available. If an Engine that already has loaded the bond becomes available in those two minutes, then the request will be assigned to that Engine, saving five minutes of startup time.

CompressionSetting the COMPRESS_DATA Service option to true (in the Service client or on the Service Type Registry page) will cause all transmitted data to be compressed. For large amounts of data, the transmission time saved more than makes up for the time to do the compression.

PackingPacking multiple requests into a single one can improve performance by amortizing the fixed per-request overhead of GridServer and the application over multiple units of work. The fixed overhead includes TCP/IP connection setups for multiple transits, GridServer scheduling, and other possible application initialization steps.

GridServer’s AUTO_PACK_NUM Service option is an easy way to achieve request packing. If its value is greater than zero, then that many requests will be packed into a single request, and responses will be unpacked, transparently to the application. (If the application makes fewer than AUTO_PACK_NUM requests, then the accumulated requests are transmitted after one second.) Auto-packing amortizes per-request overhead, but does not factor out common data.

Direct Data TransferBy default, GridServer uses Direct Data Transfer (DDT) to transfer inputs and outputs between Drivers and Engines. When Driver-Engine DDT is enabled, the Driver saves each request as a file and sends a URL to the Broker. The Engine assigned to the request gets the URL from the Broker and reads the data directly from the Driver. Engine-Driver DDT works the same way in the opposite direction. Without DDT, all data must needlessly go through the Broker.


DDT is efficient for medium to large amounts of data, and prevents the Broker from becoming a bottleneck. However, if the amount of data read and written is small, disabling DDT may boost performance.

Disable Driver-Engine DDT in the driver.properties file on the client. Disable Engine-Driver DDT from the Engine Configuration page.

Shared Directories and DDTIn some network configurations, it may be more efficient to use a shared directory for DDT rather than the internal fileservers included in the Drivers and Engines. In this case, the Driver and Engines are configured to read and write requests and results to the same shared network directory, rather than transferring data over HTTP. All Engines and the Driver must have read and write permissions on this directory. Shared directories are configured at the Job and Service level with the SHARED_UNIX_DIR and SHARED_WIN_DIR options. If using both Windows and Unix Engines and Drivers, you must configure both options to be directories that resolve to the same directory location for the respective operating systems.

CachingService initialization data is effectively a caching mechanism for data whose lifetime corresponds to the Service Session. Other caching mechanisms can be used for data with other lifetimes.

If the data is constant or rarely changing, use GridServer’s resource deployment mechanism to distribute it to Engine disks before the computation begins. This is the most efficient form of data transfer, because the transfer occurs before the application starts.

GridCache can also be used to cache data. GridCache data is stored on the Manager and cached by Engines and other clients. GridCache can handle large amounts of frequently updated data. See Chapter 7, “GridCache” on page 73 of the GridServer Developer’s Guide for more information.

Data ReferencesGridServer supports Data References: remote pointers to data. A Data Reference is small, but can refer to an arbitrary amount of data on another machine. Data References are helpful in reducing the number of network hops a piece of data needs to make. For instance, imagine that an Engine has computed a result that another Engine may want to use. It could write this result to GridCache. But if the result is large, it will travel from the writing Engine to the GridCache repository on the Broker, and then to the reading Engine. If the first Engine writes a Data Reference instead, the second Engine can read the data directly from the first Engine. Data References hide this implementation from the programmer, making network programming much simpler.

See Chapter 4, “Accessing Services” on page 39 of the GridServer Developer’s Guide or the GridServer API for more information.

Tasks Per MessageIn the Job model, messages are sent to the Engine when TaskInputs are created. To minimize message overhead, a message is only sent for each 20 Tasks in a Job. You may find that when running Jobs with many short-running tasks, message overhead can be minimized by setting the Job option TASKS_PER_MESSAGE to a number higher than the default of 20.

80 Chapter 10 – GridServer Performance and Tuning• • • •••

Tuning for Large Grids


Invocations Per MessageIn the Services model, Drivers will send a message per invocation submitted to the Manager. To minimize message overhead, more invocations can be sent in each message. This can increase submission speed on Services when many invocations are submitted in bulk. The Service option INVOCATIONS_PER_MESSAGE can be changed to a number greater than 1, so the Driver will buffer that number of invocations before submitting to the Manager. The buffered invocations are also flushed to the Manager every second if the buffered number doesn't reach the maximum number.

Tuning for Large GridsIn GridServer installations with a large Grid, Manager performance may become extremely slow. For example, the Broker Monitor response time may take several seconds to update.

The following changes can improve performance on large Grids:

• Increase the number of Resin request threads from the default of 200 to 300 or more. A good rule of thumb is Resin Threads = Maximum Messaging Connections + Maximum Resource Download Connections + 50. This ensures enought threads to handle all messaging, downloads, and browser requests. To do this, edit the conf/resin.conf file at the top of the Broker's installation directory. Change the line that reads:

<thread-max>200</thread-max>

to change the setting to 300 or more. Note that your Broker will restart when the resin.conf file is modified.

• On the Brokers, increase the Engine “Max Millis Per Heartbeat” value to be at least 2 minutes; the default is 30 seconds.

• Increase the SSL “Token Timeout,” which is actually in effect regardless of SSL, for both the “Broker Resources” and “Director Resources” to be 5 minutes. The settings are on the Manager Configuration page, in the SSL section, under the Resource Deployment heading.

• Increase the Assignment Timeout, on the Manager Configuration page, in the Services section, to 60000 ms. Increasing this allows more time for an Engine to connect and pickup an assigned task when the Broker is under heavy load. This value should be increased if you see 'Task assignment expired:'... messages often.

• On the Manager Configuration page, in the communication section, change Maximum Messaging Connections to 200; change Messaging Retry Wait to 10000 ms; change Driver/Engine/Daemon Socket Timeout to 120 seconds.

• Increase the heap size. The Java maximum heap size is set in the server.sh or server.bat file, and is 512 MB by default in GridServer 4.2. It can be increased by changing the environment variable MAX_HEAP in the server.bat or server.sh file.

Chapter 11

• • • • • • Diagnosing GridServer Issues

This chapter contains information on how to find information to diagnose GridServer issues. It contains information on troubleshooting your installation and gathering information that will be helpful if you contact DataSynapse for support.

TroubleshootingWhen troubleshooting a GridServer installation, try the following:

1. Search the GridServer Knowledge Base, located at customer.datasynapse.com. This contains known issues, including those that have occurred since the publication of this guide, and is updated frequently.

2. Check the state of your Grid:• Check Engine Daemon state configuration. • Is File Update enabled? • Are Engine paths set as desired?

3. Read the log files, as described below.

Obtaining Log FilesThere are several logs generated by GridServer. Depending on what kind of issue you are troubleshooting, you may need to examine one or more logs. These include Manager, Driver, Engine, and Engine Daemon logs.

Manager LogsManager Logs are generated on the console window on Windows machines if the Manager is not run as a service, or on Unix machines if the Manager is run in the foreground on the console. Because GridServer is usually run as a service or in the background, there are several other ways to view the manager log:

• In the GridServer Administration Tool, from the Admin menu, select Current Log. This displays new lines of the log as the happen, in a new window. It doesn’t, however, display any historical information. Click the Snapshot button to open a frozen duplicate of the current log window.

• Also in the Administration Tool, from the Admin menu, select Diagnostics. This page enables you to search from the Manager log, plus other logs, and display it, or create a .ZIP file of the results.To view Manager Log results, select Manager Log in Choose Files, then select a time range in Choose Manager Log Date/Time. You can then display the log on-screen by clicking Display Below, display it in a new window with Display in Separate Popup Window, or save it in a compressed file with Create .ZIP File.

• The Manager log is available directly at manager_root/webapps/livecluster/WEB-INF/log/server/* or the location specified on the Manager Configuration page in the Logging section, on the Manager tab.


8

Obtaining Log Files

The Manager log can be set to different levels of granularity, ranging from Severe, which provides the least amount of logging information, to Finest, which logs the most information. By default, this level is set at Info. For debugging purposes, it may be neccesary to set the level higher, to Finer or Finest.

To change the log level:

1. In the GridServer Administration Tool, select the Manager tab.2. Select Manager Configuration.3. Select Logging.4. In Default Debug Level, select a new level.

Engine and Daemon LogsEach Engine and Engine Daemon generates its own logs. These can be accessed directly on Engines. However, because Engines are typically installed in several different machines, there are also methods to view logs remotely from other computers. The following procedures describe how to read Engine logs.

To read the log in a scrolling window:

1. In the GridServer Administration Tool, select the Engine tab.2. Select the Engine Admin page.3. From the Actions menu, select Remote Log.

This will open a window that displays the log for the Engine. As new logging information is generated, it is displayed. This does not, however, display any prior logging history.

To access previous logs:

1. In the GridServer Administration Tool, select the Engine tab.2. Select the Engine Admin page.3. From the Actions menu, select Log URL List.

This will open a window containing hyperlinks to each of the log files on the Engine. You can click on each link to remotely view each log. Note that if you open a log and then more Engine activity occurs, you will need to reload the log to view it.

To directly view log files, look in the following directories in each Engine install directory:

• Instance logs: work/name-instance/log/*• Daemon logs: profiles/name/logs/engined.log• Also examine other .log files in Engine treeTo change the log level for Engines:

1. In the GridServer Administration Tool, select the Engine tab.2. Select the Engine Configuration Page.3. Select an Engine Configuration from the list.4. In the Log section, select a new level in the Level list.5. Change this setting in each Engine Configuration for which you want to change logging.

2 Chapter 11 – Diagnosing GridServer Issues• • • •••


Driver LogsDriver logs are displayed in the command or shell window when a Driver is running. They are also captured in the in logs subdirectory of working directory

For SOAP access, including Web Service and Batches, an embedded Driver on the Manager is used: no local logs are generated.

Application Server LogsThe application server used to run the GridServer Manager also generates logs that can be helpful in diagnosing issues. For Resin, the logs are in manager_root/log/error.log


8

Obtaining Log Files

4 Chapter 11 – Diagnosing GridServer Issues• • • •••



Chapter 12

• • • • • • Administration Howto

This chapter contains several procedures that are commonly used when administrating a GridServer Manager. Most of the tasks outlined below use the GridServer Administration Tool, which is also described in Chapter 6, “The GridServer Administration Tool” on page 35. Also, the Administration Tool has online help, which further describes each page’s features.

Backup / RestoreBacking up and restoring GridServer managers requires doing little more than an OS level file copy of the webapps/livecluster directory in your installation directory. On Director installations you may also have to use the database repair scripts to back up or restore the internal and reporting databases.

Backup ProcedureTo back up a GridServer installation:

1. Archive (with tar or zip) or simply copy the [GS Manager Root]/datasynapse/webapps/livecluster directory. Exclude the subdirectories livecluster/dataTransfer and livecluster/localDriverDDT from your archive process.

Restore ProcedureTo restore a GridServer installation:

1. Unpack the original GridServer Manager installation using WinZip or a similar tool for Windows. On a Unix system, do the following:

gzip -d -c GridServer_R4*gz | tar xvf -

2. Delete the livecluster directory from [GS Manager Root]/DataSynapse/webapps.3. Copy the backup livecluster directory to [GS Manager Root]/DataSynapse/webapps.

Manager Configuration

Applying a patch or service pack to GridServerTo apply a patch or service pack to GridServer, do the following:

1. Shut down the GridServer Managers that will be updated.2. Run the JAR file. The syntax for running the JAR is:

java -jar [Patch or Service Pack].jar [webapp_dir] [basedir1] [basedir2] ...

86 Chapter 12 – Administration Howto• • • •••



[webapp_dir] is the livecluster directory on your application server.

[basedirX] is the base directory for each Manager, if using alternate base dirs.

For example, to apply GridServer 3.2 patch 1 to GridServer 3.2 installed in c:\datasynapse:java -jar GridServer-3_2-Patch1.jar C:\datasynapse\webapps\livecluster

Driver Upgrade:

Be sure to re-download the SDK and update all Drivers after a successful Manager update.

Note:

All files that are changed will be saved in the corresponding directory in [basedirX]\WEB-INF\uninstall.

For instance, the above example will save the old files in c:\datasynapse\webapps\livecluster\WEB-INF\uninstall\3_2-Patch1.

Importing and Exporting Manager ConfigurationGridServer Managers support the ability to export the Director and Broker configurations and Engine configuration profiles into a signed JAR file format and later import this same format to migrate settings from one Manager to another. This can be used to migrate Engines from one Manager to another Manager without reconfiguring all of the Engines, to simplify administration of multiple Manager systems, or to disseminate an organization’s preferred default Engine configuration among all clusters in the organization.

To export a configuration:

1. In the GridServer Administration Tool, click the Admin tab and click the Import/Export page.2. Select the configurations you would like to include in the JAR. This includes the Broker

configuration, Director configuration, and any Engine configuration profiles.3. Click Export.4. A File Download dialog box appears. Click Save to save the jar file.To import a configuration:

1. In the GridServer Administration Tool, click the Admin tab and click the Import/Export page.2. Next to the Provide File for import box, click Browse. 3. Browse to the location of the jar file containing the GridServer Manager configuration export.4. Click Upload to begin the import.5. A list of configurations found in the JAR file will be displayed, with configurations highlighted in

red if they will install over existing configurations. Select the configurations you wish to import, then click Import.

When completed, the Manager may need to be restarted for changes to take effect; in this case, a message will be displayed and the Manager will automatically shut down.

Installing Manager LicensesEach GridServer Manager requires a valid license to function. Licenses are limited by date, hostname, and number of Engines. By default, a demo license for four Engines is included with each Manager, but for further evaluation or production use, you must obtain a license by contacting DataSynapse Support.


To view your Manager’s license information in the GridServer Administration Tool, click the Admin tab and click the License Information page.

A Manager license consists of a single XML file, and is typically sent by DataSynapse Support via email as an attached file. They can also be downloaded at any time from the http://customer.datasynapse.com customer support site. You can inspect the license with a text editor to determine its capacity, but you should not make changes to the file.

To install the license:

1. In the GridServer Administration tool, click the Admin tab, click the License Information page.2. Copy the .ser file that was an attachment in your email message from DataSynapse or from a

download from the DataSynapse customer support site to a location accessible with your web browser (either a local directory or a shared directory.)

3. Click Browse.4. Find the license file and click Open.5. Click Upload New License.If the license file is valid, it will overwrite the existing license and changes will take place immediately. If it is expired, corrupt, or otherwise not valid, an error message will appear and your existing license will remain in place.

Setting the SMTP hostThe GridServer Administration Tool can be configured to send notifications via email, via the Event Subscription page. To send the email, there must be a SMTP host configured for the Manager. This is typically configured during Manager installation, but you can later add or change the value.

To set the SMTP host:

1. Click the Manager tab.2. Click Manager Configuration.3. Click Admin.4. In the Mail heading, in SMTP Host, enter the name of your SMTP server. For many organizations,

this is simply mail.5. In Contact Address, enter the email address of an administrative contact. A notification will be sent

to this address when new users are added to the Administration Tool.6. Click Save.

Setting Up a Failover BrokerIn the fault-tolerant configuration, some Brokers can be set up as Failover Brokers. When a Broker is designated a Failover Broker, no Director will route Engines to that Broker unless there are no other active Brokers. When there are no Jobs waiting for Service on a Failover Broker and other Brokers in the Grid are available, the Failover Broker will “kick off” idle Engines causing the Engines to login to their Primary Director and get reassigned to a non-Failover Broker in the Grid. By default, all Brokers are non-Failover Brokers (they load-balance work). Designate one or more Brokers within the Grid as Failover Brokers when you want those Brokers to remain idle during normal (non-failure) operation.

http://customer.datasynapse.com/




To set up a Failover Broker:

1. Log in to the GridServer Administration Tool.2. Click the Admin Tab, then click Manager Reconfigure page.3. Go through each configuration step. In the third step, set Broker to Failover.4. After completing the eight steps of the Manager Reconfigure, click Start Installation. This will

reinstall GridServer and restart the Broker as a Failover Broker.

Configuring SNMPThe ServerEvent API supports the generation of SNMP traps on a per-event basis. For example, events such as ‘Job Cancelled’ and ‘Engine Died’ can be sent as traps to an SNMP monitoring station. The SNMP interface can be administered through an administrative plugin on the GridServer Manager. The traps themselves are defined in the GridServer application MIB.

To configure and enable SNMP support for your Manager:

1. In the Administration Tool, click the Admin tab and click SNMP Configuration.2. Enter the hostname and port of your SNMP server in the Host and Port fields, then click Add.3. If you have multiple SNMP servers, repeat step 2 for each server.4. In SNMP Version, select the version of the SNMP protocol your servers use.5. Select each event in the event list for which you would like to have a trap generated.6. Click the Manager tab, click Manager Configuration, and click Admin.7. In the SNMP section, set enabled to True for the Broker, Director, or both.The GridServer MIB can be found in [GS Manager Root]/webapps/livecluster/WEB-INF/etc/snmp.

Some SNMP events generate traps from the Broker, while others generate traps from the Director. The following is a list of events that generate traps, sorted by Broker or Director:

Broker Trap Events Director Trap Events

DriverAddedEvent BrokerAddedEvent

DriverRemovedEvent BrokerRemovedEvent

EngineAddedEvent EngineDaemonAddedEvent

EngineDiedEvent EngineDaemonRemovedEvent

EngineRemovedEvent RemoteDatabaseBackupFailure

JobCancelled LocalDatabaseBackupFailure

JobFinished ServerStartedEvent

JobRunning

ServerStartedEvent

TaskFailed


Enabling Enhanced Task InstrumentationNormally, a submitted task or remote Service Invocation’s execution time is measured only from start to finish. But often it is useful to be able to track the time spent in the various stages of this process, including input serialization, disk writing, task message submission, task queueing, task fetching, data transport, input deserialization, task processing, output serialization, output transport, queuing, and so on. This will allow you to understand the timing characteristics of distributed computing, optimize the process, and diagnose problems with greater ease.

To enable enhanced task instrumentation:

1. In the Administration Tool, click the Manager tab, click Manager Configuration, then click Services.

2. In Instrumentation, set Enable to True.3. Click Save.When enabled, task instrumentation applies to all Services on the Manager.

To view data generated by enhanced task instrumentation:

1. Click the Services tab, and click Service Session Admin.2. Find the Service you wish to view, and select View Instrumentation from the Actions menu. Note

that this choice will only appear after the Service has finished running.A new window will open, displaying a table of data collected by enhanced task instrumentation for the Service. For more information on instrumentation, see Appendix A, “Task Instrumentation” on page 105 of the GridServer Developer’s Guide.

Engine Management

Deploying Files to EnginesDirectory Replication enables you to coordinate and synchronize files from your Manager to Engines. You can use this to ensure that Engines all have the latest version of a library, file, data set, or other resources needed to complete work.

By default, the Directory Replication mechanism automatically looks for files in a predefined directory (typically deploy/resources, within the livecluster directory on the Manager. This contains six directories, one for each OS supported and a shared directory, replicated to all OSes.) During each check of the directory (the default is once per minute), if it notices changes, it sends the new files to each Engine. It also forces the Engine to log out and log back in. This interrupts any current work, but it also ensures that work isn’t completed with incorrect libraries or data.

You can also manually trigger a file update to ensure all Engines have the same files.

WARNING Task instrumentation will slow down the Manager, and also requires additional disk space, so it is important to disable it after you have completed using it. It is NOT recommended for production systems.


Engine Management


To upload files and manually trigger an update:

1. Click the Services tab.2. Click Resource Deployment.3. Add your files to the Manager by clicking directory names to navigate to a directory. Then click

Browse to find a file on your PC, and Upload to upload it to the Manager.4. You can also place files in the livecluster/deploy/resources directories on the Manager. There

are OS-specific directories for Engines running on Linux, Solaris, and Win32 machines, and a shared directory which is copies to all Engines.

5. Click the Update button.

Updating the Windows Engine JREBy default, the 1.4.2_03 JRE is used for Windows Engines. You can change what version of the JRE is used. For Windows, the JRE used resides on the Manager, and is updated on Engines, so you only need to change the JRE once on the Manager.

It is not necessary to re-install the Engines after adding the new JRE because they will update themselves automatically.

Note that when downloading a new JRE from Sun, you should download the SDK and use the JRE contained within that package. There is also a downloadable JRE package, but the JRE it contains does not contain the server version of a library required for Engines to run.

To change the JRE version:

1. Open the JRE you wish to use into a temporary directory. Also, ensure that the JRE you have is the Server version (included in the Java JDK) and not the client version (from the standalone JRE.)

2. Download the Java Cryptography Extension (JCE) from Sun at http://java.sun.com/j2se/1.4.2/download.html (at the bottom, under “Other Downloads”). This download contains the two files local_policy.jar and US_export_policy.jar which should be copied into the jre/lib/security directory.

3. Create a ZIP file of the directory containing the JRE files and additional files.4. Replace the public_html/register/install/jre/jre.ZIP on the Manager with the ZIP file of the JRE

you created. Note: the file is case-sensitive, and ZIP must be uppercase. If you are using an alternate base directory, a read-only installation, or running multiple Managers on one machine, make sure to copy this file into the same location on each DS_BASEDIR directory.

5. Open the file engineUpdate/Win32/jre.dat in your GridServer distribution with a text editor.6. Replace the 1.4.2_03 with the version number of the JRE you wish to use.

Updating the Unix Engine JREBy default, the 1.4.2 JRE is used for Unix Engines. You can change what version of the JRE is used. For Unix, there is not an update mechanism similar to the Windows, so you need to update the JRE on each Engine.

http://java.sun.com/j2se/1.4.2/download.html


Note that when downloading a new JRE from Sun, you should download the SDK and use the JRE contained within that package. There is also a downloadable JRE package, but the JRE it contains does not contain the server version of a library required for Engines to run.

To change the JRE version:

1. Shut down any running daemons:engine.sh stop

2. Change directories to the Engine home directory on the machine running the Engine, for example, DSEngine.

3. Move the current JRE to a new directory:mv jre jre1_4_3

4. Unarchive the desired JRE into a new directory, such as jre1_4_3.5. Download the Java Cryptography Extension (JCE) from Sun at

http://java.sun.com/j2se/1.4.2/download.html (at the bottom, under “Other Downloads”). This download contains the two files localsecurity.jar and US_export_security.jar which should be copied into the jre/lib/security directory.

6. Symlink the desired JRE to jre:ln -s jre1_4_3 jre

Setting the Director Used by EnginesThe primary and secondary Directors for an Engine is set during Engine installation. You can later change the Directors to which an Engine reports, by changing the Engine Configuration used by the Engine.

To configure an Engine’s Directors:

1. Log in to the GridServer Administration Tool.2. Click the Engine Tab, then click the Engine Configuration page.3. Select the Engine distribution used by the Engine. This is typically the operating system of the

Engine.4. Go to the Directors and Brokers heading and change Primary Director URL and Secondary

Director URL to the corresponding addresses and ports of the primary and secondary Directors, in the format http(s)://address:port.

Note that this will change the Directors for all Engines using that Engine distribution.

Running Services

Running MPI Jobs using PDriverPDriver, the Parametric Job Driver, has support for running MPI Jobs. The following two options in the PDS language supported by PDriver are used when running MPI:

mpiEnabled - boolean switch which indicates the job is to be run in MPI mode. An MPI mode job is based on a groupsize (see below), and each “group step” being treated as a single step of the job. If a single task in an MPI job, all other tasks in that “group step” are rescheduled.

http://java.sun.com/j2se/1.4.2/download.html


Running Services


mpiGroupsize - The number of nodes used in each MPI group step. The number of tasks for the job must be evenly divisible by this setting.

For more information on writing PDS scripts for PDriver, see Chapter 6, “PDriver” on page 49 of the GridServer Developer’s Guide.

Registering a Service Type To use a Service, you must first register a Service Type from the GridServer Administration Tool.

To register a Service Type:

1. Log in to the GridServer Administration Tool.2. Click the Service tab, then click the Service Type Registry page.3. A list of existing Service Types appears on that page, along with a line for adding a new Service

Type.4. Enter the Service Type Name on the blank line. 5. Select the Service Implementation, then click Add.

A window with several options appears after clicking the Add button.6. For Java Service Types, enter the fully qualified class name for the service; for .NET, dynamic

libraries, or commands, enter the classname plus assembly name, library name, or command line, respectively. The window also allows you to enter options for the Service Type.

Note that after you register a Service Type, you must deploy the implementation to your Engines

Creating and Running a Batch To run a Batch, you must first create a Batch Definition, which contains components that specify the schedule used by a Batch and what Services or commands are executed.

To edit a Batch Definition:

1. In the Administration Tool, click the Batch tab, and click Batch Registry.2. Type a name for your Batch Definition in the blank box at the bottom of the list and click Add.3. The Batch Editor dialog box will open. You can type values for the Batch and Schedule components,

and add additional components.For example, to create a simple Batch Definition named NightlyBatch that runs a registered Service at midnight, do the following:

1. In the Schedule object, select a type of cron.2. In the Cron subheading, enter 0 for minute and hour. This specifies a starting time of 00:00 on a

daily basis, in cron format. You could change the values here to select a different time pattern, or select a type of absolute to enter times in a string, like Sat, 12 Aug 1995 13:30:00 GMT.

3. In the Add Component list, select ServiceCommand.4. In the ServiceCommand component, select a Service Type from the ServiceName list. This is a list

of all Service Types currently registered on your Manager.5. In the ServiceCommand object, enter a MethodName, initData, inputData, or any other values that

will be needed by your Service.


6. Click Save.7. In the Actions control next to your Batch Definition in the list, select Schedule Batch Definition.8. Your Batch will now be on the Manager and viewable in the Batch Schedule page on the Batch tab.

It will wait until midnight, and then run the specified Service. When the Batch is running, you can monitor it on the Batch Admin page.

Creating a native stack trace in LinuxSometimes when you are troubleshooting native C/C++ code on linux, you want to generate a stack trace, for example when a SIGSEGV is thrown. Since the JVM on the Engine already traps SIGSEGV and prints out a Java (not native) stack trace, you need to override the actions of the JVM and install your own SIGSEGV handler for debugging. The backtrace_fd() and backtrace_symbols_fd() methods from glibc can be used for this purpose.

To install your own SIGSEGV handler for debugging, add code to your tasklet or service initialization method similar to this:

#include <execinfo.h> #include <stdio.h> #include <signal.h>#define TRACE_DEPTH 50void MyService::segv_handler(int signum) { void *trace[TRACE_DEPTH]; int depth; FILE *fp; depth = backtrace(trace, TRACE_DEPTH); fp = fopen("trace.log", "w"); backtrace_symbols_fd(trace, depth, fileno(fp)); fclose(fp); abort(); }void MyService::init() { signal(SIGSEGV, segv_handler); signal(SIGBUS, segv_handler); }

Attaching GDB to Engine native code on LinuxGDB can be used to debug native code in cppdriver or JNI in Linux. Also, GDB can be useful in identifying unusual problems with the Linux JVM. However, there are some subtle issues when trying to use GDB on a JVM, as is the case with the GridServer Engine.

First, when attaching GDB to the Engine, you must specify the LD_LIBRARY_PATH to both the Engine components and the JVM components. You must also obtain the process ID of a running “invoke” process from the ps command. Also, it’s somewhat easier if you run GDB from the base directory of the Engine install (typically DSEngine) . The GDB command used is something like:

LD_LIBRARY_PATH=lib:jre/lib/i386:jre/lib/i386/native_threads:jre/lib/i386/server:resources/lib/linux gdb bin/invoke $INVOKEPID


Running Services


This method of running GDB works well for troubleshooting those rare JVM problems. However when you are troubleshooting cppdriver code, you need a little more finesse. The issue is that cppdriver loads your application shared objects only when the tasklet or service is instantiated, so it becomes difficult to set a breakpoint in the application shared object. Further, attaching GDB to a running JVM often has undesired side effects, including crashing the JVM depending on the versions of JVM, pthreads, and GDB being used.

One technique that works in this instance is to have your application tasklet or service method include some conditional code to enter a loop checking some variable value that is never changed by the application code, effectively creating an infinite loop. When you need to attach GDB, trigger the conditional that causes the loop to be entered on the next invocation. Then attach GDB as above. You’ll see that the invoke process is stopped while running in the loop. At that point you can change the loop evaluation value so that the infinite loop is exited, and the code will continue to your breakpoint where you can continue debugging.

Logging messages from a Native service to the Engine logTo log messages to the Engine log file, use the UtilFactory::log method. See the C++ API documentation for more information. Alternatively, you may redirect your standard out to a separate log file. See the “Redirecting Engine Output” section in “Log Overview” on page 21 chapter of the GridServer Developer’s Guide for more details.

Also, if you’re using C++ via JNI from a Java Tasklet and Linux Engines, you can log to stderr and it will appear in profiles/.../engine.x.log.

Note that JNI C++ code can not write to standard out on Windows Engines.

Running a .NET Driver from an Engine ServiceTo run a .NET Driver from an Engine Service, you must first deploy the driver.properties file to the Engine, and then configure the Engine to use the new file. To do this:

1. If you haven’t already downloaded a copy of the driver.properties file, log in to the GridServer Administration Tool, click the Driver tab, click SDK download, and download the driver.properties to your local machine.

2. In the GridServer Administration Tool, on the Services page, click the Resource Deployment page.3. Navigate to the resources\shared\config directory. 4. Click Browse and find your local copy of the driver.properties file, and click Upload. 5. On the Engine tab, click the Engine Configuration page.6. Select the configuration that your Engines are currently configured to use or create a new one. If

you create a new configuration remember to change the Engines to use that configuration before you test.

7. In the configuration editing screen there will be a section called Properties. In the Environment Variables section, change the value of DSDRIVER_DIR to .\resources\shared\config.

8. Click Save.


Configuration Issues

Installation on Dual-Interface MachinesIn some network configurations, a machine may have more than one network interface, and a GridServer component may default to using the incorrect interface. This can be corrected by configuring the component to use the correct interface.

Drivers: To configure the Driver to use a different network interface, set the DSLocalIPAddress property to the IP number of the correct interface. For example:

DSLocalIPAddress=192.168.12.1

Engines: To configure the Engine to use a different network interface, select the Engine Configuration that will be used by the Engine on the Engine Configuration page, and set the Net Mask value under the File Server heading to match the network range on which the Engine should run.

Configuring the timeout period for the Administration ToolFor security purposes, the GridServer Administration Tool will time out and require users to log in again. By default, the timeout period is 60 minutes.

To change the timeout period, log in to the Administration tool and click the Manager tab. Click Manager Configuration and click Security. In the Admin User Management selection, type a time in seconds in the Admin Browser Timeout box.

Reconfiguring Managers when Installing a secondary DirectorWhen you install a Manager that includes a secondary Director, you must also configure the Manager containing the primary Director. This will register the secondary Director’s address with the primary Director, as well as reconfigure the Engine and Driver configurations.

To reconfigure the Manager containing the primary Director, click the Admin menu, click Manager Reconfigure, and enter the secondary Director’s address and port in the corresponding page. This will configure the primary Director to recognize the secondary Director, as well as reconfiguring Engine and Driver configurations accordingly.

Using UNC paths in a driver.properties fileIt is possible to use UNC paths to specify a hostname or directory within a driver.properties file. However, you will need to change all backslashes (\) to forward slashes (/) in the path.

For example, to change the input directory for task (Job) inputs to the UNC path \\homer\job1-dir, change the following line:

DSWebserverDir=./ds-data

to this:

DSWebserverDir=//homer/job1-dir


Configuration Issues



Chapter 13

• • • • • • Database Administration

IntroductionEach GridServer Manager has an embedded database running on each Director. This internal, or admin database stores administrative data, such as User, Engine, Driver, and Broker information. An external reporting database can be used to log events and statistics. By default, GridServer is not configured with a reporting database; the included HSQLDB or a different external reporting database can be used.

Database TypesThere are two databases used by the GridServer ManagerBroker, each of which are described below.

The Reporting DatabaseThe external reporting database is optionally used to store events and statistics, which depending on configuration settings, can grow fairly quickly. It is recommended to use a robust external database if you are going to be making extensive use of the reporting capabilities. The specific types of data that are stored in the reporting database are configurable on the Manager Configuration page’s Database section. The external database can be installed on any machine, providing that the ManagerBroker is able to create connections to the database through a protocol such as JDBC.

For information on installing an external database for the reporting database, see Appendix B, “Database Configuration” on page 61 of the GridServer Installation Guide.

The Internal DatabaseGridServer’s internal database stores admin data such as User, Engine, Driver, and Broker information. In typical cases, the internal database is read at Manager startup, and only written to thereafter if user-driven admin events occur, such as adding a user, Engine, Broker, or Driver profile. The internal database is required in order to start the Manager. If it becomes unavailable or corrupt, the Manager will continue to function, but a restart would be impossible until the database is available again. This database is an embedded component of the GridServer software.

Internal Database BackupThe internal database used by GridServer is automatically backed up at on a regular interval. The database is backed up to the [GS Manager Root]/webapps/livecluster/WEB-INF/db/internal/backup directory and is also replicated to the secondary Director if one is installed.

Backups take place based on the Backup Cron configuration option, located on the Database section of the Manager Configuration page of the Manager tab in the GridServer Administration Tool. The cron setting is the same as traditional Unix cron settings. It is a string of the form “minute, hour, day of month, month,

98 Chapter 13 – Database Administration• • • •••

Internal Database Backup


day of week, year”. If any field is set to -1, the backup will be repetitive. For instance, a setting of “00,23,-1,-1,-1,-1” means the backup will occur daily at 11 PM. A setting of “00,23,1,-1,-1,-1” means the backup will occur on the first of every month at 11 PM.

Ranges are as follows:

NOTE: Database backups can be very resource-intensive. It’s advisable to schedule them to occur during off-peak hours when your Grid usage is minimal.

Name Description

minute Minute of the backup. Allowed values 0-59.

hour Hour of the backup. Allowed values 0-23.

dayOfMonth Day of month of the backup (-1 if every day). This attribute is exclusive with dayOfWeek. Allowed values 1-31. If both dayOfMonth and dayOfWeek are restricted, each backup will be scheduled for the earlier match.

month Month of the backup (-1 if every month). Allowed values 0-11 (0 = January, 1 = February, ...). java.util.Calendar constants can be used.

dayOfWeek Day of week of the backup (-1 if every day). This attribute is exclusive with dayOfMonth. Allowed values 1-7 (1 = Sunday, 2 = Monday, ...). java.util.Calendar constants can be used. If both dayOfMonth and dayOfWeek are restricted, each alarm will be scheduled for the earlier match.

year Year of the backup. When this field is not set (i.e. -1) the alarm is repetitive (i.e. it is rescheduled when reached).


Appendix A

• • • • • • The grid-library.dtd

IntroductionThe grid-library.xml configuration file in the root of a Grid Library must be a well-formed XML file. The GridServer SDKs include a grid-library.dtd file that can be used to validate the XML file. The DTD is also shown below.

Example A.1: grid-library.dtd <?xml version="1.0" encoding="ISO-8859-1"?>

 <!ELEMENT grid-library (grid-library-name, grid-library-version?, dependency*, conflict*, jar-path*, lib-path*, assembly-path*, command-path*, hooks-path*, environment-variables*, java-system-properties*)> <!ATTLIST grid-library jre (true|false) "false"> <!ATTLIST grid-library bridge (true|false) "false"> <!ATTLIST grid-library os (win32|solaris|solarisX86|linux|linux64|plinux) #IMPLIED> <!ATTLIST grid-library compiler (gcc2|gcc3|gcc34) #IMPLIED>

<!ELEMENT grid-library-name (#PCDATA)>

 <!ELEMENT grid-library-version (#PCDATA)>

 <!ELEMENT dependency (grid-library-name, grid-library-version?)>



<!ELEMENT conflict (grid-library-name)>

 <!ELEMENT jar-path (pathelement*)> <!ATTLIST jar-path os (win32|solaris|solarisX86|linux|linux64|plinux) #IMPLIED> <!ATTLIST jar-path compiler (gcc2|gcc3|gcc34) #IMPLIED>

 <!ELEMENT pathelement (#PCDATA)>

 <!ELEMENT lib-path (pathelement*)> <!ATTLIST lib-path os (win32|solaris|solarisX86|linux|linux64|plinux) #IMPLIED> <!ATTLIST lib-path compiler (gcc2|gcc3|gcc34) #IMPLIED>

100 Appendix A – The grid-library.dtd• • • •••

Introduction


 <!ELEMENT assembly-path (pathelement*)> <!ATTLIST assembly-path os (win32|solaris|solarisX86|linux|linux64|plinux) #IMPLIED> <!ATTLIST assembly-path compiler (gcc2|gcc3|gcc34) #IMPLIED>

 <!ELEMENT command-path (pathelement*)> <!ATTLIST command-path os (win32|solaris|solarisX86|linux|linux64|plinux) #IMPLIED> <!ATTLIST command-path compiler (gcc2|gcc3|gcc34) #IMPLIED>

 <!ELEMENT hooks-path (pathelement*)> <!ATTLIST hooks-path os (win32|solaris|solarisX86|linux|linux64|plinux) #IMPLIED> <!ATTLIST hooks-path compiler (gcc2|gcc3|gcc34) #IMPLIED>

 <!ELEMENT environment-variables (property*)> <!ATTLIST environment-variables os (win32|solaris|solarisX86|linux|linux64|plinux) #IMPLIED> <!ATTLIST environment-variables compiler (gcc2|gcc3|gcc34) #IMPLIED>

 <!ELEMENT property (name,value)>

 <!ELEMENT name (#PCDATA)>

 <!ELEMENT value (#PCDATA)>

 <!ELEMENT java-system-properties (property*)> <!ATTLIST java-system-properties os (win32|solaris|solarisX86|linux|linux64|plinux) #IMPLIED> <!ATTLIST java-system-properties compiler (gcc2|gcc3|gcc34) #IMPLIED>



Example A.1: grid-library.dtd (Continued)


Appendix B

• • • • • • Reporting Database Tables

Introduction

GridServer uses a simple relational database to report Grid processing events for historical analysis. This appendix describes the tables in the reporting and internal databases for use by external programs.

BatchesBatches that have been scheduled or executed

Database: reporting

Primary key: none

BrokersTable of all Brokers that have participated in this Grid.

Database: internal

Primary key: broker_id

Column name Data type Description

server Varchar Manager where the Batch resided or ran

batch_id Bigint Unique ID number of the Batch Entry

time_stamp Timestamp Timestamp of the event

event Int Event code

class Varchar Class in the Batch

execution_id Bigint Unique ID number of the Batch Execution, if applicable

description Longvarchar Description of the Batch Event


broker_id Int Broker ID #

broker_url Varchar Broker’s configured base URL

weight_0 Float Engine weight for Broker routing

weight_1 Float Driver weight for Broker routing

102 Appendix B – Reporting Database Tables• • • •••

Broker_stats


* stored as xml object

Broker_statsAll statistic reports from Brokers are stored in this table.

Database: reporting

Primary key: broker_id + timestamp

Driver_eventsBrokers report when a Driver logs in or out.

Database: reporting

discriminator_0 Longvarchar Engine discriminator for Broker routing*

discriminator_1 Longvarchar Driver discriminator for Broker routing*

broker_name Varchar Name of the Broker

shared_brokers Longvarchar Comma-delimited list of Brokers that share Engines with this Broker

min_engines Int Minimum number of Engines allowed on the Broker

max_engines Int Maximum number of Engines allowed on the Broker


broker_id Int The unique id of the Broker

time_stamp Timestamp Timestamp of the report

num_busy_engines Int Number of Engines busy at report time

num_total_engines Int Number of Engines logged in at report time

num_drivers Int Number of Drivers logged in at report time

uptime_minutes Float Time since Broker start in minutes

num_jobs_running Int Number of jobs running at report time

num_tasks_pending Int Number of tasks pending (not yet assigned to Engines) at report time



Primary key: none

Driver_profilesProfiles that can be used by Drivers

Database: internal

Primary key: name

* Stored as xml object

Driver_usersDriver users for internal use

Database: internal

Primary key: username


username Varchar Driver user name

hostname Varchar Hostname Driver is running on


broker_id Int ID of Broker where event occurred

event Int 0 for an add, or the reason code for a remove – map these to the event_codes table


name Varchar Profile name

driver_properties Longvarchar Internal properties*

permission_properties Longvarchar Permissions*

description_discriminator Longvarchar Job description discriminator*


username Varchar Driver username

password Varchar Driver password

hostname Varchar Hostname Driver is on

profile Varchar Driver profile used by Driver


Engine_events


Engine_eventsThe Brokers report when an Engine is added or removed; for example, when an Engine logs in or logs out.

Database: reporting

Primary key: none

Engine_infoThis table contains administrative information for all Engines that have ever logged in to this Director.

Database: internal

Primary key: engine_id

* deprecated - fields are no longer updated

** stored as xml object

Engine_statsAll statistic reports from Engine Daemons are stored in this table.

Database: reporting


engine_id Bigint The unique id of the Engine


broker_id Int ID of Broker where event occurred

event Int 0 for an add, or the reason code for a remove – map these to the event_codes table


engine_id Bigint The unique ID of the Engine

username Varchar The username used by the Engine

guid Varchar Another unique such as a MAC address

IP Varchar The IP address used by the Engine

install_date Timestamp When the Engine was installed

last_logon_date Timestamp When the Engine last logged on*

last_file_update_date Timestamp The last successful file update to the Engine*

properties Longvarchar Administratively defined Engine properties**


Primary key: none

Event_codesTable mapping event codes to reasons

Database: reporting or internal

Primary key: none

Job_status_codesTable mapping numeric job status codes to descriptive text


Primary key: none

JobsHistorical information about all jobs that have been run by GridServer

Database: reporting


engine_id Bigint The unique ID of the Engine


cpu_utilization Float %CPU total utilization

ds_cpu_utilization Float %CPU utilized by DataSynapse processes

total_ram_kb Bigint Installed RAM reported by the OS in kilobytes

free_ram_kb Bigint Free RAM reported by the OS in kilobytes

disk_mb Bigint Free disk reported by the OS in megabytes

num_invokes Int Number of Engine processes currently running


code Int Numeric code

name Varchar Description





Job_discriminators


Primary key: job_id+start_time

Job_discriminatorsTable of Job-based discriminators

Database: internal

Primary key: name


job_id bigint Job ID

service_type_name Varchar The Service Type used for the Service.

job_class Varchar Java or pseudo-java class used to create the job on the client

start_time Timestamp When job was started

end_time Timestamp When job finished

job_status Int Job status (see job_status_codes table)

num_tasks Int Number of tasks in the job

task_time_std Float Standard deviation of task completion time

task_time_avg Float Mean task completion time

priority Int Job priority when submitted

end_priority Int Job priority when complete

driver_username Varchar Submitting Driver username

driver_hostname Varchar Submitting Driver hostname

job_name Varchar Optional descriptive job name from JobDescription

app_name Varchar Optional descriptive application name from JobDescription

description Varchar Optional descriptive description from JobDescription

dept_name Varchar Optional descriptive department name from JobDescription

group_name Varchar Optional descriptive group name from JobDescription

indiv_name Varchar Optional descriptive individual name from JobDescription

broker_id Int ID of Broker that ran the job


name Varchar Name of discriminator



PropertiesProperties used by the Manager for its internal processing.

Database: internal

Primary key: none

TasksHistorical information about all tasks that have been run by GridServer

Database: reporting

Primary key: none

Task_status_codesTable mapping numeric task status codes to descriptive text

description_discriminator Longvarchar Discriminator on Job description to determine whether to attach job discriminator*

job_discriminator Longvarchar Engine discriminator for service


name Varchar The property name

value Longvarchar The property value as an XML object.


job_id Bigint Job ID

task_id Int Task ID

engine_id Bigint Engine that (finally) ran task

start_time Timestamp When task was started

end_time Timestamp When task finished

task_status Int Task status (see task_status_codes table)

num_reschedules Int Number of times task was retried

engine_instance Int Number of Engine instance that ran task

task_info Varchar Task information



Users



Primary key: none

UsersAdministrative users for internal use

Database: internal

Primary key: none


User_eventsTable stores historical user events.

Database: reporting

Primary key: none





username Varchar User name

user_access Int Authorized role

user_info Longvarchar Various internal info about the user

personalization Longvarchar UI personalization*


server Varchar Server where event occurred

username Varchar User recording event

time_stamp Timestamp When event occurred

handler Varchar Internal handler class that recorded event

event Longvarchar Description of event


Index

Symbols[GS Manager Root] 11

Aaccess levels

Administration Tool 36Administration Tool

access levels 36help 10introduction 35opening 35shortcut buttons 39timeout 38

authenticationbuilt-in 70Driver, configuring 69, 70, 71, 76

Bbackup

database 97balancing

Engines 18Batch

Batch Definition 61Batch Entry 61deploying resources 67editing Batch Definition 62fault-tolerance 27, 67running 66Service Runners 65using PDriver with 67

Batch Definitiondefinition 61editing 62scheduling 66

Batch Entrydefinition 61

Batch scheduling facilityintroduction 61serial and parallel jobs 63

blacklisting

Engine 33Broker

enabling SSL for messaging with clients 72, 73, 74, 75

failover 25failure 25heartbeat 23monitor 40

Broker Monitor 40Broker routing 17

introduction 17Broker,routing 17

CC++ bridges 51configuring 88

SNMP 88conflicts

Grid Library 48credentials

pass through 58stored 58

Ddatabase

backup 97deployment

Batch resources 67Director

failure 25monitor 40

Director Monitor 40discriminators

in Service Runners 66task 33

Driverauthentication, enabling 69, 70, 71, 76failure 24heartbeat 23

dsinstall.conf

definition 56

EEngine

110 – Index• • • •••


balancing 17blacklisting 33failure 24heartbeat 23

Ffailover

introduction 23failover Brokers 25failure

Broker 25Director 25Driver 24Engine 24

fault tolerant tasks 26fault-tolerance

Batch 27, 67GridCache 27introduction 23

GGrid Library

conflicts 48definition 43directory, alternate 50example 51format 44–46loading 48state preservation 49using 49variable substitution 46versioning 46–47

Grid Library Manager 50GridCache

fault-tolerance 27grid-library.dtd

description 99–100grid-library.xml

dtd 99–100elements 44–46

GridServer Web Servicestimeout 38

Hheartbeat 23HTTP

disabling 76

Iinternal database

backup 97

JJAR Ordering File 56Job

definition 13

MManager

component indicator 40Manager Component Indicator 40Microsoft Install Package

example 57monitor

Broker 40Director 40

Ppass through credentials

using 58PDriver

introduction 15using with Batch 67

port 80disabling 76

preemptionService 32

priorityService 31

RRemote Application Installation

definition 43using 56


Resource Deploymentdefinition 43

ROOT_CA.pem

definition 72Run-as

definition 57Engine setup 58managing credentials 59Service Type Registry 60using 58

Sscheduling

introduction 29serial priority execution 32serial Service execution 32

securityauthentication 69disabling HTTP 76Grid users

authenticatingwith Grid users 69

operating system usersauthentication

with operating system users 69user accounts 37

ServerSee also Manager

Servicepreemption 32priority 31urgent priority 32

Service Runners 65Service Session

definition 14Services

definition 13session timeout

Administration Tool 38shortcut buttons

Administration Tool 39simple network management protocol 88SNMP

configuring 88SSL

enabling for Broker-Client messaging 72, 73, 74, 75

state preservationGrid Library 49

stored credentialsusing 58

TTask

discriminators 33Task Reservation

definition 49Tasks

fault tolerant 26

UUser accounts

security 37using

Grid Library 49

Vvariable substitution

Grid Library 46versioning

Grid Library 46–47

WWindows Deployment Scripting Language

using 56

112 – Index• • • •••

Date post:	16-Nov-2014
Category:	Documents
Upload:	api-3722405
View:	413 times
Download:	1 times

admin-guide

Documents