Post on 02-Jun-2020
transcript
Microsoft® HPC++ CompFin Lab Architecture Whitepaper
Microsoft Corporation
Published: March 2008
Abstract
This whitepaper covers an incubation project from HPC++ Labs called the HPC++ CompFin Lab. The lab
explores the hosting of data-centric, high-performance computing (HPC) and high productivity
computing solutions by providing a computational finance service for academic use.
Microsoft HPC++ Labs would like to recognize Lab49, Inc (www.lab49.com) for their work in building the
Microsoft HPC++ CompFin Lab.
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 2
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 3
The information contained in this document represents the current view of
Microsoft Corporation on the issues discussed as of the date of publication.
Because Microsoft must respond to changing market conditions, it should not
be interpreted to be a commitment on the part of Microsoft, and Microsoft
cannot guarantee the accuracy of any information presented after the date of
publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN
THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user.
Without limiting the rights under copyright, no part of this document may be
reproduced, stored in or introduced into a retrieval system, or transmitted in
any form or by any means (electronic, mechanical, photocopying, recording,
or otherwise), or for any purpose, without the express written permission of
Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. © 2008 Microsoft Corporation. All rights reserved. Microsoft, Excel, SharePoint, SQL Server, Visual Studio, Windows, Windows
Server, and the Windows logo are trademarks of the Microsoft group of
companies.
The names of actual companies and products mentioned herein may be the
trademarks of their respective owners.
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 4
Introducing Microsoft HPC++ Labs
For over a year, a program within the Microsoft® Windows® HPC Server product team referred to as
HPC++ Labs has hosted a 64-node\256-core compute cluster for universities and researchers to use via
the internet. The resulting workload and user feedback provides the product team with an internal
production environment to:
Demonstrate end-to-end integrated HPC solutions using Microsoft® Windows® HPC Server 2008
and Microsoft productivity and developer products
Develop expertise in operating end user focused compute- and data-intensive HPC services
Incubate Microsoft product extensions required to support HPC++ scenarios “out of the box” on
customer premises
Hosting Data-Centric Clusters
One of the initial projects HPC++ Labs embarked on was the investigation of hosting data-centric,
compute-intensive solutions. The theory that such solutions lend themselves to hosting is based on the
following considerations:
There is an industry-wide trend where large data warehouses within organizations are growing
into the petabytes. Examples of these include:
o Financial services data warehouses, which maintain large time series databases of
market activity that increase exponentially as trading volume, the number of securities,
and the data used to price the securities increase
o Environmental sensor data from devices that measure atmospheric pressure, river
levels, ocean tides, wind speeds, rainfall, etc.
o Human sensor data accumulated from hospital equipment, portable medical devices
and exercise equipment
o Consumer behavior and advertising data
o Data accumulated from the tracking of radio frequency identification (RFID) tagged goods
o Pedestrian and automobile traffic patterns across a metropolitan area
o Traces accumulated from software executing under diverse loads
o Events accumulated from utility devices installed in remote locations
A variety of interesting questions could be answered by building models that leverage these
large datasets
Added computing power is necessary to execute, backtest and optimize such models
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 5
Moving the necessary computing power to these large datasets is more efficient than moving
the datasets to the computing power1
Let’s look at a simple, hypothetical example that leverages a large central database and added
computing power. The key question a financial analyst may ask is “Are these mortgage backed securities
currently overpriced or underpriced?” To derive an answer, the analyst uses a newly devised mortgage
backed security (MBS) pricing model based on future interest rate predictions and mortgage payment
patterns. The pricing model itself is a Monte Carlo simulation and, thus, is compute-intensive on its own.
Thousands of potential interest rate paths are fed into the model for it to price the MBS for a particular
day. To better understand the new model’s reliability, the analyst backtests it across a portfolio of
10,000 securities and their trade prices for the last five years. After executing the model across this
range of data, the analyst discovers the model produces negative results for a particular set of securities
during particular periods within the 5 years. After analyzing these specific securities and periods, the
analyst realizes the model fails for 30-year fixed mortgages in a specific geographic area. Cross-
referencing the weighted credit average of the population for that geographic region during that period,
the analyst is able to expand the model to accommodate for this type of scenario. Executing the new
version of the model across the same dataset now leads no negative results and the analyst can now
consider using the model to price current securities.
In this scenario the large, central datasets enable the backtesting and optimization of the model by
providing a large number of tests which validate the model and help surface correlations. The added
computing resources allow a compute-intensive pricing model to run for a large number of tests
repeatedly in a reasonable about of time.
Microsoft HPC++ CompFin Lab
Overview
The HPC++ CompFin Lab is an initial data-centric HPC service from HPC++ Labs. It is targeted for
university computational finance courses and research. It brings together Microsoft Windows HPC
Server, a central market data database, and Microsoft productivity products, in order to provide
university courses with an online service to publish, execute and manage computational finance models.
It is comprised of the following components:
Computing Resources - A 64-node\256-core compute cluster with 5 TB storage and a low latency
interconnect
1 http://research.microsoft.com/research/pubs/view.aspx?tr_id=655
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 6
Central Market Dataset - Historical market data including 5 years of intraday equity tick data for
the S&P 500, daily and fundamental data for 10,000 stocks and mortgage backed securities pool
data
Microsoft® Office SharePoint® Server 2007 Web Portal - A Microsoft Office SharePoint Server
2007 portal to publish, browse and monitor models
Excel® HPC Task Pane - A Microsoft Excel 2007 user interface for model input and results
Model Execution Status Notifications - Model execution workflow with status email notifications
Programming Model
The lab also promotes a common programming pattern for parameter sweep models based on
Microsoft® Windows Communication Foundation (WCF) services. The following diagram illustrates the
key methods and operations a model implements and how data flows between them:
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 7
A model that follows this pattern is implemented as follows:
A model implements a split method that parallelizes the model’s execution by splitting the
model’s input message into multiple operation messages. The operation messages are then
queued to the cluster and processed in parallel across the number of processor cores allocated
to the model’s job. From a data perspective, the split method derives a set of keys from the
model input message and assigns one or more keys to each operation message. For example, a
model may split its work based on stock symbol and period (i.e. one day). The model input
message specifies the S&P 500 and a date range of one week. As a result, the model’s split
method splits the model input message into 2500 operation messages that are queued to the
cluster.
A model implements calculation operations that process the messages from the split operation.
The calculation operations implement the model’s algorithms and store their output in results
storage or intermediate storage based on whether the results are targeted to the submitting
user or a dependency operation to be executed later in the job. More information on the
intermediate and results storages can be found later in this document.
Optionally a model implements operations that depend on other operations and execute after
their dependent operations complete. These are often utility operations that process the output
from the calculation operations and prepare the final results of the model. In the diagram
above, a single operation called Rollup depends on all the other operations in the job to
complete before executing, at which time it aggregates the results of the other calculation
operations into a final result for the client.
Publishing and Executing a Model
Typically, the professor develops a model in Microsoft® Visual Studio® 2008 in their Microsoft® .Net
language of choice. The model is then published to his or her students via the HPC++ CompFin Lab portal
as shown below:
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 8
A model is comprised two files: an Excel workbook template and a cabinet file. The Excel workbook
template is the user interface to the model and allows students to create workbooks for different input
combinations and result layouts. A cabinet file contains the .Net assemblies that implement the model’s
split, calculation and other operations that run on the web server and the cluster.
Once published students can create workbooks based on the models’ workbook templates, specify input
and submit the model to the service for execution via the Excel Task Pane. The user enters the model’s
input and clicks the Submit button in the HPC Excel Task Pane to submit the model for execution as
shown below:
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 9
Model Execution
The following diagram illustrates the path of a model once it is submitted to a cluster:
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 10
Microsoft® Internet ExplorerMicrosoft® Internet Explorer
Microsoft® Excel® 2007Microsoft® Excel® 2007
Historical
Market Data
SQL Server
Database
File Server
Results
Storage
(SQL Server®,
external store)
Intermediate
Storage
(SQL Server®,
Cache)
Server
Web Server
Job Execution
Web Service
Job Monitor
Windows
Service
MOSS 2007
University Site
Job Results
Web Service
64 Node\256 Core
Compute Cluster
1. Submit t
ype safe
job input m
essage
2.Pre-execution WF
1. Create working
directory
2. Install model
3. Split job input msgs
into task input msgs
4. Submit job
Job processes
task input
messages
3.Pull binaries and input
msgs from
file server 4. R
etr
ieve
mark
et
data
via
Lin
q
5. S
tore
intra
-
opera
tion d
ata
6. S
tore
fina
l res
ults
Get Status
Equities.csv
Fixed
Income.csv
Corp
Reports.csvSQL Server® 2005
Integration Services
1. The HPC Excel Task pane extracts the user’s input from the worksheet via XML maps and
submits the model’s input XML to the job execution web service. The input XML’s schema is
specific to the model being executed. The web service is developed in Windows Communication
Foundation and uses WS-Security UserNameTokens to authenticate each message.
2. The job execution web service then executes a configurable Windows Workflow which performs
the following steps when submitting the job to the cluster:
a. Creates a working directory for the job’s execution files in the user’s home directory on
a file server shared between the web servers and the cluster
b. Installs the model into the working directory. The model was previously published to a
specific SharePoint document library as a cabinet file. Installing the model involves
copying the cabinet file to the working directory and extracting its files
c. Invokes the model’s split operation as explained earlier in the document. The resulting
operation messages are stored as text files in the working directory
d. Submits the job to the cluster for execution
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 11
3. The job executes across the cluster as multiple WCF service instances each assigned to a core.
Each service processes its operation messages, which include the keys to the market data
needed for that operation invocation to perform its calculation.
4. The operation retrieves the market data required for its calculations based on the specified keys
using .Net 3.5 Linq and Data Entity objects. The services leverage Linq and an assembly of
custom Data Entity classes that map to the market data database. Once a model references this
assembly from their Visual Studio project, they are able to browse the database schema with
the Visual Studio object browser and Intellisense. Furthermore, the assembly provides compile
time type checking for database commands, which minimizes job failures.
5. The operations store data targeted to other operations within the same job to the intermediate
storage. The intermediate storage is a logical storage that can be configured to a variety of
physical storages such as a database or distributed cache.
6. The operations store final results targeted to the submitting user to the results storage. The
results storage is a location storage that can be configured to a variety of physical storages such
as local and external databases.
The following diagram illustrates the steps performed to cleanup a job and for a client to retrieve its
results:
Microsoft® Internet ExplorerMicrosoft® Internet Explorer
Microsoft® Excel® 2007Microsoft® Excel® 2007
Historical
Market Data
SQL Server
Database
File Server
Results
Storage
(SQL Server®,
external store)
Intermediate
Storage
(SQL Server®,
Cache)
Server
Web Server
Job Execution
Web Service
Job Monitor
Windows
Service
MOSS 2007
University Site
Job Results
Web Service
64 Node\256 Core
Compute Cluster
2. Post-execution Workflow
1. Cleanup working
directories
2. Send email
Job processes
task input
messages
3. Get S
tatus
4. Get filtered results stream
Equities.csv
Fixed
Income.csv
Corp
Reports.csvSQL Server® 2005
Integration Services
1.
5.
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 12
1. The job monitor service polls the job execution service. The job execution service checks the
status of the active jobs and updates their status in the web portal.
2. If a job’s status reaches one of the completed states, the job cleanup workflow is executed as
follows:
a. Removes the job’s working directory
b. Sends email to the submitting user
3. Meanwhile if the model’s workbook is open, the Excel task pane polls the job execution service
for the job’s status and updates the task pane accordingly. When the job completes successfully
the task pane allows the user to retrieve the model’s results.
4. The task pane provides a GetResults button that passes an optional XPath filter to the Results
web service, which in turn passes the filter to the results database. The results of the model are
then streamed back to the task pane, which injects them into the workbook using an XML map.
For large datasets, users can filter for specific portions of the results.
SQL Server® XML Storage
Currently the lab provides one physical storage implementation that can be assigned to intermediate or
results logical storages. The SQL Server® XML Storage leverages SQL Server XML column types to store
objects and object hierarchies of varying shapes. The database is laid out as follows:
Each user is provisioned a single SQL Server database
Each model is allocated a database schema within a user’s database. A model can then allocate
a table within the schema for each data type used for storing intermediate or final results.
Each row in the table stores a tagged .Net DataContract object serialized to XML. The XML is
stored using a SQL Server XML column type.
Each .Net DataContract object can be retrieved by its row’s tag, an XPath query or a combination of
both. The XPath statement allows filtering within an object’s XML infoset across multiple rows. The
granularity of the objects (or XML infosets) stored within each row depends on the filtering needs of the
consumers. The more granular XML infosets are stored the more granular the results can be filtered.
However, this also can have an impact on performance, as more selects, inserts and XML serialization
are necessary to move the data.
Data is maintained in the user database using a FIFO policy. Old data is purged from the database when
it fills.
Microsoft® HPC++ CompFin Lab
_____________________________________________________________________________________ 13