EMC Proven Professional Knowledge Sharing 2010
Design Considerations for Block Based Storage Virtualization Applications Venugopal Reddy
Venugopal Reddy Global Solutions Architect EMC [email protected]
2010 EMC Proven Professional Knowledge Sharing 2
Table of Contents
Introduction ...................................................................................................................................... 3 Virtualization – Virtual Entities ......................................................................................................... 4
Virtual Target ............................................................................................................................... 5 Virtual Initiator ............................................................................................................................. 5 Virtual LUN .................................................................................................................................. 5 ITL ............................................................................................................................................... 5
Virtualization hardware .................................................................................................................... 6 Applications ..................................................................................................................................... 7
Recoverpoint ............................................................................................................................... 7 SANTAP Implementation ........................................................................................................ 8 Brocade Implementation ......................................................................................................... 9
Invista ........................................................................................................................................ 10 Storage Encryption .................................................................................................................... 10
Cisco Storage Media Encryption .......................................................................................... 11 Brocade Disk and Tape Encryption ...................................................................................... 12
Design Considerations .................................................................................................................. 12 Recoverpoint ............................................................................................................................. 12
SAN design ........................................................................................................................... 14 Fabric Splitter Sizing ............................................................................................................. 18 Journal volume design .......................................................................................................... 19 Sizing the RPAs (Appliances) ............................................................................................... 20 Sizing the WAN Pipe ............................................................................................................. 21 Databases in Consistency Groups ....................................................................................... 21 Recoverpoint over Invista ..................................................................................................... 22 Splitter configuration limits .................................................................................................... 23 Storage performance with RecoverPoint .............................................................................. 23 SANTAP Performance .......................................................................................................... 25 FAP performance .................................................................................................................. 26 Network latency and BW requirements ................................................................................ 26 Statistics and Bottlenecks ..................................................................................................... 27
Storage Encryption with Cisco SME ......................................................................................... 29 Storage Encryption with Brocade Encryption Services ............................................................. 31
Future Directions ........................................................................................................................... 32 Conclusion ..................................................................................................................................... 32 References .................................................................................................................................... 33 Disclaimer: The views, processes or methodologies published in this compilation are those of the authors. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies
2010 EMC Proven Professional Knowledge Sharing 3
Introduction
The tumultuous global events during recent years accentuate the need for organizations
to optimize their information technology budgets by increasing resource utilization,
reducing infrastructure costs, and accelerating deployment cycles. Simultaneously,
regulatory framework and compliance requirements are imposing new demands on the
ways we store, access, and manage data.
“Network based Storage Virtualization,” the term describing virtualization implemented in
Storage Area Networks (SANs), is an enabling technology that has spawned several
innovative applications that are beginning to form a framework to optimize information
life cycle management in data centers and to solve the challenges mentioned above.
These innovations include the ability to create and manage dynamic storage resource
pools, perform seamless data migrations, encrypt stored data, and enable long distance
replication and server less backups.
The core functionality of data virtualization in a SAN is its ability to map, redirect, or
make a copy of the data within the intelligent SAN switch. The block level I/O data that
can be mapped, redirected, or copied fosters unprecedented flexibility and facilitates
some of these innovations. Products such as EMC’s Recoverpoint and Invista, and
Storage Encryption solutions from Brocade and Cisco are some of these block based
storage virtualization applications. They implement virtualization in the “network” to
achieve innovative uses of data without impacting I/O performance. The intelligent
appliances and SAN switches implementing the virtualization layer provide high
performance and scalability to match the needs of enterprise class application
environments.
Network based storage virtualization, while enabling a number of innovative applications,
poses a number of important design and deployment challenges. The implementation of
the technology requires specialized ‘virtualizing modules’ in the SAN switches that are
highly vendor specific and hardware dependent. The inherent complexities and the
interdisciplinary nature of the technologies also call for special considerations when
designing, deploying and maintaining these new generation applications.
2010 EMC Proven Professional Knowledge Sharing 4
Successful deployment of network based storage virtualization applications requires
meticulous planning and design, and an intimate understanding of the intelligence
implemented in the SAN layer. This article provides a practitioner’s perspective on
implementing these solutions and aims to:
1. Extend the understanding of the mechanisms of various virtual entities that exist
in network based virtualization
2. Provide insight into a broad range of applications that benefit from virtualization
in the SANs
3. Share recommendations and best practices for successful virtualized
infrastructure planning, design and deployment considering scalability, availability
and reliability
4. Describe techniques to maintain the virtualized infrastructure while sustaining
performance
Virtualization – Virtual Entities
The principle behind virtualization is to capture the I/O in flight from the host to or from
the target within the fabric. You can perform various operations once the I/O is
intercepted, for example make a copy, redirect the I/O, encrypt the data etc.
Virtualization hardware within the fabric creates virtual entities by implementing this
mechanism of intercepting the I/O. There are four main components of virtual entities:
• Virtual Targets (VT)
• Virtual Initiators (VI)
• Virtual LUNs (vLUN)
• Initiator-Target-LUN (ITL) nexus
2010 EMC Proven Professional Knowledge Sharing 5
Virtual Target
VT is a virtual entity created in the fabric that is presented to the real host as a target. In
some cases, the VT can assume the identity (WWN) of the real storage target (in case of
SANTAP on Cisco switches) or it can be a different identity. The physical host performs
I/O on these virtual targets; virtualization modules intercept the I/O.
Virtual Initiator
VI is a virtual entity created in the fabric that is presented to the physical target as a host
HBA. In some cases, the VI can assume the identity (WWN) of the real host HBA (in
case of SANTAP on Cisco switches) or it can be a different identity. The Virtual Initiator
performs I/O on the physical storage targets; the virtualization module acts as the
‘intermediary’ between the physical host initiator (HI) and Physical Target (PT).
Virtual LUN
vLUN is a virtual entity created on the VT. The Physical Host Initiator performs its I/O on
the vLUN. The virtualization hardware intercepts the I/O operation to the vLUN and is
then redirects it to the physical target (through the VI) after performing necessary
operations (copy, map, encrypt etc). A vLUN can be created from one or more physical
LUNs from disparate arrays. Features such as striping, mirroring, and concatenation are
implemented at this layer.
ITL
Front end ITL is a virtual entity that makes up a nexus between the physical host HBA,
Virtual Target, and the virtual LUN. Front end ITL counts and their placement in the
fabric play an important role in optimizing performance and scalability of the virtualized
application. ITLs define the limits of the virtualized hardware.
A backend ITL is a nexus between the Virtual Initiator, Physical Target, and physical
LUN. The virtualization framework in the fabric manages the mapping between a FITL
and BITL.
2010 EMC Proven Professional Knowledge Sharing 6
Virtualization hardware
In the current market, virtualization hardware comes from two major fibre channel switch
equipment vendors, Brocade and Cisco. The virtualization modules are available as
blade modules that can be inserted into directors and expandable switches, or in the
form of standalone switches.
Brocade hardware Virtualization hardware from Brocade comes in the following forms:
1) AP-7600B Storage Application Services switch
2) PB-48k-AP4-18 Storage Application Services blade
3) ES5832 Encryption switch
4) PB-DCX-16EB Encryption Blade
Brocade Application platforms - the standalone AP-7600B switch or PB-48k-AP4-18
blade for the Connectrix ED-48000B, or ED-DCX-B director implement the Storage
Application Services (SAS) framework in the SAN using the Fabric Application
Programming (FAP) layer. Brocade's SAS API implemented on these specialized
modules provides hardware implemented primitives like mirroring, copying, extent maps,
striping, concatenation, copy on write, resync, dirty region logging etc. These features
enable the virtualization applications to implement features such as Mirroring,
Replication, snapshots, Migration, Backup etc.
ES-5832B and PB-DCX-16EB implement encryption services for data-at-rest disk array
LUNs using IEEE standard p1619 Advanced Encryption Standard (AES) 256 bit
algorithms. The I/O redirection using VIs and VTs enable the data compression by six
encryption/compression FPGAs in the blade or switch.
Cisco Hardware Cisco’s virtualization hardware comes in the following forms:
1) Storage Services Module (SSM)
2) MDS-PBFI-1804 (18/4 port) Multi Services Module
3) 9222i MSM switch
2010 EMC Proven Professional Knowledge Sharing 7
Virtualization functionality is enabled through the Storage Services Module (SSM) line
card that you can insert into any modular switch within the Cisco MDS 9000 family. Each
SSM contains a virtualization daughter card that hosts 4 virtualization engines. Each
card has 2 Data path processors (DPPs) with Storage virtualization, volume
management, reliable replication writes (Cisco SANTAP), SCSI Flow services, Fibre
Channel Write Acceleration, and Network-Accelerated Serverless Backup (NASB).
Storage Media Encryption (SME), or encryption of data on the Tape Libraries, is
facilitated by Cisco’s encryption engines integrated on the Cisco MDS 9000 18/4-Port
Multiservice Module (MSM), MDS-PBFI-1804 and Cisco MDS 9222i Multiservice Module
Switches. SAN OS’s FC-Redirect creates VIs and VTs to enable data encryption with
any fabric reconfiguration. These Multiservice Modules and switches also support Cisco
SANTAP.
Applications
This section briefly describes the applications that use virtualization features in the
intelligent switches.
Recoverpoint
Recoverpoint is a distance replication and disaster recovery solution that runs on out-of-
band appliances protecting locally and remotely. The salient features of the solution are:
1) Heterogeneous storage replication for tiered disaster recovery
2) Innovative data Journaling and Application Data consistency
3) Support of FC and IP replication
Recoverpoint uses splitter technology for replication where a copy of the host writes is
sent to the out of band intelligent appliances. The splitters can be host, fabric, or array
based. This article focuses on fabric. A Cisco fabric splitter uses SANTAP technology on
its SSM and MSM modules, and a Brocade fabric based splitter uses AP7600B or PB-
48k-AP4-18 as splitting engine.
2010 EMC Proven Professional Knowledge Sharing 8
SANTAP Implementation
SANTAP implemented on Cisco SSM / MSM modules ‘taps’ a copy of the I/O to be sent
to the Recoverpoint appliance. SANTAP in the SSM / MSM module is responsible for
creating the virtual entities in the fabric. SANTAP currently utilizes two VSANs: a
Frontend VSAN where the Host Initiators and the Virtual targets reside, and a Back end
VSAN where the physical targets, VIs and appliance HBAs are situated. In addition to
the host VIs, SANTAP creates a group of virtual entities called Control Virtual Targets
(CVT).
SANTAP Communication CVT is the portal through which the appliance (RPA) communicates with SANTAP. In an
SSM module, when a CVT is created in the back-end VSAN, 10 virtual WWNs are
created. Of these, 8 Virtual Initiators (VI) represent 8 Data Path Processors (DPPs,
ASICS on the module), the remaining VI represents the Control Path processor (CPP)
Initiator, and the lone Virtual Target (VT) created represents the CPP Target.
Communications between the SANTAP service and the RPA fit into three classes:
1. Control messages from the RPA to the SANTAP service
2. Control messages from the SANTAP service to the RPA
3. Data traffic (reliable writes) mirrored from a host issuing a write to a storage
array.
The first two classes of communication are messages/notifications between the devices
to control various aspects of the SANTAP service. Both the SANTAP service (CPP VI
and CPP VT) and the RPA appear as both a standard SCSI initiator and target. SCSI
Write operations are used between the SANTAP service and the RPA to convey control
messages.
The appliance discovers SANTAP service when it logs into the FC fabric and queries the
name server (FCNS). Once discovered, the RPA issues Port Login (PLOGI) and
Process Login (PRLI) commands, followed by the standard SCSI device-discovery
process. The SANTAP service (CPP target) responds to a SCSI Inquiry with Vendor
Information set to "CISCO MDS" and Product Identification set to "CISCO SANTAP
CVT."
2010 EMC Proven Professional Knowledge Sharing 9
The Cisco SANTAP will initially issue a pending write log (PWL) to the RPA when
mirroring a host write to the appliance. The PWL is a short SCSI command (several
bytes) consisting only of the write operation’s Metadata (LBA number). Once the RPA
acknowledges the PWL, the Cisco SANTAP service will simultaneously perform a write
I/O to both the RPA and target device (storage array). The RPA will then acknowledge
the write I/O. Finally, the pending write log entry is cleared with another short PWL
Command. Communications and Operations on MSM modules and switches are similar.
Brocade Implementation
RecoverPoint on AP7600B and PB-48k-AP4-18 can be deployed in two modes:
• Multi VI mode
• Frame redirect Mode
Multi VI mode In this mode, the HI are zoned with the Virtual target (created on the switch, when HI is
bound to a VI) and the VI is zoned with the PT. Because of this, you need to mask VIs
on the PT and reorganize the zones. When the host sends an I/O to the VT, the I/O is
intercepted by the DPP on the switch. VI then sends one copy to the Physical Target
and the other to the appliance.
Frame Redirect Frame redirect ensures that a copy of the I/O can be sent to the Recoverpoint appliance.
The feature uses a combination of Redirect zones and Name Server changes to map
real device WWNs to the FCIDs of the virtual entities. This allows redirecting a flow
between a host and target to the appliance without reconfiguring them. When you
perform binding between an HI and a PT, a new redirect (RD) zone is created. The RD
zones have a prefix of “lsan_” and will contain the HI, PT, VI and VT.
The RD Zone is part of the defined zone configuration and will not appear in the effective
zone configuration. When you create the first RD zone (using the bind_host_initiators
command on the RPA), two additional zone objects are created: A base zone
"red_______base" and a "r_e_d_i_r_c__fg" zone configuration . These additional zone
objects are required by the Frame Redirect implementation and must remain on the
switch as long as other RD zones are defined.
2010 EMC Proven Professional Knowledge Sharing 10
Invista
EMC Invista is a network-based storage virtualization solution that utilizes intelligent
Fibre Channel switches to implement centralized storage virtualization services that
span heterogeneous storage systems. Using the virtualization modules in the FC
switches, Invista provides services such as volume management, mirroring, clones, non-
disruptive data migration across heterogeneous storage systems with an easy to
manage centralized management user interface.
Virtual volumes created out of one or more storage systems are presented to the host on
the virtual targets created by Invista. Similarly, Invista VIs perform I/O on the physical
storage systems on behalf of the hosts. I/O remapping occurs in the data path for fast
path commands (read6/write6 and read10/write10) at hardware speeds, with minimal
additional latency. Slow-path commands on the virtual volumes (such as inquiry) are
serviced by the highly available and redundantly configured Invista appliances that
maintain the metadata of the virtualized storage on the highly available LUN in the SAN.
Brocade Implementation Invista creates 16 VIs and 16VTs on the virtualizing modules or switches. The VIs are
zoned with PTs and VTs are zoned with HIs. The VIs and VTs are equally distributed
among the two DPPs on the DPC.
Cisco Implementation Invista create 9 VIs (one per each DPP and one Control VI) and 32 VTs. The VTs are
zoned to HIs in front end VSANs, and the VIs are zoned to storage targets. The SAL
agent installed on the FC switches communicates with Invista appliances to configure
the intelligent services modules.
Storage Encryption Storage Encryption at the Fabric layer is a relatively new application of Block level
storage Virtualization. Key advantages of the Fabric Level Storage Encryption include:
• The ability to encrypt data at wire speeds
• Central management of Encryption resources
• Simplified, non-disruptive installation and configuration
2010 EMC Proven Professional Knowledge Sharing 11
These encryption solutions are ideal for cases such as:
• Highly sensitive data on the Disk or Tape that needs protection (Data-at-rest).
• Secure data backups for offsite tape storage and long-term archiving
• Centralized management of heterogeneous disk and tape storage environments
• Secure replication of Encrypted data backups to remote facilities
• Implementing Clusters of Encryption Blades or Switches by scaling data center
encryption services
Cisco Storage Media Encryption Cisco MDS Storage Media Encryption (SME) service enables encryption of data stored
on tape. This protects the backed up data on the tapes from unauthorized access or
tape loss. SME creates VIs and VTs. An I/O sent to VT is intercepted, encrypted, and
written to the tape by the MSM module through the VI. SME is a transparent fabric
service and the MSM module can be deployed anywhere in the fabric. It does not need
to be directly in the data path; hence no cabling or configuration changes are required.
Once an SME is enabled, traffic that is being encrypted is redirected to the appropriate
MSM in the fabric using the FC-Redirect service.
FC-Redirect VIs and VTs are created and placed in the default zone when SME is enabled. When an
HI-PT nexus is configured on the SME, a LOGO (Logoff) is sent to the host to abort any
existing sessions and exchanges to the physical target that may be in transit. The host
then performs another PLOGI, but the MSM module intercepts it and redirects it to VT.
The VI corresponding to the VT then performs a PLOGI on behalf of the Host, and
continues through the PRLI and Discovery sequence. Once complete, the VT
acknowledges the host’s PLOGI request, and accepts the host’s PRLI request. Then,
the VT will intercept the host I/O sent to the PT, encrypt by the encryption module that is
forwarded to the VI that sends the encrypted data to the PT. This is transparent to the
HI and PT.
2010 EMC Proven Professional Knowledge Sharing 12
Brocade Disk and Tape Encryption Similar to the frame redirect option in Recoverpoint deployment, the Brocade encryption
engine uses RD zones for encryption. The HI gets the FCID of the VT when it queries for
the FCID of the PT and PT gets the FCID of the VI when it queries for the FCID of the
HI. The I/O intercepted by the VT is encrypted by the encryption engine and is written to
the PT by the VT.
The reverse happens when data is read from the PT. In addition, there is another entity
named CryptoTarget Container that binds all these virtual entities. A CryptoTarget
Container holds configuration information for a single target including:
Target Port, Initiators, and LUN settings
Interfaces between the encryption engine and targets
the initiators that access storage devices
Design Considerations
We will discuss a few design considerations when deploying the above mentioned
applications using virtualization in the fabrics. These considerations stem from the
experience of deploying these applications and may be considered additional to the
product documentation provided by vendors.
Recoverpoint
Consider four major components when designing a replication solution using
RecoverPoint.
• RecoverPoint Appliances (RPA) — RecoverPoint appliances are Linux based
boxes and are instrumental for replication activities. They accept “split” data and,
based on policy settings, apply bandwidth reduction techniques, ensure write
order fidelity, guarantee data consistency, and route the data to the appropriate
destination volume, either via IP or Fibre Channel. The RPA also acts as the sole
management interface to the RecoverPoint installation.
2010 EMC Proven Professional Knowledge Sharing 13
• RecoverPoint Journal Volumes – Journal volumes are dedicated LUNs on both
Production and Target sides used to stage small aperture, incremental snapshots
of host data. As the personality of production and target can change during
failover and failback scenarios, Journal volumes are required on all sides of
Replication (production, CDP and CRR).
• Intelligent Fabric Splitter — The RecoverPoint splitter driver is a use-specific,
small footprint software that enables continuous data protection (CDP) and
continuous remote replication (CRR). The splitter driver can be loaded on a host,
on an Intelligent Blade within a SAN director, or on a CLARiiON® array. The
intelligent fabric splitter is the intelligent-switch hardware that contains
specialized port-level processors (ASICs) to perform virtualization operations on
IO at line speed. As mentioned in the previous sections, this functionality is
available from two vendors: Brocade and Cisco. Brocade’s intelligent switch, the
AP-7600, can be linked though ISLs to a new or existing SAN. Cisco’s intelligent
blades are the Storage Services Module (SSM) and the MultiServices Module
(MSM) that can be installed in MDS 9513, 9509, 9506, 9216i, 9216A, or 9222i.
• Remote Replication — Two RecoverPoint Appliance (RPA) clusters can be
connected via TCP/IP or FC to perform replication to a remote location. RPA
clusters connected via TCP/IP for remote communication will transfer “split” data
via IP to the remote cluster. The target cluster’s distance from the source is only
limited by the physical limitations of TCP/IP. RPA clusters can also be connected
remotely via Fibre Channel. They can reside on the same fabric or on different
fabrics, as long as the two clusters can be zoned together. The target cluster’s
distance from the source is again only limited by FC’s physical limitations. RPA
clusters can support distance extension hardware (i.e., DWDM) to extend the
distance between clusters.
2010 EMC Proven Professional Knowledge Sharing 14
SAN design
Deciding where to place the SSM modules or AP-7600 switch / PB-48k-AP4-18 module
in the SAN is one of the most common design considerations in Recoverpoint. You also
have to decide on the location of Recoverpoint appliances on the SAN.
Here are guidelines for placing the Intelligent switch modules and SSM modules:
1) As a best practice, the intelligent modules/switches should be placed nearest to
the storage ports. In Core-Edge fabrics, the intelligent modules/switches should
be connected on the Core Switches. Similarly in Host-edge-Core-Storage-edge
fabrics, the most logical place would be on the Storage edge fabrics. However, if
the modules will be used by multiple storage ports on different storage edge
switches, placing the intelligent modules on the core switches is ideal.
2) As a best practice, the Recoverpoint appliances should be placed as close to the
intelligent modules as possible. In AP-7600B deployments, it is preferable to
place the appliance ports on the switch itself. Similarly in MDS, MSM modules
and switches appliance ports should be placed on the module/switch. However,
on MDS switches with SSM module, the appliance should be connected to a
regular line card on the switch on non-shared FC ports. Further, the appliance
ports should not be connected to the ports on SSM modules.
3) Inserting an SSM module in a MDS9513 Director reduces the director port count
to 255. For this reason, placing SSM modules in a 9513 director is not
recommended where scalability of the ports in MDS Directors is a concern.
Complex SAN topologies
A complex topology with numerous switches in the fabrics will most likely be in one of
two designs, each discussed in the next sections:
• Core/edge (hosts on an edge switch and storage on a core switch)
• Edge/core/edge (hosts and storage on an edge switch but on different tiers)
2010 EMC Proven Professional Knowledge Sharing 15
Core/Edge configurations
In this model, you connect hosts to the edge tier switches and storage to the core tier
switch(es). The core tier is the centralized location and connection point for all physical
storage in this model. All IO between the host and storage must flow over the ISLs
between the edge and core. It is a one hop logical infrastructure. (ISL hop is the link
between two switches).
MDS Configurations
The SSM/MSM blade is located in the core. To minimize latency and increase fabric
efficiency, co-locate the SSM/MSM blade with the storage that it is virtualizing, just as
you would do in a Layer 2 SAN.
In these deployments:
ISLs between the switches do not have to be connected to the SSM/MSM blade
Hosts do not have to be connected to the SSM/MSM blade
Storage should not be connected to the SSM blade
Internal routing between blades in the chassis is not considered an ISL hop. Since the
SSM/MSM is located inside the MDS switch, there is additional latency with the
virtualization ASICs. However, there is no protocol overhead associated with routing
between blades in a chassis.
If you are using a switch with an embedded MSM blade (MDS 9222i switch) for
virtualization, it should be ISLed to the Core tier switch. The number of ISLs used should
meet the amount of virtualization traffic.
Brocade Configurations
In these configurations, you can locate an AP4-18 blade on the core tier director or you
can use an external AP-7600B switch to ISL to the core switch. The considerations for
locating the blade are similar to the MSM blade in the MDS configuration. The AP-7600B
is an external intelligent switch; it must be linked through ISLs to the core switch.
Physical placement of the RecoverPoint Appliances can be anywhere within the fabric
and need not be connected directly to the intelligent switch although it is the most
commonly employed approach.
2010 EMC Proven Professional Knowledge Sharing 16
When using AP-7600B switches, hosts are connected to the edge tier and storage is
connected to the core tier. The core tier is the centralized location and connection point
for all physical storage in this model. All IO between the host and storage must flow over
the ISLs between the edge and core. It is a one-hop infrastructure for non-virtualized
storage. However, all virtualized storage traffic must pass through at least one of the
ports on the AP-7600B. Therefore, the IO from the host must traverse:
1. an ISL between the edge tier and the Core tier
2. an ISL between the Core Tier and the AP-7600B
3. an ISL back from the AP-7600B to the core tier where the IO is terminated. This
is a three-hop logical topology.
Edge/Core/Edge Topologies
MDS configurations:
In this model, you connect hosts to the Connectrix® MDS host edge tier switches and
storage to MDS storage tier switches. The MDS switches or directors act as a
connectivity layer. The core tier is the centralized location and connection point for edge
and storage tiers. All IO between the host and storage must flow over the ISLs between.
This is a two-hop logical topology.
Within the fabric, hosts can access virtualized and non-virtualized storage. The location
of the physical storage to be virtualized will determine the location of the SSM/MSM
blade. There are two possibilities:
Storage to be virtualized is located on a single switch
Storage to be virtualized is located on multiple switches on the storage tier
Note: Physical placement of the RecoverPoint Appliances can be anywhere within the
fabric and need not be connected directly to the intelligent switch.
Storage on a single switch:
If physical storage is located on one edge switch, place the SSM/MSM modules on the
same switch. tThe SSM/MSM is co-located with the storage that it is virtualizing to
minimize latency and increase fabric efficiency.
2010 EMC Proven Professional Knowledge Sharing 17
Note: Connections to the SSM/MSM are not required.
ISLs between the switches do not have to be connected to the SSM blade
Hosts do not have to be connected to the SSM/MSM blade
Storage should not be connected to the SSM blade
Internal routing between blades in the chassis is not considered an ISL hop since the
SSM/MSM is located inside the storage edge switch. There is additional latency with the
virtualization ASICs; however, there is no protocol overhead associated with routing
between blades in a chassis.
Storage on multiple switches:
If the physical storage is spread among several edge switches, locate a single
SSM/MSM in a centralized location in the fabric to achieve the highest possible
efficiencies. Because the physical storage ports are divided among multiple edge
switches in the storage tier, place the SSM/MSM in the connectivity layer in the core.
Just as with the Core/Edge design (or any design), all virtualized traffic must flow
through the virtualization ASICs. By locating the SSM/MSM in the connectivity layer,
RecoverPoint’s VIs will only need to traverse a single ISL to access physical storage. If
the SSM/MSM was placed in one of the storage tier switches, most or some of the traffic
between the SSM and storage would traverse two or more ISLs.
With MDS switches, since the SSM/MSM is located inside the switch or director, internal
routing between blades in the chassis is not considered an ISL hop. There is additional
latency with the virtualization ASICs but there is no protocol overhead associated with
routing between blades in a chassis.
Brocade Configurations
In a core/edge/core design with Connectrix B, The AP-7600Bs are linked via ISLs to the
storage edge switches.
Storage on a single switch:
In this model, hosts are connected to the host edge tier switches and storage is
connected to storage switches that form the other edge tier. The core Directors are for
connectivity only. All IO between the host and storage must flow over the ISLs in the
core. It is a two-hop infrastructure for non-virtualized storage.
2010 EMC Proven Professional Knowledge Sharing 18
In these cases, when using an AP4-18i blade, it should be located in the directors of the
core tier. If employing an AP7600B switch, it should be directly ISLed to the core
directors. The considerations are similar to the MDS blades and switches.
In the case of AP7600B switches, all virtualized storage traffic must pass through at
least one of the ports on the AP-7600B. Therefore, the IO from the host must traverse an
ISL between the edge tier switch and the core director. Then, it must traverse the ISL
between the core director and the AP-7600B. Finally, it must traverse an ISL back from
the AP-7600B to the core director where the IO is forwarded to the storage edge switch.
This is a four-hop design that would require an RPQ.
Storage on multiple switches:
You may spread physical storage amongst several edge switches. In thse cases, an AP-
7600B or AP4-18i blade must be located in a centralized location in the fabric to achieve
the highest possible efficiencies. Because the physical storage ports are divided among
multiple edge switches in the storage tier, place the AP-7600B or AP4-18i blade in the
connectivity layer in the core.
Note: When RecoverPoint is added to a SAN, not all physical storage has to be
virtualized. Multiple storage ports may have LUNs to be virtualized. However, if more
than 75% of the LUNs to be virtualized reside on a single switch, locate the AP4-18i
blade on that switch or ISL the AP-7600B on that switch.
Fabric Splitter Sizing Fabric splitters introduce additional limitations that are independent of the RecoverPoint
cluster limitations.
An ITL is an entity used internally by the switch to uniquely identify a LUN, as accessed
by some initiator. It is composed of the following elements:
I Initiator’s WWN
T Target’s WWN
L LUN
2010 EMC Proven Professional Knowledge Sharing 19
All three values are necessary to uniquely identify a LUN given the possibility of LUN-
mapping per host. Because data replication performed by the switch occurs at ITL
granularity, the limits regarding the number of volumes supported by a service are given
in ITLs. In other words, the relevant “count” of replicated entities in the environment
should be the sum of possible paths for all initiator-target-LUN combinations.
ITL numbers supported for each splitter vary between splitter type and Recoverpoint
release versions. Use ITL calculators to calculate the ITLs required in the configuration
and to verify that they are under the limit imposed for those configurations.
Journal volume design Journal volumes can be seen as append only, log structured systems where all the data
that is modified on the source volumes is being logged on the journals. Understanding
the I/O profile on these journal volumes helps to properly design the journal volumes to
meet performance requirements.
As the hosts writes to the production volumes, RPA consolidates the incoming copy of
I/Os into consistent snapshots. It then organizes these I/O blocks along with their
metadata and writes them to Journal in a sequential write. Because the metadata is
stored with the snapshot, RPA can identify where these blocks belong on the target
volume. Similarly, when it’s time to distribute the snapshot to the target volume, RPA
reads the snapshot as a sequential read. RPA then reads the existing data on the target
volumes and follows it with a write on the target volumes to update the data from the
Journal. These reads and writes to the target volumes are random and are based on the
host I/O profile. RPA again bunches all the random reads from the target volumes and
writes them to the journal's undo stream as a sequential write.
The I/O profile on the Journal volumes is sequential and random on the target volumes
while distributing. For every write done on the production volumes, RPA performs one
random read and one random write on the target volumes. Keep the journal spindles
separate from random I/O profile volumes when designing the Journal volumes. When
allocating the target volumes, design them with their random I/O profile in mind. For
example, if the production host has a 3:1 read to write, random I/O profile, design the
target volumes performance profile to be 60%-70% of the production volumes.
2010 EMC Proven Professional Knowledge Sharing 20
Sizing the RPAs (Appliances)
A single RecoverPoint appliance can sustain an average of 75 MB/s write I/Os up to
peaks of 110 MB/s. Use this throughput figure to calculate the number of appliances
required for the desired replication. A minimum of two RPAs are required for redundancy
in any RecoverPoint solution. The maximum sustainable incoming throughput for a
single cluster is 600 MB/s.
Note: CGs (unless appropriately configured on Version 3.3) run on a single node. From
v 3.3 onwards, applications can span RPAs providing up to 250MB/s throughput per
consistency group.
Below version 3.3, if an end user requires CGs that sustain throughput more than the
maximum per node, utilize the parallel bookmark feature.
In addition to throughput rates, calculate the I/O per second (IOPS) when sizing the
appliances. The maximum IOPS a single RPA can sustain is 16,000 IOPS. The
processing power of an eight-node cluster is theoretically 128,000 IOPS.
But, as already mentioned, the other elements of the SAN (in this case the splitter
location) affect the overall environment. In the chart below, for SANTAP (Cisco) and
SAS (Brocade), values represent the cumulative supported IOPS for a pair of blades.
See the Recoverpoint release notes of the version being used for the latest number.
Splitter IOPS 3.0 sp1 3.1
Average Sustained
Host 12000 14000
SANTAP 11500 11500
SAS 12000 16000
Clariion 6000 10000
Burst
Host 19000+ 19000+
SANTAP 11500 12000
SAS 18000+ 19000
Clariion 6500+ 20000
2010 EMC Proven Professional Knowledge Sharing 21
Sizing the WAN Pipe
The RecoverPoint WAN network, in the case of remote replication, must be well-
engineered with no packet loss or duplication as it would lead to undesirable
retransmission. When planning the network, ensure that the average utilized throughput
doesn’t exceed the available bandwidth. Oversubscribing available bandwidth will lead to
network congestion, causing dropped packets and TCP slow start. Consider Network
congestion between switches as well as between the switch and the end device.
Consider user RPO requirements and I/O fluctuations to determine the BW required.
The relevant date to size the WAN pipe is:
• Average incoming I/O for a representative window in MB/s. (24 hrs/7 days/30
days)
• Compression level achievable on the data. (This is often difficult to obtain and
depends on the compressibility of the data. The rule is 2x to 6x.)
Dedicate a segment or pipe for the replication traffic or implement an external QOS
system to ensure bandwidth allocated to replication is available to meet the required
recover point objectives (RPO).
From these numbers, compute the minimal BW requirements of the environment by
multiplying the estimated compression level by the average incoming data. Allocating
this BW for replication does not guarantee RPO or the frequency of high loads because
the I/O rate can fluctuate throughout the representative window.
Databases in Consistency Groups Database files and redo logs should be in a single consistency group (CG) or group set.
Place the archive logs in a different consistency group. This separates the spindles from
database volumes and also facilitates enabling image access on the archive logs CG at
a later point of time from that of an image on the Database CG. This enables recovery.
2010 EMC Proven Professional Knowledge Sharing 22
Further, the archive logs are created at discrete points in time (when the online redo logs
are switched) and are generally used as a whole. Bookmarks taken just after the log
switch should be enough for database recovery and we will not need the intermediate
bookmarks created while the logs are being switched. For this reason, the journal
volumes on the archive log CG need not be big. Archive log CG is also a good place to
enable fast forward distribution by specifying maximum journal lag.
Recoverpoint over Invista
You can achieve replication of Invista volumes to a remote location utilizing a
Recoverpoint solution. The Invista volumes can be replicated in a Virtual-to-Virtual or in
Virtual-to-Physical configurations. The following aspects of the design need to be
carefully evaluated when designing to deploy Recoverpoint over Invista in this fashion:
a) Firmware versions on the intelligent modules: The versions of Invista and
Recoverpoint are generally qualified to work only with a specific version of firmware on
the intelligent modules. The firmware on these modules is closely tied to the firmware on
Fibre channel switches hosting them. Deploy modules/switches hosting Invista
separately from the modules/switches hosting Recoverpoint due to this tight requirement
and dependency on firmware versions. This enables upgrade or downgrade of the
modules of one application independent of other applications’ modules.
b) All Recoverpoint appliances must be able to access all of the Invista volumes that are
being replicated. Recoverpoint Appliance HBAs accessing Invista volumes add up the
ITLs on Invista configuration. Consider these counts to correctly design without running
into Invista ITL scalability limits. For example, a 2x appliance per site Recoverpoint
configuration (with two HBAs per fabric per appliance) replicating Invista volume will
increase the Invista ITL counts by 5 times (Invista ITLs + 2x HBA X 2x RPA X Invista
ITLs). Similarly, consider the number of ITLs supported on an Invista VT (256 currently)
so as to distribute the VTs among RP appliances and front end hosts carefully.
2010 EMC Proven Professional Knowledge Sharing 23
Splitter configuration limits
Multiples splitters in the fabric:
Most RecoverPoint cluster configurations contain one fabric splitter per fabric. There is
no restriction on the number of clusters in a fabric when the splitters of the clusters are
maintained separately. The architect must be mindful of the following when there is a
need to deploy two or more splitters within a fabric (for the same cluster) for
performance or resiliency reasons.
Brocade:
You can use several splitters in the same fabric, as long as each target port is handled
exclusively by a single splitter.
Cisco:
You can use several splitters in the same fabric:
As long as each target port (represented by DVT) is handled exclusively by a
single splitter
As long as each target port (represented by DVT) is handled by several splitters,
each on different VSAN.
Storage performance with RecoverPoint There are three different entities with the RecoverPoint replication that are important for
performance analysis.
1. The replica volume set, a copy of the production volume set in size
2. The journal volume set as RecoverPoint performs striping on these volumes.
Putting a large number of these from different RAID groups will increase
performance.
3. The production volume set that is usually irrelevant in CRR installations.
However, for CDP that is using the same array for the source and target, or for
CRR in test environments that use the same array for both sets, this set must
also be included in the storage performance.
2010 EMC Proven Professional Knowledge Sharing 24
The journal volume set is used by the system for large IOs, and they are almost all
sequential IO, striped across several LUNs for performance. The journal contains two
streams that we will refer to as the “DO” data containing future writes and the “UNDO”
data containing historic writes.
The RecoverPoint system has three modes of distribution at the remote site. The system
switches automatically between these modes based on storage performance.
Journal Storage Performance Configuration Guidelines:
I suggest the following to increase Journal volume performance:
Allocate a special RAID group for the journal. Writes on the Journal volume are
usually sequential with a large write size. Hence, any random I/O profile LUNs
should not be on the same disk spindles as journal volumes.
Configure journal volumes of different CGs on different storage RAID groups so
one group’s journal will not slow down other groups.
Configure journal volumes on a different RAID group than the user volumes.
Raid 5 is generally a good choice for Journal volumes.
Journal volumes can be corrupted if any host that is not the RPA writes to it. To
prevent this, ensure that it is not zoned in with any other hosts other than the
RPAs. Manually load balance between consistency groups so that CGs on each
RPA generate, on average, approximately the same amount of data.
Ensure that journal volumes are optimized for sequential writes (reads and
writes).
Journal speed is also the maximal burst speed that is handled with no high load.
If the WAN allows it, and you have bursts, be sure the journal volume is fast
enough to handle them.
Run a benchmark on the journal (For example, IO meter or iorate from the host)
to measure sequential access speed.
At a minimum, remote user volume should be able to sustain the average
production writes load. The system can provide better Recovery Time Objectives
(RTO) when the remote storage can keep more than twice the average of
production writes load. Note: we only care about production writes and not reads.
2010 EMC Proven Professional Knowledge Sharing 25
The maximum replication distribution will be at least 20 percent slower than the
journal volume speed.
If you have more RPAs in the cluster when you have CGs defined, I recommend
splitting a consistency group into two or more separate groups to gain
performance. Run the parallel group set on different RPAs. The system will
create bookmarks that are consistent across multiple RPAs when configuring
parallel groups (or group sets).
SANTAP Performance
Supported Host ITLs The number of supported SANTAP sessions or replicated ITLs is limited to 2048. A
session is the object that SANTAP uses to manage the split write stream for a particular
ITL. When an appliance suffers an outage, SANTAP experiences a significant amount of
exception traffic and must process this additional work in a limited amount of time. The
time budget and the amount of work impose a ceiling on the number of sessions that
SANTAP can support. This is why only 2048 host ITLs per SSM or MSM module are
supported.
Distributing the load among SSM DPCs Create a DVT for every target port utilizing SANTAP services. The DVT is created in the
Front-end (FE or host) VSAN and is assigned to a DPP on the SSM module to manage
(mirror) the writes sent to the target port. DPP (Data Path Processor) is an ASIC on the
SSM module. There are 8 such ASICs per SSM module. Once a host logs into a DVT,
SANTAP will install a DVTLUN for every masked LUN on the target port for this host.
When a DVT is created in the Front-end VSAN and a host logs into that DVT, SANTAP
creates a pseudo initiator for the host on the DPP (that the DVT was assigned to). Once
a pseudo initiator is tied to a DPP it should not be associated with another DPP (as it
leads to a duplicate VI problem and results in non-deterministic behavior). If the host
needs to talk to another DVT, then that DVT also must be created on the same DPP
where the pseudo initiator is installed. This is an important design consideration when
distributing the load across the DPPs.
2010 EMC Proven Professional Knowledge Sharing 26
Utilize as many DPPs as possible in an SSM based SANTAP design for performance
reasons. Each Front-end and Back-end VSAN combination gets its own DPP. If a set of
hosts are exclusively talking to a set of DVTs, place them in a single Front-End VSAN.
Distributing the load in MSM Configurations MSM configurations require only a single Front-end VSAN. Manual load balancing
across the DPPs is not required and adds significant flexibility in SANTAP deployments
without performance impacts.
FAP performance In Brocade FAP deployments with Frame redirection, each binding is alternately
assigned to one of the two DPCs on the virtualizing module or switch. When binding
performance critical hosts with multiple paths per fabric, bind the initiators of the host
belonging to a fabric one after the other so that they get different DPCs.
Network latency and BW requirements In addition to the size considerations mentioned in the WAN sizing section, the WAN
pipe becomes an issue when there is a large latency (more than 100ms) or packet
drops. Configure or tune the number of streams (sockets) used for replication when
there is a high latency or low fidelity WAN.
Run the ‘iperf’ command on the RPA (as user boxmgmt ) with 1, 5, 10, 20, 40, 60
sockets to check WAN performance. Set the number of sockets (num_of_streams) in the
RPA to the highest number on which iperf obtains improvement in results (maximum
number is 40). If there is a significant gain moving from 40 to 60 sockets (more than
10%), it illustrates that the WAN link is problem bound. Note that Recover Point will not
get better performance than iperf if the WAN bandwidth is significantly larger than iperf
results. RecoverPoint will not be able to utilize it, since the WAN suffers from bottlenecks
(for instance, too many packet drops).
2010 EMC Proven Professional Knowledge Sharing 27
Statistics and Bottlenecks “Detect_bottlenecks” in RecoverPoint is a simple tool for the user to see f there is a
problem with system performance, identify the problem and the possible resolution. This
tool is often used to iteratively resize and tune the various components of a
RecoverPoint system based on the live system’s performance feedback.
RecoverPoint captures and stores both short term and long statistics on its performance.
The statistics are collected on a wide ranging and comprehensive set of counters. Only
very experienced users can understand these statistics. ‘Detect_bottlenecks’ provides a
simple and easy to understand analysis interface.
‘detect_bottlenecks’ outputs provide the following:
1) System overview based on the long term statistics for the duration specified
2) Observations on the system exceptions (bottlenecks) that cause improper
behaviors
Interpreting Detect_bottlenecks output There are 4 sections In the output of Detect_bottlenecks output.
1) Overview of the system:
This contains an overview of the system during the time specified. There are four
subsections in the system overview:
Site Overview
Group Overview
Copy Overview
Link Overview
2010 EMC Proven Professional Knowledge Sharing 28
2) Highload periods: the
The output prints how many high load periods the system observed during the period
specified and their times. The system then prints the system overview for the first 3
highload periods.
Each highload period could have several highloads. The system overview for the
highload period and the first highload in that period is produced.
System overview for each highload and highload period is similar to the system
overview. The highload overview will contain the Site, affected RPA, affected Copy, and
affected Link overview during the period of highload.
3) Initialization Periods:
The output prints how many Initialization periods the system observed during the period
specified and their times. The system then prints the system overview for the first 3
Initializations.
System overview for each Initialization period is similar to overview of the system. That
is, each Initialization period’s overview will contain Site, Group, Copy and Link overview
during the period of Initialization.
4) Peak Periods:
This feature detects the largest data peaks during a specified detection period across all
RPAs at the production site. The command returns write volumes on all data transfer
links, where a data transfer link is uniquely defined by an RPA, target copy (local or
remote), and consistency group. Information at this granularity makes it possible to
identify opportunities for reorganizing consistency groups across the available RPAs to
achieve optimal load balancing (and reduce peaks) across the system.
Bottleneck Analysis: Bottleneck analysis is done on the statistics collated at each of the analysis cycles
(system overview, highload analysis, Initialization analysis). Bottleneck analysis is the
exception analysis where the algorithm deduces exceptions and identifies the root cause
of the exception using analytical formulas.
2010 EMC Proven Professional Knowledge Sharing 29
Storage Encryption with Cisco SME
Several deployment topologies are possible with Cisco SME modules or Switches. The following guidelines related to SAN topology apply when deploying SME clusters.
The existing and new tape libraries must be connected to the MDS 9500 family
switches and the MDS 9200 family switches.
Switches connected to tape libraries must be running the minimum supported
SANOS version or later.
The MSM-18/4 module is supported on MDS 9500 family of switches and the
MDS 9222i switch. The switch must be running the minimum supported SANOS
version or later.
Cisco SME requires a minimum of one SME line card in a cluster.
SME modules should be on the target switch whenever possible.
Core-Edge Topology In core-edge topology, media servers (or the hosts) are at the edge of the network, and
tape libraries are at the core.
In this topology, use SME modules in the core switch if the targets that require SME
services are connected to only one switch in the core. The number of SME line cards
depends on the throughput requirements
Edge-Core-Edge Topology In Edge-Core-Edge topology, the hosts and the targets are at the two edges of the
network connected via core switches.
If the targets that require SME services are connected to only one switch on the edge,
SME modules should be used on that switch and the SME should be provisioned on that
switch only. The number of SME line cards depends on throughput requirements
2010 EMC Proven Professional Knowledge Sharing 30
Sizing Guidelines
1. Each SME interface supports up to 450 MB/s throughput with compression and
encryption enabled.
2. The number of tape drives that can be serviced by an SME module depends on
the throughput of the tape drives. For example, peak throughput of each LTO-3
drive is 40-60 MB/s with compression and encryption enabled. Each SME
interface should be connected to 6-8 such tape drives for optimal performance,.
3. In addition, the actual throughput also depends on the server performance,
number of concurrent SME streams on the SME interface, and the backup data
(compressibility) so appropriate considerations must be made and a benchmark
is recommended.
4. 32 targets, at most, per switch are supported by FC-redirect.
5. Each FC-redirected target can be zoned with a maximum of 16 hosts.
6. A maximum of 1000 FC Redirect entries are available on each line card on which
two hosts or targets are connected.
7. A Cisco MDS 9500 series switch can accommodate multiple SME line cards.
8. A physical fabric can have, at most, one Cisco SME Cluster. Each cluster can
have up to four switches with multiple SME interfaces provisioned and SME
service enabled.
9. The encryption engine processor on the Cisco MSM module also processes the
traffic on the four Gigabit Ethernet ports on the MSM module and performs
IPSec encryption and data compression for Fibre channel over IP connections.
As a result, using FCIP and SME on the same MSM module is not advisable due
to the performance degradation that may result.
2010 EMC Proven Professional Knowledge Sharing 31
Storage Encryption with Brocade Encryption Services You can employ several deployment topologies using Brocade Encryption modules or
switches. Some commonly deployed topologies are:
1. Single fabric deployment - HA cluster
In this topology, two encryption blades or switches can form an HA cluster
providing redundancy in a Single fabric. If one Encryption Blade or switch fails,
the other switch or blade takes over the Crypto Target Container.
2. Single fabric deployment - DEK cluster
In this deployment, the Encryption modules/switches depend on host Multi Path
I/O software for failover. Each Blade or switch services one Host Initiator and
Target pair per host.
3. Dual fabric deployment - HA and DEK cluster
In this model, the HA cluster is used within a fabric and DEK failover is used
between the fabrics for redundancy.
Sizing Guidelines
The following performance numbers dictate the sizing and the number of modules or
switches to be deployed for Brocade Encryption.
For Disk Encryption, the throughput per module/switch is rated at 96 Gbit/sec
(Mix of both encrypted and clear text traffic). Up to 64K concurrent exchanges
can be processed per module or switch.
For Tape Processing, the throughput per module/switch is rated at 48
Gbit/sec. Up to 96 concurrent tape sessions can be processed per module or
switch.
2010 EMC Proven Professional Knowledge Sharing 32
Future Directions One of the limiting factors of fabric based block virtualization technologies is that they
require a specialized virtualization hardware switch or module in the fabrics. Even in
highly redundantly designed fabrics, once these modules or switches are inserted into
the fabric, the entire virtualized application infrastructure depends on them for
virtualization services. These virtualization modules can become the performance
bottleneck and single point of failure within a fabric. Although redundancy can be
introduced within the fabrics by adding multiple virtualizing modules and additional host
initiators and storage targets, this design becomes expensive and difficult to maintain.
The limitation stems from the fact that enabling virtualization technology operates at the
module or switch layer.
Fabric Port level virtualization services could alleviate such limitations in the fabrics.
When the virtualization is done at the port level of a fabric, the redundancy designed in
the existing fabrics can remain the same while the intelligent ports can deliver the
virtualization services. Major strides in design changes and innovation at the port level
technology are needed for this to be feasible. These enhancements are needed for
Block level virtualization applications to be widely deployed in enterprise environments.
Conclusion
This article briefly reviews block based storage virtualization applications that are being
deployed by Enterprises. It describes the SAN specialized hardware that enables the
virtualization features in these applications and examines the design aspects of these
solutions while deploying them and keeping performance and scalability in mind.
2010 EMC Proven Professional Knowledge Sharing 33
References
1. RecoverPoint CLI Reference Guide
2. RecoverPoint Administrator’s Guide
3. RecoverPoint Security and Networking Technical Notes
4. Deploying RecoverPoint with SANTAP and SAN-OS Technical Notes
5. Deploying RecoverPoint with the Connectrix AP-7600B and PB-48K-AP4-18
Technical Notes
6. Cisco MDS 9000 Family Storage Media Encryption Configuration Guide
7. Brocade Fabric OS Encryption Administrator’s Guide