+ All Categories
Home > Technology > How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

Date post: 19-Jan-2015
Category:
Upload: ibm-danmark
View: 852 times
Download: 2 times
Share this document with a friend
Description:
v/ Petur Eythorsson, Nyherji
Popular Tags:
36
© 2013 IBM Corporation High Availability Environments Using TSM & FCM Pétur Eyþórsson Nýherji hf 28 may, Denmark
Transcript
Page 1: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

© 2013 IBM Corporation

High Availability Environments Using TSM & FCMPétur EyþórssonNýherji hf

28 may, Denmark

Page 2: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

1

Disclaimer

NÝHERJI ACCEPTS NO LIABILITY FOR THE CONTENT OF THIS PRESENTATION, OR THE CONSEQUENCES OF ANY ACTIONS TAKEN ON THE BASIS OF THE INFORMATION PROVIDED, UNLESS THAT INFORMATION IS SUBSEQUENTLY CONFIRMED IN WRITING.

ANY VIEWS OR OPINIONS PRESENTED IN THIS SESSION ARE SOLELY THOSE OF THE AUTHOR AND DO NOT NECCESARILY REPRESENT THOSE OF IBM OR NÝHERJI.

Page 3: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

2

About Nyherji and Peter

� Nyherji is one of Iceland's leading service providers in the field of information technology, offers complete solutions in the fields of information technology, including consultancy, the provision of hardware and software, office equipment and technical service.

� Pétur Eyþórsson is a Lead designer of TSM and DR planning infrastructure for all of Nyherji´s TSM Customers for the last 14 years, and an amateur folk style wrestler.

Page 4: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

3

Our Environment

� Nyherji manages roughly 50 TSM Servers� TSM servers come in many shapes protecting 5 – 5,000 TB.� Main OS Windows, AIX and Linux.� TSM Server versions mostly 6.3

� Mostly midrange customers, that historically have used the traditional disk-to-tape approach.

� No VTL´s, XIV or any other high end devices exist� Wide distribution of Storwize V7000 and V3700

Page 5: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

4

Businesses are storing unnecessary data

� Businesses are spending 20% more than they need to on backing up unnecessary data.

– “The most common mistake businesses make is to fail to

update their backup policies. It is not unusual for companies to

be using backup policies that are years or even decades old,

which do not discriminate between business-critical files

and the personal music files of employees.” ~Gartner

� Especially notorious problem in Icelands financial institutions for historical reasons.

Page 6: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

5

Our History

� TSM has made some improvements that offer some new approaches

– TSM 6.1 (New TSM Database DB2, Introduced target deduplication)

– TSM 6.2 (Introduced Source [client side] Deduplication)– TSM 6.3 (Introduced Node Replication, FCM 3.1, and TDP for

VMware)– IBM acquired FastBack– TSM 6.4 (Enhancements to Node Replication, TSM Server

scalability, FCM 3.2 intorduced support for Netapp devices as well as Metro Mirror/Global Mirror)

� Our past experience was solely based on conventional TSM disk-to-tape servers

� These new technologies offered new options that show great potential.

Page 7: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

6

Our History

� In November 2010, we decided to move 2 of our TSM servers to a highly deduplicated environment.

� We had no prior experience with TSM deduplication and not much experience existed on the market that we could tap into.

� Since then we have moved 2 other big environments to deduplication and FCM

� Ajustments have been made based on prior experience, as we go along.

� Purpose of this presentation is to show you how we use our TSM Environments

Page 8: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

7

Our Enviorment

� IBM Tivoli Storage Manager Suite for Unified Recovery license

– Changes everything – No more PVU Counting – Incentive to push for technologies like FCM, deduplication and

compression.• New challenges

� 6 High perfomance environments (3 sites) emerged– Use Deduplication where possible– Flash Copy Manager– TSM Node Replication– Block Level backup where possible– High utilization of Client Compression

Page 9: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

8

Design Goals

� RTO Goal on Important Data less than 1 Hour� TSM Server RTO less than 6 Hours� Has to be Cost effective!

– NL-SATA for Storage Pool– TSM Deduplication

Page 10: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

9

Pre Dedup Enviorment

Client Data

Disk Pool

Tape Pool Copy Pool

Larger files

Small Files

Page 11: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

10

Our TSM Dedup Solution

Client Data

Dedup

Traditional Tape

� 4 Domains– B/A Client Data– Managed Applications

(MGAPP)– Non managed

Applications (N-MGAPP)– Virtual Data

Files App

Files AppVirtual

V-CTL

Files AppNON-Dedup

File Device Class

Page 12: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

11

FCM for Vmware

Virtual Machines

LUN LUNData Stores

LUN

Full Backup

Full Backup

Incremental

D1 D2 D3 D4

TSM

Page 13: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

12

FCM VMware backup

� Daily Inc, Weekly Full– 2 week cycle

� 2 Device Classes (FCM)– STANDARD– INCREMENTAL

� TDP for VMware used for weekly backup, 90 day retention– Daily on Linux FCM Management Machine

� Benefits of FCM for VMware– MUCH Faster Restore Speed (Data Stores)– No CBT issues– Cheap, License (Storwize)

Page 14: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

13

FCM Naming Conventions

� One FCM DeviceClass for each backup type

– Full, Incr, Copy

� TARGET_NAMING must specify a valid target naming schema

– Difficult/Impossible to manage if not structured properly

� Schedules backup whole ESX Clusters

Page 15: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

14

Our TSM Dedup High Availability Solution

� Primary Site • Secondary Site

Node Replication

Node ReplicationActive data (no Oracle)

Oracle Primary

Metro Mirror

Page 16: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

15

Why FCM for Oracle

� Restore/Backup time reduced down to minutes.� Added workload of client deduplication & compression not

accpetable for the DBA´s on Production Machines.– Auxiliary/Proxy machine to backup to TSM from FCM copies

does the Deduplication & Compression and sends to TSM

� TDP for Oracle does not distinguish between Active/Inactive copies.

– DR Problem when using Active Data Node Replication• Solved with FCM MM Copy

Page 17: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

16

Our Enviorment

� Total Storage of 95TB weekly change of 25TB (before backup/dedup)

� Intel Based Servers– 120-140 GB Ram

– TSM DB• 8 SSD DB

– Raid-5

– EasyTier 22 SAS

– 1,7 TB SSD

» Total available Size 6TB

• 48 SAS 15k rpm – Raid-10

» Total available Size 6 TB

– 8-12 Cores

� V7000 Contoller– DS3700 & DS3200

– 3TB NL-SAS drives

– 170TB Usable Storage • RAID-6

� FCM for higher RTO needs � Node Replication for high availability � Tape for

– long term storage

– Data that does not fit in dedup storage

Page 18: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

17

TSM 24 hour work schedule

� FCM Backups 18:00� Main Client Backups 18:00-02:00� Expire Inventory 02:00-03:30� Identify Duplicates 03:30-04:15� TSM DB backup 04:30-06:00� TSM File Data Node Replication 06:00-10:00� TSM Virtual Node replication 10:00 – 16:00� TSM Database Node Replication 14:00-18:00� TSM * Replication to capture missed and new data� TSM Space Reclamation 13:00 – 18:00 (Threshold 10)

Page 19: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

18

It matters where you do the deduplication

Page 20: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

19

What we learned

� Achieving increased perfomance– Solved with engineering parallelism

• Solved differently between different applications

Page 21: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

20

What we learned

� Protecting dedup storage pool to tape copypools proved problematic

Perfomance based on filesize* Fabricated data

Node Replication Solved this

Page 22: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

21

Why Node Replication as High Availability

� 4 possible solutions– OS Cluster (Windows Cluster, AIX HACMP)

• Pros– Robust – Automated Failover

• Cons– Only OS fault torlerant

– Traditional Server-to-Server Copy Pool Virtual Volumes• Pros

– Robust – Volume failure recovery

• Cons– Long RTO – Cumbersome and long recovery (especially Dedup TSM Servers)

– TSM Node Replication• Pros

– Relative simple failover– Warm standby server ready to go

• Cons– Young technology– No easy way to recover from damaged volumes

– TSM DB2 HADR• Pros

– Easy Failover– Cold Standy berver ready to go– Can take over metadata only

• Cons– No installation of the kind we proposed existed in production.

Page 23: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

22

Our Perfomance Design Formula

IF (X) AND (Y) = < 95%THEN (N) + 1

X = CPU Load Y = Disk Response time (15 ms =95%)N = Number of parallel worker threads in TSM

• Simplified formula to maximise workload in our TSM Servers– If idle TSM Resource is detected more threads are added.

• CPU or Disk response time should always be the bottleneck

Page 24: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

23

Two different sites used to collect information

TSM enviorment 1� 2 TSM Servers Active/Inactive

(NR)� Windows 2008 R2� 8 Core� 140 Gig RAM� 1,4 TB DB

– 44 (SAS RAID-10)� 70TB Dedup Storage Pool

– DS3200• 2&3TB SATA RAID-5 and

RAID-6

� Current Bottlenck– CPU´s

TSM enviorment 2� 2 TSM Servers Active/Inactive

(NR)� Windows 2008 R2� 12 Core (24 multithreaded)� 120 Gig RAM� 1,7 TB DB

– 4 Internal SSD (RAID-5)– Easy Tier V7000

• 4 400Gb SSD (RAID-5)• 20 SAS (RAID-10)

� ~90TB Dedup Storage Pool– DS3700 behind V7000

• 3TB SATA RAID-6

� Current Bottleneck– SSD´s

Page 25: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

24

Perfomance Data

• TSM Server 1,1 TB Database, 80 TB Dedup Pool• 20 Thread deduplication space reclimation (threshold 10)

– Sustained 5,000-9,000 IOPS

• 17 TB of total data transfer– 7 Write– 10 Read

• CPU Near Fully Utilized

TSM env1

Page 26: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

25

Perfomance numbers

� 1,3TB TSM Database on Storwize SSD/SAS Easy Tier– Max 30,000 IOPS!

– Space ReclaimMoves 4,3TB pr/H(read & write)

Average DB IOPS8,000-12,000

TSM env 2

Page 27: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

26

Perfomance

� TSM Dedup enviorment total in 24 hours– Database writes ~x1 it´s size every day

• 1TB Database writes 970GB

– Database Reads ~x 1,5 it´s size every day• 1TB Database reads 1,5TB

– SSD´s in Raid-5 becomes bottleneck during write intentisve operations (Space Reclimation W/R 50/50)

TSM env 2

Page 28: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

27

What has changed

� Use Deduplication where we can– Exclusion/Special treatment:

• Very large single objects

• Encrypted data

• Large non repetetive data

� We can’t use storage pool hierarchy based on small file pools anymore, Client Dedup restrictions

� We Assign a spesific DISK device class VMware Control Storage Pool

– Reduces the mount points requirement

Page 29: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

28

Summary

Page 30: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

29

What we have learnd so far

� Our TSM Dedup servers can scale up 400TB of managed data (pre-dedup) this is based on DB size

� We can´t be cheap when it comes to our TSM Servers Hardware– A lot of RAM 48Gig min– 12 cores (Intel)

• Only put multiple TSM instances on AIX. – Use really fast disks for your database, it´s going to get hammered

(5000+ IOPS). Preferably SSD or a lot of spindles– We use maximum active log size off the bat

• We must be careful about our space reclaim workload, many threads can eat up all the log´s really fast.

• Gigantic single objects (1,0TB +) will pin the log, must be careful about workload during that object´s backup time.

– Larger databases.• X2 if you dedup only B/A client data.• X3 - ∞ If you dedup Application data as well.

– Depends a lot on how long you plan to keep your data in the pool

– Copypool to tape from the dedup pool proved difficult to use , used node replication instead.

Page 31: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

30

What we have learnd so far

� We Use Client Deduplication if we have the backup window to do so.

– It will cause performance degradation on your backups, application backups are more affected. (assuming no transportation bottleneck)

– Send all client data directly to the dedup storage pool– Saves a lot of work on your SATA drives

� We use VMware backup when possible, ALL VM´s must be on high enough HW level to support CBT.

– If not Use FCM only on those machines

� We use Client Compression to add more space savings.– Must be careful about use of 3rd party compression, it may have

adverse affects on deduplication ratio. – We DON’T use Compression if we plan on doing server-side

deduplication, bad ratio

Page 32: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

31

What we have learnd so far

� Utilize client parallelism for greater speed and workload– B/A client - resource utilization– TDP SQL, - multiple database backup concurrently or

stripes– VMware - use Vmmaxparallel, be carefull not to exceed

8 on each host at the same time– Exchange - multiple data movers

� We Keep our Deduplication volumes small 12-24G � Run aggressive space reclamation

– Aim for 10% Threshold (90% or less utilized)� Keep large objects in a separate storage pool (active log size)� Keep VMware CTL files in a separate (DISK Device class) storage

pool.� TSM Database backup as fast as possible

– All log activity during backup window will be applied during the end– Need increased space in active log due to this.

Page 33: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

32

What we have learnd so far

� Plan for a small tape pool for data that does not suit well in dedup storage pools

– Application databases that change a lot (reorgs, index rebuild e.t.c)

– Data that requires encryption.– Very large single objects 750G +

� For higher RTO systems we utilize FCM as a method to achive instant restore for newest data, alternatively we utilizeparallel threads to achieve your RTO goal, but there are drawbacks.

� Reajust the Copy Rate for FCM, default setting does not always apply

� In rare cases heavy utilized applications can´t handle LUN quiescing

– FCM or Blok level backups can´t be used.

Page 34: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

33

The future

Page 35: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

34

Chicago´s World Fair 1893

• Moral of the story

• Expect more technological innovation than you can imagine in comming years

• Don’t get your hopes up, IT´s innovation won’t solve all it´s problems.

- With new and improved technology new challenges emerge.

Page 36: How Nyherji Manages High Availability TSM Environments using FlashCopy Manager

35

Thank you!

Questions?


Recommended